Methods to impute missing genotypes for population data

Zhaoxia Yu; Daniel J. Schaid

doi:10.1007/s00439-007-0427-y

Methods to impute missing genotypes for population data

Zhaoxia Yu, Daniel J. Schaid

Quantitative Health Sciences

Research output: Contribution to journal › Article › peer-review

38 Scopus citations

Abstract

For large-scale genotyping studies, it is common for most subjects to have some missing genetic markers, even if the missing rate per marker is low. This compromises association analyses, with varying numbers of subjects contributing to analyses when performing single-marker or multi-marker analyses. In this paper, we consider eight methods to infer missing genotypes, including two haplotype reconstruction methods (local expectation maximization-EM, and fastPHASE), two k-nearest neighbor methods (original k-nearest neighbor, KNN, and a weighted k-nearest neighbor, wtKNN), three linear regression methods (backward variable selection, LM.back, least angle regression, LM.lars, and singular value decomposition, LM.svd), and a regression tree, Rtree. We evaluate the accuracy of them using single nucleotide polymorphism (SNP) data from the HapMap project, under a variety of conditions and parameters. We find that fastPHASE has the lowest error rates across different analysis panels and marker densities. LM.lars gives slightly less accurate estimate of missing genotypes than fastPHASE, but has better performance than the other methods.

Original language	English (US)
Pages (from-to)	495-504
Number of pages	10
Journal	Human genetics
Volume	122
Issue number	5
DOIs	https://doi.org/10.1007/s00439-007-0427-y
State	Published - Dec 2007

ASJC Scopus subject areas

Genetics
Genetics(clinical)

Access to Document

10.1007/s00439-007-0427-y

Cite this

@article{da0fd1c7465e4213919bc560c293205d,

title = "Methods to impute missing genotypes for population data",

abstract = "For large-scale genotyping studies, it is common for most subjects to have some missing genetic markers, even if the missing rate per marker is low. This compromises association analyses, with varying numbers of subjects contributing to analyses when performing single-marker or multi-marker analyses. In this paper, we consider eight methods to infer missing genotypes, including two haplotype reconstruction methods (local expectation maximization-EM, and fastPHASE), two k-nearest neighbor methods (original k-nearest neighbor, KNN, and a weighted k-nearest neighbor, wtKNN), three linear regression methods (backward variable selection, LM.back, least angle regression, LM.lars, and singular value decomposition, LM.svd), and a regression tree, Rtree. We evaluate the accuracy of them using single nucleotide polymorphism (SNP) data from the HapMap project, under a variety of conditions and parameters. We find that fastPHASE has the lowest error rates across different analysis panels and marker densities. LM.lars gives slightly less accurate estimate of missing genotypes than fastPHASE, but has better performance than the other methods.",

author = "Zhaoxia Yu and Schaid, {Daniel J.}",

note = "Funding Information: Acknowledgments The authors are grateful to the three anonymous reviewers for their constructive suggestions. This work was supported by the U.S. Public Health Service, National Institutes of Health, contract grant number GM065450.",

year = "2007",

month = dec,

doi = "10.1007/s00439-007-0427-y",

language = "English (US)",

volume = "122",

pages = "495--504",

journal = "Human genetics",

issn = "0340-6717",

publisher = "Springer Verlag",

number = "5",

}

TY - JOUR

T1 - Methods to impute missing genotypes for population data

AU - Yu, Zhaoxia

AU - Schaid, Daniel J.

N1 - Funding Information: Acknowledgments The authors are grateful to the three anonymous reviewers for their constructive suggestions. This work was supported by the U.S. Public Health Service, National Institutes of Health, contract grant number GM065450.

PY - 2007/12

Y1 - 2007/12

N2 - For large-scale genotyping studies, it is common for most subjects to have some missing genetic markers, even if the missing rate per marker is low. This compromises association analyses, with varying numbers of subjects contributing to analyses when performing single-marker or multi-marker analyses. In this paper, we consider eight methods to infer missing genotypes, including two haplotype reconstruction methods (local expectation maximization-EM, and fastPHASE), two k-nearest neighbor methods (original k-nearest neighbor, KNN, and a weighted k-nearest neighbor, wtKNN), three linear regression methods (backward variable selection, LM.back, least angle regression, LM.lars, and singular value decomposition, LM.svd), and a regression tree, Rtree. We evaluate the accuracy of them using single nucleotide polymorphism (SNP) data from the HapMap project, under a variety of conditions and parameters. We find that fastPHASE has the lowest error rates across different analysis panels and marker densities. LM.lars gives slightly less accurate estimate of missing genotypes than fastPHASE, but has better performance than the other methods.

AB - For large-scale genotyping studies, it is common for most subjects to have some missing genetic markers, even if the missing rate per marker is low. This compromises association analyses, with varying numbers of subjects contributing to analyses when performing single-marker or multi-marker analyses. In this paper, we consider eight methods to infer missing genotypes, including two haplotype reconstruction methods (local expectation maximization-EM, and fastPHASE), two k-nearest neighbor methods (original k-nearest neighbor, KNN, and a weighted k-nearest neighbor, wtKNN), three linear regression methods (backward variable selection, LM.back, least angle regression, LM.lars, and singular value decomposition, LM.svd), and a regression tree, Rtree. We evaluate the accuracy of them using single nucleotide polymorphism (SNP) data from the HapMap project, under a variety of conditions and parameters. We find that fastPHASE has the lowest error rates across different analysis panels and marker densities. LM.lars gives slightly less accurate estimate of missing genotypes than fastPHASE, but has better performance than the other methods.

UR - http://www.scopus.com/inward/record.url?scp=36348978013&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=36348978013&partnerID=8YFLogxK

U2 - 10.1007/s00439-007-0427-y

DO - 10.1007/s00439-007-0427-y

M3 - Article

C2 - 17851696

AN - SCOPUS:36348978013

SN - 0340-6717

VL - 122

SP - 495

EP - 504

JO - Human genetics

JF - Human genetics

IS - 5

ER -

Methods to impute missing genotypes for population data

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this