GCTA: A Tool for Genome-wide Complex Trait Analysis (2024)

Journal List
Am J Hum Genet
v.88(1); 2011 Jan 7
PMC3014363

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsem*nt of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more: PMC Disclaimer | PMC Copyright Notice

Am J Hum Genet. 2011 Jan 7; 88(1): 76–82.

doi:10.1016/j.ajhg.2010.11.011

PMCID: PMC3014363

PMID: 21167468

Jian Yang,^1,^∗ S. Hong Lee,¹ Michael E. Goddard,^2,³ and Peter M. Visscher¹

Author information Article notes Copyright and License information PMC Disclaimer

Abstract

For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction of the heritability. Here we report a user-friendly software tool called genome-wide complex trait analysis (GCTA), which was developed based on a method we recently developed to address the “missing heritability” problem. GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the trait. We introduce GCTA's five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets.

Main Text

Despite the great success of genome-wide association studies (GWAS), which have identified hundreds of SNPs conferring the genetic variation of human complex diseases and traits,¹ the genetic architecture of human complex traits still remains largely unexplained. For most traits, the associated SNPs from GWAS only explain a small fraction of the heritability.^2,3 There has not been any consensus on the explanation of the “missing heritability.” Possible explanations include a large number of common variants with small effects, rare variants with large effects, and DNA structural variation.^2,4 We recently proposeda method of estimating the total amount of phenotypic variance captured by all SNPs on the current generation of commercial genotyping arrays and estimated that ∼45% of the phenotypic variance for human height can be explained by all common SNPs.⁵ Thus, most of the heritability for height is hiding rather than missing because of many SNPs with small effects.^5,6 In contrast to single-SNP association analysis, the basic concept behind our method is to fit the effects of all the SNPs as random effects by a mixed linear model (MLM),

$y = X β + Wu + ɛ with var (y) = V = W W^{'} σ_{u}^{2} + I σ_{ɛ}^{2},$

(Equation 1)

where y is an n × 1 vector of phenotypes with n being the sample size, β is a vector of fixed effects such as sex, age, and/or one or more eigenvectors from principal component analysis (PCA), u is a vector of SNP effects with $u \sim N (0, I σ_{u}^{2})$ , I is an n × n identity matrix, and ɛ isa vector of residual effects with $ɛ \sim N (0, I σ_{ɛ}^{2})$ . W isa standardized genotype matrix with the ij^th element $w_{i j} = (x_{i j} - 2 p_{i}) / \sqrt{2 p_{i} (1 - p_{i})}$ , where x_ij is the number of copies of the reference allele for the i^th SNP of the j^th individual and p_i is the frequency of the reference allele. If we define A =WW^′/N and define $σ_{g}^{2}$ as the variance explained by all the SNPs, i.e., $σ_{g}^{2} = N σ_{u}^{2}$ , with N being the number of SNPs, then Equation 1 will be equivalent to:^7–9

$y = Xβ + g + ɛ with V = A σ_{g}^{2} + I σ_{ɛ}^{2},$

(Equation 2)

where g is an n × 1 vector of the total genetic effects of the individuals with $g \sim N (0, A σ_{g}^{2})$ , and A is interpreted as the genetic relationship matrix (GRM) between individuals. We can therefore estimate $σ_{g}^{2}$ by the restricted maximum likelihood (REML) approach,¹⁰ relying on the GRM estimated from all the SNPs. Here we report a versatile tool called genome-wide complex trait analysis (GCTA), which implements the method of estimating variance explained by all SNPs, and extend the method to partition the genetic variance onto each of the chromosomes and also to estimate the variance explained by the X chromosome and test for dosage compensation in females. We developed GCTA in five function domains: data management, estimation of the GRM from a set of SNPs, estimation of the variance explained by all the SNPs on a single chromosome or the whole genome, estimation of linkage disequilibrium (LD) structure, and simulation.

Estimation of the Genetic Relationship from Genome-wide SNPs

One of the core functions of GCTA is to estimate the genetic relationships between individuals from the SNPs. From the definition above, the genetic relationship between individuals j and k can be estimated by the following equation:

$A_{j k} = \frac{1}{N} \sum_{i = 1}^{N} \frac{(x_{i j} - 2 p_{i}) (x_{i k} - 2 p_{i})}{2 p_{i} (1 - p_{i})} .$

(Equation 3)

We provide a function to iteratively exclude one individual of a pair whose relationship is greater than a specified cutoff value, e.g., 0.025, while retaining the maximum number of individuals in the data. For data collected from family or twin studies, we recommend that users estimate the genetic relationships with all of the autosomal SNPs and then use this option to exclude close relatives. The reason for exclusion is that the objective of the analysis is to estimate genetic variation captured by all the SNPs, just as GWAS does for single SNPs. Including close relatives, such as parent-offspring pairs and siblings, would result in the estimate of genetic variance being driven by the phenotypic correlations for these pairs (just as in pedigree analysis), and this estimate could be a biased estimate of total genetic variance, for example because of common environmental effects. Even if the estimate is not biased, its interpretation is different from the estimate from “unrelated” individuals: a pedigree-based estimator captures the contribution from all causal variants (across the entire allele frequency spectrum), whereas our method captures the contribution from causal variants that are in LD with the genotyped SNPs.

As a by-product, we provide a function in GCTA to calculate the eigenvectors of the GRM, which is asymptotically equivalent to those from the PCA implemented in EIGENSTRAT¹¹ because the GRM (A_jk) defined in GCTA is approximately half of the covariance matrix (Ψ_jk) used in EIGENSTRAT. The only purpose of developing this function is to calculate eigenvectors and then include them in the model as covariates to capture variance due to population structure. More sophisticated analyses of the population structure can be found in programs such as EIGENSTRAT¹¹ and STRUCTURE.¹²

Estimation of the Variance Explained by Genome-wide SNPs by REML

The GRM estimated from the SNPs can be fitted subsequently in an MLM to estimate the variance explained by these SNPs via the REML method.¹⁰ Previously, we included only one genetic factor in the model. Here we extend the model in a general form as

Estimation of the Variance Explained by the SNPs on the X Chromosome

The method of estimating the genetic relationship from the X chromosome is different to that for the autosomal SNPs, because males have only one X chromosome. We modified Equation 3 for the X chromosome as:

$A_{j k}^{M} = \sum_{i = 1}^{N} \frac{(x_{i j}^{M} - p_{i}) (x_{i k}^{M} - p_{i})}{p_{i} (1 - p_{i})} for a male-male pair,$

$A_{j k}^{F} = \sum_{i = 1}^{N} \frac{(x_{i j}^{F} - 2 p_{i}) (x_{i k}^{F} - 2 p_{i})}{2 p_{i} (1 - p_{i})} for a female-female pair, and$

$A_{j k}^{MF} = \sum_{i = 1}^{N} \frac{(x_{i j}^{M} - p_{i}) (x_{i k}^{F} - 2 p_{i})}{\sqrt{2} p_{i} (1 - p_{i})} for a male-female pair,$

where $x_{i j}^{M}$ and $x_{i j}^{F}$ are the number of copies of the reference allele for an X chromosome SNP for a male anda female, respectively.

Assuming the male-female genetic correlation to be 1, the X-linked phenotypic covariance between a pair of individuals is:²⁰

${cov}_{X} (y_{j}^{M}, y_{k}^{M}) = E (A_{j k}^{M}) σ_{X (M)}^{2} for a male-male pair,$

${cov}_{X} (y_{j}^{F}, y_{k}^{F}) = E (A_{j k}^{F}) σ_{X (F)}^{2} for a female-female pair, and$

${cov}_{X} (y_{j}^{M}, y_{k}^{F}) = E (A_{j k}^{MF}) σ_{X (M)} σ_{X (F)} for a male-female pair,$

where $σ_{X (M)}^{2}$ and $σ_{X (F)}^{2}$ are the genetic variance attributed to the X chromosome for males and females, respectively.

The relative values of $σ_{X (M)}^{2}$ and $σ_{X (F)}^{2}$ depend on the assumption made regarding dosage compensation for X chromosome genes. There are two alleles per locus in females, but only one in males. If we assume that each allele has a similar effect on the trait (i.e., no dosage compensation), the genetic variance on the X chromosome for females is twice that for males: i.e., $σ_{X}^{2} = σ_{X (F)}^{2} = 2 σ_{X (M)}^{2}$ . Thus,

${cov}_{X} (y_{j}^{M}, y_{k}^{M}) = \frac{1}{2} E (A_{j k}^{M}) σ_{X}^{2} for a male-male pair,$

${cov}_{X} (y_{j}^{F}, y_{k}^{F}) = E (A_{j k}^{F}) σ_{X}^{2} for a female-female pair, and$

${cov}_{X} (y_{j}^{M}, y_{k}^{F}) = \frac{1}{\sqrt{2}} E (A_{j k}^{MF}) σ_{X}^{2} for a male-female pair .$

This can be implemented by redefining GRM for the X chromosome as $A_{X}^{ND} = 1 / 2 A_{X}$ for male-male pairs, $A_{X}^{ND} = A_{X}$ for female-female pairs, and $A_{X}^{ND} = 1 / \sqrt{2} A_{X}$ for male-female pairs. If we assume that each allele in females has only half the effect of an allele in males (i.e., full dosage compensation), the X-linked genetic variance for females is half that for males: i.e., $σ_{X}^{2} = σ_{X (F)}^{2} = 1 / 2 σ_{X (M)}^{2}$ . Thus,

${cov}_{X} (y_{j}^{M}, y_{k}^{M}) = 2 E (A_{j k}^{M}) σ_{X}^{2} for a male-male pair,$

${cov}_{X} (y_{j}^{F}, y_{k}^{F}) = E (A_{j k}^{F}) σ_{X}^{2} for a female-female pair, and$

${cov}_{X} (y_{j}^{M}, y_{k}^{F}) = \sqrt{2} E (A_{j k}^{MF}) σ_{X}^{2} for a male-female pair .$

Therefore, the raw A_X matrix should be parameterized as $A_{X}^{FD} = 2 A_{X}$ for male-male pairs, $A_{X}^{FD} = A_{X}$ for female-female pairs, and $A_{X}^{ND} = \sqrt{2} A_{X}$ for male-female pairs. Thethird possibility is to assume equal genetic variance on the X chromosome for males and females, i.e., $σ_{X}^{2} = σ_{X (F)}^{2} = σ_{X (M)}^{2}$ , in which case the A_X matrix is not redefined at all.

We can estimate $σ_{X}^{2}$ by fitting the model y =Xβ +g_X +g +ɛ, where g_X is a vector of genetic effectsattributable to the X chromosome, with $var (g_{X}) = A_{X}^{ND} σ_{X}^{2}$ assuming no dosage compensation, $var (g_{X}) = A_{X}^{FD} σ_{X}^{2}$ assuming full dosage compensation, and $var (g_{X}) = A_{X} σ_{X}^{2}$ assuming equal X-linked genetic variance for males and females. Test of dosage compensation can be achieved by comparing the likelihoods of model fitting under the three assumptions.

Estimation of the Variance Explained by Genome-wide SNPs for a Case-Control Study

The methodology described above is also applicable for case-control data, for which the estimate of variance explained by the SNPs corresponds to variation on the observed 0–1 scale. Under the assumption of a threshold-liability model for a disease, i.e., disease liability on the underlying scale follows standard normal distribution,²¹ the estimate of variance explained by the SNPs on the observed 0–1 scale can be transformed to that on the unobserved continuous liability scale by a linear transformation.²² The relationship between additive genetic variance on the observed 0–1 and unobserved liability scales was proposed more than a half century ago,^23,24 and we recently extended this transformation to account for ascertainment bias in a case-control study, i.e., a much higher proportion of cases in the sample than in the general population (unpublished data). We provide options in GCTA to analyze a binary trait and to transform the estimate on the 0–1 scale to that on the liability scale with an adjustment for ascertainment bias. There is an important caveat in applying the methods described herein to case-control data. Any batch, plate, or other technical artifact that causes allele frequencies between case and control on average to be more different than that under the null hypothesis stating that the samples come from the same population will contribute to the estimation of spurious genetic variation, because cases will appear to be more related to other cases than to controls. Therefore, stringent quality control is essential when applying GCTA to case-control data. Quantitative traits are less likely to suffer from technical genotyping artifacts because they will generally not lead to spurious association between continuous phenotypes and genotypes.

Estimation of the Inbreeding Coefficient from Genome-wide SNPs

Apart from estimating the genetic relatedness between individuals, GCTA also has a function to estimate the inbreeding coefficient (F) from SNP data, i.e., the relationship between haplotypes within an individual. Two estimates have been used: one based on the variance of additive genetic values (diagonal of the SNP-derived GRM) and the other based on SNP hom*ozygosity (implemented in PLINK).²⁵ Let (1 – p_i)² + p_i(1 – p_i)F, 2p_i(1 – p_i)(1 – F), and p_i² + p_i(1 – p_i)F be the frequencies of the three genotypes of a SNP i and let h_i = 2p_i(1 – p_i). The estimate based on the variance of additive genotype values is

${\hat{F}}_{i}^{I} = {[x_{i} - E (x_{i})]}^{2} / h_{i} - 1 = {(x_{i} - 2 p_{i})}^{2} / h_{i} - 1 and var ({\hat{F}}_{i}^{I} | F) = (1 - h_{i}) / h_{i} + 7 (1 - 2 h_{i}) F / h_{i} - F^{2},$

where x_i is the number of copies of the reference allele for the i^th SNP. This is a special case of Equation 3 for a single SNP when j = k. The estimate based upon excess hom*ozygosity is

${\hat{F}}_{i}^{II} = [O (# hom) - E (# hom)] / [1 - E (# hom)] = 1 - x_{i} (2 - x_{i}) / h_{i} and var ({\hat{F}}_{i}^{II} | F) = (1 - h_{i}) / h_{i} - (1 - 2 h_{i}) F / h_{i} - F^{2},$

where O(#hom) and E(# hom) are the observed and expected number of hom*ozygous genotypes in the sample, respectively. Both estimators are unbiased estimates of F in the sense that $E ({\hat{F}}_{i}^{I} | F) = E ({\hat{F}}_{i}^{II} | F) = F$ , but their sampling variances are dependent on allele frequency, i.e., $var ({\hat{F}}_{i}^{I}) = var ({\hat{F}}_{i}^{II}) =$ (1 – h_i) / h_i if F = 0. In addition, the covariance between the two estimators is (3h_i – 1) / h_i + (1 – 2h_i)F / h_i – F², so that the sampling covariance between the estimators is (3h_i – 1) / h_i and the sampling correlation is (3h_i– 1) / (1 – h_i) when F = 0. We proposed an estimator based upon the correlation between uniting gametes:⁵

${\hat{F}}_{i}^{III} = [x_{i}^{2} - (1 + 2 p_{i}) x_{i} + 2 p_{i}^{2}] / h_{i} and var ({\hat{F}}_{i}^{III} | F) = 1 + 2 (1 - 2 h_{i}) F / h_{i} - F^{2} .$

${\hat{F}}_{i}^{III}$ is also an unbiased estimator of F in the sense that $E ({\hat{F}}_{i}^{III} | F) = F$ . If F = 0, $var ({\hat{F}}_{i}^{III}) = 1$ regardless of allele frequency, which is smaller than the sampling variance of ${\hat{F}}_{i}^{I}$ and ${\hat{F}}_{i}^{II}$ , i.e., 1≤ (1 – h_i) / h_i. When 0< F < 1/3, ${\hat{F}}_{i}^{III}$ also has a smaller variance than ${\hat{F}}_{i}^{I}$ and ${\hat{F}}_{i}^{II}$ . In GCTA, we use 1+ ${\hat{F}}_{i}^{III}$ rather than 1+ ${\hat{F}}_{i}^{I}$ to calculate the diagonal of the GRM. For multiple SNPs, we average the estimates over all of the SNPs, i.e., $\hat{F} = 1 / N \sum_{i = 1}^{N} {\hat{F}}_{i}$ .

Estimating LD Structure

In a standard GWAS, particularly with a large sample size, the mean (λ_mean) or median (λ_median) of the test statistics for single-SNP associations often deviates from its expected value under the null hypothesis of no association between any SNP and the phenotype, which is usually interpreted as the effect due to population stratification and/or cryptic relatedness.^11,26,27 An alternative explanation is that polygenic variation causes the observed inflated test statistic.¹⁸ To predict the genomic inflation factors, λ_mean and λ_median, from polygenic parameters such as the total amount of variance that is explained by all SNPs, we need to quantify the LD structure between SNPs and putative causal variants (unpublished data). GCTA provides a function to search for all the SNPs in LD with the “causal variants” (mimicked by a set of SNPs chosen by the user). Given a causal variant, we use simple regression to test for SNPs in LD with the causal variant within d Mb distance in either direction. PLINK has an option (“show targets”) to select SNPs in LD with a set of target SNPs with LD r² larger than a user-specified cutoff value. This function is very useful to distinguish independent association signals but less suited to predict λ_mean and λ_median, because the test statistics of the SNPs in modest LD with causal variants (SNPs at Mb distance with low r²) will also be inflated to a certain extent, and these test statistics will contribute to the genomic inflation factors.

GWAS Simulation

We provided a function to simulate GWAS data based on the observed genotype data. For a quantitative trait, the phenotypes are simulated by the simple additive genetic model y= Wu + ɛ, where the notation is the same as above. Given a set of SNPs assigned as causal variants, the effects of the causal variants are generated from a standard normal distribution, and the residual effects are generated from a normal distribution with mean of 0 and variance of $σ_{g}^{2} (1 / h^{2} - 1)$ , where $σ_{g}^{2}$ is the empirical variance of Wu and h² is the user specified heritability. For a case-control study, assuming a threshold-liability model, disease liabilities are simulated in the same way as that for the phenotypes of a quantitative trait. Any individual with disease liability exceeding a certain threshold T is assigned to be a case and a control otherwise, where T is the threshold of normal distribution truncating the proportion of K (disease prevalence). The only purpose of this function is to do a simple simulation based on the observed genotype data. More complicated simulation can be performed with programs such as ms,²⁸ GENOME,²⁹ FREGENE,³⁰ and HAPGEN.³¹

Data Management

We chose the PLINK²⁵ compact binary file format (^∗.bed, ^∗.bim, and ^∗.fam) as the input data format for GCTA because of its popularity in the genetics community and its efficiency of data storage. For the imputed dosage data, we use the output files of the imputation program MACH³² (^∗.mldose.gz and ^∗.mlinfo.gz) as the inputs for GCTA. For the convenience of analysis, we provide options to extract a subset of individuals and/or SNPs and to filter SNPs based on certain criteria, such as chromosome position, minor allele frequency (MAF), and imputation R² (for the imputed data). However, we do not provide functions for a thorough quality control (QC) of the data, such as Hardy-Weinberg equilibrium test and missingness, because these functions have been well developed in many other genetic analysis packages, e.g., PLINK, GenABEL,³³ and SNPTEST.³⁴ We assume that the data have been cleaned by a standard QC process before entering into GCTA.

Estimating Total Heritability

The method implemented in GCTA is to estimate the variance explained by chromosome- or genome-wide SNPs rather than the trait heritability. Estimating the heritability (i.e., variance explained by all the causal variants), however, relies on the genetic relationship at causal variants that is predicted with error by the genetic relationship derived from the SNPs as a result of imperfect tagging. We have previously established that the prediction error is c + 1 / N, with c depending on the distribution of the MAF of causal variants. We therefore developed a method based on simple regression to correct for the prediction error by

$A_{j k}^{*} = {\begin{cases} 1 + β (A_{j j} - 1), j = k \\ β A_{j k}, j \neq k, \end{cases}$

where β =1−(c +1/N)/var(A_jk). The estimate of variance explained by all of the SNPs after such adjustment is an unbiased estimate of heritability only if the assumption about the MAF distribution of causal variants is correct.

Efficiency of GCTA Computing Algorithm

GCTA implements the REML method based on the variance-covariance matrix V and the projection matrix P. Insome of the mixed model analysis packages, such as ASREML,³⁵ to avoid the inversion of the n × n V matrix, people usually use Gaussian elimination of the mixed model equations (MME) to obtain the AI matrix based on sparse matrix techniques. The SNP-derived GRM matrix, however, is typically dense, so the sparse matrix technique will bring an extra cost of memory and CPU time. Moreover, the dimension of MME depends on the number of random effects in the model, whereas the V matrix does not. For example, when fitting the 22 chromosomes simultaneously in the model, the dimension of MME is 22n × 22n (ignoring the fixed effects), whereas the dimension of V matrix is still n × n. We compared the computational efficiency of GCTA and ASREML. When the sample size is small, e.g., n < 3000, both GCTA and ASREML take a few minutes to run. When the sample size is large, e.g., n > 10,000, especially when fitting multiple GRMs, it takes days for ASREML to finish the analysis, whereas GCTA needs only a few hours.

System Requirements

We have released executable versions of GCTA for the threemajor operating systems: MS Windows, Linux/Unix, andMac OS. We have also released the source codes so that users can compile them for some specific platforms. GCTA requires a large amount of memory when calculating the GRM or performing an REML analysis with multiple genetic components. For example, it requires ∼4.8 GB memory to calculate the GRM for a data set with 3925 individuals genotyped by 294,831 SNPs, and it takes ∼4 CPU hours (AMD Opteron 2.8 GHz) to finish the computation. We therefore recommend using the 64-bit version of GCTA for large memory support.

Nonadditive Genetic Variance

The analysis approach we have adapted is a logical extension of estimation methods based on pedigrees. It allows estimation of additive genetic variation that is captured by SNP arrays and is therefore informative with respect to the genetic architecture of complex traits. The estimate of variance captured by all of the SNPs obtained in GCTA is directly comparable to the heritability estimated from pedigree analysis in family and twin studies, as well as the variance explained by GWAS hits, so that missing and hiding heritability can be quantified.⁵ Other sources of genetic variations such as dominance, gene-gene interaction, and gene-environment interaction are also important for complex trait variation but are less relevant to the “missing heritability” problem if the total heritability refers to the narrow-sense heritability, i.e., the proportion of phenotypic variance due to additive genetic variance. The current version of GCTA only provides functions to estimate and partition the variances of additive and additive-environment interaction effects. It is technically feasible to extend the analysis to include dominance and/or gene-gene interaction effects in the future. However, the power to detect the high-order genetic variation will be limited, i.e., the sampling variance of estimated variance components will be very large. Future developments will also include options to do multivariate analyses, to read genotype or imputed probability data in different formats, and to implement other applications of whole-genome or chromosome segment approaches.

In summary, we have developed a versatile tool to estimate genetic relationships from genome-wide SNPs that can subsequently be used to estimate variance explained by SNPs via a mixed model approach. We provide flexible options to specify different genetic models to partition genetic variance onto each of the chromosomes. We developed methods to estimate genetic relationships from the SNPs on the X chromosome and to test the hypotheses of dosage compensation. GCTA is not limited to the analysis of data on human complex traits, but in this report we only use examples and specifications (e.g., the number of autosomes) for humans.

Acknowledgments

We thank Bruce Weir for discussions on the sampling variance of estimators of inbreeding coefficients. We thank Allan McRae and David Duffy for discussions and Anna Vinkhuyzen for software testing. We acknowledge funding from the Australian National Health and Medical Research Council (grants 389892 and 613672) and the Australian Research Council (grants DP0770096 and DP1093900).

Web Resources

The URLs for data presented herein are as follows:

Genome-wide Complex Trait Analysis (GCTA), http://gump.qimr.edu.au/gcta
MACH 1.0: A Markov Chain-based haplotyper, http://www.sph.umich.edu/csg/yli/mach
PLINK, http://pngu.mgh.harvard.edu/∼purcell/plink

References

1. Hindorff L.A., Sethupathy P., Junkins H.A., Ramos E.M., Mehta J.P., Collins F.S., Manolio T.A. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA. 2009;106:9362–9367. [PMC free article] [PubMed] [Google Scholar]

2. Manolio T.A., Collins F.S., Cox N.J., Goldstein D.B., Hindorff L.A., Hunter D.J., McCarthy M.I., Ramos E.M., Cardon L.R., Chakravarti A. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. [PMC free article] [PubMed] [Google Scholar]

3. Maher B. Personal genomes: The case of the missing heritability. Nature. 2008;456:18–21. [PubMed] [Google Scholar]

4. Eichler E.E., Flint J., Gibson G., Kong A., Leal S.M., Moore J.H., Nadeau J.H. Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 2010;11:446–450. [PMC free article] [PubMed] [Google Scholar]

5. Yang J., Benyamin B., McEvoy B.P., Gordon S., Henders A.K., Nyholt D.R., Madden P.A., Heath A.C., Martin N.G., Montgomery G.W. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. [PMC free article] [PubMed] [Google Scholar]

6. Gibson G. Hints of hidden heritability in GWAS. Nat. Genet. 2010;42:558–560. [PubMed] [Google Scholar]

7. Hayes B.J., Visscher P.M., Goddard M.E. Increased accuracy of artificial selection by using the realized relationship matrix. Genet. Res. 2009;91:47–60. [PubMed] [Google Scholar]

8. Strandén I., Garrick D.J. Technical note: Derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit. J. Dairy Sci. 2009;92:2971–2975. [PubMed] [Google Scholar]

9. VanRaden P.M. Efficient methods to compute genomic predictions. J. Dairy Sci. 2008;91:4414–4423. [PubMed] [Google Scholar]

10. Patterson H.D., Thompson R. Recovery of inter-block information when block sizes are unequal. Biometrika. 1971;58:545–554. [Google Scholar]

11. Price A.L., Patterson N.J., Plenge R.M., Weinblatt M.E., Shadick N.A., Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. [PubMed] [Google Scholar]

12. Falush D., Stephens M., Pritchard J.K. Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics. 2003;164:1567–1587. [PMC free article] [PubMed] [Google Scholar]

13. Gilmour A.R., Thompson R., Cullis B.R. Average information REML: An efficient algorithm for variance parameters estimation in linear mixed models. Biometrics. 1995;51:1440–1450. [Google Scholar]

14. Henderson C.R. Best linear unbiased estimation and prediction under a selection model. Biometrics. 1975;31:423–447. [PubMed] [Google Scholar]

15. Kruuk L.E. Estimating genetic parameters in natural populations using the “animal model” Philos. Trans. R. Soc. Lond. B Biol. Sci. 2004;359:873–890. [PMC free article] [PubMed] [Google Scholar]

16. Goddard M.E., Wray N.R., Verbyla K., Visscher P.M. Estimating effects and making predictions from genome-wide marker data. Stat. Sci. 2009;24:517–529. [Google Scholar]

17. de Los Campos G., Gianola D., Allison D.B. Predicting genetic predisposition in humans: The promise of whole-genome markers. Nat. Rev. Genet. 2010;11:880–886. [PubMed] [Google Scholar]

18. Purcell S.M., Wray N.R., Stone J.L., Visscher P.M., O'Donovan M.C., Sullivan P.F., Sklar P., International Schizophrenia Consortium Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748–752. [PMC free article] [PubMed] [Google Scholar]

19. Lango Allen H., Estrada K., Lettre G., Berndt S.I., Weedon M.N., Rivadeneira F., Willer C.J., Jackson A.U., Vedantam S., Raychaudhuri S. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010;467:832–838. [PMC free article] [PubMed] [Google Scholar]

20. Kent J.W., Jr., Dyer T.D., Blangero J. Estimating the additive genetic effect of the X chromosome. Genet. Epidemiol. 2005;29:377–388. [PMC free article] [PubMed] [Google Scholar]

21. Lynch M., Walsh B. Sinauer Associates; Sunderland, MA: 1998. Genetics and Analysis of Quantitative Traits. [Google Scholar]

22. Falconer D.S. The inheritance of liability to certain diseases, estimated from the incidence among relatives. Ann. Hum. Genet. 1965;29:51–76. [Google Scholar]

23. Dempster E.R., Lerner I.M. Heritability of threshold characters. Genetics. 1950;35:212–236. [PMC free article] [PubMed] [Google Scholar]

24. Robertson A., Lerner I.M. The heritability of all-or-none traits; viability of poultry. Genetics. 1949;34:395–411. [PMC free article] [PubMed] [Google Scholar]

25. Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A., Bender D., Maller J., Sklar P., de Bakker P.I., Daly M.J., Sham P.C. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. [PMC free article] [PubMed] [Google Scholar]

26. Campbell C.D., Ogburn E.L., Lunetta K.L., Lyon H.N., Freedman M.L., Groop L.C., Altshuler D., Ardlie K.G., Hirschhorn J.N. Demonstrating stratification in a European American population. Nat. Genet. 2005;37:868–872. [PubMed] [Google Scholar]

27. Cardon L.R., Palmer L.J. Population stratification and spurious allelic association. Lancet. 2003;361:598–604. [PubMed] [Google Scholar]

28. Hudson R.R. Gene genealogies and the coalescent process. Oxford Surveys in Evolutionary Biology. 1990;7:1–44. [Google Scholar]

29. Liang L., Zöllner S., Abecasis G.R. GENOME: Arapid coalescent-based whole genome simulator. Bioinformatics. 2007;23:1565–1567. [PubMed] [Google Scholar]

30. Hoggart C.J., Chadeau-Hyam M., Clark T.G., Lampariello R., Whittaker J.C., De Iorio M., Balding D.J. Sequence-level population simulations over large genomic regions. Genetics. 2007;177:1725–1731. [PMC free article] [PubMed] [Google Scholar]

31. Spencer C.C., Su Z., Donnelly P., Marchini J. Designing genome-wide association studies: Sample size, power, imputation, and the choice of genotyping chip. PLoS Genet. 2009;5:e1000477. [PMC free article] [PubMed] [Google Scholar]

32. Li Y., Abecasis G.R. Mach 1.0: Rapid Haplotype Reconstruction and Missing Genotype Inference. Am. J. Hum. Genet. 2006;S79:2290. [Google Scholar]

33. Aulchenko Y.S., Ripke S., Isaacs A., van Duijn C.M. GenABEL: An R library for genome-wide association analysis. Bioinformatics. 2007;23:1294–1296. [PubMed] [Google Scholar]

34. Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. [PMC free article] [PubMed] [Google Scholar]

35. Gilmour A.R., Gogel B.J., Cullis B.R., Thompson R. VSN International; Hemel Hempstead, UK: 2006. ASReml User Guide Release 2.0. [Google Scholar]

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics