User:Slrubenstein/molecular genetics

Molecular genetics: lineages and clusters

With the recent availability of large amounts of human genetic data from many geographically distant human groups scientists have again started to investigate the relationships between people from various parts of the world. One method is to investigate DNA molecules that are passed down from mother to child (mtDNA) or from father to son (Y chromosomes), these form molecular lineages and can be informative regarding prehistoric population migrations. Alternatively autosomal alleles are investigated in an attempt to understand how much genetic material groups of people share. This work has led to a debate amongst geneticists, molecular anthropologists and medical doctors as to the validity of conceps such as "race". Some researchers insist that classifying people into groups based on ancestry may be important from medical and social policy points of view, and claim to be able to do so accurately. Others claim that individuals from different groups share far too much of their genetic material for group membership to have any medical implications. This has reignited the scientific debate over the validity of human classification and concepts of "race".

Molecular lineages, Y chromosomes and mitochondrial DNA

Mitochondria are intracellular organelles that contain DNA, this mitochondrial DNA (mtDNA) is passed in a direct female line of descent from mother to child. Human Y chromosomes are male specific sex chromosomes, any human that possesses a Y chromosome will be morphologically male. Y chromosomes are therefore passed from father to son. When a mutation arises in mtDNA or Y chromosome it is passed down a specific maternal or paternal line and because mutations accumulate on these molecules they can be used to identify specific molecular lineages. These mutations are derived from copying mistakes, when the DNA is copied it is possible that a single mistake occurs in the DNA sequence, these single mistakes are called single nucleotide polymorphisms (SNPs).

**Molecular lineages.**
  Ancestral Haplogroup

  Haplogroup A (Hg A)
  Haplogroup B (Hg B)
All of these molecules are part of the ancestral haplogroup, but at some point in the past a mutation occurred in the ancestral molecule, mutation A, which produced a new lineage, this is haplogroup A and is defined by mutation A, at some more recent point in the past a new mutation, mutation B, occurred in a person carrying haplogroup A, mutation B defined haplogroup B, haplogroup B is a subgroup, or subclade of haplogroup A, both haplogrups A and B are subclades of the ancestral haplogroup.

Mitochondrial DNA and Y chromosome research has produced three reproducible observations relevant to race and human evolution. ^[1]

Firstly all mtDNA and Y chromosome lineages derive from a common ancestral molecule. For mtDNA this ancestor is estimated to have lived about 140,000-290,000 years ago (Mitochondrial Eve), while for Y chromosomes the ancestor is estimated to have lived about 70,000 years ago (Y chromosome Adam). These observations are robust, and the individuals that originally carried these ancestral molecules are the direct female and male line most recent common ancestors of all extant anatomically modern humans. The observation that these are the direct female line and male line ancestors of all living humans should not be interpreted as meaning that either was the first anatomically modern human. Nor should we assume that there were no other modern humans living concurrently with mitochondrial Eve or Y chromosome Adam. A more reasonable explanation is that other humans who lived at the same time did indeed reproduce and pass their genes down to extant humans, but that their mitochondrial and Y chromosomal lineages have been lost over time, probably due to random events (e.g. producing only male or female children). It is impossible to know to what extent these non-extant lineages have been lost, or how much they differed from the mtDNA or Y chromosome of our maternal and paternal lineage MRCA. The difference in dates between Y chromosome Adam and mitochondrial Eve is usually attributed to a higher extinction rate for Y chromosomes. This is probably because a few very successful men produce a great many children, while a larger number of less successful men will produce far fewer children.

Secondly mtDNA and Y chromosome work supports a recent African origin for anatomically modern humans, with the ancestors of all extant modern humans leaving Africa somewhere between 100,000 - 50,000 years ago.^[1]^[2]^[3]^[4]

Thirdly studies show that specific types (haplogroups) of mtDNA or Y chromosomes do not always cluster by geography, ethnicity or race, implying multiple lineages are involved in founding modern human populations, with many closely related lineages spread over large geographic areas, and many populations containing distantly related lineages.^[1] Keita et al. (2004) say, with reference to Y chromosome and mtDNA studies and their relevance to concepts of "race":

Y-chromosome and mitochondrial DNA genealogies are especially interesting because they demonstrate the lack of concordance of lineages with morphology and facilitate a phylogenetic analysis. Individuals with the same morphology do not necessarily cluster with each other by lineage, and a given lineage does not include only individuals with the same trait complex (or 'racial type'). Y-chromosome DNA from Africa alone suffices to make this point. Africa contains populations whose members have a range of external phenotypes. This variation has usually been described in terms of 'race' (Caucasoids, Pygmoids, Congoids, Khoisanoids). But the Y-chromosome clade defined by the PN2 transition (PN2/M35, PN2/M2) [see haplogroup E3b and Haplogroup E3a] shatters the boundaries of phenotypically defined races and true breeding populations across a great geographical expanse21. African peoples with a range of skin colors, hair forms and physiognomies have substantial percentages of males whose Y chromosomes form closely related clades with each other, but not with others who are phenotypically similar. The individuals in the morphologically or geographically defined 'races' are not characterized by 'private' distinct lineages restricted to each of them.^[5]

How much are genes shared? Clustering analyses and what they tell us

Multi-Locus Allele Clusters

In a haploid population, when a single locus is considered (blue), with two alleles, + and - we can see a differential geographical distribution between Population I (70% +) and Population II (30% +).

When we want to assign an individual to one of these populations using this single locus we will assign any + to population I because the probability (p) of this allele belonging to Population I is p = 0.7, the probability (q) of incorrectly assigning this allele to Population I is q = 1 − p, or 0.3. This amounts to a Bernoulli trial because the answer to the question "is this the correct population?" is a simple yes or no. This makes the test binomially distributed but with a single trial.

But when three loci per individual are taken into account, each with p = 0.7 for a + allele in Population I the average number of + alleles per individual becomes kp = 2.1 (number of trials (k = 3) × probability for each allele (p = 0.7)) and 0.9 (3 × 0.3) + alleles per individual in Population II. This is sometimes referred to as the population trait value. Because alleles are discrete entities we can only assign an individual to a population based on the number of whole + alleles it contains. Therefore we will assign any individual with three or two + alleles to Population I, and any individual with one or fewer + alleles to population II.

The binomial distribution with three trials and a probability of 0.7 shows that the probability of an individual from this population having a single + allele is 0.189 and for zero + alleles it is 0.027, which gives a misclassification rate of 0.189 + 0.027 = 0.216, which is a smaller chance of misclassification than for a single allele. Misclassification becomes much smaller as we use more alleles. When more loci are taken into account, each new locus adds an extra independent test to the binomial distribution, decreasing the chance of misclassification.

Using modern computer software and the abundance of genetic data now available, it is possible not only to distinguish such correlations for hundreds or even thousands of alleles, which form clusters, it is also possible to assign individuals to given populations with very little chance of error.^{[citation needed]} It should be noted, however, that genes tend to vary clinally, and there are likely to be intermediate populations that reside in the geographical areas between our sample populations (Population III, for example, may lie equidistantly from Population I and Population II). In this case it may well be that Population III may display characteristics of both population I and Population II and have intermediate frequencies for many of the alleles used for classification, causing this population to be more prone to misclassification.

Human genetic variation is not distributed uniformly throughout the global population, the global range of human habitation means that there are great distance between some human populations (e.g. between South America and Southern Africa) and this will reduce gene flow between these populations. On the other hand environmental selection is also likely to play a role in differences between human populations. Conversely it is now believed that the majority of genetic differences between populations is selectively neutral. The existence of differences between peoples from different regions of the world is relevant to discussions about the concept of "race", some biologists believe that the language of "race" is relevant in describing human genetic variation. It is now possible to reasonably estimate the continents of origin of an individual's ancestors based on genetic data^[6]

Richard Lewontin has claimed that "race" is a meaningless classification because the majority of human variation is found within groups (~85%), and therefore two individuals from different "races" are almost as likely to be as similar to each other as either is to someone from their own "race". In 2003 A. W. F. Edwards rebuked this argument, claiming that Lewontin's conclusion ignores the fact that most of the information that distinguishes populations is hidden in the correlation structure of the data and not simply in the variation of the individual factors (see Infobox: Multi Locus Allele Clusters). Edwards concludes that "It is not true that 'racial classification is ... of virtually no genetic or taxonomic significance' or that 'you can't predict someone’s race by their genes'."^[7] Researchers such as Neil Risch and Noah Rosenberg have argued that a person's biological and cultural background may have important implications for medical treatment decisions, both for genetic and non-genetic reasons.^[8]^[9]^[10]

The results obtained by clustering analyses are dependent on several criteria:

The clusters produced are relative clusters and not absolute clusters, each cluster is the product of comparisons between sets of data derived for the study, results are therefore highly influenced by sampling strategies. (Edwards, 2003)
The geographic distribution of the populations sampled, because human genetic diversity is marked by isolation by distance, populations from geographically distant regions will form much more discrete clusters than those from geographically close regions. (Kittles and Weiss, 2003)
The number of genes used. The more genes used in a study the greater the resolution produced and therefore the greater number of clusters that will be identified. (Tang, 2005)

Rosenberg et al.'s (2002) paper "Genetic Structure of Human Populations." especially was taken up by Nicholas Wade in the New York Times as evidence that genetics studies supported the "popular conception" of race.^[11] On the other hand Rosenberg's work used samples from the Human Genome Diversity Project (HGDP), a project that has collected samples from individuals from 52 ethnic groups from various locations around the world. The HGDP has itself been criticised for collecting samples on an "ethnic group" basis, on the grounds that ethnic groups represent constructed categories rather than categories which are solely natural or biological. Scientists such as the molecular anthropologist Jonathan Marks, the geneticists David Serre, Svante Pääbo, Mary-Claire King and medical doctor Arno G. Motulsky argue that this is a biased sampling strategy, and that human samples should have been collected geographically, i.e. that samples should be collected from points on a grid overlaying a map of the world, and maintain that human genetic variation is not partitioned into discrete racial groups (clustered), but is spread in a clinal manner (isolation by distance) that is masked by this biased sampling strategy.^[12]^[13]^[14] The existence of allelic clines and the observation that the bulk of human variation is continuously distributed, has led scientists such as Kittles and Weiss (2003) to conclude that any categorization schema attempting to partition that variation meaningfully will necessarily create artificial truncations.^[15] It is for this reason, Reanne Frank argues, that attempts to allocate individuals into ancestry groupings based on genetic information have yielded varying results that are highly dependent on methodological design.^[16]

In a follow up paper "Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure" in 2005, Rosenberg et al. maintain that their clustering analysis is robust. But they also agree that there is evidence for clinality (isolation by distance). Thirdly they distance themselves from the language of race, and do not use the term "race" in any of their publications: "The arguments about the existence or nonexistence of 'biological races' in the absence of a specific context are largely orthogonal to the question of scientific utility, and they should not obscure the fact that, ultimately, the primary goals for studies of genetic variation in humans are to make inferences about human evolutionary history, human biology, and the genetic causes of disease."^[17]

One of the underlying questions regarding the distribution of human genetic diversity is related to the degree to which genes are shared between the observed clusters, and therefore the extent that membership of a cluster can accurately predict an individuals genetic makeup or susceptibility to disease. This is at the core of Lewontin's argument. Lewontin used Sewall Wright's Fixation index (F_ST), to estimate that on average 85% of human genetic diversity is contained within groups. Are members of the same cluster always more genetically similar to each other than they are to members of a different cluster? Lewontin's argument is that within group differences are almost as high as between group differences, and therefore two individuals from different groups are almost as likely to be more similar to each other than they are to members of their own group. Can clusters correct for this finding? In 2004 Bamshad et al. used the data from Rosenberg et al. (2002) to investigate the extent of genetic differences between individuals within continental groups relative to genetic differences between individuals between continental groups. They found that though these individuals could be classified very accurately to continental clusters, there was a significant degree of genetic overlap on the individual level.^[18]

Percentage similarity between two individuals from different clusters when 377 microsatellite markers are considered.^[18]
x	Africans	Europeans	Asians
Europeans	36.5	—	—
Asians	35.5	38.3	—
Indigenous Americans	26.1	33.4	35

This question was addressed in more detail in a 2007 paper by Witherspoon et al. entitled "Genetic Similarities Within and Between Human Populations".^[19] Where they make the following observations:

Genetic differences between human continental populations account for only a small fraction of the differences between people.
Multilocus clusters provide accurate and reproducible results for dividing people into the correct populations.
Two individuals from different populations are often more genetically alike to each other than they are to individuals from their own population.

The paper states that "All three of the claims listed above appear in disputes over the significance of human population variation and 'race'" and asks "If multilocus statistics are so powerful, then how are we to understand this [last] finding?"

Witherspoon et al. (2007) attempt to reconcile these apparently contradictory findings, and show that the observed clustering of human populations into relatively discrete groups is a product of using what they call "population trait values". This means that each individual is compared to the "typical" trait for several populations, and assigned to a population based on the individual's overall similarity to one of the populations as a whole. They therefore claim that clustering analyses cannot necessarily be used to make inferences regarding the similarity or dissimilarity of individuals between or within clusters, but only for similarities or dissimilarities of individuals to the "trait values" of any given cluster. The paper measures the rate of misclassification using these "trait values" and calls this the "population trait value misclassiﬁcation rate" (C_T). The paper investigates the similarities between individuals by use of what they term the "dissimilarity fraction" (ω): "the probability that a pair of individuals randomly chosen from different populations is genetically more similar than an independent pair chosen from any single population." Witherspoon et al. show that two individuals can be more genetically similar to each other than to the typical genetic type of their own respective populations, and yet be correctly assigned to their respective populations. An important observation is that the likelihood that two individuals from different populations will be more similar to each other genetically than two individuals from the same population depends on several criteria, most importantly the number of genes studied and the distinctiveness of the populations under investigation. For example when 10 loci are used to compare three geographically disparate populations (sub-Saharan African, East Asian and European) then individuals are more similar to members of a different group about 30% of the time. If the number of loci is increased to 100 individuals are more genetically similar to members of a different population ~20% of the time, and even using 1000 loci, ω ~ 10%. They do stated that for these very geographically separated populations it is possible to reduce this statistic to 0% when tens of thousands of loci are used. That means that individuals will always be more similar to members of their own population. But the paper notes that humans are not distributed inot geographically separated populations, omitting intermediate regions may produce a false distinctiveness for human diversity. The paper supports the observation that "highly accurate classification of individuals from continuously sampled (and therefore closely related) populations may be impossible". Furthermore the results indicate that clustering analyses and self reported ethnicity may not be good estimates for genetic susceptibility to disease risk. Witherspoon et al. conclude that:

given enough genetic data, individuals can be correctly assigned to their populations of origin is compatible with the observation that most human genetic variation is found within populations, not between them. It is also compatible with our ﬁnding that, even when the most distinct populations are considered and hundreds of loci are used, individuals are frequently more similar to members of other populations than to members of their own population.

^ ^a ^b ^c Rebecca L. Cann, Mark Stoneking, Allan C. Wilson (1987) Mitochondrial DNA and human evolution in Nature 325: 31-36)
^ S Horai, K Hayasaka, R Kondo, K Tsugane, and N Takahata (1995) "Recent African origin of modern humans revealed by complete sequences of hominoid mitochondrial DNAs" Procedings of the National Academy of Science, 92: 532-536. PDF
^ Mark Seielstad, Endashaw Bekele, Muntaser Ibrahim, Amadou Touré, and Mamadou Traoré (1999) "A View of Modern Human Origins from Y Chromosome Microsatellite Variation" Genome Research 9: 558-567. Full Text.
^ Gibbons, A. (2001) "Modern Men Trace Ancestry to African Migrants". Science. 292: 1051 - 1052 doi:10.1126/science.292.5519.1051b
^ Conceptualizing human variation (2004) by S O Y Keita, R A Kittles1, C D M Royal, G E Bonney, P Furbert-Harris, G M Dunston & C N Rotimi in Nature Genetics 36, S17 - S20
^ Lynn B Jorde & Stephen P Wooding, 2004, "Genetic variation, classification and 'race'" in Nature Genetics 36, S28 - S33 Genetic variation, classification and 'race'
^ "Human genetic diversity: Lewontin's fallacy.", Edwards AW., Gonville and Caius College, Cambridge, in PubMed, 2003 Aug;25(8):798-801.
^ Genetic Structure, Self-Identified Race/Ethnicity, and Confounding in Case-Control Association Studies by Hua Tang, Tom Quertermous, Beatriz Rodriguez, Sharon L. R. Kardia, Xiaofeng Zhu, Andrew Brown, James S. Pankow, Michael A. Province, Steven C. Hunt, Eric Boerwinkle, Nicholas J. Schork, and Neil J. Risch Am J Hum Genet. 2005 February; 76(2): 268–275.
^ Categorization of humans in biomedical research: genes, race and disease by Neil Risch, Esteban Burchard, Elad Ziv and Hua Tang] Genome Biology 2002, 3:comment
^ Noah A. Rosenberg, Jonathan K. Pritchard, James L. Weber, Howard M. Cann, Kenneth K. Kidd, Lev A. Zhivotovsky, Marcus W. Feldman. "Genetic Structure of Human Populations." Science (2002) 298:2381-5
^ Wade, N. (2002) "Gene Study Identifies 5 Main Human Populations, Linking Them to Geography" New York Times 20 December. [1]
^ Marks, J. (2002) What it means to be 98% chimpanzee (paperback ed.) pp.202-203. Berkley. University of California Press.
^ Serre, D. and Pääbo, S. (2004) "Evidence for Gradients of Human Genetic Diversity Within and Among Continents" Genome Research 14: 1679-1685 full text.
^ Mary-Claire King and Arno G. Motulsky Mapping Human History. Science (2002) 298: pp. 2342 - 2343. doi:10.1126/science.1080373
^ Kittles, R., A. and Weiss K., M. (2003) "RACE, ANCESTRY, AND GENES: Implications for Defining Disease Risk." Annual Review of Genomics Human Genetics 4: 33-67. doi:10.1146/annurev.genom.4.070802.110356
^ Back with a Vengeance: the Reemergence of a Biological Conceptualization of Race in Research on Race/Ethnic Disparities in Health Reanne Frank
^ Rosenberg NA, Mahajan S, Ramachandran S, Zhao C, Pritchard JK, et al. (2005) "Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure." PLoS Genet 1(6): e70 doi:10.1371/journal.pgen.0010070
^ ^a ^b Bamshad, Wooding, Salisbury and Stephens (2004) "Deconstructing the relationship between genetics and race." Nature Reviews Genetics 8:598-609. doi:10.1038/nrg1401
^ D. J. Witherspoon, S. Wooding, A. R. Rogers, E. E. Marchani, W. S. Watkins, M. A. Batzer, and L. B. Jorde (2007) "Genetic Similarities Within and Between Human Populations" Genetics 176(1): 351–359. doi:10.1534/genetics.106.067355

[cann-1] Rebecca L. Cann, Mark Stoneking, Allan C. Wilson (1987) Mitochondrial DNA and human evolution in Nature 325: 31-36)

[2] S Horai, K Hayasaka, R Kondo, K Tsugane, and N Takahata (1995) "Recent African origin of modern humans revealed by complete sequences of hominoid mitochondrial DNAs" Procedings of the National Academy of Science, 92: 532-536. PDF

[3] Mark Seielstad, Endashaw Bekele, Muntaser Ibrahim, Amadou Touré, and Mamadou Traoré (1999) "A View of Modern Human Origins from Y Chromosome Microsatellite Variation" Genome Research 9: 558-567. Full Text.

[4] Gibbons, A. (2001) "Modern Men Trace Ancestry to African Migrants". Science. 292: 1051 - 1052 doi:10.1126/science.292.5519.1051b

[5] Conceptualizing human variation (2004) by S O Y Keita, R A Kittles1, C D M Royal, G E Bonney, P Furbert-Harris, G M Dunston & C N Rotimi in Nature Genetics 36, S17 - S20

[jorde-6] Lynn B Jorde & Stephen P Wooding, 2004, "Genetic variation, classification and 'race'" in Nature Genetics 36, S28 - S33 Genetic variation, classification and 'race'

[7] "Human genetic diversity: Lewontin's fallacy.", Edwards AW., Gonville and Caius College, Cambridge, in PubMed, 2003 Aug;25(8):798-801.

[8] Genetic Structure, Self-Identified Race/Ethnicity, and Confounding in Case-Control Association Studies by Hua Tang, Tom Quertermous, Beatriz Rodriguez, Sharon L. R. Kardia, Xiaofeng Zhu, Andrew Brown, James S. Pankow, Michael A. Province, Steven C. Hunt, Eric Boerwinkle, Nicholas J. Schork, and Neil J. Risch Am J Hum Genet. 2005 February; 76(2): 268–275.

[9] Categorization of humans in biomedical research: genes, race and disease by Neil Risch, Esteban Burchard, Elad Ziv and Hua Tang] Genome Biology 2002, 3:comment

[10] Noah A. Rosenberg, Jonathan K. Pritchard, James L. Weber, Howard M. Cann, Kenneth K. Kidd, Lev A. Zhivotovsky, Marcus W. Feldman. "Genetic Structure of Human Populations." Science (2002) 298:2381-5

[11] Wade, N. (2002) "Gene Study Identifies 5 Main Human Populations, Linking Them to Geography" New York Times 20 December. [1]

[12] Marks, J. (2002) What it means to be 98% chimpanzee (paperback ed.) pp.202-203. Berkley. University of California Press.

[serre-13] Serre, D. and Pääbo, S. (2004) "Evidence for Gradients of Human Genetic Diversity Within and Among Continents" Genome Research 14: 1679-1685 full text.

[14] Mary-Claire King and Arno G. Motulsky Mapping Human History. Science (2002) 298: pp. 2342 - 2343. doi:10.1126/science.1080373

[kittlesandweiss-15] Kittles, R., A. and Weiss K., M. (2003) "RACE, ANCESTRY, AND GENES: Implications for Defining Disease Risk." Annual Review of Genomics Human Genetics 4: 33-67. doi:10.1146/annurev.genom.4.070802.110356

[Frank-16] Back with a Vengeance: the Reemergence of a Biological Conceptualization of Race in Research on Race/Ethnic Disparities in Health Reanne Frank

[rosenberg2005-17] Rosenberg NA, Mahajan S, Ramachandran S, Zhao C, Pritchard JK, et al. (2005) "Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure." PLoS Genet 1(6): e70 doi:10.1371/journal.pgen.0010070

[bamshad2004-18] Bamshad, Wooding, Salisbury and Stephens (2004) "Deconstructing the relationship between genetics and race." Nature Reviews Genetics 8:598-609. doi:10.1038/nrg1401

[19] D. J. Witherspoon, S. Wooding, A. R. Rogers, E. E. Marchani, W. S. Watkins, M. A. Batzer, and L. B. Jorde (2007) "Genetic Similarities Within and Between Human Populations" Genetics 176(1): 351–359. doi:10.1534/genetics.106.067355

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]