Title: The Human Genomes
1The Human Genomes
- Gil McVean, Department of Statistics, Oxford
2Genetic variation among humans
http//www.ncbi.nlm.nih.gov/genome/guide/human/
3How do we differ? Let me count the ways
- Single nucleotide polymorphisms
- 1 every few hundred bp, mutation rate 10-9
- Short indels (insertion/deletion)
- 1 every few kb, mutation rate v. variable
- Microsatellite (STR) repeat number
- 1 every few kb, mutation rate 10-3
- Minisatellites
- 1 every few kb, mutation rate 10-1
- Repeated genes
- rRNA, histones
- Large inversions, deletions
- Rare, e.g. Y chromosome
TGCATTGCGTAGGC TGCATTCCGTAGGC
TGCATT---TAGGC TGCATTCCGTAGGC
TGCTCATCATCATCAGC TGCTCATCA------GC
100bp
1-5kb
per generation
4Y chromosome variation
- Non-pathological rearrangements of the AZFc
region on the Y chromosome
Tyler-Smith and McVean (2003)
5Serological techniques for detecting variation
Rabbit
Human
A
B
AB
O
A
6Blood group systems in humans
Blood group systemLocus and chromosomal
location Number of genesFunction of productNumber
of allelesGene alterationsABOABO9q341Enzyme
(glycosyltransferase)102Mutations, insertions,
deletions,gene rearrangementsChido-RodgersC4A,
C4B6p21.32Complement factors7Mutations,
duplications, gene rearrangementsColtonAQP17p141C
hannel7Mutations, insertions, deletionsCromerDAF1
q321Complement binding protein10MutationsDiegoSLC4
A117q21-q22(erythroid non-eryrhroid)1Anion
exchangeradhesion78Mutations, insertions,
deletionsDombrockDO12p12.31Not
known9Mutations,one deletionDuffyFY1q22-q231Recep
tor9Mutations,one deletionGerbich
(Ge)GYPC2q14-q211Cytoskeleton?9Mutations gene
rearrangementsGILAQP39p131Channel2Mutation
splice siteH/h FUT1, FUT2 (pseudogene)19q13.32E
nzymes (glycosyltransferases)27 FUT122
FUT2Mutations, insertions, deletions one
unequal homologous recombination.IGCNT2
(IGnT)6p241Enzyme (glycosyltransferase)7Mutation
s, exon delIndian (IN)CD4411p131Adhesion
molecule2MutationsKell (with Kx blood group
system))KEL7q33,XKXp212 (KEL, XK)KEL
enzymeXK transporter?33 KEL 30 XKMutations
deletions,insertion, gene deletions in
XKKiddSLC14A118q12-q211Transporter8MutationsKnops
CR11q321Receptor24(tentative, because of
multiple mutations and gene rearrangements)Mutatio
nsdeletionsduplicationsLandsteiner WienerICAM4
(LW)19p13.31Adhesion molecule3Mutation one
deletionLewisFUT3(FUT6 ,FUT7 also
includedsame family but do not result in a blood
group phenotype19p1312Enzyme (glycosyltransferase
)Enzymes (glycosyltransferases)1420MutationsMutati
onsone insertionLutheranLU19q13.2-13.31Adhesion
molecule16MutationsMNSGYPA,GYPB,GYPE4q28-313
(GYPA,GYPB,GYPE)Not known43Unequal homologous
recombinations gene conversions
mutationsOKBSG19p13.31Factor adhesion2MutationsP
-related(includes P1 and globoside blood group
systems)A4GALT22q11.2-q13.2B3GALT33q252Enzy
mes (glycosyltransferases)14 A4GALT5
B3GALTMutations, insertions, deletionsRAPH-MER2CD
15111p15.51Adhesion molecule3MutationsRhRHCE,
RHD,1p34-36RHAG6p11-21.1RHBG,RHCG5Transporte
rs116 RHCE, RHD13 RHAGGene conversions,
mutations, deletions for RHCE,RHDrecombinations
for RHD mutations for RHAGSciannaERMAP1p34.11Adh
esion, receptor molecule?4MutationsXgXG, CD99
(MIC2)Xp22-332unknownadhesion moleculeso far
none documentedpolymorphism based on level of
expression?YTACHE7q22.11Enzyme4Mutations one
deletion
- 28 known systems
- 39 genes, 643 alleles
System Genes Alleles
ABO ABO 102
Colton C4A, C4B 7
Chido-rodgers AQP1 7
Colton DAF 10
Diego SLC4A1 78
Dombrock DO 9
Duffy FY 9
Gerbich GYPC 9
GIL AQP3 2
H/h FUT1, FUT2 27/22
I GCNT2 7
Indian CD44 2
Kell KEL, XK 33/30
Kidd SLC14A1 8
Knops CR1 24
Landsteiner-Wiener ICAM4 3
Lewis FUT3, FUT6 14/20
Lutheran LU 16
MNS GYPA,GYPB,GYPE 43
OK BSG 2
P-related A4GALT, B3GALT3 14/5
RAPH-MER2 CD151 3
Rh RHCE, RHD, RHAG 129
Scianna ERMAP 4
Xg XG, CD99 -
YT ACHE 4
http//www.bioc.aecom.yu.edu/bgmut/summary.htm
7HLA diversity at the MHC locus
6p21.3
4 Mbp c. 127 genes
DP DQ DR C4 C2 TNFa,b
HLA-B HLA-C HLA-A
HLA-D
(18 genes)
Class II
Class III
Class I
HLA-A
8Protein electrophoresis
Starch or agar gel
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Direction of travel
Lewontin and Hubby (1966) Harris (1966)
9The rise of DNA sequence analysis
- RFLPs
- Cann et al 1987
- Sequencing of small regions
- Vigilant et al 1991
- Whole genome sequencing
- Ingman et al 2000
10Different, but not that different
- Humans are one of the least diverse organisms
(excepting cheetahs)
Species Diversity (percent)
Humans 0.08 - 0.1
Chimpanzees 0.12 - 0.17
Drosophila simulans 2
E. coli 5
HIV1 30
Photos from UN photo gallery www.un.org/av/photo
11The biological significance of genetic variation
- Genetic variation must underlie both pathological
and non-pathological traits that show significant
heritability - How do we locate these variants, and what use is
finding them? - Genetic variation has been influenced by several
million years of human existence. - How have human populations evolved over
pre-historical times? - The distribution of variation is influenced by
fundamental evolutionary processes - How has mutation, selection and recombination
shaped the human genome?
12Differences between autosomes, sex chromosomes,
mtDNA
Genome Average pairwise differences / kb Relative copy number (a)
Autosomes 0.5 0.85 1
X chromosome 0.47 3/4
Y chromosome 0.15 1/4
mtDNA 2.8 1/4
TISMWG (2001) Jobling, Hurles, Tyler-Smith (2004)
- Under very simple models of populations, average
pairwise differences is predicted by the formula - If m 1.5x10-9 per site per generation, this
implies that the human population is lt 15,000 - Population geneticists refer to this number as
the effective population size
13Demographic factors affecting diversity
- Diversity is influenced by demographic factors
such as - Variance in reproductive success
- Differences in variance of success between males
and females - Heritability of reproductive success
- Changes in population size (growth, bottlenecks,
natural fluctuations) - Which effects are most important?
- Iceland faster drift in matrilines due to
shorter generation interval, but no differences
between the sexes (Helgason et al 2003) - Quebec heritability of reproductive success
reduces diversity by more than an order in
magnitude (Austerlitz and Heyer 1998) - The effective population size (Ne) is an
approximation that allows simple mathematical
models of populations to be applied to real data - Neltlt N
14Diversity is not randomly distributed across the
genome
Chromosome 6
TISMWG (2001)
15Correlates and determinants of diversity
- There is systematic variation in the mutation
rate along chromosomes - Wolfe and Sharp (1987), Lercher et al (2001)
- Levels of diversity correlate with recombination
rates - Nachman et al (1998)
- Diversity and the allele frequency spectrum of
SNPs are influenced by the local GC content
(above CpG frequency) - Eyre-Walker (1999), Smith and Eyre-Walker (2001)
Lercher et al (2002) - Recombination rates are correlated (to some
degree) with GC content - Eyre-Walker (1993), Fullerton et al (2001), Kong
et al (2002)
Lercher and Hurst 2002
Lercher et al (2001)
16What is the link between recombination and
diversity?
- A positive correlation between recombination rate
and diversity could mean - Recombination is mutagenic
- Diversity promotes recombination
- Recombination and mutation are linked by a third
factor (chromatin accessibility, transcription,
Hill-Robertson effects)
Mutation
Hellmann et al 2003
Hitch-hiking
17Diversity is not evenly distributed across genes I
- Adaptive evolution wipes out diversity nearby
due to the hitch-hiking effects of a selective
sweep - e.g. Duffy-null locus in sub-Saharn africa,
protects against P. vivax - Hamblin and Di Rienzo (2000)
FYO mutation
African
Pop1
Pop2
European
Ancestral allele
Derived allele
Missing data
18Diversity is not evenly distributed across genes
II
- Purifying selection eliminates deleterious
mutations and reduces diversity in regions of
strong functional constraint
Zhao et al (2003)
19Diversity is not evenly distributed across genes
III
- Some genes are under balancing or diversifying
selection, where diversity is actively selected
for - MHC complex heterozygote advantage and
frequency-dependent selection driven by
recognition of pathogens
Horton et al (1998)
20Diversity is not evenly distributed across
populations I
- African populations are more diverse than
non-African populations - More polymorphisms
- Polymorphisms at less skewed frequencies
- Why?
- Out-of-Africa event associated with a bottleneck?
- Selection on genome in adaptation to novel
habitats?
Population Segregating sites per kb (n 30) Diversity per kb Tajima D statistic
Hausa (African) 4.8 0.11 -0.33
Italian 3.2 0.10 1.18
Chinese 3.0 0.07 1.19
Frisse et al (2001)
21The Tajima D statistic
- Measures departure from neutral coalescent
expectations in allele frequency distribution - ve values indicate excess of intermediate
frequency variants - -ve values indicate excess of low-frequency
variants - E.g. human mtDNA
No. sites
Observed
Expected
Rare allele frequency
Data from Ingman et al (2000)
22Diversity is not evenly distributed across
populations II
- Small, isolated populations often have skewed
allele frequencies (ve Tajima D) due to founder
effects and high degree of genetic drift - Marginal populations (Evenki, Saami)
- Island populations (Iceland, Sardinia)
Finns
Saami
Swedes
Evenki
Minor allele frequencies at 50 SNPs (Kaessmann et
al 2002)
23The second dimension of human diversity!
- The distribution of alleles at different loci are
not independent - Correlations between SNPs are particularly strong
for those lt50kb - These correlations indicate shared evolutionary
history
Chromosomes
Sites
Chromosome 22 1Mb 57 Europeans
Lipoprotein Lipase 10kb 48 African Americans
Xq13 10kb 69 worldwide
24Correlations between SNPs are measured by linkage
disequilibrium
Linkage equilibrium
Linkage disequilibrium
25Why are SNPs correlated?
. . .
. . .
. . .
The mutation arises on a particular genetic
background
If the mutation increases in frequency by drift
(or selection) the associated haplotype will also
increase in frequency
Over time the association between the new
mutation and linked mutations will decay by
recombination
26What generates and destroys LD?
- Genetic drift
- Stochastic sampling process in finite population
- Population structure and admixture
- Correlations between mutations arising through
shared population history - Natural selection
- Combinations of favoured/unfavoured alleles (weak
force) - Recombination is the ONLY force which breaks down
LD - LD is a balance between recombination and other
forces
27Empirical patterns of LD
- Large-scale surveys of LD in humans
- e.g. Huttley et al. (1999), Abecasis et al.
(2001), Reich et al. (2001) - LD extends over considerable distance (gtgt10kb) in
most populations
D
Kruglyak prediction
1 5 10 20 40 80 160
unlinked
Distance (kb)
Reich et al. (2001)
28Differences between populations
r2
- African populations show less LD than European
populations (e.g. Frisse et al. 2001) - Small, isolated populations (e.g. Saami, Evenki)
show increased LD (Kaessmann et al 2002) - Founder populations (e.g. Finland, Sardinia) do
not always show increased LD (e.g. Eaves et al.
2000)
29Assessing the contribution of structure to LD
- Rosenberg et al. (2002)
- Population differences in allele frequency exist,
but many markers/loci are required in order to
estimate ethnic origin with accuracy - Admixture between populations has played an
important historical role
Oceania
America
Asia
Middle east
Europe
Africa
30Differences between genomic regions
Average D
Dawson et al (2002)
Reich et al (2001)
- Evidence for heterogeneity in LD along/between
chromosomes - Taillon-Miller et al (2000), Jeffreys et al
(2001), Daly et al (2001), Patil et al (2001),
Reich et al (2001), Reich et al (2002), Gabriel
et al (2002), Dawson et al (2002), Phillips et al
(2002)
31Differences within genomic regions
Jeffreys et al (2001)
32Recombination hot-spots in the MHC region
Jeffreys et al (2001)
- Other genes with recombination hot-spots
- B-globin
- PAR/SHOX
- MS32
- (Chi sequences)
33In an ideal block world...
Pääbo (2003)
- Blocks extend many (gt100) kbs.
- All alleles within blocks are in strong
associations. - There are no associations between blocks.
- In each block, only a few (4-5) haplotypes
account for the majority (gt90) of variation. - In each block, only a few SNPs are required to
map the majority of haplotype variation. - Blocks correspond to recombination hot-spots.
Association studies suddenly look much less
difficult... Goldstein (2001)
34The international Hapmap project
- International partnership of scientists and
funding agencies from Canada, China, Japan,
Nigeria, the United Kingdom and the United States
to develop a public resource that will help
researchers find genes associated with human
disease and response to pharmaceuticals - Gibbs et al (2003)
- Aims to survey variation across entire human
genome at 1 SNP per 5kb or less, in three
populations (CEPH Europeans, Chinese/Japanese,
Yoruban Africans). More than 600,000 SNPs with
MAFgt5 - http//www.hapmap.org/
- All data is public access and available through
the Data Coordination Center (DCC)
35How are blocks defined?
- Incompatibility through the four-gamete test
- Wang et al. (2002)
- Regions with consistently high pairwise LD
measures - Gabriel et al. (2002)
- Dynamic programming solutions based on
- Measures of pairwise LD structure - Zhang et al.
(2002) - Minimum description length (information theoretic
principles) Koivisto et al. (2002), Anderson et
al (2003)
36Empirical block pattern
Blocks
Length
match
frequencies
Daly et al (2001)
37Problems with blocks
- Block definitions depend on marker spacing,
allele frequency and algorithm. - Blocks (as defined by some algorithms) may not
reflect variation in the recombination rate
All reported mean block lengths consistent with
uniform recombination ( 1 SD)
Phillips et al (2003)
38Do we need haplotype blocks?
- The key determinant of LD is recombination
- True haplotype blocks are formed by regions of
low recombination separated by recombination
hotspots - If we knew the fine-scale (ltltMb) structure of
recombination-rate variation, blocks would not be
necessary - Genetic maps estimated from pedigree studies show
recombination rate variation - BUT do not have the resolution to define
recombination hotspots
Chromosome 3 Kong et al (2000)
39Learning about recombination from diversity
- We can estimate the fine-scale structure of
recombination rates from patterns of genetic
variation
Rate estimates from sperm (Jeffreys et al 2001)
Genes
n50 unrelated European genotypes
40Comparison with pedigree-based maps
- Summing fine-scale estimates over 2Mb intervals
accurately recovers variation in recombination
rate detected by pedigree studies
Chromosome 22
Chromosome 19
Markers for pedigree-based map
Sex-averaged recombination rate (cM/Mb)
Position (kb)
Position (kb)
Pedigree Population genetic
41A chromosomal view of recombination rate variation
- 10Mb of Chromosome 20, 96 CEPH genotypes, 4337
SNPs
Sex-averaged recombination rate (cM/Mb)
Position
42- What is the probability that there exists a SNP
in this region that is NOT in LD with currently
observed SNPs?
?
43(No Transcript)
44The answer depends on recombination
Recombination rate
If recombination is high, the untyped SNP is
unlikely to be in association
45 If recombination is low, the untyped SNP is
likely to be in association
Recombination rate
We can use population genetic methods to estimate
the recombination rate and predict the
distribution of the untyped SNP
46Hapmap challenges
- Prediction
- Do the SNPs currently genotyped provide an
accurate representation of variation at linked
SNPs in other samples from the same population? - Selection of tagging SNPs
- What is the smallest number of SNPs I need type
in order to achieve a given level of power? - Demography
- Are the results from one population transferable
to other populations?
47Suggested reading
- Jobling MA, Hurles ME and Tyler-Smith C. 2004.
Human Evolutionary Genetics Origins, Peoples
Disease. Garland Science - Balding DJ, Bishop M and Cannings C. 2001.
Handbook of Statistical Genetics. John Wiley and
Sons Ltd. - Li W-H. 2001. Molecular evolution. Sinauer.
48References 1. E. C. Anderson and J. Novembre,
Am.J.Hum.Genet. 73, 336-354 (2003). 2. F.
Austerlitz and E. Heyer, Proc.Natl.Acad.Sci.U.S.A
95, 15140-15144 (1998). 3. R. L. Cann, M.
Stoneking, A. C. Wilson, Nature 325, 31-36
(1987). 4. M. J. Daly, J. D. Rioux, S. F.
Schaffner, T. J. Hudson, E. S. Lander, Nat.Genet.
29, 229-232 (2001). 5. E. Dawson et al., Nature
418, 544-548 (2002). 6. I. A. Eaves et al.,
Nat.Genet. 25, 320-323 (2000). 7. A.
Eyre-Walker, Proc.R.Soc.Lond B Biol.Sci. 252,
237-243 (1993). 8. A. Eyre-Walker, Genetics 152,
675-683 (1999). 9. L. Frisse et al.,
Am.J.Hum.Genet. 69, 831-843 (2001). 10. S. M.
Fullerton, C. A. Bernardo, A. G. Clark,
Mol.Biol.Evol. 18, 1139-1142 (2001). 11. S. B.
Gabriel et al., Science 296, 2225-2229
(2002). 12. R. A. Gibbs et al., Nature 426,
789-796 (2003). 13. D. B. Goldstein, Nat.Genet.
29, 109-111 (2001). 14. M. T. Hamblin and A. Di
Rienzo, Am.J.Hum.Genet. 66, 1669-1679 (2000). 15.
A. Helgason, B. Hrafnkelsson, J. R. Gulcher, R.
Ward, K. Stefansson, Am.J.Hum.Genet. 72,
1370-1388 (2003). 16. I. Hellmann, I.
Ebersberger, S. E. Ptak, S. Paabo, M. Przeworski,
Am.J.Hum.Genet. 72, 1527-1535 (2003). 17. R.
Horton et al., J.Mol.Biol. 282, 71-97 (1998). 18.
M. Ingman, H. Kaessmann, S. Paabo, U.
Gyllensten, Nature 408, 708-713 (2000). 19. A.
J. Jeffreys, L. Kauppi, R. Neumann, Nat.Genet.
29, 217-222 (2001). 20. M. A. Jobling, M. E.
Hurles, C. Tyler-Smith, Human Evolutionary
Genetics Origins, Peoples Disease (Garland
Science, New York, 2004). 21. H. Kaessmann et
al., Am.J.Hum.Genet. 70, 673-685 (2002). 22. M.
Koivisto et al., Pac.Symp.Biocomput. 502-513
(2003). 23. A. Kong et al., Nat.Genet. 31,
241-247 (2002). 24. M. J. Lercher and L. D.
Hurst, Trends Genet. 18, 337-340 (2002). 25. M.
J. Lercher, N. G. Smith, A. Eyre-Walker, L. D.
Hurst, Genetics 162, 1805-1810 (2002). 26. M. J.
Lercher, E. J. Williams, L. D. Hurst,
Mol.Biol.Evol. 18, 2032-2039 (2001). 27. M. W.
Nachman, V. L. Bauer, S. L. Crowell, C. F.
Aquadro, Genetics 150, 1133-1141 (1998). 28. S.
Paabo, Nature 421, 409-412 (2003). 29. N. Patil
et al., Science 294, 1719-1723 (2001). 30. M. S.
Phillips et al., Nat.Genet. 33, 382-387
(2003). 31. D. E. Reich et al., Nature 411,
199-204 (2001). 32. D. E. Reich et al.,
Nat.Genet. 32, 135-142 (2002). 33. N. A.
Rosenberg et al., Science 298, 2381-2385
(2002). 34. R. Sachidanandam et al., Nature 409,
928-933 (2001). 35. N. G. Smith and A.
Eyre-Walker, Mol.Biol.Evol. 18, 982-986
(2001). 36. P. Taillon-Miller et al., Nat.Genet.
25, 324-328 (2000). 37. C. Tyler-Smith and G.
McVean, Nat.Genet. 35, 201-202 (2003). 38. L.
Vigilant, M. Stoneking, H. Harpending, K. Hawkes,
A. C. Wilson, Science 253, 1503-1507 (1991). 39.
N. Wang, J. M. Akey, K. Zhang, R. Chakraborty,
L. Jin, Am.J.Hum.Genet. 71, 1227-1234 (2002). 40.
K. Zhang, M. Deng, T. Chen, M. S. Waterman, F.
Sun, Proc.Natl.Acad.Sci.U.S.A 99, 7335-7339
(2002). 41. Z. Zhao, Y. X. Fu, D. Hewett-Emmett,
E. Boerwinkle, Gene 312, 207-213 (2003).