Title: Introduction to Gene-Finding: Linkage and Association
1Introduction to Gene-Finding Linkage and
Association
- Danielle Dick, Sarah Medland, (Ben Neale)
2Aim of QTL mapping
- LOCALIZE and then IDENTIFY a locus that
regulates a trait (QTL) - Locus Nucleotide or sequence of nucleotides with
variation in the population, with different
variants associated with different trait levels.
3Location and Identification
- Linkage
- localize region of the genome where a QTL that
regulates the trait is likely to be harboured - Family-specific phenomenon Affected individuals
in a family share the same ancestral predisposing
DNA segment at a given QTL
4Location and Identification
- Association
- identify a QTL that regulates the trait
- Population-specific phenomenon Affected
individuals in a population share the same
ancestral predisposing DNA segment at a given QTL
5Linkage
6Progress of the Human Genome Project
Human Chromosome 4
7Genetic markers (DNA polymorphisms)
ATGCTTGCCACGCE ATGCTTCTTGCCATGCE
Microsatellite Markers can be di(2), tri(3), or
tetra (4) nucleotide repeats
ATGCTTGCCACGCE ATGCTTGCCATGCE
Single Nucleotide Polymorphism
8DNA polymorphisms
- Can occur in gene, but be silent
- Can change gene product (protein)
- Alter amino acid sequence (a lot or a little)
- Can regulate gene product
- Upregulate or downregulate protein production
- Turn off or on gene
- Can occur in noncoding region
- This happens most often!
9Mutations
10How do we map genes?
- Deviation from Mendels Independent Assortment
Law - Aa Bb ¼ AB, ¼ Ab, ¼ aB, ¼ ab
- Were looking for variation from this
11Recombination
12Recombination
- Another way of introducing genetic diversity
- Allows us to map genes!
- Crossovers more likely to occur between genes
that are further away likelihood of a
recombination event is proportional to the
distance - Interference tend not to see 2 crossovers in a
small area - Alleles that are very close together are more
likely to stay together, dont assort
independently
13Linkage Mapping (is a marker linked to the
disease gene)
- Collect families with affected individuals
- Genome Scan - Test markers evenly spaced across
the entire genome (every 10cM, 400 markers) - Lod score (log of the odds) what are the odds
of observing the family marker data if the marker
is linked to the disease (less recombination than
expected) compared to if the marker is not linked
to the disease
14Thomas Hunt Morgan discoverer of linkage
15Linkage Co-segregation
A3A4
A1A2
A2A4
A1A3
A2A3
Marker allele A1 cosegregates with dominant
disease
A1A2
A1A4
A3A4
A3A2
16Lod scores
- gt3.0 evidence for linkage
- lt-2.0 can rule out linkage
- In between inconclusive, collect more families
17Linkage Co-segregation
- Parametric Linkage used very successfully to map
disease genes for Mendelian disorders - Problematic for complex disorders requires
disease model, penetrance, assumes gene of major
effect, phenotypic precision
A3A4
A1A2
A2A4
A1A3
A2A3
A1A2
A1A4
A3A4
A3A2
18Nonparametric Linkage
- Based on allele-sharing
- More appropriate for phenotypes with multiple
genes of small effect, environment, no disease
model assumed - Basic unit of data affected relative (often
sibling) pairs
19x
1/4
1/4
1/4
1/4
20IDENTITY BY DESCENT
Sib 1
2
1
1
0
2
1
1
0
Sib 2
1
0
2
1
2
1
1
0
4/16 1/4 sibs share BOTH parental alleles IBD
2
8/16 1/2 sibs share ONE parental allele IBD
1
4/16 1/4 sibs share NO parental alleles IBD
0
21Genotypic similarity between relatives
IBS Alleles shared Identical By State look the
same, may have the same DNA sequence but they
are not necessarily derived from a known common
ancestor - focus for association
M3
M1
M2
M3
Q3
Q1
Q2
Q4
IBD Alleles shared Identical By Descent are
a copy of the same ancestor allele - focus for
linkage
M1
M2
M3
M3
Q1
Q2
Q3
Q4
IBS
IBD
M1
M3
M1
M3
2
1
Q1
Q3
Q1
Q4
22Genotypic similarity basic principals
- Loci that are close together are more likely to
be inherited together than loci that are further
apart - Loci are likely to be inherited in context ie
with their surrounding loci - Because of this, knowing that a loci is
transmitted from a common ancestor is more
informative than simply observing that it is the
same allele - Critical to have parental data when possible
23Linkage Markers
24For disease traits (affected/unaffected) Affected
sib pairs selected
1000
750
500
250
IBD 2
Expected
1
2
3
127
310
IBD 1
Markers
IBD 0
25For continuous measures Unselected sib pairs
26So how does all this fit into Mx?
27IDENTITY BY DESCENT
Sib 1
2
1
1
0
2
1
1
0
Sib 2
1
0
2
1
2
1
1
0
4/16 1/4 sibs share BOTH parental alleles IBD
2
8/16 1/2 sibs share ONE parental allele IBD
1
4/16 1/4 sibs share NO parental alleles IBD
0
28- In biometrical modeling A is correlated at 1 for
MZ twins and .5 for DZ twins - .5 is the average genome-wide sharing of genes
between full siblings (DZ twin relationship)
29- In linkage analysis we will be estimating an
additional variance component Q - For each locus under analysis the coefficient of
sharing for this parameter will vary for each
pair of siblings - The coefficient will be the probability that the
pair of siblings have both inherited the same
alleles from a common ancestor
30MZ1.0 DZ0.5
MZ DZ 1.0
1
1
1
1
1
1
1
1
Q
A
C
E
E
C
A
Q
e
c
a
q
q
a
c
e
PTwin1
PTwin2
31Linkage
- How do we do this?
- 1.Genotyping data.
32Microsatellite data
- Ideally positioned at equal genetic distances
across chromosome - Mostly di/tri nucleotide repeats
- http//research.marshfieldclinic.org/genetics/Gene
ticResearch/screeningsets.asp
33Microsatellite data
- Raw data consists of allele lengths/calls (bp)
- Different primers give different lengths
- So to compare data you MUST know which primers
were used - http//research.marshfieldclinic.org/genetics/Gene
ticResearch/screeningsets.asp
34Binning
- Raw allele lengths are converted to allele
numbers or lengths - ExampleD1S1646 tri-nucleotide repeat size
range130-150 - Logically Work with binned lengths
- Commonly Assign allele 1 to 130 allele, 2 to 133
allele - Commercially Allele numbers often assigned based
on reference populations CEPH. So if the first
CEPH allele was 136 that would be assigned 1 and
130 133 would assigned the next free allele
number - Conclusions whenever possible start from the RAW
allele size and work with allele length
35Error checking
- After binning check for errors
- Family relationships (GRR, Rel-pair)
- Mendelian Errors (Sib-pair)
- Double Recombinants (MENDEL, ASPEX, ALEGRO)
- An iterative process
36Clean data
- ped file
- Family, individual, father, mother, sex, dummy,
genotypes
37Estimating genotypic sharing
- The ped file is used with map files to obtain
estimates of genotypic sharing between relatives
at each of the locations under analysis
38Estimating genotypic sharing
Merlin will give you probabilities of sharing 0,
1, 2 alleles for every pair of individuals.
39Estimating genotypic sharing
40Estimating genotypic sharing
Why isnt P0, P1, P2 exact for everyone?
41Estimating genotypic sharing
Why isnt P0, P1, P2 exact for
everyone? -missing parental genotypes -low
informativeness at marker
1/2
2/2
2/2
1/2
42MZ1.0 DZ0.5
MZ DZ 1.0
1
1
1
1
1
1
1
1
Q
A
C
E
E
C
A
Q
e
c
a
q
q
a
c
e
PTwin1
PTwin2
43Genotypic similarity between relatives
IBD Alleles shared Identical By Descent are a
copy of the same ancestor allele Pairs of
siblings may share 0, 1 or 2 alleles IBD The
probability of a pair of relatives being IBD is
known as pi-hat
M1
M3
M2
M3
Q3
Q1
Q2
Q4
M1
M2
M3
M3
Q1
Q2
Q3
Q4
IBS
IBD
M1
M3
M1
M3
2
1
Q1
Q3
Q1
Q4
44Estimating genotypic sharing
45Distribution of pi-hat
- Adult Dutch DZ pairs distribution of pi-hat
at 65 cM on chromosome 19 - lt 0.25 IBD0 group
- gt 0.75 IBD2 group
- others IBD1 group
- pi65cat (0,1,2)
46Linkage Analyses
- Advantage
- Systematically scan the genome
- Disadvantages
- Not very powerful
- Need hundreds thousands of family member
- Broad peaks
47Lod scores
1cM 1MB 1MB1000kb 1kb1000bp 1cM 1,000,000
bp
48Strategy
1. Ascertain families with multiple affecteds
2. Linkage analyses to identify chromosomal
regions
? allele-sharing among affecteds within a
family
3. Association analyses to identify specific
genes
Gene A
Gene B
Gene C
49 50Linkage vs. Association
- Linkage analyses look for relationship between a
marker and disease within a family (could be
different marker in each family) - Association analyses look for relationship
between a marker and disease between families
(must be same marker in all families)
51Allelic Association Extension of linkage to the
population
3/5
2/6
3/5
2/6
3/2
3/6
5/2
5/6
Both families are linked with the marker, but a
different allele is involved
52Allelic Association Extension of linkage to the
population
3/6
2/4
4/6
2/6
3/2
6/2
6/6
6/6
All families are linked with the marker Allele
6 is associated with disease
53Localization
- Linkage analysis yields broad chromosome regions
harbouring many genes - Resolution comes from recombination events
(meioses) in families assessed - Good in terms of needing few markers, poor in
terms of finding specific variants involved - Association analysis yields fine-scale resolution
of genetic variants - Resolution comes from ancestral recombination
events - Good in terms of finding specific variants,
poor in terms of needing many markers
54Allelic Association Three Common Forms
- Direct Association
- Mutant or susceptible polymorphism
- Allele of interest is itself involved in
phenotype - Indirect Association
- Allele itself is not involved, but a nearby
correlated - marker changes phenotype
- Spurious association
- Apparent association not related to genetic
aetiology - (most common outcome)
55Indirect and Direct Allelic Association
Direct Association
D
Measure disease relevance () directly, ignoring
correlated markers nearby
Semantic distinction between Linkage
Disequilibrium correlation between (any) markers
in population Allelic Association
correlation between marker allele and trait
56Decay of Linkage Disequilibrium
Reich et al., Nature 2001
57Average Levels of LD along chromosomes
CEPH W.Eur Estonian
Chr22
Dawson et al Nature 2002
58Characterizing Patterns of Linkage Disequilibrium
59Linkage Disequilibrium Maps Allelic Association
D
1
2
3
n
Marker
LD
Primary Aim of LD maps Use relationships
amongst background markers (M1, M2, M3, Mn) to
learn something about D for association studies
Something Efficient association study design
by reduced genotyping Predict approx location
(fine-map) disease loci Assess complexity of
local regions Attempt to quantify/predict
underlying (unobserved) patterns
60Deliverables Sets of haplotype tagging SNPs
61Building Haplotype Maps for Gene-finding
1. Human Genome Project ? Good for consensus,
not good for individual differences
2. Identify genetic variants ? Anonymous with
respect to traits.
3. Assay genetic variants ? Verify
polymorphisms, catalogue correlations
amongst sites ? Anonymous with respect to
traits
62Haplotype Tagging for Efficient Genotyping
Cardon Abecasis, TIG 2003
- Some genetic variants within haplotype blocks
give redundant information - A subset of variants, htSNPs, can be used to
tag the conserved haplotypes with little loss
of information (Johnson et al., Nat Genet, 2001) - Initial detection of htSNPs should facilitate
future genetic association studies
63HapMap Strategy
- Samples
- Four populations, small samples
- Genotyping
- 5 kb initial density across genome (600K markers)
- Subsequent focus on low LD regions
- Recent NIH RFA for deeper coverage
64- Hapmap validating millions of SNPs.
- Are they the right SNPs?
Distribution of allele frequencies in public
markers is biased toward common alleles
Expected frequency in population
Frequency of public markers
Updated with phase 2more similar to expectation
Phillips et al. Nat Genet 2003
65Summary of Role of Linkage Disequilibrium on
Association Studies
- Marker characterization is becoming extensive and
genotyping throughput is high - Tagging studies will yield panels for immediate
use - Need to be clear about assumptions/aims of each
panel - Density of eventual Hapmap probably cover much of
genome in high LD, but not all - Challenges
- Just having more markers doesnt mean that
success rate will improve - Expectations of association success via LD are
too high.
66Two types of association studies
- Case-control
- Family-based
67Allelic Association
Controls
Cases
6/6
6/2
3/5
3/4
3/6
5/6
2/4
3/2
3/6
6/6
4/6
2/6
2/6
5/2
Allele 6 is associated with disease
68Main Blame
Primary Concern with Case-Control
Analyses Population stratification Analysis of
mixed samples having different allele frequencies
is a primary concern in human genetics, as it
leads to false evidence for allelic association.
69Population Stratification
- Leads to spurious association
- Requirements
- Group differences in allele frequencies AND
- Group differences in outcome
- In epidemiology, this is a classic matching
problem, with genetics as a confounding variable
70Population Stratification
c21 14.84, p lt 0.001
Spurious Association
71Family-based association methods
TDT Transmission Disequilibrium Test
1/2
3/3
2/3
- 50/50 chance the 2 is transmitted
- Looking for overtransmission of a particular
allele - across affected individuals (undertransmission to
unaffecteds)
72TDT Advantages/Disadvantages
Advantages
Robust to stratification Genotyping error
detectable via Mendelian inconsistencies Estimates
of haplotypes possible
Disadvantages
Detection/elimination of genotyping errors causes
bias (Gordon et al., 2001) Uses only heterozygous
parents Inefficient for genotyping 3
individuals yield 2 founders 1/3 information
not used Can be difficult/impossible to
collect Late-onset disorders, psychiatric
conditions, pharmacogenetic applications
73Association studies lt 2000 TDT
- TDT virtually ubiquitous over past decade
- Grant, manuscript referees editors mandated
design - View of case/control association studies greatly
- diminished due to perceived role of
stratification
Association Studies 2000 Return to population
- Case/controls, using extra genotyping
- families, when available
74Detecting and Controlling for Population
Stratification with Genetic Markers
Idea
- Take advantage of availability of large N
genetic markers - Use case/control design
- Genotype genetic markers across genome
- (Number depends on different factors)
- Look if any evidence for background population
substructure exists and account for it
75Two types of association studies
- Case-control
- Adv more powerful
- Disadv population stratification
- limited by case/control
definition - Family-based
- Adv population stratification not a problem
- Disadv less powerful, hard to collect parents
for some phenotypes
76Association Analyses vs Linkage
- Advantage
- More powerful
- Disadvantage
- Not systematic (in the past)
- Now!
- Genome wide association scans
77Current Association Study Challenges1)
Genome-wide screen or candidate gene
- Genome-wide screen
- Hypothesis-free
- High-cost large genotyping requirements
- Multiple-testing issues
- Possible many false positives, fewer misses
- Candidate gene
- Hypothesis-driven
- Low-cost small genotyping requirements
- Multiple-testing less important
- Possible many misses, fewer false positives
78Current Association Study Challenges2) What
constitutes a replication?
GOLD Standard for association studies Replicating
association results in different laboratories is
often seen as most compelling piece of evidence
for true finding But. in any sample, we
measure Multiple traits Multiple
genes Multiple markers in genes and we analyse
all this using multiple statistical tests
What is a true replication?
79What is a true replication?
Replication Outcome
Explanation
- Association to same trait, but different gene
- Association to same trait, same gene, different
SNPs (or haplotypes) - Association to same trait, same gene, same SNP
but in opposite direction (protective ?? disease) - Association to different, but correlated
phenotype(s) - No association at all
- Genetic heterogeneity
- Allelic heterogeneity
- Allelic heterogeneity/pop differences
- Phenotypic heterogeneity
- Sample size too small
80Measuring Success by Replication
- Define objective criteria for what is/is not a
replication in advance - Design initial and replication study to have
enough power - Lumper use most samples to obtain robust
results in first place - Great initial detection, may be weak in
replication - Skol et al. 2006lumping is better for power
- Splitter Take otherwise large sample, split
into initial and replication groups - One good study ? two bad studies.
- Poor initial detection, poor replication
81Current Association Study Challenges3) Do we
have the best set of genetic markers
- There exist 6 million putative SNPs in the
public domain. Are they the right markers?
Allele frequency distribution is biased toward
common alleles
Expected frequency in population
Frequency of public markers
82Current Association Study Challenges3) Do we
have the best set of genetic markers
Tabor et al, Nat Rev Genet 2003
83Greatest power comes from markers that match
allele freq with trait loci
ls 1.5, a 5 x 10-8, Spielman TDT
(Müller-Myhsok and Abel, 1997)
84Current Association Study Challenges4)
Integrating the sampling, LD and genetic effects
Questions that dont stand alone
How much LD is needed to detect complex disease
genes? What effect size is big enough to be
detected? How common (rare) must a disease
variant(s) be to be identifiable? What marker
allele frequency threshold should be used to find
complex disease genes?
85Complexity of System
- In any indirect association study, we measure
marker alleles that are correlated with trait
variants - We do not measure the trait variants themselves
- But, for study design and power, we concern
ourselves with frequencies and effect sizes at
the trait locus. - This can only lead to underpowered studies and
inflated expectations - We should concern ourselves with the apparent
effect size at the marker, which results from - 1) difference in frequency of marker and trait
alleles - 2) LD between the marker and trait loci
- 3) effect size of trait allele
86Practical Implications of Allele Frequencies
- Strongest argument for using common markers is
not CD-CV. It is practical - For small effects, common markers are
the only ones for which sufficient sample sizes
can be collected - ? There are situations where indirect association
analysis will not work - Discrepant marker/disease freqs, low LD,
heterogeneity, - Linkage approach may be only genetics approach in
these cases - At present, no way to know when association
will/will not work - Balance with linkage
87Current Association Study Challenges5) How to
analyse the data
- Allele based test?
- 2 alleles ? 1 df
- E(Y) a bX X 0/1 for presence/absence
- Genotype-based test?
- 3 genotypes ? 2 df
- E(Y) a b1A b2D A 0/1 additive (hom) W
0/1 dom (het) - Haplotype-based test?
- For M markers, 2M possible haplotypes ? 2M -1 df
- E(Y) a ?bH H coded for haplotype effects
- Multilocus test?
- Epistasis, G x E interactions, many possibilities
88Current Association Study Challenges6) Multiple
Testing
- Candidate genes a few tests (probably
correlated) - Linkage regions 100s 1000s tests (some
correlated) - Whole genome association 100,000s 1,000,000s
tests (many correlated) - What to do?
- Bonferroni (conservative)
- False discovery rate?
- Permutations?
- .Area of active research
89Despite challenges upcoming association studies
hold some promise
- Availability of millions of genetic markers
- Genotyping costs decreasing rapidly
- Cost per SNP 2001 (0.25) ? 2003 (0.10) ? 2004
(0.01) - Background LD patterns being characterized
- International HapMap and other projects
90Genome Wide Association Studies (GWAS) Underway
- Genetic Analysis Information Network (GAIN)
- Psoriasis, ADHD, Schizophrenia, Bipolar Disorder,
Depression, Type 1 Diabetes - Welcome Trust Case Control Consortium
- Bipolar Disorder, Coronary Artery Disease,
Crohns disease, Rheumatoid Arthristis, Type 1
Diabetes, Type 2 Diabetes - Genes, Environment, Health Initiative
(Gene/Environment Association Studies GENEVA) - Addiction, diabetes, Heart Disease, Oral Clefts,
Maternal Metabolism and Birth Weight, Lung
Cancer, Pre-Term Birth, Dental Carries - Genes, Environment, Development Initiative
(GEDI)