Title: Gene mapping: Linkage and association methods
1Gene mappingLinkage and association methods
- Disease gene mapping is one of the main purposes
for genotyping - Two major approaches linkage and association
analyses
2Linkage analysis
- Try to localize genes affecting specific
phenotypes - Search for co-segregation of disease and marker
alleles
3Basics of Linkage Analysis
- Idea of Linkage Analysis
- Types of Linkage Analysis
- Parametric Linkage Analysis
- Conclusions
4Basics of Linkage Analysis
- Idea of Linkage Analysis
- Types of Linkage Analysis
- Parametric Linkage Analysis
- Conclusions
5Linkage Analysis
- One of the two main approaches in gene mapping.
- Uses pedigree data.
6Genetic linkage and linkage analysis
- Two loci are linked if they appear nearby in the
same chromosome. - The task of linkage analysis is to find markers
that are linked to the hypothetical disease locus - Complex diseases in focus ? usually need to
search for one gene at a time - Requires mathematical modelling of meiosis
7Meiosis and crossover
- Number of crossover sites is thought to follow
Poisson distribution. - Their locations are generally random and
independent of each other.
8The simple idea
Recombination fraction
?
Always 0 ? 0.5
- Task Find ? that maximises L(? data )
- Obtain measure for degree of evidence in favour
of linkage (LOD score)
9Markers and inheritance
- Polymorphic loci whose locations are known
- Most often SNPs or microsatellites
- Inherited within the chromosomes
10Markers and information
- Two individuals share same allele label ? they
share the allele IBS (identical by state) - Two individuals share an allele with same
(grand)parental origin ? they share an allele IBD
(identical by descent) - IBS sharing can easily be deduced from genotypes.
- IBD sharing requires more information. One can
try to deduce IBD sharing based on family
structure and inheritance.
11Markers and information
1,2
2,3
The children share allele 1 IBS.
They also share it IBD.
1,2
1,3
12Markers and information
1,2
1,3
The children share allele 1 IBS.
1,2
1,3
They do not share alleles IBD.
13Markers and information
1,1
2,3
The children share allele 1 IBS.
1,2
1,3
They either share or do not share it IBD.
14Building blocks of linkage analysis
Marker maps
Pedigree structures
Genotypes
Phenotypes
15Building blocks of linkage analysis
- Information about disease model (in parametric
analysis)
? ?(aa), probability of a homozygote being
affected
? ?(Aa), probability of a heterozygote being
affected
? ?(AA), probability of a non-carrier being
affected (phenocopy rate)
- Assumed disease allele frequency
- Marker allele frequencies
- Information about environmental variables
16Basics of Linkage Analysis
- Idea of Linkage Analysis
- Types of Linkage Analysis
- Parametric Linkage Analysis
- Conclusions
17Types of linkage analysis
- Parametric vs. non-parametric
- Dichotomous vs. continuous phenotypes
- Elston-Stewart vs. Lander-Green vs. heuristic
- Two-point vs. multipoint
- Genome scan vs. candidate gene
18Basics of Linkage Analysis
- Idea of Linkage Analysis
- Types of Linkage Analysis
- Parametric Linkage Analysis
- Conclusions
19Maximum likelihood estimation
- A common approach in statistical estimation
- Define hypotheses
- Generate likelihood function
- Estimate
- Test hypotheses
- Draw statistical conclusions
20Hypotheses in linkage analysis
- H0
- ? 0.5
- the disease locus is not linked to the marker(s)
- HA
- ? ? 0.5
- the disease locus is linked to the marker(s)
21Likelihood function for a single nuclear family
- Lj ?gF P(gF) P(yF gF)?gM P(gM)P(yM
gM)?gOi P(gOi gF, gM) P(yOi gO)
G genotype probabilitiesy phenotype
probabilities
22Several independent families
- The likelihood functions of multiple independent
families are combined - L ? Lj or logL ? log Lj
23Testing of hypotheses
- Compute values of likelihood function under null
and alternative hypotheses. - Their relationship is expressed by LOD score
(essentially derived from the likelihood ratio
test statistic.
24On significance levels
- P-value gives a probability that a null
hypothesis is rejected even though it was true. - A LOD-score threshold of 3 corresponds to a
single-test p-value of approximately 0.0001 - Often, the significant areas pointed out are
quite large, from 10-40 cM (millions of basepairs)
250.56
0.5
LOD score
0.0
0.0
0.5
0.14
Recombination fraction
LODgt3 taken as evidence of linkage.
26Basics of Linkage Analysis
- Idea of Linkage Analysis
- Types of Linkage Analysis
- Parametric Linkage Analysis
- Conclusions
27Conclusions
- Linkage analysis is a pedigree-based approach to
gene mapping. - Parametric vs. nonparametric methods.
- Hypothesis-driven vs. explorative analysis.
- Meta-analysis (integration of several studies
into one big study) becoming increasingly
popular.
28Fine mapping and association analysis
- After successful linkage analysis, what to do?
- How to refine the linked area where actually
the disease susceptibility locus is? - Outline of the rest of the lecture
- Allelic association
- ?2 test
- LD mapping
29Allelic association
- An example A leukaemia study, where a number of
affected and healthy control persons have been
contacted for DNA samples - A candidate gene has been suggested GSTM1, which
functions in the metabolism of benzene - GSTM1 has two different alleles, 1 and 2, where
- A person is positive for allele 1 if his
genotype is 1 1 or 1 2 - A person is null, if having genotype 2 2
- The numbers of leukaemic and control individuals
either positive or null with respect to allele 1
are compared by ?2-test in order to find out,
whether there is statistically significant
difference
30Allelic assosiation
- Results observed frequencies
- Expected frequencies
31Test statistic
- The observed are compared to expected
frequencies. (null hypothesis, H0 carrier status
and disease occurrence are independent of each
other ) - Test statistic
-
- where
- oi is the observed frequency for class i, ei the
expected frequency for class i - k is the number of classes
32Allelic assosiation
- Now, ?2 111,39.
- Degrees of freedom for the test df(r-1)(s-1),
where r number of rows, s number of columns - Here, df (2-1)(2-1) 1
- The ?2 value is then compared to the null
distribution of critical ?2-test statistic values
(within the given df class)
33?2-distribution critical values for chosen
significance levels
- df\p 0.10 .05 .025 .01 .005
- 1 2.71 3.84 5.02 6.63 7.88
- 2 4.61 5.99 7.38 9.21 10.60
- 3 6.25 7.81 9.35 11.34 12.84
- 4 7.78 9.49 11.14 13.28 14.86
- 5 9.24 11.07 12.83 15.09 16.75
- 6 10.64 12.59 14.45 16.81 18.55
- 7 12.02 14.07 16.01 18.48 20.28
- 8 13.36 15.51 17.53 20.09 21.96
- 9 14.68 16.92 19.02 21.67 23.59
- 10 15.99 18.31 20.48 23.21 25.19
- 11 17.28 19.68 21.92 24.73 26.76
When the observed value of test statistic is
greater than the critical value (for the chosen
significance levels) given in the table, the null
hypothesis can be rejected.
34Allelic association
- The value we obtained, ?2 111,39 , exceeds all
critical values with df1 given in the table. We
conclude, that H0 can be rejected and thus, there
is statistically significant difference between
the affected and healthy with respect to GSTM1
genotypes. - The relative frequencies of null and positive
genotypes show the same - It seems that different GSTM1 genotypes, by
changing the benzene metabolism, considerably
affect the probability of getting leukaemia
35- Note compared to linkage analysis, which is
based on the observed inheritance patterns in
pedigrees, the association analysis studies
correlation of allele presence and a disease in
the level of population - We find an allele or a haplotype overrepresented
in affected individuals ? - BUT the statistical correlation does not
implicate a causal relationship !!!! ? - Quite often, the associating allele or haplotype
is not the cause of the disease itself, but is
merely correlated with the presence of the actual
susceptibility gene in the same chromosome. It is
then said to be in linkage disequilibrium with
the disease gene. ?
36Original mutation in one chromosome in the
founder population
A
Time
Current generation
C
B
An affected pedigree
37LD mapping
- The marker itself is NOT the reason for the
disease, but its located nearby the disease
susceptibility gene, and there is correlation
between the presence of certain marker allele and
the disease gene allele (LD) - The correlation, i.e. LD, is based on founder
effect the disease allele has been born a long
time ago on a certain ancestral chromosome, and
majority of disease alleles existing presently
predate from that original mutation
38LD-mapping Utilizing the founder effect
39Data
Disease locus
Disease status
S2
...
SNP1
...
a ? 2 1 1 a ? 1
2 1
1 2 2 1 1 2 1 2 1 2 1 1
2 2
1 2 2 1 2 1 1
2
2 1 1 1 1 1 1
1
c 2 1 ? ?c 1 1
? ?
1 2 2 1 1 2 1 1 2 2 2 1
1 1
a 1 1 2 1a 1 1
1 2
1 1 2 1 1 2 2 2 2 2 1 1
2 1
2 2 ? 1 1 1 ?
1
40Many approaches, several programs
- old-fashioned allele association with some
simple test (problem multiple testing) - TDT modelling of LD process Bayesian, EM
algorithm, integrated linkage LD
41Limitations LD is random process
- The amount of LD is on a continuous but slow
change, where the natural forces of - genetic drift
- population structure
- natural selection
- new mutations
- founder effect
- ...affect it even if two pairs of loci are in
exactly the same distance from each other, their
amount of LD may vary a lot. - ? This limits the accuracy of LD mapping, though
it is much more accurate in pinpointing the
location of a disease gene compared to linkage