Introduction to Gene-Finding: Linkage and Association

About This Presentation

Title:

Introduction to Gene-Finding: Linkage and Association

Description:

Introduction to Gene-Finding: Linkage and Association Danielle Dick, Sarah Medland, (Ben Neale) – PowerPoint PPT presentation

Number of Views:266

Avg rating:3.0/5.0

Slides: 90

Provided by: SarahM106

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Gene-Finding: Linkage and Association

1
Introduction to Gene-Finding Linkage and
Association

Danielle Dick, Sarah Medland, (Ben Neale)

2
Aim of QTL mapping

LOCALIZE and then IDENTIFY a locus that
regulates a trait (QTL)
Locus Nucleotide or sequence of nucleotides with
variation in the population, with different
variants associated with different trait levels.

3
Location and Identification

Linkage
localize region of the genome where a QTL that
regulates the trait is likely to be harboured
Family-specific phenomenon Affected individuals
in a family share the same ancestral predisposing
DNA segment at a given QTL

4
Location and Identification

Association
identify a QTL that regulates the trait
Population-specific phenomenon Affected
individuals in a population share the same
ancestral predisposing DNA segment at a given QTL

5
Linkage

Overview

6
Progress of the Human Genome Project
Human Chromosome 4
7
Genetic markers (DNA polymorphisms)
ATGCTTGCCACGCE ATGCTTCTTGCCATGCE
Microsatellite Markers can be di(2), tri(3), or
tetra (4) nucleotide repeats
ATGCTTGCCACGCE ATGCTTGCCATGCE
Single Nucleotide Polymorphism
8
DNA polymorphisms

Can occur in gene, but be silent
Can change gene product (protein)
Alter amino acid sequence (a lot or a little)
Can regulate gene product
Upregulate or downregulate protein production
Turn off or on gene
Can occur in noncoding region
This happens most often!

9
Mutations
10
How do we map genes?

Deviation from Mendels Independent Assortment
Law
Aa Bb ¼ AB, ¼ Ab, ¼ aB, ¼ ab
Were looking for variation from this

11
Recombination
12
Recombination

Another way of introducing genetic diversity
Allows us to map genes!
Crossovers more likely to occur between genes
that are further away likelihood of a
recombination event is proportional to the
distance
Interference tend not to see 2 crossovers in a
small area
Alleles that are very close together are more
likely to stay together, dont assort
independently

13
Linkage Mapping (is a marker linked to the
disease gene)

Collect families with affected individuals
Genome Scan - Test markers evenly spaced across
the entire genome (every 10cM, 400 markers)
Lod score (log of the odds) what are the odds
of observing the family marker data if the marker
is linked to the disease (less recombination than
expected) compared to if the marker is not linked
to the disease

14
Thomas Hunt Morgan discoverer of linkage
15
Linkage Co-segregation

A3A4
A1A2
A2A4
A1A3
A2A3
Marker allele A1 cosegregates with dominant
disease
A1A2
A1A4
A3A4
A3A2
16
Lod scores

gt3.0 evidence for linkage
lt-2.0 can rule out linkage
In between inconclusive, collect more families

17
Linkage Co-segregation

Parametric Linkage used very successfully to map
disease genes for Mendelian disorders
Problematic for complex disorders requires
disease model, penetrance, assumes gene of major
effect, phenotypic precision

A3A4
A1A2
A2A4
A1A3
A2A3
A1A2
A1A4
A3A4
A3A2
18
Nonparametric Linkage

Based on allele-sharing
More appropriate for phenotypes with multiple
genes of small effect, environment, no disease
model assumed
Basic unit of data affected relative (often
sibling) pairs

19
x
1/4
1/4
1/4
1/4
20
IDENTITY BY DESCENT
Sib 1
2
1
1
0
2
1
1
0
Sib 2
1
0
2
1
2
1
1
0
4/16 1/4 sibs share BOTH parental alleles IBD
2
8/16 1/2 sibs share ONE parental allele IBD
1
4/16 1/4 sibs share NO parental alleles IBD
0
21
Genotypic similarity between relatives
IBS Alleles shared Identical By State look the
same, may have the same DNA sequence but they
are not necessarily derived from a known common
ancestor - focus for association
M3
M1
M2
M3
Q3
Q1
Q2
Q4
IBD Alleles shared Identical By Descent are
a copy of the same ancestor allele - focus for
linkage
M1
M2
M3
M3
Q1
Q2
Q3
Q4
IBS
IBD
M1
M3
M1
M3
2
1
Q1
Q3
Q1
Q4
22
Genotypic similarity basic principals

Loci that are close together are more likely to
be inherited together than loci that are further
apart
Loci are likely to be inherited in context ie
with their surrounding loci
Because of this, knowing that a loci is
transmitted from a common ancestor is more
informative than simply observing that it is the
same allele
Critical to have parental data when possible

23
Linkage Markers
24
For disease traits (affected/unaffected) Affected
sib pairs selected
1000
750
500
250
IBD 2
Expected
1
2
3
127
310
IBD 1
Markers
IBD 0
25
For continuous measures Unselected sib pairs
26
So how does all this fit into Mx?
27
IDENTITY BY DESCENT
Sib 1
2
1
1
0
2
1
1
0
Sib 2
1
0
2
1
2
1
1
0
4/16 1/4 sibs share BOTH parental alleles IBD
2
8/16 1/2 sibs share ONE parental allele IBD
1
4/16 1/4 sibs share NO parental alleles IBD
0
28

In biometrical modeling A is correlated at 1 for
MZ twins and .5 for DZ twins
.5 is the average genome-wide sharing of genes
between full siblings (DZ twin relationship)

In linkage analysis we will be estimating an
additional variance component Q
For each locus under analysis the coefficient of
sharing for this parameter will vary for each
pair of siblings
The coefficient will be the probability that the
pair of siblings have both inherited the same
alleles from a common ancestor

30
MZ1.0 DZ0.5
MZ DZ 1.0
1
1
1
1
1
1
1
1
Q
A
C
E
E
C
A
Q
e
c
a
q
q
a
c
e
PTwin1
PTwin2
31
Linkage

How do we do this?
1.Genotyping data.

32
Microsatellite data

Ideally positioned at equal genetic distances
across chromosome
Mostly di/tri nucleotide repeats
http//research.marshfieldclinic.org/genetics/Gene
ticResearch/screeningsets.asp

33
Microsatellite data

Raw data consists of allele lengths/calls (bp)
Different primers give different lengths
So to compare data you MUST know which primers
were used
http//research.marshfieldclinic.org/genetics/Gene
ticResearch/screeningsets.asp

34
Binning

Raw allele lengths are converted to allele
numbers or lengths
ExampleD1S1646 tri-nucleotide repeat size
range130-150
Logically Work with binned lengths
Commonly Assign allele 1 to 130 allele, 2 to 133
allele
Commercially Allele numbers often assigned based
on reference populations CEPH. So if the first
CEPH allele was 136 that would be assigned 1 and
130 133 would assigned the next free allele
number
Conclusions whenever possible start from the RAW
allele size and work with allele length

35
Error checking

After binning check for errors
Family relationships (GRR, Rel-pair)
Mendelian Errors (Sib-pair)
Double Recombinants (MENDEL, ASPEX, ALEGRO)
An iterative process

36
Clean data

ped file
Family, individual, father, mother, sex, dummy,
genotypes

37
Estimating genotypic sharing

The ped file is used with map files to obtain
estimates of genotypic sharing between relatives
at each of the locations under analysis

38
Estimating genotypic sharing
Merlin will give you probabilities of sharing 0,
1, 2 alleles for every pair of individuals.
39
Estimating genotypic sharing

Output

40
Estimating genotypic sharing

Output

Why isnt P0, P1, P2 exact for everyone?
41
Estimating genotypic sharing

Output

Why isnt P0, P1, P2 exact for
everyone? -missing parental genotypes -low
informativeness at marker
1/2
2/2
2/2
1/2
42
MZ1.0 DZ0.5
MZ DZ 1.0
1
1
1
1
1
1
1
1
Q
A
C
E
E
C
A
Q
e
c
a
q
q
a
c
e
PTwin1
PTwin2
43
Genotypic similarity between relatives
IBD Alleles shared Identical By Descent are a
copy of the same ancestor allele Pairs of
siblings may share 0, 1 or 2 alleles IBD The
probability of a pair of relatives being IBD is
known as pi-hat
M1
M3
M2
M3
Q3
Q1
Q2
Q4
M1
M2
M3
M3
Q1
Q2
Q3
Q4
IBS
IBD
M1
M3
M1
M3
2
1
Q1
Q3
Q1
Q4
44
Estimating genotypic sharing

Output

45
Distribution of pi-hat

Adult Dutch DZ pairs distribution of pi-hat
at 65 cM on chromosome 19
lt 0.25 IBD0 group
gt 0.75 IBD2 group
others IBD1 group
pi65cat (0,1,2)

46
Linkage Analyses

Advantage
Systematically scan the genome
Disadvantages
Not very powerful
Need hundreds thousands of family member
Broad peaks

47
Lod scores
1cM 1MB 1MB1000kb 1kb1000bp 1cM 1,000,000
bp
48
Strategy
1. Ascertain families with multiple affecteds
2. Linkage analyses to identify chromosomal
regions
? allele-sharing among affecteds within a
family
3. Association analyses to identify specific
genes
Gene A
Gene B
Gene C
49

BREAK

50
Linkage vs. Association

Linkage analyses look for relationship between a
marker and disease within a family (could be
different marker in each family)
Association analyses look for relationship
between a marker and disease between families
(must be same marker in all families)

51
Allelic Association Extension of linkage to the
population
3/5
2/6
3/5
2/6
3/2
3/6
5/2
5/6
Both families are linked with the marker, but a
different allele is involved
52
Allelic Association Extension of linkage to the
population
3/6
2/4
4/6
2/6
3/2
6/2
6/6
6/6
All families are linked with the marker Allele
6 is associated with disease
53
Localization

Linkage analysis yields broad chromosome regions
harbouring many genes
Resolution comes from recombination events
(meioses) in families assessed
Good in terms of needing few markers, poor in
terms of finding specific variants involved
Association analysis yields fine-scale resolution
of genetic variants
Resolution comes from ancestral recombination
events
Good in terms of finding specific variants,
poor in terms of needing many markers

54
Allelic Association Three Common Forms

Direct Association
Mutant or susceptible polymorphism
Allele of interest is itself involved in
phenotype
Indirect Association
Allele itself is not involved, but a nearby
correlated
marker changes phenotype
Spurious association
Apparent association not related to genetic
aetiology
(most common outcome)

55
Indirect and Direct Allelic Association
Direct Association
D

Measure disease relevance () directly, ignoring
correlated markers nearby
Semantic distinction between Linkage
Disequilibrium correlation between (any) markers
in population Allelic Association
correlation between marker allele and trait
56
Decay of Linkage Disequilibrium
Reich et al., Nature 2001
57
Average Levels of LD along chromosomes
CEPH W.Eur Estonian
Chr22
Dawson et al Nature 2002
58
Characterizing Patterns of Linkage Disequilibrium
59
Linkage Disequilibrium Maps Allelic Association
D
1
2
3
n
Marker
LD
Primary Aim of LD maps Use relationships
amongst background markers (M1, M2, M3, Mn) to
learn something about D for association studies
Something Efficient association study design
by reduced genotyping Predict approx location
(fine-map) disease loci Assess complexity of
local regions Attempt to quantify/predict
underlying (unobserved) patterns
60
Deliverables Sets of haplotype tagging SNPs
61
Building Haplotype Maps for Gene-finding
1. Human Genome Project ? Good for consensus,
not good for individual differences
2. Identify genetic variants ? Anonymous with
respect to traits.
3. Assay genetic variants ? Verify
polymorphisms, catalogue correlations
amongst sites ? Anonymous with respect to
traits
62
Haplotype Tagging for Efficient Genotyping
Cardon Abecasis, TIG 2003

Some genetic variants within haplotype blocks
give redundant information
A subset of variants, htSNPs, can be used to
tag the conserved haplotypes with little loss
of information (Johnson et al., Nat Genet, 2001)
Initial detection of htSNPs should facilitate
future genetic association studies

63
HapMap Strategy

Samples
Four populations, small samples
Genotyping
5 kb initial density across genome (600K markers)
Subsequent focus on low LD regions
Recent NIH RFA for deeper coverage

Hapmap validating millions of SNPs.
Are they the right SNPs?

Distribution of allele frequencies in public
markers is biased toward common alleles
Expected frequency in population
Frequency of public markers
Updated with phase 2more similar to expectation
Phillips et al. Nat Genet 2003
65
Summary of Role of Linkage Disequilibrium on
Association Studies

Marker characterization is becoming extensive and
genotyping throughput is high
Tagging studies will yield panels for immediate
use
Need to be clear about assumptions/aims of each
panel
Density of eventual Hapmap probably cover much of
genome in high LD, but not all
Challenges
Just having more markers doesnt mean that
success rate will improve
Expectations of association success via LD are
too high.

66
Two types of association studies

Case-control
Family-based

67
Allelic Association
Controls
Cases
6/6
6/2
3/5
3/4
3/6
5/6
2/4
3/2
3/6
6/6
4/6
2/6
2/6
5/2
Allele 6 is associated with disease
68
Main Blame
Primary Concern with Case-Control
Analyses Population stratification Analysis of
mixed samples having different allele frequencies
is a primary concern in human genetics, as it
leads to false evidence for allelic association.
69
Population Stratification

Leads to spurious association
Requirements
Group differences in allele frequencies AND
Group differences in outcome
In epidemiology, this is a classic matching
problem, with genetics as a confounding variable

70
Population Stratification

c21 14.84, p lt 0.001
Spurious Association
71
Family-based association methods
TDT Transmission Disequilibrium Test
1/2
3/3
2/3

50/50 chance the 2 is transmitted
Looking for overtransmission of a particular
allele
across affected individuals (undertransmission to
unaffecteds)

72
TDT Advantages/Disadvantages
Advantages
Robust to stratification Genotyping error
detectable via Mendelian inconsistencies Estimates
of haplotypes possible
Disadvantages
Detection/elimination of genotyping errors causes
bias (Gordon et al., 2001) Uses only heterozygous
parents Inefficient for genotyping 3
individuals yield 2 founders 1/3 information
not used Can be difficult/impossible to
collect Late-onset disorders, psychiatric
conditions, pharmacogenetic applications
73
Association studies lt 2000 TDT

TDT virtually ubiquitous over past decade
Grant, manuscript referees editors mandated
design
View of case/control association studies greatly
diminished due to perceived role of
stratification

Association Studies 2000 Return to population

Case/controls, using extra genotyping
families, when available

74
Detecting and Controlling for Population
Stratification with Genetic Markers
Idea

Take advantage of availability of large N
genetic markers
Use case/control design
Genotype genetic markers across genome
(Number depends on different factors)
Look if any evidence for background population
substructure exists and account for it

75
Two types of association studies

Case-control
Adv more powerful
Disadv population stratification
limited by case/control
definition
Family-based
Adv population stratification not a problem
Disadv less powerful, hard to collect parents
for some phenotypes

76
Association Analyses vs Linkage

Advantage
More powerful
Disadvantage
Not systematic (in the past)
Now!
Genome wide association scans

77
Current Association Study Challenges1)
Genome-wide screen or candidate gene

Genome-wide screen
Hypothesis-free
High-cost large genotyping requirements
Multiple-testing issues
Possible many false positives, fewer misses

Candidate gene
Hypothesis-driven
Low-cost small genotyping requirements
Multiple-testing less important
Possible many misses, fewer false positives

78
Current Association Study Challenges2) What
constitutes a replication?
GOLD Standard for association studies Replicating
association results in different laboratories is
often seen as most compelling piece of evidence
for true finding But. in any sample, we
measure Multiple traits Multiple
genes Multiple markers in genes and we analyse
all this using multiple statistical tests
What is a true replication?
79
What is a true replication?
Replication Outcome
Explanation

Association to same trait, but different gene
Association to same trait, same gene, different
SNPs (or haplotypes)
Association to same trait, same gene, same SNP
but in opposite direction (protective ?? disease)
Association to different, but correlated
phenotype(s)
No association at all

Genetic heterogeneity
Allelic heterogeneity
Allelic heterogeneity/pop differences
Phenotypic heterogeneity
Sample size too small

80
Measuring Success by Replication

Define objective criteria for what is/is not a
replication in advance
Design initial and replication study to have
enough power
Lumper use most samples to obtain robust
results in first place
Great initial detection, may be weak in
replication
Skol et al. 2006lumping is better for power
Splitter Take otherwise large sample, split
into initial and replication groups
One good study ? two bad studies.
Poor initial detection, poor replication

81
Current Association Study Challenges3) Do we
have the best set of genetic markers

There exist 6 million putative SNPs in the
public domain. Are they the right markers?

Allele frequency distribution is biased toward
common alleles
Expected frequency in population
Frequency of public markers
82
Current Association Study Challenges3) Do we
have the best set of genetic markers
Tabor et al, Nat Rev Genet 2003
83
Greatest power comes from markers that match
allele freq with trait loci
ls 1.5, a 5 x 10-8, Spielman TDT
(Müller-Myhsok and Abel, 1997)
84
Current Association Study Challenges4)
Integrating the sampling, LD and genetic effects
Questions that dont stand alone
How much LD is needed to detect complex disease
genes? What effect size is big enough to be
detected? How common (rare) must a disease
variant(s) be to be identifiable? What marker
allele frequency threshold should be used to find
complex disease genes?
85
Complexity of System

In any indirect association study, we measure
marker alleles that are correlated with trait
variants
We do not measure the trait variants themselves
But, for study design and power, we concern
ourselves with frequencies and effect sizes at
the trait locus.
This can only lead to underpowered studies and
inflated expectations
We should concern ourselves with the apparent
effect size at the marker, which results from
1) difference in frequency of marker and trait
alleles
2) LD between the marker and trait loci
3) effect size of trait allele

86
Practical Implications of Allele Frequencies

Strongest argument for using common markers is
not CD-CV. It is practical
For small effects, common markers are
the only ones for which sufficient sample sizes
can be collected
? There are situations where indirect association
analysis will not work
Discrepant marker/disease freqs, low LD,
heterogeneity,
Linkage approach may be only genetics approach in
these cases
At present, no way to know when association
will/will not work
Balance with linkage

87
Current Association Study Challenges5) How to
analyse the data

Allele based test?
2 alleles ? 1 df
E(Y) a bX X 0/1 for presence/absence
Genotype-based test?
3 genotypes ? 2 df
E(Y) a b1A b2D A 0/1 additive (hom) W
0/1 dom (het)
Haplotype-based test?
For M markers, 2M possible haplotypes ? 2M -1 df
E(Y) a ?bH H coded for haplotype effects
Multilocus test?
Epistasis, G x E interactions, many possibilities

88
Current Association Study Challenges6) Multiple
Testing

Candidate genes a few tests (probably
correlated)
Linkage regions 100s 1000s tests (some
correlated)
Whole genome association 100,000s 1,000,000s
tests (many correlated)
What to do?
Bonferroni (conservative)
False discovery rate?
Permutations?
.Area of active research

89
Despite challenges upcoming association studies
hold some promise

Availability of millions of genetic markers
Genotyping costs decreasing rapidly
Cost per SNP 2001 (0.25) ? 2003 (0.10) ? 2004
(0.01)
Background LD patterns being characterized
International HapMap and other projects

90
Genome Wide Association Studies (GWAS) Underway

Genetic Analysis Information Network (GAIN)
Psoriasis, ADHD, Schizophrenia, Bipolar Disorder,
Depression, Type 1 Diabetes
Welcome Trust Case Control Consortium
Bipolar Disorder, Coronary Artery Disease,
Crohns disease, Rheumatoid Arthristis, Type 1
Diabetes, Type 2 Diabetes
Genes, Environment, Health Initiative
(Gene/Environment Association Studies GENEVA)
Addiction, diabetes, Heart Disease, Oral Clefts,
Maternal Metabolism and Birth Weight, Lung
Cancer, Pre-Term Birth, Dental Carries
Genes, Environment, Development Initiative
(GEDI)

Write a Comment

User Comments (0)

About PowerShow.com

Introduction to Gene-Finding: Linkage and Association - PowerPoint PPT Presentation

Introduction to Gene-Finding: Linkage and Association

Introduction to Gene-Finding: Linkage and Association Danielle Dick, Sarah Medland, (Ben Neale) – PowerPoint PPT presentation