Introduction to Gene-Finding: Linkage and Association - PowerPoint PPT Presentation

1 / 89
About This Presentation
Title:

Introduction to Gene-Finding: Linkage and Association

Description:

Introduction to Gene-Finding: Linkage and Association Danielle Dick, Sarah Medland, (Ben Neale) – PowerPoint PPT presentation

Number of Views:276
Avg rating:3.0/5.0
Slides: 90
Provided by: SarahM106
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Gene-Finding: Linkage and Association


1
Introduction to Gene-Finding Linkage and
Association
  • Danielle Dick, Sarah Medland, (Ben Neale)

2
Aim of QTL mapping
  • LOCALIZE and then IDENTIFY a locus that
    regulates a trait (QTL)
  • Locus Nucleotide or sequence of nucleotides with
    variation in the population, with different
    variants associated with different trait levels.

3
Location and Identification
  • Linkage
  • localize region of the genome where a QTL that
    regulates the trait is likely to be harboured
  • Family-specific phenomenon Affected individuals
    in a family share the same ancestral predisposing
    DNA segment at a given QTL

4
Location and Identification
  • Association
  • identify a QTL that regulates the trait
  • Population-specific phenomenon Affected
    individuals in a population share the same
    ancestral predisposing DNA segment at a given QTL

5
Linkage
  • Overview

6
Progress of the Human Genome Project
Human Chromosome 4
7
Genetic markers (DNA polymorphisms)
ATGCTTGCCACGCE ATGCTTCTTGCCATGCE
Microsatellite Markers can be di(2), tri(3), or
tetra (4) nucleotide repeats
ATGCTTGCCACGCE ATGCTTGCCATGCE
Single Nucleotide Polymorphism
8
DNA polymorphisms
  • Can occur in gene, but be silent
  • Can change gene product (protein)
  • Alter amino acid sequence (a lot or a little)
  • Can regulate gene product
  • Upregulate or downregulate protein production
  • Turn off or on gene
  • Can occur in noncoding region
  • This happens most often!

9
Mutations
10
How do we map genes?
  • Deviation from Mendels Independent Assortment
    Law
  • Aa Bb ¼ AB, ¼ Ab, ¼ aB, ¼ ab
  • Were looking for variation from this

11
Recombination
12
Recombination
  • Another way of introducing genetic diversity
  • Allows us to map genes!
  • Crossovers more likely to occur between genes
    that are further away likelihood of a
    recombination event is proportional to the
    distance
  • Interference tend not to see 2 crossovers in a
    small area
  • Alleles that are very close together are more
    likely to stay together, dont assort
    independently

13
Linkage Mapping (is a marker linked to the
disease gene)
  • Collect families with affected individuals
  • Genome Scan - Test markers evenly spaced across
    the entire genome (every 10cM, 400 markers)
  • Lod score (log of the odds) what are the odds
    of observing the family marker data if the marker
    is linked to the disease (less recombination than
    expected) compared to if the marker is not linked
    to the disease

14
Thomas Hunt Morgan discoverer of linkage
15
Linkage Co-segregation

A3A4
A1A2
A2A4
A1A3
A2A3
Marker allele A1 cosegregates with dominant
disease
A1A2
A1A4
A3A4
A3A2
16
Lod scores
  • gt3.0 evidence for linkage
  • lt-2.0 can rule out linkage
  • In between inconclusive, collect more families

17
Linkage Co-segregation
  • Parametric Linkage used very successfully to map
    disease genes for Mendelian disorders
  • Problematic for complex disorders requires
    disease model, penetrance, assumes gene of major
    effect, phenotypic precision

A3A4
A1A2
A2A4
A1A3
A2A3
A1A2
A1A4
A3A4
A3A2
18
Nonparametric Linkage
  • Based on allele-sharing
  • More appropriate for phenotypes with multiple
    genes of small effect, environment, no disease
    model assumed
  • Basic unit of data affected relative (often
    sibling) pairs

19
x
1/4
1/4
1/4
1/4
20
IDENTITY BY DESCENT
Sib 1
2
1
1
0
2
1
1
0
Sib 2
1
0
2
1
2
1
1
0
4/16 1/4 sibs share BOTH parental alleles IBD
2
8/16 1/2 sibs share ONE parental allele IBD
1
4/16 1/4 sibs share NO parental alleles IBD
0
21
Genotypic similarity between relatives
IBS Alleles shared Identical By State look the
same, may have the same DNA sequence but they
are not necessarily derived from a known common
ancestor - focus for association
M3
M1
M2
M3
Q3
Q1
Q2
Q4
IBD Alleles shared Identical By Descent are
a copy of the same ancestor allele - focus for
linkage
M1
M2
M3
M3
Q1
Q2
Q3
Q4
IBS
IBD
M1
M3
M1
M3
2
1
Q1
Q3
Q1
Q4
22
Genotypic similarity basic principals
  • Loci that are close together are more likely to
    be inherited together than loci that are further
    apart
  • Loci are likely to be inherited in context ie
    with their surrounding loci
  • Because of this, knowing that a loci is
    transmitted from a common ancestor is more
    informative than simply observing that it is the
    same allele
  • Critical to have parental data when possible

23
Linkage Markers
24
For disease traits (affected/unaffected) Affected
sib pairs selected
1000
750
500
250
IBD 2
Expected
1
2
3
127
310
IBD 1
Markers
IBD 0
25
For continuous measures Unselected sib pairs
26
So how does all this fit into Mx?
27
IDENTITY BY DESCENT
Sib 1
2
1
1
0
2
1
1
0
Sib 2
1
0
2
1
2
1
1
0
4/16 1/4 sibs share BOTH parental alleles IBD
2
8/16 1/2 sibs share ONE parental allele IBD
1
4/16 1/4 sibs share NO parental alleles IBD
0
28
  • In biometrical modeling A is correlated at 1 for
    MZ twins and .5 for DZ twins
  • .5 is the average genome-wide sharing of genes
    between full siblings (DZ twin relationship)

29
  • In linkage analysis we will be estimating an
    additional variance component Q
  • For each locus under analysis the coefficient of
    sharing for this parameter will vary for each
    pair of siblings
  • The coefficient will be the probability that the
    pair of siblings have both inherited the same
    alleles from a common ancestor

30
MZ1.0 DZ0.5
MZ DZ 1.0
1
1
1
1
1
1
1
1
Q
A
C
E
E
C
A
Q
e
c
a
q
q
a
c
e
PTwin1
PTwin2
31
Linkage
  • How do we do this?
  • 1.Genotyping data.

32
Microsatellite data
  • Ideally positioned at equal genetic distances
    across chromosome
  • Mostly di/tri nucleotide repeats
  • http//research.marshfieldclinic.org/genetics/Gene
    ticResearch/screeningsets.asp

33
Microsatellite data
  • Raw data consists of allele lengths/calls (bp)
  • Different primers give different lengths
  • So to compare data you MUST know which primers
    were used
  • http//research.marshfieldclinic.org/genetics/Gene
    ticResearch/screeningsets.asp

34
Binning
  • Raw allele lengths are converted to allele
    numbers or lengths
  • ExampleD1S1646 tri-nucleotide repeat size
    range130-150
  • Logically Work with binned lengths
  • Commonly Assign allele 1 to 130 allele, 2 to 133
    allele
  • Commercially Allele numbers often assigned based
    on reference populations CEPH. So if the first
    CEPH allele was 136 that would be assigned 1 and
    130 133 would assigned the next free allele
    number
  • Conclusions whenever possible start from the RAW
    allele size and work with allele length

35
Error checking
  • After binning check for errors
  • Family relationships (GRR, Rel-pair)
  • Mendelian Errors (Sib-pair)
  • Double Recombinants (MENDEL, ASPEX, ALEGRO)
  • An iterative process

36
Clean data
  • ped file
  • Family, individual, father, mother, sex, dummy,
    genotypes

37
Estimating genotypic sharing
  • The ped file is used with map files to obtain
    estimates of genotypic sharing between relatives
    at each of the locations under analysis

38
Estimating genotypic sharing
Merlin will give you probabilities of sharing 0,
1, 2 alleles for every pair of individuals.
39
Estimating genotypic sharing
  • Output

40
Estimating genotypic sharing
  • Output

Why isnt P0, P1, P2 exact for everyone?
41
Estimating genotypic sharing
  • Output

Why isnt P0, P1, P2 exact for
everyone? -missing parental genotypes -low
informativeness at marker
1/2
2/2
2/2
1/2
42
MZ1.0 DZ0.5
MZ DZ 1.0
1
1
1
1
1
1
1
1
Q
A
C
E
E
C
A
Q
e
c
a
q
q
a
c
e
PTwin1
PTwin2
43
Genotypic similarity between relatives
IBD Alleles shared Identical By Descent are a
copy of the same ancestor allele Pairs of
siblings may share 0, 1 or 2 alleles IBD The
probability of a pair of relatives being IBD is
known as pi-hat
M1
M3
M2
M3
Q3
Q1
Q2
Q4
M1
M2
M3
M3
Q1
Q2
Q3
Q4
IBS
IBD
M1
M3
M1
M3
2
1
Q1
Q3
Q1
Q4
44
Estimating genotypic sharing
  • Output

45
Distribution of pi-hat
  • Adult Dutch DZ pairs distribution of pi-hat
    at 65 cM on chromosome 19
  • lt 0.25 IBD0 group
  • gt 0.75 IBD2 group
  • others IBD1 group
  • pi65cat (0,1,2)

46
Linkage Analyses
  • Advantage
  • Systematically scan the genome
  • Disadvantages
  • Not very powerful
  • Need hundreds thousands of family member
  • Broad peaks

47
Lod scores
1cM 1MB 1MB1000kb 1kb1000bp 1cM 1,000,000
bp
48
Strategy
1. Ascertain families with multiple affecteds
2. Linkage analyses to identify chromosomal
regions
? allele-sharing among affecteds within a
family
3. Association analyses to identify specific
genes
Gene A
Gene B
Gene C
49
  • BREAK

50
Linkage vs. Association
  • Linkage analyses look for relationship between a
    marker and disease within a family (could be
    different marker in each family)
  • Association analyses look for relationship
    between a marker and disease between families
    (must be same marker in all families)

51
Allelic Association Extension of linkage to the
population
3/5
2/6
3/5
2/6
3/2
3/6
5/2
5/6
Both families are linked with the marker, but a
different allele is involved
52
Allelic Association Extension of linkage to the
population
3/6
2/4
4/6
2/6
3/2
6/2
6/6
6/6
All families are linked with the marker Allele
6 is associated with disease
53
Localization
  • Linkage analysis yields broad chromosome regions
    harbouring many genes
  • Resolution comes from recombination events
    (meioses) in families assessed
  • Good in terms of needing few markers, poor in
    terms of finding specific variants involved
  • Association analysis yields fine-scale resolution
    of genetic variants
  • Resolution comes from ancestral recombination
    events
  • Good in terms of finding specific variants,
    poor in terms of needing many markers

54
Allelic Association Three Common Forms
  • Direct Association
  • Mutant or susceptible polymorphism
  • Allele of interest is itself involved in
    phenotype
  • Indirect Association
  • Allele itself is not involved, but a nearby
    correlated
  • marker changes phenotype
  • Spurious association
  • Apparent association not related to genetic
    aetiology
  • (most common outcome)

55
Indirect and Direct Allelic Association
Direct Association
D

Measure disease relevance () directly, ignoring
correlated markers nearby
Semantic distinction between Linkage
Disequilibrium correlation between (any) markers
in population Allelic Association
correlation between marker allele and trait
56
Decay of Linkage Disequilibrium
Reich et al., Nature 2001
57
Average Levels of LD along chromosomes
CEPH W.Eur Estonian
Chr22
Dawson et al Nature 2002
58
Characterizing Patterns of Linkage Disequilibrium
59
Linkage Disequilibrium Maps Allelic Association
D
1
2
3
n
Marker
LD
Primary Aim of LD maps Use relationships
amongst background markers (M1, M2, M3, Mn) to
learn something about D for association studies
Something Efficient association study design
by reduced genotyping Predict approx location
(fine-map) disease loci Assess complexity of
local regions Attempt to quantify/predict
underlying (unobserved) patterns
60
Deliverables Sets of haplotype tagging SNPs
61
Building Haplotype Maps for Gene-finding
1. Human Genome Project ? Good for consensus,
not good for individual differences
2. Identify genetic variants ? Anonymous with
respect to traits.
3. Assay genetic variants ? Verify
polymorphisms, catalogue correlations
amongst sites ? Anonymous with respect to
traits
62
Haplotype Tagging for Efficient Genotyping
Cardon Abecasis, TIG 2003
  • Some genetic variants within haplotype blocks
    give redundant information
  • A subset of variants, htSNPs, can be used to
    tag the conserved haplotypes with little loss
    of information (Johnson et al., Nat Genet, 2001)
  • Initial detection of htSNPs should facilitate
    future genetic association studies

63
HapMap Strategy
  • Samples
  • Four populations, small samples
  • Genotyping
  • 5 kb initial density across genome (600K markers)
  • Subsequent focus on low LD regions
  • Recent NIH RFA for deeper coverage

64
  • Hapmap validating millions of SNPs.
  • Are they the right SNPs?

Distribution of allele frequencies in public
markers is biased toward common alleles
Expected frequency in population
Frequency of public markers
Updated with phase 2more similar to expectation
Phillips et al. Nat Genet 2003
65
Summary of Role of Linkage Disequilibrium on
Association Studies
  • Marker characterization is becoming extensive and
    genotyping throughput is high
  • Tagging studies will yield panels for immediate
    use
  • Need to be clear about assumptions/aims of each
    panel
  • Density of eventual Hapmap probably cover much of
    genome in high LD, but not all
  • Challenges
  • Just having more markers doesnt mean that
    success rate will improve
  • Expectations of association success via LD are
    too high.

66
Two types of association studies
  • Case-control
  • Family-based

67
Allelic Association
Controls
Cases
6/6
6/2
3/5
3/4
3/6
5/6
2/4
3/2
3/6
6/6
4/6
2/6
2/6
5/2
Allele 6 is associated with disease
68
Main Blame
Primary Concern with Case-Control
Analyses Population stratification Analysis of
mixed samples having different allele frequencies
is a primary concern in human genetics, as it
leads to false evidence for allelic association.
69
Population Stratification
  • Leads to spurious association
  • Requirements
  • Group differences in allele frequencies AND
  • Group differences in outcome
  • In epidemiology, this is a classic matching
    problem, with genetics as a confounding variable

70
Population Stratification

c21 14.84, p lt 0.001
Spurious Association
71
Family-based association methods
TDT Transmission Disequilibrium Test
1/2
3/3
2/3
  • 50/50 chance the 2 is transmitted
  • Looking for overtransmission of a particular
    allele
  • across affected individuals (undertransmission to
    unaffecteds)

72
TDT Advantages/Disadvantages
Advantages
Robust to stratification Genotyping error
detectable via Mendelian inconsistencies Estimates
of haplotypes possible
Disadvantages
Detection/elimination of genotyping errors causes
bias (Gordon et al., 2001) Uses only heterozygous
parents Inefficient for genotyping 3
individuals yield 2 founders 1/3 information
not used Can be difficult/impossible to
collect Late-onset disorders, psychiatric
conditions, pharmacogenetic applications
73
Association studies lt 2000 TDT
  • TDT virtually ubiquitous over past decade
  • Grant, manuscript referees editors mandated
    design
  • View of case/control association studies greatly
  • diminished due to perceived role of
    stratification

Association Studies 2000 Return to population
  • Case/controls, using extra genotyping
  • families, when available

74
Detecting and Controlling for Population
Stratification with Genetic Markers
Idea
  • Take advantage of availability of large N
    genetic markers
  • Use case/control design
  • Genotype genetic markers across genome
  • (Number depends on different factors)
  • Look if any evidence for background population
    substructure exists and account for it

75
Two types of association studies
  • Case-control
  • Adv more powerful
  • Disadv population stratification
  • limited by case/control
    definition
  • Family-based
  • Adv population stratification not a problem
  • Disadv less powerful, hard to collect parents
    for some phenotypes

76
Association Analyses vs Linkage
  • Advantage
  • More powerful
  • Disadvantage
  • Not systematic (in the past)
  • Now!
  • Genome wide association scans

77
Current Association Study Challenges1)
Genome-wide screen or candidate gene
  • Genome-wide screen
  • Hypothesis-free
  • High-cost large genotyping requirements
  • Multiple-testing issues
  • Possible many false positives, fewer misses
  • Candidate gene
  • Hypothesis-driven
  • Low-cost small genotyping requirements
  • Multiple-testing less important
  • Possible many misses, fewer false positives

78
Current Association Study Challenges2) What
constitutes a replication?
GOLD Standard for association studies Replicating
association results in different laboratories is
often seen as most compelling piece of evidence
for true finding But. in any sample, we
measure Multiple traits Multiple
genes Multiple markers in genes and we analyse
all this using multiple statistical tests
What is a true replication?
79
What is a true replication?
Replication Outcome
Explanation
  • Association to same trait, but different gene
  • Association to same trait, same gene, different
    SNPs (or haplotypes)
  • Association to same trait, same gene, same SNP
    but in opposite direction (protective ?? disease)
  • Association to different, but correlated
    phenotype(s)
  • No association at all
  • Genetic heterogeneity
  • Allelic heterogeneity
  • Allelic heterogeneity/pop differences
  • Phenotypic heterogeneity
  • Sample size too small

80
Measuring Success by Replication
  • Define objective criteria for what is/is not a
    replication in advance
  • Design initial and replication study to have
    enough power
  • Lumper use most samples to obtain robust
    results in first place
  • Great initial detection, may be weak in
    replication
  • Skol et al. 2006lumping is better for power
  • Splitter Take otherwise large sample, split
    into initial and replication groups
  • One good study ? two bad studies.
  • Poor initial detection, poor replication

81
Current Association Study Challenges3) Do we
have the best set of genetic markers
  • There exist 6 million putative SNPs in the
    public domain. Are they the right markers?

Allele frequency distribution is biased toward
common alleles
Expected frequency in population
Frequency of public markers
82
Current Association Study Challenges3) Do we
have the best set of genetic markers
Tabor et al, Nat Rev Genet 2003
83
Greatest power comes from markers that match
allele freq with trait loci
ls 1.5, a 5 x 10-8, Spielman TDT
(Müller-Myhsok and Abel, 1997)
84
Current Association Study Challenges4)
Integrating the sampling, LD and genetic effects
Questions that dont stand alone
How much LD is needed to detect complex disease
genes? What effect size is big enough to be
detected? How common (rare) must a disease
variant(s) be to be identifiable? What marker
allele frequency threshold should be used to find
complex disease genes?
85
Complexity of System
  • In any indirect association study, we measure
    marker alleles that are correlated with trait
    variants
  • We do not measure the trait variants themselves
  • But, for study design and power, we concern
    ourselves with frequencies and effect sizes at
    the trait locus.
  • This can only lead to underpowered studies and
    inflated expectations
  • We should concern ourselves with the apparent
    effect size at the marker, which results from
  • 1) difference in frequency of marker and trait
    alleles
  • 2) LD between the marker and trait loci
  • 3) effect size of trait allele

86
Practical Implications of Allele Frequencies
  • Strongest argument for using common markers is
    not CD-CV. It is practical
  • For small effects, common markers are
    the only ones for which sufficient sample sizes
    can be collected
  • ? There are situations where indirect association
    analysis will not work
  • Discrepant marker/disease freqs, low LD,
    heterogeneity,
  • Linkage approach may be only genetics approach in
    these cases
  • At present, no way to know when association
    will/will not work
  • Balance with linkage

87
Current Association Study Challenges5) How to
analyse the data
  • Allele based test?
  • 2 alleles ? 1 df
  • E(Y) a bX X 0/1 for presence/absence
  • Genotype-based test?
  • 3 genotypes ? 2 df
  • E(Y) a b1A b2D A 0/1 additive (hom) W
    0/1 dom (het)
  • Haplotype-based test?
  • For M markers, 2M possible haplotypes ? 2M -1 df
  • E(Y) a ?bH H coded for haplotype effects
  • Multilocus test?
  • Epistasis, G x E interactions, many possibilities

88
Current Association Study Challenges6) Multiple
Testing
  • Candidate genes a few tests (probably
    correlated)
  • Linkage regions 100s 1000s tests (some
    correlated)
  • Whole genome association 100,000s 1,000,000s
    tests (many correlated)
  • What to do?
  • Bonferroni (conservative)
  • False discovery rate?
  • Permutations?
  • .Area of active research

89
Despite challenges upcoming association studies
hold some promise
  • Availability of millions of genetic markers
  • Genotyping costs decreasing rapidly
  • Cost per SNP 2001 (0.25) ? 2003 (0.10) ? 2004
    (0.01)
  • Background LD patterns being characterized
  • International HapMap and other projects

90
Genome Wide Association Studies (GWAS) Underway
  • Genetic Analysis Information Network (GAIN)
  • Psoriasis, ADHD, Schizophrenia, Bipolar Disorder,
    Depression, Type 1 Diabetes
  • Welcome Trust Case Control Consortium
  • Bipolar Disorder, Coronary Artery Disease,
    Crohns disease, Rheumatoid Arthristis, Type 1
    Diabetes, Type 2 Diabetes
  • Genes, Environment, Health Initiative
    (Gene/Environment Association Studies GENEVA)
  • Addiction, diabetes, Heart Disease, Oral Clefts,
    Maternal Metabolism and Birth Weight, Lung
    Cancer, Pre-Term Birth, Dental Carries
  • Genes, Environment, Development Initiative
    (GEDI)
Write a Comment
User Comments (0)
About PowerShow.com