Association analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Association analysis

Description:

Linkage disequilibrium (LD) process. Formulation of the ... and a, and locus 2 with alleles B and b, at a distance of a few centiMorgans from each other ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 34
Provided by: pivio
Category:

less

Transcript and Presenter's Notes

Title: Association analysis


1
Association analysis
  • Genetics for Computer Scientists
  • 15.3.-19.3.2004
  • Biomedicum Department of Computer Science,
    Helsinki
  • Päivi Onkamo

2
Lecture outline
  • Genetic association analysis
  • Allelic association
  • ?2 test
  • Linkage disequilibrium (LD) process
  • Formulation of the computational problem for LD
    mapping
  • Limitations of the LD mapping
  • Approaches. For example HPM

3
Genetic association analysis
  • Search for significant correlations between gene
    variants and phenotype
  • For example
  • Locus A for
  • SLE 100
  • cases and 100
  • controls
  • genotyped

4
Allelic association An allele is associated to
a trait
  • Allele 1 seems to be associated, based on sheer
    numbers, but how sure can one be about it?

5
(No Transcript)
6
  • The idea is to compare the observed frequencies
    to frequencies expected under hypothesis of no
    association between alleles and the occurrence of
    the disease (independency between variables)
  • Test statistic
  • Where
  • oi is the observed class frequency for class i,
    ei expected (under H0 of no association)
  • k is the number of classes in the table
  • Degrees of freedom for the test df(r-1)(s-1)

7
Expected
df1 pltlt0,001
8
(No Transcript)
9
Interpretation of the test results
  • The p-value is low enough that H0 can be rejected
    the probability that the observed frequencies
    would differ this much (or even more) from
    expected by just coincidence lt 0.001
  • ?2 tables (Appendix), internet resources, etc.

10
  • Genetic association is population level
    correlation with some known genetic variant and a
    trait an allele is over-represented in affected
    individuals ?
  • From a genetic point of view, an association does
    not imply causal relationship
  • Often, a gene is not a direct cause for the
    disease, but is in LD with a causative gene
  • ?

11
Linkage disequilibrium (LD)
  • Closely located genes often express linkage
    disequilibrium to each other
  • Locus 1 with alleles A and a, and locus 2 with
    alleles B and b, at a distance of a few
    centiMorgans from each other
  • At equilibrium, the frequency of the AB haplotype
    should equal to the product of the allele
    frequencies of A and B, ?AB ?A?B. If this
    holds, then ?Ab ?A ?b, ?aB ?a?B and ?ab
    ?a?b , as well. Any deviation from these values
    implies LD.

12
Linkage disequilibrium (LD)
  • LD follows from the fact that closely located
    genes are transmitted as a block which only
    rarely breaks up in meioses
  • An example
  • Locus 1 marker gene
  • Locus 2 disease locus, with allele b as
    dominant susceptibility allele with 100
    penetrance

13
An example
14
  • Association evaluated ?
  • Locus 1 also seems associated, even though it
    has nothing to do with the disease association
    observed just due to LD
  • LD mapping utilizing founder effect
  • A new disease mutation born n generations ago in
    a relatively small, isolated population
  • The original ancestral haplotype slowly decays as
    a function of generations
  • In the last generation, only small stretches of
    founder haplotype can be observed in the
    disease-associated chromosomes

15
LD mapping Utilizing founder effect
16
Data Searching for a needle in a haystack
Disease gene
Disease status
S2
...
SNP1
...
a ? 2 1 1 a ? 1
2 1
1 2 2 1 1 2 1 2 1 2 1 1
2 2
1 2 2 1 2 1 1
2
2 1 1 1 1 1 1
1
c 2 1 ? ?c 1 1
? ?
1 2 2 1 1 2 1 1 2 2 2 1
1 1
a 1 1 2 1a 1 1
1 2
1 1 2 1 1 2 2 2 2 2 1 1
2 1
2 2 ? 1 1 1 ?
1

17
  • Task is to find either an allele or an allele
    string (haplotype) which is overrepresented in
    disease-associated chromosomes
  • markers may vary SNPs, microsatellites
  • populations vary the strength of
    marker-to-marker LD
  • Many approaches
  • old-fashioned allele association with some
    simple test (problem multiple testing)
  • TDT modelling of LD process Bayesian, EM
    algorithm, integrated linkage LD

18
Limitations of the LD mapping
  • The relationship between the distance of the
    markers vs. the strength of LD theoretical curve

19
Linkage disequilibrium (D) for the African
American (red) and European (blue) populations
binned in 5 kb classes after removing all SNPs
with minor allele frequencies less than 20. 3429
SNPs were included (Source http//www.fhcrc.org/la
bs/kruglyak/PGA/pga.html)
20
Limitations LD is random process
  • LD is a continuous process, which is created and
    decreased by several factors
  • genetic drift
  • population structure
  • natural selection
  • new mutations
  • founder effect
  • ? limits the accuracy of association mapping

21
Research challenges
  • Haplotyping methods needed as prerequisite for
    association/LD methods
  • or, searching association directly from genotype
    data (without the haplotyping stage)
  • Better methods for measurement of the association
    (and/or the effects of the genes)
  • Taking disease models into consideration

22
A methodological projectHaplotype Pattern
Mining (HPM)AJHG 67133-145, 2000
  • Search the haplotype data for recurrent patterns
    with no pre-specified sequence
  • Patterns may contain gaps, taking into
    consideration missing and erroneous data
  • The patterns are evaluated for their strength of
    association
  • Markerwise score of association is calculated

23
  • Algorithm
  • Find a set of associated haplotype patterns
  • number of gaps allowed (2)
  • maximum gap length (1 marker)
  • maximum pattern length (7 markers)
  • association threshold (?2 9)
  • Score loci based on the patterns
  • Evaluate significance by permutation tests
  • Extendable to quantitative traits
  • Extendable to multiple genes

24
Example a set of associated patterns
Marker 01 02 03 04 05 06 07 08 ?2 P1
2 1 2 2 2 9.6 P2 2 1
2 2 2 1 9.2 P3 2 1 2 2
1 1 8.9 P4 2 1 2 1
8.1 P5 1 1 2 2 7.4 P6
1 2 2 1 2 7.1 P7
2 1 2 7.1
P8 2 1 1 2 6.9 P9
2 1 1 6.8

Score 5 6 7 7 6 3 2 0
25
Pattern selection
  • The set of potential patterns is large.
  • Depth-first search for all potential patterns
  • Search parameters limit search space
  • number of gaps
  • maximum gap length
  • maximum pattern length
  • association threshold

26
Score and localization an example
27
Permutation tests
  • random permutation of the status fields of the
    chromosomes
  • 10,000 permutations
  • HPM and marker scores recalculated for each
    permuted data set
  • proportion of permuted data sets in which score gt
    true score ? empirical p-value.

28
Permutation surface (A7.5 ). The solid line is
the observed frequency.
29
Localization power with simulated SNP data
(density 3 SNPs per 1 cM). Isolated population
with a 500-year history was simulated. Disease
model was monogenic with disease allele frequency
varying from 2.5-10 in the affecteds. 12.5 of
data was missing. Sample size 100 cases and 100
controls.
30
Benefits drawbacks
  • Non-parametric, yet efficient approach no
    disease model specification is needed
  • Powerful even with weak genetic effects and small
    data sets
  • Robust to genotyping errors, mutations, missing
    data
  • Allows for gaps in haplotypes

31
  • Flexible easily extended to different types of
    markers, environmental covariates, and
    quantitative measurements
  • optimal pattern search parameters may need to be
    specified case-wise -
  • no rigid statistical theory background -
  • requires dense enough map to find the area where
    DS gene is in LD with nearby markers.

32
  • Search of the susceptibility gene
  • With good luck - and information from gene banks,
    pick up the correct candidate gene
  • Genetic region with positive linkage signal is
    saturated with markers, and this data is now
    searched for a secondary correlation
    correlation of marker allele(s) with the actual
    disease mutation (LD)

33
  • Improved statistical methods to detect LD
  • Terwilliger (1995)
  • Devlin, Risch, Roeder (1996)
  • McPeek and Strahs (1999)
  • Service, Lang et al. (1999)
  • Statistical power of association test statistics
  • Long, Langley (1999).
  • Review on statistical approaches to gene mapping
  • Ott, Hoh (2000)
Write a Comment
User Comments (0)
About PowerShow.com