The Complexities of Data Analysis in Human Genetics - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

The Complexities of Data Analysis in Human Genetics

Description:

The Complexities of Data Analysis in Human Genetics. Marylyn DeRiggi Ritchie, Ph.D. ... Cordell, Human Molecular Genetics 11:2463-8 (2002) Gene-gene Interactions ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 36
Provided by: ritc60
Category:

less

Transcript and Presenter's Notes

Title: The Complexities of Data Analysis in Human Genetics


1
The Complexities of Data Analysis in Human
Genetics
  • Marylyn DeRiggi Ritchie, Ph.D.
  • Center for Human Genetics Research
  • Vanderbilt University
  • Nashville, TN

2
Biology is complex
BioCarta
3
(No Transcript)
4
Single nucleotide polymorphisms (SNPs)
5
Mendelian Traits
affected
Aa
Aa
BB
bb
Locus 2
Aa
AA
Aa
BB
bb
Bb
BB
Bb
bb
AA
AABB
AABb
AAbb
affected
Aa
Locus 1
AaBB
AaBb
Aabb
affected
aa
aaBB
aaBb
aabb
6
Complex Traits
Aa
Aa
BB
Bb
Locus 2
BB
Bb
bb
aa
AA
Aa
BB
bb
Bb
AA
AABB
AABb
AAbb
Aa
Locus 1
AaBB
AaBb
Aabb
affected
aa
aaBB
aaBb
aabb
affected
7
Complex Traits
  • Complex trait implies the involvement of multiple
    genes and/or environmental factors
  • Mendelian trait implies a single mutation
  • Mendelian traits are generally rare
  • Complex traits are common and of substantial
    public health impact

8
Genetic Analysis
  • Two main areas of genetic analysis
  • Linkage analysis
  • Association analysis
  • Methods have been developed for each approach for
    a variety of different study designs

9
Association Analysis
  • In disease studies, when the disease gene is
    unknown, we look for association between genetic
    markers and the disease
  • If a marker occurs more frequently or less
    frequently in affected individuals than in
    unaffected individuals, then it is associated
    with the disease.

10
Association Analysis
  • Case-control studies
  • Test for association between marker alleles and
    the disease phenotype in a group of affected and
    unaffected individuals randomly from the
    population
  • Family-based studies
  • Test for association between marker alleles and
    the disease phenotype in a group of affected
    individuals and unaffected family members

11
Case-control data structure
12
Association Analysis
  • Single marker tests
  • Haplotype association
  • Epistasis

13
Single marker tests
?
?
?
?
SNP1
SNP2
SNP3
14
Haplotype
15
Haplotype Analysis
  • May be able to increase power by testing for
    association with marker haplotype
  • Haplotype is a block of DNA that stays intact
    through generations
  • Do not directly observe marker haplotypes
  • Use likelihood methods to infer

16
Haplotype Analysis
17
Epistasis Gene-Gene InteractionsW. Bateson,
Mendels Principles of Heredity (1909) A.R.
Templeton, In Wade et al. (eds), Epistasis and
the Evolutionary Process (2000)
  • Epistasis first used by William Bateson (1909)
  • Literal translation is standing upon (I.e. one
    gene masks the effects of another gene).

Cordell, Human Molecular Genetics 112463-8 (2002)
18
Gene-gene Interactions
  • Searching for gene-gene interactions brings about
    a whole new suite of problems and challenges
  • Types of interactions
  • Additive
  • Multiplicative
  • Epistatic
  • Curse of dimensionality big problem

19
Curse of Dimensionality
N 100
50 Cases, 50 Controls
SNP 1
AA
Aa
aa
20
Curse of Dimensionality
N 100
50 Cases, 50 Controls
SNP 1
BB
SNP 2
Bb
bb
21
Curse of Dimensionality
N 100
50 Cases, 50 Controls
22
Three Other Issues to Consider
  • 1. Variable selection
  • Model selection
  • Interpretation

23
1. Variable Selection
  • How can you determine which variables to select?
  • Not computationally feasible to evaluate all
    possible combinations
  • Need to select correct variables to detect
    interactions

24
How many combinations are there?
  • 500,000 SNPs span 80 of common variation in
    genome (HapMap)

Number of Possible Combinations
SNPs in each subset
25
How many combinations are there?
  • 500,000 SNPs span 80 of common variation in
    genome (HapMap)

Number of Possible Combinations
SNPs in each subset
26
2. Model Selection
  • For each variable subset, evaluate a statistical
    model
  • Goal is to identify the best subset of variables
    that compose the best model

27
Finding the best model
Choose variable subset Choose statistical
model Evaluate model fitness Best model
28
Simple Fitness Landscape
Fitness
Model
29
Complex Fitness Landscape
Fitness
Model
30
3. Interpretation
  • Selection of best statistical model in a vast
    search space of possible models
  • Statistical or computational model may not
    translate into biology
  • May not be able to identify prevention or
    treatment strategies directly
  • Wet lab experiments will be necessary, but may
    not be sufficient

31
3. Interpretation
  • Strategies to assess biological interpretation of
    gene-gene interaction models
  • Consider current knowledge about the biochemistry
    of the system and the biological plausibility of
    the models
  • Perform experiments in the wet lab to measure the
    effect of small perturbations to the system
  • Computer simulation algorithms to model
    biochemical systems

32
Additional Challenges(true of all association
studies)
  • Sample size and power/type I error
  • Population specific effects
  • Age, gender
  • Poorly matched cases and controls
  • Ethnic background
  • Controls must be at risk
  • Bias
  • Heterogeneity

33
Heterogeneity
Thornton-Wells TA, Moore JH, Haines JL. Trends in
Genetics, 200420(12)640-7. .
  • Phenotypic (Clinical, Trait)
  • Affected individuals vary in clinical expression
  • Genetic
  • Different inheritance patterns for same disease
  • Locus
  • Different genes lead to the same disease
  • Allelic
  • Different alleles at the same gene lead to
    same/different disease

34
New Statistical Approaches
  • Data Reduction
  • Combinatorial Partitioning Method (CPM)
  • Multifactor Dimensionality Reduction (MDR)
  • Detection of informative combined effects (DICE)
  • Logic Regression
  • Set Association Analysis
  • Pattern Recognition
  • Symbolic Discriminant Analysis (SDA)
  • Cellular Automata (CA)
  • Neural Networks (NN)

35
Areas of Future Work(possible collaborations)
  • More analytical methods for gene-gene and
    gene-environment interactions
  • Especially including categorical and continuous
    variables simultaneously
  • Inclusion of pathway information into analyses
  • Ways of dealing with heterogeneity of all kinds
Write a Comment
User Comments (0)
About PowerShow.com