Title: The Complexities of Data Analysis in Human Genetics
1The Complexities of Data Analysis in Human
Genetics
- Marylyn DeRiggi Ritchie, Ph.D.
- Center for Human Genetics Research
- Vanderbilt University
- Nashville, TN
2Biology is complex
BioCarta
3(No Transcript)
4Single nucleotide polymorphisms (SNPs)
5Mendelian Traits
affected
Aa
Aa
BB
bb
Locus 2
Aa
AA
Aa
BB
bb
Bb
BB
Bb
bb
AA
AABB
AABb
AAbb
affected
Aa
Locus 1
AaBB
AaBb
Aabb
affected
aa
aaBB
aaBb
aabb
6Complex Traits
Aa
Aa
BB
Bb
Locus 2
BB
Bb
bb
aa
AA
Aa
BB
bb
Bb
AA
AABB
AABb
AAbb
Aa
Locus 1
AaBB
AaBb
Aabb
affected
aa
aaBB
aaBb
aabb
affected
7Complex Traits
- Complex trait implies the involvement of multiple
genes and/or environmental factors - Mendelian trait implies a single mutation
- Mendelian traits are generally rare
- Complex traits are common and of substantial
public health impact
8Genetic Analysis
- Two main areas of genetic analysis
- Linkage analysis
- Association analysis
- Methods have been developed for each approach for
a variety of different study designs
9Association Analysis
- In disease studies, when the disease gene is
unknown, we look for association between genetic
markers and the disease - If a marker occurs more frequently or less
frequently in affected individuals than in
unaffected individuals, then it is associated
with the disease.
10Association Analysis
- Case-control studies
- Test for association between marker alleles and
the disease phenotype in a group of affected and
unaffected individuals randomly from the
population - Family-based studies
- Test for association between marker alleles and
the disease phenotype in a group of affected
individuals and unaffected family members
11Case-control data structure
12Association Analysis
- Single marker tests
- Haplotype association
- Epistasis
13Single marker tests
?
?
?
?
SNP1
SNP2
SNP3
14Haplotype
15Haplotype Analysis
- May be able to increase power by testing for
association with marker haplotype - Haplotype is a block of DNA that stays intact
through generations - Do not directly observe marker haplotypes
- Use likelihood methods to infer
16Haplotype Analysis
17Epistasis Gene-Gene InteractionsW. Bateson,
Mendels Principles of Heredity (1909) A.R.
Templeton, In Wade et al. (eds), Epistasis and
the Evolutionary Process (2000)
- Epistasis first used by William Bateson (1909)
- Literal translation is standing upon (I.e. one
gene masks the effects of another gene).
Cordell, Human Molecular Genetics 112463-8 (2002)
18Gene-gene Interactions
- Searching for gene-gene interactions brings about
a whole new suite of problems and challenges - Types of interactions
- Additive
- Multiplicative
- Epistatic
- Curse of dimensionality big problem
19Curse of Dimensionality
N 100
50 Cases, 50 Controls
SNP 1
AA
Aa
aa
20Curse of Dimensionality
N 100
50 Cases, 50 Controls
SNP 1
BB
SNP 2
Bb
bb
21Curse of Dimensionality
N 100
50 Cases, 50 Controls
22Three Other Issues to Consider
- 1. Variable selection
- Model selection
- Interpretation
231. Variable Selection
- How can you determine which variables to select?
- Not computationally feasible to evaluate all
possible combinations - Need to select correct variables to detect
interactions
24How many combinations are there?
- 500,000 SNPs span 80 of common variation in
genome (HapMap)
Number of Possible Combinations
SNPs in each subset
25How many combinations are there?
- 500,000 SNPs span 80 of common variation in
genome (HapMap)
Number of Possible Combinations
SNPs in each subset
262. Model Selection
- For each variable subset, evaluate a statistical
model - Goal is to identify the best subset of variables
that compose the best model
27Finding the best model
Choose variable subset Choose statistical
model Evaluate model fitness Best model
28Simple Fitness Landscape
Fitness
Model
29Complex Fitness Landscape
Fitness
Model
303. Interpretation
- Selection of best statistical model in a vast
search space of possible models - Statistical or computational model may not
translate into biology - May not be able to identify prevention or
treatment strategies directly - Wet lab experiments will be necessary, but may
not be sufficient
313. Interpretation
- Strategies to assess biological interpretation of
gene-gene interaction models - Consider current knowledge about the biochemistry
of the system and the biological plausibility of
the models - Perform experiments in the wet lab to measure the
effect of small perturbations to the system - Computer simulation algorithms to model
biochemical systems
32Additional Challenges(true of all association
studies)
- Sample size and power/type I error
- Population specific effects
- Age, gender
- Poorly matched cases and controls
- Ethnic background
- Controls must be at risk
- Bias
- Heterogeneity
33Heterogeneity
Thornton-Wells TA, Moore JH, Haines JL. Trends in
Genetics, 200420(12)640-7. .
- Phenotypic (Clinical, Trait)
- Affected individuals vary in clinical expression
- Genetic
- Different inheritance patterns for same disease
- Locus
- Different genes lead to the same disease
- Allelic
- Different alleles at the same gene lead to
same/different disease
34New Statistical Approaches
- Data Reduction
- Combinatorial Partitioning Method (CPM)
- Multifactor Dimensionality Reduction (MDR)
- Detection of informative combined effects (DICE)
- Logic Regression
- Set Association Analysis
- Pattern Recognition
- Symbolic Discriminant Analysis (SDA)
- Cellular Automata (CA)
- Neural Networks (NN)
35Areas of Future Work(possible collaborations)
- More analytical methods for gene-gene and
gene-environment interactions - Especially including categorical and continuous
variables simultaneously - Inclusion of pathway information into analyses
- Ways of dealing with heterogeneity of all kinds