Title: The Causes of Variation
1The Causes of Variation
- Lindon Eaves and Tim York
- Boulder, CO
- March 2001
2One Issue (Among Many!)
- Identifying genes that cause complex diseases and
genes that contribute to variation in
quantitative traits
3Quantitative Trait Locus (QTL)
- Any gene whose contribution to variation in a
quantitative trait is large enough to stand out
against the background noise of other genetic and
environmental factors
4Quantitative Trait
- A continuously variable trait (in which variation
may be caused by multiple genetic and/or
environmental factors) any categorical trait in
which differences between categories may be
mapped onto variation in a continuous trait
5Common diseases
- Estimated life time risk c.60
- Substantial genetic component
- Non-Mendelian inheritance
- Non-genetic risk factors
- Multiple interacting pathways
- Most genes still not mapped
6Examples
- Ischaemic heart disease (30-50, F-M)
- Breast cancer (12, F)
- Colorectal cancer (5)
- Recurrent major depression (10)
- ADHD (5)
- Non-insulin dependent diabetes (5)
- Essential hypertension (10-25)
7Even for simple diseasesNumber of alleles is
large(Wright et al, 1999)
- Ischaemic heart disease (LDR) gt190
- Breast cancer (BRAC1) gt300
- Colorectal cancer (MLN1) gt140
8Definitions
- Locus One of c. 30-40,000 genes
- Allele One of several variants of a specific
gene - Gene a sequence of DNA that codes for a specific
function - Base pair chemical letter of the genome (a
gene has many 1000s of base pairs) - Genome all the genes considered together
9Finding QTLs
10Linkage
- Finds QTLs by correlating phenotypic similarity
with genetic similarity (IBD) in specific parts
of genome
11Linkage
- Doesnt depend on guessing gene
- Works over broad regions (good for getting in
right ball-park) and whole genome (genome scan) - Only detects large effects (gt10)
- Requires large samples (10,000s?)
- Cant guarantee close to gene
12Association
- Looks for correlation between specific alleles
and phenotype (trait value, disease risk)
13Association
- More sensitive to small effects
- Need to guess gene/alleles (candidate gene)
or be close enough for linkage disequilibrium
with nearby loci - May get spurious association (stratification)
need to have genetic controls to be convinced
14RealityFor complex disorders and quantitative
traits
- Large number of alleles at large number of genes
15Defining the Haystack
- 3x109 base pairs
- Markers every 6-10kb for association in
populations with no recent bottleneck history - 1 SNPs per 721 b.p. (Wang et al., 1998)
- c.14 SNPs per 10kb 1000s haplotypes/alleles
- O (104 -105) genes
16Problems
- Large number of loci and alleles/haplotypes
- Possible interactions between genes
- Possible interactions between genes and
environment - Relatively low frequencies of individual risk
factors - Functional form of genotype-phenotype relations
not known - Sorting out signal from noise minimizing errors
within budget - Scaling of phenotype (continuous, discontinuous)
- Spurious association (stratification)
17Prepare for the worst
- Need statistical approaches that can screen
enormous numbers of loci and alleles to identify
reliably those that have impact on risk to disease
18System Chosen for Study
- 100 loci
- 20 loci affect outcome, 80 nuisance genes
- 257 alleles/locus
- Allele frequencies c.20-0.1
- Disease genes each explain 2.5 variance in risk
(c. 2-fold risk increase) - 40 rarest alleles increase risk
- 50 variance non-genetic
19(No Transcript)
20Its a Mess!
- Dont know which genes might have clues
- Dont know which alleles unordered categories
- gt250100 locus/allele combinations
- More predictor combinations than people (curse
of dimensionality) - Reality worse
21Problems
- Informatics large volume of data
- Computational large number of combinations
- Statistical large number of chance associations
- Genetic-epidemiological secondary associations
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26How are we going to figure it out?
27Data Mining(Steinberg and Cartel)
- Attempt to discover possibly very complex
structure in huge databases (large number of
records and large number of variables) - Problems include classification, regression,
clustering, association (market analysis) - Need tools to partially or fully automate the
discovery process - Large databases support search for rare but
important patterns and interactions (epistasis,
GxE)
28Some Approaches to DM
- Logistic regression
- Neural networks
- CART (Breiman et al. 1984)
- MARS (Friedman, 1991)
29MARS
- Multivariate
- Adaptive
- Regression
- Splines
30Key references
- Friedman, J.H. (1991) Multivariate Adaptive
Regression Splines (with discussion), Annals of
Statistics, 19 1-141. - Steinberg, D., Bernstein, B., Colla, P., Martin,
K., Friedman, J.H. (1999) MARS User Guide. San
Diego, CA Salford Systems
31 The MARS Advantage
- Allows large number of predictors
(loci/alleles/environments) to be screened - Non-parametric
- Continuous and discontinuous outcomes
- Systematic search for detailed interactions
- Testing and cross-validation
- Continuous and categorical predictors
- Decides best form of relationship
32Example Regression SplineImpact of Non-Retail
Business on Median Boston House Prices
Median House Price
Knot
Industrial Business
33Fitting functions with Splines
- Piece-wise linear regression.
- simplest form. allow regression to bend.
- Knots define where the function changes
behavior. - Local fit vs. Global fit.
actual data
spline with 3 knots
34One predictor example
- True knots at 20 and 45 (left)
- Best single knot at about 35 (right)
Y
Y
10 20 30 40 50 60
10 20 30 40 50 60
X
X
3510 20 30 40 50 60
10 20 30 40 50 60
10 20 30 40 50 60
10 20 30 40 50 60
36Re-express variables as basis functions
- Done to generalize the search for knots.
Difficult to illustrate splines with gt one
dimension. - Core building block of MARS model
- max (0, X c)
- example BF1 max(0, ENV 5)
- BF2 max(0, ENV 8)
- 0 for
ENV lt 5 - ?1 for 5 lt ENV lt 8
- ?1 ?2 for ENV gt 8
- Weighted sum of basis functions used to
approximate the global function. - ie y constant ?1 BF1 ?2 BF2
error
37Adaptive Spline
- Optimal placement of knots
- Optimal selection of predictors and interactions
38Adaptive splines
- Problem
- What is the optimal location of knots?
- How many knots do you need?
- Best to test all variable / knot locations, but
computationally burdensome. - MARS solution
- Develop an overfit model with too many knots.
- Remove all knots that contribute little to model
quality. - The final model should have approximately correct
knot locations.
39Optimal
- Explains salient features of data
- Ignores irrelevant features
- Stands up to replication
- - Several ways to operationalize mathematically
40MARS 2-step model building
- Step 1. Growing phase
- begins with only a constant in the model.
- serially adds basis functions to a user defined
limit. tests each for improvement when added to
the model. - addition of basis functions until an overly large
model is found. (theoretically the true model is
captured). - Step 2. Pruning phase
- delete basis function that contributes least to
model fit. - refit the model and delete next term, repeat.
- the most parsimonious model is selected.
- GCV criterion to select optimal model (Craven
1979). - MARS option uses 10 fold cross-validation to
estimate DF.
41Cross-validation
- Protects against over fitting data.
- Develops a model on subset of data. Tests fit on
remaining set. - Systematically assesses how many DF to charge
each variable entered into model. - Adding a basis function will always lower MSE.
- This reduction is penalized by DF charged.
- Only backwards deletion step is penalized.
42Genetic ExampleRegression spline for
multi-allelic locus
43(No Transcript)
44(No Transcript)
45(No Transcript)
46So Far
- Does quite well for largish random samples and
continuous outcomes. - -What about disease (dichotomous) outcomes?
- -What about selected (extreme) samples?
47(No Transcript)
48(No Transcript)
49(No Transcript)
50So?
- Can detect signal due to relatively large numbers
of relatively rare unordered alleles of
relatively small effect at relatively many loci
amid the noise of still more loci and
environmental effects - MARS may provide elements for analyzing such
data in this and similar contexts (?micro-
arrays, SNPs, expression arrays?) - Works with continuous data on random samples and
dichotomous outcomes on selected samples
51GAW12 Simulated data
- Provided for two populations
- large general pop.
- pop. isolate founded 20 generations ago by 100
ind. - limited migration b/w.
- Common disease
- prevalence of 25. increases with age
- middle age disease, some early onset
- more common in females than males
52- General population
- 7 genes simulated
- 13 to 20 kb
- 12 to 40 diallelic sites at start of simulation
- passed through 120 to 200K of random mating
- mutation, intragenic recombination, gene
conversion allowed at diff. rates for diff.
genes - each gene contains a 500bp recombination hotspot
15 to 65 of intragenic recombinations - 8 to 13 mutational hotspots per gene (6 300 xs
?) - 25 of genes isolated for 35 to 85K generations.
53GENE1 GENE5
Length (kb) 20 17
Start of SNP 40 20
Random Mating 150K 165K
Rec. rate .01 .002
Mutation rate 4x10-8 6x10-9
Gene conv. .01 .002
Mean length conv. 1000 1600
Start of rec. hotspot / in 10349 / 50 4197 / 65
mutat. hotspot 13 8
Incr mut rate 200 20
54- Isolate population
- loosely modeled after pop. history of Old Order
Amish in Lancaster Co., PA - Founders 200 chr.s sampled from general pop.
- 20,000 chr.s sampled from general pop. to create
an outside pop - Isolate children lt12, mean 4 Outside children
lt12, 1 - migration allowed b/w pop.s at each generation
- rate migrants 5 of current isolate size
- evolution progressed for 20 generations with
recombination (no mutations, no intragenic rec.) - founders were then sampled to create the isolate
pop.
55- 23 extended pedigrees with 1,497 individuals from
each population. (1,000 living) - Pedigrees include the proband, spouse, and all
first, second, and third degree relatives of
each. - Living individuals are provided
- affected status, fid, mid, sex
- age at last exam
- age of onset if affected
- 5 quantitative risk factors
- 2 environmental risk factors (binary and
quantitative) - marker genotype for 1 cM whole genome screen.
2,855 total markers with an average of 9.1
alleles - sequence data for 7 candidate genes 1,176
sequence variants - 50 replicates provided for each pop.
56(No Transcript)
57Sequence data
- Isolate and General population
- Intron and Exon sequence from 7 candidate genes.
- Kept only those individuals with sequence data.
Each set contain 7,000 individuals. 64 mb MARS
limit. - 5 sets of 7 randomly selected replicates (used 35
of 50 replicates provided) - 5 associated quantitative risk factors.
- Covariates included E1, E2, Age, Sex, Age of
onset.
58- Affected status binary.
- Exon sequence coded for each individual as having
0, 1, or 2 ancestral variants. - If intron variant present (whether 1 or 2 copies)
given a value of 1. Coded in binary form as
haplotypes of length four.
59Aff Status
Age of onset
MG6
Liability
E1
CG1
Q1
Q2
Q3
Q4
Q5
MG5
MG1
MG2
MG3
MG4
E2
Age
CG2
CG6
60True Model Isolate pop. General pop.
AFF E1, Q1-Q5, MG6 557 E1, Q1-Q5, MG6 (435 547 548 557) 5244 5268 6912 7281 E1, Q1-Q5, MG6 (27 57 76 110)(435 547 548 557)
Q1 E1, MG1 5782 MG1 5007 MG1 5782
Q2 E1, MG1 5782 E1, MG1 5007 E1, MG1 5782
Q3 E1, E2 E1, E2 E1, E2
Q4 E1, AGE E1, AGE E1, AGE
Q5 E1, MG5 multi-allelic E1, MG5 1289 3745 8657 8817 E1, MG5 1289 3745 8657 8817
ONSET MG6 557 MG6 15625 none
61Conclusions
- MARS works well to capture functional form of
disease etiology in simulated data with
dichotomous outcome. - In most cases was within 1 Kb of functional
variant. - Generated a predictive model that was replicable
in at least 4 of 5 data sets. - Highly interpretable output in the form of basis
functions and Importance values. - MARS may have problems with highly correlated
variables. - Pattern-recognition tools can be useful to narrow
down search for genes.
62Comparison of MARS and ANN
MARS ANN
Both are non-parametric estimation schemes, allow for a high number of input predictors, allow for interactions, non-linear mappings. Both are non-parametric estimation schemes, allow for a high number of input predictors, allow for interactions, non-linear mappings.
Maximum allowable basis functions and degree of interactions. Type of network architecture needs to be specified.
Models are developed fast. Models are trained more slowly (DeVeaux et al. 1993).
Backwards elimination stage to remove unnecessary basis functions. Problem of overfitting the data esp. with small data sets.
Easily interpretable basis functions. Local interpretation of the function. Black box-weights have little meaning. Diff. to interpret predictor contribution
Penalizes model complexity. Tries to dev. a low order, interpretable model. Non-linear transformations and high connectivity allows for ? complexity.
63But the Haystack is Very Large
- Reality worse than simulations
- More alleles at more loci
- Phenotypes more complex (multivariate)
- More irrelevant loci (?1000s)
- Interactions with environment and between loci
- Spurious associations
64It Needs Collaboration
- Clinical
- Statistical
- Molecular
- Epidemiological
- Physiological
- Developmental
- Informational
- Evolutionary