Title: An introduction to Genetical Genomics and Systems Genetics to better understand cancer and other com
1An introduction to Genetical Genomics and Systems
Genetics to better understand cancer and other
complex diseases
- Ina Hoeschele
- September 2006
2Genetic base of complex disease traits
- Understanding the genetic determination of
complex, disease-related traits is long-standing
goal - Some standard approaches
- Mapping of quantitative trait loci (QTL) via
linkage or association mapping - In human populations
- In animal models of specific diseases
- Collaboration with Professor Miller (Cancer
Biology) to identify candidate genes responsible
for the observed phenotypic variance in lung
tumor incidence of adult mice following in utero
exposure to chemical carcinogens - Still no ready to use software for some animal
models (AIL, RIX)
3Genetic base of complex disease traits
- Some standard approaches
- Gene expression studies
- Global gene expression profiling of individuals
with a disease and individuals without the
disease, or two groups of individuals with
different sub-types of a disease (observational
data) - List of differentially expressed genes (candidate
genes for QTL) - Identification of differentially expressed and/or
differentially regulated gene networks - Collaboration with Professor Miller
Identification of time dependent alterations in
signaling networks that drive tumor progression
from benign to advanced
4Genetic base of complex disease traits
- Systems Biology Reconstruction of cellular
networks involving genes, proteins, metabolites - Gene networks
- phenomenological, not physical
- network of interactions or regulations among
genes - provide valuable information about the genetic
architecture of complex diseases - classical genetics concepts such as dominance and
epistasis can be understood in terms of gene
networks and their properties - important practical applications, identification
of drug targets
5(No Transcript)
6Gene networks from observational data
- This network can be obtained using linear
graphical methods - This is an interaction network edges between any
pair of genes that directly interact with each
other - the edges are undirected - it is not a causal or
regulatory network - it does not tell us which gene regulates which
other gene(s) - We compute such a network as an Undirected
Dependency Graph
7Gene networks from observational data
- Undirected Dependency Graph UDG
- Based on partial correlations
- Corr(gene A, gene B) 0.7, Corr(A,C) 0.4
- A ? B ? C Corr(A,C B) 0
- Sometimes, we can find regulation
- A ? C ? B A and B jointly regulate C
- A and B do not regulate each other
- Corr(A,B C) ? 0 BUT Corr(A,B) 0
- Observational data
- Power and Sample Size
- Several simulation studies 100 genes, 200
edges, samples sizes 50 200 power lt 0.20 to
0.60
8Causal, regulatory gene networks
- Also represented by graph with gene nodes but now
edges are directed - A ? B gene A regulates gene B
- Two approaches
- Time series experiment
- Specific perturbation experiment
- One-at-a-time specific perturbations in same
genetic background (several genes are knocked
down, one at a time, RNA interference) - Multifactorial perturbations genetical
genomics or expression genetics (Jansen and
Nap 2003 Trends in Genetics)
9Design and Data
- Specific Perturbation Experiment
10Design and Data
- Multi-factorial perturbation experiment
Genetical Genomics (M marker tested genome
location)
11Genetical Genomics / Systems Genetics
- Segregating population of n x 100 individuals
(n1,2,3,), each individual is - DNA marker genotyped (genome-wide)
- Expression profiled (genome-wide)
- Phenotyped for disease-complex related traits
We need to find out which DNA markers are
expression QTL (eQTL)
12Genetical Genomics / Systems Genetics
- Goal causal, regulatory gene network
- Identification of DNA markers affecting the
expression profiles (etraits) of genes eQTLs - Identification of regulator-target gene pairs
from - set of genes physically located in an eQTL region
(candidate regulators) - set of genes affected by the eQTL region
- Construction of Encompassing Network (EN)
- Search for set of optimal networks within EN
13Genetical Genomics / Systems GeneticsIdentificat
ion of expression QTLs
- Identification of DNA marker influencing an
etrait via linear regression - Eig ?g bgXik eig Xik 0 (AA) / 1
(BB) - H0 bg 0
- Standard approach genome-wide search of each
etrait separately - Mapping of principal components based on PCA of
all genes or subsets of genes obtained by cluster
analysis
14Genetical Genomics / Systems GeneticsGenetic
Mapping of Expression QTLs
- Cis- and trans-eQTL mapping
- Cis-eQTL
- eQTL affects the expression of a gene located at
the eQTL eQTL is a DNA polymorphism in the
promoter of the gene - Test only the marker(s) closest to the gene (not
genome-wide) - Cis-Trans-Regulation
- Test the effects of any cis-eQTL on other genes
cis-eQTL ? Gene A ? Gene B - cis-eQTL will affect expression of gene B
15Genetical Genomics / Systems GeneticsGenetic
Mapping of Expression QTLs
- Cis- and trans-eQTL mapping
- Trans-eQTL A ? B ? trans-eQTLA
- Coding region polymorphism in gene A which
regulates gene B - Test jointly the effect of candidate regulator
gene (kA) and its nearest DNA marker on
expression of target gene (gB) - Eig ?g b1gXik b2gEik ( b3gEik?Xik) eig
- Xik 0/1
- Intersection-Union test to identify cases where
b1g and b2g are both non-zero
16eQTL overlap for SPA, PC-mapping, Cis-mapping and
Trans-mapping
7
13
21
70
83
2
1
3
SPA
Cis-mapping
24
3
8
3
8
87
16
49
2
PC-mapping
Trans-mapping
17Genetical Genomics / Systems GeneticsIdentificat
ion of regulator-target pairs
eQTL
Regulators genes physically located in eQTL
Targets genes whose etraits are affected by
eQTL
18Genetical Genomics / Systems GeneticsIdentificat
ion of regulator-target pairs
- Cis- versus cis-trans regulation
19Genetical Genomics / Systems GeneticsIdentificat
ion of regulator-target pairs
- Cis-trans versus trans regulation
20Genetical Genomics / Systems GeneticsIdentificat
ion of regulator-target pairs
- Trans regulation (is gene A a trans-regulator)
- Intersection-Union test for b1D and b2D
21Regulator-target pairs - SPA
23
15
15
6
41
22Regulator-target pairs - Cis mapping
14
24
41
6
15
23Regulator-target pairs PC mapping
62
8
9
16
5
24Regulator-target pairs Trans-mapping
25Genetical Genomics / Systems GeneticsEncompassin
g (Directed) Network EDN
- EDN is obtained by combining all retained
regulator-target pairs - cis-eQTL ? target gene
- cis-regulated gene ? target gene (cis-trans
regulation) - trans-eQTL ? target gene
regulator gene ? target gene (trans-regulation) - Next step constrained network search within the
EDN
26Yeast Data Encompassing Network
- Yeast data of Brem and Kruglyak (2005)
- 112 haploid offspring of cross between wild and
laboratory strain of yeast - Expression-profiled for 6000 genes
- DNA marker genotyped for 3000 markers
- Used 4589 etraits and 2956 markers
- Encompassing network
- 28,609 regulator-target pairs
- 4,274 gene nodes
- 2118 gene regulators, 4116 gene targets
27Yeast Data Encompassing Network
- Encompassing network
- Regulator with most targets (PHM7) 468
- Target with most regulators (YLR152C) 32
- Confirmed regulators
- Amn1 top cistrans regulator with 408 targets
- MAK5 110 trans targets
- GPA1 60 targets (half trans)
- Heme dependent transcription factor HAP1 141
targets (100 cistrans)
28Genetical Genomics / Systems GeneticsGene
network reconstruction
- Popular tool Bayesian Network (BN)
- Represents conditional independence A?B?C
- Suitable for noisy data
- Search among equivalence classes (equivalent
models cannot be distinguished based on available
data) A?B A?B or A?B - Limited to DAG directed, acyclic graph no
cycles or feedback loops - Time dimension A1?B2?C3?A4? dynamic BN
- Usually, the expression data are discretized
A
B
C
29Genetical Genomics / Systems GeneticsGene
network reconstruction
- Our tool Structural Equation Modeling (SEM)
- Represents conditional independence A?B?C
- Uses continuous expression data with normality
assumption robustness? - Suitable for noisy data
- Can model DCG directed cyclic graph
- Search among models, not equivalence classes
- Edge directions are fixed
- Among two equivalent models with different
numbers of edges (possible in DCG), we prefer the
sparser - No efficient algorithm for identification of
equivalence classes
30Genetical Genomics / Systems GeneticsGene
network reconstruction
- Structural Equation Modeling (SEM)
yn expression data, xn eQTL genotype codes
Regulator gene A
bBA
fC2
bBC
Target gene B
Regulator gene C
hBA1
eQTL 2
fB1
eQTL 1
kB12
31RESULTS
- SEM widely used in econometrics, sociology and
psychology (confirmatory, not exploratory) - Typically applied to at most 10s of variables
- Available in many software packages (Lisrel, Mx,
) but not feasible for genome-size data - Own implementation based on C can handle 100s
of genes (in Genetical Genomics context) - SEM applied to sub-network of Yeast Encompassing
Network - 265 genes, 241 eQTL, 832 gene-gene edges, 640
eQTL-gene edges, cycle with 168 genes - Sparsified network with 475 gene-gene edges and
468 eQTL-gene edges
32Yeast network
EDN 265 genes, 241 QTLs, 832 gene ? gene edges,
and 640 QTL ? gene edges After sparsification
475 gene ? gene edges and 468 QTL ? gene edges
33RESULTS
- SEM applied to artificial gene expression data
(known network structure) - 10 data sets with different, random network
topologies, 100 genes, 100 eQTLs - mRNA levels simulated with non-linear ODE
(Gepasi) - On average 148 gene-gene and 123 eQTL-gene edges
- EN with 360 gene-gene and 301 eQTL-gene edges
- On average 42 genes in cycles (1-3)
- False Discovery Rate ( wrongly identified edges
/ total identified edges) 0 - 15. - Power ( edges correctly inferred / total edges
in the true network) 72 - 100
34Systems Genetics
- The merger of Systems Biology with the study of
genetic variation - The integration and anchoring of
multi-dimensional data-types to underlying
genetic variation (transcriptomic, phenomic,
metabolomic )
35Systems Genetics
- Using a segregating population, we can
- Reconstruct a network of direct relationships
among various phenotypic measurements related to
a disease complex - Reconstruct a causal gene regulatory network
(with DNA marker and expression profiling) - Combine the above into a causal network of genes
and disease phenotypes - Determine the extent of genetic control of
metabolomic variation (with additional
metabolomic profiling) and transcriptional
control of metabolic reactions
36Systems Genetics
- Using a segregating population, we can
- Reconstruct a network of direct relationships
among various phenotypic measurements related to
a disease complex - e.g., Cardiovascular System
- Bone Fragility (morphology, mineral content,
mechanical properties, body composition)
37Systems Genetics
- Our prospects for investigating the complex
interactions between gene variants, disease, and
the environment will be significantly improved - Our understanding of the gene and metabolic
regulatory circuitry and its relationship with
disease phenotypes will be greatly enhanced
38Questions / Comments
39Gene networks from observational data
- Undirected Dependency Graph UDG
- Observational data tumor and normal samples
with measured gene expression for all genes - Method A
- Start with a network that has an edge between any
pair of genes which are significantly correlated
(many of these interactions are indirect through
other genes) - For each pair of genes, compute Corr(G1,G2 Gk),
where k denotes any gene other than G1 and G2,
retain only those edges with significant 1st
order partial correlation - For each pair of genes, compute Corr(G1,G2 Gk,
Gm), where k and m are any two genes other than
G1 and G2, retain only those edges with
significant 2nd order partial correlation - Etc.
40Gene networks from observational data
- Undirected Dependency Graph UDG
- Method B
- Estimate the covariance matrix of all genes and
invert this matrix. - genes gtgt observations, sample covariance
matrix does not have inverse - use of special shrinkage estimator
- Inverse ? matrix of partial correlations
Corr(G1,G2 all other genes) - Begin with a network having edges between any
genes with significant partial correlation - For each pair of genes, compute Corr(G1,G2),
retain only those edges with significant simple
correlation - For each pair of genes, compute Corr(G1,G2 Gk),
retain only those edges with significant 1st
order partial correlation - For each pair of genes, compute Corr(G1,G2 Gk,
Gm), retain only those edges with significant 2nd
order partial correlation - Etc.