Title: Medical Informatics
1Medical Informatics
Genomics and Bioinformatics
2Identification and Prioritization of Novel
Disease Candidate Genes Systems Biology Based
Integrative Approaches
Bioinformatics to Systems Biology November 16,
2007
- Anil Jegga
- Division of Biomedical Informatics,
- Cincinnati Childrens Hospital Medical Center
(CCHMC) - Department of Pediatrics, University of
Cincinnati - Cincinnati, Ohio - 45229
- Anil.Jegga_at_cchmc.org
- http//anil.cchmc.org
3Acknowledgements
- Jing Chen
- Eric Bardes
- Bruce Aronow
- All the publicly available gene annotation
resources especially NCBI, MGI and UCSC
Support
4Two Separate Worlds..
Medical Informatics
Bioinformatics the omes
PubMed
Proteome
Disease Database
Patient Records
OMIM Clinical Synopsis
Clinical Trials
382 omes so far and there is UNKNOME too -
genes with no function known http//omics.org/inde
x.php/Alphabetically_ordered_list_of_omics (as
on November 15, 2007)
With Some Data Exchange
5PubMed
miRNAome
Pharmacogenome
OMIM
6No Integrative Genomics is Complete without
Ontologies
- Unified Medical Language System (UMLS)
7The 3 Gene Ontologies
- Molecular Function elemental activity/task
- the tasks performed by individual gene products
examples are carbohydrate binding and ATPase
activity - What a product does, precise activity
- Biological Process biological goal or objective
- broad biological goals, such as dna repair or
purine metabolism, that are accomplished by
ordered assemblies of molecular functions - Biological objective, accomplished via one or
more ordered assemblies of functions - Cellular Component location or complex
- subcellular structures, locations, and
macromolecular complexes examples include
nucleus, telomere, and RNA polymerase II
holoenzyme - is located in (is a subcomponent of )
http//www.geneontology.org
8Example Gene Product hammer
Function (what) Process (why) Drive a nail -
into wood Carpentry Drive stake - into soil
Gardening Smash a bug Pest Control A
performers juggling object Entertainment
http//www.geneontology.org
9Unified Medical Language System Knowledge Server
UMLSKS
- The UMLS Metathesaurus contains information about
biomedical concepts and terms from many
controlled vocabularies and classifications used
in patient records, administrative health data,
bibliographic and full-text databases, and expert
systems. - The Semantic Network, through its semantic types,
provides a consistent categorization of all
concepts represented in the UMLS Metathesaurus.
The links between the semantic types provide the
structure for the Network and represent important
relationships in the biomedical domain. - The SPECIALIST Lexicon is an English language
lexicon with many biomedical terms, containing
syntactic, morphological, and orthographic
information for each term or word.
http//umlsks.nlm.nih.gov/kss
10Unified Medical Language SystemMetathesaurus
- about over 1 million biomedical concepts
- About 5 million concept names from more than 100
controlled vocabularies and classifications (some
in multiple languages) used in patient records,
administrative health data, bibliographic and
full-text databases and expert systems. - The Metathesaurus is organized by concept or
meaning. Alternate names for the same concept
(synonyms, lexical variants, and translations)
are linked together. - Each Metathesaurus concept has attributes that
help to define its meaning, e.g., the semantic
type(s) or categories to which it belongs, its
position in the hierarchical contexts from
various source vocabularies, and, for many
concepts, a definition. - Customizable Users can exclude vocabularies that
are not relevant for specific purposes or not
licensed for use in their institutions.
MetamorphoSys, the multi-platform Java install
and customization program distributed with the
UMLS resources, helps users to generate
pre-defined or custom subsets of the
Metathesaurus. - Uses
- linking between different clinical or biomedical
vocabularies - information retrieval from databases with human
assigned subject index terms and from free-text
information sources - linking patient records to related information in
bibliographic, full-text, or factual databases - natural language processing and automated
indexing research
11Open biomedical ontologies
http//obo.sourceforge.net/
12Mammalian Phenotype Ontology
- The Mammalian Phenotype (MP) Ontology enables
robust annotation of mammalian phenotypes in the
context of mutations, quantitative trait loci and
strains that are used as models of human biology
and disease. - Each node in MPO represents a category of
phenotypes and each MP ontology term has a unique
identifier, a definition, synonyms, and is
associated with gene variants causing these
phenotypes in genetically engineered or
mutagenesis experiments. - In the current version of MPO, there are gt4250
terms associated to gt4300 unique Entrez mouse
genes (extrapolated to 4300 orthologous human
genes).
http//www.informatics.jax.org
13Disease Gene Identification and Prioritization
Hypothesis Majority of genes that impact or
cause disease share membership in any of several
functional relationships OR Functionally similar
or related genes cause similar phenotype.
- Functional Similarity Common/shared
- Gene Ontology term
- Pathway
- Phenotype
- Chromosomal location
- Expression
- Cis regulatory elements (Transcription factor
binding sites) - miRNA regulators
- Interactions
- Other features..
14Background, Problems Issues
- Most of the common diseases are multi-factorial
and modified by genetically and mechanistically
complex polygenic interactions and environmental
factors. - High-throughput genome-wide studies like linkage
analysis and gene expression profiling, tend to
be most useful for classification and
characterization but do not provide sufficient
information to identify or prioritize specific
disease causal genes.
15Background, Problems Issues
- Since multiple genes are associated with same or
similar disease phenotypes, it is reasonable to
expect the underlying genes to be functionally
related. - Such functional relatedness (common pathway,
interaction, biological process, etc.) can be
exploited to aid in the finding of novel disease
genes. For e.g., genetically heterogeneous
hereditary diseases such as Hermansky-Pudlak
syndrome and Fanconi anaemia have been shown to
be caused by mutations in different interacting
proteins.
16Background, Problems Issues
- Disease candidate gene studies
Biological experiments (expensive, time
consuming)
17Background, Problems Issues
Current candidate gene prioritization tools
- Assumption genes involved in the same complex
disease will have similar functions
dilated cardiomyopathy
Approach with training
Training Known disease genes (10 from OMIM)
Test 68 genes at 10q25-26
Score test genes based on their similarity to
training set
18TOPPGeneTranscriptome Ontology Pathway based
Prioritization of Geneshttp//toppgene.cchmc.org
Chen J, Xu H, Aronow BJ, Jegga AG. 2007. Improved
human disease candidate gene prioritization using
mouse phenotype. BMC Bioinformatics 8(1) 392
Epub ahead of print
- Applications
- For functional enrichment
- For candidate gene prioritization
Why another gene prioritization method?
19Comparison with other related approaches
Feature type POCUS Prospectr SUSPECTS ENDEAVOUR ToppGene
Year 2003 2005 2006 2006 2007
Sequence Features
GO Annotations
Transcript Features
Protein Features
Literature
Phenotype Annotations
Training set
20Comparison with other related approaches Feature
Details
Feature type POCUS Prospectr SUSPECTS ENDEAVOUR ToppGene
Year 2003 2005 2006 2006 2007
Sequence Features Annotations Gene length Homology Base composition Gene length Homology Base composition Blast cis-element Cytoband cis-element miRNA targets GeneSets
Gene Annotations Gene Ontology Gene Ontology Gene Ontology Gene Ontology Mouse Phenotype
Transcript Features Gene expression Gene expression EST expression Gene expression
Protein Features domains Protein domains domains interactions pathways domains interactions pathways
Literature Keywords Co-citation
Training set No No Yes Yes Yes
21Mammalian Phenotype Ontology
We do not check whether the human orthologous
gene of a mouse gene causes similar phenotype.
Rather, we assume that orthologous genes cause
orthologous phenotype and test the potential of
the extrapolated mouse phenotype terms as a
similarity measure to prioritize human disease
candidate genes
22Mammalian Phenotype Ontology
23ToppGene General Schema
24TOPPGene - Data Sources
- Gene Ontology GO and NCBI Entrez Gene
- Mouse Phenotype MGI (used for the first time for
human disease gene prioritization) - Pathways KEGG, BioCarta, BioCyc, Reactome,
GenMAPP, MSigDB - Domains UniProt (Pfam, Interpro,etc.)
- Interactions NCBI Entrez Gene (Biogrid,
Reactome, BIND, HPRD, etc.) - Pubmed IDs NCBI Entrez Gene
- Expression GEO
- Cytoband MSigDB
- Cis-Elements MSigDB
- miRNA Targets MSigDB
New features added
25TOPPGene - Validation
- Random-gene cross-validation
- Disease-gene relations from OMIM and GAD
databases - Training set disease genes with one gene
(target) removed - Test set 100 genes target gene 99 random
genes - Rank of target gene
- Control random training sets
- AUC and Sensitivity/Specificity
26TOPPGene - Validation
- Random-gene cross-validation breast cancer
example
27- Random-gene cross-validation result
- Training19 diseases with 693 genes
- Control 20 random sets of 35 genes each
- Sensitivity/Specificity 77/90
- AUC 0.916
- Sensitivity frequency of target genes that are
ranked above a particular threshold position - Specificity the percentage of genes ranked below
the threshold
28Using Mouse Phenotype as a feature of similarity
measure improves human disease gene prioritization
- Random-gene cross-validation with only one feature
29Using Mouse Phenotype as a feature of similarity
measure improves human disease gene prioritization
Random-gene cross-validation by leaving one
feature out
Overall performance All features 0.913 All MP
0.893 All MP PubMed 0.888
Sensitivity true positive rate at a cutoff
score Specificity true negative rate at the same
cutoff
30- Locus-region cross-validation using different
feature sets
Features Average rank ratio of target genes Number of times target genes were ranked top 5 Number of times target genes were ranked top 10
All 7.39 118 125
GO MP PubMed 7.50 118 126
MP PubMed 7.08 121 126
Without GO 6.84 117 123
Without Pathway 7.66 118 124
Without Domain 6.71 118 124
Without Interaction 7.17 120 124
Without Expression 7.28 118 128
Without MP 9.77 110 117
Without Pubmed 9.91 100 111
Without MP Pubmed 22.61 71 80
31- ToppGene web server (http//toppgene.cchmc.org)
- For functional enrichment analysis
32- ToppGene web server (http//toppgene.cchmc.org)
- For functional enrichment analysis
33- ToppGene web server (http//toppgene.cchmc.org)
- For functional enrichment analysis
34- ToppGene web server (http//toppgene.cchmc.org)
- For functional enrichment analysis
35PPI - Predicting Disease Genes
- Direct proteinprotein interactions (PPI) are one
of the strongest manifestations of a functional
relation between genes. - Hypothesis Interacting proteins lead to same or
similar disease phenotypes when mutated. - Several genetically heterogeneous hereditary
diseases are shown to be caused by mutations in
different interacting proteins. For e.g.
Hermansky-Pudlak syndrome and Fanconi anaemia.
Hence, proteinprotein interactions might in
principle be used to identify potentially
interesting disease gene candidates.
36- Prioritize candidate genes in the interacting
partners of the disease-related genes - Training sets disease related genes
- Test sets interacting partners of the training
genes
37OMIM genes (level 0) Directly interacting genes (level 1) Indirectly interacting genes (level2)
15 342 2469!
15
342
2469
38- ToppGene web server (http//toppgene.cchmc.org)
- For candidate gene prioritization
39- ToppGene web server (http//toppgene.cchmc.org)
- For candidate gene prioritization
40- ToppGene web server (http//toppgene.cchmc.org)
- For candidate gene prioritization
41- Example Breast cancer study. Genome-wide
association study identifies novel breast cancer
susceptibility loci. Nature. 2007 May 27.
rs id Location Gene Training set Test set
rs2981582 10q26 FGFR2 15 OMIM genes 83 genes in the region
Prioritization result
Rank Gene Description P-value
1 BUB3 budding uninhibited by benzimidazoles 3 homolog 0.003865
2 FGFR2 fibroblast growth factor receptor 2 0.018906
3 BCCIP BRCA2 and CDKN1A interacting protein 0.04784
42Example Breast cancer study. Genome-wide
association study identifies novel breast cancer
susceptibility loci. Nature. 2007 May 27.
43ToppGene Prioritization
Training set Test set
15 OMIM genes 342 interacting genes
Ranked Interactants
Rank Gene Description
1 ATR ataxia telangiectasia and Rad3 related
2 FANCD2 Fanconi anemia, complementation group D2
3 NBN (NBS1) Nibrin
44Limitations
- General limitations of any training-test
strategy - Prior knowledge of disease-gene associations.
- Assumption that the disease genes yet to discover
will be consistent with what is already known
about a disease. - Depend on the accuracy and completeness of the
functional annotations. - Only one-fifth of the known human genes have
pathway or phenotype annotations and there are
still more than 40 genes whose functions are not
defined!
Chen et al., 2007 BMC Bioinformatics
45Mouse Phenotype - Limitations
- MP is not a disease-centric ontology and the
phenotype of a same gene mutation can vary
depending on specific mouse strains or their
genetic backgrounds. - Orthologous genes need not necessarily result in
orthologous phenotypes.
Possible Solutions - Future Directions More
efficient cross-species phenome extrapolation
where in the mouse phenotype terms are mapped to
human phenotype concepts (from UMLS) semantically
(orthologous phenotype) and the resultant
orthologous genes associated with an orthologous
phenotype are identified.
Chen et al., 2007 BMC Bioinformatics
46PPIs for disease gene identification Limitations
- Noisy interactome data
- In vitro Vs in vivo (for e.g. only 5.8 of yeast
two-hybrid predicted interactions were confirmed
by HPRD) - Extrapolation of interactions from one species to
another - Bias towards well-studied genes/proteins
- Too many interactants! Hub proteins
- Two interacting proteins need not lead to similar
phenotype when mutated - Disease proteins may lie at different points in a
pathway and need not interact directly - Lastly, disease mutations need not always involve
proteins
Oti et al., 2006 J Med Gen
47http//anil.cchmc.org (under presentations)
http//sbw.kgi.edu/