Title: THE CHALLENGES OF GENOME INFORMATION MANAGEMENT
1THE CHALLENGES OF GENOMEINFORMATION MANAGEMENT
Shamkant B. Navathe1 sham_at_cc.gatech.edu in
collaboration with Douglas C. Wallace2 dwallace_at_gm
m.gen.emory.edu
1Bioengineering Program - Database Group College
of Computing Georgia Institute of
Technology 2Center for Molecular Medicine Emory
University School of Medicine
2Acknowledgements
- Students
- Andreas Kogelnik, M.D., Ph.D.
- Girish Katdare. M.S.
- Mondira Deb, Ph. D. student (ECE)
- Ken Kladitis, B.S.
- Martin Brandon, Ph.D. student
- Faculty
- Mike Brown, Ph.D. Asst Prof., Emory
- Marie Lott , Ph.D., Research Scientist, Emory
3General Challenges for Data Management in
Biological Applications
- Collection and Curation
- Analysis
- comparative analysis
- Integration
- cross linking of data
- Understanding
- multiple interpretations
- Dissemination
- traditional means / web-based
4Desired Properties of Proposed Solutions
- Scalability
- applicable to large volumes of data
- high rates of data acquisition
- No loss of information between systems
- Constructive approach to data management vs.
Absolute representation - identification
- accommodation (of viewpoints)
- manipulation
- conflict resolution
- This represents a very tall order for existing
DBMSs
5Current State of Affairs in Genetic Data
Management
- Lots of genetic info lies in non-electronic form
- Biology is becoming very "data rich"
- laboratory automation is increasing data output
- Comprehensive data analysis and interpretation is
no longer feasible by manual means - many databases, many tools, many nomenclatures
6Human Genome Initiative (HGI)
- Begun in 1988
- 200 million/year for 15 years from Congress
- Capturing, analyzing, and interpreting the human
collective genetic information for 24 pairs of
chromosomes - 3-4 billion nucleotides per genome
- 100,000-300,000 genes
This initiative was superseded by industry,
particularly Celera, who announced in Feb. 2001
that they had sequenced the complete genome
within 2 years. They announced that they have
identified about 30K genes. Mismatches among
genes from HGI vs. Celera. Exact number is
unknown.
7Current Genome-related prominent databases
- GenBank
- DNA sequence 1,053,000,000 bases - 1,611,000
sequences - Molecular modeling DB (MMDB) - 3-D structures
- Online Mendelian Inheritance in Man (OMIM)
- Clinical phenotypes - 8,700 entries
- Swiss-Prot
- Protein sequence 70,000 proteins
- Genome Database (GDB) - human genome mapping data
- Locus specific databases such as PaHDb, CFTR
- PIR International and PDB (Protein data bank)
8Types of Database Content
- Full genome databases of human and other
organisms (C-elegans, E-coli, Drosophila,
mouse..) - Specialized subject databases
- TRANSFAC transcription factors and their binding
sites - REBASE Restriction Enzyme Resource
- Derived Databases providing annotations and
novel structuring of the content - Protein motif databases (PROSITE)
- Protein structure-sequence alignment database
(HSSP) - Protein Domains (PFAM)
- Structural Clasification of Proteins (SCOP)
9TYPE OF DATA
- Sequence Data (different sequences for DNA data
and protein data). Sequences ae linked to
structures, motifs and metabolic pathways - Structure and function data
- Annotations
- Evolutionary relationships
- Visual Representations
- Audio and Video data related to phenotypes
- (patient symptoms and behaviors)
Databases of metabolic pathways have intrinsic
complexity because nodes represent data from
sequences while edges represent chemical
reactions which are independent and non-sequence
related (Karp 1998)
10Nature of Biological Data
- Representation of biological macromolecules
- Combined with associated fuzzy information
- Incomplete and sometimes subjective
- Open to interpretation
- Using different nomenclature
- Quality Control is a major Issue
- New data is based on experiments without
confirmation - Previous annotations of data may be inherited,
but may not match with new results - Submissions have to be checked for accuracy
- Same data may occur in multiple submisions.
Question should genomic and proteolmic databases
be passive repositories or active in the form of
annotations and links to other databases?
11Quality control of data
- Easier for structural data
- rules of stereo chemistry and protein
architecture apply - Experimental techniques exist to verify structure
- NMR (nuclear magnetic resonance)
- X ray Crystallography
- Classical Tradeoff - whether to make data
available quickly or whether to wait to verify
its accuracy before it is made available - Existing databases like ESTs (expressed sequence
tags) have a lot of related and possibly
redundant information.
12Related Areas of Computer Science/Computational
Science/Computer Technology
- Algorithm design algorithm complexity analysis
- applicable to comparison of sequence data
- applicable to prediction of protein folding
- Database modeling
- entities, relationships, attributes, constraints
- objects and object references
- Database design
- schemas, content/data organization design,
loading, curating process design
13Areas of Computer Science/Computational
Science/Computer Technology (cont.)
- Knowledge system design
- incorporation of heuristics rules
- automated detection of patterns
- deduction or derivation of new information using
distributed computing and neural networks - Parallel processing and supercomputing
- Animation and visualization
- Virtual environments
14OUR WORK IN THE AREA
- Started a project around 1993 - later on named
MITOMAP to create a database of the mitochondrial
genome - Resulted in the PhD dissertation of Andreas
Kogelnik in 1997 on Biological Information
Management - Work in dissertation included an approach to
manage all aspects of mitochondrial genome - System called GENOME was proposed
- Currently being maintained and further developed
by Martin Brandon
15Human Mitochondrial Map MITOMAP
http//www.gen.emory.edu
T
F
DEAF 1555
D-Loop
12s
V
LHON 14484
rRNA
Cyt b
P
0
LDYS 14459
ND6
E
16s
rRNA
America A
MELAS 3243
L
ND5
LHON 3460
America C
ND1
Africa L
I
Q
L
ADPD 4336
M
S
H
America D
ND2
ND4
A
Asia F
N
Europe H
C
LHON 11778
W
Y
ND4L
America B, Asia B
ND3
R
COI
S
COIII
G
ATPase6
COII
D
K
NARP 8993/Leighs 8993
MERRF 8344
ATPase8
16GENOME Georgia Tech Emory Networked Object
Management Environment
- Focus on mitochondrial genome
- 16,659 base pairs
- Develop capabilities for collecting/storing/distri
buting and analyzing the data produced - Integration of multiple types of data to create a
comprehensive research data repository
17Data Organization Problem Relational Model of
Data
- Best for structured information
- Naturally appealing for tabular data
- Well founded and mathematically sound theory
of sets and relations - - No accounting of semantics of data
- - Does not provide simple features like subtyping
inheritance - - SQL as a language is not powerful enough
18Data Organization Problem Object-Oriented Model
of Data
- Captures objects of greater complexity
- Easier to deal with unstructured information
- Easier to deal with relationships/behavior/inter
pretation of data - - Query languages are not well developed
- - Schema evolution techniques are lacking
- - industrial support / experience is much weaker
compared to relational
19CASE STUDY OF DDLJ
- Human Genome Database
- Maintained in Kyoto, Japan
- Linked to GenBank and EMBL
20Schema Levels
21Conceptual Schema
22External Schema
23Data Organization Our Approach in initial design
- Use of standardized notation
- ASN.1
- Tailoring the approach by defining our own
schemas, classes, properties, and functions - Creating an open architecture for future
expansion/evolution of data models and
incorporation of databases
24Data Organization General Trends
- Combining relational and O-O features into one
system - Providing system functionality with pre-defined
classes then adding user defined facilities - Active data - use of triggers and rules to
create new data - Allowing support for heterogeneous/ federated
data collections
25OUR FUTURE APPROACH
- Data Integration
- integrate sequence based genomic information with
mutation/disease related information, functional
and biochemical information - interaction between nuclear and mitochondrial
genome data using microarray experiments - combine with existing mutation databases
- Long Term Maintainable Repository
- use standard commercial approaches
- likely to implement system using Oracle 9i
26Data Integration
Genomic DBs
Mutational DBs
Protein DBs
Central Repository
27MITOMAP Data Interactions
Functional Data
Gene-gene Interactions Data
mtDNA Sequence Data
Population Data
population database
Disease Data
28Complexities vs. Ease of Use
- Complex objects with illdefined and uncertain
- data
- Multiplicity of data types
- Numbers, text, images, audio, video
- Easy interface for the scientists and drug
designers etc. - Better interfaces
- More visualization
- More animation
- More user interactivity
- Varied search paradigms - querying, browsing,
navigation, interactive exploration
29Biological Application Challenges
- Raw Data
- sequence data, anthropological evolution,
microarray studies of gene expression, electronic
patient records - Multiple dimensions of information
- Content-relationships and links among data
- Incomplete, ill-defined, ill-structured,
ill-formed information - Missing and erroneous information
- Going beyond raw data to meaningful
information - Extraction - selection
- Derivation - deduction
- Exploration - discovery (data mining)
30Challenges for Database Professionals
- Learning applications
- Jargon
- Process model of the environment
- Complexities, typical scenarios, rules,
constraints - Apply database techniques to help in application
- Conceptual modeling
- Views, indexing, text analysis
- Specification, normalization, query optimization
- Apply techniques from outside the database area
- AI, information retrieval, software engineering,
user interfaces
31Challenges for Biomedical Scientists
- Integration of multiple, disparate data sources
- Appreciation of data modeling as a precursor to
information utilization - Ability to deal with multiple models, interfaces
and environments
- Awareness of the limitations of information
technology
32A lot of biologists and scientists dont realize
that if you build a database and you tolerate 5
sloppiness in the definition of individual
concepts, when you execute a query that joins
across 15 concepts, youve got less than a 5050
chance of getting the answer you want. -
Robert Robbins, Director, Applied Research
Laboratory, Johns Hopkins