THE CHALLENGES OF GENOME INFORMATION MANAGEMENT - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

THE CHALLENGES OF GENOME INFORMATION MANAGEMENT

Description:

many databases, many tools, many nomenclatures Shamkant B. Navathe. 6 ... Using different nomenclature. Quality Control is a major Issue ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 33

Provided by: amolna

Category:

more less

Transcript and Presenter's Notes

Title: THE CHALLENGES OF GENOME INFORMATION MANAGEMENT

1
THE CHALLENGES OF GENOMEINFORMATION MANAGEMENT
Shamkant B. Navathe1 sham_at_cc.gatech.edu in
collaboration with Douglas C. Wallace2 dwallace_at_gm
m.gen.emory.edu
1Bioengineering Program - Database Group College
of Computing Georgia Institute of
Technology 2Center for Molecular Medicine Emory
University School of Medicine
2
Acknowledgements

Students
Andreas Kogelnik, M.D., Ph.D.
Girish Katdare. M.S.
Mondira Deb, Ph. D. student (ECE)
Ken Kladitis, B.S.
Martin Brandon, Ph.D. student
Faculty
Mike Brown, Ph.D. Asst Prof., Emory
Marie Lott , Ph.D., Research Scientist, Emory

3
General Challenges for Data Management in
Biological Applications

Collection and Curation
Analysis
comparative analysis
Integration
cross linking of data
Understanding
multiple interpretations
Dissemination
traditional means / web-based

4
Desired Properties of Proposed Solutions

Scalability
applicable to large volumes of data
high rates of data acquisition
No loss of information between systems
Constructive approach to data management vs.
Absolute representation
identification
accommodation (of viewpoints)
manipulation
conflict resolution
This represents a very tall order for existing
DBMSs

5
Current State of Affairs in Genetic Data
Management

Lots of genetic info lies in non-electronic form
Biology is becoming very "data rich"
laboratory automation is increasing data output
Comprehensive data analysis and interpretation is
no longer feasible by manual means
many databases, many tools, many nomenclatures

6
Human Genome Initiative (HGI)

Begun in 1988
200 million/year for 15 years from Congress
Capturing, analyzing, and interpreting the human
collective genetic information for 24 pairs of
chromosomes
3-4 billion nucleotides per genome
100,000-300,000 genes

This initiative was superseded by industry,
particularly Celera, who announced in Feb. 2001
that they had sequenced the complete genome
within 2 years. They announced that they have
identified about 30K genes. Mismatches among
genes from HGI vs. Celera. Exact number is
unknown.
7
Current Genome-related prominent databases

GenBank
DNA sequence 1,053,000,000 bases - 1,611,000
sequences
Molecular modeling DB (MMDB) - 3-D structures
Online Mendelian Inheritance in Man (OMIM)
Clinical phenotypes - 8,700 entries
Swiss-Prot
Protein sequence 70,000 proteins
Genome Database (GDB) - human genome mapping data
Locus specific databases such as PaHDb, CFTR
PIR International and PDB (Protein data bank)

8
Types of Database Content

Full genome databases of human and other
organisms (C-elegans, E-coli, Drosophila,
mouse..)
Specialized subject databases
TRANSFAC transcription factors and their binding
sites
REBASE Restriction Enzyme Resource
Derived Databases providing annotations and
novel structuring of the content
Protein motif databases (PROSITE)
Protein structure-sequence alignment database
(HSSP)
Protein Domains (PFAM)
Structural Clasification of Proteins (SCOP)

9
TYPE OF DATA

Sequence Data (different sequences for DNA data
and protein data). Sequences ae linked to
structures, motifs and metabolic pathways
Structure and function data
Annotations
Evolutionary relationships
Visual Representations
Audio and Video data related to phenotypes
(patient symptoms and behaviors)

Databases of metabolic pathways have intrinsic
complexity because nodes represent data from
sequences while edges represent chemical
reactions which are independent and non-sequence
related (Karp 1998)
10
Nature of Biological Data

Representation of biological macromolecules
Combined with associated fuzzy information
Incomplete and sometimes subjective
Open to interpretation
Using different nomenclature
Quality Control is a major Issue
New data is based on experiments without
confirmation
Previous annotations of data may be inherited,
but may not match with new results
Submissions have to be checked for accuracy
Same data may occur in multiple submisions.

Question should genomic and proteolmic databases
be passive repositories or active in the form of
annotations and links to other databases?
11
Quality control of data

Easier for structural data
rules of stereo chemistry and protein
architecture apply
Experimental techniques exist to verify structure
NMR (nuclear magnetic resonance)
X ray Crystallography
Classical Tradeoff - whether to make data
available quickly or whether to wait to verify
its accuracy before it is made available
Existing databases like ESTs (expressed sequence
tags) have a lot of related and possibly
redundant information.

12
Related Areas of Computer Science/Computational
Science/Computer Technology

Algorithm design algorithm complexity analysis
applicable to comparison of sequence data
applicable to prediction of protein folding
Database modeling
entities, relationships, attributes, constraints
objects and object references
Database design
schemas, content/data organization design,
loading, curating process design

13
Areas of Computer Science/Computational
Science/Computer Technology (cont.)

Knowledge system design
incorporation of heuristics rules
automated detection of patterns
deduction or derivation of new information using
distributed computing and neural networks
Parallel processing and supercomputing
Animation and visualization
Virtual environments

14
OUR WORK IN THE AREA

Started a project around 1993 - later on named
MITOMAP to create a database of the mitochondrial
genome
Resulted in the PhD dissertation of Andreas
Kogelnik in 1997 on Biological Information
Management
Work in dissertation included an approach to
manage all aspects of mitochondrial genome
System called GENOME was proposed
Currently being maintained and further developed
by Martin Brandon

15
Human Mitochondrial Map MITOMAP
http//www.gen.emory.edu
T
F
DEAF 1555
D-Loop
12s
V
LHON 14484
rRNA
Cyt b
P
0
LDYS 14459
ND6
E
16s
rRNA
America A
MELAS 3243
L
ND5
LHON 3460
America C
ND1
Africa L
I
Q
L
ADPD 4336
M
S
H
America D
ND2
ND4
A
Asia F
N
Europe H
C
LHON 11778
W
Y
ND4L
America B, Asia B
ND3
R
COI
S
COIII
G
ATPase6
COII
D
K
NARP 8993/Leighs 8993
MERRF 8344
ATPase8
16
GENOME Georgia Tech Emory Networked Object
Management Environment

Focus on mitochondrial genome
16,659 base pairs
Develop capabilities for collecting/storing/distri
buting and analyzing the data produced
Integration of multiple types of data to create a
comprehensive research data repository

17
Data Organization Problem Relational Model of
Data

Best for structured information
Naturally appealing for tabular data
Well founded and mathematically sound theory
of sets and relations
- No accounting of semantics of data
- Does not provide simple features like subtyping
inheritance
- SQL as a language is not powerful enough

18
Data Organization Problem Object-Oriented Model
of Data

Captures objects of greater complexity
Easier to deal with unstructured information
Easier to deal with relationships/behavior/inter
pretation of data
- Query languages are not well developed
- Schema evolution techniques are lacking
- industrial support / experience is much weaker
compared to relational

19
CASE STUDY OF DDLJ

Human Genome Database
Maintained in Kyoto, Japan
Linked to GenBank and EMBL

20
Schema Levels
21
Conceptual Schema
22
External Schema
23
Data Organization Our Approach in initial design

Use of standardized notation
ASN.1
Tailoring the approach by defining our own
schemas, classes, properties, and functions
Creating an open architecture for future
expansion/evolution of data models and
incorporation of databases

24
Data Organization General Trends

Combining relational and O-O features into one
system
Providing system functionality with pre-defined
classes then adding user defined facilities
Active data - use of triggers and rules to
create new data
Allowing support for heterogeneous/ federated
data collections

25
OUR FUTURE APPROACH

Data Integration
integrate sequence based genomic information with
mutation/disease related information, functional
and biochemical information
interaction between nuclear and mitochondrial
genome data using microarray experiments
combine with existing mutation databases
Long Term Maintainable Repository
use standard commercial approaches
likely to implement system using Oracle 9i

26
Data Integration
Genomic DBs
Mutational DBs
Protein DBs
Central Repository
27
MITOMAP Data Interactions
Functional Data
Gene-gene Interactions Data
mtDNA Sequence Data
Population Data
population database
Disease Data
28
Complexities vs. Ease of Use

Complex objects with illdefined and uncertain
data
Multiplicity of data types
Numbers, text, images, audio, video
Easy interface for the scientists and drug
designers etc.
Better interfaces
More visualization
More animation
More user interactivity
Varied search paradigms - querying, browsing,
navigation, interactive exploration

29
Biological Application Challenges

Raw Data
sequence data, anthropological evolution,
microarray studies of gene expression, electronic
patient records
Multiple dimensions of information
Content-relationships and links among data
Incomplete, ill-defined, ill-structured,
ill-formed information
Missing and erroneous information
Going beyond raw data to meaningful
information
Extraction - selection
Derivation - deduction
Exploration - discovery (data mining)

30
Challenges for Database Professionals

Learning applications
Jargon
Process model of the environment
Complexities, typical scenarios, rules,
constraints
Apply database techniques to help in application
Conceptual modeling
Views, indexing, text analysis
Specification, normalization, query optimization
Apply techniques from outside the database area
AI, information retrieval, software engineering,
user interfaces

31
Challenges for Biomedical Scientists

Integration of multiple, disparate data sources
Appreciation of data modeling as a precursor to
information utilization
Ability to deal with multiple models, interfaces
and environments

Awareness of the limitations of information
technology

32
A lot of biologists and scientists dont realize
that if you build a database and you tolerate 5
sloppiness in the definition of individual
concepts, when you execute a query that joins
across 15 concepts, youve got less than a 5050
chance of getting the answer you want. -
Robert Robbins, Director, Applied Research
Laboratory, Johns Hopkins

Write a Comment

User Comments (0)