Title: The MoBIoS Project Molecular Biological Information System
1The MoBIoS ProjectMolecular Biological
Information System
- Daniel P. Miranker
- Dept. of Computer Sciences
- Center for Computational Biology and
Bioinformatics - University of Texas
Weijia Xu, Rui Mao, Will Briggs, Smriti
Ramakrishnan, Shu Wang, Shulin Ni, Kai Yan, Ving
Lei
2Compared to Business Databases,Biological
Databases
- are not that big
- Genbank ftp mirror download, ltlt 1 Terabyte
- CMS spectrometer at CERN, gt Petabyte/year
- but,
- data management is biology is a big problem
- There must be another problem
3You Cant Sort
- Sequences
- DNA, RNA, Protein databases
- Mass Spectra
- proteomics
- Small Molecules Protein Structure
- Protein interaction
- Rational drug design
- Pathways (graphs)
- Phylogenies (graphs, trees in particular)
4In Life-Sciences Database Management Systems are
Souped Up File Systems
- Primary data is stored in text or blob fields
- Annotations may be relational
- Data retrieval
- Filter DB, sequential dump, O(n), to utilities
- E.g. BLAST,
5Scope To Find Common Ground Both Biology and
DBMS Have to Move
DBMS
Biological Information System
Metric-Space Database as the Common Ground
6Metric Space is
- a pair, M(D,d),
- where
- D is a set of points
- d is metric distance function with the
following properties - d(x,y) d (y,x)
(symmetry) - d(x, y) gt 0, d(x,x) 0
(non negativity) - d(x,z) lt d(x,y) d(y,z)
(triangle inequality)
x y z
7Definition - By Analogy
- A Spatial Database Management System
- Extend relational DBMS
- Special indexes for 2D and 3D data k-d and
R-trees - New data types
- Geographic information systems
- Topographic maps
- Buildings and the like
- A Metric-Space Database Management System
- Extend Relational DBMS
- Special indexes for metric-spaces
- New data types
- Biological information system
- Life science data types
8Results to date
- Developed and Validated Metrics for
- protein sequence homology
- protein mass-spectra
- volumetric (3-d) models of protein electrostatics
- Feasibility prototype,
- validated scalable biological data retrieval
9A Broad Range of Problems
- General purpose metric-space index
- Use of metrics as similarity functions on
biological data - Integration of biological sequences into the SQL
programming model
10Metric Space Indexing
Vantage point algorithm BurkhardKeller73
Choose a point,VP
And a radius, R
11Range Search
R
- If d(q,VP) gt R d
- then
- search outside the sphere
12Biological Models are Usually Based on Similarity
- Similarity
- Biologists like scoring functions that reward
each similar feature with a positive number - Intuitive
- Distance
- More similar ? smaller numbers
- Identical ? 0
131) Do Metric Models of Similarity Capture
Biology?
-
- Metrics are a subset of possible mathematical
models -
.
14Sequence Similarity
- Sequence similarity based on weighted edit
distance - Sellers74
- Accepted weight matrices, PAM BLOSSUM, are not
metric - Log-odd matrices negative values
- Defy simple algebraic normalization
- TaylorJones93,Linialetal97, Halperin etal04
-
15PAM vs. mPAM t 1/f
Xu Miranker04
- Using original substitution counts
- PAM frequency of substitution
- S(a,bt) log P(ba,t)/qb
- mPAM expected time between substitutions
- D(a,b) 1/log(1 ?(P(a,x)P(b,x))
x
16Metrics for Biological Similarity
- Biosequences
- mPAM
- mBLOSSUM (not yet published)
- Hamming distance
- Protein Mass and Protein Fragment Mass Spectra
- derived an effecive metric from cosine distance
- Volumetric models of chemical fields of proteins
- (with C. Bajaj)
17Matching Electrostatic Shape of Molecules
18Metric-Space Indexing (search)
- Well studied in main-memory
- no means a closed problem
- In databases (external/disk based methods)
- Embryonic
- Success only in specific domains
- Biological sequences, nucleotide and peptide
- Databases of analytically determined mass-spectra
- Volumetric models of proteins (collaboration in
progress with C. Bajaj)
19MoBIoS System Overview
Other Opportunities
Active Application Efforts
Other Opportunities
Active Application Efforts
Mass
-
Spec
Mass
-
Spec
Combi
-
Chem
Combi
-
Chem
Compartive
and
Compartive
and
Homology
Homology
Ligand
Ligand
Library
Protein
Library
Protein
Search
Search
Phylo
Genomics
Phylo
Genomics
Docking
Docking
Management
Identification
Management
Identification
MoBIoS
SQL (mSQL)
-
)
MoBIoS
SQL (
Query Engine
Mining Engin
Query Engine
Mining Engine
MoBIoS
Java Interface (MJI)
Metric
-
Space Based
Metric
-
Space Based
Storage Manager
Storage Manager
20SQL as a Bioinformatic Programming Model
- Simple retrieval of similar objects.
21Protein identification by database look-up
- Build a reference database
- Given a complete genome ? putative proteome
- Given a proteome ? set of ideal mass-spectra
(database) - 2. Wet lab work
- In a laboratory, experimentally determine the
mass-spectra of an unknown protein - 3. Analyze the data
- Find the closest match with the database
- (a very noisy proposition)
22Analytic Mass-Spectra
- At each resolvable mass, either a peptide is
present or not - Vector-Space model binary vector, one bit for
each resolvable mass - Similarity Shared peaks count Inner Product
- (0100101) (0111100) 2
23Cosine Distance
- Drs 1 xrxs/(xrxr)1/2(xsxs)1/2
- Document retrieval uses
- the vector space model
- known
- similarity inner product
- as distance cosine distance
24Given - the spectra of an unknown, S -
reference database, Spectra
- SELECT Spectra.accesionNumber
- FROM Spectra
- WHERE
- Cosine_Distance(S,Spectra,tolerance)
25Protein Identification by Database Lookup of
Mass Spectra
26Sequences Pose Additional Problems
- Sequences long units (identity for storage and
retrieval) - Genes
- Chromosomes
- Analysis comprises comparing small substrings
30/45
27Local-Alignment (Homology Search)
- For each pair of arguments
- Find matching subsequences
-
Sequence 1
Sequence 2
28Soln Sequence View
- New view type
- Breaks sequences into k-mers
create SEQUENCEVIEW rice_sview as SELECT CREATE
FRAGMENTS (, 3) k 3 FROM WHERE
USING HAMMING-DISTANCE Dan Instead of
miRNA later, maybe Sequence view of only exons.
29Materialize as an Index
D(AAA) 2
30mSQL Sequence Operators
- CreateFragement
- Merge and Groupby
- Merge fragments back into longer subsequences
- Algebraic constraints
- Consider relative offsets when applied to pairs
of sequences - Parameterize this for additional biological
semantics (gaps)
31Local-Alignment (Homology Search)
- For each pair of arguments
- Find matching subsequences
-
Sequence 1
Sequence 2
32Dan Insert example of n log n homology on k-mers
with merge
33Conserved (PCR) Primer Pair Discovery(with R.
Linder)
- Goal
- To find evidence of horizontal gene transfer
- Method
- Sample and sequence many orthologous regions in
many plants - Build a phylogenetic tree for each region, look
for inconsistent topology
34Pattern Definition of a Conserved Primer Pair
- Compare Arabidopsis Genome X Rice Genome
- Locate nucleotide patterns of form
- PCR primer pair candidate
- Matching ? lt 5 mismatches
- Eliminate non-unique primer candidates
- Usual implementations O(n2), n 109
Rice Arab.
?18 Matching Nucleotides
?18 Matching Nucleotides
Rice Gap 400 3000 Long Arab. Gap 400 3000
Long
35Query Plan
- Arab. Genome, O(n)
Rice Genome, O(m) - Offline Build Sequence
- View O(n log n)
- Compare O(mlogn)
- Indexed Nested Loop
- Eliminate Duplicates
- Eliminate Low Complexity
- Primers (LZ compression)
- Merge Overlapping Primers
- 10,000 conserved
- primer pairs candidates
36mSQL to locate candidate conserved primer pairs
// Create sequenceview of 18-mers of Rice and
Arab.
- SELECT merge(R1.fragment, A1.fragment)
- FROM
- Rice_sview R1, Rice_sview, R2, Arab_sview A1,
Arab_sview A2 - WHERE
- distance(HAMMINGDISTANCE', R1.fragment,
A1.fragment) lt 1.0 AND distance(HAMMINGDISTANCE'
, R2.fragment, A2.fragment) lt 1.0 AND - (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment))
gt 400 AND - (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment))
lt 3000 AND - (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment))
gt 400 AND - (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment))
lt 3000 - GROUP BY R1.fragment, A1.fragment
37Detailed Analysis, Wet Lab Validation
- Found 13,418 possible primer pairs from MoBIoS,
- lt 8 processor days including database load
- 100 best candidates BLASTed for matches in
GenBank - 15 matched other plant genes and the primers
- At least 2 of 15 showed potential after PCR
amplification against Helianthus and
Phalaenopsis.
38mSQL Queries Developed for Other Sequence
Analysis Problems Miranker et.al. Data Eng.
Bul. Sept. 04
- Micro RNA RNAi predictions
- Rosetta analysis to determine protein function
- Electronic-PCR
- Sequence Homology
39Query Engine Can Consider All Data
- From the MiRNA problem
- Select merge(G7.dna_seq), merge(M7.seq) mseq,
gene - From M7, G7, Features // consider both 7-mers
and their annotations - Where
- distance(hamming, G7.dna_seq, M7.seq) lt 0
- and G7.SID Features.SID
- and Features.name miRNA
- and G7.dna_seq.offset gt Features.first
- and G7.dna_seq.offset G7.dna_seq.length lt
Features.last - having mseq.offset lt 1 and mseq.offset
mseq.length gt 7
40Status
- All components of the system integrated and
function - Not product
- Scalability of index on nearest-neighbor search
- Range search is a better fit for the relational
model - Actively seeking collaborations in anticipation
of distributing the software.
41Protein Identification by Database Lookup of
Mass Spectra
42Matching Electrostatic Shape of Molecules
43Status
- Started with McKoi
- A Java open source object-relational DBMS
- (Think of Postgress written in Java)
- Added
- Biological data types
- Metric-space index
- Extending SQL engine (in finishing stages)
44Other Results
- Protein identification by database lookup
- match against a database of theoretical
mass-spectrometer signatures. - Protein similarity by 3-D electrostatic shape
45Three classes of algorithm
- Vantage-Point
- Generalized Hyperplane
- Radius-based Methods
46Hyper-planes Ulhmann91
- If d(x,h1) lt d(x,h2) then x assigned to h1
h1
x
h2
47Develop a Hierarchical Clustering
C
A
E
B
D
F
- Hierarchy of Bounding spheres, (center, radius),
- Bounding spheres may overlap
- Inspired by R-trees
48Multi-vantage point method
- Consider d(VPi, x) a projection onto an axis
- Looks like a k-d tree
- Choose number k d
49Hyper-planes Ulhmann91
- If d(x,h1) lt d(x,h2) then x assigned to h1
h1
x
h2
50Status
- Implemented an algorithm from each class.
- Examined performance on
- Synthetic euclidean vectors typical of other work
- Peptide k-mers, mPAM distance
- Mass spectra, fuzzy cosine distance
- Image database
51Comparison of three methods
- Vector data not typical of other data sets
- Multivantage point wins
- but not yet exciting
52Our First Result mPAM XuMiranker04
- Dayhoff etals PAM Derivation74
- Took a set of closely related protein sequences
- Developed a phylogenetic tree based on parsimony
- Counted substitutions to transform one sequence
to another
53mPAM250 vs. PAM250 on NCBI benchmark
- 103 queries over yeast proteome
- Hand curated correct answers.