Title: The MoBIoS Project Molecular Biological Information System
1The MoBIoS ProjectMolecular Biological
Information System
- Daniel P. Miranker
- Dept. of Computer Sciences
- Center for Computational Biology and
Bioinformatics - University of Texas
Weijia Xu, Rui Mao, Will Briggs, Smriti
Ramakrishnan, Shu Wang, Lulu Zhang
2- ProblemIn Life Sciencses, database management
systems (DBMS) serve as glorified file managers. -
- Little use of sophisticated data and
pattern-based retrieval - Real scientific and technological problems
3When biological data is put in to an RDBMS
- Primary data is stored in text or blob fields
- Annotations may be relational
- Data retrieval
- Filter DB, sequential dump, O(n), to utilities
- E.g. BLAST,
Organism Function Sequence
Yeast membrane AACCGGTTT
Yeast mitosis TATCGAAA
E. Coli membrane AGGCCTA
4Linear Data Scans, O(n), Endemic in Life Sciences
- Sequences
- DNA, RNA, Protein databases
- Mass Spectra
- proteomics
- Small Molecules Protein Structure
- Protein interaction
- Rational drug design
- Pathways (graphs)
- Phylogenies (graphs, trees in particular)
5Scope To Find Common Ground Both Biology and
DBMS Have to Move
DBMS
Biological Information System
Metric-Space Database as the Common Ground
6Metric Space is
- a pair, M(D,d),
- where
- D is a set of points
- d is metric distance function with the
following properties - d(x,y) d (y,x)
(symmetry) - d(x, y) gt 0, d(x,x) 0
(non negativity) - d(x,z) lt d(x,y) d(y,z)
(triangle inequality)
x y z
7Definition - By Analogy
- A Spatial Database Management System
- Extend relational DBMS
- Special indexes for 2D and 3D data k-d and
R-trees - New data types
- Geographic information systems
- Topographic maps
- Buildings and the like
- A Metric-Space Database Management System
- Extend Relational DBMS
- Special indexes for metric-spaces
- New data types
- Biological information system
- Life science data types
8Develop index structures to support distance
nearest-neighbor queries
- Well studied in main-memory
- But by no means a closed problem
- In databases (external/disk based methods)
- Embryonic
- Many myths
- Often assumed to be the basis of multimedia
database systems
9How to build a metric-space index
- Three algorithmic classes Tasan, Ozsoyoglu 04
- Vantage points
- Hyperplanes
- Bounding spheres
10Vantage Point Method BurkhardKeller73
11Vantage Point Method
Choose a point,VP
And a radius, R
12Vantage Point Method
- Given VP, R
- The predicates
- d(VP,x) lt R
- d(VP,x) ? R
- Divide the set into two equal halves
- apply recursively
Choose a point,VP
And a radius,R
13Query, q, range r
r
q
14Query, q, range r
- if
- d(q,VP) gt R r
- then
- all neighbors are outside the sphere
VP
R
r
q
15Multi-vantage point method
16Multi-vantage point method
- Consider d(VPi, x) a projection onto an axis
- Looks like a k-d tree
- Choose number k d
17Myths
- Solved problem M-trees Ciaccia et.al. 96, 97
- I cant get them to work on anything but their
original synthetic data generator - Good choice for vantage points is to find
cornersYianilos93 (farthest-first clustering) - Might be true for euclidean spaces
- Early result, not true for our data
- High dimensional indexing always asymptotically
reduces to linear scans. - Formal result based on an assumption of uniform
data distributions.
18Comparison of Three Methods of Metric-Space
Indexing
19Open problems
- Is there a general metric-space index structure
that is generally good for most work loads. - We are optimistic mvp trees further tuning
will be a useful answer - Hyperplane methods are fair game there is
circumstantial evidence that that is key
component in Googles search engine. - No work addresses clustering data pages on disk.
- Metric-space join algorithms
20Biological Models are Usually Based on Similarity
- Similarity
- Biologist like scoring functions that reward each
similar feature with a positive number - Intuitive
- Distance
- More Similar ? smaller numbers
- Identical ? 0
21But Do Metric Models Capture Biology?
-
- Metrics are a subset of possible mathematical
models -
.
22Sequence Problem 1
- Sequence similarity based on weighted edit
distance - Accepted weight matrices, PAM BLOSSUM, are not
metric - Log-odd matrices negative values
- Defy simple algebraic normalizationTaylorJones93,
Linialetal97
23Our First Result mPAM XuMiranker04
- Dayhoffetals PAM Derivation74
- Took a set of closely related protein sequences
- Developed a phylogenetic tree
- Counted substitutions to transform one sequence
to another - Tree determines a measure of time
24PAM vs. mPAM t 1/f
- Using original substitution counts
- PAM frequency of substitution
- S(a,bt) log P(ba,t)/qb
- mPAM expected time between substitutions
- D(a,b) 1/log(1 ?(P(a,x)P(b,x))
x
25Sequence Problem 2
- Sequences long units (identity for storage and
retrieval) - Genes
- Chromosomes
- Analysis comprises comparing small substrings
26Soln Sequence View
- New view type
- Breaks sequences into q-grams
create SEQUENCEVIEW rice_sview as SELECT CREATE
FRAGMENTS (, 3, 1) FROM WHERE USING
HAMMING-DISTANCE
27Materialize as an Index
D(AAA) 2
Rowd Offset Logical Fragment Logical Fragment Logical Fragment Logical Fragment Logical Fragment Logical Fragment
R1 1 A C A
R1 2 C A A
R1 3 A A C
R1 4 A C A
R2 1 A T C
R2 2 T C A
R2 3 C A A
R2 4 A A A
D(ACA) 1 D(CAA) 0 D(ATC) 1
Genomes Genomes
Rowid Seq
R1 CAACA
R2 ATCAAA
R3
28Status
- Started with McKoi
- A Java open source object-relational DBMS
- (Think of Postgress written in Java)
- Added
- Biological data types
- Metric-space index
- Extending SQL engine (in progress)
29Computed in MoBIoS
- Compare Arabidopsis Genome X Rice Genome
- Locate nucleotide patterns of form
- primer pair candidate
- Eliminate non-unique primer candidates
- Merge overlapping primer candidates
- Usual implementations O(n2), n 109
Rice Arab.
?18 Matching Nucleotides
?18 Matching Nucleotides
Rice Gap 400 3000 Long Arab. Gap 400 3000
Long
30mSQL Query to locate candidate primer pairs
- SELECT merge(R1.fragment, A1.fragment)
- FROM
- G1_sview R1, G1_sview R2, G2_sview A1, G2_sview
A2 - WHERE
- distance(HAMMINGDISTANCE', R1.fragment,
A1.fragment) lt 1.0 AND distance(HAMMINGDISTANCE'
, R2.fragment, A2.fragment) lt 1.0 AND - (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment))
gt 400 AND - (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment))
lt 3000 AND - (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment))
gt 400 AND - (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment))
lt 3000 - GROUP BY R1.fragment, A1.fragment
31Query Plan
- Arab. Genome, O(n)
Rice Genome, O(m) - Offline Build Sequence
- View O(n log n)
- Compare O(mlogn)
- Indexed Nested Loop
- Eliminate Duplicates
- Eliminate Low Complexity
- Primers (LZ compression)
- Merge Overlapping Primers
- 10,000 conserved
32Preliminary Results
- Found 13,418 possible primer pairs from MoBIoS
- 100 best candidates BLASTed for matches in
GenBank - 15 matched other plant genes and the primers
- At least 2 of 15 showed potential after PCR
amplification against Helianthus and
Phalaenopsis.
33MoBIoS Architecture(Molecular Biological
Information System)
34Analysing Mass-Spectra
- Spectrum Histogram of Mass/Charge Ratios of a
collection peptides - Similarity Shared peaks count Inner Product
- (0100101) (0111100) 2
35Cosine Distance Approx. Inner Product
- Drs 1 xrxs/(xrxr)1/2(xsxs)1/2
- shown store and retrieve mass-spectra
- using cosine distance, and it scales
36mSQL Query for Protein Identification by
Mass-Spec. Signature Database Look
- SELECT Prot.accesion_id, Prot.sequence
- FROM protein_sequences Prot, digested_sequences
DS, - mass_spectra MS
- WHERE
- MS.enzyme DS.enzyme E and
- Cosine_Distance(S, MS.spectrum, range1) and
- DS.accession_id MS.accession_id
Prot.accesion_id and - DS.ms_peak P and MPAM250(PS, DS.sequence,
range2)
37Matching Electrostatic Shape of Molecules
38Still benefit from grid-services
- Intermittently, but regularly compile (recluster)
the indices O(nlog n), n gt 106 - Rational drug design O(log n) finite element
solutions to traverse search tree. - Make a service call to the grid for these
operations only - Mirror data contents to minimize I/O
- Since need is intermittant, one grid serves many
MoBIoS servers
recluster
G R I D
MoBIoS Server
New index
Shape match (FEM)
Distance(real)
High speed I/O
Mirror DB-Contents
39Hyper-planes Ulhmann91
- If d(x,h1) lt d(x,h2) then x assigned to h1
h1
x
h2
40Develop a Hierarchical Clustering
C
A
E
B
D
F
- Hierarchy of Bounding spheres, (center, radius),
- Bounding spheres may overlap
- Inspired by R-trees