Title: MoBIoS
1 - MoBIoS
- A Metric-space DBMS to Support Biological
Discovery - Presenter Enohi I. Ibekwe
2Overview
- MoBIoS Project
- Motivation
- The challenge
- Established similarity measures
- Metric-space distance measure
- Disk-based metric tree index
- MoBIoS as a DBMS
- Application of MoBIoS
3MoBIoS Project
- Molecular Biological Information System
- Project at UT-Austin center for computational
biology and bioinformatics. - DBMS based on metric-space indexing techniques,
object-relational model of genomic and proteomic
data types and a database query language that
embodies the semantics of genomic and proteomic
data.
4Motivation
- Develop a DBMS to power Biological Information
System
5The Challenge
- Established biological model of similarity
measure do not form a metrics. -
- Scalable disk-based metric-indexes suffer from
the Curse of dimensionality
6Established Similarity Measure (I)
- Sequence Homology
- Query Sequence
- Database of sequences
- Substitution Matrix (PAM / BLOSUM)
- Similarity Measure
- Global Sequence Alignment (Edit distance)
- Local Sequence Alignment (Most important)
7Established Similarity Measure (II)
- Local Sequence Alignment
- A local sequence alignment query asks, given a
query sequence S, a database of sequences T and
a similarity matrix corresponding to an
evolutionary model, return all subsequences of T
that are sufficiently similar to a subsequence of
S - Main issue Result is a set of answer.
- A metric distance function must return a single
value for each pair of argument
8Established Similarity Measure (III)
- Global Sequence Alignment
- Given an alphabet A , a similarity substitution
matrix M corresponding to an evolutionary model,
the global sequence alignment for two sequences s
and t is to find a strings a and b which are
obtained from s and t respectively by inserting
spaces either into or at the ends of s and t and
whose score computed using M is at a maximum
(Similarity measure) over all pairs of such
strings obtained from s and t. (example) - Issue Result maybe negative since substitution
matrix is based on log-odd probability.
Similarity measure favors greater positive
number.
9Metric-space Distance measure (I)
- Homology Search
- Query Sequence Sub strings of length q (q-grams)
- Database of sequences Metric indexed records of
fixed length q (indexed q-grams) strings. - Substitution Matrix (mPAM)
- Similarity Measure (distance measure)
- Local Alignments is computed from global
alignment.
10Metric-space Distance measure (II)
- mPAM substitution Matrix
- Accepted Point Mutation Model.
- PAM calculates scores based on frequency in which
individual pairs of amino acids substituted for
each other. - mPAM instead of calculating frequency of
substitutions (PAM), computes expected time
between substitution. - mPAM has been validated.(Validation)
11Metric-space Distance measure (III)
- Computing Local Alignment from Global Alignment
(Algorithm) - Offline
- Divide database of sequence into sub strings
(q-grams) - Build metric-space index structure on q-grams
- Online
- Divide query sequence into sub strings (q-grams)
- Using global alignment as a distance function to
match query q-grams.
12Disk-based metric-tree index
- Phases
- Initialization
- Searching
- Query performance metric
- Number of disk I/O ( nodes visited)
- Number of distance computation
- Options Exploited
- M-Tree
- Generalized Hyper plane tree
- MVP-Tree (optimal)
13Disk-based metric-tree index (initialization)
- M-Tree initialization
- Best case O(nlogn)
- worst case O(n3)
- Generalized Hyper plane (GH-Tree) initialization
- Best case O(nlogn)
- worst case O(n2)
- GH-tree Bi-direction
- M-Tree Bottom-up
- In practice, both M-Tree and GH-Tree scale
linearly
14Disk-based metric-tree index (Searching)
15MoBIoS as a DBMS (I)
- Mckoi (Java RDBMS).
- Plus metric-space indexing
- Plus Biological data types
- Plus biological semantics
- Life science data store
- Biological sequence data
- Mass-spectrometry protein signature
16MoBIoS as a DBMS (III)
- Language Extension
- M-SQL
- Data type Extension
- Data type for Sequences (DNA,RNA,peptide)
- Data type for Mass spectrum
- Semantics Extension
- Subsequence Operators
- Local alignment
17MoBIoS as a DBMS (IV)
- Semantics Extension
- Similarity (metric distance) between data types
- mPAM250
- Cosine distance
- Lk norms
- Keys Extension
- Primary key (metrickey)
- Index (metric)
18Application of MoBIoS (I)
- MS/MS Protein Identification
- Breakdown protein into fragments called peptide
using a protease enzyme - Identify protein by using a mass-spectrometer to
measure the mass-charge ratio of the fragments
and comparing the experiment result to a database
of precomputed spectra.
19Application of MoBIoS(II)
- M-SQL Solution
- Create table protein_sequences (accesion_id int,
- sequence peptide,
- primary metrickey(sequence, mPAM250)
- Create table digested_sequences (accession_id
int, - fragment peptide,
- enzyme varchar,
- ms_peak int, primary key(enzyme,
accession_id)
- Create index fragment_sequence on
digested_sequences (fragment) - metric(mPAM250)
- Create table mass_spectra
- (accession_id int,
- enzyme varchar,
- spectrum spectrum,
- primary metrickey(spectrum, cosine_distance)
20Application of MoBIoS(III)
- M-SQL Solution
- SELECT Prot.accesion_id, Prot.sequence
- FROM protein_sequences Prot, digested_sequences
DS,mass_spectra MS - WHERE
- MS.enzyme DS.enzyme E and
- Cosine_Distance(S, MS.spectrum, range1) and
- DS.accession_id MS.accession_id
Prot.accesion_id and - DS.ms_peak P and
- MPAM250(PS, DS.sequence, range2)
21BLAST vs MoBIoS
- MoBIoS
- Molecular Biological Information System
- DBMS specialized for storage, retrieval and
mining of biological data - Sequence Database and query sequence is divided
into q-grams and Database is indexed offline.
- BLAST
- Basic Local Alignment Search Tool
- Utility specialized for retrieval and mining of
biological data outside a database
- Only query sequence is divide and hot-point index
is done at query time
22MoBIoS Demo
- MoBIoS http//ccvweb.csres.utexas.edu9080/msfoun
d/ccForm.jsp - PDB http//www.rcsb.org/pdb/
23Conclusion
- Biological data is not random and very likely
exhibit the intrinsic structure necessary for
metric-space indexing to succeed.
24References
- http//www.cs.utexas.edu/users/mobios/Publications
/miranker-mobios-final-03.pdf - http//www.cs.utexas.edu/users/mobios/Publications
/mao-bibe-03.pdf - http//www.cs.utexas.edu/users/mobios/
- http//www.mckoi.com/database/
25Appendix
Return
26Appendix I- Metric
- A metric-space is a set of objects S, with a
distance function d, such that given any three
objects x, y, z, - Non-Negativity
- d(x,y) gt 0 for x y d(x,y) 0 for x y
- Symmetry
- d(x,y) d(y,x)
- Triangular inequality
- d(x,y) d(y,z) d(x,y)
-
Return
27Appendix II - Sequence
- 2 RNA sequences from a DNA strand.
Return
28Appendix III - PAM
- Percent Accepted Mutation(PAM)
- A PAM(x) substitution matrix is a look-up table
in which scores for each amino acid substitution
have been calculated based on the frequency of
that substitution in closely related proteins
that have experienced a certain amount (x) of
evolutionary divergence. (e.g PAM250) - A unit to quantify the amount of evolutionary
change in a protein sequence. Based on log-odd
probability.
Return
29Appendix IV PAM250
- At this evolutionary distance (250 substitutions
per hundred residues)
Return
30Appendix V - BLOSUM
- Blocks Substitution Matrix (BLOSUM)
- A substitution matrix in which scores for each
position are derived from observations of the
frequencies of substitutions in blocks of local
alignments in related ( e.g BLOSUM62) - A unit to quantify the amount of evolutionary
change in a protein sequence. Based on log-odd
probability
Return
31Appendix VI BLOSUM62
- BLOSUM62 matrix is calculated from protein blocks
such that if two sequences are more than 62
identical
Return
32Appendix VII mPAM250
- Expected time based on 250 PAM distance as a
unit.
Return
33Appendix VIII mPAM Validation
- Based on benchmark query set by Smith-Waterman.
- Graph shows ROC50 values (Receiver Operating
Characteristics) - Negative x- axis indicate mPAM has better
performance
- Difference between ROC50 values using mPAM and
PAM250
Return
34Appendix IX - Distance measure
- Global Sequence Alignment
- Given an alphabet A , a similarity substitution
matrix M corresponding to an evolutionary model,
the global sequence alignment for two sequences s
and t is to find a strings a and b which are
obtained from s and t respectively by inserting
spaces either into or at the ends of s and t and
whose score computed using M is at a maximum
(Similarity measure) or minimum (distance
measure) over all pairs of such strings obtained
from s and t.
Return
35Appendix X Homology Search
- Build Index Structure(Offline)
- Divide the database sequences into a set of
overlapping sub strings of length q (q-grams)
with step size 1. - Build a metric-space index D based on global
alignment to support constant time lookup of
exact match. - Homology Search Query (Online)
- Divide the query sequence W into overlapping sub
string , F wi i 0.. W -q , of length q
with step size 1. - For each wi in F, run range query Q(wi, r)
against database D to find a set of matching
q-grams, Ri f i,j d( f i,j , wi) lt r, f i,j
E D wi E F , where d is the distance function. - Using a greedy heuristic algorithm to extend and
chain all fragments in R0UR1URw-t to deduce the
result of homology search based on local
alignment for query W
Return
36Appendix XI - GSA
Return