MoBIoS - PowerPoint PPT Presentation

About This Presentation

Title:

MoBIoS

Description:

Disk-based metric tree index. MoBIoS as a DBMS. Application of ... Generalized Hyper plane (GH-Tree) initialization. Best case : O(nlogn) ... Identification ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 37

Provided by: veri76

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: MoBIoS

1

MoBIoS
A Metric-space DBMS to Support Biological
Discovery
Presenter Enohi I. Ibekwe

2
Overview

MoBIoS Project
Motivation
The challenge
Established similarity measures
Metric-space distance measure
Disk-based metric tree index
MoBIoS as a DBMS
Application of MoBIoS

3
MoBIoS Project

Molecular Biological Information System
Project at UT-Austin center for computational
biology and bioinformatics.
DBMS based on metric-space indexing techniques,
object-relational model of genomic and proteomic
data types and a database query language that
embodies the semantics of genomic and proteomic
data.

4
Motivation

Develop a DBMS to power Biological Information
System

5
The Challenge

Established biological model of similarity
measure do not form a metrics.
Scalable disk-based metric-indexes suffer from
the Curse of dimensionality

6
Established Similarity Measure (I)

Sequence Homology
Query Sequence
Database of sequences
Substitution Matrix (PAM / BLOSUM)
Similarity Measure
Global Sequence Alignment (Edit distance)
Local Sequence Alignment (Most important)

7
Established Similarity Measure (II)

Local Sequence Alignment
A local sequence alignment query asks, given a
query sequence S, a database of sequences T and
a similarity matrix corresponding to an
evolutionary model, return all subsequences of T
that are sufficiently similar to a subsequence of
S
Main issue Result is a set of answer.
A metric distance function must return a single
value for each pair of argument

8
Established Similarity Measure (III)

Global Sequence Alignment
Given an alphabet A , a similarity substitution
matrix M corresponding to an evolutionary model,
the global sequence alignment for two sequences s
and t is to find a strings a and b which are
obtained from s and t respectively by inserting
spaces either into or at the ends of s and t and
whose score computed using M is at a maximum
(Similarity measure) over all pairs of such
strings obtained from s and t. (example)
Issue Result maybe negative since substitution
matrix is based on log-odd probability.
Similarity measure favors greater positive
number.

9
Metric-space Distance measure (I)

Homology Search
Query Sequence Sub strings of length q (q-grams)
Database of sequences Metric indexed records of
fixed length q (indexed q-grams) strings.
Substitution Matrix (mPAM)
Similarity Measure (distance measure)
Local Alignments is computed from global
alignment.

10
Metric-space Distance measure (II)

mPAM substitution Matrix
Accepted Point Mutation Model.
PAM calculates scores based on frequency in which
individual pairs of amino acids substituted for
each other.
mPAM instead of calculating frequency of
substitutions (PAM), computes expected time
between substitution.
mPAM has been validated.(Validation)

11
Metric-space Distance measure (III)

Computing Local Alignment from Global Alignment
(Algorithm)
Offline
Divide database of sequence into sub strings
(q-grams)
Build metric-space index structure on q-grams
Online
Divide query sequence into sub strings (q-grams)
Using global alignment as a distance function to
match query q-grams.

12
Disk-based metric-tree index

Phases
Initialization
Searching
Query performance metric
Number of disk I/O ( nodes visited)
Number of distance computation
Options Exploited
M-Tree
Generalized Hyper plane tree
MVP-Tree (optimal)

13
Disk-based metric-tree index (initialization)

M-Tree initialization
Best case O(nlogn)
worst case O(n3)
Generalized Hyper plane (GH-Tree) initialization
Best case O(nlogn)
worst case O(n2)

GH-tree Bi-direction
M-Tree Bottom-up
In practice, both M-Tree and GH-Tree scale
linearly

14
Disk-based metric-tree index (Searching)
15
MoBIoS as a DBMS (I)

Mckoi (Java RDBMS).
Plus metric-space indexing
Plus Biological data types
Plus biological semantics
Life science data store
Biological sequence data
Mass-spectrometry protein signature

16
MoBIoS as a DBMS (III)

Language Extension
M-SQL
Data type Extension
Data type for Sequences (DNA,RNA,peptide)
Data type for Mass spectrum
Semantics Extension
Subsequence Operators
Local alignment

17
MoBIoS as a DBMS (IV)

Semantics Extension
Similarity (metric distance) between data types
mPAM250
Cosine distance
Lk norms
Keys Extension
Primary key (metrickey)
Index (metric)

18
Application of MoBIoS (I)

MS/MS Protein Identification
Breakdown protein into fragments called peptide
using a protease enzyme
Identify protein by using a mass-spectrometer to
measure the mass-charge ratio of the fragments
and comparing the experiment result to a database
of precomputed spectra.

19
Application of MoBIoS(II)

M-SQL Solution
Create table protein_sequences (accesion_id int,
sequence peptide,
primary metrickey(sequence, mPAM250)
Create table digested_sequences (accession_id
int,
fragment peptide,
enzyme varchar,
ms_peak int, primary key(enzyme,
accession_id)

Create index fragment_sequence on
digested_sequences (fragment)
metric(mPAM250)
Create table mass_spectra
(accession_id int,
enzyme varchar,
spectrum spectrum,
primary metrickey(spectrum, cosine_distance)

20
Application of MoBIoS(III)

M-SQL Solution
SELECT Prot.accesion_id, Prot.sequence
FROM protein_sequences Prot, digested_sequences
DS,mass_spectra MS
WHERE
MS.enzyme DS.enzyme E and
Cosine_Distance(S, MS.spectrum, range1) and
DS.accession_id MS.accession_id
Prot.accesion_id and
DS.ms_peak P and
MPAM250(PS, DS.sequence, range2)

21
BLAST vs MoBIoS

MoBIoS
Molecular Biological Information System
DBMS specialized for storage, retrieval and
mining of biological data
Sequence Database and query sequence is divided
into q-grams and Database is indexed offline.

BLAST
Basic Local Alignment Search Tool
Utility specialized for retrieval and mining of
biological data outside a database
Only query sequence is divide and hot-point index
is done at query time

22
MoBIoS Demo

MoBIoS http//ccvweb.csres.utexas.edu9080/msfoun
d/ccForm.jsp
PDB http//www.rcsb.org/pdb/

23
Conclusion

Biological data is not random and very likely
exhibit the intrinsic structure necessary for
metric-space indexing to succeed.

24
References

http//www.cs.utexas.edu/users/mobios/Publications
/miranker-mobios-final-03.pdf
http//www.cs.utexas.edu/users/mobios/Publications
/mao-bibe-03.pdf
http//www.cs.utexas.edu/users/mobios/
http//www.mckoi.com/database/

25
Appendix

Return
26
Appendix I- Metric

A metric-space is a set of objects S, with a
distance function d, such that given any three
objects x, y, z,
Non-Negativity
d(x,y) gt 0 for x y d(x,y) 0 for x y
Symmetry
d(x,y) d(y,x)
Triangular inequality
d(x,y) d(y,z) d(x,y)

Return
27
Appendix II - Sequence

2 RNA sequences from a DNA strand.

Return
28
Appendix III - PAM

Percent Accepted Mutation(PAM)
A PAM(x) substitution matrix is a look-up table
in which scores for each amino acid substitution
have been calculated based on the frequency of
that substitution in closely related proteins
that have experienced a certain amount (x) of
evolutionary divergence. (e.g PAM250)
A unit to quantify the amount of evolutionary
change in a protein sequence. Based on log-odd
probability.

Return
29
Appendix IV PAM250

At this evolutionary distance (250 substitutions
per hundred residues)

Return
30
Appendix V - BLOSUM

Blocks Substitution Matrix (BLOSUM)
A substitution matrix in which scores for each
position are derived from observations of the
frequencies of substitutions in blocks of local
alignments in related ( e.g BLOSUM62)
A unit to quantify the amount of evolutionary
change in a protein sequence. Based on log-odd
probability

Return
31
Appendix VI BLOSUM62

BLOSUM62 matrix is calculated from protein blocks
such that if two sequences are more than 62
identical

Return
32
Appendix VII mPAM250

Expected time based on 250 PAM distance as a
unit.

Return
33
Appendix VIII mPAM Validation

Based on benchmark query set by Smith-Waterman.
Graph shows ROC50 values (Receiver Operating
Characteristics)
Negative x- axis indicate mPAM has better
performance

Difference between ROC50 values using mPAM and
PAM250

Return
34
Appendix IX - Distance measure

Global Sequence Alignment
Given an alphabet A , a similarity substitution
matrix M corresponding to an evolutionary model,
the global sequence alignment for two sequences s
and t is to find a strings a and b which are
obtained from s and t respectively by inserting
spaces either into or at the ends of s and t and
whose score computed using M is at a maximum
(Similarity measure) or minimum (distance
measure) over all pairs of such strings obtained
from s and t.

Return
35
Appendix X Homology Search

Build Index Structure(Offline)
Divide the database sequences into a set of
overlapping sub strings of length q (q-grams)
with step size 1.
Build a metric-space index D based on global
alignment to support constant time lookup of
exact match.
Homology Search Query (Online)
Divide the query sequence W into overlapping sub
string , F wi i 0.. W -q , of length q
with step size 1.
For each wi in F, run range query Q(wi, r)
against database D to find a set of matching
q-grams, Ri f i,j d( f i,j , wi) lt r, f i,j
E D wi E F , where d is the distance function.
Using a greedy heuristic algorithm to extend and
chain all fragments in R0UR1URw-t to deduce the
result of homology search based on local
alignment for query W