MoBIoS - PowerPoint PPT Presentation

About This Presentation
Title:

MoBIoS

Description:

Disk-based metric tree index. MoBIoS as a DBMS. Application of ... Generalized Hyper plane (GH-Tree) initialization. Best case : O(nlogn) ... Identification ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 37
Provided by: veri76
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: MoBIoS


1
  • MoBIoS
  • A Metric-space DBMS to Support Biological
    Discovery
  • Presenter Enohi I. Ibekwe

2
Overview
  • MoBIoS Project
  • Motivation
  • The challenge
  • Established similarity measures
  • Metric-space distance measure
  • Disk-based metric tree index
  • MoBIoS as a DBMS
  • Application of MoBIoS

3
MoBIoS Project
  • Molecular Biological Information System
  • Project at UT-Austin center for computational
    biology and bioinformatics.
  • DBMS based on metric-space indexing techniques,
    object-relational model of genomic and proteomic
    data types and a database query language that
    embodies the semantics of genomic and proteomic
    data.

4
Motivation
  • Develop a DBMS to power Biological Information
    System

5
The Challenge
  • Established biological model of similarity
    measure do not form a metrics.
  • Scalable disk-based metric-indexes suffer from
    the Curse of dimensionality

6
Established Similarity Measure (I)
  • Sequence Homology
  • Query Sequence
  • Database of sequences
  • Substitution Matrix (PAM / BLOSUM)
  • Similarity Measure
  • Global Sequence Alignment (Edit distance)
  • Local Sequence Alignment (Most important)

7
Established Similarity Measure (II)
  • Local Sequence Alignment
  • A local sequence alignment query asks, given a
    query sequence S, a database of sequences T and
    a similarity matrix corresponding to an
    evolutionary model, return all subsequences of T
    that are sufficiently similar to a subsequence of
    S
  • Main issue Result is a set of answer.
  • A metric distance function must return a single
    value for each pair of argument

8
Established Similarity Measure (III)
  • Global Sequence Alignment
  • Given an alphabet A , a similarity substitution
    matrix M corresponding to an evolutionary model,
    the global sequence alignment for two sequences s
    and t is to find a strings a and b which are
    obtained from s and t respectively by inserting
    spaces either into or at the ends of s and t and
    whose score computed using M is at a maximum
    (Similarity measure) over all pairs of such
    strings obtained from s and t. (example)
  • Issue Result maybe negative since substitution
    matrix is based on log-odd probability.
    Similarity measure favors greater positive
    number.

9
Metric-space Distance measure (I)
  • Homology Search
  • Query Sequence Sub strings of length q (q-grams)
  • Database of sequences Metric indexed records of
    fixed length q (indexed q-grams) strings.
  • Substitution Matrix (mPAM)
  • Similarity Measure (distance measure)
  • Local Alignments is computed from global
    alignment.

10
Metric-space Distance measure (II)
  • mPAM substitution Matrix
  • Accepted Point Mutation Model.
  • PAM calculates scores based on frequency in which
    individual pairs of amino acids substituted for
    each other.
  • mPAM instead of calculating frequency of
    substitutions (PAM), computes expected time
    between substitution.
  • mPAM has been validated.(Validation)

11
Metric-space Distance measure (III)
  • Computing Local Alignment from Global Alignment
    (Algorithm)
  • Offline
  • Divide database of sequence into sub strings
    (q-grams)
  • Build metric-space index structure on q-grams
  • Online
  • Divide query sequence into sub strings (q-grams)
  • Using global alignment as a distance function to
    match query q-grams.

12
Disk-based metric-tree index
  • Phases
  • Initialization
  • Searching
  • Query performance metric
  • Number of disk I/O ( nodes visited)
  • Number of distance computation
  • Options Exploited
  • M-Tree
  • Generalized Hyper plane tree
  • MVP-Tree (optimal)

13
Disk-based metric-tree index (initialization)
  • M-Tree initialization
  • Best case O(nlogn)
  • worst case O(n3)
  • Generalized Hyper plane (GH-Tree) initialization
  • Best case O(nlogn)
  • worst case O(n2)
  • GH-tree Bi-direction
  • M-Tree Bottom-up
  • In practice, both M-Tree and GH-Tree scale
    linearly

14
Disk-based metric-tree index (Searching)
15
MoBIoS as a DBMS (I)
  • Mckoi (Java RDBMS).
  • Plus metric-space indexing
  • Plus Biological data types
  • Plus biological semantics
  • Life science data store
  • Biological sequence data
  • Mass-spectrometry protein signature

16
MoBIoS as a DBMS (III)
  • Language Extension
  • M-SQL
  • Data type Extension
  • Data type for Sequences (DNA,RNA,peptide)
  • Data type for Mass spectrum
  • Semantics Extension
  • Subsequence Operators
  • Local alignment

17
MoBIoS as a DBMS (IV)
  • Semantics Extension
  • Similarity (metric distance) between data types
  • mPAM250
  • Cosine distance
  • Lk norms
  • Keys Extension
  • Primary key (metrickey)
  • Index (metric)

18
Application of MoBIoS (I)
  • MS/MS Protein Identification
  • Breakdown protein into fragments called peptide
    using a protease enzyme
  • Identify protein by using a mass-spectrometer to
    measure the mass-charge ratio of the fragments
    and comparing the experiment result to a database
    of precomputed spectra.

19
Application of MoBIoS(II)
  • M-SQL Solution
  • Create table protein_sequences (accesion_id int,
  • sequence peptide,
  • primary metrickey(sequence, mPAM250)
  • Create table digested_sequences (accession_id
    int,
  • fragment peptide,
  • enzyme varchar,
  • ms_peak int, primary key(enzyme,
    accession_id)
  • Create index fragment_sequence on
    digested_sequences (fragment)
  • metric(mPAM250)
  • Create table mass_spectra
  • (accession_id int,
  • enzyme varchar,
  • spectrum spectrum,
  • primary metrickey(spectrum, cosine_distance)

20
Application of MoBIoS(III)
  • M-SQL Solution
  • SELECT Prot.accesion_id, Prot.sequence
  • FROM protein_sequences Prot, digested_sequences
    DS,mass_spectra MS
  • WHERE
  • MS.enzyme DS.enzyme E and
  • Cosine_Distance(S, MS.spectrum, range1) and
  • DS.accession_id MS.accession_id
    Prot.accesion_id and
  • DS.ms_peak P and
  • MPAM250(PS, DS.sequence, range2)

21
BLAST vs MoBIoS
  • MoBIoS
  • Molecular Biological Information System
  • DBMS specialized for storage, retrieval and
    mining of biological data
  • Sequence Database and query sequence is divided
    into q-grams and Database is indexed offline.
  • BLAST
  • Basic Local Alignment Search Tool
  • Utility specialized for retrieval and mining of
    biological data outside a database
  • Only query sequence is divide and hot-point index
    is done at query time

22
MoBIoS Demo
  • MoBIoS http//ccvweb.csres.utexas.edu9080/msfoun
    d/ccForm.jsp
  • PDB http//www.rcsb.org/pdb/

23
Conclusion
  • Biological data is not random and very likely
    exhibit the intrinsic structure necessary for
    metric-space indexing to succeed.

24
References
  • http//www.cs.utexas.edu/users/mobios/Publications
    /miranker-mobios-final-03.pdf
  • http//www.cs.utexas.edu/users/mobios/Publications
    /mao-bibe-03.pdf
  • http//www.cs.utexas.edu/users/mobios/
  • http//www.mckoi.com/database/

25
Appendix

Return
26
Appendix I- Metric
  • A metric-space is a set of objects S, with a
    distance function d, such that given any three
    objects x, y, z,
  • Non-Negativity
  • d(x,y) gt 0 for x y d(x,y) 0 for x y
  • Symmetry
  • d(x,y) d(y,x)
  • Triangular inequality
  • d(x,y) d(y,z) d(x,y)

Return
27
Appendix II - Sequence
  • 2 RNA sequences from a DNA strand.

Return
28
Appendix III - PAM
  • Percent Accepted Mutation(PAM)
  • A PAM(x) substitution matrix is a look-up table
    in which scores for each amino acid substitution
    have been calculated based on the frequency of
    that substitution in closely related proteins
    that have experienced a certain amount (x) of
    evolutionary divergence. (e.g PAM250)
  • A unit to quantify the amount of evolutionary
    change in a protein sequence. Based on log-odd
    probability.

Return
29
Appendix IV PAM250
  • At this evolutionary distance (250 substitutions
    per hundred residues)

Return
30
Appendix V - BLOSUM
  • Blocks Substitution Matrix (BLOSUM)
  • A substitution matrix in which scores for each
    position are derived from observations of the
    frequencies of substitutions in blocks of local
    alignments in related ( e.g BLOSUM62)
  • A unit to quantify the amount of evolutionary
    change in a protein sequence. Based on log-odd
    probability

Return
31
Appendix VI BLOSUM62
  • BLOSUM62 matrix is calculated from protein blocks
    such that if two sequences are more than 62
    identical

Return
32
Appendix VII mPAM250
  • Expected time based on 250 PAM distance as a
    unit.

Return
33
Appendix VIII mPAM Validation
  • Based on benchmark query set by Smith-Waterman.
  • Graph shows ROC50 values (Receiver Operating
    Characteristics)
  • Negative x- axis indicate mPAM has better
    performance
  • Difference between ROC50 values using mPAM and
    PAM250

Return
34
Appendix IX - Distance measure
  • Global Sequence Alignment
  • Given an alphabet A , a similarity substitution
    matrix M corresponding to an evolutionary model,
    the global sequence alignment for two sequences s
    and t is to find a strings a and b which are
    obtained from s and t respectively by inserting
    spaces either into or at the ends of s and t and
    whose score computed using M is at a maximum
    (Similarity measure) or minimum (distance
    measure) over all pairs of such strings obtained
    from s and t.

Return
35
Appendix X Homology Search
  • Build Index Structure(Offline)
  • Divide the database sequences into a set of
    overlapping sub strings of length q (q-grams)
    with step size 1.
  • Build a metric-space index D based on global
    alignment to support constant time lookup of
    exact match.
  • Homology Search Query (Online)
  • Divide the query sequence W into overlapping sub
    string , F wi i 0.. W -q , of length q
    with step size 1.
  • For each wi in F, run range query Q(wi, r)
    against database D to find a set of matching
    q-grams, Ri f i,j d( f i,j , wi) lt r, f i,j
    E D wi E F , where d is the distance function.
  • Using a greedy heuristic algorithm to extend and
    chain all fragments in R0UR1URw-t to deduce the
    result of homology search based on local
    alignment for query W

Return
36
Appendix XI - GSA
Return
Write a Comment
User Comments (0)
About PowerShow.com