The MoBIoS Project Molecular Biological Information System - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

The MoBIoS Project Molecular Biological Information System

Description:

E.g. BLAST, Scope: To Find Common Ground Both Biology and ... 100 best candidates BLASTed for matches in GenBank. 15 matched other plant genes and the primers ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 54
Provided by: danielm47
Category:

less

Transcript and Presenter's Notes

Title: The MoBIoS Project Molecular Biological Information System


1
The MoBIoS ProjectMolecular Biological
Information System
  • Daniel P. Miranker
  • Dept. of Computer Sciences
  • Center for Computational Biology and
    Bioinformatics
  • University of Texas

Weijia Xu, Rui Mao, Will Briggs, Smriti
Ramakrishnan, Shu Wang, Shulin Ni, Kai Yan, Ving
Lei
2
Compared to Business Databases,Biological
Databases
  • are not that big
  • Genbank ftp mirror download, ltlt 1 Terabyte
  • CMS spectrometer at CERN, gt Petabyte/year
  • but,
  • data management is biology is a big problem
  • There must be another problem

3
You Cant Sort
  • Sequences
  • DNA, RNA, Protein databases
  • Mass Spectra
  • proteomics
  • Small Molecules Protein Structure
  • Protein interaction
  • Rational drug design
  • Pathways (graphs)
  • Phylogenies (graphs, trees in particular)

4
In Life-Sciences Database Management Systems are
Souped Up File Systems
  • Primary data is stored in text or blob fields
  • Annotations may be relational
  • Data retrieval
  • Filter DB, sequential dump, O(n), to utilities
  • E.g. BLAST,

5
Scope To Find Common Ground Both Biology and
DBMS Have to Move
DBMS
Biological Information System
Metric-Space Database as the Common Ground
6
Metric Space is
  • a pair, M(D,d),
  • where
  • D is a set of points
  • d is metric distance function with the
    following properties
  • d(x,y) d (y,x)
    (symmetry)
  • d(x, y) gt 0, d(x,x) 0
    (non negativity)
  • d(x,z) lt d(x,y) d(y,z)
    (triangle inequality)

x y z
7
Definition - By Analogy
  • A Spatial Database Management System
  • Extend relational DBMS
  • Special indexes for 2D and 3D data k-d and
    R-trees
  • New data types
  • Geographic information systems
  • Topographic maps
  • Buildings and the like
  • A Metric-Space Database Management System
  • Extend Relational DBMS
  • Special indexes for metric-spaces
  • New data types
  • Biological information system
  • Life science data types

8
Results to date
  • Developed and Validated Metrics for
  • protein sequence homology
  • protein mass-spectra
  • volumetric (3-d) models of protein electrostatics
  • Feasibility prototype,
  • validated scalable biological data retrieval

9
A Broad Range of Problems
  • General purpose metric-space index
  • Use of metrics as similarity functions on
    biological data
  • Integration of biological sequences into the SQL
    programming model

10
Metric Space Indexing
Vantage point algorithm BurkhardKeller73
Choose a point,VP
And a radius, R
11
Range Search
R
  • If d(q,VP) gt R d
  • then
  • search outside the sphere

12
Biological Models are Usually Based on Similarity
  • Similarity
  • Biologists like scoring functions that reward
    each similar feature with a positive number
  • Intuitive
  • Distance
  • More similar ? smaller numbers
  • Identical ? 0

13
1) Do Metric Models of Similarity Capture
Biology?
  • Metrics are a subset of possible mathematical
    models

.
14
Sequence Similarity
  • Sequence similarity based on weighted edit
    distance
  • Sellers74
  • Accepted weight matrices, PAM BLOSSUM, are not
    metric
  • Log-odd matrices negative values
  • Defy simple algebraic normalization
  • TaylorJones93,Linialetal97, Halperin etal04

15
PAM vs. mPAM t 1/f
Xu Miranker04
  • Using original substitution counts
  • PAM frequency of substitution
  • S(a,bt) log P(ba,t)/qb
  • mPAM expected time between substitutions
  • D(a,b) 1/log(1 ?(P(a,x)P(b,x))

x
16
Metrics for Biological Similarity
  • Biosequences
  • mPAM
  • mBLOSSUM (not yet published)
  • Hamming distance
  • Protein Mass and Protein Fragment Mass Spectra
  • derived an effecive metric from cosine distance
  • Volumetric models of chemical fields of proteins
  • (with C. Bajaj)

17
Matching Electrostatic Shape of Molecules
18
Metric-Space Indexing (search)
  • Well studied in main-memory
  • no means a closed problem
  • In databases (external/disk based methods)
  • Embryonic
  • Success only in specific domains
  • Biological sequences, nucleotide and peptide
  • Databases of analytically determined mass-spectra
  • Volumetric models of proteins (collaboration in
    progress with C. Bajaj)

19
MoBIoS System Overview
Other Opportunities
Active Application Efforts
Other Opportunities
Active Application Efforts
Mass
-
Spec
Mass
-
Spec
Combi
-
Chem
Combi
-
Chem
Compartive
and
Compartive
and
Homology
Homology
Ligand
Ligand
Library
Protein
Library
Protein
Search
Search
Phylo
Genomics
Phylo
Genomics
Docking
Docking
Management
Identification
Management
Identification
MoBIoS
SQL (mSQL)
-
)
MoBIoS
SQL (
Query Engine
Mining Engin
Query Engine
Mining Engine
MoBIoS
Java Interface (MJI)
Metric
-
Space Based
Metric
-
Space Based
Storage Manager
Storage Manager
20
SQL as a Bioinformatic Programming Model
  • Simple retrieval of similar objects.

21
Protein identification by database look-up
  • Build a reference database
  • Given a complete genome ? putative proteome
  • Given a proteome ? set of ideal mass-spectra
    (database)
  • 2. Wet lab work
  • In a laboratory, experimentally determine the
    mass-spectra of an unknown protein
  • 3. Analyze the data
  • Find the closest match with the database
  • (a very noisy proposition)

22
Analytic Mass-Spectra
  • At each resolvable mass, either a peptide is
    present or not
  • Vector-Space model binary vector, one bit for
    each resolvable mass
  • Similarity Shared peaks count Inner Product
  • (0100101) (0111100) 2

23
Cosine Distance
  • Drs 1 xrxs/(xrxr)1/2(xsxs)1/2
  • Document retrieval uses
  • the vector space model
  • known
  • similarity inner product
  • as distance cosine distance

24
Given - the spectra of an unknown, S -
reference database, Spectra
  • SELECT Spectra.accesionNumber
  • FROM Spectra
  • WHERE
  • Cosine_Distance(S,Spectra,tolerance)

25
Protein Identification by Database Lookup of
Mass Spectra
26
Sequences Pose Additional Problems
  • Sequences long units (identity for storage and
    retrieval)
  • Genes
  • Chromosomes
  • Analysis comprises comparing small substrings

30/45
27
Local-Alignment (Homology Search)
  • For each pair of arguments
  • Find matching subsequences

Sequence 1
Sequence 2
28
Soln Sequence View
  • New view type
  • Breaks sequences into k-mers

create SEQUENCEVIEW rice_sview as SELECT CREATE
FRAGMENTS (, 3) k 3 FROM WHERE
USING HAMMING-DISTANCE Dan Instead of
miRNA later, maybe Sequence view of only exons.
29
Materialize as an Index
D(AAA) 2


30
mSQL Sequence Operators
  • CreateFragement
  • Merge and Groupby
  • Merge fragments back into longer subsequences
  • Algebraic constraints
  • Consider relative offsets when applied to pairs
    of sequences
  • Parameterize this for additional biological
    semantics (gaps)

31
Local-Alignment (Homology Search)
  • For each pair of arguments
  • Find matching subsequences

Sequence 1
Sequence 2
32
Dan Insert example of n log n homology on k-mers
with merge
33
Conserved (PCR) Primer Pair Discovery(with R.
Linder)
  • Goal
  • To find evidence of horizontal gene transfer
  • Method
  • Sample and sequence many orthologous regions in
    many plants
  • Build a phylogenetic tree for each region, look
    for inconsistent topology

34
Pattern Definition of a Conserved Primer Pair
  • Compare Arabidopsis Genome X Rice Genome
  • Locate nucleotide patterns of form
  • PCR primer pair candidate
  • Matching ? lt 5 mismatches
  • Eliminate non-unique primer candidates
  • Usual implementations O(n2), n 109

Rice Arab.
?18 Matching Nucleotides
?18 Matching Nucleotides
Rice Gap 400 3000 Long Arab. Gap 400 3000
Long
35
Query Plan
  • Arab. Genome, O(n)
    Rice Genome, O(m)
  • Offline Build Sequence
  • View O(n log n)
  • Compare O(mlogn)
  • Indexed Nested Loop
  • Eliminate Duplicates
  • Eliminate Low Complexity
  • Primers (LZ compression)
  • Merge Overlapping Primers
  • 10,000 conserved
  • primer pairs candidates

36
mSQL to locate candidate conserved primer pairs
// Create sequenceview of 18-mers of Rice and
Arab.
  • SELECT merge(R1.fragment, A1.fragment)
  • FROM
  • Rice_sview R1, Rice_sview, R2, Arab_sview A1,
    Arab_sview A2
  • WHERE
  • distance(HAMMINGDISTANCE', R1.fragment,
    A1.fragment) lt 1.0 AND distance(HAMMINGDISTANCE'
    , R2.fragment, A2.fragment) lt 1.0 AND
  • (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment))
    gt 400 AND
  • (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment))
    lt 3000 AND
  • (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment))
    gt 400 AND
  • (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment))
    lt 3000
  • GROUP BY R1.fragment, A1.fragment

37
Detailed Analysis, Wet Lab Validation
  • Found 13,418 possible primer pairs from MoBIoS,
  • lt 8 processor days including database load
  • 100 best candidates BLASTed for matches in
    GenBank
  • 15 matched other plant genes and the primers
  • At least 2 of 15 showed potential after PCR
    amplification against Helianthus and
    Phalaenopsis.

38
mSQL Queries Developed for Other Sequence
Analysis Problems Miranker et.al. Data Eng.
Bul. Sept. 04
  • Micro RNA RNAi predictions
  • Rosetta analysis to determine protein function
  • Electronic-PCR
  • Sequence Homology

39
Query Engine Can Consider All Data
  • From the MiRNA problem
  • Select merge(G7.dna_seq), merge(M7.seq) mseq,
    gene
  • From M7, G7, Features // consider both 7-mers
    and their annotations
  • Where
  • distance(hamming, G7.dna_seq, M7.seq) lt 0
  • and G7.SID Features.SID
  • and Features.name miRNA
  • and G7.dna_seq.offset gt Features.first
  • and G7.dna_seq.offset G7.dna_seq.length lt
    Features.last
  • having mseq.offset lt 1 and mseq.offset
    mseq.length gt 7 

40
Status
  • All components of the system integrated and
    function
  • Not product
  • Scalability of index on nearest-neighbor search
  • Range search is a better fit for the relational
    model
  • Actively seeking collaborations in anticipation
    of distributing the software.

41
Protein Identification by Database Lookup of
Mass Spectra
42
Matching Electrostatic Shape of Molecules
43
Status
  • Started with McKoi
  • A Java open source object-relational DBMS
  • (Think of Postgress written in Java)
  • Added
  • Biological data types
  • Metric-space index
  • Extending SQL engine (in finishing stages)

44
Other Results
  • Protein identification by database lookup
  • match against a database of theoretical
    mass-spectrometer signatures.
  • Protein similarity by 3-D electrostatic shape

45
Three classes of algorithm
  • Vantage-Point
  • Generalized Hyperplane
  • Radius-based Methods

46
Hyper-planes Ulhmann91
  • If d(x,h1) lt d(x,h2) then x assigned to h1

h1
x
h2
47
Develop a Hierarchical Clustering
C
A
E
B
D
F
  • Hierarchy of Bounding spheres, (center, radius),
  • Bounding spheres may overlap
  • Inspired by R-trees

48
Multi-vantage point method
  • Consider d(VPi, x) a projection onto an axis
  • Looks like a k-d tree
  • Choose number k d

49
Hyper-planes Ulhmann91
  • If d(x,h1) lt d(x,h2) then x assigned to h1

h1
x
h2
50
Status
  • Implemented an algorithm from each class.
  • Examined performance on
  • Synthetic euclidean vectors typical of other work
  • Peptide k-mers, mPAM distance
  • Mass spectra, fuzzy cosine distance
  • Image database

51
Comparison of three methods
  • Vector data not typical of other data sets
  • Multivantage point wins
  • but not yet exciting

52
Our First Result mPAM XuMiranker04
  • Dayhoff etals PAM Derivation74
  • Took a set of closely related protein sequences
  • Developed a phylogenetic tree based on parsimony
  • Counted substitutions to transform one sequence
    to another

53
mPAM250 vs. PAM250 on NCBI benchmark
  • 103 queries over yeast proteome
  • Hand curated correct answers.
Write a Comment
User Comments (0)
About PowerShow.com