Exploiting%20Regularities%20in%20Data%20for%20Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

Exploiting%20Regularities%20in%20Data%20for%20Bioinformatics

Description:

Exploiting Regularities in Data for Bioinformatics. Laxmi Parida. IBM T J Watson Research Center ... Bioinformatics & Pattern Discovery. Isidore Rigoutsos ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 50
Provided by: csRmi
Category:

less

Transcript and Presenter's Notes

Title: Exploiting%20Regularities%20in%20Data%20for%20Bioinformatics


1
Exploiting Regularities in Data for
Bioinformatics
Laxmi Parida IBM T J Watson Research Center
2
Overview
  • Background
  • Regularities/Pattern Discovery
  • on sequences
  • ID patterns
  • permutation patterns
  • multi-dimensional patterns
  • network motifs
  • Open Problems

3
DNA
4
Base pairs
5
Gene
6
Gene Expression
7
Codons
8
Translation
9
Protein
10
DNA to Protein
DNA codes for protein sequence
FIBRINOGEN GAMMA CHAIN QIHDITGKDCQDIANKGAKQSGLYFIK
PLKANQQFLVYCEIDG SGNGWTVFQKRLDGSVDFKKNWIQYKEGFGHLS
PTGTTEFWLG NEKIHLISTQSAIPYALRVELEDWNGRTSTADYAMFKVG
PEAD KYRLTYAYFAGGDAGDAFDGFDFGDDPSDKFFTSHNGMQFSTW D
NDNDKFEGNCAEQDGSGWWMNKCHAGHLNGVYYQGGTYSKAS TPNGYDN
GIIWATWKTRWYSMKKTTMKIIPFNRL
There are 20 natural amino acids with different
physicochemical properties, such as shape,
volume, flexibility, hydrophobic, hydrophilic,
charge
Amino acid
"Beads on a string"
Three-dimensional structure
Function
Structural keratin (skin, hair, nail), collagen
(tendon), fibrin (clot) Motive actomyosin
(muscle) Transport Hemoglobin (blood)
11
Regularities/Patterns/Motifs
  • DNA
  • protein
  • others

12
Similar sequence, similar structure
13
Patterns (serine proteases)
14
Pattern Discovery
Serine Proteases Trypsin subfamily PS00134
15
Dissimilar Sequences, Similar Structures
ARTVKLLLLGAGESGKSTIVKQMKIIHQDGXTGIIETQFSFKDLNFRMFD
VGGQRSERKK WIHCFEGVTCIIFIAALSAYDMVLVEDDEVNRMHESLHL
FNSICNHRYFATTSIVLFLNK KDVFSEKIKKAHLSICFPDYNGPNTYED
AGNYIKVQFLELNMRRDVKEIYSHMTCATDTQ NVKFVFDAVTDIII
PDB 1dts SCOP d1dts__ 3.032.001 ATP-dependent
carboxylase, dethiobiotin synthetase
PDB 1tag SCOP 3.032.001 d1tag_2
27-56,178-340 GTP-Binding Protein
SKRYFVTGTDTEVGKTVASCALLQAAKAAGYRTAGYKPVASGSEKTPEGL
RNSDALALQR NSSLQLDYATVNPYTFAEPTSPHIISAQEGRPIESLVMS
AGLRALEQQADWVLVEGAGGW FTPLSDTFTFADWVTQEQLPVILVVGVK
LGCINHAMLTAQVIQHAGLTLAGWVANDVTPP GKRHAEYMTTLTRMIPA
PLLGEIPWLAEAATGKYINLALL
Sequence Similarity 18 Fold P-loop containing
nucleotide triphosphate hydrolase
16
Sequence Pattern Discovery
  • Define the phenomenon
  • non-unique (occurs at least kgt1 times)
  • define occurrence
  • Example sabcdabcd, abcd is a pattern,
    "occurs" at positions 1 and 5 on s
  • Discover the phenomenon
  • Let D be the discovered set on input s, then
  • if p e D then, p is a pattern
  • if p is a pattern, then p e D

17
Modeling Patterns
o Probabilistic patterns
(assign a probability to the occurrence) o
Deterministic patterns (Yes or
No occurrence)
Brazma, Jonassen, Eidhammer, Gilbert Approaches
to the automatic discovery of patterns in
biosequences, JCB 1998.
18
Substring Patterns
  • Example s abcdabcd
  • ab, abc, abcd, bc, bcd ..... are patterns
  • maximality in length (only abcd is maximal, 1
    5)
  • Number of maximal patterns linear in input size
  • A Note on Patterns with No Wild Cards, Laxmi
    Parida

19
Fixed-Gap Substring Patterns
  • Example s abcdabxd
  • ab, b.d, ab.d, a..d are patterns
  • maximality
  • in length (ab, b.d X)
  • in composition (a..d X)
  • Number of maximal patterns ?

20
Maximality based on locations
Example S abcdabxdab Maximal patterns ab.d
at pos 1 and 5 ab at pos 1, 5 and 9
21
Pursuit of the Preposterous
  • s aaaaaXaaaaa
  • aa, aaa, a.a, aaaa, a.aa, aa.a, aaaaa, a.aaa,
    aa.aa, aaa.a, a..aaaa, a.a.aaa, a.aa.aa, a.aaa.a,
    aa..aaa, aa.a.aa, aa.aa.a, aaaa.aa, aaa.a.a,
    aaaa..a, aa..aaaa, aa.aa.aa, aaa.a.aa, aaaa..aa,
    aaa..aaa, aa.a.aaa, aaa.a.aaa, aaaa..aaa,
    aaa..aaaa, aaaa..aaaa
  • Number is exponential in the size of input!

22
Deal with the explosion
  • A subset (basis) of all the maximal patterns s.t.
  • "defines" all the maximal patterns
  • is manageable in size

23
Irredundancy
  • maximal patterns
  • s abcdxaecdxxabfd
  • ab.d at pos 1, 13
  • a.cd at pos 1, 6
  • a..d at pos 1, 6, 13 (redundant)
  • s abcdxaecdxxabfdayzd
  • ab.d at pos 1, 13
  • a.cd at pos 1, 6
  • a..d at pos 1, 6, 13, 17 (irredundant)

24
Irredundancy Basis ?
  • redundant motif can be obtained, along with the
    location list
  • a..d ab.d a.cd
  • 1, 6, 13 1, 6 U 1, 13
  • Size of the irredundant motif is O(n) for k2
  • Pattern Discovery on character sets and
    real-valued data linear bound on irredundant
    motifs and polynomial time algorithms, L Parida
    et al, Proceedings of the eleventh ACM-SIAM
    Symposium on Discrete Algorithms (SODA 2000),
    January 2000.

25
Redundancy
redundant
a..d
a.cd ab.d
irredundant (basis)
Given maximal patterns m, m1, m2, ...., mp such
that Lm Lm1d1 U Lm2 d2 ..... U Lmp
dp then m is redundant.
26
How does redundancy help?
  • Economic description
  • use in (lossy) data compression
  • Compression and the Wheel of Fortune, A
    Apostolico, L Parida, DCC 2003.
  • Gapped Codes and their Induced Grammars
    Bridging Lossy and Lossless Compression by Motif
    Pattern Discovery, A Apostolico, M Comin ,L
    Parida, under review.
  • smaller domain descriptions in biology
  • Efficient algorithm to detect all patterns
  • An Output-sensitive Flexible Pattern Discovery
    Algorithm, L Parida, I Rigoutsos, D Platt,
    Combinatorial Pattern Matching ( CPM 2001 ),
    LNCS vol 2089, 2001.

27
Algorithm
  • Step 1 Detect the core (irredundant) O(n3logn)
  • Step 2 Compute all redundant patterns from the
    core (Output-sensitive O(Nlogn))

28
Step 1 Detecting the core motifs
  • Input s, d, k
  • For all x in the alphabet construct Lx i
    si x Lx gt k O(n)
  • For j 1 to d for all y construct subsets of
    Lx, Lx Lx dy i i e Lx
    id e Ly O(dn)
  • Prune the sets for redundancy in Lx
    O(n3)
  • Prune the sets for "maximality"
    The number of sets lt n
  • For each L, construct subset using set L' O(n3)

29
Algorithm Step 2
Given O(n) irredundant patterns, we detect all
the N redundant patterns (that are possibly
exponential in the input size n) in time O(Nlog
n).
This is done by solving the following problem
Given S sets on n elements, find all N distinct
intersection sets of size gt 2.
30
Definition
Given substrings m1 and m2, m1 lt m2, if m1 occurs
at position 1 of m2. Example m1 ab.de, m2
a..d, m3 ac.d, m4 a.cde
m2 lt m1 True
m3 lt m1 False
m4 lt m1 False

31
Exercise

Given a sequence s 1) label each occurrence of
maximal m on s, with the pattern m 2) for each
co-occurrence m1, m2, remove m1 if m1 lt m2 3)
Collect the motifs that remain, M
What is M?
32
Redundancy 'operational' view

- M is the basis (irredundant motifs) - Using
this observation to design an online algorithm
( Joint work with Apostolico) Incremental
Paradigms of Motif Discovery, A Apostolico, L
Parida. Under review.
33
Variable-Gap Substring Patterns
(Extensible Patterns/Motifs)
s axbcaxxbcaxxxbc m
a1-3bc, at pos 1, 5 and 10
Main Issues
1) a location list corresponds to multiple
patterns Eg. axbcpdaycbqd (at positions 1
and 7) m1 a1-2b1-2d
m2 a1-2c1-2d 2) multiple occurrences at a
location Eg. abbc (at position 1)
m a0-1b0-1c
34
Definitions (extensible patterns)
Realization of m m a2-3b m' a..b,
a...b m lt m' for every
realization of m there exists
a realization of m' s.t. m lt
m' Maximality in composition
in length
in extension
35
Motivation
Fibronectin is a plasma protein that binds cell
surfaces and various compounds including
collagen, fibrin, heparin, DNA, and actin. The
major part of the sequence of fibronectin
consists of the repetition of three types of
domains, which are called type I, II, and III.
Type II domain is approximately forty residues
long, contains four conserved cysteines
involved in disulfide bonds and is part of
the collagen-binding region of fibronectin.
In fibronectin the type II domain is
duplicated.
36
Motivation (cont)
Type II domains have also been found in the
following proteins - Blood coagulation factor
XII (Hageman factor) (1 copy). - Bovine seminal
plasma proteins PDC-109 (BSP-A1/A2) and BSP-A3
(twice). - Cation-independent mannose-6-phosphate
receptor (which is also the insulin-like growth
factor II receptor) (1 copy). - Mannose receptor
of macrophages (1 copy). - 180 Kd secretory
phospholipase A2 receptor (1 copy) - DEC-205
receptor (1 copy) - 72 Kd type IV collagenase
(EC 3.4.24.24) (MMP-2) (3 copies) - 92 Kd type
IV collagenase (EC 3.4.24.24) (MMP-9) (3 copies)
- Hepatocyte growth factor activator (1 copy)
37
Motivation (cont)
"fibronectin type II" domain and the pattern
C..PF.FYWI.......C(8,10)WC....DNSRFYW(3,5)
FYW.FYWIC A schematic representation of
the position of the invariant residues and
the topology of the disulfide bonds in
fibronectin type II domain is shown below.
----------------------

xxCxxPFxxxxxxxxCxxxx-xxxWCxxxxxx-xxCxx

----------------------- 'C' conserved
cysteine involved in a disulfide bond. '' large
hydrophobic residue.
38
Inexact Suffix Tree
Varun An Inexact-suffix Tree Based Algorithm for
Detecting Extensible Patterns, A Chattaraj, L
Parida, under submission.
39
Variations in input
  • sequence of sets
  • a bcdbcd ba bed
  • bcd..d is a maximal pattern at pos 1 and 4
  • a b.d is a maximal pattern at pos 1 and 7
  • sequence of real numbers
  • notion of "occurrence" uses a tolerance value
  • infinite number of solutions!
  • 1.5 3 4 1.7 6 4 tolerance 0.4
  • patterns are possibly on intervals (why?)
  • 1.3,1.9 . 4 is a maximal pattern at pos 1 and 4

40
Handling variations
  • Sets
  • redefine "occurs" s1gts2 w s1r s2
  • proceed as before accounting for the new
    definitions
  • Real numbers
  • map to an instance of patterns on sets
  • solve
  • map solutions to reals

41
Permutation Patterns (o patterns)
  • Example s abcdacdb
  • p1 c, d, at pos 3, 6 p2 a, b, c, d at
    1, 2, 5
  • Example on sets s a,ebcde,ccdb
  • p1 e, b, c, d at 1, 2, 5
  • Application in finding gene/domain clusters

Related Work Fast Algorithms to Enumerate all
common intervals of two permutations, Uno,
Yagiura, Algorithmica 2000. Finding All Common
Intervals of K permutations, Heber, Stoye, CPM
2001.
42
Permutation Patterns (o patterns)
Notion of minimality (instead of maximality)
(used by Heber Stoye) Example s
abcdefbacdfe p1 a,b p4
a,b,c p2 c,d
p5 d,e,f p3 e,f
p6 c,d,e,f p7
a,b,c,d p8 a,b,c,d,e,f
43
Maximality in o patterns
s abcdefbacdfe p1 a,b
p4 a,b,c p2 c,d
p5 d,e,f p3 e,f
p6 c,d,e,f
p7 a,b,c,d p8 a,b,c,d,e,f Maximal
pattern p (a,b)-c-d-(e,f)
A Combinatorial Approach to Automatic Discovery
of Cluster-Patterns, R Eres, G Landau, L Parida,
WABI 2003.
44
o Patterns
  • contiguous
  • linear time algorithm
  • (intervals) Heber Stoye
  • with multiplicity of character
  • s abbcdabcbd p a,b2,c,d at pos 1, 2 6
  • linear time algorithm
  • Joint work with Eres, Landau

45
Open Problems in o Patterns
  • Efficient Algorithm for o patterns with gaps
  • s abcdaedcb p a,b,c,d at pos 1, 2
    5
  • s abffcdaedcb p a,b,c,d at pos 1, 2
    7
  • Number of maximal patterns?
  • "Natural" notion of irreducibility/irredundancy?

46
Network Motifs
Joint work with A Tsirigos
47
Deterministic Patterns
Forbidden words
o Patterns
Substring Patterns
On reals
On chars/sets
Gapped
0-Gap
Var-Gaps (extensible)
Fixed-Gap
Gene-chip Analysis
Association Discovery
Association Discovery
Simple Time-series
48
http//www.research.ibm.com/bioinformatics
49
Thanks..
www.nhgri.nih.gov Structural Biology Ajay
Royyuru Bioinformatics Pattern Discovery
Isidore Rigoutsos Daniel Platt Tien
Huynh http//www.research.ibm.com/bioinformatics
http//www.research.ibm.com/people/p/parida
Write a Comment
User Comments (0)
About PowerShow.com