Pattern Discovery in Biological Sequences: A Review - PowerPoint PPT Presentation

About This Presentation
Title:

Pattern Discovery in Biological Sequences: A Review

Description:

Sequence weighting: Address sample bias. Computational efficiency. Profile HMMs: More Details ... World Wide Web in the UK,..., Sweden, ..., France, ..., US. ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 47
Provided by: Ale2
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Pattern Discovery in Biological Sequences: A Review


1
Pattern Discoveryin Biological Sequences A
Review
ChengXiang Zhai Language Technologies
Institiute School of Computer Science Carnegie
Mellon University
Presentation at the Biological Language Modeling
Seminar, June17, 2002
2
Outline
Computer Science
Algorithm
Pattern Discovery
Application
Biology
Motivation
Formalization
Basic Concepts (Common Language)
3
Basic Concepts
  • Alphabet Language
  • Alphabet set of symbols, e.g., ?A, T, G, C
    is the nucleotide alphabet
  • String/Sequence (over an alphabet) finite seq.
    of symbols, e.g., wAGCTGC ( How many different
    nucleotide strings of length 3 are there?)
  • Language (over an alphabet) set of strings,
    e.g., LAAA, AAT, ATA, AGC, , AGG all
    nucleotide triplets starting with A.

4
ExampleEssential AA Language
The language (set) of essential amino acids on
the alphabet A, U, C, G LCAC, CAU, , UAC,
UAU
The Genetic Code
5
Questions to Ask about a Language (L)
  • Syntax Semantics
  • How do we describe L and interpret L?
  • Recognition
  • Is sequence s in L or not?
  • Learning
  • Given example sequences in L and not in L, how do
    we learn L? What if given sequences that either
    match or do not match a sub-sequence in L ?

6
Syntax Semantics of Language
  • Syntax description of the form of sequences
  • Surface description enumeration
  • Deep description a concise decision rule or a
    characterizing pattern, e.g.,
  • L contains all the triplets ending with A, or
  • L contains all sequences that match AGGGGGA
  • Semantics meaning of sequences
  • Functional description of a amino acid sequence
  • Gene regulation of a nucleotide sequence

7
Recognizing Sequences in L
  • Recognizer (for L) given a sequence s, it tells
    us if s is in L or not. An operational way of
    describing L!

Algorithm (G-rec. Recgonizer)
0 (no) 1 (yes)
L (G-receptors)
? ( all protein sequences)
Is the sequence SNASCTTNAPTGAK a G-receptor?
8
More than recognizing...
  • Can the recognizer explain why a sequence is a
    G-receptor? Is the explanation biologically
    meaningful?
  • The explanatory power reflects the recognizers
    understanding of the language.
  • Two possible explanations/decision rules
  • It is longer than 300 AAs
  • The four AAs A, P, K, B co-occur within a
    window of 50 AAs

9
Learning a Language (from Examples)
Positive examples
L
?
Negative examples
Learn a recognizer (Classification) - Given a
new sequence, decide if it is in L
Learn meaningful features (Feature
Extraction/Selection) - Characterize, in a
meaningful way, how L is different from the rest
of ?
10
More Basic Concepts
  • Pattern/Motif ? sequence template, e.g., A..GT
  • Different views of a pattern
  • A pattern defines a language L seqs that
    match the pattern gt
  • Language learning pattern learning?
  • Given a language, can we summarize it with a
    pattern?
  • A pattern is a feature The feature is on for a
    sequence that matches the pattern gt
  • Feature extraction pattern extraction?
  • A pattern is a sequence of a pattern language

11
The Need of Probabilities
  • We have many uncertainties due to
  • incomplete data and knowledge
  • noise in data (incorrectly labeled, measurement
    errors, etc)
  • So, we relax our criteria
  • L could potentially contain all the sequences,
    but with different probabilities (statistical
    LM)
  • How likely is a sequence s in L?
  • How do we learn such an L? (LM estimation)

12
Biological Motivation for Pattern Discovery
  • Motifs or preserved sequence patterns are
    believed to exist
  • Motifs determine a sequences biological function
    or structure
  • Many successful stories (Brejova et al. 2000)
  • Tuberculosis detecting secretary proteins 90
    confirmed
  • Coiled coils in histidine kinases detecting
    coiled coil

13
Amino Acid Patterns Patterns in Protein
  • Possible biological meanings They may
  • determine a proteins 3-D structure
  • determine a proteins function
  • indicate a proteins evolutionary history or
    family
  • Suspected properties They may be
  • long and with gaps
  • flexible to permit substitutions
  • weak in its primary sequence form
  • strong in its structural form

14
Nucleotide Patterns Inon-coding regions
  • Possible biological meanings They may
  • determine the global function of a genome, e.g.,
    where all the promotors are
  • regulate specific gene expression
  • play other roles in the complex gene reg. network
  • Suspected properties They may be
  • in the non-coding regions
  • relatively more continuous and short
  • working together with many other factors

15
Nucleotide Patterns II Patterns in RNA
  • Possible biological meanings They may
  • determine RNAs 3-D structure, thus indirectly
    transcription behavior
  • Suspected properties They may
  • be long and with gaps(?)
  • contain many coordinating/interacting elements
  • weak in its primary sequence form
  • strong in its structural form

16
Nucleotide Patterns III Tandem Repeats
  • Possible biological meanings They may
  • be a result of mutations from the same original
    segment
  • play a role in gene regulation
  • be related to several diseases
  • Suspected properties They may
  • be contiguous
  • approximate copies of the same root form
  • be hard to detect

17
Pattern Discovery Problem Formulation
  • The ultimate goal is to find Meaningful
    Patterns
  • Broadly three types of sub-problems
  • Pattern Generation/Enumeration
  • Sequence Classification/Retrieval/Mining
  • Pattern Extraction

18
Map of Pattern Discovery Problems
Sequences
Pattern Generation
Seq. Classification
Candidate Patterns
Function Info
Pattern Extraction
19
Pattern Generation/Enumeration
  • Given a (usually big) collection of sequences
  • Generate/enumerate all the significant patterns
    that satisfy certain constraints
  • Issues
  • Design a pattern language (e.g., max. length?)
  • Design significance criteria (e.g., freq gt 3)
  • Design a search/enumeration strategy
  • Algorithm has to be efficient

20
Sequence Classification
  • Finding structures on a group of sequences
  • Categorization group sequences into known
    families
  • Clustering explore any natural grouping tendency
  • Retrieval find sequences that satisfy certain
    need
  • Goal maximize classification accuracy
  • Issues
  • Dealing with noise Using good features/patterns
  • Breaking the limit of linear similarity

21
Sequence Categorization
  • 2 or more meaningful classes
  • Examples available for each class
  • Goal is to predict the class label of a new
    instance as accurately as possible
  • E.g., protein categorization, G-receptor
    recognition

Learn the boundaries
C2
C3
Examples
C1
22
Sequence Clustering
  • Given sequences, but no pre-defined classes
  • Design similarity criteria to group sequences
    that are similar
  • Goal is to reveal any interesting structure
  • E.g., gene clustering based on expression
    information

Learn the boundaries
23
Sequence Retrieval
  • Given some desired property of sequences
  • Find all sequences with the desired property
  • E.g., find all Enzyme sequences that are similar
    to my G-receptor sequence

Query
Find these sequences
24
Pattern Extraction
  • Suppose you are given a lot of text in a foreign
    language unknown to you
  • Can you identify proper names in the text?
  • Issues
  • Need to know the possible form of a meaningful
    pattern (Will a name have more than three words?)
  • Need to identify useful clues (e.g.,
    Capitalized)
  • The extraction criteria must be based on some
    information about the functions or structures of
    a sequence

25
Entering the Algorithm Zone ...
  • The most influential ones seem to be
  • Pattern Generation Algorithms
  • TEIRESIAS SPLASH
  • Pattern Classification Algorithms
  • I believe that most standard classification
    algorithms have been tried
  • HMMs are very popular
  • Pattern Extraction Algorithms ???

26
TEIRESIAS SPLASH
  • Find deterministic patterns (exact algorithm)
  • Pattern Language
  • Allowing gaps, e.g. A..HC
  • Constraints on density of the wild-card .
  • Less powerful than the regular language/expression
  • Significance criteria
  • Longer more significant
  • Higher frequency more significant
  • Statistical test How likely is it a random
    effect?

27
Basic Idea in TEIRESIAS SPLASH
  • Generate Test
  • Pruning strategy If a (short) pattern occurs
    fewer than 5 times, so do ALL longer patterns
    containing it!
  • A Bottom-up Inductive Procedure
  • Start with high frequency short patterns
  • At any step, try to extend the short patterns
    slightly

28
Possible Applications of TEIRESIAS SPLASH
  • Defining a feature space (biological words)
  • Suggesting structures for probabilistic models
    (e.g. HMM structure)
  • A general tool for any sequence mining task
    (e.g., mining the web click-log data?)

29
Map of Pattern Discovery Problems
Pattern ? Meaningful Structure?
Sequences
Structure Analysis (Alignment)
Function Analysis (Classification)
Structure Info
Function Info
Pattern Extraction/Interpretation
30
Probabilistic Pattern Finding
  • Probabilistic vs. Deterministic Patterns
  • Functional comparison
  • A deterministic pattern either matches or does
    not match a sequence
  • A probabilistic pattern potentially matches every
    sequence, but with different probabilities
  • Deterministic patterns are special cases of
    prob. Patterns
  • Structural comparison
  • Deterministic patterns are easier to interpret

31
Hidden Markov Models (HMMs)
  • Probabilistic Models for Sequence Data
  • The System is in one of K states at any moment
  • At the next moment, the system
  • moves to another state probabilistically
  • outputs a symbol probabilistically
  • To generate a sequence of n symbols, the system
    makes n moves (transitions)

32
Examples
1.0
P(w1wn) p(w1s)p(wns)
p(w1)p(wn)
Unigram LM
s
p(ws)
Position Weight Matrix(PWM)
1.0
1.0
1.0
1.0
...
Start
End
s1
s2
sk
p(ws1)
p(ws2)
p(wsk)
P(w1wn) p(w1s)p(wns)
Deterministic Pattern AUGUAGUGAUAA
A
A
Start
1.0
A
End
U
G
1
2
3
U
G
A
p(A)1
p(U)1
p(G)1
A
G
p(A) p(U) p(G) p(C)1
33
Three Tasks/Uses
  • Prediction (Forward/Backward algorithms)
  • Given a HMM, how likely would the HMM generate a
    particular sequence?
  • Useful for, e.g., recognizing unknown proteins
  • Decoding (Viterbi algorithm)
  • Given a HMM, what is the most likely transition
    path for a sequence (discover hidden structure
    or alignment)
  • Training (Baum-Welch algorithm)
  • HMM unknown, how to estimate parameters?
  • Supervised (known state transitions) vs.
    unsupervised

34
Applications of HMM
  • MANY!!!
  • Protein family characterization (Profile HMM)
  • A generative model for proteins in one family
  • Useful for classifying/recognizing unknown
    proteins
  • Discovering weak structure
  • Gene finding
  • A generative model for DNA sequences
  • Identify coding-regions and non-coding regions

35
An Example Profile HMM
  • Three types of states
  • Match
  • Insert
  • Delete
  • One delete and one match per position in model
  • One insert per transition in model
  • Start and end dummy states

delete
insert
alignment
Match (at position 2)
Example borrowed from Cline, 1999
36
Profile HMMs Basic Idea
  • Goal Use HMM to represent the common pattern of
    proteins in the same family/domain
  • First proposed in (Krogh et al. 1994)
  • Trained on multiple sequence alignments
  • match-states consensus columns
  • Supervised learning
  • Trained on a set of raw sequences
  • match-states avg-length
  • Unsupervised learning

37
Uses of Profile HMMs
  • Identify new proteins of a known family
  • Match a profile HMM with a database of sequences
  • Score a sequence by likelihood ratio (w.r.t. a
    background model), apply a threshold
  • Identify families/domains of a given sequence
  • Match a sequence with a database of profile
    HMMs,
  • Return top N domains
  • Multiple alignments
  • Identify similar sequences Iterative search

38
Profile HMMs Major Issues
  • Architecture Explain sub-families, more
    constrained (motif HMMs)
  • Local vs. global alignment
  • Avoid over-fitting Mixture Dirichlet prior, Use
    labeled data
  • Avoid local-maxima Annealing, labeled data
  • Sequence weighting Address sample bias
  • Computational efficiency

39
Profile HMMs More Details
  • Dirichlet Mixture Prior
  • Generate an AA from a Dirichlet distribution
    Dir(p?) in two-stages
  • Given observed AA counts, we can estimate the
    prior parameters ?s
  • Assume a mixture of k Dirichlet distributions
    Dir(p?)
  • For each column of multiple alignment
  • Assume that the counts (of different AAs) are a
    sample of the mixture model

40
Protein Structure Prediction with HMMs
  • SAM-T98
  • Best method that made use of no direct structural
    information at CASP 3 (Current Assessment of
    Structure Prediction)
  • Create a model of your target sequence
  • Search a database of proteins using that model
  • Whichever sequence scores highest, predict that
    structure

41
How do we build a model using only one sequence?
42
Application Example Pfam (HMMER)
  • Pfam is a large collection of protein multiple
    sequence alignments and profile hidden Markov
    models. Pfam is available on the World Wide Web
    in the UK,, Sweden, , France, , US. The latest
    version (6.6) of Pfam contains 3071 families,
    which match 69 of proteins in SWISS-PROT 39 and
    TrEMBL 14. Structural data, where available, have
    been utilised to ensure that Pfam families
    correspond with structural domains, and to
    improve domain-based annotation. Predictions of
    non-domain regions are now also included. In
    addition to secondary structure, Pfam multiple
    sequence alignments now contain active site
    residue mark-up. New search tools, including
    taxonomy search and domain query, greatly add to
    the functionality and usability of the Pfam
    resource.

43
HMM Gene Finders
  • Goal Use HMM to find the exact boundary of genes
  • Usually Generalized HMMs
  • With Class (GeneMark GeneMark.hmm?)
  • State Neural Network (Genie)
  • Architecture 2 modules interleaved
  • Boundary module start codon, stop codon, binding
    sites, transcription factors, etc.
  • Region module exons, introns, etc.
  • A lot of domain knowledge encoded

44
HMMs Pros Cons
  • Advantages
  • Statistics
  • Modularity
  • Transparency
  • Prior Knowledge
  • Disadvantages
  • State independence
  • Over-fitting
  • Local Maximums
  • Speed

45
More Applications Discussions
  • Ultimately how useful are these algorithms for
    biology discovery?
  • Integrated with biological experiment design
    (reinforcement learning?)
  • Biological verification of patterns/classification
  • Evaluation of these algorithms is generally hard
    and expensive?

46
Some Fundamental Questions
  • How powerful should the pattern language be? Is
    regular expression sufficient?
  • How do we formulate biologically meaningful or
    biologically motivated classification/extraction
    criteria?
  • How do we evaluate a pattern without expensive
    biological experiments?

47
The End Thank you!
Write a Comment
User Comments (0)
About PowerShow.com