Some Computer Science Techniques in Bioinformatics - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Some Computer Science Techniques in Bioinformatics

Description:

Less amenable to rigid rule-based algorithms. ... on tyrosine when associated to the TCR zeta chain and was able to bind p56(lck) ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 68
Provided by: csieNt
Category:

less

Transcript and Presenter's Notes

Title: Some Computer Science Techniques in Bioinformatics


1
Some Computer Science Techniques in Bioinformatics
  • ???
  • ??????

2
Problems in Biological Computing
  • Laboratory data is full of errors.
  • Less amenable to rigid rule-based algorithms.
  • Multi-objective difficult to model them as a
    regular single objective approximation problem
  • Dependency among different objectives
  • Local noise less suitable for global
    optimization
  • To tackle these problems one should focus on real
    data rather than worst case analysis.

3
Distribution of Problem Instances with Respect to
an Algorithm
Proportion of instances
n2
n3
n
Exponential
Complexity
4
Distribution of Problem Instances with Respect to
an Algorithm
Proportion of instances
n2
n3
n
Exponential
Complexity
5
How to Exploit the Structure of a Problem
  • Best case closed-form solutions
  • The solutions of a
  • quadratic equation
  • Solve by an algorithm
  • Rule-based algorithms
  • Rules (If , then else ) reside inside the
    algorithm Most discrete algorithms (e.g. special
    graph algorithms)
  • corpus-based algorithms (training set, testing
    set)
  • Machine-learning algorithm
  • Knowledge-based algorithm
  • Represent the structure by a collection of
    explicit features and their weights

6
Some Computer Science Techniques in Bioinformatics
  • String matching
  • alignment (blast), multiple sequence alignment
    (psi-blast), motif finding
  • Clustering, classification
  • Neural net, SVM, maximum entropy
  • Graph algorithm
  • Natural language processing
  • IR, IE, Machine translation, HMM
  • Image, signal processing

7
Guideline for Bioinformatics Research
  • Bioinformatics tools are used as decision support
    systems
  • The emphasis is on biological fact finding rather
    than methodology sophistication
  • The more biological knowledge embedded in the
    tools the better the result could be.
  • Corpus-based algorithms seem to be more
    appropriate
  • Pure machine-learning algorithms usually suffer
    from low prediction accuracies

8
Outline of Examples (in our lab)
  • Protein Secondary Structure Prediction
  • NMR Backbone Assignment
  • Mass Spectrometry-ICAT Technique
  • Literature Mining

9
Protein Secondary Structure Prediction
10
Hierarchy of Protein Structures
11
Sequence to 2 D structure prediction
  • Secondary structure elements
  • 3 states ?-helix (H), ?-strand (E), others (L).
  • Secondary structure prediction
  • Assign one of the states to each amino acid.
  • Useful for 3D prediction
  • Can be regarded as a translation problem
  • Need to consider evolutionary information
  • What are the words, phrases?

12
Previous Works
  • Statistical approaches
  • Chou-Fasman, GOR
  • Chemistry-based approaches
  • Lim, Cohen
  • Machine learning approaches
  • PHD, PSIPRED, Hyprosp

13
Characteristics of Our Method
  • Sequence similarity ? Structure similarity
  • There seems to be a gap for sequence identity
    between 0 to 25
  • A knowledge-based approach (PROSP)
  • Provides a monotone similarity measure
  • Other NN methods such as PSIPRED
  • A hybrid method (HYPROSP)
  • Knowing when to use PROSP or PSIPRED
  • Based on match rate

14
A Comparison of PROSP and PSIPRED
15
Sequence Homology- Global vs. Local
  • Protein similarity
  • Global based on sequence or structure alignment
  • Local based on k-peptide similarity (Wu CSB
    2003)
  • Global similarity ? local similarity
  • but not vice versa
  • We adopt local similarity here based on PSIBLAST
    and HSP.

16
Divide the Target Protein into Peptides of 7-mers
target protein
17
Knowledge-based Prediction Algorithm PROSP
Knowledge base
x
Psi-Blast
H

H
H
L
E
L
H(x) E(x) L(x)
x is assigned as helix
Voting score
18
Hybrid MethodHYPROSP
  • Determine match rate of the target protein
  • Hybrid methodHYPROSP
  • Match rate gt 80 ? using PROSP
  • We only predict this part
  • Match rate lt 80 ? using PSIPRED

19
Match Rate
target protein
Match rate / ( )
20
A Comparison of PROSP and PSIPRED
21
New Experimental Result
22
NMR Backbone Assignment
23
NMR 3D Structure Determination
NMR
24
Blind Mans Elephant
  • We cannot directly see the positions of these
    atoms (the structure)
  • But we can measure a set of parameters (with
    constraints) on these atoms,
  • which can help us infer their coordinates

Each experiment can only determine a subset of
parameters (with noises)
To combine the parameters of different
experiments we need to stitch them together
25
Chemical Shift Acquirement
26
NMR Experiment
27
Ambiguities
  • All 4 point experiments are mixed together
  • All 2 point experiments are mixed together
  • Each spin system can be mapped to several amino
    acids in the protein sequence
  • False positives, false negatives

28
A Peculiar Parking Lot
Information you have The make of your car, the
guy who park behind you or in front of you
(approximately). Together with others, try to
identify as many cars as possible (maximizing the
overall satisfaction).
29
Potential errors some peoples memory is wrong,
some just forget their neighbors
  • When there is perfect information, this can be
    formulated as the constrained bipartite matching
    problem (if you find your car, your neighbor
    should also find his/her car).

Legal matching
30
Problems Caused by False Positives
Two people claim the same car
Many cars are of the same make
Your two neighbors are split
31
Spin System Group Generation
  • Three types of spin system group are generated
    based on the quality of CBCANH data
  • Perfect
  • Weak false-negative
  • Severe false-negative

32
Linking and Mapping
Gradually include more information. The more
reliable one has higher priority. Use a maximum
independent set algorithm in each iteration
33
Experimental Results
  • The accuracy on two real dataset
  • SBD91.4
  • LBD83.6
  • The average accuracy on perfect BMRB datasets
    (902 proteins) is 98.28.

34
Mass Spectrometry-ICAT Technique
35
Mass Spectrometry in Biology
  • Protein quantification
  • Protein-protein interaction
  • Protein structure
  • Protein sequencing
  • Confirm synthetic chemical component

36
Advances in mass spectrometry and the generation
of large quantities of nucleotide sequence
information, combined with computational
algorithms that could correlate the two, led to
the emergence of proteomics as a field
Proteomics the first decade and beyond. (2003)
Patterson and Aebersold Nat Genet 33 Suppl
311-23.
37
Goal of MS
Schematic representation of the systems biology
paradigm
38
Mass Spectrometers
Ionization Source
This part will ionize the sample.
Mass Analyzer
This part will separate the ions by their
mass/charge (m/z) ratio.
Detector
Detect the result of Mass Analyzer as spectrum.
(From Dr. Khoos pdf file)
39
Tandem Mass Spectrometry (MS/MS)
Sample Mixtures
Ionized
MS 1
Select Sample for MS/MS
MS 2
(From Dr. Khoos pdf file)
40
Tandem Mass Spectrometry (MS/MS)
Sample Mixtures
Ionized
MS 1
Select Sample for MS/MS
MS 2
(From Dr. Khoos pdf file)
41
Mass Spectrometry -Example
42
Workflow of MS
computational algorithms
The raw data (tandem mass spectra) are further
processed by software to produce information
about the identity, quantity and characteristics
of the proteins detected.
43
Bioinformatics in MS
Institute of Systems Biology
44
Protein Identification
Associate with which protein?
45
????????
46
Quantification(ICAT)
47
Quantification(ICAT)
1. Pair selection 2. MS database 3. LC
reconstruction 4. Peptide ratio 5. Protein
ratio 6. Mobility shift 7. Expert validation
48
Quantification(ICAT) ---Chromatogram
reconstruction
RP-LC
min
Intensity
6
1MS 3MSMS
5
MS/MS
4
3
2
1
MS
4000
B2
850
m/z
49
Quantification(ICAT) ---Chromatogram
reconstruction
50
Quantification(ICAT) protein ratio
MH
M3H3
M2H2
M4H4
51
Quantification(ICAT) --- Expert validation
_at_ Retention Time Peak Maximum (lt 20 sec) _at_
Peak Shape Normal distribution Peak Width
(3040 sec) Scan (4550) _at_ Peak Intensity
(S/N ratiogt3)
52
Literature Mining
53
Literature Mining Problems
  • Named Entity Recognition (NER)
  • Named Entity Relation Recognition (NERR)
  • Document Classification

Biologists need a better search engine than
GOOGLE, which is more semantically oriented,
which can search and make use of relations
54
Named Entity Recognition
  • The basic problem in information extraction.
  • Problem Definition
  • Extract Named Entities from Text
  • Example
  • Find protein names from the following text
  • The ZAP-70 mutant studied here could be
    phosphorylated on tyrosine when associated to the
    TCR zeta chain and was able to bind p56(lck) .

55
Named Entity Disambiguation- another example
(acronyms)
  • Example
  • PI can be abbreviation of following named
    entities
  • glutathione transferase
  • Permeability Index
  • alpha-1-antitrypsin
  • Without full name written in text, it is hard to
    disambiguate the meaning of PI

56
Named Entity Relation Recognition
  • Problem Definition
  • Given a text with named entities annotation, find
    out the relations among these named entities.
  • Example
  • Find protein interaction relation from the
    following text
  • The ZAP-70 mutant studied here could be
    phosphorylated on tyrosine when associated to the
    TCR zeta chain and was able to bind p56(lck) .

Relation(N1, R, N2) (ZAP-70, bind, p56(lck))
57
Named Entity Disambiguation
  • Problem Definition
  • In database query services, the user submits a
    gene symbol and the system often returns some
    data having the same spelling with the gene
    symbol, but different meaning.
  • We can utilize different computation techniques
    to handle this problem.
  • Example
  • Find out the classes of the following named
    entities
  • ZAP-70, LIF, E1A gene, CD4
  • ZAP-70 protein name
  • LIF protein name
  • E1A gene gene name
  • CD4 cell type

58
Document Classification
  • Problem Definition
  • Given a list of papers, find out those papers of
    a users interest.
  • Example
  • Paper lists (PubMed ID) 15017560, 14768007,
    14581357, 12734082, 14672953, 14600132, 14581357,
    12810673, 12734082
  • User interest find out those papers which talk
    about gene ANG and disease Hepatocellular
    Carcinoma
  • Answer 14581357, 12734082

59
Computation Techniques in Literature Mining
  • Named Entity Recognition
  • Named Entity Relation Recognition

60
Named Entity Recognition
  • Naming Template
  • Dictionary based Method
  • BLAST match
  • Morphological match
  • Machine Learning
  • Decision Tree
  • Statistical methods
  • Naïve Bayesian
  • Hidden Markove Model (HMM)
  • Support Vector Machine (SVM)
  • Maximum Entropy (ME)

61
Machine Learning Maximum Entropy
  • Given
  • Training Corpus GENIA 3.01 Corpus
  • Features an indicator to disambiguate named
    entities constitutive words with normal words
  • Morphological Features
  • POS Features
  • Semantic Trigger Features
  • Head-noun Features
  • NF-kappaB consensus site

62
Morphological Features
63
Head-noun Features
64
Orthographic Features
65
Named Entity Relation Recognition
  • Co-occurrences
  • Natural Language Process (NLP) method
  • Grammar Parser
  • Template based method
  • Statistical Method
  • Bayesian

66
Template based method
  • Manual code rules
  • Writing rules that describe relations between
    named entities
  • Examples (protein-protein interaction)
  • Protein_1 Token_1 interact with Protein_2
  • interaction between Protein_1 and Protein_2
  • Protein_1 involved in activation of Protein_2

67
Natural Language Process (NLP) method
  • Utilize NLP techniques to facilitate the
    extraction problems.
  • Frequently used techniques
  • POS Tagging
  • Grammar Parser
Write a Comment
User Comments (0)
About PowerShow.com