Title: Some Computer Science Techniques in Bioinformatics
1Some Computer Science Techniques in Bioinformatics
2Problems in Biological Computing
- Laboratory data is full of errors.
- Less amenable to rigid rule-based algorithms.
- Multi-objective difficult to model them as a
regular single objective approximation problem - Dependency among different objectives
- Local noise less suitable for global
optimization - To tackle these problems one should focus on real
data rather than worst case analysis.
3Distribution of Problem Instances with Respect to
an Algorithm
Proportion of instances
n2
n3
n
Exponential
Complexity
4Distribution of Problem Instances with Respect to
an Algorithm
Proportion of instances
n2
n3
n
Exponential
Complexity
5How to Exploit the Structure of a Problem
- Best case closed-form solutions
- The solutions of a
- quadratic equation
- Solve by an algorithm
- Rule-based algorithms
- Rules (If , then else ) reside inside the
algorithm Most discrete algorithms (e.g. special
graph algorithms) - corpus-based algorithms (training set, testing
set) - Machine-learning algorithm
- Knowledge-based algorithm
- Represent the structure by a collection of
explicit features and their weights
6Some Computer Science Techniques in Bioinformatics
- String matching
- alignment (blast), multiple sequence alignment
(psi-blast), motif finding - Clustering, classification
- Neural net, SVM, maximum entropy
- Graph algorithm
- Natural language processing
- IR, IE, Machine translation, HMM
- Image, signal processing
7Guideline for Bioinformatics Research
- Bioinformatics tools are used as decision support
systems - The emphasis is on biological fact finding rather
than methodology sophistication - The more biological knowledge embedded in the
tools the better the result could be. - Corpus-based algorithms seem to be more
appropriate - Pure machine-learning algorithms usually suffer
from low prediction accuracies
8Outline of Examples (in our lab)
- Protein Secondary Structure Prediction
- NMR Backbone Assignment
- Mass Spectrometry-ICAT Technique
- Literature Mining
9Protein Secondary Structure Prediction
10Hierarchy of Protein Structures
11Sequence to 2 D structure prediction
- Secondary structure elements
- 3 states ?-helix (H), ?-strand (E), others (L).
- Secondary structure prediction
- Assign one of the states to each amino acid.
- Useful for 3D prediction
- Can be regarded as a translation problem
- Need to consider evolutionary information
- What are the words, phrases?
12Previous Works
- Statistical approaches
- Chou-Fasman, GOR
- Chemistry-based approaches
- Lim, Cohen
- Machine learning approaches
- PHD, PSIPRED, Hyprosp
13Characteristics of Our Method
- Sequence similarity ? Structure similarity
- There seems to be a gap for sequence identity
between 0 to 25 - A knowledge-based approach (PROSP)
- Provides a monotone similarity measure
- Other NN methods such as PSIPRED
- A hybrid method (HYPROSP)
- Knowing when to use PROSP or PSIPRED
- Based on match rate
14A Comparison of PROSP and PSIPRED
15Sequence Homology- Global vs. Local
- Protein similarity
- Global based on sequence or structure alignment
- Local based on k-peptide similarity (Wu CSB
2003) - Global similarity ? local similarity
- but not vice versa
- We adopt local similarity here based on PSIBLAST
and HSP.
16Divide the Target Protein into Peptides of 7-mers
target protein
17Knowledge-based Prediction Algorithm PROSP
Knowledge base
x
Psi-Blast
H
H
H
L
E
L
H(x) E(x) L(x)
x is assigned as helix
Voting score
18Hybrid MethodHYPROSP
- Determine match rate of the target protein
- Hybrid methodHYPROSP
- Match rate gt 80 ? using PROSP
- We only predict this part
- Match rate lt 80 ? using PSIPRED
19Match Rate
target protein
Match rate / ( )
20A Comparison of PROSP and PSIPRED
21New Experimental Result
22NMR Backbone Assignment
23NMR 3D Structure Determination
NMR
24Blind Mans Elephant
- We cannot directly see the positions of these
atoms (the structure) - But we can measure a set of parameters (with
constraints) on these atoms, - which can help us infer their coordinates
Each experiment can only determine a subset of
parameters (with noises)
To combine the parameters of different
experiments we need to stitch them together
25Chemical Shift Acquirement
26NMR Experiment
27Ambiguities
- All 4 point experiments are mixed together
- All 2 point experiments are mixed together
- Each spin system can be mapped to several amino
acids in the protein sequence - False positives, false negatives
28A Peculiar Parking Lot
Information you have The make of your car, the
guy who park behind you or in front of you
(approximately). Together with others, try to
identify as many cars as possible (maximizing the
overall satisfaction).
29Potential errors some peoples memory is wrong,
some just forget their neighbors
- When there is perfect information, this can be
formulated as the constrained bipartite matching
problem (if you find your car, your neighbor
should also find his/her car).
Legal matching
30Problems Caused by False Positives
Two people claim the same car
Many cars are of the same make
Your two neighbors are split
31Spin System Group Generation
- Three types of spin system group are generated
based on the quality of CBCANH data - Perfect
- Weak false-negative
- Severe false-negative
32Linking and Mapping
Gradually include more information. The more
reliable one has higher priority. Use a maximum
independent set algorithm in each iteration
33Experimental Results
- The accuracy on two real dataset
- SBD91.4
- LBD83.6
- The average accuracy on perfect BMRB datasets
(902 proteins) is 98.28.
34Mass Spectrometry-ICAT Technique
35Mass Spectrometry in Biology
- Protein quantification
- Protein-protein interaction
- Protein structure
- Protein sequencing
- Confirm synthetic chemical component
36Advances in mass spectrometry and the generation
of large quantities of nucleotide sequence
information, combined with computational
algorithms that could correlate the two, led to
the emergence of proteomics as a field
Proteomics the first decade and beyond. (2003)
Patterson and Aebersold Nat Genet 33 Suppl
311-23.
37Goal of MS
Schematic representation of the systems biology
paradigm
38Mass Spectrometers
Ionization Source
This part will ionize the sample.
Mass Analyzer
This part will separate the ions by their
mass/charge (m/z) ratio.
Detector
Detect the result of Mass Analyzer as spectrum.
(From Dr. Khoos pdf file)
39Tandem Mass Spectrometry (MS/MS)
Sample Mixtures
Ionized
MS 1
Select Sample for MS/MS
MS 2
(From Dr. Khoos pdf file)
40Tandem Mass Spectrometry (MS/MS)
Sample Mixtures
Ionized
MS 1
Select Sample for MS/MS
MS 2
(From Dr. Khoos pdf file)
41Mass Spectrometry -Example
42Workflow of MS
computational algorithms
The raw data (tandem mass spectra) are further
processed by software to produce information
about the identity, quantity and characteristics
of the proteins detected.
43Bioinformatics in MS
Institute of Systems Biology
44Protein Identification
Associate with which protein?
45????????
46Quantification(ICAT)
47Quantification(ICAT)
1. Pair selection 2. MS database 3. LC
reconstruction 4. Peptide ratio 5. Protein
ratio 6. Mobility shift 7. Expert validation
48Quantification(ICAT) ---Chromatogram
reconstruction
RP-LC
min
Intensity
6
1MS 3MSMS
5
MS/MS
4
3
2
1
MS
4000
B2
850
m/z
49Quantification(ICAT) ---Chromatogram
reconstruction
50Quantification(ICAT) protein ratio
MH
M3H3
M2H2
M4H4
51Quantification(ICAT) --- Expert validation
_at_ Retention Time Peak Maximum (lt 20 sec) _at_
Peak Shape Normal distribution Peak Width
(3040 sec) Scan (4550) _at_ Peak Intensity
(S/N ratiogt3)
52Literature Mining
53Literature Mining Problems
- Named Entity Recognition (NER)
- Named Entity Relation Recognition (NERR)
- Document Classification
Biologists need a better search engine than
GOOGLE, which is more semantically oriented,
which can search and make use of relations
54Named Entity Recognition
- The basic problem in information extraction.
- Problem Definition
- Extract Named Entities from Text
- Example
- Find protein names from the following text
- The ZAP-70 mutant studied here could be
phosphorylated on tyrosine when associated to the
TCR zeta chain and was able to bind p56(lck) .
55Named Entity Disambiguation- another example
(acronyms)
- Example
- PI can be abbreviation of following named
entities - glutathione transferase
- Permeability Index
- alpha-1-antitrypsin
- Without full name written in text, it is hard to
disambiguate the meaning of PI
56Named Entity Relation Recognition
- Problem Definition
- Given a text with named entities annotation, find
out the relations among these named entities. - Example
- Find protein interaction relation from the
following text - The ZAP-70 mutant studied here could be
phosphorylated on tyrosine when associated to the
TCR zeta chain and was able to bind p56(lck) .
Relation(N1, R, N2) (ZAP-70, bind, p56(lck))
57Named Entity Disambiguation
- Problem Definition
- In database query services, the user submits a
gene symbol and the system often returns some
data having the same spelling with the gene
symbol, but different meaning. - We can utilize different computation techniques
to handle this problem. - Example
- Find out the classes of the following named
entities - ZAP-70, LIF, E1A gene, CD4
- ZAP-70 protein name
- LIF protein name
- E1A gene gene name
- CD4 cell type
58Document Classification
- Problem Definition
- Given a list of papers, find out those papers of
a users interest. - Example
- Paper lists (PubMed ID) 15017560, 14768007,
14581357, 12734082, 14672953, 14600132, 14581357,
12810673, 12734082 - User interest find out those papers which talk
about gene ANG and disease Hepatocellular
Carcinoma - Answer 14581357, 12734082
59Computation Techniques in Literature Mining
- Named Entity Recognition
- Named Entity Relation Recognition
60Named Entity Recognition
- Naming Template
- Dictionary based Method
- BLAST match
- Morphological match
- Machine Learning
- Decision Tree
- Statistical methods
- Naïve Bayesian
- Hidden Markove Model (HMM)
- Support Vector Machine (SVM)
- Maximum Entropy (ME)
61Machine Learning Maximum Entropy
- Given
- Training Corpus GENIA 3.01 Corpus
- Features an indicator to disambiguate named
entities constitutive words with normal words - Morphological Features
- POS Features
- Semantic Trigger Features
- Head-noun Features
- NF-kappaB consensus site
62Morphological Features
63Head-noun Features
64Orthographic Features
65Named Entity Relation Recognition
- Co-occurrences
- Natural Language Process (NLP) method
- Grammar Parser
- Template based method
- Statistical Method
- Bayesian
66Template based method
- Manual code rules
- Writing rules that describe relations between
named entities - Examples (protein-protein interaction)
- Protein_1 Token_1 interact with Protein_2
- interaction between Protein_1 and Protein_2
- Protein_1 involved in activation of Protein_2
67Natural Language Process (NLP) method
- Utilize NLP techniques to facilitate the
extraction problems. - Frequently used techniques
- POS Tagging
- Grammar Parser