Some Computer Science Techniques in Bioinformatics - PowerPoint PPT Presentation

1 / 67

About This Presentation

Title:

Some Computer Science Techniques in Bioinformatics

Description:

Less amenable to rigid rule-based algorithms. ... on tyrosine when associated to the TCR zeta chain and was able to bind p56(lck) ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 68

Provided by: csieNt

Category:

more less

Transcript and Presenter's Notes

Title: Some Computer Science Techniques in Bioinformatics

1
Some Computer Science Techniques in Bioinformatics

???
??????

2
Problems in Biological Computing

Laboratory data is full of errors.
Less amenable to rigid rule-based algorithms.
Multi-objective difficult to model them as a
regular single objective approximation problem
Dependency among different objectives
Local noise less suitable for global
optimization
To tackle these problems one should focus on real
data rather than worst case analysis.

3
Distribution of Problem Instances with Respect to
an Algorithm
Proportion of instances
n2
n3
n
Exponential
Complexity
4
Distribution of Problem Instances with Respect to
an Algorithm
Proportion of instances
n2
n3
n
Exponential
Complexity
5
How to Exploit the Structure of a Problem

Best case closed-form solutions
The solutions of a
quadratic equation
Solve by an algorithm
Rule-based algorithms
Rules (If , then else ) reside inside the
algorithm Most discrete algorithms (e.g. special
graph algorithms)
corpus-based algorithms (training set, testing
set)
Machine-learning algorithm
Knowledge-based algorithm
Represent the structure by a collection of
explicit features and their weights

6
Some Computer Science Techniques in Bioinformatics

String matching
alignment (blast), multiple sequence alignment
(psi-blast), motif finding
Clustering, classification
Neural net, SVM, maximum entropy
Graph algorithm
Natural language processing
IR, IE, Machine translation, HMM
Image, signal processing

7
Guideline for Bioinformatics Research

Bioinformatics tools are used as decision support
systems
The emphasis is on biological fact finding rather
than methodology sophistication
The more biological knowledge embedded in the
tools the better the result could be.
Corpus-based algorithms seem to be more
appropriate
Pure machine-learning algorithms usually suffer
from low prediction accuracies

8
Outline of Examples (in our lab)

Protein Secondary Structure Prediction
NMR Backbone Assignment
Mass Spectrometry-ICAT Technique
Literature Mining

9
Protein Secondary Structure Prediction
10
Hierarchy of Protein Structures
11
Sequence to 2 D structure prediction

Secondary structure elements
3 states ?-helix (H), ?-strand (E), others (L).
Secondary structure prediction
Assign one of the states to each amino acid.
Useful for 3D prediction
Can be regarded as a translation problem
Need to consider evolutionary information
What are the words, phrases?

12
Previous Works

Statistical approaches
Chou-Fasman, GOR
Chemistry-based approaches
Lim, Cohen
Machine learning approaches
PHD, PSIPRED, Hyprosp

13
Characteristics of Our Method

Sequence similarity ? Structure similarity
There seems to be a gap for sequence identity
between 0 to 25
A knowledge-based approach (PROSP)
Provides a monotone similarity measure
Other NN methods such as PSIPRED
A hybrid method (HYPROSP)
Knowing when to use PROSP or PSIPRED
Based on match rate

14
A Comparison of PROSP and PSIPRED
15
Sequence Homology- Global vs. Local

Protein similarity
Global based on sequence or structure alignment
Local based on k-peptide similarity (Wu CSB
2003)
Global similarity ? local similarity
but not vice versa
We adopt local similarity here based on PSIBLAST
and HSP.

16
Divide the Target Protein into Peptides of 7-mers
target protein
17
Knowledge-based Prediction Algorithm PROSP
Knowledge base
x
Psi-Blast
H

H
H
L
E
L
H(x) E(x) L(x)
x is assigned as helix
Voting score
18
Hybrid MethodHYPROSP

Determine match rate of the target protein
Hybrid methodHYPROSP
Match rate gt 80 ? using PROSP
We only predict this part
Match rate lt 80 ? using PSIPRED

19
Match Rate
target protein
Match rate / ( )
20
A Comparison of PROSP and PSIPRED
21
New Experimental Result
22
NMR Backbone Assignment
23
NMR 3D Structure Determination
NMR
24
Blind Mans Elephant

We cannot directly see the positions of these
atoms (the structure)
But we can measure a set of parameters (with
constraints) on these atoms,
which can help us infer their coordinates

Each experiment can only determine a subset of
parameters (with noises)
To combine the parameters of different
experiments we need to stitch them together
25
Chemical Shift Acquirement
26
NMR Experiment
27
Ambiguities

All 4 point experiments are mixed together
All 2 point experiments are mixed together
Each spin system can be mapped to several amino
acids in the protein sequence
False positives, false negatives

28
A Peculiar Parking Lot
Information you have The make of your car, the
guy who park behind you or in front of you
(approximately). Together with others, try to
identify as many cars as possible (maximizing the
overall satisfaction).
29
Potential errors some peoples memory is wrong,
some just forget their neighbors

When there is perfect information, this can be
formulated as the constrained bipartite matching
problem (if you find your car, your neighbor
should also find his/her car).

Legal matching
30
Problems Caused by False Positives
Two people claim the same car
Many cars are of the same make
Your two neighbors are split
31
Spin System Group Generation

Three types of spin system group are generated
based on the quality of CBCANH data
Perfect
Weak false-negative
Severe false-negative

32
Linking and Mapping
Gradually include more information. The more
reliable one has higher priority. Use a maximum
independent set algorithm in each iteration
33
Experimental Results

The accuracy on two real dataset
SBD91.4
LBD83.6
The average accuracy on perfect BMRB datasets
(902 proteins) is 98.28.

34
Mass Spectrometry-ICAT Technique
35
Mass Spectrometry in Biology

Protein quantification
Protein-protein interaction
Protein structure
Protein sequencing
Confirm synthetic chemical component

36
Advances in mass spectrometry and the generation
of large quantities of nucleotide sequence
information, combined with computational
algorithms that could correlate the two, led to
the emergence of proteomics as a field
Proteomics the first decade and beyond. (2003)
Patterson and Aebersold Nat Genet 33 Suppl
311-23.
37
Goal of MS
Schematic representation of the systems biology
paradigm
38
Mass Spectrometers
Ionization Source
This part will ionize the sample.
Mass Analyzer
This part will separate the ions by their
mass/charge (m/z) ratio.
Detector
Detect the result of Mass Analyzer as spectrum.
(From Dr. Khoos pdf file)
39
Tandem Mass Spectrometry (MS/MS)
Sample Mixtures
Ionized
MS 1
Select Sample for MS/MS
MS 2
(From Dr. Khoos pdf file)
40
Tandem Mass Spectrometry (MS/MS)
Sample Mixtures
Ionized
MS 1
Select Sample for MS/MS
MS 2
(From Dr. Khoos pdf file)
41
Mass Spectrometry -Example
42
Workflow of MS
computational algorithms
The raw data (tandem mass spectra) are further
processed by software to produce information
about the identity, quantity and characteristics
of the proteins detected.
43
Bioinformatics in MS
Institute of Systems Biology
44
Protein Identification
Associate with which protein?
45
????????
46
Quantification(ICAT)
47
Quantification(ICAT)
1. Pair selection 2. MS database 3. LC
reconstruction 4. Peptide ratio 5. Protein
ratio 6. Mobility shift 7. Expert validation
48
Quantification(ICAT) ---Chromatogram
reconstruction
RP-LC
min
Intensity
6
1MS 3MSMS
5
MS/MS
4
3
2
1
MS
4000
B2
850
m/z
49
Quantification(ICAT) ---Chromatogram
reconstruction
50
Quantification(ICAT) protein ratio
MH
M3H3
M2H2
M4H4
51
Quantification(ICAT) --- Expert validation
_at_ Retention Time Peak Maximum (lt 20 sec) _at_
Peak Shape Normal distribution Peak Width
(3040 sec) Scan (4550) _at_ Peak Intensity
(S/N ratiogt3)
52
Literature Mining
53
Literature Mining Problems

Named Entity Recognition (NER)
Named Entity Relation Recognition (NERR)
Document Classification

Biologists need a better search engine than
GOOGLE, which is more semantically oriented,
which can search and make use of relations
54
Named Entity Recognition

The basic problem in information extraction.
Problem Definition
Extract Named Entities from Text
Example
Find protein names from the following text
The ZAP-70 mutant studied here could be
phosphorylated on tyrosine when associated to the
TCR zeta chain and was able to bind p56(lck) .

55
Named Entity Disambiguation- another example
(acronyms)

Example
PI can be abbreviation of following named
entities
glutathione transferase
Permeability Index
alpha-1-antitrypsin
Without full name written in text, it is hard to
disambiguate the meaning of PI

56
Named Entity Relation Recognition

Problem Definition
Given a text with named entities annotation, find
out the relations among these named entities.
Example
Find protein interaction relation from the
following text
The ZAP-70 mutant studied here could be
phosphorylated on tyrosine when associated to the
TCR zeta chain and was able to bind p56(lck) .

Relation(N1, R, N2) (ZAP-70, bind, p56(lck))
57
Named Entity Disambiguation

Problem Definition
In database query services, the user submits a
gene symbol and the system often returns some
data having the same spelling with the gene
symbol, but different meaning.
We can utilize different computation techniques
to handle this problem.
Example
Find out the classes of the following named
entities
ZAP-70, LIF, E1A gene, CD4
ZAP-70 protein name
LIF protein name
E1A gene gene name
CD4 cell type

58
Document Classification

Problem Definition
Given a list of papers, find out those papers of
a users interest.
Example
Paper lists (PubMed ID) 15017560, 14768007,
14581357, 12734082, 14672953, 14600132, 14581357,
12810673, 12734082
User interest find out those papers which talk
about gene ANG and disease Hepatocellular
Carcinoma
Answer 14581357, 12734082

59
Computation Techniques in Literature Mining

Named Entity Recognition
Named Entity Relation Recognition

60
Named Entity Recognition

Naming Template
Dictionary based Method
BLAST match
Morphological match
Machine Learning
Decision Tree
Statistical methods
Naïve Bayesian
Hidden Markove Model (HMM)
Support Vector Machine (SVM)
Maximum Entropy (ME)

61
Machine Learning Maximum Entropy

Given
Training Corpus GENIA 3.01 Corpus
Features an indicator to disambiguate named
entities constitutive words with normal words
Morphological Features
POS Features
Semantic Trigger Features
Head-noun Features
NF-kappaB consensus site