Bioinformatics For MNW 2nd Year - PowerPoint PPT Presentation

About This Presentation
Title:

Bioinformatics For MNW 2nd Year

Description:

Title: Finding Patterns in Protein Sequence and Structure Author: mathbio Last modified by: heringa Created Date: 6/9/2002 12:55:37 AM Document presentation format – PowerPoint PPT presentation

Number of Views:300
Avg rating:3.0/5.0
Slides: 180
Provided by: math79
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics For MNW 2nd Year


1
Bioinformatics For MNW 2nd Year
  • Jaap Heringa
  • FEW/FALW
  • Integrative Bioinformatics Institute VU (IBIVU)
  • heringa_at_cs.vu.nl

2
Current Bioinformatics Unit
  • Jens Kleinjung (1/11/02)
  • Victor Simosis PhD (1/12/02)
  • Radek Szklarczyk - PhD (1/01/03)
  • John Romein (1/12/02, Henri Bal)

3
Bioinformatics course 2nd year MNW spring 2003
  • Pattern recognition
  • Supervised/unsupervised learning
  • Types of data, data normalisation, lacking data
  • Search image
  • Similarity tables
  • Clustering
  • Principal component analysis
  • Discriminant analysis

4
Bioinformatics course 2nd year MNW spring 2003
  • Protein
  • Folding
  • Structure and function
  • Protein structure prediction
  • Secondary structure
  • Tertiary structure
  • Function
  • Post-translational modification
  • Prot.-Prot. Interaction -- Docking algorithm
  • Molecular dynamics/Monte Carlo

5
Bioinformatics course 2nd year MNW spring 2003
  • Sequence analysis
  • Pairwise alignment
  • Dynamic programming (NW, SW, shortcuts)
  • Multiple alignment
  • Combining information
  • Database/homology searching (Fasta, Blast,
    Statistical issues-E/P values)

6
Bioinformatics course 2nd year MNW spring 2003
  • Gene structure and gene finding algorithm
  • Omics
  • DNA makes RNA makes protein
  • Expression data, Nucleus to ribosome,
    translation, etc.
  • Metabolomics
  • Physiomics
  • Databases
  • DNA, EST
  • Protein sequence
  • Protein structure

7
Bioinformatics course 2nd year MNW spring 2003
  • Microarray data
  • Protein structure (PDB)
  • Proteomics
  • Mass spectrometry/NMR/X-ray?

8
Bioinformatics course 2nd year MNW spring 2003
  • Bioinformatics method development
  • IPR issues
  • Programming and scripting languages
  • Web solutions
  • Computational issues
  • NP-complete problems
  • CPU, memory, storage problems
  • Parallel computing
  • Bioinformatics method usage/application
  • Molecular viewers (RasMol, MolMol, etc.)

9
Gathering knowledge
  • Anatomy, architecture
  • Dynamics, mechanics
  • Informatics
  • (Cybernetics Wiener, 1948)
  • (Cybernetics has been defined as the science of
    control in machines and animals, and hence it
    applies to technological, animal and
    environmental systems)
  • Genomics, bioinformatics

Rembrandt, 1632
Newton, 1726
10
Bioinformatics
Chemistry
Biology Molecular biology
Mathematics Statistics
Bioinformatics
Computer Science Informatics
Medicine
Physics
11
Bioinformatics
  • Studying informational processes in biological
    systems (Hogeweg, early 1970s)
  • No computers necessary
  • Back of envelope OK

Information technology applied to the management
and analysis of biological data (Attwood and
Parry-Smith)
Applying algorithms with mathematical formalisms
in biology (genomics) -- USA
12
Bioinformatics in the olden days
  • Close to Molecular Biology
  • (Statistical) analysis of protein and nucleotide
    structure
  • Protein folding problem
  • Protein-protein and protein-nucleotide
    interaction
  • Many essential methods were created early on (BG
    era)
  • Protein sequence analysis (pairwise and multiple
    alignment)
  • Protein structure prediction (secondary, tertiary
    structure)

13
Bioinformatics in the olden days (Cont.)
  • Evolution was studied and methods created
  • Phylogenetic reconstruction (clustering NJ
    method

14
The Human Genome -- 26 June 2000
15
The Human Genome -- 26 June 2000
Dr. Craig Venter Celera Genomics -- Shotgun method
Sir John Sulston Human Genome Project
16
Human DNA
  • There are about 3bn (3 ? 109) nucleotides in the
    nucleus of almost all of the trillions (3.5 ?
    1012 ) of cells of a human body (an exception is,
    for example, red blood cells which have no
    nucleus and therefore no DNA) a total of 1022
    nucleotides!
  • Many DNA regions code for proteins, and are
    called genes (1 gene codes for 1 protein in
    principle)
  • Human DNA contains 30,000 expressed genes
  • Deoxyribonucleic acid (DNA) comprises 4 different
    types of nucleotides adenine (A), thiamine (T),
    cytosine (C) and guanine (G). These nucleotides
    are sometimes also called bases

17
Human DNA (Cont.)
  • All people are different, but the DNA of
    different people only varies for 0.2 or less.
    So, only 2 letters in 1000 are expected to be
    different. Over the whole genome, this means that
    about 3 million letters would differ between
    individuals.
  • The structure of DNA is the so-called double
    helix, discovered by Watson and Crick in 1953,
    where the two helices are cross-linked by A-T and
    C-G base-pairs (nucleotide pairs so-called
    Watson-Crick base pairing).

18
Tot hier 3/2 10.45-12.30
19
DNA compositional biases
  • Base composition of genomes
  • E. coli 25 A, 25 C, 25 G, 25 T
  • P. falciparum (Malaria parasite) 82AT
  • Translation initiation
  • ATG is the near universal motif indicating the
    start of translation in DNA coding sequence.

20
Some facts about human genes
  • Comprise about 3 of the genome
  • Average gene length 8,000 bp
  • Average of 5-6 exons/gene
  • Average exon length 200 bp
  • Average intron length 2,000 bp
  • 8 genes have a single exon
  • Some exons can be as small as 1 or 3 bp.
  • HUMFMR1S is not atypical 17 exons 40-60 bp long,
    comprising 3 of a 67,000 bp gene

21
Genetic diseases
  • Many diseases run in families and are a result of
    genes which predispose such family members to
    these illnesses
  • Examples are Alzheimers disease, cystic fibrosis
    (CF), breast or colon cancer, or heart diseases.
  • Some of these diseases can be caused by a problem
    within a single gene, such as with CF.

22
Genetic diseases (Cont.)
  • For other illnesses, like heart disease, at least
    20-30 genes are thought to play a part, and it is
    still unknown which combination of problems
    within which genes are responsible.
  • With a problem within a gene is meant that a
    single nucleotide or a combination of those
    within the gene are causing the disease (or make
    that the body is not sufficiently fighting the
    disease).
  • Persons with different combinations of these
    nucleotides could then be unaffected by these
    diseases.

23
Genetic diseases (Cont.)Cystic Fibrosis
  • Known since very early on (Celtic gene)
  • Inherited autosomal recessive condition (Chr. 7)
  • Symptoms
  • Clogging and infection of lungs (early death)
  • Intestinal obstruction
  • Reduced fertility and (male) anatomical anomalies
  • CF gene CFTR has 3-bp deletion leading to Del508
    (Phe) in 1480 aa protein (epithelial Cl- channel)
    protein degraded in ER instead of inserted into
    cell membrane

24
Genomic Data Sources
  • DNA/protein sequence
  • Expression (microarray)
  • Proteome (xray, NMR,
  • mass spectrometry)
  • Metabolome
  • Physiome (spatial,
  • temporal)

Integrative bioinformatics
25
Genomic Data Sources Vertical Genomics
genome
transcriptome
proteome
metabolome
physiome
Dinner discussion Integrative Bioinformatics
Genomics VU
26
A gene codes for a protein
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
27
Humans have spliced genes
28
DNA makes RNA makes Protein
29
Remark
  • The problem of identifying (annotating) human
    genes is considerably harder than the early
    success story for ß-globin might suggest.
  • The human factor VIII gene (whose mutations cause
    hemophilia A) is spread over 186,000 bp. It
    consists of 26 exons ranging in size from 69 to
    3,106 bp, and its 25 introns range in size from
    207 to 32,400 bp. The complete gene is thus 9 kb
    of exon and 177 kb of intron.
  • The biggest human gene yet is for dystrophin. It
    has gt 30 exons and is spread over 2.4
    million bp.

30
DNA makes RNA makes ProteinExpression data
  • More copies of mRNA for a gene leads to more
    protein
  • mRNA can now be measured for all the genes in a
    cell at ones through microarray technology
  • Can have 60,000 spots (genes) on a single gene
    chip
  • Colour change gives intensity of gene expression
    (over- or under-expression)

31
(No Transcript)
32
Metabolic networksGlycolysis and
Gluconeogenesis
Kegg database (Japan)
33
High-throughput Biological Data
  • Enormous amounts of biological data are being
    generated by high-throughput capabilities even
    more are coming
  • genomic sequences
  • gene expression data
  • mass spec. data
  • protein-protein interaction
  • protein structures
  • ......

34
Protein structural data explosion
Protein Data Bank (PDB) 14500 Structures (6
March 2001) 10900 x-ray crystallography, 1810
NMR, 278 theoretical models, others...
35
Dickersons formula equivalent to Moores law
n e0.19(y-1960) with y the year.
On 27 March 2001 there were 12,123 3D protein
structures in the PDB Dickersons formula
predicts 12,066 (within 0.5)!
36
Sequence versus structural data
  • Despite structural genomics efforts, growth of
    PDB slowed down in 2001-2002 (i.e did not keep up
    with Dickersons formula)
  • More than 100 completely sequenced genomes
  • Increasing gap between structural and sequence
    data

37
Bioinformatics
Bioinformatics
Large - external (integrative) Science Human
Planetary Science Cultural Anthropology
Population Biology Sociology
Sociobiology Psychology Systems
Biology Biology Medicine
Molecular Biology
Chemistry Physics Small
internal (individual)
38
Bioinformatics
  • Offers an ever more essential input to
  • Molecular Biology
  • Pharmacology (drug design)
  • Agriculture
  • Biotechnology
  • Clinical medicine
  • Anthropology
  • Forensic science
  • Chemical industries (detergent industries, etc.)

39
High-throughput Biological DataThe data deluge
  • Hidden in these data is information that reflects
  • existence, organization, activity, functionality
    of biological machineries at different levels
    in living organisms

Most effectively utilising this information will
prove to be essential for Integrative
Bioinformatics
40
Data Issues
  • Data collection getting the data
  • Data representation data standards, data
    normalisation ..
  • Data organisation and storage database issues
    ..
  • Data analysis and data mining discovering
    knowledge, patterns/signals, from data,
    establishing associations among data patterns
  • Data utilisation and application from data
    patterns/signals to models for bio-machineries
  • Data visualization viewing complex data
  • Data transmission data collection, retrieval,
    ..

41
Tot hier 5/2
42
Bioinformatics
  • Nothing in Biology makes sense except in the
    light of evolution (Theodosius Dobzhansky
    (1900-1975))
  • Nothing in bioinformatics makes sense except in
    the light of Biology

43
Pair-wise alignment
T D W V T A L K T D W L - - I K
Combinatorial explosion - 1 gap in 1 sequence
n1 possibilities - 2 gaps in 1 sequence (n1)n
- 3 gaps in 1 sequence (n1)n(n-1), etc.
2n (2n)! 22n
n (n!)2
??n 2 sequences of 300 a.a. 1088
alignments 2 sequences of 1000 a.a. 10600
alignments!
44
Dynamic programmingScoring alignments
Sa,b gp(k) pi k?pe affine gap
penalties pi and pe are the penalties for gap
initialisation and extension, respectively
45
Dynamic programmingScoring alignments
T D W V T A L K T D W L - - I K
20?20
10
1
Gap penalties (open, extension)
Amino Acid Exchange Matrix
Score s(T,T)s(D,D)s(W,W)s(V,L)Po2Px
s(L,I)s(K,K)
46
Pairwise sequence alignment Global dynamic
programming
MDAGSTVILCFVG
Evolution
M D A A S T I L C G S
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open,extension)
MDAGSTVILCFVG-
MDAAST-ILC--GS
47
Global dynamic programming
j-1
i-1
MaxS0ltxlti-1, j-1 - Pi - (i-x-1)Px Si-1,j-1 MaxS
i-1, 0ltyltj-1 - Pi - (j-y-1)Px
Si,j si,j Max
48
Global dynamic programming
49
Global dynamic programming
50
Tot hier 17/02/03
51
Local dynamic programming (Smith Waterman,
1981)
LCFVMLAGSTVIVGTR
E D A S T I L C G S
Negative numbers
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open, extension)
AGSTVIVG A-STILCG
52
Local dynamic programming (Smith Waterman,
1981)
j-1
i-1
Si,j MaxS0ltxlti-1,j-1 - Pi - (i-x-1)Px Si,j
Si-1,j-1 Si,j Max Si-1,0ltyltj-1 - Pi -
(j-y-1)Px 0
Si,j Max
53
Local dynamic programming
54
Sequence database searching Homology searching
  • DP too slow for repeated database searches
  • FASTA
  • BLAST and PSI-BLAST
  • QUEST
  • HMMER
  • SAM-T98

Fast heuristics
Hidden Markov modelling
55
FASTA
  • Compares a given query sequence with a library of
    sequences and calculates for each pair the
    highest scoring local alignment
  • Speed is obtained by delaying application of the
    dynamic programming technique to the moment where
    the most similar segments are already identified
    by faster and less sensitive techniques
  • FASTA routine operates in four steps

56
FASTA
  • Operates in four steps
  • Rapid searches for identical words of a user
    specified length occurring in query and database
    sequence(s) (Wilbur and Lipman, 1983, 1984). For
    each target sequence the 10 regions with the
    highest density of ungapped common words are
    determined.
  • These 10 regions are rescored using Dayhoff
    PAM-250 residue exchange matrix (Dayhoff et al.,
    1983) and the best scoring region of the 10 is
    reported under init1 in the FASTA output.
  • Regions scoring higher than a threshold value and
    being sufficiently near each other in the
    sequence are joined, now allowing gaps. The
    highest score of these new fragments can be found
    under initn in the FASTA output.
  • full dynamic programming alignment (Chao et al.,
    1992) over the final region which is widened by
    32 residues at either side, of which the score is
    written under opt in the FASTA output.

57
FASTA output example
DE METAL RESISTANCE PROTEIN YCF1 (YEAST CADMIUM
FACTOR 1). . . . SCORES Init1 161 Initn 161
Opt 162 z-score 229.5 E() 3.4e-06
Smith-Waterman score 162 35.1 identity in 57
aa overlap
10 20 30 test.seq
MQRSPLEKASVVSKLFFSW
TRPILRKGYRQRLE


YCFI_YEAST CASILLLEALPKKPLMPHQHIHQTLTRRKPNPY
DSANIFSRITFSWMSGLMKTGYEKYLV 180
190 200 210 220 230
40 50 60
test.seq LSDIYQIPSVDSADNLSEKLEREWDRE

YCFI_YEAST EADLYKLPRNFSSEELSQKLEKNWENELKQKSN
PSLSWAICRTFGSKMLLAAFFKAIHDV 240
250 260 270 280 290

58
FASTA
  • (1) Rapid identical word searches
  • Searching for k-tuples of a certain size within a
    specified bandwidth along search matrix
    diagonals.
  • For not-too-distant sequences (gt 35 residue
    identity), little sensitivity is lost while speed
    is greatly increased.
  • Technique employed is known as hash coding or
    hashing a lookup table is constructed for all
    words in the query sequence, which is then used
    to compare all encountered words in each database
    sequence.

59
FASTA
  • The k-tuple length is user-defined and is usually
    1 or 2 for protein sequences (i.e. either the
    positions of each of the individual 20 amino
    acids or the positions of each of the 400
    possible dipeptides are located).
  • For nucleic acid sequences, the k-tuple is 5-20,
    and should be longer because short k-tuples are
    much more common due to the 4 letter alphabet of
    nucleic acids. The larger the k-tuple chosen, the
    more rapid but less thorough, a database search.

60
BLAST
  • blastp compares an amino acid query sequence
    against a protein sequence database
  • blastn compares a nucleotide query sequence
    against a nucleotide sequence database
  • blastx compares the six-frame conceptual protein
    translation products of a nucleotide query
    sequence against a protein sequence database
  • tblastn compares a protein query sequence against
    a nucleotide sequence database translated in six
    reading frames
  • tblastx compares the six-frame translations of a
    nucleotide query sequence against the six-frame
    translations of a nucleotide sequence database.

61
BLAST
  • Generates all tripeptides from a query sequence
    and for each of those the derivation of a table
    of similar tripeptides number is only fraction
    of total number possible.
  • Quickly scans a database of protein sequences for
    ungapped regions showing high similarity, which
    are called high-scoring segment pairs (HSP),
    using the tables of similar peptides. The initial
    search is done for a word of length W that scores
    at least the threshold value T when compared to
    the query using a substitution matrix.
  • Word hits are then extended in either direction
    in an attempt to generate an alignment with a
    score exceeding the threshold of S, and as far as
    the cumulative alignment score can be increased.

62
BLAST
  • Extension of the word hits in each direction are
    halted
  • when the cumulative alignment score falls off by
    the quantity X from its maximum achieved value
  • the cumulative score goes to zero or below due to
    the accumulation of one or more negative-scoring
    residue alignments
  • upon reaching the end of either sequence
  • The T parameter is the most important for the
    speed and sensitivity of the search resulting in
    the high-scoring segment pairs
  • A Maximal-scoring Segment Pair (MSP) is defined
    as the highest scoring of all possible segment
    pairs produced from two sequences.

63
PSI-BLAST
  • Query sequences are first scanned for the
    presence of so-called low-complexity regions
    (Wooton and Federhen, 1996), i.e. regions with a
    biased composition likely to lead to spurious
    hits are excluded from alignment.
  • The program then initially operates on a single
    query sequence by performing a gapped BLAST
    search
  • Then, the program takes significant local
    alignments found, constructs a multiple alignment
    and abstracts a position specific scoring matrix
    (PSSM) from this alignment.
  • Rescan the database in a subsequent round to find
    more homologous sequences Iteration continues
    until user decides to stop or search has
    converged

64
PSI-BLAST iteration
Query sequence
Q
xxxxxxxxxxxxxxxxx
Gapped BLAST search
Query sequence
Q
xxxxxxxxxxxxxxxxx
Database hits
A C D . . Y
PSSM
Pi Px
Gapped BLAST search
A C D . . Y
PSSM
Pi Px
Database hits
65
PSI-BLAST output example
66
Multiple alignment profilesGribskov et al. 1987
i
A C D ? ? ? W Y
0.3 0.1 0 ? ? ? 0.3 0.3
Gap penalties
0.5
1.0
Position dependent gap penalties
67
Normalised sequence similarity
The p-value is defined as the probability of
seeing at least one unrelated score S greater
than or equal to a given score x in a database
search over n sequences. This probability
follows the Poisson distribution (Waterman and
Vingron, 1994)
P(x, n) 1 e-n?P(S? x), where n is the
number of sequences in the database Depending on
x and n (fixed)
68
Normalised sequence similarityStatistical
significance
The E-value is defined as the expected number of
non-homologous sequences with score greater than
or equal to a score x in a database of n
sequences E(x, n)
n?P(S ? x) if E-value 0.01, then the expected
number of random hits with score S ? x is 0.01,
which means that this E-value is expected by
chance only once in 100 independent searches over
the database. if the E-value of a hit is 5, then
five fortuitous hits with S ? x are expected
within a single database search, which renders
the hit not significant.
69
Normalised sequence similarityStatistical
significance
  • Database searching is commonly performed using an
    E-value in between 0.1 and 0.001.
  • Low E-values decrease the number of false
    positives in a database search, but increase the
    number of false negatives, thereby lowering the
    sensitivity of the search.

70
HMM-based homology searching
  • Most widely used HMM-based profile searching
    tools currently are SAM-T98 (Karplus et al.,
    1998) and HMMER2 (Eddy, 1998)
  • formal probabilistic basis and consistent theory
    behind gap and insertion scores
  • HMMs good for profile searches, bad for alignment
  • HMMs are slow

71
The HMM algorithms
  • Questions
  • What is the most likely die (predicted) sequence?
    Viterbi
  • What is the probability of the observed sequence?
    Forward
  • What is the probability that the 3rd state is B,
    given the observed sequence? Backward

72
HMM-based homology searching
Transition probabilities and Emission
probabilities Gapped HMMs also have insertion
and deletion states
73
Profile HMM mmatch state, I-insert state,
ddelete state go from left to right. I and m
states output amino acids d states are silent.
74
Homology-derived Secondary Structure of Proteins
(HSSP) Sander Schneider, 1991
75
Tot hier 17/02/03
76
Bio-Data Analysis and Data Mining
  • Existing/emerging bio-data analysis and mining
    tools for
  • DNA sequence assembly
  • Genetic map construction
  • Sequence comparison and database searching
  • Gene finding
  • .
  • Gene expression data analysis
  • Phylogenetic tree analysis to infer
    horizontally-transferred genes
  • Mass spec. data analysis for protein complex
    characterization
  • Current mode of work

Often enough developing ad hoc tools for each
individual application
77
Bio-Data Analysis and Data Mining
  • As the amount and types of data and their cross
    connections increase rapidly
  • the number of analysis tools needed will go up
    exponentially
  • blast, blastp, blastx, blastn, from BLAST
    family of tools
  • gene finding tools for human, mouse, fly, rice,
    cyanobacteria, ..
  • tools for finding various signals in genomic
    sequences, protein-binding sites, splice junction
    sites, translation start sites, ..

78
Bio-Data Analysis and Data Mining
Many of these data analysis problems are
fundamentally the same problem(s) and can be
solved using the same set of tools e.g.
clustering or optimal segmentation by Dynamic
Programming
Developing ad hoc tools for each application (by
each group of individual researchers) may soon
become inadequate as bio-data production
capabilities further ramp up
79
Bio-data Analysis, Data Mining and Integrative
Bioinformatics
To have analysis capabilities covering wide
range of problems, we need to discover the common
fundamental structures of these problems HOWEVER
in biology one size does NOT fit all
Goal is development of a data analysis
infrastructure in support of Genomics and beyond
80
Algorithms in bioinformatics
  • string algorithms
  • dynamic programming
  • machine learning (NN, k-NN, SVM, GA, ..)
  • Markov chain models
  • hidden Markov models
  • Markov Chain Monte Carlo (MCMC) algorithms
  • stochastic context free grammars
  • EM algorithms
  • Gibbs sampling
  • clustering
  • tree algorithms
  • text analysis
  • hybrid/combinatorial techniques and more

81
Sequence analysis and homology searching
82
Finding genes and regulatory elements
83
Expression data
84
Functional genomics
Monte Carlo
85
Protein translation
86
Example of algorithm reuse Data clustering
  • Many biological data analysis problems can be
    formulated as clustering problems
  • microarray gene expression data analysis
  • identification of regulatory binding sites
    (similarly, splice junction sites, translation
    start sites, ......)
  • (yeast) two-hybrid data analysis (for inference
    of protein complexes)
  • phylogenetic tree clustering (for inference of
    horizontally transferred genes)
  • protein domain identification
  • identification of structural motifs
  • prediction reliability assessment of protein
    structures
  • NMR peak assignments
  • ......

87
Data Clustering Problems
  • Clustering partition a data set into clusters so
    that data points of the same cluster are
    similar and points of different clusters are
    dissimilar
  • cluster identification -- identifying clusters
    with significantly different features than the
    background

88
Application Examples
  • Regulatory binding site identification CRP (CAP)
    binding site
  • Two hybrid data analysis
  • Gene expression data analysis

Are all solvable by the same algorithm!
89
Other Application Examples
  • Phylogenetic tree clustering analysis
  • Protein sidechain packing prediction
  • Assessment of prediction reliability of protein
    structures
  • Protein secondary structures
  • Protein domain prediction
  • NMR peak assignments

90
Integrative bioinformatics _at_ VU
  • Studying informational processes at biological
    system level
  • From gene sequence to intercellular processes
  • Computers necessary
  • We have biology, statistics, computational
    intelligence (AI), HTC, ..
  • VUMC microarray facility
  • Enabling technology new glue to integrate
  • New integrative algorithms
  • Goals understanding cells in terms of genomes,
    fighting disease (VUMC)

91
Bioinformatics _at_ VU
  • Progression
  • DNA gene prediction, predicting regulatory
    elements
  • mRNA expression
  • Proteins docking, domain prediction
  • Metabolic pathways metabolic control
  • Cell-cell communication

92
(No Transcript)
93
Protein structure and function can be complex
Pyruvate kinase Phosphotransferase
b barrel regulatory domain a/b barrel catalytic
substrate binding domain a/b nucleotide binding
domain
1 continuous 2 discontinuous domains
94
Bioinformatics _at_ VU
  • Qualitative challenges
  • High quality alignments (alternative splicing)
  • In-silico structural genomics
  • In-silico functional genomics reliable
    annotation
  • Protein-protein interactions.
  • Metabolic pathways assign the edges in the
    networks
  • Cell-cell communication find membrane associated
    components
  • New algorithms

95
Bioinformatics _at_ VU
  • Quantitative challenges
  • Understanding mRNA expression levels
  • Understanding resulting protein activity
  • Time dependencies
  • Spatial constraints, compartmentalisation
  • Are classical differential equation models
    adequate or do we need more individual modeling
    (e.g macromolecular crowding and activity at
    oligomolecular level)?
  • Metabolic pathways calculate fluxes through time
  • Cell-cell communication tissues, hormones,
    innervations

Need complete experimental data for good
biological model system to learn to integrate
96
Bioinformatics _at_ VU
  • VUMC
  • Neuropeptide addiction
  • Oncogenes disease patterns
  • Reumatic disease
  • CNCR
  • From synapses to higher order behaviour
  • Addiction
  • FPP
  • Genetic psychology twin data bank

97
Integrative Genomics
98
Recurrent theme Integration from molecule to
health
Leiden-VU-TNO (Centre for Medical Systems Biology)
CRCS
VUMC
Dinner discussion Integrative Bioinformatics
Genomics VU
99
genome
transcriptome
proteome
metabolome
physiome
Dinner discussion Integrative Bioinformatics
Genomics VU
100
Integrative bioinformatics
  • Calculate from sequence to molecular behaviour
  • Calculate from molecular behaviour and
    interactions to cells
  • Calculate from cellular interactions to tissues
  • Calculate from tissue to organism
  • Calculate from organisms to ecosystem and society
  • Do this in conjunction with data analysis at all
    levels
  • AND CALCULATE BACK (induction)

101
Bioinformatics _at_ VU
  • Quantitative challenges
  • How much protein produced from single gene?
  • What time dependencies?
  • What spatial constraints (compartmentalisation)?
  • Metabolic pathways assign the edges in the
    networks
  • Cell-cell communication find membrane associated
    components

102
Integrative bioinformatics
  • Integrate data sources
  • Integrate methods
  • Integrate data through method integration
    (biological model)

103
Bioinformatics tool
Algorithm
Data
tool
Biological Interpretation (model)
104
Bioinformatics
  • Nothing in Biology makes sense except in the
    light of evolution (Theodosius Dobzhansky
    (1900-1975))
  • Nothing in Bioinformatics makes sense except in
    the light of Biology

105
Pair-wise sequence alignment (more than just
string matching)
Global dynamic programming
MDAGSTVILCFVG
Evolution
M D A A S T I L C G S
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open,extension)
MDAGSTVILCFVG-
MDAAST-ILC--GS
106
Pair-wise alignment search explosions
T D W V T A L K T D W L - - I K
Combinatorial explosion - 1 gap in 1 sequence
n1 possibilities - 2 gaps in 1 sequence (n1)n
- 3 gaps in 1 sequence (n1)n(n-1), etc.
2n (2n)! 22n
n (n!)2
??n 2 sequences of 300 a.a. 1088
alignments 2 sequences of 1000 a.a. 10600
alignments!
107
Global dynamic programming
108
This talk own kitchen
  • Three integrative methods to predict protein
    structural aspects
  • Iterative multiple alignment protein secondary
    structure (Praline)
  • Intermezzo 2½-D structure prediction of
    flavodoxin fold by hand
  • Protein domain delineation based on consistency
    of multiple ab initio model tertiary structures
    (SnapDRAGON)
  • Protein domain delineation based on combining
    homology searching with domain prediction
    (Domaination)

109
Comparing sequences - Similarity Score -
  • Many properties can be used
  • Nucleotide or amino acid composition
  • Isoelectric point
  • Molecular weight
  • Morphological characters

110
Multivariate statistics Cluster analysis
C1 C2 C3 C4 C5 C6 ..
1 2 3 4 5
Raw table
Similarity criterion
Similarity matrix
Scores
55
Cluster criterion
Phylogenetic tree
111
Human Evolution
112
Comparing sequences - Similarity Score -
  • Many properties can be used
  • Nucleotide or amino acid composition
  • Isoelectric point
  • Molecular weight
  • Morphological characters
  • But molecular evolution through sequence
    alignment

113
Multivariate statistics Cluster analysis
1 2 3 4 5
Multiple alignment
Similarity criterion
Similarity matrix
Scores
55
Phylogenetic tree
114
Lactate dehydrogenase multiple alignment
Distance
Matrix 1 2 3 4
5 6 7 8 9 10 11 12
13 1 Human 0.000 0.112 0.128 0.202
0.378 0.346 0.530 0.551 0.512 0.524 0.528 0.635
0.637 2 Chicken 0.112 0.000 0.155 0.214
0.382 0.348 0.538 0.569 0.516 0.524 0.524 0.631
0.651 3 Dogfish 0.128 0.155 0.000 0.196
0.389 0.337 0.522 0.567 0.516 0.512 0.524 0.600
0.655 4 Lamprey 0.202 0.214 0.196 0.000
0.426 0.356 0.553 0.589 0.544 0.503 0.544 0.616
0.669 5 Barley 0.378 0.382 0.389 0.426
0.000 0.171 0.536 0.565 0.526 0.547 0.516 0.629
0.575 6 Maizey 0.346 0.348 0.337 0.356
0.171 0.000 0.557 0.563 0.538 0.555 0.518 0.643
0.587 7 Lacto_casei 0.530 0.538 0.522 0.553
0.536 0.557 0.000 0.518 0.208 0.445 0.561 0.526
0.501 8 Bacillus_stea 0.551 0.569 0.567 0.589
0.565 0.563 0.518 0.000 0.477 0.536 0.536 0.598
0.495 9 Lacto_plant 0.512 0.516 0.516 0.544
0.526 0.538 0.208 0.477 0.000 0.433 0.489 0.563
0.485 10 Therma_mari 0.524 0.524 0.512 0.503
0.547 0.555 0.445 0.536 0.433 0.000 0.532 0.405
0.598 11 Bifido 0.528 0.524 0.524 0.544
0.516 0.518 0.561 0.536 0.489 0.532 0.000 0.604
0.614 12 Thermus_aqua 0.635 0.631 0.600 0.616
0.629 0.643 0.526 0.598 0.563 0.405 0.604 0.000
0.641 13 Mycoplasma 0.637 0.651 0.655 0.669
0.575 0.587 0.501 0.495 0.485 0.598 0.614 0.641
0.000
115
(No Transcript)
116
Multiple sequence alignmentWhy?
  • It is the most important means to assess
    relatedness of a set of sequences
  • Gain information about the structure/function of
    a query sequence (conservation patterns)
  • Construct a phylogenetic tree
  • Putting together a set of sequenced fragments
    (Fragment assembly)
  • Comparing a segment sequenced by two different
    labs
  • Many bioinformatics methods depend on it (e.g.
    secondary/tertiary structure prediction)

117
Flavodoxin fold aligning 13 Flavodoxins cheY
5(??) fold
118
Flavodoxin-cheY multiple alignment Praline with
pre-processing
  • 1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YE
    VDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-
    DSLEETGAQGRKVACF
  • FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HE
    VTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-
    EEFNRFGLAGRKVAAf
  • FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YE
    VDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-
    DSLEETGAQGRKVACf
  • FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-ID
    VELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-
    DSLENADLKGKKVSVf
  • FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-ME
    TTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-
    EDLDRAGLKDKKVGVf
  • 2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KA
    DAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLY
    DKLPEVDMKDLPVAIF
  • FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MS
    DA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-
    PKIEGLDFSGKTVALf
  • FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IA
    DAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-
    NTLSEADLTGKTVALf
  • FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DV
    VTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-
    SELDDVDFNGKLVAYf
  • FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DV
    ADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-
    PTLEEIDFNGKLVALf
  • 4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KD
    VNTINVSDVNIDELL-NEDILILGCSAMGDEVL-------EESEFEPFI-
    EEIS-TKISGKKVALF
  • FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-AD
    VESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-
    TDLA-PKLKGKKVGLf
  • FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIE
    VKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-
    DESSEFNLEGKLGAAf
  • 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NV
    EEAEDGVDALNKLQAGGYGFVI---SDWNMPNM----------DGLELL-
    KTIRADGAMSALPVLM
  • T
  • 1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD
    ---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-
    -------
  • FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE
    ---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-
    -------
  • FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD
    ---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-
    -------

119
Flavodoxin-cheY NJ tree
120
Integrating secondary structure prediction in
multiple alignmentVictor Simossis
  • Praline multiple alignment method
  • (Heringa, Comp. Chem. 23, 341-3641999, Comp.
    Chem., 26, 459-4772002
  • Kleinjung, Douglas Heringa, Bioinformatics, in
    press2002)
  • Combining sequence data and secondary structure
    prediction (Heringa, Curr. Prot. Pept. Sci., 1
    (3), 273-3012000)
  • Secondary structure methods PhD, Predator,
    PSIPred, Jpred, SSPRED,...

121
Using secondary structure in multiple alignment
  • Structure more conserved than sequence

122
Protein structure hierarchical levels
123
Protein structure hierarchical levels
124
Secondary structure-induced alignment
125
Using secondary structure in multiple alignment
Dynamic programming search matrix
Amino acid exchange weights matrices
MDAGSTVILCFV
HHHCCCEEEEEE
M D A A S T I L C G S
H H H H C C E E E C C
H
H
C
C
E
E
Default
126
Flavodoxin-cheY predicted secondary
structure (PREDATOR)
1fx1 -PK-ALIVYGSTTGNTEYTAETIARQLANAG-YE
VDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFD
S-LEETGAQGRKVACF e eeee b
ssshhhhhhhhhhhhhhttt eeeee stt tttttt seeee b
ee sss ee ttthhhhtt ttss tt
eeeee FLAV_DESVH MPK-ALIVYGSTTGNTEYTaETIARELA
DAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDD
FIPLFDS-LEETGAQGRKVACf e eeeeee
hhhhhhhhhhhhhhh eeeeee eeeeee
hhhhhh
eeeee FLAV_DESGI MPK-ALIVYGSTTGNTEGVaEAIAKTLN
SEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQED
FVPLYED-LDRAGLKDKKVGVf e eeeeee
hhhhhhhhhhhhhh eeeeee hhhhhh eeeeeee
hhhhhh
eeeeee FLAV_DESSA MSK-SLIVYGSTTGNTETAaEYVAEAF
ENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQD
DFIPLYDS-LENADLKGKKVSVf
eeeeee hhhhhhhhhhhhhh eeeee
eeeee hhhhhhh h
eeeee FLAV_DESDE MSK-VLIVFGSSTGNTESIaQKLEELIA
AGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDD
FLSLFEE-FNRFGLAGRKVAAf eeee
hhhhhhhhhhhhhh eeeee hhhhhhhhhhheeeee
hhhhhhh hh eeeee 2fcr
--K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVT
DPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKD
LPVAIF eeeee
ssshhhhhhhhhhhhhggg b eeggg s gggggg seeeeeee
stt s s s sthhhhhhhtggg tt
eeeee FLAV_ANASP SKK-IGLFYGTQTGKTESVaEIIRDEFG
ND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSD
WEGLYSE-LDDVDFNGKLVAYf eeeee
hhhhhhhhhhhh eee hhh hhhhhhheeeeee
hhhhhhhhh
eeeeee FLAV_ECOLI -AI-TGIFFGSDTGNTENIaKMIQKQL
GKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA--------QC
DWDDFFPT-LEEIDFNGKLVALf eee
hhhhhhhhhhhh eee hhh hhhhhhheeeee
hhhhh
eeeeee FLAV_AZOVI -AK-IGLFFGSNTGKTRKVaKSIKKRF
DDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENE
SWEEFLPK-IEGLDFSGKTVALf eee
hhhhhhhhhhhhh hhh hhhhhhheeeee
hhhhhhhhh
eeeeee FLAV_ENTAG MAT-IGIFFGSDTGQTRKVaKLIHQKL
DG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYD
SWQEFTNT-LSEADLTGKTVALf eeee
hhhhhhhhhhhh hhh hhhhhhheeeee
hhhhh eeeee 4fxn
----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDV
NIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KIS
GKKVALF eeeee
ssshhhhhhhhhhhhhhhtt eeeettt sttttt seeeeee
btttb ttthhhhhhh hst t tt
eeeee FLAV_MEGEL M---VEIVYWSGTGNTEAMaNEIEAAVK
AAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSV
VEPFFTD-LAP-KLKGKKVGLf
hhhhhhhhhhhhhh eeeee hhhhhhhh eeeee

eeeee FLAV_CLOAB M-K-ISILYSSKTGKTERVaKLIEEGVK
RSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI--------SWE
MKKWIDE-SSEFNLEGKLGAAf eee
hhhhhhhhhhhhhh eeeeee hhhhhhhhhh eeee
hhhhhhhhh eeeee 3chy
ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DAL
NKLQAGGYGFVISD---WNMPNM----------DGLELLKTIRADGAMSA
LPVLMV tt eeee s
hhhhhhhhhhhhhht eeeesshh hhhhhhhh eeeee s
sss hhhhhhhhhh ttttt eeee 1fx1
GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD----------
-----------GLRIDGD--PRAARDDIVGWAHDVRGAI--------
eee s ss sstthhhhhhhhhhhttt ee s
eeees gggghhhhhhhhhhhhhh FLAV
_DESVH GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD-----
----------------GLRIDGD--PRAARDDIVGwAHDVRGAI------
-- eee hhhhhhhhhhhh
eeeee eeeee
hhhhhhhhhhhhhh FLAV_DESGI GCGDS-SY-TYFCGAVDVI
EKKAEELgATLVAS---------------------SLKIDGE--P--DSA
EVLDwAREVLARV-------- eee
hhhhhhhhhhhh eeeee
hhhhhhhhhhh FLAV_DESSA
GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD-----------------
----SLKIDGD--P--ERDEIVSwGSGIADKI--------
hhhhhhhhhhhh eeeee
e eee FLAV_DESDE
ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE-----------------
----GLKMEGD--ASNDPEAVASfAEDVLKQL--------
e hhhhhhhhhhhhhh eeeee
ee hhhhhhhhhhh 2fcr
GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSV
RD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------
eee ttt ttsttthhhhhhhhhhhtt eee b gggs
s tteet teesseeeettt ss hhhhhhhhhhhhhhhht FLAV_A
NASP GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYD
FNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------
hhhhhhhhhhhhhh
eeee
hhhhhhhhhhhhhhhh FLAV_ECOLI
GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADD
DHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA
hhhhhhhhhhhhhh eeee
hhhhhhhhhhhhhhhhhh FLAV_AZOVI
GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESS
EAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L--
e hhhhhhhhhhhhhh eeeee
hhhhhhhhhhh FLAV_ENTA
G GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSF
SAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------
hhhhhhhhhhhhhhh eeee
hhhhhhh hhhhhhhhhhhh 4fxn
G-----SYGWGDGKWMRDFEERMNGYGCVVVET---------
------------PLIVQNE--PDEAEQDCIEFGKKIANI---------
e eesss shhhhhhhhhhhhtt ee s
eeees ggghhhhhhhhhhhht FLAV
_MEGEL G-----SYGWGSGEWMDAWKQRTEDTgATVIGT-----
-----------------AIVNEM--PDNAPE-CKElGEAAAKA-------
-- hhhhhhhhhhh
eeeee eeee h
hhhhhhhh FLAV_CLOAB STANSIA-GGSDIALLTILNHLMVK
-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfG
ERiANkV--KQIF--
hhhhhhhhhhhhhh eeeee
hhhh hhh hhhhhhhhhhhh h 3chy
-----------TAEAKKENIIAAAQAGASGY-------------------
------VVK----P-FTAATLEEKLNKIFEKLGM------
ess hhhhhhhhhtt see
ees s hhhhhhhhhhhhhhht

G
Enough to predict 5(??) topology
127
Secondary structure-induced alignment
128
Iteration
Convergence
Limit cycle
Divergence
129
Flavodoxin-cheY multiple alignment/ secondary
structure iteration cheY SSEs
3chy-AA SEQUENCE AA ADKELKFLVVDDFSTMRR
IVRNLLKELGFNNVEEAEDGVDALNKLQAGGYGFVISDWNMP 3chy-I
TERATION-0 PHD EEEEEEE
HHHHHHHHHHHHHHHHH E HHHHHHHHHH HHHEEE
3chy-ITERATION-1 PHD EEEEEEEE
HHHHHHHHHHHHHHH HHHHHHHH EEEEEE
3chy-ITERATION-2 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHHHH EEEEEE
3chy-ITERATION-3 PHD EEEEEEEE
HHHHHHHHHHHHHH EEE HHHHHH EEEEE
3chy-ITERATION-4 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHH EEEEE
3chy-ITERATION-5 PHD EEEEEEEE
HHHHHHHHHHHHHH EEE HHHHHH EEEEE
3chy-ITERATION-6 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHHH EEEEEE
3chy-ITERATION-7 PHD EEEEEEEE
HHHHHHHHHHHHHH EEE HHHHHH EEEEE
3chy-ITERATION-8 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHH EEEEEE
3chy-ITERATION-9 PHD EEEEEEEE
HHHHHHHHHHHHHH HHHHHHHHHH EEEEE
3chy-AA SEQUENCE AA
NMDGLELLKTIRADGAMSALPVLMVTAEAKKENIIAAAQAGASGYVVKP
FTAATLEEKLNKIFEKLGM 3chy-ITERATION-0
PHD HHHHHHEEEEEE HHHHHHHHHHHHHHHHH
HHHHHHHHHHHHHH 3chy-ITERATION-1
PHD HHHHHHEEEEEE HHH HHHHHHHHHHHHHHHHHH
EEE HHHHHHHHHHHHHH 3chy-ITERATION-2
PHD HHHHHHEEEEEE HHHHHHHHHHHHHHHHHH
EEE HHHHHHHHHHHHHH 3chy-ITERATION-3
PHD HHHHHHHHHHHH
HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH
3chy-ITERATION-4 PHD HHHHH
EEEEE HHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH
3chy-ITERATION-5 PHD HHHHHHHH
EEEEE HHHHHHHHHHHHHHHH EEE
HHHHHHHHHHHHHH 3chy-ITERATION-6 PHD
HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEEE
HHHHHHHHHHHHHH 3chy-ITERATION-7
PHD HHHHHHHH EEEEEE HHHHHHHHHHHHHHHH
EEE HHHHHHHHHHHHHH 3chy-ITERATION-8
PHD HHHHHHHH EEEEE HHHHHHHHHHHHHHHH
EEE HHHHHHHHHHHHHH 3chy-ITERATION-9
PHD HHHHHHHH EEEEE
HHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH
130
4fxn-AA SEQUENCE AA MKIVYWSGTGNTEKMAEL
IAKGIIESGKDVNTINVSDVNIDELLNEDILILGCSAMGDEV 4fxn-I
TERATION-0 PHD EEEEE
HHHHHHHHHHHHHHH EEE EEEEE
4fxn-ITERATION-1 PHD EEEEE
HHHHHHHHHHHHHHH EEEE EEEEE
4fxn-ITERATION-2 PHD EEEEE
HHHHHHHHHHHHHHH EEEE EEEEE
4fxn-ITERATION-3 PHD EEEEE
HHHHHHHHHHHHHHH E EEEEE
4fxn-ITERATION-4 PHD EEEEEE
HHHHHHHHHHHHHHH EEEE EEEEE
4fxn-ITERATION-5 PHD EEEEEE
HHHHHHHHHHHHHHH EE EEEEE
4fxn-ITERATION-6 PHD EEEEEE
HHHHHHHHHHHHHHH EEEE EEEEE
4fxn-ITERATION-7 PHD EEEEEE
HHHHHHHHHHHHHHH EE EEEEE
4fxn-ITERATION-8 PHD EEEEEE
HHHHHHHHHHHHHHH EEE EEEEE
4fxn-ITERATION-9 PHD EEEEE
HHHHHHHHHHHHHHH EEE EEEEE
4fxn-AA SEQUENCE AA
LEESEFEPFIEEISTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGCV
VVETPLIVQNE 4fxn-ITERATION-0 PHD
EEEEE HHHHHHHHHHHHHHHHH EEE
EEE 4fxn-ITERATION-1 PHD
HHHH EEEEE HHHHHHHHHHHHHHH EEE EE
4fxn-ITERATION-2 PHD
HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHH EEE
EE 4fxn-ITERATION-3 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHH EEE
EE 4fxn-ITERATION-4 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-5 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-6 PHD
HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-7 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-8 PHD
HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE
E 4fxn-ITERATION-9 PHD
HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE
E 4fxn-AA SEQUENCE AA
PDEAEQDCIEFGKKIANI 4fxn-ITERATION-0
PHD HHHHHHHHHHHHH 4fxn-ITERATION-1
PHD HHHHHHHHHHHHH 4fxn-ITERATION-2
PHD HHHHHHHHHHHHH 4fxn-ITERATION-3
PHD HHHHHHHHHHHHH 4fxn-ITERATION-4
PHD HHHHHHHHHHHH 4fxn-ITERATION-5
PHD HHHHHHHHHHHHH
4fxn-ITERATION-6 PHD
HHHHHHHHHHHH 4fxn-ITERATION-7 PHD
HHHHHHHHHHHHH 4fxn-ITERATION-8 PHD
HHHHHHHHHHHHH 4fxn-ITERATION-9
PHD HHHHHHHHHHHH
131
Predicting sec. struct. with PHD, etc.
A
1
5
B
2
4
C
3
D
6
132
Secondary structure prediction using MA (SymSS)
1 2 3 4
2 1 3 4
3 1 2 4
4 1 2 3
1 1 1 1
EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH
EEE H EEEEE HHHHH? ??EE
HH EEEEEE ?HHHHH EEEE HH
EEEEE HHHHHH EEE HH EEEE? ?HHHHH
EEE H EEEEE HHHHH? ??EE HH EEEEE
?HHHHH EEEE HH
EEEEE HHHH EEE HH EEEE? ?HHH EEE
H EEEEE HHH? ??EE HH EEEEE HHH?
EEEE HH
EEEEE HHHHHH EEE HHHH EEEE? ?HHHHH
EEE ?HHH EEEEE HHHHH? ??EE HHHH EEEEE
?HHHHH EEEE HHHH
EEEEE HHHHH EEE H
EEEE HHHH EE HHH
EEEE HHHHH EEE H
EEEE HHH EEE HH
133
Flavodoxin-cheY
3chy ------------ GYVVKPFTAATLEEKLNKI
FEKLGM------ PHD ---------------
hhhhhhhhhhhhhh ------ 13 -gt 0
ee ??hhhhhhhhhhh? 13 -gt 1
ee ??hhhhhhhhhhh??
13 -gt 2 ee
??hhhhhhhhhhh? 13 -gt 3
eee ?hhhhhhhhhhh? 13 -gt 4
eee ?hhhhhhhhhhh?
13 -gt 5 eee
h?hhhhhhhhhhh 13 -gt 6
eee hh hhhhhhhhhhh 13 -gt 7
e eeeeeee hhhhhhhhhhhhh??
13 -gt 8 eeeeeee
hhhhhhhhhhhhh?? 13 -gt 9
eeeeeee hhhhhhhhhhhhh?? ????? 13 -gt 10
eeeeeee hhhhhhhhhhhhh??
13 -gt 11 e eeeeeeee
hhhhhhhhhhhhh??? 13 -gt 12
eeeeeee hhhhhhhhhh 13 -gt 13
hhhhhhhhhhhhhh
h DSSP ...............EEEESS
HHHHHHHHHHHHHHHT ......
134
Optimal segmentation of predicted secondary
structures
Each sequence within an alignment gives rise to
a library of n secondary structure predictions,
where n is the number of sequences in the
alignment. The predictions are recorded by
secondary structure type and region position in a
single matrix
1 2 3 4
1-gt1 1-gt2 1-gt3 1-gt4
EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH EEE
H EEEEE HHHHH? ??EE HH EEEEEE ?HHHHH
EEEE HH
C
E
H
H score 0 0 0 0 0.
E score 3 4 4 4 3.
C score 1 0 0 0 0..
? Score 0 0 0 0 1.
Region 0 1 1 1 0.
135
Optimal segmentation of predicted secondary
structures by Dynamic Programming
H score
The recorded values are used in a weighted
function according to their secondary structure
type, that gives each position a window-specific
score. The more probable the secondary structure
element, the higher the score. Restrictions H
only if wsgt4 E only if wsgt2
E score
C score
? score
Region
window size
Segmentation score (Total score of each path)
2
6
sequence position
Max score
5
Offset
Label
H
136
Example of an optimally segmented secondary
structure prediction library for sequence 3chy
3chy ---------------GYVV-----
KPFTAATLEEKLNKIFEKLGM------ 3chy lt- 1fx1
??????????????? ee ?? hhhhhhhhhhhhhh
???????? 3chy lt- FLAV_DESDE
??????????????? ee ?? hhhhhhhhhhhhhhh
???????? 3chy lt- FLAV_DESVH
??????????????? ee ?? hhhhhhhhhhhhhh
???????? 3chy lt- FLAV_DESGI
??????????????? eee ?? ??hhhhhhhhhhhhh
???????? 3chy lt- FLAV_DESSA
??????????????? eee ?? ??hhhhhhhhhhhhh
???????? 3chy lt- 4fxn
??????????????? eee ?? hhhhhhhhhhhhh
????????? 3chy lt- FLAV_MEGEL
????????????????eee ?? hh?hhhhhhhhhhh
????????? 3chy lt- 2fcr e ?
eeeeeee hhhhhhhhhhhhhhh ?????? 3chy lt-
FLAV_ANASP ? eeeeeee
hhhhhhhhhhhhhhh ?????? 3chy lt- FLAV_ECOLI
eeeeeee hhhhhhhhhhhhhhh
hhhhh 3chy lt- FLAV_AZOVI ?
eeeeeee hhhhhhhhhhhhhhh ???? 3chy lt-
FLAV_ENTAG e eeeeeeee
hhhhhhhhhhhhhhhh? ?????? 3chy lt- FLAV_CLOAB
eeeeeee hhhhhhhhhh
??????????? 3chy lt- 3chy
--------------- ----- hhhhhhhhhhhhhh
------ Consensus
---------------EEEE----- HHHHHHHHHHHHH
------ Consensus-DSSP
....................xx.....
. PHD ---------------
----- HHHHHHHHHHHHHH ------ PHD-DSSP
...............xxxx.....
x...... DSSP
...............EEEE.....SS HHHHHHHHHHHHHHHT
...... LumpDSSP
...............EEEE..... HHHHHHHHHHHHHHH
......
137
What to do with a multiple alignment?
  • Use it to eyeball and detect structural/functiona
    l features
  • Use it to make a profile and search a database
    for homologs
  • Give it to other bioinformatics methods and
    predict secondary structure, functional residues,
    correlated mutations, phylogenetic trees, etc.

138
Rules of thumb when looking at a multiple
alignment (MA)
  • Hydrophobic residues are internal
  • Gly (Thr, Ser) in loops
  • MA hydrophobic block -gt internal ?-strand
  • MA alternating (1-1) hydrophobic/hydrophilic gt
    edge ?-strand
  • MA alternating 2-2 (or 3-1) periodicity gt
    ?-helix
  • MA gaps in loops
  • MA Conserved column gt functional? gt active
    site

139
Rules of thumb when looking at a multiple
alignment (MA)
  • Active site residues are together in 3D structure
  • Helices often cover up core of strands
  • Helices less extended than strands gt more
    residues to cross protein
  • ?-?-? motif is right-handed in gt95 of cases
    (with parallel strands)
  • MA inconsistent alignment columns and match
    errors!
  • Secondary structures have local anomalies, e.g.
    ?-bulges
Write a Comment
User Comments (0)
About PowerShow.com