Title: Functional Analysis of Proteins and Proteomes
1Functional Analysis of Proteins and Proteomes
- CSB2003 Tutorial
- Steve Bennett, Ph.D.
- steve_at_bennett.org
2Introduction
- Although genetic material contains all the
information required for cellular function, DNA
itself does not carry out much the work in cells. - Rather, it is the products of those genes,
proteins and sometimes regulatory and catalytic
RNAs, that carry out the chemical and mechanical
work in biological systems.
3Central Dogma in Biology
- DNA
- RNA (sequence, structure)
- Protein (sequence, structure)
4Introduction
- With the completion of numerous genome projects,
resources and focus are shifting from genomes to
proteomes. - Once researchers have an accurate collection of
gene sequences, the next question is what these
genes do.
5Introduction
- Although there are numerous definitions of
function with respect to proteins, here we
define it as precisely that what it is that a
particular gene product, or protein, does in the
cell. - Examples
- Molecular motor (kinesin, myosin)
- Zinc-finger transcription factor
6Introduction
- In this tutorial, I will first give a general
protein introduction, followed by historical and
current computational approaches for assigning
function to a protein. - Background
- Overview of some selected algorithms and
approaches - Demos and software examples
7Forming a Peptide Bond
Creates the Primary Structure, or protein sequence
8Polypeptide Chain
The chemical nature of the R groups determine
the amino acid sequence of the peptide
9Translation
10Planar Peptide Bond
11Alpha Helix Structure
12Alpha Helix End-On
13Alpha Helix Variable Pitch
14Anti-Parallel Beta Sheets
15Corrugated Beta Sheets
16Beta Turn at End of Anti-Parallel Sheet
17Protein Database (PDB) Growth
- 19,225 released atomic coordinate entries
- 17,315 proteins, peptides, and viruses
- 1,892 nucleic acids, protein-nucleic acid
complexes
18Structural Challenges
- Compare all known structures to each other
- Classify and organize all structures in a
biological way - Find common folding patterns and structural
motifs - Compute evolutionary distances between protein
structures - Study interactions between structures and other
molecules (Protein Docking) - Use known structures to predict structure from
sequence (Protein Threading) - Many more ...
19Classification of Protein Structures
- Class
- Similar secondary structure content
- All a all b ab a/b etc
- Fold (Architecture)
- Major structural similarity
- SSEs in similar arrangement
- globin-like fold, TIM barrel fold
- Superfamily (Topology)
- Probable common ancestry
- globins phycocyanin
- Family
- Clear evolutionary relationship
- Sequence similarity usually gt 25
20Class
Fold / Architecture
Superfamily
21Classes of Protein Structures
- Mainly ?
- Mainly ?
- ????
- Parallel ? sheets, ?-?-? units
- ???
- Anti-parallel ? sheets, segregated ? and ?
regions - helices mostly on one side of sheet
22Classes of Protein Structures
- Others
- Multi-domain, membrane and cell surface, small
proteins, peptides and fragments, designed
proteins
23Folds / Architectures
- ??? and ???
- Closed
- Barrel
- Roll, ...
- Open
- Sandwich
- Clam, ...
- Mainly ?
- Bundle
- Non-Bundle
- Mainly ?
- Single sheet
- Roll
- Barrel
- Clam
- Sandwich
- Prism
- 4/6/7/8 Propeller
- Solenoid
24eg. The TIM Barrel Fold
25Growth in PDB Folds
Gold Old Folds White New Folds
26Databases of Folds
- SCOP
- Murzin AG, Brenner SE, Hubbard T, Chothia C
- Structural Classification of Protein Structures
- Manual assembly by inspection
- All nodes are annotated (eg. All-alpha,
alpha/beta) - Structural similarity search using 3dSearch
(Singh and Brutlag) - CATH
- Dr. C.A. Orengo, Dr. A.D. Michie, Dr. S. Jones,
Dr. M.B. Swindells, Dr. G. Hutchinson, Dr. A.
Martin, Dr. D.T. Jones, Prof. J.M. Thornton - Class - Architecture - Topology - Homologous
Superfamily - Manual classification at Architecture level
- Automated topology classification using the SSAP
algorithm No structural similarity search
27Databases of Folds
- FSSP
- L. L. Holm and C. Sander
- Fully automated using the DALI algorithm (Holm
and Sander) - No internal node annotations
- Structural similarity search using DALI
- Pclass
- A. Singh, X. Liu, J. Chang, D. Brutlag
- Fully automated using the LOCK and 3dSearch
algorithms - All internal nodes automatically annotated with
common terms - JAVA based classification browser
- Structural similarity search using 3dSearch
28Protein Structure Prediction
Sequence of 984 amino acids
PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKI
G PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAG
LKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNV
LPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQ
HRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVL
PEKDSWTVN DIQKLVGKLNWASQIYPGIKVRQLCKLLRGTKALTEVIPL
TEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQE
PFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPI
QKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYV
DGAANRETKLGKAGYVT NKGRQKVVPLTNTTNQKTELQAIYLALQDSGL
EVNIVTDSQYALGIIQAQP DKSESELVNQIIEQLIKKEKVYLAWVPAHK
GIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKA
LVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNK
RTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFT
IPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVI
YQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLW
MGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVKQ
LCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIA
EIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKIT
TESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKL
WYQ
HIV reverse transcriptase
3D coordinates of 7404 atoms
29Abstracting the problem
3D coords of C-alpha backbone
3D coords of all atoms
3D coords of secondary structure elements
C-alpha groups
30Defining the secondary structure of a protein
sequence
Alpha helix and anti-parallel beta sheet
31The Secondary Structure Prediction Problem
- Given a protein sequence
- NWVLSTAADMQGVVTDGMASGLDKD...
- Predict a secondary structure sequence
- LLEEEELLLLHHHHHHHHHHLHHHL...
- 3-state problem ARNDCQEGHILKMFPSTWYVn -gt
L,H,En
32Amphipathic helix End view
33Amphipathic helix backbone sidechains
34Amphipathic helixhydrophobic sidechains
35Amphipathic helixhydrophobic sidechains
36Amphipathic helixsidechain periodicity
Sequence NLAKMVVKTAEAILKD
37Structural Correlations in Alpha-Helices
38Structural Correlations inBeta-Strands
39Functional Analysis of Proteins
40Sequence methods
- The earliest general approach for assigning
function to a protein sequence was to compare the
sequence of unknown function to a sequence (or
sequences) of known function.
Seqs of known function
?
41Sequence methods
- The earliest general approach for assigning
function to a protein sequence was to compare the
sequence of unknown function to a sequence (or
sequences) of known function.
42Sequence methods
- The earliest general approach for assigning
function to a protein sequence was to compare the
sequence of unknown function to a sequence (or
sequences) of known function.
43Sequence methods
- The earliest general approach for assigning
function to a protein sequence was to compare the
sequence of unknown function to a sequence (or
sequences) of known function.
44Sequence methods
- The earliest general approach for assigning
function to a protein sequence was to compare the
sequence of unknown function to a sequence (or
sequences) of known function.
45Sequence methods
- The earliest general approach for assigning
function to a protein sequence was to compare the
sequence of unknown function to a sequence (or
sequences) of known function.
46Sequence methods
- The earliest general approach for assigning
function to a protein sequence was to compare the
sequence of unknown function to a sequence (or
sequences) of known function.
47Sequence methods
- One such method for doing this is sequence
alignment in which two sequences are aligned to
determine how similar they are to one another.
48Sequence methods
- If an alignment is of sufficient quality, one
might assign the function of the known sequence
to be that of the unknown sequence as well.
Scorealignment gt threshold
Function Zn finger transcription factor
function assignment
49Sequence Alignment
- Well briefly discuss two different alignment
approaches, pairwise sequence alignment and
multiple sequence alignment before moving on to
other topics. - Alignments are most often in one of two forms
local or global.
50Amino Acid Similarity
- To discuss alignment methods, we first need to
discuss methods for determining if characters in
different sequences are similar. - Identity
- Biochemical properties
- PAM, BLOSUM matrices
51PAM Matrices
- Percent Accepted Mutation Matrices (Dayhoff)
- Examine amino acid changes in groups of related
proteins with at least 85 sequence similarity. - The differing amino acids are assumed to be
accepted over evolutionary time. - Counts are normalized and used to estimate a
matrix representing all possible amino acid
changes.
52PAM Matrices
53BLOSUM Matrices
54Dot Matrices
- Dot matrices create an n x m matrix from the
two sequences to be compared. - A match is scored in the matrix by strict
character identity, chemical similarity of the
amino acids, or the use of a symbol comparison
matrix such as PAM or BLOSUM. - Mark the matrix location of each match with a
dot. Connected regions of similarity will
appear as diagonal lines.
55Dot Matrices
- A D S C T F G V V L I
- A
- E º
- S
- C
- V
- V
- L
- V º
56Dot Matrices
57Dot Matrices SLIT vs. itself
58Dot Matrices
59Dot Matrices
- Improving signal-to-noise in dot matrices
- Sliding Window
- Scoring a match at position i, j in the matrix
is not independent downstream positions are
considered as well. This helps screen out
spurious matches in favor of meaningful local
regions of similarity. - Variable Stringency
- Used with the sliding window method denotes
how many characters in the window must match for
a hit to be declared at position i, j.
60Dot Matrices
- Advantages
- Intuitive and Straightforward
- Immediate visualization of similar subsequences
- Limitations
- Although related subsequences are easily seen, it
is unclear what the best alignment is between
the. - Difficult to assess the quality of different
alignments no scoring system
61Dynamic Programming
- Considerable improvement to the basic dot matrix
approach DP generates a provably-optimal
alignment between a pair of sequences. - Produces a score which can be evaluated for
statistical significance given the aligned
sequences and conditions. - Allows for the inclusion of gaps without the
extremely large number of computations required
in any direct computation. - Can be used for both local and global alignments.
62Dynamic Programming
- As observed for dot matrices, DP uses a scoring
system that favors identical and similar amino
acids, and penalizes dissimilar amino acids and
gaps. - Values for the scoring system are usually derived
from amino acid substitution tables such as PAM
or BLOSUM matrices. Each position in a potential
alignment is evaluated according to these
substitution tables. - The scores for all positions are then summed to
generate an overall log-odds score for the
alignment.
63Dynamic Programming
- Similar to the dot matrix algorithm, we construct
an n x m matrix consisting of the 2 sequences to
be aligned and a gap row, allowing each
sequence to begin with a gap if necessary. - Instead of marking a dot, we calculate a
running best score that depends on the scores of
the cells calculated previously. The matrix is
built left to right, to bottom.
64Dynamic Programming
- Specifically, given two sequences, p and q
- p p1p2pipn
- q q1q2qjqm
- then the score at each position i in sequence p
and position j in sequence q (that is, the score
Sij in each matrix cell) is given by
65Example VDFS and VET
66Example VDFS and VET
67Example VDFS and VET
68Example VDFS and VET
69Example VDFS and VET
70Example VDFS and VET
71Example VDFS and VET
72Example VDFS and VET
73Dynamic Programming
- Smith-Waterman more widely used implementation
for local alignments. - Software packages
- BESTFIT (Smith-Waterman)
- GAP (Needleman-Wunsch)
- On the web http//motif.stanford.edu/alion/
74Dynamic Programming
- Advantages
- Alignments are optimal
- Quantitative score associated with the alignments
- Limitations
- Costly in time and space hardware required for
database-sized searches (increasingly important
for modern bioinformatics applications).
75Rapid Database Searching
- Goal 1 Execute with less demands in time and
space than dynamic programming. - Goal 2 Perform reasonably well using
heurisitics, as compared to the DP optimal
solution. - Drawback Resulting sequence alignments are not
guaranteed to be optimal.
76Rapid Database Searching FASTA
- Dynamic Programming approaches match single
characters at a time - FASTA matches groups of characters, called words
or k-tuples, which are managed in a table.
ADCGPH
ADCGPH
ADCGPH
ADCGPH
77Rapid Database Searching FASTA
db1
db2
Assume k 3 8000 possible 3-character words.
db3
Scan each database sequence, recording the
position of each 3-tuple in a lookup table of
size 8000 keyed on the 3-tuple
78Rapid Database Searching FASTA
db1
AAC
db2
Assume k 3 8000 possible 3-character words.
AAC
db3
AAC
- Assume that the 3-tuple AAC occurs
- at position 12 in db1
- at position 52 in db2
- at position 20 in db3
After scanning the database sequences and
building the lookup table, the table element
corresponding to AAC would look like
AAC ? db112 , db252 , db320
79Rapid Database Searching FASTA
db1
AAC
db2
AAC
db3
AAC
AAC ? db112 , db252 ,
db320
Suppose a query sequence, q, has the 3-tuple AAC
at position 60. The table returns the 3 database
sequences and the locations where the matching
tuple occurs in those sequences. Assuming this is
done for another 3-tuple, DFE, we might
have 3-tuple db1 q AAC
12 60 DFE 56 104
80Rapid Database Searching FASTA
AAC ? db112 , db252 ,
db320
3-tuple db1 q AAC
12 60 DFE 56 104 Next,
FASTA compute the offsets between the locations
of matched tuples. Here, we see that for AAC and
DFE, the offest is identical, equal to 48. This
indicates that these 3-tuples are in-phase or
part of a larger locally-aligned region. FASTA
then rescores these local alignments using a
PAM250 matrix, and takes the 10 highest regions
of identity and performs a joining step in an
attempt to join the regions.
81Rapid Database Searching BLAST
- BLAST is a much faster algorithm than FASTA, and
has been shown to be just as sensitive. - As such, BLAST is considerably more widely used.
- Similar to FASTA in that it uses words, but the
size is fixed at k 3.
82Rapid Database Searching BLAST
- BLAST first extracts all overlapping 3-tuples
from a query sequence. - Then, the tuples in the query are evaluated
against the possible 8000 tuples using a BLOSUM
matrix. This determines if inexact matches
between query words and potential database words
are above a certain threshold score. Those tuples
remaining are assembled into a tree for rapid
database search.
ADCGPH
ADC
DCG
CGP
GPH
83Rapid Database Searching BLAST
- Suppose we observed the tuple SEI in the query
sequence. - Step 1. Score against all 8000 tuples, keeping
only those that are above our predetermined
scoring threshold. - SEI scored against SEI gives a score of 13 (S-S
E-E I-I) in the BLOSUM matrix) - SEI scored against SDI gives a score of 10
- SEI scored against SDG gives a score of only 2.
- Hence, if our cutoff score were 9, we would keep
SEI and SDI, but not SDG when assembling the
search tree.
84Rapid Database Searching BLAST
- Step 2. Once the possible matching tuples are
stored in the tree, database sequences are
searched for exact matches to these possible
scores. - Matches are examined for regions that are on the
same diagonal and within some distance, A of one
another. These regions serve as starting points
for a longer ungapped alignment between the
words. These joined regions are then extended in
each direction as long as the score is
increasing. - PSI-BLAST Iterative BLAST approach that includes
conservation information within a family of
proteins as opposed to just between two proteins.
85Multiple Sequence Alignments
- So far, we have only discussed pairwise
alignments since they are the most commonly used,
have optimal solutions. - Multiple sequence alignments are vitally
important to understanding true evolutionary
conservation between sequences in a family.
86Multiple Sequence Alignments
- Allows for the extraction of probes for new
members of a family (motifs / patterns). - Helps identify the functionally important amino
acids in a protein family. Amino acids not
required for function or structural integrity
will in general not be highly conserved within a
family - VTDIAYRCGFSDSNHFSTLFRREFNWSPRDI
- VTEIAYRCGFGDSNHFSTLFRREFNWSPRDI
- VFQISHRCGFGSNAYFCDVFKRKYNMTPSQF
- VFQISHRCGFGSNAYFCDAFKRKYGMTPSQF
87Representations for Similarity and Alignments
- Short, simple representations for conserved
sequence information in MSAs make assigning new
proteins to the family considerably easier. - Short representations can suggest functional and
biological conclusions regarding why certain
amino acids are conserved at certain positions. - Such representations can identify function in a
protein that more global homology methods (such
as BLAST) might miss.
88Profiles
- A highly conserved local region in an MSA is
identified, then a profile (a type of PSSM) is
constructed to describe it. - 20 x n matrix with each column describing the
scores, or probabilities of different amino acids
appearing at a given position.
89PROSITE patterns
- Patterns (motifs, signatures, fingerprints) are
short regular expression-like text strings that
describe a conserved region. - PROSITE is a manually curated database of
profiles and patterns. - Focuses on particular regions of family MSAs
shown in the literature to be biologically
important (usually catalytic sites, metal-binding
sites, reduced cysteines, or ligand binding
sites).
90PROSITE patterns
- Short conserved sequences from the MSA are then
extracted as a core and used to search
SWISS-PROT. If no additional sequences are found,
the core is designated as the actual signature.
If numerous false positives are picked up, then
the core is increased in size until good
discrimination is achieved, or until it is clear
that good discrimination wont be possible. - C-x(15)-A-x(3,4)-G-x(3)-C-x(2)-G-x(8,9)-P-x(7)-
C
91Blocks
- Blocks are short, ungapped conserved regions in
multiple sequence alignments.
92Blocks
- They are created from one of two starting with
either - Unaligned sequences from PROSITE families
- An existing MSA.
- Since PROSITEs manual curation limits is size,
the BLOCKS database currently includes families
from PRINTS-S and InterPro in addition to PROSITE.
93BLOCKS Contains Many Protein Families(Henikoff
Henikoff, 1999)
94Properties of eMOTIFshttp//emotif.stanford.edu/
- Discrete motifs that represent specific functions
- Highly specific motifs for searching entire
proteomes - Maintain sensitivity with multiple motifs
- Generate motifs automatically from protein
alignments - Resistant to sequence errors, misalignment
misclassification - Robust with respect to protein subclasses
- Generates structural motifs potential drug
targets - Biological generalization from known examples
95eMOTIFshttp//emotif.stanford.edu/
fly..h...hst..krpfy.c
96Generating Motifs fromAligned Protein Sequences
TEAESNMNDPVAEYQQYTDARQDLYELEVDYANLTEARENIAVLERDF
EEVTEAESNMNDLVSEYQQYTEVRANMNDLVAEYQQYSEAESNMNDL
VSEYQQYTEAREDLAALEKDYEEVTEAREDLAALERDYIEVSEARED
LAALEKDYEEVAEAREDLAALEKDYIEVSEAREDLAALEKDYEEVSE
AREDLAALERDYEEV
97Generating Motifs fromAligned Protein Sequences
TEAESNMNDPVAEYQQYTDARQDLYELEVDYANLTEARENIAVLERDF
EEVTEAESNMNDLVSEYQQYTEVRANMNDLVAEYQQYSEAESNMNDL
VSEYQQYTEAREDLAALEKDYEEVTEAREDLAALERDYIEVSEARED
LAALEKDYEEVAEAREDLAALEKDYIEVSEAREDLAALEKDYEEVSE
AREDLAALERDYEEV
TEARENIAVLERDFEEV SDVESDNNDPVAEYIQL A A LYE V
ANY Q A S Q K
98Generating Motifs fromAligned Protein Sequences
TEAESNMNDPVAEYQQYTDARQDLYELEVDYANLTEARENIAVLERDF
EEVTEAESNMNDLVSEYQQYTEVRANMNDLVAEYQQYSEAESNMNDL
VSEYQQYTEAREDLAALEKDYEEVTEAREDLAALERDYIEVSEARED
LAALEKDYEEVAEAREDLAALEKDYIEVSEAREDLAALEKDYEEVSE
AREDLAALERDYEEV
TEAREDLAALERDYEEV S K I A
99Generating Motifs fromAligned Protein Sequences
TEAESNMNDPVAEYQQYTDARQDLYELEVDYANLTEARENIAVLERDF
EEVTEAESNMNDLVSEYQQYTEVRANMNDLVAEYQQYSEAESNMNDL
VSEYQQYTEAREDLAALEKDYEEVTEAREDLAALERDYIEVSEARED
LAALEKDYEEVAEAREDLAALEKDYIEVSEAREDLAALEKDYEEVSE
AREDLAALERDYEEV
TEARENIAVLERDFEEV SDVESDNNDPVAEYIQL A A LYE V
ANY Q A S Q K
100Amino Acid Substitution Groups Based on Physical
Properties
- Only permit groups of amino acids
- sharing some chemical or physical property
Group
AG
ST
PAGST
QN
QNED
KR
VLI
VLIM
FYW
KRH
DE
101Allowable Amino AcidSubstitution Groups
fly..h...hst..krpfy.c
102(No Transcript)
103Discovery of eMOTIFshttp//emotif.stanford.edu/
104Discovery of eMOTIFshttp//emotif.stanford.edu/
105- Each red dot is an eMOTIF
- Most specific eMOTIFs along pareto-optimal curve
- High Sensitivity gt Low Specificity
- High Specificity gt Low Sensitivity
106(No Transcript)
107(No Transcript)
108(No Transcript)
109Protein Function with eMOTIF Searchhttp//emotif.
stanford.edu/
110Protein Function with eMOTIF-Searchhttp//emotif.
stanford.edu/
1113MOTIFs 3MATRICEShttp//3motif.stanford.edu/
112(No Transcript)
113 Searched for 3est
Visualization Features Conservation strength
shading Relative and overall solvent
accessibilities per residue, and for the eMOTIF
as a whole Accessibility shading Multiple display
and manipulation options
114Visualization Features
3est - cgg.lilv...wvilmvstaahc
115(No Transcript)
116(No Transcript)
117(No Transcript)
118(No Transcript)
119(No Transcript)
120(No Transcript)
121(No Transcript)
122(No Transcript)
123 3motif Pipeline Construction Query
124(No Transcript)
125eMotifs and SCOP
- eMotifs were observed to correlate strongly with
SCOP classification, even when global sequences
were not overly similar. - eMotifs that were found to hit proteins in
different SCOP locations were particularly
interesting.
126(No Transcript)
127(No Transcript)
128(No Transcript)
129(No Transcript)
130(No Transcript)
131eMATRIXPosition-Specific Scoring Matrices
132An eMATRIXhttp//ematrix.stanford.edu/
133eMATRIX Scanhttp//ematrix.stanford.edu/
134eMATRIX Scan Resultshttp//ematrix.stanford.edu/e
matrix-scan/
135eMATRIX Searchhttp//ematrix.stanford.edu/
136eMATRIX Search Resultshttp//ematrix.stanford.edu
/
137eMATRIX Makerhttp//ematrix.stanford.edu/
1383MATRIXhttp//3matrix.stanford.edu/
139(No Transcript)
140ePROTEOMEA Functional Genomics
Databasehttp//eproteome.stanford.edu/
141BLOCKS Is Based On SeveralProtein Family
Databases
142eBLOCKs - Discovering Protein Motifshttp//eblock
s.stanford.edu/
Higher Specificity
A
B
C
Higher Sensitivity
143Building eBLOCKs with PSI-BLAST
- 1) Compare the query to database with BLAST
- 2) Construct profile from significant
similarities - 3) Compare the profile to database
- 4) Repeat step 2 and 3 until convergence
144Generating Multiple OverlappingeBLOCKs from
PSI-BLAST Results
G2B1
G2B2
1
2
G3B1
G3B2
G3B3
G1B1
G1B2
1 Clustering Grouping 2 Aligning Trimming
145Clusters Are Organized Into Groupswith Varying
Specificity Sensitivity
Higher Specificity
A
B
C
Higher Sensitivity
146eBLOCKs Summary
- SWISS-PROT
- 79,449 Sequences
- Filtered Target Set
- Homologous, putative, fragment, hypothetical,
probable, possible - 57,266 Sequences
- PSI-BLAST Searches
- 17,415
- Final Number Of Groups
- 19,889
- Final Number Of Blocks
- 81,413
147eBLOCKs are More Comprehensive
148Properties of 52,671 Novel eBLOCKs
- New eBLOCKs are the same width as BLOCKS blocks
- Average new eBLOCK 34 positions, others 37
- New eBLOCKs have fewer sequences than BLOCKS
blocks - Average new eBLOCK has 18 sequences, BLOCKS 27
- New eBLOCKs have similar information content
- New eBLOCKs have 2.82 bits/position, BLOCKS 2.88
bits - One half of new eBLOCKs (26,254) are in known
families - One half of new eBLOCKs (26,471) are in 6,713 new
families
149Example of New eBLOCK in a Known Family
14-3-3 Family of Proteins
68 Sequences, 72 Sequences
BL00796A, P29358G1B1 BL00796B,
P29358G1B2 BL00796C, P29358G1B4
(28, 33)
150Catalytic Site of ATP Synthase
P-SAP-LIV-DNH-x(3)-S-x-S
PROSITE PS00152
eBLOCKs P19483G1B2
BLOCKS BL00152F
151Protein Functional AnalysisUsing BLOCKS or
eBLOCKs
Motifs Significant at an Expectation of 10-4
Red eBLOCKs Black BLOCKS
152Two Human Protein Sets
- Ensembl - (http//www.ensembl.org)
- 29,304 proteins (Feb 2002) from the human genome
project - Based on GenScan Models
- Shorter, more fragmentary protein sequences
- RefSeq (http//www.ncbi.nlm.nih.gov/)
- 21,724 -- curated (XP)
- 11,407 reviewed sequences (NP)
- Based on full length cDNAs
- Longer, more reliable protein sequences
153eBLOCKs Assignments forRefSeq and ENSEMBL
Proteins
154Web Access to eBLOCKshttp//eblocks.stanford.edu/
155An Entry From eBLOCKs http//eblocks.stanford.edu
/
156Another Entry from eBLOCKs http//eblocksanford.e
du/
157A Sample Keyword Search http//eblocks.stanford.e
du/
158Search A Sequence http//eblocks.stanford.edu/
159Software demos
- Dot Matrices
- http//bioinf.ibun.unal.edu.co/java/dotlet/Dotlet.
html - Software Smith-Waterman alignments
- http//motif.stanford.edu/alion/
- Hardware Smith-Waterman alignments
- http//decypher.stanford.edu
- eMotif
- http//motif.stanford.edu/emotif/
- eMatrix
- http//motif.stanford.edu/ematrix/
- 3motif / 3matrix
- http//motif.stanford.edu/3motif/
- http//motif.stanford.edu/3matrix/
- eBlocks
- http//eblocks.stanford.edu/
- LOCK
- http//dlb3.stanford.edu/lock/
- SCOP / PDB
- http//scop.berkeley.edu
160Conclusion
- Functional Analysis is more important than ever
with the rate of growth of sequence databases. - Important for understanding of biology give
researchers a head start on how to experimentally
examine proteins. - Important in pharmaceuticals allows rapid
discovery of targets.