Why Is Sequence Comparison Useful? - PowerPoint PPT Presentation

About This Presentation

Title:

Why Is Sequence Comparison Useful?

Description:

Almost 100 Trillion BLAST comparisons per quarter (10/01) ... Alex Souverov. Lewis Geer. Greg Schuler. Steve Bryant & all my colleagues at NCBI and NIH ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 33

Provided by: NCBI6

Learn more at: https://sites.pitt.edu

Category:

more less

Transcript and Presenter's Notes

Title: Why Is Sequence Comparison Useful?

1
Why Is Sequence Comparison Useful?
Lipman, David (NIH/NLM/NCBI)
2
Almost 100 Trillion BLAST comparisons per quarter
(10/01)
3
Rapid similarity searches of nucleic acid and
protein data banks.Wilbur WJ, Lipman DJ. Proc
Natl Acad Sci U S A 1983 Feb80(3)726-30 With
the development of large data banks of protein
and nucleic acid sequences, the need for
efficient methods of searching such banks for
sequences similar to a given sequence has become
evident. We present an algorithm for the global
comparison of sequences based on matching
k-tuples of sequence elements for a fixed k. The
method results in substantial reduction in the
time required to search a data bank when compared
with prior techniques of similarity analysis,
with minimal loss in sensitivity. The algorithm
has also been adapted, in a separate
implementation, to produce rigorous sequence
alignments. Currently, using the DEC KL-10
system, we can compare all sequences in the
entire Protein Data Bank of the National
Biomedical Research Foundation with a 350-residue
query sequence in less than 3 min and carry out a
similar analysis with a 500-base query sequence
against all eukaryotic sequences in the Los
Alamos Nucleic Acid Data Base in less than 2 min.
4
Cancer Gene Meets Its MatchNY Times July 3,
1983a serendipitous computer search
Waterfield MD et al., Nature 1983 Jul
7304(5921)35-39 Doolittle RF et al., Science
1983 Jul 15221(4607)275-277
v-sis 6 QGDPIPEELYKMLSGHSIRSFDDLQRLLQGDSGKEDGAE
LDLNMTRSHSGGELESLARGK 65
QGDPIPEELYMLS HSIRSFDDLQRLL GD
GEDGAELDLNMTRSHSGGELESLARG PDGF 10
QGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDGAELDLNMTRSHSG
GELESLARGR 69 v-sis 66 RSLGSLSVAEPAMIAECKTRTEVF
EISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ 125
RSLGSLAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEV
QRCSGCCNNRNVQ PDGF 70 RSLGSLTIAEPAMIAECKTRTEVF
EISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ 129 v-sis
126 CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCEIVAAA
RAVTRSPGTSQEQR 185 CRPTQVQLRPVQVRKIEIV
RKKPIFKKATVTLEDHLACKCE VAAAR VTRSPG SQEQR PDGF
130 CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAA
RPVTRSPGGSQEQR 189 v-sis 186 AKTTQSRVTIRTVRVRRPP
KGKHRKCKHTHDKTALKETLGA 226 AKT
QRVTIRTVRVRRPPKGKHRK KHTHDKTALKETLGA PDGF 190
AKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA 230
V-sis and Platelet-Derived Growth Factor (PDGF)
5
An earlier, more subtle discovery
(for Slide Animation please Click the area of
slide or Slide Show button)
Viral src gene products are related to the
catalytic chain of mammalian cAMP-dependent
protein kinase Barker WC, Dayhoff MO. PNAS 1982
May79(9)2836-2839
Query 113 YAAQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTD
FGFAKR---VKGRTWT---LC 166 Y V
LHS DLKP NLI Q DFG GR
Sbjct 125 YSLDVVNGLLFLHSQSILHLDLKPANILISEQDVCKIS
DFGCSQKLQDLRGRQASPPHIG 184 Query 167
GTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEK
IVSGKVR 223 GT APEI D
G M P P V R Sbjct 185
GTYTHQAPEILKGEIATPKADIYSFGITLWQMTTREVP-YSGEPQYVQYA
VVAYNLR 240

Biology not Algorithms
- compare proteins, not DNA
must detect similar amino acids not just
identities

6
How often would one find matches?
(for Slide Animation please Click the area of
slide or Slide Show button)

How many protein families would there be?

Unexpected similarities should be extremely rare.
7
Estimating number of protein families
(for Slide Animation please Click the area of
slide or Slide Show button)
8
Earliest Estimates of Number of Protein Families
- 1000

Zuckerkandl,E. (1974) Accomplissement et
perspectives de la paleogenetique chimique. In
Ecole de Roscoff 1974, p. 69. ParisCNRS.
The appearance of new structures and
functions in proteins during evolution, J. Mol.
Evol. 7, 1-57 (1975).
Dayhoff, M.O. (1974) Federation Proceedings 33,
2314.
The origin and evolution of protein
superfamilies, Fed.Proc. 35, 2132-2138 (1976).

9
Margaret Dayhoff
10
Atlas of Protein Sequence and Structure, Vol. 5,
Supplement 3 (1978) pg. 10

It has been estimated that in humans there are
approximately 50,000 proteins of functional or
medical importance. A landmark of molecular
biology will occur when one member of each
superfamily has been elucidated. At the present
rate of 25 per year, this will take less than 15
years.

11
Hubris, the Genome Project, and Protein Families
(for Slide Animation please Click the area of
slide or Slide Show button)

Chothia, C. (1992). One thousand families for the
molecular biologist. Nature, 357, 543-544.

Green P, Lipman D, Hillier L, Waterson R,
States,D, and Claverie JM (1993). Ancient
Conserved Regions in New Gene Sequences and the
Protein Databases. Science, 259, 1711-1716.
ACR similarity detected between sequences from
distantly related organisms
12
1992 What new families do we get from the genome
projects?
(for Slide Animation please Click the area of
slide or Slide Show button)
Set N Coding Sequences Seq. with ACRs ACRs
human ESTs 2644 600-1200 197 (16-33) 103
worm ESTs 1472 1370 570 (42) 240
worm genes 234 234 74 (32) 59
yeast ORFs 182 182 43 (24) 35
Sets compared Matching Sequences ACRs ACRs in database
worm ESTs, human ESTs 77, 66 34 31 (91)
worm ESTs, yeast ORFs 23, 13 9 8 (89)
worm genes, human ESTs 17, 17 12 12 (100)
worm genes, yeast ORFs 6, 4 4 3 (75)
human ESTs, yeast ORFs 14, 13 10 10 (100)
13
Cumulative growth in number of proteins number
of conserved domains (from Geer, L., Bryant, S.,
Ostell, J.)
(for Slide Animation please Click the area of
slide or Slide Show button)
6
1.210
100
6
1.010
80
5
8.010
60
Conserved Domain Families
5
6.010
Families Hit
Number of Proteins
40
5
4.010
Protein Sequences
20
5
2.010
0.0
0
1960
1965
1970
1975
1980
1985
1990
1995
2000
14
Why so few families and why do they evolve
slowly?
(for Slide Animation please Click the area of
slide or Slide Show button)

Structural View
Thermodynamics Finkelstein, AV, Why are the
same protein folds used to perform different
functions? FEBS 325, pp. 23-28 (1993)

15
Constraints Due To Biological Function May Be
More Important
(for Slide Animation please Click the area of
slide or Slide Show button)

Compare pairs of sequences from related classes
of proteins

All sequences should at least share structural
similarity

Divergence times for all sequences should be
approximately the same

prokaryotes

Sequences within a class share function but
sequences between classes have differing function

eukaryotes
Degree within-class similarity gt between-class
similarity indicates importance of constraints
due to biological function.
16
Example from the Aminoacyl-tRNA synthetases
(aaRS) (from E. Koonin Y. Wolf) essential
enzymes responsible for incorporation of amino
acids into proteins
(for Slide Animation please Click the area of
slide or Slide Show button)

Two unrelated classes of aaRS, each includes 10
aaRS related to each other

The last universal common ancestor (LUCA) of
modern life forms already had at least 17 aaRS

The duplication leading to aaRS of different
specificities must have occurred during a
relatively short period of early evolution

The post-LUCA evolution of aaRS took much longer
than the early phase when the specificities were
established. However, the changes that occurred
after the aaRS were locked in their specificities
are small compared to the changes traced to the
early phase

17
Orthologs (from S. Bryant)

18
Paralogs (from S. Bryant)

19
Example from the Aminoacyl-tRNA Synthetases
(aaRS) (from E. Koonin Y. Wolf)
Exceptions - glutamine/glutamate,asparagine/aspart
ate tryptophan/tyrosine
20
How many human genes?
(for Slide Animation please Click the area of
slide or Slide Show button)

80,000 Antequera F Bird A, Number of CpG
islands and genes in human and mouse, PNAS 90,
11995-11999 (1993).

120,000 Liang F et al., Gene Index analysis of
the human genome estimates approximately 120,000
genes, Nat. Gen., 25, 239-240 (2000)
35,000 Ewing B Green P, Analysis of expressed
sequence tags indicates 35,000 human genes,
Nat. Gen. 25, 232-234 (2000)
28,000-34,000 Roest Crollius, H. et al.,
Estimate of human gene number Provided by
genome-wide analysis using Tetraodon nigroviridis
DNA Sequence, Nat. Gen. 25, 235-238 (2000).
41,000-45,000 Das M et al., Assessment of the
Total Number of Human Transcription Units,
Genomics 77, 71-78 (2001)
21
How many human genes with ACRs? (from S.
Resenchuk, T.Tatusov, L. Wagner, A. Souverov)
(for Slide Animation please Click the area of
slide or Slide Show button)
12,245 characterized mRNAs from RefSeq
78 have ACR, i.e., hit outside vertebrates at E
lt10e-6 ( 9,496/12,245)
90 of these have corresponding GenomeScan
predictions which also have ACR (8501/9496)
20,245 GS models for entire human genome have ACR
15,573 GS models after correction for splitting
(20,245/1.3)
17,300 estimated human genes with ACRs (
15,573/.9)
22
How many human genes?
(for Slide Animation please Click the area of
slide or Slide Show button)
17,303 estimated human genes with ACRs
Now use comparative genomics
S.cerev. S. Pombe A.thal. C. Elegans D. mela.
ACRs/genes 4022/6306 63 4846/6593 73 14443/24605 58 11598/20850 55 10469/14335 73
17,303/.55 31,500 Total Human Genes
More complicated than that!
23
Conservation, expression level, protein length,
exon number
(for Slide Animation please Click the area of
slide or Slide Show button)
EST 0 0-20 0-200 gt200 All
RefSeq 396 2716 9454 2791 12,245
RS ACR 240 (61) 1718 (63) 7049 (75) 2447 (88) 9496 (78)
GS ACR 158 (66) 1424 (83) 6256 (89) 2245 (92) 8501 (90)
Prot. Len. 319 419 486 517 493
Avg. exon 3.82 6.25 8.78 10.38 9.15
23,600 revised est. human genes with ACRs
(15,573/.66)
43,000 upper bound on est. total human genes
(23,600/.55) 35,000 is more reasonable bound
with this approach
24
The relationship of protein conservation and
sequence length

Lipman DJ, Souvorov A, Koonin EV, Panchenko AR,
Tatusova TA
BMC Evol Biol. 2002 220

25
4279 proteins
Salmonella Set
26
Archaeoglobus fulgidus
100
80
2420 proteins
60
Number
40
20
0
0
200
400
600
800
1000
Length
27
conserved
nonconserved
Structural domains
28
conserved
nonconserved
Structural domains
Length
29
Human
300
conserved
250
nonconserved
14538 proteins
Structural domains
200
Number
150
100
50
0
0
200
400
600
800
1000
Length
30
A
conserved
nonconserved
B
31
Archaeoglobus fulgidus Escherichia coli Contact
density
32
Acknowledgements
Steve Bryant Greg Schuler
Lewis Geer Alex Souverov
Alex Kondrashov Tatiana Tatusov
Eugene Koonin Lukas Wagner
Jim Ostell Yuri Wolf
Sergei Resenchuk Phil Murphy (NIAID)

all my colleagues at NCBI and NIH

Write a Comment

User Comments (0)