Title: Genome Annotation Continued
1Genome Annotation Continued
- This weeks lab.
- Genome annotation - web based databases for
assigning gene function.
2Last weeks lab
- E-value
- Score
- Blastx
- Taxonomy
3Lab
- Sequence assembly and analysis
- Assemble individual sequence reads
- Phred 30 - good or bad?
4Linking Protein Sequence, Structure, and Function
Protein sequences
Protein
CDD Conserved functional domains in proteins
represented by a PSSM
Domains
PSI-BLAST, RPS-BLAST, CDART
3D Domains
NCBI Field Guide
5Position Specific Substitution Rates
Active site serine
Weakly conserved serine
6Position Specific Score Matrix (PSSM)
A R N D C Q E G H I L K M
F P S T W Y V 206 D 0 -2 0 2 -4 2 4
-4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G
-2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2
-1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1
-4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3
-3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1
-4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6
-4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4
-4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3
212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0
-7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0
-2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G
-2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3
-5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5
-7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4
-2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6
-5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7
-5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5
-6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7
219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7
9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6
-7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N
-1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2
-1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1
-1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1
4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3
-4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1
-2 -2 -3 0 -2 -2 -2 -3
Serine is scored differently in these two
positions
Active site nucleophile
7Hidden Markov Models
- A statistical model that can be applied to any
system that is represented as a discrete state. - Applies to protein and nt sequences.
- Can be thought of much like PSSMs used in
PSI-BLAST. - After several interations.
- Are used in gene finding and protein profile
analysis.
8Uses of HMMs in protein function analysis.
- TIGRFAMs
- Strive to annotate function of an entire protein
- PFAMs
- Strive to annotate domains of proteins.
9Homologs, orthologs, and paralogs.
- Homologous genes are genes that share a common
evolutionary ancestor. - Orthologs are genes found in different organisms
that arose from a common ancestor. Speciation. - Paralogs are genes found in the same organism
that arose from a common ancestor. Duplication
could have occurred in the species or earlier,
often have diverged in function
10Orthologs may differ in function!
11TIGRFAM
- Curated such that proteins in a TIGRFAM should
have the same function if they are equivalogs. - Proteins have identity over their entire length.
- Equivalog family all proteins that are
conserved with respect to function since their
last common ancestor. - Superfamily - all proteins with homology but may
have different biological functions. - Subfamily - incomplete set of proteins with
homology - may have diverse biological functions.
12PFAM
- More likely to describe a protein domain rather
than a family. - Pfams will not overlap.
- Crosslisted in TIGRFAM page.
- 70 of proteins in SWISS-Prot have a Pfam match.
13COGs
- Cluster of orthologous groups
- Pairwise comparison of orthologs from many
bacterial genomes. - Suggests function only (book example).
14Gene Ontology (GO)
- The goal of the Gene Ontology project is to
produce a controlled vocabulary that can be
applied to all organisms even as knowledge of
gene and protein roles in cells is accumulating
and changing. - Biological process, Molecular function, Cellular
component
15Literature Curation
- Saccharomyces genome database (SGD) for example.
- Manual curation of the literature for
experimental evidence linking function to
annotation.
16Additional databases
- SMART - Simple Modular Architecture Research
Tool. - PROSITE - Protein motifs
- PRODOM - A database based on PSI-BLAST PSSMs.
- InterPro - A database that brings together many
of the above databases so that you can search
them all at once. - Others.
17CDD
- Conserved domain database - linking all of this
information together. - Consists of SMART, Pfam, and COGs (KOGs).
Searchable directly - automatically searched by
BLAST. - Linked to CDART - allows the identification of
proteins with a similar domain architecture.
18Bottom line about databases
- Are useful tools in assigning possible functions.
- Be careful about annotations
- example -proteins in the same COG can be
orthologs that have evolved different functions. - Many annotations are not backed up by
experimental data. - Some databases are automated - have not been
checked for accuracy.
19Annotation can not be guaranteed without
experimental evidence.