Genome of the week - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Genome of the week

Description:

Genome of the week. Bacillus subtilis. Gram-positive soil bacterium ... 4.2 Mb genome (sequence completed 1997) Close relative of Bacillus anthracis (Anthrax) ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 36
Provided by: RobertB139
Learn more at: https://www.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Genome of the week


1
Genome of the week
Bacillus subtilis
Gram-positive soil bacterium Genetically
tractable, well-studied Developmental pathways
(sporulation, genetic competence) Industrial and
agricultural importance 4.2 Mb genome (sequence
completed 1997) Close relative of Bacillus
anthracis (Anthrax)
2
B. subtilis genome features
  • 4,106 protein coding genes
  • 10 rRNA operons
  • Nearly 50 of the genome consists of paralogous
    genes.
  • 77 ABC transporter binding proteins
  • 10 phage like regions - horizontal transfer. Low
    GC regions in the genome.
  • 18 sigma factors - initiate transcription.
  • 34 two-component regulatory systems.

3
Annotating genes
  • How to assign preliminary functions to genes.
  • Automated programs.
  • Similarity searches
  • BLAST and PSI-BLAST
  • COGs, Pfam, CDD, other databases
  • Only 50-75 of genes will have a predicted
    function. Some have no known homologs in any
    other genome.
  • Functional characterization (individual genes)
  • Gene knockouts
  • Overexpression

4
  • In many cases computer annotation will only be
    able to predict function - NOT assign function!
  • The biological function of many genes have not
    been determined, even in model systems.
  • As genomic characterization of gene function
    continues - more and more computer generated
    annotations will be correct.

5
  • Molecular function - activity of a protein at the
    molecular level.
  • Examples would be ATPase, metal binding,
    converting glucose-6-phosphate to
    fructose-6-phosphate.
  • Biological function - cellular role of the
    protein.
  • Examples would be translation initiation, DNA
    replication, glycolysis.

6
Homologs, orthologs, and paralogs.
  • Homologous genes are genes that share a common
    evolutionary ancestor.
  • Orthologs are genes found in different organisms
    that arose from a common ancestor
  • Paralogs are genes found in the same organism
    that arose from a common ancestor. Duplication
    could have occurred in the species or earlier.

7
Using BLAST to predict gene function.
  • BLAST predicted protein sequence against the
    non-redundant database.
  • Determine best hits
  • Automated annotation programs will often assign
    the best hit function to the gene being searched.
  • Must manually confirm automated annotations.
    (Final project).

8
Basic Local Alignment Search Tool
  • Calculates similarity for biological sequences
  • Finds best local alignments
  • Heuristic approach based on Smith-Waterman
    algorithm
  • Searches for matching words rather than
    individual residues
  • Uses statistical theory to determine if a match
    might have occurred by chance

NCBI Field Guide
9
Nucleotide Words
GTACTGGACAT TACTGGACATG ACTGGACATGG
CTGGACATGGA TGGACATGGAC GGACATGGACC
GACATGGACCC ACATGGACCCT
...........
Minimum word size 7 blastn default
11 megablast default 28
Make a lookup table of words
NCBI Field Guide
10
Protein Words
GTQ TQI QIT ITV TVE VED
EDL DLF ...
Word Size can be 2 or 3 (default 3)
Make a lookup table of words
NCBI Field Guide
11
Minimum Requirements for a Hit
ATCGCCATGCTTAATTGGGCTT CATGCTTAATT
exact word match
one match
  • Nucleotide BLAST requires one exact match
  • Protein BLAST requires two neighboring matches
    within 40 aa

GTQITVEDLFYNI SEI YYN
neighborhood words
two matches
NCBI Field Guide
12
Scoring Systems - Nucleotides
Identity matrix
A G C T A 1 3 3 -3 G 3 1 3 -3 C 3 3
1 -3 T 3 3 3 1
CAGGTAGCAAGCTTGCATGTCA
raw score 19-9 10 CACGTAGCAAGCTTG-GTGTCA
NCBI Field Guide
13
Scoring Systems - Proteins
  • Position Independent Matrices
  • PAM Matrices (Percent Accepted Mutation)
  • Derived from observation small dataset of
    alignments
  • Implicit model of evolution
  • All calculated from PAM1
  • PAM250 widely used
  • BLOSUM Matrices (BLOck SUbstitution Matrices)
  • Derived from observation large dataset of
    highly conserved blocks
  • Each matrix derived separately from blocks with
    a defined percent identity cutoff
  • BLOSUM62 - default matrix for BLAST
  • Position Specific Score Matrices (PSSMs)
  • PSI- and RPS-BLAST

NCBI Field Guide
14
BLOSUM62
NCBI Field Guide
A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3
-3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2
5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0
0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2
-3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1
-2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2
-1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0
6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1
-4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2
-1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3
3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2
-1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1
A R N D C Q E G H I L K M F P S
T W Y V X
Common amino acids have low weights
Rare amino acids have high weights
Negative for less likely substitutions
Positive for more likely substitutions
15
Scores
Simply add the scores for each pair of aligned
residues
V D S C Y V E T
L C F BLOSUM62 4 2 1 -12 9 3 7
PAM30 7 2 0 -10 10 2 11
Different matrices produce different scores!
NCBI Field Guide
16
Local Alignment Statistics
High scores of local alignments between two
random sequences follow the Extreme Value
Distribution
Expect Value E number of database hits you
expect to find by chance
size of database
your score
At low E values E approximates a P value
Alignments
expected number of random hits
Score
NCBI Field Guide
17
BLAST Databases for Proteins
  • nr (non-redundant protein sequences)
  • GenBank CDS translations
  • NP_ RefSeqs
  • PIR, Swiss-Prot, PRF
  • PDB (sequences from structures)
  • swissprot
  • pat - patents
  • pdb sequences with 3D structures
  • month sequences updated within 30 days

NCBI Field Guide
18
Assessment of BLAST output
  • What is the level of identity and similarity of
    the best hits?
  • More identity - more likely the proteins may have
    similar functions.
  • Does the area of similarity occur over the entire
    protein? Or just part of the protein? (fig.
    2.19)
  • Often you will find hits to only part of your
    protein. A GTP-binding domain for example.
  • Have any of the best hits been characterized
    experimentally?
  • With so many microbial genomes sequenced chances
    are you will have to search extensively to find a
    hit that has been characterized experimentally.

NCBI Field Guide
19
BLAST Formatting Page
NCBI Field Guide
20
BLAST Output Graphic Overview
PX
SH3
NCBI Field Guide
21
BLAST Output Descriptions
4 X 10-68
links to entrez
default e value cutoff 10
22
TaxBLAST Taxonomy Reports
23
BLAST Output Alignments
gtgi12643956spQ9Y5X1SNX9_HUMAN Sorting nexin 9
(SH3 and PX domain- containing protein 1) (SDP1
protein) Length 595 Score 255 bits (652),
Expect 4e-68 Identities 140/322 (43),
Positives 185/322 (56), Gaps 7/322
(2) Query 221 SSATVSRNLNRFSTFVKSGGEAFVLGEASGFVK
DGDKLCVVLGPYGPEWQENPYPFQCTI 280 Sbjct 197
SSSSMKIPLNKFPGFAKPGTEQYLL--AKQLAKPKEKIPIIVGDYGPMWV
YPTSTFDCVV 254 Query 281 DDPTKQTKFKGMKSYISYKLVPT
HTQVPVHRRYKHFDWLYARLAEKF-PVISVPHLPEKQ 339
DP K K GKSYI YL PTT V RYKHFDWLY RL
KF I P LPKQ Sbjct 255 ADPRKGSKMYGLKSYIEYQLTPT
NTNRSVNHRYKHFDWLYERLLVKFGSAIPIPSLPDKQ 314 Query
340 ATGRFEEDFISKRRKGLIWWMNHMASHPVLAQCDVFQHFLTCPSST
DEKAWKQGKRKAEK 399 TGRFEEFI R L
WM M HPV VFQ FL DEK WK GKRKAE Sbjct
315 VTGRFEEEFIKMRMERLQAWMTRMCRHPVISESEVFQQFL---NFR
DEKEWKTGKRKAER 371
SS LNF F K G E L A K K G YGP W
F C
NCBI Field Guide
24
Blink Protein BLAST Alignments
  • Lists only 200 hits
  • List is nonredundant

NCBI Field Guide
25
Nucleotide vs. Protein BLAST
Comparing ADSS from H. sapiens and A. thaliana
aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaag
gc Human N R V T V V L G A Q W G D
E G V V L G Q W G
D E G A.th. S Q V S G V L G C Q W
G D E G agtcaagtatctggtgtactcggttgccaatg
gggagatgaaggt
BLASTp finds three matching words
BLASTn finds no match, because there are no 7 bp
words
Protein searches are generally more sensitive
than nucleotide searches.
NCBI Field Guide
26
Translated BLAST
ucleotide
rotein
Particularly useful for nucleotide sequences
without protein annotations, such as ESTs or
genomic DNA
Query
Database
Program
P
N
blastx
N
P
tblastn
N
N
tblastx
27
Linking Protein Sequence, Structure, and Function
Protein sequences
Protein
CDD Conserved functional domains in proteins
represented by a PSSM
Domains
PSI-BLAST, RPS-BLAST, CDART
3D Domains
NCBI Field Guide
28
Position Specific Substitution Rates
Active site serine
Weakly conserved serine
29
Position Specific Score Matrix (PSSM)
A R N D C Q E G H I L K M
F P S T W Y V 206 D 0 -2 0 2 -4 2 4
-4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G
-2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2
-1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1
-4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3
-3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1
-4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6
-4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4
-4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3
212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0
-7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0
-2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G
-2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3
-5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5
-7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4
-2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6
-5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7
-5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5
-6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7
219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7
9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6
-7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N
-1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2
-1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1
-1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1
4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3
-4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1
-2 -2 -3 0 -2 -2 -2 -3
Serine is scored differently in these two
positions
Active site nucleophile
30
NCBI Field Guide
PSI-BLAST
Create your own PSSM Confirming relationships of
purine nucleotide metabolism proteins
BLOSUM62
PSSM
query
Alignment
Alignment
31
PSI BLAST
e value cutoff for PSSM
NCBI Field Guide
32
PSI Results Initial BLAST Run
NCBI Field Guide
33
First PSSM Search
Other purine nucleotide metabolizing enzymes not
found by ordinary BLAST
NCBI Field Guide
34
Third PSSM Search Convergence
Just below threshold, another nucleotide
metabolism enzyme
NCBI Field Guide
35
Entrez Domains (CDD)
16,482 records
Domains
A Database of Position Specific Score Matrices
CDD 2
NCBI Curated Alignments
SMART 4
LOAD 0.3
  • EMBL
  • HMM based models
  • originally concentrating
  • on eukaryotic signaling
  • domains, now expanding
  • NCBI
  • Library of Ancient Domains

Pfam 35
  • Sanger Center
  • Pfam-A seeds
  • HMM based models
  • representing a wide
  • variety of functional
  • domains derived from
  • SWISS-PROT

KOG 29
  • NCBI
  • Eukaryotic COGs

COG 30
  • NCBI
  • BLAST based alignments derived from complete
    proteomes of unicelluar organisms

NCBI Field Guide
Write a Comment
User Comments (0)
About PowerShow.com