Title: National Center for Biotechnology Information Web Tools Introduction
1National Center for Biotechnology Information Web
Tools (Introduction)
- Craig R. Street, M. Biot.
- Biomedical Informatics Facility
- 20 April 2004
2Types of Databases
- Primary Databases
- Original submissions by experimentalists
- Content controlled by the submitter
- Examples GenBank, SNP
- Derivative Databases
- Built from primary data
- Content controlled by third party (NCBI)
- Examples Refseq, UniGene, NCBI Protein,
Structure, Conserved Domain
3A Traditional GenBank Record
LOCUS AF062069 3808 bp
mRNA linear INV 23-OCT-2002 DEFINITION
Limulus polyphemus myosin III mRNA, complete
cds. ACCESSION AF062069 VERSION AF062069.2
GI7144484 KEYWORDS . SOURCE Limulus
polyphemus (Atlantic horseshoe crab) ORGANISM
Limulus polyphemus Eukaryota
Metazoa Arthropoda Chelicerata Merostomata
Xiphosura Limulidae Limulus. REFERENCE
1 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE A myosin III from Limulus
eyes is a clock-regulated phosphoprotein
JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998)
MEDLINE 98279067 PUBMED 9614231 REFERENCE
2 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE Direct Submission
JOURNAL Submitted (29-APR-1998) Whitney
Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086,
USA
4GenBank Overview
- Comprehensive Public Database of Nucleotide
Sequences - Composed Primarily of Submission of Sequence Data
from Authors and Bulk Submission of ESTs, GSSs,
and Other High-Throughput Data - Continues to Grow at an Exponential Rate with 9
Million New Sequences Added in 2003
5GenBank Overview
- Complete Genomes Represent a Growing Portion of
the Database - Over a Dozen Eukaryote Genomes Are Now Available
- Each Entry Contains a Concise Description of the
Sequence, the Scientific Name Taxonomy of the
Source Organism, Bibliographic References and a
Table of Features
6GenBank Overview
- Record Features Include Source, Gene, Variations
(SNPs, etc.), Coding Sequence, etc. - Each Record (Consisting of a Sequence and its
Annotations) is Assigned a Stable and Unique
Accession Number - ESTs Continue to be Major Source of New Sequence
Records
7GenBank Overview
- Sequence Records Are Accessible via NCBIs
Retrieval System Entrez - Sequence-Similarity Searches are Performed on
GenBank Data Using the BLAST Family of Tools
8LOCUS AF062069 3808 bp mRNA
linear INV 23-OCT-2002
GenBank Record Locus
LOCUS AF062069 3808 bp
mRNA linear INV 23-OCT-2002 DEFINITION
Limulus polyphemus myosin III mRNA, complete
cds. ACCESSION AF062069 VERSION AF062069.2
GI7144484 KEYWORDS . SOURCE Limulus
polyphemus (Atlantic horseshoe crab) ORGANISM
Limulus polyphemus Eukaryota
Metazoa Arthropoda Chelicerata Merostomata
Xiphosura Limulidae Limulus. REFERENCE
1 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE A myosin III from Limulus
eyes is a clock-regulated phosphoprotein
JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998)
MEDLINE 98279067 PUBMED 9614231 REFERENCE
2 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE Direct Submission
JOURNAL Submitted (29-APR-1998) Whitney
Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086,
USA COMMENT On Mar 2, 2000 this sequence
version replaced gi3132700.
Length
Division
Molecule type
Locus name
Modification Date
9Actin Record (FASTA Format)
10RefSeq NCBIs Derivative Sequence Database
- Curated transcripts and proteins
- reviewed
- human, mouse, rat, fruit fly, zebrafish,
arabidopsis - Model transcripts and proteins
- Assembled Genomic Regions (contigs)
- human genome
- mouse genome
- Chromosome records
- Human genome
- microbial
- organelle
11RefSeq Benefits
- non-redundancy Â
- explicitly linked nucleotide and protein
sequences - updates to reflect current sequence data and
biology - data validation
- format consistency
- distinct accession number prefix
- stewardship by NCBI staff and collaborators
12RefSeq Accession Numbers
mRNAs and Proteins NM_123456 Curated
mRNA NP_123456 Curated Protein NR_123456 Curated
non-coding RNA XM_123456 Predicted
mRNA XP_123456 Predicted Protein
XR_123456 Predicted non-coding RNA Gene
Records NG_123456 Reference Genomic
Sequence Chromosome NC_123455 Microbial
replicons, organelle genomes, human
chromosomes Assemblies NT_123456 Contig
NW_123456 WGS Supercontig
13Text
Entrez
Sequence
BLAST
Structure
VAST
14Entrez Data Retrieval System
- Provides Integrated Access to a Wide Range of
Data Domains - Entrez is Grouped into Nodes. Each Node is a
Collections of Data that is Grouped and Indexed
Together and Often Referred to as an Entrez
Database
15Entrez Data Retrieval System
- Popular Nodes Include
-
- Nucleotide sequence database (GenBank)
- Protein sequence database
- Genome whole genome sequences
- SNP single nucleotide polymorphism
- PopSet population study data sets
16Entrez Data Retrieval System
17Entrez Database Integration
Word weight
PubMed abstracts
Phylogeny
VAST
3-D Structure
3 -D Structure
Taxonomy
Genomes
Protein sequences
Nucleotide sequences
BLAST
BLAST
18Global Entrez Search
19Entrez Nucleotides Limits Preview/Index
20Entrez Nucleotides Limits
Accession All Fields Author Name EC/RN
Number Feature key Filter Gene Name Issue Journal
Name Keyword Modification Date Organism Page
Number Primary Accession Properties Protein
Name Publication Date SeqID String Sequence
Length Substance Name Text Word Title Uid Volume
Field Restriction
21BLAST Basic Local Alignment Search Tool
- Why align sequences ?
- - because it is the best way to infer
structure-function relationships for the
unknown biomolecules - Global vs Local Alignments
- BLAST Basics
- Interpretation of Results
- PSI-BLAST
22BLAST 2.0 (a.k.a. Gapped BLAST)
- Calculates Similarity for Biological Sequences
- Finds Best LOCAL Alignments
- Heuristic Approach Based on Smith-Waterman
Algorithm - Searches for Matching Words and then Extends
the Hits - Uses Statistical Theory to Determine if a Match
Might have Occurred by Chance
23BLAST 2.0 (a.k.a. Gapped BLAST)
- blastp Compares an amino acid query sequence
against a protein sequence database. - blastn Compares a nucleotide query sequence
against a nucleotide sequence database. - blastx Compares a nucleotide query sequence
translated in all reading frames against a
protein sequence database. - tblastn Compares a protein query sequence against
a nucleotide sequence database dynamically
translated in all reading frames. - tblastx Compares the six-frame translations of a
nucleotide query sequence against the
six- frame translations of a nucleotide sequence
database. - Please note that the tblastx program cannot be
used with the nr database on the BLAST Web page
because it is computationally intensive.
24BLAST Databases Nucleic Acid
- nr (nt)
- Traditional GenBank Divisions
- NM_ and XM_ RefSeqs
- dbest
- EST Division
- htgs
- HTG division
- gss
- GSS division
- chromosome
- NC_ RefSeqs
- wgs
- whole genome shotgun
25BLAST Databases Amino Acid
- nr (non-redundant protein sequences)
- GenBank CDS translations
- NP_ RefSeqs
- Outside Protein
- PIR, Swiss-Prot, PRF
- PDB (sequences from structures)
26How BLAST Works
- Make a lookup table of all words in the query
- Scan the database for matching words
- Initiate extensions from these matches
27Words
GTQ TQI QIT ITV TVE
VED EDL DLF
LFY
Make a lookup table of words
Word Size 3
28Scan DatabaseInitiate Extensions
Protein BLAST requires two hits
GTQITVEDLFYNI lt------ TVE FYN ------gt
two words (threshold score)
Nucleotide BLAST requires exact matches
ATCGCCATGCTTAATTGGGCTT
lt------ CATGCTTAATT ------gt
29An Alignment that BLAST Cant Find
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACC
ACGCTATTCTTGCTGTTG
1
GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTA
CTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGG
ATCATTAAGAACTCCTGGGGAGCCAGTT
61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGC
TGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCT
CGTGGTAAAAAC
121 GGGGAGACCAAGGCTACATCCTTATGTCCCG
TGACAACAAC
30Local Alignment Statistics
High scores of local alignments between two
random sequences follow the Extreme Value
Distribution
Expect Value E number of database hits you
expect to find by chance
31Scoring Systems - Nucleotides
Identity matrix
A G C T A 1 3 3 -3 G 3 1 3 -3 C 3 3
1 -3 T 3 3 3 1
CAGGTAGCAAGCTTGCATGTCA
raw score 19-9 10 CACGTAGCAAGCTTG-GTGTCA
32Scoring Systems - Proteins
- Position Independent Matrices
- PAM Matrices (Percent Accepted Mutation)
- Derived from observation small dataset of
alignments - Implicit model of evolution
- All calculated from PAM1 (PAM250 widely used)
- BLOSUM Matrices (BLOck SUbstitution Matrices)
- Derived from observation large dataset of
highly conserved blocks - Each matrix derived separately from blocks with
a defined percent identity cutoff - BLOSUM62 - default matrix for BLAST
- Position Specific Score Matrices (PSSMs)
- PSI-BLAST
33A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3
-3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2
5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0
0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2
-3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1
-2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2
-1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0
6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1
-4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2
-1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3
3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2
-1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1
A R N D C Q E G H I L K M F P S
T W Y V X
BLOSUM62
34Options for Advanced Blasting Nucleotide
Example Entrez Queries nucleotide allFilter NOT
mammaliaOrganism green plantsOrganism biomol
mrnaProperties biomol genomicProperties Other
Advanced -Word size Expect value
35Options for Advanced Blast Protein
Example Entrez queries proteins allFilter NOT
mammaliaOrganism green plantsOrganism srcdb
refseqProperties Other advanced -Word
size Expect value
Limit by taxon Mus musculusOrganism MammaliaOrg
anism ViridiplantaeOrganism
- Matrix Selection
- PAM30 -- most stringent
- BLOSUM45 -- least stringent
36BLAST Graphical Overview
37BLAST Alignments
gi7160701embCAB04427.2 C. elegans KIN-22
protein (corresponding sequence F49B2.5)
Caenorhabditis elegans gi17508235refNP_49350
2.1 Tyrosine kinase with SH2, SH3 and N
myristoylation domains, Drosophila suppressor of
pole hole homolog (57.5 kD) (kin-22)
Caenorhabditis elegans Length 507 Score
290 bits (742), Expect 1e-78 Identities
170/440 (38), Positives 245/440 (55), Gaps
21/440 (4)
38PSI-BLASTPosition-Specific Iterated BLAST
- Mining for protein domains
- Confirming relationships among related proteins
39Position Specific Substitution Rates
Weakly conserved serine
Active site serine
40Position Specific Score Matrix (PSSM)
A R N D C Q E G H I L K M
F P S T W Y V 206 D 0 -2 0 2 -4 2 4
-4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G
-2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2
-1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1
-4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3
-3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1
-4 0 210 D -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6
-4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4
-4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3
212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0
-7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0
-2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G
-2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3
-5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5
-7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4
-2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6
-5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7
-5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5
-6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7
219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7
9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6
-7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N
-1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2
-1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1
-1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1
4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3
-4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1
-2 -2 -3 0 -2 -2 -2 -3
Serine scored differently in these two positions
Active site nucleophile
41PSI-BLAST
gtgi113340spP03958ADA_MOUSE ADENOSINE
DEAMINASE (ADENOSINE MAQTPAFNKPKVELHVHLDGAIKPETILY
FGKKRGIALPADTVEELRNIIGMDKPLSLPGF VIAGCREAIKRIAYEFV
EMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVD EQAFG
IKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFP
GHVEAY RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLL
KENMHFEVCPWSSYLTGA VRFKNDKANYSLNTDDPLIFKSTLDTDYQMT
KKDMGFTEEEFKRLNINAAKSSFLPEEEKK
e value cutoff for PSSM
42LocusLink
- Provides a Single Query Interface to Curated
Sequence and Descriptive Information about
Genetic Loci - LocusLink will be Replaced by Entrez Gene
- Model Organisms Include C. elegans, chicken, cow,
dog, human, human immunodeficiency virus type 1,
mouse, pig, rat, S. purpuratus, X. laevis, X.
tropicalis, and zebrafish.
43LocusLink
- Each Locuslink Record Consists of a Collection of
Links to More Information About a Locus - A Unique Integer is Assigned to Each Locus
44LocusLink Query Results Page
Links field small, color-coded boxes indicate
when links are available for PubMed, OMIM,
RefSeq, GenBank Nucleotide, Protein, HomoloGene,
UniGene, and Variation data
45Genome Resources Pages
46Online Tutorials
http//www.ncbi.nlm.nih.gov/Education/index.html
47Future Courses
- Data Mining
- Protein Structure
- SQL (Structured Query Language)