National Center for Biotechnology Information Web Tools Introduction PowerPoint PPT Presentation

presentation player overlay
1 / 47
About This Presentation
Transcript and Presenter's Notes

Title: National Center for Biotechnology Information Web Tools Introduction


1
National Center for Biotechnology Information Web
Tools (Introduction)
  • Craig R. Street, M. Biot.
  • Biomedical Informatics Facility
  • 20 April 2004

2
Types of Databases
  • Primary Databases
  • Original submissions by experimentalists
  • Content controlled by the submitter
  • Examples GenBank, SNP
  • Derivative Databases
  • Built from primary data
  • Content controlled by third party (NCBI)
  • Examples Refseq, UniGene, NCBI Protein,
    Structure, Conserved Domain

3
A Traditional GenBank Record
LOCUS AF062069 3808 bp
mRNA linear INV 23-OCT-2002 DEFINITION
Limulus polyphemus myosin III mRNA, complete
cds. ACCESSION AF062069 VERSION AF062069.2
GI7144484 KEYWORDS . SOURCE Limulus
polyphemus (Atlantic horseshoe crab) ORGANISM
Limulus polyphemus Eukaryota
Metazoa Arthropoda Chelicerata Merostomata
Xiphosura Limulidae Limulus. REFERENCE
1 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE A myosin III from Limulus
eyes is a clock-regulated phosphoprotein
JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998)
MEDLINE 98279067 PUBMED 9614231 REFERENCE
2 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE Direct Submission
JOURNAL Submitted (29-APR-1998) Whitney
Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086,
USA
4
GenBank Overview
  • Comprehensive Public Database of Nucleotide
    Sequences
  • Composed Primarily of Submission of Sequence Data
    from Authors and Bulk Submission of ESTs, GSSs,
    and Other High-Throughput Data
  • Continues to Grow at an Exponential Rate with 9
    Million New Sequences Added in 2003

5
GenBank Overview
  • Complete Genomes Represent a Growing Portion of
    the Database
  • Over a Dozen Eukaryote Genomes Are Now Available
  • Each Entry Contains a Concise Description of the
    Sequence, the Scientific Name Taxonomy of the
    Source Organism, Bibliographic References and a
    Table of Features

6
GenBank Overview
  • Record Features Include Source, Gene, Variations
    (SNPs, etc.), Coding Sequence, etc.
  • Each Record (Consisting of a Sequence and its
    Annotations) is Assigned a Stable and Unique
    Accession Number
  • ESTs Continue to be Major Source of New Sequence
    Records

7
GenBank Overview
  • Sequence Records Are Accessible via NCBIs
    Retrieval System Entrez
  • Sequence-Similarity Searches are Performed on
    GenBank Data Using the BLAST Family of Tools

8
LOCUS AF062069 3808 bp mRNA
linear INV 23-OCT-2002
GenBank Record Locus
LOCUS AF062069 3808 bp
mRNA linear INV 23-OCT-2002 DEFINITION
Limulus polyphemus myosin III mRNA, complete
cds. ACCESSION AF062069 VERSION AF062069.2
GI7144484 KEYWORDS . SOURCE Limulus
polyphemus (Atlantic horseshoe crab) ORGANISM
Limulus polyphemus Eukaryota
Metazoa Arthropoda Chelicerata Merostomata
Xiphosura Limulidae Limulus. REFERENCE
1 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE A myosin III from Limulus
eyes is a clock-regulated phosphoprotein
JOURNAL J. Neurosci. 18 (12), 4548-4559 (1998)
MEDLINE 98279067 PUBMED 9614231 REFERENCE
2 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE Direct Submission
JOURNAL Submitted (29-APR-1998) Whitney
Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086,
USA COMMENT On Mar 2, 2000 this sequence
version replaced gi3132700.
Length
Division
Molecule type
Locus name
Modification Date
9
Actin Record (FASTA Format)
10
RefSeq NCBIs Derivative Sequence Database
  • Curated transcripts and proteins
  • reviewed
  • human, mouse, rat, fruit fly, zebrafish,
    arabidopsis
  • Model transcripts and proteins
  • Assembled Genomic Regions (contigs)
  • human genome
  • mouse genome
  • Chromosome records
  • Human genome
  • microbial
  • organelle

11
RefSeq Benefits
  • non-redundancy  
  • explicitly linked nucleotide and protein
    sequences
  • updates to reflect current sequence data and
    biology
  • data validation
  • format consistency
  • distinct accession number prefix
  • stewardship by NCBI staff and collaborators

12
RefSeq Accession Numbers
mRNAs and Proteins NM_123456 Curated
mRNA NP_123456 Curated Protein NR_123456 Curated
non-coding RNA XM_123456 Predicted
mRNA XP_123456 Predicted Protein
XR_123456 Predicted non-coding RNA Gene
Records NG_123456 Reference Genomic
Sequence Chromosome NC_123455 Microbial
replicons, organelle genomes, human
chromosomes Assemblies NT_123456 Contig
NW_123456 WGS Supercontig
13
Text
Entrez
Sequence
BLAST
Structure
VAST
14
Entrez Data Retrieval System
  • Provides Integrated Access to a Wide Range of
    Data Domains
  • Entrez is Grouped into Nodes. Each Node is a
    Collections of Data that is Grouped and Indexed
    Together and Often Referred to as an Entrez
    Database

15
Entrez Data Retrieval System
  • Popular Nodes Include
  • Nucleotide sequence database (GenBank)
  • Protein sequence database
  • Genome whole genome sequences
  • SNP single nucleotide polymorphism
  • PopSet population study data sets

16
Entrez Data Retrieval System
17
Entrez Database Integration
Word weight
PubMed abstracts
Phylogeny
VAST
3-D Structure
3 -D Structure
Taxonomy
Genomes
Protein sequences
Nucleotide sequences
BLAST
BLAST
18
Global Entrez Search
19
Entrez Nucleotides Limits Preview/Index
20
Entrez Nucleotides Limits
Accession All Fields Author Name EC/RN
Number Feature key Filter Gene Name Issue Journal
Name Keyword Modification Date Organism Page
Number Primary Accession Properties Protein
Name Publication Date SeqID String Sequence
Length Substance Name Text Word Title Uid Volume
Field Restriction
21
BLAST Basic Local Alignment Search Tool
  • Why align sequences ?
  • - because it is the best way to infer
    structure-function relationships for the
    unknown biomolecules
  • Global vs Local Alignments
  • BLAST Basics
  • Interpretation of Results
  • PSI-BLAST

22
BLAST 2.0 (a.k.a. Gapped BLAST)
  • Calculates Similarity for Biological Sequences
  • Finds Best LOCAL Alignments
  • Heuristic Approach Based on Smith-Waterman
    Algorithm
  • Searches for Matching Words and then Extends
    the Hits
  • Uses Statistical Theory to Determine if a Match
    Might have Occurred by Chance

23
BLAST 2.0 (a.k.a. Gapped BLAST)
  • blastp Compares an amino acid query sequence
    against a protein sequence database.
  • blastn Compares a nucleotide query sequence
    against a nucleotide sequence database.
  • blastx Compares a nucleotide query sequence
    translated in all reading frames against a
    protein sequence database.
  • tblastn Compares a protein query sequence against
    a nucleotide sequence database dynamically
    translated in all reading frames.
  • tblastx Compares the six-frame translations of a
    nucleotide query sequence against the
    six- frame translations of a nucleotide sequence
    database.
  • Please note that the tblastx program cannot be
    used with the nr database on the BLAST Web page
    because it is computationally intensive.

24
BLAST Databases Nucleic Acid
  • nr (nt)
  • Traditional GenBank Divisions
  • NM_ and XM_ RefSeqs
  • dbest
  • EST Division
  • htgs
  • HTG division
  • gss
  • GSS division
  • chromosome
  • NC_ RefSeqs
  • wgs
  • whole genome shotgun

25
BLAST Databases Amino Acid
  • nr (non-redundant protein sequences)
  • GenBank CDS translations
  • NP_ RefSeqs
  • Outside Protein
  • PIR, Swiss-Prot, PRF
  • PDB (sequences from structures)

26
How BLAST Works
  • Make a lookup table of all words in the query
  • Scan the database for matching words
  • Initiate extensions from these matches

27
Words
GTQ TQI QIT ITV TVE
VED EDL DLF
LFY
Make a lookup table of words
Word Size 3
28
Scan DatabaseInitiate Extensions
Protein BLAST requires two hits
GTQITVEDLFYNI lt------ TVE FYN ------gt
two words (threshold score)
Nucleotide BLAST requires exact matches
ATCGCCATGCTTAATTGGGCTT
lt------ CATGCTTAATT ------gt
29
An Alignment that BLAST Cant Find
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACC
ACGCTATTCTTGCTGTTG
1
GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTA
CTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGG
ATCATTAAGAACTCCTGGGGAGCCAGTT

61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGC
TGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCT
CGTGGTAAAAAC
121 GGGGAGACCAAGGCTACATCCTTATGTCCCG
TGACAACAAC
30
Local Alignment Statistics
High scores of local alignments between two
random sequences follow the Extreme Value
Distribution
Expect Value E number of database hits you
expect to find by chance
31
Scoring Systems - Nucleotides
Identity matrix
A G C T A 1 3 3 -3 G 3 1 3 -3 C 3 3
1 -3 T 3 3 3 1
CAGGTAGCAAGCTTGCATGTCA
raw score 19-9 10 CACGTAGCAAGCTTG-GTGTCA
32
Scoring Systems - Proteins
  • Position Independent Matrices
  • PAM Matrices (Percent Accepted Mutation)
  • Derived from observation small dataset of
    alignments
  • Implicit model of evolution
  • All calculated from PAM1 (PAM250 widely used)
  • BLOSUM Matrices (BLOck SUbstitution Matrices)
  • Derived from observation large dataset of
    highly conserved blocks
  • Each matrix derived separately from blocks with
    a defined percent identity cutoff
  • BLOSUM62 - default matrix for BLAST
  • Position Specific Score Matrices (PSSMs)
  • PSI-BLAST

33
A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3
-3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2
5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0
0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2
-3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1
-2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2
-1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0
6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1
-4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2
-1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3
3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2
-1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1
A R N D C Q E G H I L K M F P S
T W Y V X
BLOSUM62
34
Options for Advanced Blasting Nucleotide
Example Entrez Queries nucleotide allFilter NOT
mammaliaOrganism green plantsOrganism biomol
mrnaProperties biomol genomicProperties Other
Advanced -Word size Expect value
35
Options for Advanced Blast Protein
Example Entrez queries proteins allFilter NOT
mammaliaOrganism green plantsOrganism srcdb
refseqProperties Other advanced -Word
size Expect value
Limit by taxon Mus musculusOrganism MammaliaOrg
anism ViridiplantaeOrganism
  • Matrix Selection
  • PAM30 -- most stringent
  • BLOSUM45 -- least stringent

36
BLAST Graphical Overview
37
BLAST Alignments
gi7160701embCAB04427.2 C. elegans KIN-22
protein (corresponding sequence F49B2.5)
Caenorhabditis elegans gi17508235refNP_49350
2.1 Tyrosine kinase with SH2, SH3 and N
myristoylation domains, Drosophila suppressor of
pole hole homolog (57.5 kD) (kin-22)
Caenorhabditis elegans Length 507 Score
290 bits (742), Expect 1e-78 Identities
170/440 (38), Positives 245/440 (55), Gaps
21/440 (4)
38
PSI-BLASTPosition-Specific Iterated BLAST
  • Mining for protein domains
  • Confirming relationships among related proteins

39
Position Specific Substitution Rates
Weakly conserved serine
Active site serine
40
Position Specific Score Matrix (PSSM)
A R N D C Q E G H I L K M
F P S T W Y V 206 D 0 -2 0 2 -4 2 4
-4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G
-2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2
-1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1
-4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3
-3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1
-4 0 210 D -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6
-4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4
-4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3
212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0
-7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0
-2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G
-2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3
-5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5
-7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4
-2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6
-5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7
-5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5
-6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7
219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7
9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6
-7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N
-1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2
-1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1
-1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1
4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3
-4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1
-2 -2 -3 0 -2 -2 -2 -3
Serine scored differently in these two positions
Active site nucleophile
41
PSI-BLAST
gtgi113340spP03958ADA_MOUSE ADENOSINE
DEAMINASE (ADENOSINE MAQTPAFNKPKVELHVHLDGAIKPETILY
FGKKRGIALPADTVEELRNIIGMDKPLSLPGF VIAGCREAIKRIAYEFV
EMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVD EQAFG
IKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFP
GHVEAY RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLL
KENMHFEVCPWSSYLTGA VRFKNDKANYSLNTDDPLIFKSTLDTDYQMT
KKDMGFTEEEFKRLNINAAKSSFLPEEEKK
e value cutoff for PSSM
42
LocusLink
  • Provides a Single Query Interface to Curated
    Sequence and Descriptive Information about
    Genetic Loci
  • LocusLink will be Replaced by Entrez Gene
  • Model Organisms Include C. elegans, chicken, cow,
    dog, human, human immunodeficiency virus type 1,
    mouse, pig, rat, S. purpuratus, X. laevis, X.
    tropicalis, and zebrafish.

43
LocusLink
  • Each Locuslink Record Consists of a Collection of
    Links to More Information About a Locus
  • A Unique Integer is Assigned to Each Locus

44
LocusLink Query Results Page
Links field small, color-coded boxes indicate
when links are available for PubMed, OMIM,
RefSeq, GenBank Nucleotide, Protein, HomoloGene,
UniGene, and Variation data
45
Genome Resources Pages
46
Online Tutorials
http//www.ncbi.nlm.nih.gov/Education/index.html
47
Future Courses
  • Data Mining
  • Protein Structure
  • SQL (Structured Query Language)
Write a Comment
User Comments (0)
About PowerShow.com