Title: Using Entrez
1Using Entrez
- The Life Sciences Search Engine
2Searching NCBI Databases Efficiently
- Knowing how to retrieve the exact information you
need in an efficient way is the fundamental and
most important skill in Bioinformatics. - Every NCBI database is designed and created for
some specific purposes. - A common mistake Bioinformatics novices make is
searching for information in an inappropriate
database. - Entrez links among and within databases, making
it easier to search for information.
3What is Entrez?
- Entrez is an NCBI retrieval system designed for
searching several linked databases. - Entrez is a search tool for integrated access to
the biological literature and sequence data. - Entrez is extremely powerful, enabling the user
to quickly move between the different specialized
databases.
4Entrez
- Entrez is divided into sites for nucleotide,
protein, structure, genomes, OMIM, and more. You
can use limits (such as RefSeq) to focus your
Entrez search. - When you conduct a search via Entrez, your query
generates this screen, telling you the number of
hits to your query.
5The Entrez System
6The Big Picture
Books
UCSC
PubMed
PopSet
e!
GDB
Nucleotide
ProbeSet
MGC
Genome
Protein
Entrez
LocusLink
HGMD
Taxonomy
OMIM
Homologene
Structure
SNP
CDD
UniSTS
MapViewer
3D Domains
7Entrez and LocusLink
- Entrez doesnt link to all the databases that
contain sequences, however! - LocusLink has its own groups of links to
specialty databases, since it doesnt cover all
the genomes yet.
8EntrezDatabase Integration
Word weight
Phylogeny
3-D Structure
3 -D Structure
VAST
Protein sequences
BLAST
BLAST
9The (ever) Expanding Entrez System
PubMed
Nucleotide
UniGene
Protein
Journals
Structure
Genome
CDD
PopSet
SNP
OMIM
3D Domains
Taxonomy
UniSTS
Books
ProbeSet
10Entrez Databases
PubMed Biomedical literature Books Online
textbooks Nucleotide GenBank, EMBL, DDBJ, RefSeq,
PDB Protein GenBank, EMBL, DDBJ, RefSeq,
SWISS-PROT, PIR, PRF, PDB Genome Complete
genomes Taxonomy Organisms in NCBI sequence
databases Structure MMDB experimental 3D
structures Domains CDD conserved protein
domains 3D Domains Compact 3D protein domains in
MMDB OMIM Online Mendelian Inheritance in
Man SNP Single nucleotide polymorphisms UniSTS Se
quence Tagged Site markers ProbeSet Gene
expression and microarray datasets PopSet Populati
on study datasets UniGene Gene-based expressed
sequence clusters
11Nucleotide Database
- The Nucleotide database contains sequence data
from GenBank, EMBL, and DDBJ, the members of the
tripartite, international collaboration of
sequence databases. - EMBL is the European Molecular Biology Laboratory
at Hinxton Hall, UK - DDBJ is the DNA Database of Japan in Mishima,
Japan. - Sequence data are also incorporated from the
Genome Sequence Data Base (GSDB), Santa Fe, NM. - Patent sequences are incorporated through
arrangements with the U.S. Patent and Trademark
Office (USPTO) and via the collaborating
international databases from other international
patent offices.
12Entrez Nucleotides
- Primary
- GenBank / EMBL / DDBJ 35,116,960
- Derivative
- RefSeq 259,219
- Third Party Annotation 3,182
- PDB 4,703
-
- Total
35,384,248
13Database Searching with Entrez
- Using limits and field restriction to find plant
g6pdh - Linking and neighboring with g6pdh
14Entrez Nucleotides
The G6PD enzyme catalyzes the oxidation of
glucose-6-phosphate to 6-phosphogluconate, while
reducing nicotinamide adenine dinucleotide
phosphate (NADP to NADPH). In terms of electron
transfer, glucose-6-phosphate loses two electrons
to become 6-phosphogluconate and NADP gains two
electrons to become NADPH. This is the first step
in the pentose phosphate pathway. This pathway,
or shunt, as it is sometimes called, produces the
5- carbon sugar, ribose, which is an essential
component of both DNA and RNA.
15(No Transcript)
16Limits Are Helpful
- Limits allow restriction of a search to a defined
subset of the database. - Limits can be set to restrict a search to a
particular database field (e.g., the Author
field). - Limits can be set to search everything but a
particular type of data (e.g., exclude patent
records). - Alternatively, limits can be set to search only a
particular type of data (e.g., Genomic RNA/DNA)
or to search only data from a particular source
database (e.g., EMBL). Date limits and sequence
length limits are also possible. - The contents of each Entrez database differ, and
therefore the Limits available for each database
differ.
17Entrez Nucleotides Limits Preview/Index
Try using the Limits and Preview function to hone
your search To find the Plant G6PD genes.
18Entrez Nucleotides Limits
Exclude bulk sequences
19Entrez Nucleotides Limits
20Document Summaries Limits
21Adding Terms Preview/Index
Accession All Fields Author Name EC/RN
Number Feature key Filter Gene Name Issue Journal
Name Keyword Modification Date Organism Page
Number Primary Accession Properties Protein
Name Publication Date SeqID String Sequence
Length Substance Name Text Word Title
Word Uid Volume
22Plant cytosolic g6pdh mRNAs
23Database Neighbors and Interlinking
- What makes Entrez more powerful than many
services is that most of its records are linked
to other records, both within a given database
(such as Nucleotide) and between databases. - Links within a database are called neighbors
(e.g., Nucleotide neighbors).
24Links Between Databases
- Protein and Nucleotide neighbors are determined
by performing similarity searches using the BLAST
algorithm to compare the entry amino acid or DNA
sequence to all other amino acid or DNA sequences
in the database. We will discuss more about
BLAST later. - Nucleotide sequence records in the Nucleotide
database are linked to the PubMed citation of the
article in which the sequences were published. - Protein sequence records are linked to the
nucleotide sequence from which the protein was
translated.
25Plant cytosolic g6pdh mRNAs
26LinkOut
- LinkOut is a feature of Entrez that is designed
to provide users with links from PubMed and other
Entrez databases to a wide variety of relevant
web-accessible online resources - Full-text publications
- Other biological databases
- Consumer health information
- Research tools
- The goal is to facilitate access to relevant
online resources beyond the Entrez system to
extend, clarify, or supplement information found
in the Entrez databases.
27Protein Database
- The protein database includes proteins from
translate regions of DNA in GenBank as well as
sequence from PIR - The entry includes
- The name of the protein
- How the protein sequence was derived
- An accession and a PID number
- The number of amino acids
28Protein Entry
- The Entry also includes
- Structural information for the protein (if known)
- Helices and ?-Sheets
- Domains
- Etc
- The sequence of amino acids comprising the protein
29Setting Protein Database search limits
- Choose Protein from the drop-down menu
- Can do a Boolean search
- Or can set LIMITS
- Fields (eg Author, Journal, etc.)
- Gene Location (genomic, mitochondrial etc)
- Segmented Sequence
- Only from (Database to check)
- Modification date
30Linking Between Databases
- Sometimes you will pull up a record and you have
no idea what organism the gene you are looking at
is from. - For Example, the following record- what is
Medicago sativa ?
31Entrez GenBank / GenPept
32Taxonomy to the Rescue
- Entrez lets you click a live link from the record
and determine what organism Medicago sativa is. - It is alfalfa.
- You can also tell what it is related to
taxonomically, because sometimes the common name
isnt very useful either!
33Taxonomy Link
34Advanced Neighbors BLink
35What is BLink
- BLink - BLAST Link
- Someone has done a BLAST search already, and you
can just retrieve it! - BLink displays the graphical output of
pre-computed blastp results against the protein
non-redundant (nr) database.
36This graphical output includes
- Alignment of up to 200 BLAST hits on the query
sequence - Best Hits to each organism
- List of known protein domains in the query
sequence - Filter hits by selecting the BLAST cutoff score
- Distribution of hits by taxonomic grouping
- Display of similar sequences with known 3D
structure - Filter hits by database and/or by taxonomic
grouping - Display a taxonomic tree of all organisms with
similar sequences
37PopSet Links
- The PopSet database contains aligned sequences
submitted as a set resulting from a population,
phylogenetic, or mutation study. - These alignments describe such events as
evolution and population variation. - The PopSet database contains both nucleotide and
protein sequence data.
38Protein Neighbors-gtPopSet Links
39Protein Neighbors-gtGenome Links
40PopSet search results
- The results or a PopSet search
- The PopSet database includes alignments of genes
from multiple organisms OR different gene
families OR mutational analyses
41PopSet Entry
- The PopSet entry includes
- The title of the paper/study
- The length of the sequence(s) aligned
- The number of aligned sequences
42PopSet Entry without alignment
- The PopSet Entry without an alignment
- Title of the study
- The number of sequences included
- Links to the sequences
43Entrez Structures
44Protein Structures can also be in databases
http//bmbiris.bmb.uga.edu/wampler/tutorial/prot0.
html is a useful review Tutorial.
45Entrez links to structure databases
- The Structure database or Molecular Modeling
Database (MMDB) contains experimental data from
crystallographic and NMR structure
determinations. - The data for MMDB are obtained from the Protein
Data Bank (PDB). - The NCBI has cross-linked structural data to
bibliographic information, to the sequence
databases, and to the NCBI taxonomy. - Use Cn3D, the NCBI 3D structure viewer, for easy
interactive visualization of molecular structures
from Entrez.
46Structure Search results
- The structure of proteins are also in a database
- Search as before
- Your search results are similar
47Structure Entry
- The structure Entry has links to the other
databases - And it will allow you download a file to open
with a structure viewer program
48- Proteins with similar structures and functions
have been identified in the databases
49BLink Advanced Protein Neighbors
50BLink Related Structures
51Viewing Structure in Cn3D
- You can download Cn3D (a structural viewer
program) from NCBI - This will allow you to view the structures from
the structure database
52Cn3D Text Window
- The Text window of Cn3D will align two or more
proteins so you can compare the structure of
multiple proteins
53BLink Human Homologue
54Human RefSeqs Genome Reagents
55MMDB Molecular Modeling Data Base
- Derived from experimentally determined PDB
records - Value added to PDB records including
- Addition of explicit chemical graph information
- Validation
- Inclusion of Taxonomy, Citation,
- and other information
- Conversion to ASN.1 data description language
- Structure neighbors determined by
- Vector Alignment Search Tool (VAST)
56Structure Summary
Cn3D viewer
Structure Neighbors
3D Domain Neighbors
Conserved Domains
57Cn3D 4.1
58Cn3D 4.1 Structural Alignment
Conserved ATP binding site
Src Kinase H. sapiens
Casein kinase S. pombe
59Cn3D Simple Homology Modeling
human
swordtail
60Using Cn3D to model domains
61Other services and databases from the NCBI
- LocusLink to all possible information from NCBI
and beyond for a few well characterized model
organisms. - LocusLink is a great starting point it collects
key information on each gene/protein from major
databases. It now covers 8 organisms. - RefSeq provides a curated, optimal accession
number for each DNA (NM_006744) or protein
(NP_007635)
62Locus Links
- Results of a Locus links search, includes
- Locus ID
- Species
- Locus symbol
- Locus name
- Locus location
- Links
- Protein Database
- OMIM
- Reference Sequence
- Related GenBank Sequences
- Homologene Data
- UniGene
- Variation Data
63LocusLink Selected Higher Genomes
64Protein Database
- The Protein database contains sequence data from
the translated coding regions from DNA sequences
in GenBank, EMBL, and DDBJ as well as protein
sequences submitted to - Protein Information Resource (PIR)
- SWISS-PROT
- Protein Research Foundation (PRF)
- Protein Data Bank (PDB) (sequences from solved
structures)
65NCBI Protein Databases
- GenPept GenBank, EMBL, DDBJ CDS translations
- RefSeq mRNA based (NP_) and genome based (XP_)
- Swiss-Prot curated high quality protein reviews
- PIR protein information resource Georgetown
University - PRF protein resource foundation
- PDB Protein Databank sequences from structures
66Entrez Protein
-
- GenPept (GB,EMBL, DDBJ) 3,442,298
- RefSeq 856,191
- Third Party Annotation 3,834
- Swiss Prot 144,508
- PIR 282,821
- PRF
12,079 -
- Total
3,442,298 - BLAST nr
1,642,191
67Protein Link
BLAST Link
Conserved Domains
68Related Proteins Redundancy
Redundant Sequences
69Related Proteins Links
70BLink non-redundant relatives
Arabidopsis homolog
Conserved Domain
71MLH1 Domain Structure CDD
72MLH1 ATPase Domain
731BGQ ATPase Domain in Cn3D
Yeast HSP90
ATP Binding site helix
74Variations Human MLH1
75BLink
Finding structural models
76Mapping Variation Onto Structure
Loads sequence alignment and structure in Cn3D
Bacterial DNA mismatch repair proteins
77Mapping Variation Onto Structure
Asn
Ile
Conserved Asn
Ile Val
78NCBI Genome Databases
- The Genome database provides views for a variety
of genomes, complete chromosomes, sequence maps
with contigs, and integrated genetic and physical
maps.
79Microbial Genomes
ZWF
80Genome search results
- Genome Search Results
- The Genome database includes full (and some
partial) genomes from viruses to complex organisms
81Genome Entry
- Genome entries include
- Maps of the genome
- Links to the sequence
- The organism for the genome
82Genes Database All Genomes
Coming soon!
83Genes Database All Genomes
84Genes Database All Genomes
85But wait! Theres more!
- There is even more at NCBI that I have covered
here. - This site map is also a guide to NCBI resources.
Each link leads to a brief description of the
resource on this page, then to the resource
itself. http//www.ncbi.nlm.nih.gov/Sitemap/
86There are many bioinformatics servers outside
NCBI.
- Try ExPASys sequence retrieval system at
http//www.expasy.ch/ - (ExPASy Expert Protein Analysis System)
- Or try ENSEMBL at www.ensembl.org for a premier
human genome web browser.