Using Entrez - PowerPoint PPT Presentation

1 / 86
About This Presentation
Title:

Using Entrez

Description:

Patent sequences are incorporated through arrangements with the U.S. Patent and Trademark Office ... (USPTO) and via the collaborating international databases ... – PowerPoint PPT presentation

Number of Views:375
Avg rating:3.0/5.0
Slides: 87
Provided by: Mur2
Category:
Tags: entrez | using

less

Transcript and Presenter's Notes

Title: Using Entrez


1
Using Entrez
  • The Life Sciences Search Engine

2
Searching NCBI Databases Efficiently
  • Knowing how to retrieve the exact information you
    need in an efficient way is the fundamental and
    most important skill in Bioinformatics.
  • Every NCBI database is designed and created for
    some specific purposes.
  • A common mistake Bioinformatics novices make is
    searching for information in an inappropriate
    database.
  • Entrez links among and within databases, making
    it easier to search for information.

3
What is Entrez?
  • Entrez is an NCBI retrieval system designed for
    searching several linked databases.
  • Entrez is a search tool for integrated access to
    the biological literature and sequence data.
  • Entrez is extremely powerful, enabling the user
    to quickly move between the different specialized
    databases.

4
Entrez
  • Entrez is divided into sites for nucleotide,
    protein, structure, genomes, OMIM, and more. You
    can use limits (such as RefSeq) to focus your
    Entrez search.
  • When you conduct a search via Entrez, your query
    generates this screen, telling you the number of
    hits to your query.

5
The Entrez System
6
The Big Picture
Books
UCSC
PubMed
PopSet
e!
GDB
Nucleotide
ProbeSet
MGC
Genome
Protein
Entrez
LocusLink
HGMD
Taxonomy
OMIM
Homologene
Structure
SNP
CDD
UniSTS
MapViewer
3D Domains
7
Entrez and LocusLink
  • Entrez doesnt link to all the databases that
    contain sequences, however!
  • LocusLink has its own groups of links to
    specialty databases, since it doesnt cover all
    the genomes yet.

8
EntrezDatabase Integration
Word weight
Phylogeny
3-D Structure
3 -D Structure
VAST
Protein sequences
BLAST
BLAST
9
The (ever) Expanding Entrez System
PubMed
Nucleotide
UniGene
Protein
Journals
Structure
Genome
CDD
PopSet
SNP
OMIM
3D Domains
Taxonomy
UniSTS
Books
ProbeSet
10
Entrez Databases
PubMed Biomedical literature Books Online
textbooks Nucleotide GenBank, EMBL, DDBJ, RefSeq,
PDB Protein GenBank, EMBL, DDBJ, RefSeq,
SWISS-PROT, PIR, PRF, PDB Genome Complete
genomes Taxonomy Organisms in NCBI sequence
databases Structure MMDB experimental 3D
structures Domains CDD conserved protein
domains 3D Domains Compact 3D protein domains in
MMDB OMIM Online Mendelian Inheritance in
Man SNP Single nucleotide polymorphisms UniSTS Se
quence Tagged Site markers ProbeSet Gene
expression and microarray datasets PopSet Populati
on study datasets UniGene Gene-based expressed
sequence clusters
11
Nucleotide Database
  • The Nucleotide database contains sequence data
    from GenBank, EMBL, and DDBJ, the members of the
    tripartite, international collaboration of
    sequence databases.
  • EMBL is the European Molecular Biology Laboratory
    at Hinxton Hall, UK
  • DDBJ is the DNA Database of Japan in Mishima,
    Japan.
  • Sequence data are also incorporated from the
    Genome Sequence Data Base (GSDB), Santa Fe, NM.
  • Patent sequences are incorporated through
    arrangements with the U.S. Patent and Trademark
    Office (USPTO) and via the collaborating
    international databases from other international
    patent offices.

12
Entrez Nucleotides
  • Primary
  • GenBank / EMBL / DDBJ 35,116,960
  • Derivative
  • RefSeq 259,219
  • Third Party Annotation 3,182
  • PDB 4,703

  • Total
    35,384,248

13
Database Searching with Entrez
  • Using limits and field restriction to find plant
    g6pdh
  • Linking and neighboring with g6pdh

14
Entrez Nucleotides
The G6PD enzyme catalyzes the oxidation of
glucose-6-phosphate to 6-phosphogluconate, while
reducing nicotinamide adenine dinucleotide
phosphate (NADP to NADPH). In terms of electron
transfer, glucose-6-phosphate loses two electrons
to become 6-phosphogluconate and NADP gains two
electrons to become NADPH. This is the first step
in the pentose phosphate pathway. This pathway,
or shunt, as it is sometimes called, produces the
5- carbon sugar, ribose, which is an essential
component of both DNA and RNA.
15
(No Transcript)
16
Limits Are Helpful
  • Limits allow restriction of a search to a defined
    subset of the database.
  • Limits can be set to restrict a search to a
    particular database field (e.g., the Author
    field).
  • Limits can be set to search everything but a
    particular type of data (e.g., exclude patent
    records).
  • Alternatively, limits can be set to search only a
    particular type of data (e.g., Genomic RNA/DNA)
    or to search only data from a particular source
    database (e.g., EMBL). Date limits and sequence
    length limits are also possible.
  • The contents of each Entrez database differ, and
    therefore the Limits available for each database
    differ.

17
Entrez Nucleotides Limits Preview/Index
Try using the Limits and Preview function to hone
your search To find the Plant G6PD genes.
18
Entrez Nucleotides Limits
Exclude bulk sequences
19
Entrez Nucleotides Limits
20
Document Summaries Limits
21
Adding Terms Preview/Index
Accession All Fields Author Name EC/RN
Number Feature key Filter Gene Name Issue Journal
Name Keyword Modification Date Organism Page
Number Primary Accession Properties Protein
Name Publication Date SeqID String Sequence
Length Substance Name Text Word Title
Word Uid Volume
22
Plant cytosolic g6pdh mRNAs
23
Database Neighbors and Interlinking
  • What makes Entrez more powerful than many
    services is that most of its records are linked
    to other records, both within a given database
    (such as Nucleotide) and between databases.
  • Links within a database are called neighbors
    (e.g., Nucleotide neighbors).

24
Links Between Databases
  • Protein and Nucleotide neighbors are determined
    by performing similarity searches using the BLAST
    algorithm to compare the entry amino acid or DNA
    sequence to all other amino acid or DNA sequences
    in the database. We will discuss more about
    BLAST later.
  • Nucleotide sequence records in the Nucleotide
    database are linked to the PubMed citation of the
    article in which the sequences were published.
  • Protein sequence records are linked to the
    nucleotide sequence from which the protein was
    translated.

25
Plant cytosolic g6pdh mRNAs
26
LinkOut
  • LinkOut is a feature of Entrez that is designed
    to provide users with links from PubMed and other
    Entrez databases to a wide variety of relevant
    web-accessible online resources
  • Full-text publications
  • Other biological databases
  • Consumer health information
  • Research tools
  • The goal is to facilitate access to relevant
    online resources beyond the Entrez system to
    extend, clarify, or supplement information found
    in the Entrez databases.

27
Protein Database
  • The protein database includes proteins from
    translate regions of DNA in GenBank as well as
    sequence from PIR
  • The entry includes
  • The name of the protein
  • How the protein sequence was derived
  • An accession and a PID number
  • The number of amino acids

28
Protein Entry
  • The Entry also includes
  • Structural information for the protein (if known)
  • Helices and ?-Sheets
  • Domains
  • Etc
  • The sequence of amino acids comprising the protein

29
Setting Protein Database search limits
  • Choose Protein from the drop-down menu
  • Can do a Boolean search
  • Or can set LIMITS
  • Fields (eg Author, Journal, etc.)
  • Gene Location (genomic, mitochondrial etc)
  • Segmented Sequence
  • Only from (Database to check)
  • Modification date

30
Linking Between Databases
  • Sometimes you will pull up a record and you have
    no idea what organism the gene you are looking at
    is from.
  • For Example, the following record- what is
    Medicago sativa ?

31
Entrez GenBank / GenPept
32
Taxonomy to the Rescue
  • Entrez lets you click a live link from the record
    and determine what organism Medicago sativa is.
  • It is alfalfa.
  • You can also tell what it is related to
    taxonomically, because sometimes the common name
    isnt very useful either!

33
Taxonomy Link
34
Advanced Neighbors BLink
35
What is BLink
  • BLink - BLAST Link
  • Someone has done a BLAST search already, and you
    can just retrieve it!
  • BLink displays the graphical output of
    pre-computed blastp results against the protein
    non-redundant (nr) database.

36
This graphical output includes
  • Alignment of up to 200 BLAST hits on the query
    sequence
  • Best Hits to each organism
  • List of known protein domains in the query
    sequence
  • Filter hits by selecting the BLAST cutoff score
  • Distribution of hits by taxonomic grouping
  • Display of similar sequences with known 3D
    structure
  • Filter hits by database and/or by taxonomic
    grouping
  • Display a taxonomic tree of all organisms with
    similar sequences

37
PopSet Links
  • The PopSet database contains aligned sequences
    submitted as a set resulting from a population,
    phylogenetic, or mutation study.
  • These alignments describe such events as
    evolution and population variation.
  • The PopSet database contains both nucleotide and
    protein sequence data.

38
Protein Neighbors-gtPopSet Links
39
Protein Neighbors-gtGenome Links
40
PopSet search results
  • The results or a PopSet search
  • The PopSet database includes alignments of genes
    from multiple organisms OR different gene
    families OR mutational analyses

41
PopSet Entry
  • The PopSet entry includes
  • The title of the paper/study
  • The length of the sequence(s) aligned
  • The number of aligned sequences

42
PopSet Entry without alignment
  • The PopSet Entry without an alignment
  • Title of the study
  • The number of sequences included
  • Links to the sequences

43
Entrez Structures
44
Protein Structures can also be in databases
http//bmbiris.bmb.uga.edu/wampler/tutorial/prot0.
html is a useful review Tutorial.
45
Entrez links to structure databases
  • The Structure database or Molecular Modeling
    Database (MMDB) contains experimental data from
    crystallographic and NMR structure
    determinations.
  • The data for MMDB are obtained from the Protein
    Data Bank (PDB).
  • The NCBI has cross-linked structural data to
    bibliographic information, to the sequence
    databases, and to the NCBI taxonomy.
  • Use Cn3D, the NCBI 3D structure viewer, for easy
    interactive visualization of molecular structures
    from Entrez.

46
Structure Search results
  • The structure of proteins are also in a database
  • Search as before
  • Your search results are similar

47
Structure Entry
  • The structure Entry has links to the other
    databases
  • And it will allow you download a file to open
    with a structure viewer program

48
  • Proteins with similar structures and functions
    have been identified in the databases

49
BLink Advanced Protein Neighbors
50
BLink Related Structures
51
Viewing Structure in Cn3D
  • You can download Cn3D (a structural viewer
    program) from NCBI
  • This will allow you to view the structures from
    the structure database

52
Cn3D Text Window
  • The Text window of Cn3D will align two or more
    proteins so you can compare the structure of
    multiple proteins

53
BLink Human Homologue
54
Human RefSeqs Genome Reagents
55
MMDB Molecular Modeling Data Base
  • Derived from experimentally determined PDB
    records
  • Value added to PDB records including
  • Addition of explicit chemical graph information
  • Validation
  • Inclusion of Taxonomy, Citation,
  • and other information
  • Conversion to ASN.1 data description language
  • Structure neighbors determined by
  • Vector Alignment Search Tool (VAST)

56
Structure Summary
Cn3D viewer
Structure Neighbors
3D Domain Neighbors
Conserved Domains
57
Cn3D 4.1
58
Cn3D 4.1 Structural Alignment
Conserved ATP binding site
Src Kinase H. sapiens
Casein kinase S. pombe
59
Cn3D Simple Homology Modeling
human
swordtail
60
Using Cn3D to model domains
61
Other services and databases from the NCBI
  • LocusLink to all possible information from NCBI
    and beyond for a few well characterized model
    organisms.
  • LocusLink is a great starting point it collects
    key information on each gene/protein from major
    databases. It now covers 8 organisms.
  • RefSeq provides a curated, optimal accession
    number for each DNA (NM_006744) or protein
    (NP_007635)

62
Locus Links
  • Results of a Locus links search, includes
  • Locus ID
  • Species
  • Locus symbol
  • Locus name
  • Locus location
  • Links
  • Protein Database
  • OMIM
  • Reference Sequence
  • Related GenBank Sequences
  • Homologene Data
  • UniGene
  • Variation Data

63
LocusLink Selected Higher Genomes
64
Protein Database
  • The Protein database contains sequence data from
    the translated coding regions from DNA sequences
    in GenBank, EMBL, and DDBJ as well as protein
    sequences submitted to
  • Protein Information Resource (PIR)
  • SWISS-PROT
  • Protein Research Foundation (PRF)
  • Protein Data Bank (PDB) (sequences from solved
    structures)

65
NCBI Protein Databases
  • GenPept GenBank, EMBL, DDBJ CDS translations
  • RefSeq mRNA based (NP_) and genome based (XP_)
  • Swiss-Prot curated high quality protein reviews
  • PIR protein information resource Georgetown
    University
  • PRF protein resource foundation
  • PDB Protein Databank sequences from structures

66
Entrez Protein
  • GenPept (GB,EMBL, DDBJ) 3,442,298
  • RefSeq 856,191
  • Third Party Annotation 3,834
  • Swiss Prot 144,508
  • PIR 282,821
  • PRF
    12,079

  • Total
    3,442,298
  • BLAST nr
    1,642,191

67
Protein Link
BLAST Link
Conserved Domains
68
Related Proteins Redundancy
Redundant Sequences
69
Related Proteins Links
70
BLink non-redundant relatives
Arabidopsis homolog
Conserved Domain
71
MLH1 Domain Structure CDD
72
MLH1 ATPase Domain
73
1BGQ ATPase Domain in Cn3D
Yeast HSP90
ATP Binding site helix
74
Variations Human MLH1
75
BLink
Finding structural models
76
Mapping Variation Onto Structure
Loads sequence alignment and structure in Cn3D
Bacterial DNA mismatch repair proteins
77
Mapping Variation Onto Structure
Asn
Ile
Conserved Asn
Ile Val
78
NCBI Genome Databases
  • The Genome database provides views for a variety
    of genomes, complete chromosomes, sequence maps
    with contigs, and integrated genetic and physical
    maps.

79
Microbial Genomes
ZWF
80
Genome search results
  • Genome Search Results
  • The Genome database includes full (and some
    partial) genomes from viruses to complex organisms

81
Genome Entry
  • Genome entries include
  • Maps of the genome
  • Links to the sequence
  • The organism for the genome

82
Genes Database All Genomes
Coming soon!
83
Genes Database All Genomes
84
Genes Database All Genomes
85
But wait! Theres more!
  • There is even more at NCBI that I have covered
    here.
  • This site map is also a guide to NCBI resources.
    Each link leads to a brief description of the
    resource on this page, then to the resource
    itself. http//www.ncbi.nlm.nih.gov/Sitemap/

86
There are many bioinformatics servers outside
NCBI.
  • Try ExPASys sequence retrieval system at
    http//www.expasy.ch/
  • (ExPASy Expert Protein Analysis System)
  • Or try ENSEMBL at www.ensembl.org for a premier
    human genome web browser.
Write a Comment
User Comments (0)
About PowerShow.com