Title: Databases at NCBI
1Databases at NCBI
2Database
- A database is a structured collection of records
or data that is stored in a computer system. The
structure is achieved by organizing the data
according to a database model. The model in most
common use today is the relational model. Other
models such as the hierarchical model and the
network model use a more explicit representation
of relationships.
http//en.wikipedia.org/wiki/Database
3Database
4Database
5Database
6Database
7Database
8Database
AmiGO
9Database
AmiGO
10Database
11Database
AmiGO
12Database
- A database is a structured collection of records
or data that is stored in a computer system. The
structure is achieved by organizing the data
according to a database model. The model in most
common use today is the relational model. Other
models such as the hierarchical model and the
network model use a more explicit representation
of relationships.
http//en.wikipedia.org/wiki/Database
13About NCBI
- What does NCBI do?
- Established in 1988 as a national resource for
molecular biology information, NCBI creates
public databases, conducts research in
computational biology, develops software tools
for analyzing genome data, and disseminates
biomedical information - all for the better
understanding of molecular processes affecting
human health and disease.
http//www.ncbi.nlm.nih.gov/
14About NCBI
15Databases at NCBI
- Databases at NCBI
- Literature databases
- PubMed, PubMed Central, Books, OMIM
- Molecular databases
- Sequences
- EST, STS, GSS, HTGS, HTC, FLIC, UniGene, RefSeq,
HomoloGene - Structures
- MMDB, CDD,
- Taxonomy
- Other databases
- GEO, SKY/CGH
16Databases at NCBI
http//www.ncbi.nlm.nih.gov/
17Databases at NCBI
- Databases at NCBI
- Literature databases
- PubMed, PubMed Central, Books, OMIM
- Molecular databases
- Sequences
- EST, STS, GSS, HTGS, HTC, FLIC, UniGene, RefSeq,
HomoloGene - Structures
- MMDB, CDD,
- Taxonomy
- Other databases
- GEO, SKY/CGH
18Literature Databases
- Literature databases
- PubMed
- PubMed Central
- Books
- OMIM
19PubMed
- PubMed
- PubMed database was designed to provide access to
citations (with abstracts) from biomedical
journals. - Subsequently, a linking feature was added to
provide access to full-text journal articles at
web sites of participating publishers, as well as
to other related web resources.
20PubMed
- Data sources
- MEDLINE
- NLMs premier bibliographic databases covering
the fields of medicine, nursing, dentistry,
veterinary medicine, the health care system, and
the preclinical sciences, such as molecular
biology. - Non-MEDLINE
- General science and chemistry journals that
contain life sciences indexed for MEDLINE, e.g.,
the plate tectonics or astrophysics articles from
Science magazine. - Other databases
- HealthSTAR, AIDSLINE, HISTLINE, SPACELINE,
BIOETHICSLINE, and POPLINE.
21PubMed
- All electronic data are supplied via FTP to NCBI
in XML format, in accordance with the NLMs
specifications (document type definition, or
DTD). - XML extensible markup language
- DTD document type definition
- Example
- A 160cm 50kg B 170cm 60kg
- ltNgtAlt/NgtltHgt160lt/HgtltWgt50lt/WgtltNgtBlt/NgtltHgt170lt/HgtltWgt60
lt/Wgt
22PubMed
- PubMed citations are indexed by MeSH (Medical
Subject Headings) terms.
NCBI Handbook
23PubMed Central
- PubMed Central (PMC) is the National Library of
Medicine's digital archive of full-text journal
literature. - Journals deposit material in PMC on a voluntary
basis. - Articles in PMC may be retrieved either by
browsing a table of contents for a specific
journal or by searching the database. - Certain journals allow the full text of their
articles to be viewed directly in PMC. - Other journals require that PMC direct users to
the journals own web site to see the full text
of an article. In this case, the material will
always be available free to any user no more than
1 year after publication but will usually be
available only to the journals subscribers for
the first 6 months to 1 year.
24Literature Databases
- Literature databases
- PubMed
- PubMed Central
- Books
- OMIM
25NCBI BookShelf
- The BookShelf is a collection of biomedical books
that can be searched directly in Entrez or found
via keyword links in PubMed abstracts. - Books have been added to the BookShelf in
collaboration with authors and publishers, and
the complete content (including all figures and
tables) is free to use for anyone with an
Internet connection. - The online books are displayed one section at a
time, with navigation provided to other parts of
the current chapter or to other chapters within
the book. - Many of the books on the BookShelf can be browsed
without any restriction at all others have less
flexibility for navigating the complete content. - The publisher (or the owner of the content)
defines the rules for access. - The books are linked to PubMed through research
papers citations within the text.
26NCBI BookShelf
27NCBI BookShelf
28Literature Databases
- Literature databases
- PubMed
- PubMed Central
- Books
- OMIM
29OMIM
- Online Mendelian Inheritance in Man ( OMIMTM) is
a timely, authoritative compendium of
bibliographic material and observations on
inherited disorders and human genes. It is the
continuously updated electronic version of
Mendelian Inheritance in Man (MIM). - MIM was last published in 1998 and is authored
and edited by Dr. Victor A. McKusick and a team
of science writers, editors, scientists, and
physicians at The Johns Hopkins University and
around the world. Curation of the database and
editorial decisions take place at The Johns
Hopkins University School of Medicine.
30OMIM
31OMIM
32OMIM
33OMIM
34Literature Databases
- Literature databases
- PubMed
- PubMed Central
- Books
- OMIM
35Databases at NCBI
- Databases at NCBI
- Literature databases
- PubMed, PubMed Central, Books, OMIM
- Molecular databases
- Sequences
- EST, STS, GSS, HTGS, HTC, FLIC, UniGene, RefSeq,
HomoloGene - Structures
- MMDB, CDD,
- Taxonomy
- Other databases
- GEO, SKY/CGH
36Molecular Databases
- Sequences databases
- HTGS, HTCFLIC
- EST
- STS
- GSS
- UniGene
- RefSeq
- HomoloGene
- Structures databases
- MMDB
- CDD
- Taxonomy
37HTGS
- High-throughput genomic sequence (HTGS) entries
are submitted in bulk by genome centers,
processed by an automated system, and then
released to GenBank. - To submit sequences in bulk to the HTG processing
system, a center or group must set up an FTP
account. Submitters frequently use two tools to
create HTG submissions, Sequin or fa2htgs.
38HTGS
- Phase 0 sequences are one-to-few reads of a
single clone and are not usually assembled into
contigs. They are low-quality sequences that are
often used to check whether another center is
already sequencing a particular clone. - Phase 1 entries are assembled into contigs that
are separated by sequence gaps, the relative
order and orientation of which are not known. - Phase 2 entries are also unfinished sequences
that may or may not contain sequence gaps. If
there are gaps, then the contigs are in the
correct order and orientation. - Phase 3 sequences are of finished quality and
have no gaps.
NCBI Handbook
39Genome Sequencing
- Bacterial artificial chromosome (BAC) Sequencing
http//www.genomenewsnetwork.org/articles/06_00/se
quence_primer.shtml
40Genome Sequencing
Nature, Vol. 381, 364-366 (1996)?
http//en.wikipedia.org/
41Genome Sequencing
- Whole Genome Shotgun (WGS) Sequencing
42Genome Sequencing
Nature, Vol. 381, 364-366 (1996)?
43Genome Sequencing
- BAC sequencing
- High precision
- Slow
- Shotgun sequencing
- High throughput
- Consume large computational resource
- Fast at early stage, but complicated at later
stage
44HTGS
- Submission tools
- fa2htgs Command-line program
- tbl2asn Command-line program
- Sequin Stand-alone bulk submission tool
http//www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequ
in.htm
45HTC FLIC
- HTC records are High-Throughput cDNA/mRNA
submissions that are similar to ESTs but often
contain more information. - FLIC records, Full-Length Insert cDNA, contain
the entire sequence of a cloned cDNA/mRNA.
Therefore, FLICs are generally longer, and
sometimes even full-length, mRNAs. They are
usually annotated with genes and coding regions,
although these may be lab systematic names rather
than functional names.
46Molecular Databases
- Sequences databases
- HTGS, HTCFLIC
- EST
- STS
- GSS
- UniGene
- RefSeq
- HomoloGene
- Structures databases
- MMDB
- CDD
- Taxonomy
47What are Expressed Sequence Tags
- ESTs are small pieces of DNA sequence (usually
200 to 500 nucleotides long) that are generated
by sequencing either one or both ends of an
expressed gene. The idea is to sequence bits of
DNA that represent genes expressed in certain
cells, tissues, or organs from different
organisms and use these "tags" to fish a gene out
of a portion of chromosomal DNA by matching base
pairs. The challenge associated with identifying
genes from genomic sequences varies among
organisms and is dependent upon genome size as
well as the presence or absence of introns, the
intervening DNA sequences interrupting the
protein coding sequence of a gene.
http//www.ncbi.nlm.nih.gov/About/primer/est.html
48What are Expressed Sequence Tags
http//www.ncbi.nlm.nih.gov/About/primer/est.html
49What are Expressed Sequence Tags
sequencing
sequencing
cDNA
5EST
3EST
- Usually 200500 nucleotides long
50What are Expressed Sequence Tags
Chromosome sequence
Mapping back to chromosome sequence
5EST
3EST
51Expressed Sequence Tags(ESTs)?
52Molecular Databases
- Sequences databases
- HTGS, HTCFLIC
- EST
- STS
- GSS
- UniGene
- RefSeq
- HomoloGene
- Structures databases
- MMDB
- CDD
- Taxonomy
53Sequence clustering
- Because a gene can be expressed as mRNA many,
many times, ESTs ultimately derived from this
mRNA may be redundant. That is, there may be many
identical, or similar, copies of the same EST.
Such redundancy and overlap means that when
someone searches dbEST for a particular EST, they
may retrieve a long list of tags, many of which
may represent the same gene. Searching through
all of these identical ESTs can be very time
consuming. - To resolve the redundancy and overlap problem,
NCBI investigators developed the UniGene
database. - UniGene automatically partitions GenBank
sequences into a non-redundant set of
gene-oriented clusters.
http//www.ncbi.nlm.nih.gov/About/primer/est.html
54Sequence clustering
mRNA
Pre-mRNA
Chromosome
cDNA Library clone No. 1 cDNA Library clone No.
2 cDNA Library clone No. 3 cDNA Library clone No.
4 cDNA Library clone No. 5 cDNA Library clone No.
6
55Sequence clustering
56Sequence clustering
57Sequence clustering
UG No.1
UG No.2
UG No.3
UG No.4
58Introduction of UniGene database
- UniGene Build Procedure - Transcriptome
BasedClustering is the process of finding
subsets of sequences that belong together within
a larger set. This is done by converting discrete
similarity scores to Boolean links between
sequences. That is, two sequences are considered
linked if their similarity exceeds a threshold.
UniGene clustering proceeds in several stages,
with each stage adding less reliable data to the
results of the preceding stage. This staged
clustering affords greater control than a more
egalitarian treatment of all links between
sequences.
http//www.ira.cinvestav.mx8080/GenBioMolI_05/DOC
UMENTOS/HTML/NCBI/UniGene20Build20Procedures.htm
59UniGene database
60Sequence clustering
61Sequence clustering
62UniGene database
63UniGene database
64UniGene database
65Brief of Cancer Genome Anatomy Project
66Brief of Cancer Genome Anatomy Project
- The goal of CGAP is to determine the gene
expression profiles of normal, precancer, and
cancer cells
67Brief of Cancer Genome Anatomy Project
68Digital Differential Display
UniGene
dbEST
CGAP
Gene A
EST No.
Gene A
EST No.
Gene B
EST No.
Gene B
EST No.
Tissue A
Tissue B
Gene C
EST No.
Gene C
EST No.
Gene D
EST No.
Gene D
EST No.
Gene A
EST No.
Gene A
EST No.
Gene B
EST No.
Gene B
EST No.
Tissue C
Tissue D
Gene C
EST No.
Gene C
EST No.
Gene D
EST No.
Gene D
EST No.
69Digital Differential Display
UniGene
dbEST
CGAP
Gene A
EST No.
Gene A
EST No.
Gene B
EST No.
Gene B
EST No.
Tissue A
Tissue B
Gene C
EST No.
Gene C
EST No.
Gene D
EST No.
Gene D
EST No.
70Digital Differential Display
- DDD is a tool for comparing EST-based expression
profiles among the various libraries, or pools of
libraries, represented in UniGene. These
comparisons allow the identification of those
genes that differ among libraries of different
tissues, making it possible to determine which
genes may be contributing to a cell's unique
characteristics, e.g., those that make a muscle
cell different from a skin or liver cell. - Along similar lines, DDD can be used to try to
identify genes for which the expression levels
differ between normal, premalignant, and
cancerous tissues or different stages of
embryonic development.
71Digital Differential Display
72Digital Differential Display
73Digital Differential Display
74Digital Differential Display
75Digital Differential Display
76Digital Differential Display
77Digital Differential Display
78Digital Differential Display
79Digital Differential Display
80Molecular Databases
- Sequences databases
- HTGS, HTCFLIC
- EST
- STS
- GSS
- UniGene
- RefSeq
- HomoloGene
- Structures databases
- MMDB
- CDD
- Taxonomy
81STS
- In the National Research Council (NRC)
Committees discussions, there are 2 problems in
generating genome map by PCR - The difficulty of merging mapping data gathered
by diverse methods in different laboratories into
a consensus physical map. - The logistics and expense of managing the huge
collections of cloned segments on which the
mapping data would depend almost absolutely
82STS
- Sequence tagged sites (STSs) are short genomic
landmark sequences. They are operationally unique
in that they are specifically amplified from the
genome by PCR amplification. In addition, they
define a specific location on the genome and are,
therefore, useful for mapping. - In most instances, 200 to 500 b.p. of sequence
define an STS that is operationally unique in the
human genome.
83STS
84STS
85Molecular Databases
- Sequences databases
- HTGS, HTCFLIC
- EST
- STS
- GSS
- UniGene
- RefSeq
- HomoloGene
- Structures databases
- MMDB
- CDD
- Taxonomy
86GSS
- The genome survey sequences (GSS) division of
GenBank is similar to the EST division, with the
exception that most of the sequences are genomic
in origin, rather than cDNA (mRNA). It should be
noted that two classes (exon trapped products and
gene trapped products) may be derived via a cDNA
intermediate. Care should be taken when analyzing
sequences from either of these classes, as a
splicing event could have occurred and the
sequence represented in the record may be
interrupted when compared to genomic sequence.
The GSS division contains (but is not limited to)
the following types of data - random "single pass read" genome survey
sequences. - cosmid/BAC/YAC end sequences
- exon trapped genomic sequences
- Alu PCR sequences
- transposon-tagged sequences
87GSS
- Many labs have approached GenBank over the last
few months, interested in submitting these types
of sequences. We have been reluctant to introduce
them via the existing GenBank divisions. On the
other hand, such sequences are of value to the
genome community, and require similar processing
and access tools as have been provided for EST's
and STS's. GSS sequences will will be used,
amongst other things, as a framework for the
mapping and sequencing of genome size pieces
which will be present in the standard GenBank
divisions. - Sequence data appropriate for the new GSS
division are, to date, generated by genome labs
performing human genome sequencing we expect
that similar data will be generated for other
model organisms, such as the mouse.
88Molecular Databases
- Sequences databases
- HTGS, HTCFLIC
- EST
- STS
- GSS
- UniGene
- RefSeq
- HomoloGene
- Structures databases
- MMDB
- CDD
- Taxonomy
89RefSeq
- RefSeq biological sequences (also known as
RefSeqs) are derived from GenBank records but
differ in that each RefSeq is a synthesis of
information, not an archived unit of primary
research data. - RefSeq provides a non-redundant framework of
information to facilitate database searches,
whether they are searched via genomic location,
sequence, or text annotation.
90RefSeq
- The RefSeq database is the result of data
extraction from GenBank, curation, and
computation, combined with extensive
collaboration with authoritative groups. Each
molecule is annotated as accurately as possible
with the organism name, strain (or breed,
ecotype, cultivar, or isolate), gene symbol for
that organism, and informative protein name. - In cases when a molecule is represented by
multiple sequences for an organism in GenBank, an
effort is made by NCBI staff to select the "best"
sequence to be presented as a RefSeq. The goal is
to avoid known mutations, sequencing errors,
cloning artifacts, and erroneous annotation.
91RefSeq
92RefSeq
93RefSeq
94RefSeq
95RefSeq
96RefSeq
97RefSeq
98Molecular Databases
- Sequences databases
- HTGS, HTCFLIC
- EST
- STS
- GSS
- UniGene
- RefSeq
- HomoloGene
- Structures databases
- MMDB
- CDD
- Taxonomy
99HomoloGene
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/Or
thology.html
100HomoloGene
101HomoloGene
- HomoloGene Build Procedure
- The input for HomoloGene processing consists of
the proteins from the input organisms. These
sequences are compared to one another (using
blastp) and then are matched up and put into
groups, using a tree built from sequence
similarity to guide the process, where closer
related organisms are matched up first, and then
further organisms are added as the tree is
traversed toward the root. The protein alignments
are mapped back to their corresponding DNA
sequences, where distance metrics can be
calculated (e.g. molecular distance, Ka/Ks
ratio). Sequences are matched using synteny when
applicable. Remaining sequences are matched up by
using an algorithm for maximizing the score
globally, rather than locally, in a bipartite
matching. Cutoffs on bits per position and Ks
values are set to prevent unlikely "orthologs"
from being grouped together. These cutoffs are
calculated based on the respective score
distribution for the given groups of organisms.
Paralogs are identified by finding sequences that
are closer within species than other species.
102HomoloGene
103HomoloGene
104Molecular Databases
- Sequences databases
- HTGS, HTCFLIC
- EST
- STS
- GSS
- UniGene
- RefSeq
- HomoloGene
- Structures databases
- MMDB
- CDD
- Taxonomy
105MMDB
- Molecular modeling database (MMDB) is based on
the structures within Protein Data Bank (PDB) and
can be queried using the Entrez search engine, as
well as via the more direct but less flexible
structure summary search. Once found, any
structure of interest can be viewed using Cn3D, a
piece of software that can be freely downloaded
for Mac, PC, and UNIX platforms.
106MMDB
107MMDB
108MMDB
109MMDB
110MMDB
- VAST Search is a WWW service which allows you to
compare the 3-dimensional structure of an input
protein with other protein structures in NCBI's
MMDB, using the VAST algorithm. - VAST Search is NCBI's structure-structure
similarity search service. It compares 3D
coordinates of a newly determined protein
structure to those in the MMDB/PDB database. VAST
Search computes a list of structure neighbors
that you may browse interactively, viewing
super-positions and alignments by molecular
graphics. - The output of the pre-computed VAST searches is a
list of structure records, each representing one
of the non-redundant PDB chain sets (nr-PDB),
which can also be downloaded. There are four
clustered subsets of MMDB that compose nr-PDB,
each consisting of clusters having a preset level
of sequence similarity.
111MMDB
112MMDB
113Molecular Databases
- Sequences databases
- HTGS, HTCFLIC
- EST
- STS
- GSS
- UniGene
- RefSeq
- HomoloGene
- Structures databases
- MMDB
- CDD
- Taxonomy
114CDD
- The collections of domain alignments in the
conserved domain database (CDD) are imported
either from two databases outside of the NCBI,
named Pfam and simple modular architecture
research tool (SMART) from the NCBI COG
database from another NCBI collection named
library of ancient domain (LOAD) and from a
database curated by the CDD staff.
115CDD
116CDD
117CDD
118CDD
119CDD
- Given a query sequence, CDART shows the
functional domains that make up a protein and
then lists proteins with a similar domain
architecture. The functional domains for a
sequence are found by RPS-BLAST, which defines a
domain by a PSSM (Position-specific scoring
matrices), a set of probabilities of amino acids
existing at each position of the domain.
RPS-BLAST is known as a "profile" search, which
is a sensitive way to look for sequence
homologues.
120CDD
121CDD
122Molecular Databases
- Sequences databases
- HTGS, HTCFLIC
- EST
- STS
- GSS
- UniGene
- RefSeq
- HomoloGene
- Structures databases
- MMDB
- CDD
- Taxonomy
123Taxonomy
- The NCBI Taxonomy database is a curated set of
names and classifications for all of the
organisms that are represented in GenBank. When
new sequences are submitted to GenBank, the
submission is checked for new organism names,
which are then classified and added to the
Taxonomy database. - Of the several different ways to build a
taxonomy, our group maintains a phylogenetic
taxonomy. In a phylogenetic classification
scheme, the structure of the taxonomic tree
approximates the evolutionary relationships among
the organisms included in the classification.
124Taxonomy
125Taxonomy
126Taxonomy
127Molecular Databases
- Sequences databases
- HTGS, HTCFLIC
- EST
- STS
- GSS
- UniGene
- RefSeq
- HomoloGene
- Structures databases
- MMDB
- CDD
- Taxonomy
128Databases at NCBI
- Databases at NCBI
- Literature databases
- PubMed, PubMed Central, Books, OMIM
- Molecular databases
- Sequences
- EST, STS, GSS, HTGS, HTC, FLIC, UniGene, RefSeq,
HomoloGene - Structures
- MMDB, CDD,
- Taxonomy
- Other databases
- GEO, SKY/CGH
129Other Databases
- Other databases
- GEO
- SKY/CGH
130GEO
- The Gene Expression Omnibus (GEO) project was
initiated at NCBI in 1999 in response to the
growing demand for a public repository for data
generated from high-throughput microarray
experiments. GEO has a flexible and open design
that allows the submission, storage, and
retrieval of many types of data sets, such as
those from high-throughput gene expression,
genomic hybridization, and antibody array
experiments.
131GEO
132GEO
133GEO
134GEO
135GEO
136GEO
137Other Databases
- Other databases
- GEO
- SKY/CGH
138SKY/CGH
- Spectral Karyotyping (SKY) and Comparative
Genomic Hybidization (CGH) are complementary
fluorescent molecular cytogenetic techniques that
have revolutionized the detection of chromosomal
abnormalities. - SKY permits the simultaneous visualization of all
human or mouse chromosomes in a different color,
facilitating the detection of chromosomal
trans-locations and rearrangements. - CGH uses the hybridization of differentially
labeled tumor and reference DNA to generate a map
of DNA copy number changes in tumor genomes.
139SKY/CGH
140SKY/CGH
141SKY/CGH
142SKY/CGH
143SKY/CGH
144SKY/CGH
145SKY/CGH
146SKY/CGH
147Other Databases
- Other databases
- GEO
- SKY/CGH
148Databases at NCBI
- Databases at NCBI
- Literature databases
- PubMed, PubMed Central, Books, OMIM
- Molecular databases
- Sequences
- EST, STS, GSS, HTGS, HTC, FLIC, UniGene, RefSeq,
HomoloGene - Structures
- MMDB, CDD,
- Taxonomy
- Other databases
- GEO, SKY/CGH
149Entrez
- Entrez is the text-based search and retrieval
system used at NCBI for all of the major
databases, including PubMed, Nucleotide and
Protein Sequences, Protein Structures, Complete
Genomes, Taxonomy, OMIM, and many others. Entrez
is at once an indexing and retrieval system, a
collection of data from many sources, and an
organizing principle for biomedical information.
150Entrez
151Entrez
152Entrez
153Databases at NCBI
154(No Transcript)
155(No Transcript)