Title: Entrez Retrieval System ...
1Part 3
Essentials
2Global Entrez Search Page
AllFilter
3Overall Goal An on-line resource providing
comprehensive information on the biological
activities of small molecules
4Why Are Small Molecules Important?
- Constituents to all macromolecules(DNA, RNA,
protein, carbohydrates, etc.) - Serve as cofactors and signaling molecules to
thousands of proteins - The chemistry part of biochemistry
- Most drug entities and drug types are small
molecules - Most biomarkers used in clinical chemistry are
small molecules
5PubChem Databases and Tools
http//pubchem.ncbi.nlm.nih.gov/
6The Molecular Libraries RoadmapAn Integrated
Initiative
Technology Development
Screening
Informatics
Chem-informatics Research Centers
Molecular Libraries Screening Centers Network (
M L S C N )
Assay Development
Instrumentation
Compound Repository (MLSMR)
Chemical Diversity
Predictive ADMET
7PubChem
- Repository for small molecules and bioactivity
assay data - Part of Entrez search and linking system
- Links to other NCBI databases, e.g.,
- PubMed, MeSH
- Protein structures (MMDB)
- Protein/Nucleotide sequences (GenPept/GenBank)
- Contains complete chemical structures
- Standardized for uniformity
- Small set of computed properties
- Structure similarity searching
8Other Depositors to PubChem
and more
9PubChem Birds Eye View
Depositors
PubChemSubstance
PubChemBioAssays
PubChemCompound
Chemical Structure Similarity
10How does data get into PubChem?
11PubChem integration in Entrez
VAST Structure Similarity
Term Frequency Statistics
Literature
3D Structures
Bioactivity Assay Results
Small Molecule Structures
Chemical Structure Similarity
Protein Sequences
Activity Profile Similarity
12(No Transcript)
13Primary Database
14Depositor Data
- No Global rules or standards
- Based on organizational needs
- Lots of data overlap
- Often based on individual Scientist preferences
- PubChem accepts data from many organizations
- Previously unseen data representation
- Combinatorial explosion of ways for drawing the
same structure
15Redundancy, mixtures
Mixture
16Derivative Database
17Chemical Structures may be representedin many
different ways
18Chemical Structures may be representedin many
different ways
19Substance
Compound
20Substance
Compound
Unknown E/Z isomers
Unknown stereo
Knownstereochemistry
21(No Transcript)
22PubChem Compound Processing
- Chemical Data Verification
- Atom description (label, element?)
- Functional group clean-up
- Atom valence verification to prevent non-sense
- Normalize and Standardize
- Valence-Bond canonicalize (for Tautomer
invariance) - Aromaticity detection and self-consistency
- Stereochemistry detection
- Explicit hydrogen assignment
- Calculation
- 2-D Coordinate generation
- Image Depictions
- Fingerprints
- IUPAC Name
- SMILES, InChI, Hash Codes
- xLogP, TPSA, HBD, HBA, MW, MF
23Chemical Structure Sanitization
- Chemical Structures that fail Sanitization
- Are not part of the aggregated PubChem Compound
Database - Still searchable via PubChem Substance Database
- Keeps the PubChem Compound Database Clean for
Chemical Informatic Analysis - Collapses structures represented in various ways
into a uniform, identical representation
24Compound for mixture
Component compounds
25Components of a mixture
26Substance vs. Compound
Substance summary
Compound summary
27Substance vs. Compound
28Examples of queries
- pcsubstance structure"Filter
- ca"Element AND 300500MW AND
"chemidplus"SourceName
"InChI1/Ca.3H2O/h31H2/q 2/p-3/fCa.3HO/h31
h/qm3-1"InChI
- "lipinski"Filter AND "antineoplastic
agents"PharmAction
- Lipinski rule of 5 -- a molecule is likely to
be bioactive if it has - not more than 5 hydrogen bond donors (OH and NH
groups) - lt10 hydrogen bond acceptors (N or O)
- a molecular weight under 500
- a LogP under 5
29Examples of PubChem Index Fields
All ALL -- All of the following fields are
searched default search field. UidUID -- The
integer represents SID for PCSubstance database.
By default, an integer without a field alias is
recognized as a UID. Same as SID.Filter
Filter -- Limits the records to various indexed
filters. ActiveAid AA -- Active BioAssay
identifier, integer. ActiveAidCount AC, ACNT
-- bioassays where tested active.
AtomChiralCount ACC, ACCNT -- Total count of
chiral atoms in a given compound.BioAssayID
BAID, AID -- BioAssay identifier.BondChiralCoun
t BCC, BCCNT - Number of chiral bonds.Comment
CMT -- Substance or bioassay comment.
CompleteSynonym CSYN, CSYNO exactly matching
name for substance/compound. CompoundID CID --
Compound identifier, integer. DepositDate DDAT,
DEPDAT -- Deposition timestamp for a
substance. Element ELMT, EL -- Chemical element
in a substance/compound. ExactMass EMAS,
EXMASS-- The calculated mass of an ion or a
molecule containing most likely isotopic
composition for a single random molecule,
corresponding to mass of most intense
ion/molecule peak in a MS spec. A real
number.HeavyAtomCount HAC, HACNT -- Atom count
in a compound except hydrogen, integer.
HydrogenBondAcceptorCount HBAC, HBACNT --
Hydrogen bond acceptors for a compound, integer.
HydrogenBondDonorCount HBDC, HBDCNT --
Hydrogen bond donors for a compound, integer.
InChI inchi -- IUPAC International Chemical
Identifier.
30Examples of PubChem Index Fields, contd.
IUPACName UPAC, IUPAC -- Standard IUPAC name
for compound. MeSHDescription MHDMeSHTerm
MSHT, MESHT -- Medical Subject Heading
term.MeSHTreeNode MSHN, MESHTN -- Medical
Subject Heading tree node (tree
structures).MolecularWeight MW, MWT, MOLWT --
Mass of a molecule calculated using the average
mass of each element weighted for its natural
isotopic abundance. E.g., Carbon has two natural
isotopes 12 and 13 with relative abundances of
98.9 and 1.1 to yield an average mass of 12.011
g/mol. A real number. MonoisotopicMass MMAS,
MIMASS -- Mass of a molecule calculated using
the mass of the most abundant isotope of each
element. E.g., Carbon has a monoisotopic mass of
12.000 g/mol. A real number. PharmAction PHMA,
PHARMA -- MeSH pharmacological actions
heading.RotatableBondCount RBC, RBCNT Number
of rotatable bonds. SourceCategory SRCC,
SRCCAT, SRCCATG -- Depositor categories.SourceID
SRID, SRCID -- Depositor's external
id.SourceName SRC, SRCNAM, SRCNAME -- official
depositor name.SubstanceID SID -- Substance
ID. Same as UID.Synonym SYNO -- Synonyms for
substance. TautomerCount TC, TCNT, TTMC --
Possible tautomer count for each given structure,
200. TotalFormalCharge TFC, CHG, CHRG --
Total formula charge.TPSA TPSA -- Topological
Polar Surface Area.XLogP XLGP, LOGP
31Preview/Index Tab
32History Tab
Substances of MW 300-500Da having antineoplastic
properties and obeying Lipinski rule of 5
33(No Transcript)
34Property Report
35SDF format
36(No Transcript)
37(No Transcript)
38(No Transcript)
39Medical Subject Headings (MeSH)
- MeSH is the National Library of Medicine's
controlled vocabulary thesaurus. - Consists of sets of terms naming descriptors in a
hierarchical and alphabetic structure, e.g. - "Mental Disorders, Pharmacological action,
- Catecholamine hormones , etc.
- Permits searching at various levels of
specificity - MeSH thesaurus is used for indexing articles for
the MEDLINE/PubMed database - MeSH is continually updated
- PubChem assigns MeSH headings to Compound records
40Primary Database
- Contains bioactivity screens of chemical
substances described in PubChem Substance - Provides searchable descriptions of each
bioassay, including descriptions of the
conditions and readouts specific to a screening
protocol - Depositor decides on data definitions and
interpretation - Data can be plotted as graphs of statistical
histograms - Cross-indexed to other Entrez databases
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47Click to view structure
48(No Transcript)
49NCBI FTP gtgt PubChem Folder
50Entrez PubChem Help and Tabs
51Brief Summary
- PubChem is part of NIH Molecular Libraries
Roadmap for Medicine Initiative - PubChem consists of 3 databases, Substance,
Compound and BioAssay, and a poweful
Structure Search engine - Substance samples
- Compounds calculated structures, properties
- PubChem is integrated into NCBIs Entrez Search
and Linking system of databases - Records are indexed using number of terms
- Records are linked to each other and to other
databases at NCBI
52For More Information
53For More Information
E-mail addresses
- General Help info_at_ncbi.nlm.nih.gov
- BLAST blast-help_at_ncbi.nlm.nih.gov
- Telephone
- Voice 1 (301) 496-2475 Fax 1
(301) 480-9241
The (free!) NCBI Newsletter
http//www.ncbi.nih.gov/About/newsletter.html
The NCBI Handbook
Follow the link from the NCBI Home Page
The NCBI Education Page
http//www.ncbi.nih.gov/Education/index.html