Structure Databases - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Structure Databases

Description:

cryo-electron microscopy. ESRF, Grenoble. Synchrotron radiation source. includes beamlines for ... Electron microscopy. Legacy data (old PDB) Data Capture. Data ... – PowerPoint PPT presentation

Number of Views:184
Avg rating:3.0/5.0
Slides: 64
Provided by: SAME8
Category:

less

Transcript and Presenter's Notes

Title: Structure Databases


1
Structure Databases
  • Sameer Velankar
  • http//msd.ebi.ac.uk

2
http//www.ebi.ac.uk
3
Information Management is Central to Biological
Research
Analysis Reading. Talking. Thinking.
Hypothesis!
Private Data Past Experiments. Lab note
books. Group discussions.
Public Data Journals Conferences
Experiment Design. Execution.
Publish!
4
Bioinformatics is Central to Biological Research
Analysis Reading. Talking. Thinking. Computational
Analysis Software Development
Public Data Journals Conferences DNA
sequences Protein Sequences Genetic
maps Transcripts 3D structures proteomics
results SNP data etc etc etc
Private Data Past Experiments. Lab note
books. Group discussions. DNA sequences Protein
Sequences Genetic maps Transcripts 3D
structures proteomics results SNP
data etc etc etc
Hypothesis! Computer aided.
Experiment Design. Execution. Computational
experiments Simulation
Publish! Database submission Database management
5
(No Transcript)
6
EMBL-BankDNA sequences
EnsEMBL Human Genome Gene Annotation
7
EMBL-BankDNA sequences
EnsEMBL Human Genome Gene Annotation
SWISS-PROT TrEMBL Protein Sequences
8
EMBL-BankDNA sequences
EnsEMBL Human Genome Gene Annotation
SWISS-PROT TrEMBL Protein Sequences
Array-Express Microarray Expression Data
9
EMBL-BankDNA sequences
EnsEMBL Human Genome Gene Annotation
SWISS-PROT TrEMBL Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
10
EMBL-BankDNA sequences
EnsEMBL
Array-Express Microarray Expression Data
SWISS-PROT TrEMBL Protein Sequences
IntAct Protein Protein Interaction Data
EMSD Macromolecular Structure Data
11
Integr8
12
Integr8 will encompass information on
  • genetic regulatory elements
  • gene expression
  • protein expression
  • gene product information
  • protein families and domains
  • structural and ligand information
  • molecular function
  • biological role
  • location of gene products

13

EMBL Outstation European Bioinformatics
Institute Macromolecular Structure Database
Project EMSD http//msd.ebi.ac.uk To develop an
autonomous structural database capability in
Europe

14

Global Context of Data on Macromolecular
Structure
For more than 20 years, the Protein Data Bank
(PDB) at Brookhaven National Laboratory USA
collected public data on protein structure.
In 1998 the US funding agencies re-competed the
grant. Brookhaven National Laboratory lost the
contract.
Responsibility for The Protein Data Bank in the
USA has now moved to RCSB (Research
Collaboratory for Structural Bioinformatics)

RCSB is Rutgers University NJ San Diego
Supercomputer Center (SDSC) National Institute of
Standards for Technology (NIST)
15
Two Different Database Systems One Common Set of
Data
Deposition
Deposition
USA RCSB
Europe EBI-MSD
Agreed Common Data Items and Exchange Mechanism.
Services
Services
Ftp Archive
16
http//msd.ebi.ac.uk
17
Data Base Tasks
  • GET DATA
  • ORGANISE AND STORE DATA
  • GIVE IT BACK

18

New data from NMR spectroscopy, X-ray
crystallography Electron microscopy
XML
HTML
Flat-files
Harvesting
Manual
Data Delivery
Data capture
Data Storage

Relational Database
Clean-up
CORBA
Oracle
Legacy data (old PDB)
19
Where does the 3D-structural data come from?
  • Three complex techniques
  • X-ray crystallography
  • NMR spectroscopy
  • cryo-electron microscopy

20
ESRF, Grenoble Synchrotron radiation
source includes beamlines for macromolecular X-ra
y crystallography
Electron Microscope for cryo-EM
High Field NMR Spectrometer
21
J.Frank, Wadsworth Centre, New York
Rat-liver mitochondrion
Mitotic spindle
Model of translational apparatus
22
3D-EM
  • Structural information about Macromolecular
    Complexes
  • A vital link between Cell Biology and X-Ray and
    NMR studies
  • Insights into Macromolecular Mechanisms and
    Interrelationships

23
1be3 - transmembrane helix bundle
24
1be3 11 unique Proteins
  • UBIQUINOL CYTOCHROME C OXIDOREDUCTASE, COMPLEX
  • DBREF 1BE3 A 1 446 GB 1730447 1730447
    35 480
  • DBREF 1BE3 B 21 439 SWS P23004
    UCR2_BOVIN 35 453
  • DBREF 1BE3 C 1 379 SWS P00157
    CYB_BOVIN 1 379
  • DBREF 1BE3 D 1 241 GB 223266 223266
    1 241
  • DBREF 1BE3 E 1 196 SWS P13272
    UCRI_BOVIN 79 274
  • DBREF 1BE3 F 5 110 SWS P00129
    UCR6_BOVIN 5 110
  • DBREF 1BE3 G 1 81 SWS P13271
    UCRQ_BOVIN 1 81
  • DBREF 1BE3 H 15 78 SWS P00126
    UCRH_BOVIN 15 78
  • DBREF 1BE3 I 46 78 SWS P07588
    UCRI_BOVIN 46 78
  • DBREF 1BE3 J 1 62 SWS P00130
    UCRX_BOVIN 1 62
  • DBREF 1BE3 K 15 36 SWS P07552
    UCRY_BOVIN 15 36
  • 4 Haems

25
3D-structural data are complex
  • Need database models that can cope with this
    complexity
  • Need ease of maintenance and integration
  • EMSD system stretches current relational database
    technology

26
We would like to be able to...
  • ...ask questions that span from genome to
    ligand.
  • This SNP is in a coding region. Does the native
    protein have a known three dimensional structure?
  • Does the amino acid occur in the active site of
    the protein?
  • Which ligands are known to bind in the active
    site of this protein or a homologue?

27
Data Base Tasks
  • GET DATA
  • ORGANISE AND STORE DATA
  • GIVE IT BACK

28
Harvesting
Data Capture
29
Harvesting Concept
  • Harvesting for Macromolecular
  • structure Data is SIMPLY
  • communicating relevant data
  • from software to deposition site
  • The idea was first used in XPLOR with the module
    pdbsubmission (1996).
  • (Jian-Sheng Jiang and A. T. Brunger)

30
Harvesting Rationale
  • RICHER DATA BASE
  • MORE ACCURATE DATA
  • SYSTEMATIC DATA
  • CONFIDENCE LEVELS FOR DERIVED DATA
  • SIMPLER DEPOSITION

31
Data Harvesting
Deposition files
Database
Submission procedure
Manually-entered information
32
Web Based Structure Deposition Sites
EBI-MSD (Europe) http//autodep.ebi.ac.uk/ RCSB-
Rutgers (USA) http//pdb.rutgers.edu/adit/ Osak
a University (Japan) http//pdbdep.protein.osaka-
u.ac.jp/adit/
33
DEPOSITION via AUTODEP
34

New data from NMR spectroscopy, X-ray
crystallography Electron microscopy
XML
HTML
Flat-files
Harvesting
Manual
Data Delivery Data Warehouse
Data capture
Data Storage

Relational Database
Clean-up
CORBA
Oracle
Legacy data (old PDB)
35
Database organisation
Biological Unit(s)
Independent units
ASU observed exp data
Chains
  • Each level of the hierarchy can have associated
    properties, e.g.
  • Bound molecules
  • Domains
  • Site residues
  • Derived properties (e.g. asa)
  • Reference information (e.g. standard geometry)

Residues
Atoms
36
Classes of data
  • Chemical descriptions (a.k.a Hetgroup or
    chem_comp information)
  • Structure, coordinates and associated information
  • Experimental information
  • Bibliographic data
  • Sequence database cross-references
  • Taxonomy

37
Database Design Goals
  • Robust - thorough analysis and consistency
    checks expect substantial growth
  • Clean - top down design
  • Maintainable - use industry-standard tools
  • Open interface to independently-maintained
    software
  • Extensible to meet evolving needs

38
What does an RDBMS provide?
  • efficient database management
  • non-procedural
  • maintain data in an organised form
  • reading and writing data to the computer
  • fast data access mechanisms
  • reduce or eliminate need for redundant data
  • ensure integrity and consistency of data

39
Example Tables and Relationships
DEPOSITION
CREATION_DATE
REF_CONTACT_NAME
LAST_UPDATE
EMAIL
TITLE
NAME_FAMILY
o ACCESSION_CODE
o BUILDING
o DATE_ALL_ARRIVED_DATE
o DEPARTMENT_1
o DEP_PASSWORD
o DEPARTMENT_2
o DETAILS_NDB
o FAX
o HARVEST_PROJECT_NAME
o NAME_GIVEN
o HOLD_DATE
o NAME_INITIALS
o HTTP_REFERER
o NAME_SUFFIX
o HTTP_USER_AGENT
o PHONE
o IMMUNE_RELATED
o TITLE
o RELEASE_DATE
o URL
o VIRUS_FLAG
40
Our Schema
  • The whole schema statistics
  • 410 tables
  • 661 relationships
  • 1851 columns
  • Manage with Designer

41
Database Implementation
  • Industry standard relational database management
    system - Oracle.
  • Tables of data and relationships.
  • Constraints in the database prevent
    inconsistencies.
  • Many features come for free - e.g. interfaces to
    Excel and other products.

42
Use of Designer
  • Why Designer?
  • start to finish software engineering
  • Components of Designer
  • Process Modeller
  • Entity relationship diagrammer
  • Data flow Modeller
  • Function hierarchy diagrammer
  • Transformers, Editors, Generators, Servers

43
Normalised database 410 tables, 2000 attributes
deposition data 260 tables, 1300 attributes
is used to calculate
provides standard values for
reference data 130 tables, 500 attributes
derived data 20 tables, 200 attributes
44
(No Transcript)
45
(No Transcript)
46
Data DeliveryData Warehouse
Data Capture
Clean-up
Legacy data (old PDB)
47
Search system
  • Design goals
  • usable by novices - useful for experts
  • arbitrary combination of queries
  • easy-to-write own queries
  • run locally or use remotely
  • incremental updates
  • warehouse subsetting fallback database

48
EMSD Relational Databases
Normalised database 410 tables, 2000 attributes
deposition data 260 tables, 1300 attributes
transformed to data warehouse structure
provides standard values for
is used to calculate
reference data 130 tables, 500 attributes
derived data 20 tables, 200 attributes
49
What Is A Data Warehouse
  • A Data Warehouse is simply a different way of
    thinking about (and therefore organising) data
  • same data, different representation
  • not transactional, but for reporting and analysis

50
Using the Warehouse
  • Ad Hoc query
  • not available to external users
  • Simple questions
  • text-based
  • Complex Analytical Questions
  • including true 3D queries
  • Data Mining
  • searching the data for hitherto unknown
    correlations

51
Information Hierarchy
Data Mining
Data Warehouse
Analytical Questions
Ad Hoc Querying and Browsing
Operational Database
Operational Reporting
Data
52
STRUCTURAL GENOMICS
53
STRUCTURAL GENOMICS
  • SELECTION OF TARGET PROTEINS OR DOMAINS
  • CLONING, EXPRESSION, PURIFICATION
  • CRYSTALLIZATION AND STRUCTURE DETERMINATION
  • ARCHIVING AND ANNOTATION OF THE NEW STRUCTURE

54
MSD URLs
  • EBI Macromolecular Structure Database
    http//msd.ebi.ac.uk/
  • MacroMolecular PDB Code Search
    http//msd.ebi.ac.uk/Services/Quaternary/mm_search
    /mm_search.html
  • PQS Protein Quaternary Structure Query Form
    http//pqs.ebi.ac.uk/
  • 3D sequence server http//www3.ebi.ac.uk7654/3Ds
    eq/
  • UN-Published References Assignments
    http//msd.ebi.ac.uk/Services/UnPubRef/Server/form
    _page/form_page.html
  • PDB EBI Home http//pdb-browsers.ebi.ac.uk/
  • OCA http//oca.ebi.ac.uk/oca-bin/ocamain
  • 3DB Browser http//pdb-browsers.ebi.ac.uk/pdb-bin
    /pdbmain
  • PDB Lite http//pdb-browsers.ebi.ac.uk/pdb-bin/pd
    blite
  • AutoDep - PDB Submission http//autodep.ebi.ac.u
    k/
  • Oracle Search http//www.ebi.ac.uk/jji/ecsrch/
  • Biotech Validation http//biotech.ebi.ac.uk8400/

55
(No Transcript)
56
PDB SEARCHES AT THE EBI
57
Quaternary Structures
58
http//pqs.ebi.ac.uk/
59
3D sequence server http//www3.ebi.ac.uk7654/3Dse
q/
1.Derive a sequence from the coordinate
section of the PDB entry 2.Align it with
the sequence taken from the 'SEQRES' lines in
the PDB file. 3.Align the coordinate-derived
sequence again, but this time do not allow
gaps to be inserted in covalently-linked
segments 4.Derive a probe sequence from this
alignment, and use it to try and find the
corresponding SWISS-PROT entry, on the basis
of sequence similarity and taxonomy 5.Check
for any inconsistencies
60
3D sequence server http//www3.ebi.ac.uk7654/3Dse
q/
61
http//oca.ebi.ac.uk/oca-bin/ocamain
62
OCA oca_at_bioinfo.weizmann.ac.il
63
Validation Server
Write a Comment
User Comments (0)
About PowerShow.com