Title: Macromolecular Structure Database group
1 http//www.ebi.ac.uk/msd/roadshow.html MSD
Roadshow Co-ordinator . Janet Copeland 2nd
November 2005 Oxford University
http//www.ebi.ac.uk/msd/education/Tutorial.html
2USERNAME cal PASSWORD warthog
3- Introduction to MSD and to Quaternary
Structures/Assemblies as Basis of MSD database - SSM Fold recognition
- PISA Surface and assembly toolkit
- MSDchem Chemistry reference data
- MSDlite/MSDpro generalised search systems
- MSDsite Active sites
- MSDmotif small structural motifs
4- Visualisation and Patterns
- Intergration Projects with Sequence and Domain
data - Validation/Deposition
- Clustering methods used at MSD
- MSDmine generalised data access to the MSD
- PIMS Protein Information System
- Targets Workflow for Target selection tools
- NMR NMR tools and data at MSD
- Data Mining and an example MSDtemplates
- DataBases at MSD including data warehouse
technologies - DataBase Replication
5http//www.ebi.ac.uk/msd
6Bioinformatics
Genomes
Literature
Expression- profiling
Metabolic data
Proteome data
Biochemistry
Bioinformatics
Comparative genomics
Mutant/RNAi data
Hypotheses and in silico models
7Role of Bioinformatics
- To Support Experimental Biology
- To Collect and Archive Data
- To provide Framework and Integration
- To give Easy Access to Data
- To make New Discoveries through Data Analysis
8http//www.wwpdb.org/
9WHAT IS THE PDB?
10Databanks and Databases
- The PDB Archive is a databank
- A series of flat files that have a format
originally designed for Fortran card readers - The MSD provides databases
- Collections of data (1000s attributes) organized
into relational tables and held with a RDMS.
11Linking to Domain data, eFamily
Sequence Mapping, SIFTS
MSDchem ligand data
PQS biological assemblies
Electron Density Visualisation AstexViewer
MSDPro, MSDlite
SSM fold matching
Surface Matching
MSDsite Active sites
12Data information
ATOM 2567 N PHE B 175 7.821 -25.530
-22.848 1.00 8.71 ATOM 2568 CA PHE B 175
8.845 -25.172 -21.877 1.00 9.41 ATOM
2569 C PHE B 175 9.449 -23.798 -22.169
1.00 10.02 ATOM 2570 O PHE B 175
10.664 -23.613 -22.103 1.00 10.37 ATOM 2571
CB PHE B 175 9.928 -26.251 -21.848 1.00
9.53 ATOM 2572 CG PHE B 175 10.969
-26.137 -22.982 1.00 10.03 ATOM 2573 CD1
PHE B 175 12.356 -25.819 -22.988 1.00 10.51
ATOM 2574 CD2 PHE B 175 11.725 -27.211
-23.402 1.00 10.25 ATOM 2575 CE1 PHE B 175
11.821 -27.095 -22.869 1.00 11.17 ATOM
2576 CE2 PHE B 175 12.282 -26.086 -24.008
1.00 10.95 ATOM 2577 CZ PHE B 175
10.953 -26.335 -23.622 1.00 11.38
13MSD service provider
- We provide a service to the scientific community
- 24/7 (almost)
- parallel DB with fail-over, etc.
- Service ping baseline check several times/day
- Data is incremented with new data weekly
- Systems are extensible
14Query capabilities
- Browsing (click and read)
- Simple search
- select records with some constraints
- More elaborate search
- select specific fields of some records with
constraints on some fields - Complex querying
- ability to return an answer that results from a
"live" computation, and was not part of any
record of the database
15What we cannot do well
- Give us sequence, we do rest
16(No Transcript)
17(No Transcript)
18- What is the function of this structure?
What is the function of this sequence?
- What is the function of this motif?
- the fold provides a scaffold, which can be
decorated in different ways by different
sequences to confer different functions - knowing
the fold function allows us to rationalise how
the structure effects its function at the
molecular level
19Complication Multiprotein Complexes
20ATPase
1H8E (ADP.ALF4)2(ADP.SO4) BOVINE F1-ATPASE (ALL
THREE CATALYTIC SITES OCCUPIED) MENZ, R.I.,
WALKER, J.E., LESLIE, A.G.W.
21Ground rules for bioinformatics
- Don't always believe what programs tell you
- they're often misleading sometimes wrong!
- Don't always believe what databases tell you
- they're often misleading sometimes wrong!
- Don't always believe what lecturers tell you
- they're often misleading sometimes wrong!
- In short, don't be a naive user
- when computers are applied to biology, it is
vital to understand the difference between
mathematical biological significance - computers dont do biology - they do sums
quickly!
22General Evaluation Criteria Be sceptical and
cynical!
- When you are searching for information you need
to judge its quality and suitability. - Think critically about each piece of information
you find and how you found it. - Relevance
- Does the information you have found adequately
support your research? - Does it answer the question, or support one of
your arguments? - How general or specific is the information
about the topic?
23http//harvester.embl.de/
Harvester collects information from selected
public databases
24- Appreciate how difficult it is to draw a complex
3-D object and appreciate the complexity of the
requirements for storing sequence and structural
information of molecules in a database. - There are a lot of interrelated pieces of
information about a biomolecule, such as - sequence similarities
- genome location
- protein structure
- Expression
- chemistry
25Some of the obstacles of searching databases are
- Data formats are not standard.
- The nomenclature is not standard.
- There is more than one database offering the
same information (data redundancy). - Links between databases may not be easy to
follow. - The number of databases available makes it
confusing to choose from
26Accuracy or Validity
You need to determine whether the information is
reliable or not
27Quality Control Issues
- The quality of archived data is no better than
the data determined in the contributing
laboratories. - Curation of the data can help to identify
errors. - Disagreement between duplicate determinations is
a clear warning of an error in one or the other. - Similarly, results that disagree with
established principles may contain errors. - It is useful, for instance, to flag deviations
from expected stereochemistry in protein
structures, but such outliers'' are not
necessarily wrong.
28Data quality
- Data Consistency
- Data Models
- Reliability
- Evidences ? Level of confidence ?
- Assignation of function by similarity
- recursive process ? propagation of errors
29Data quality
Its hard to judge whether something makes
sense. The lack of labeling on many web pages
makes it hard to know the source. Calculations
based on databases are even harder to deal
with Logical deductions may be worse. tacR gene
regulates the human nervous system tacQ gene is
similar to tacR but is found in E. coli so tacQ
gene regulates the E. coli nervous system
30Who spotted ?
E. coli nervous system
31Significance
- Appreciating that mathematical biological
significance are different is crucial - Important in understanding the limitations of
- database search algorithms
- multiple sequence alignment algorithms
- pattern recognition techniques
- functional site structure prediction tools
- Contrary to popular opinion, there is currently
still - no biologically-reliable automatic multiple
alignment algorithm - no infallible pattern-recognition technique
- no reliable gene, function or structure
prediction algorithm
32New Problems
- As a result, we will have to give up the safe''
idea of a stable databank composed of entries
that are correct when they are first distributed
in mature form and stay fixed thereafter. - Databanks are dynamic in information content and
growing in size, and maturing in quality. - Maintaining local copies largely top up this
is not sufficient. - Proliferation of various copies in various states
with out-of-date linkages