Macromolecular Structure Database group - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Macromolecular Structure Database group

Description:

Macromolecular Structure Database group – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 33
Provided by: same84
Category:

less

Transcript and Presenter's Notes

Title: Macromolecular Structure Database group


1

http//www.ebi.ac.uk/msd/roadshow.html MSD
Roadshow Co-ordinator . Janet Copeland 2nd
November 2005 Oxford University
http//www.ebi.ac.uk/msd/education/Tutorial.html
2
USERNAME cal PASSWORD warthog
3
  • Introduction to MSD and to Quaternary
    Structures/Assemblies as Basis of MSD database
  • SSM Fold recognition
  • PISA Surface and assembly toolkit
  • MSDchem Chemistry reference data
  • MSDlite/MSDpro generalised search systems
  • MSDsite Active sites
  • MSDmotif small structural motifs

4
  • Visualisation and Patterns
  • Intergration Projects with Sequence and Domain
    data
  • Validation/Deposition
  • Clustering methods used at MSD
  • MSDmine generalised data access to the MSD
  • PIMS Protein Information System
  • Targets Workflow for Target selection tools
  • NMR NMR tools and data at MSD
  • Data Mining and an example MSDtemplates
  • DataBases at MSD including data warehouse
    technologies
  • DataBase Replication

5
http//www.ebi.ac.uk/msd
6
Bioinformatics
Genomes
Literature
Expression- profiling
Metabolic data
Proteome data
Biochemistry
Bioinformatics
Comparative genomics
Mutant/RNAi data
Hypotheses and in silico models
7
Role of Bioinformatics
  • To Support Experimental Biology
  • To Collect and Archive Data
  • To provide Framework and Integration
  • To give Easy Access to Data
  • To make New Discoveries through Data Analysis

8
http//www.wwpdb.org/
9
WHAT IS THE PDB?
10
Databanks and Databases
  • The PDB Archive is a databank
  • A series of flat files that have a format
    originally designed for Fortran card readers
  • The MSD provides databases
  • Collections of data (1000s attributes) organized
    into relational tables and held with a RDMS.

11
Linking to Domain data, eFamily
Sequence Mapping, SIFTS
MSDchem ligand data
PQS biological assemblies
Electron Density Visualisation AstexViewer
MSDPro, MSDlite
SSM fold matching
Surface Matching
MSDsite Active sites
12
Data information
ATOM 2567 N PHE B 175 7.821 -25.530
-22.848 1.00 8.71 ATOM 2568 CA PHE B 175
8.845 -25.172 -21.877 1.00 9.41 ATOM
2569 C PHE B 175 9.449 -23.798 -22.169
1.00 10.02 ATOM 2570 O PHE B 175
10.664 -23.613 -22.103 1.00 10.37 ATOM 2571
CB PHE B 175 9.928 -26.251 -21.848 1.00
9.53 ATOM 2572 CG PHE B 175 10.969
-26.137 -22.982 1.00 10.03 ATOM 2573 CD1
PHE B 175 12.356 -25.819 -22.988 1.00 10.51
ATOM 2574 CD2 PHE B 175 11.725 -27.211
-23.402 1.00 10.25 ATOM 2575 CE1 PHE B 175
11.821 -27.095 -22.869 1.00 11.17 ATOM
2576 CE2 PHE B 175 12.282 -26.086 -24.008
1.00 10.95 ATOM 2577 CZ PHE B 175
10.953 -26.335 -23.622 1.00 11.38
13
MSD service provider
  • We provide a service to the scientific community
  • 24/7 (almost)
  • parallel DB with fail-over, etc.
  • Service ping baseline check several times/day
  • Data is incremented with new data weekly
  • Systems are extensible

14
Query capabilities
  • Browsing (click and read)
  • Simple search
  • select records with some constraints
  • More elaborate search
  • select specific fields of some records with
    constraints on some fields
  • Complex querying
  • ability to return an answer that results from a
    "live" computation, and was not part of any
    record of the database

15
What we cannot do well
  • Give us sequence, we do rest

16
(No Transcript)
17
(No Transcript)
18
  • What is the function of this structure?

What is the function of this sequence?
  • What is the function of this motif?
  • the fold provides a scaffold, which can be
    decorated in different ways by different
    sequences to confer different functions - knowing
    the fold function allows us to rationalise how
    the structure effects its function at the
    molecular level

19
Complication Multiprotein Complexes
20
ATPase
1H8E (ADP.ALF4)2(ADP.SO4) BOVINE F1-ATPASE (ALL
THREE CATALYTIC SITES OCCUPIED) MENZ, R.I.,
WALKER, J.E., LESLIE, A.G.W.
21
Ground rules for bioinformatics
  • Don't always believe what programs tell you
  • they're often misleading sometimes wrong!
  • Don't always believe what databases tell you
  • they're often misleading sometimes wrong!
  • Don't always believe what lecturers tell you
  • they're often misleading sometimes wrong!
  • In short, don't be a naive user
  • when computers are applied to biology, it is
    vital to understand the difference between
    mathematical biological significance
  • computers dont do biology - they do sums
    quickly!

22
General Evaluation Criteria Be sceptical and
cynical!
  • When you are searching for information you need
    to judge its quality and suitability.
  • Think critically about each piece of information
    you find and how you found it.
  • Relevance
  • Does the information you have found adequately
    support your research?
  • Does it answer the question, or support one of
    your arguments?
  • How general or specific is the information
    about the topic?

23
http//harvester.embl.de/
Harvester collects information from selected
public databases
24
  • Appreciate how difficult it is to draw a complex
    3-D object and appreciate the complexity of the
    requirements for storing sequence and structural
    information of molecules in a database.
  • There are a lot of interrelated pieces of
    information about a biomolecule, such as
  • sequence similarities
  • genome location
  • protein structure
  • Expression
  • chemistry

25
Some of the obstacles of searching databases are
  • Data formats are not standard.
  • The nomenclature is not standard.
  • There is more than one database offering the
    same information (data redundancy).
  • Links between databases may not be easy to
    follow.
  • The number of databases available makes it
    confusing to choose from

26
Accuracy or Validity
You need to determine whether the information is
reliable or not
27
Quality Control Issues
  • The quality of archived data is no better than
    the data determined in the contributing
    laboratories.
  • Curation of the data can help to identify
    errors.
  • Disagreement between duplicate determinations is
    a clear warning of an error in one or the other.
  • Similarly, results that disagree with
    established principles may contain errors.
  • It is useful, for instance, to flag deviations
    from expected stereochemistry in protein
    structures, but such outliers'' are not
    necessarily wrong.

28
Data quality
  • Data Consistency
  • Data Models
  • Reliability
  • Evidences ? Level of confidence ?
  • Assignation of function by similarity
  • recursive process ? propagation of errors

29
Data quality
Its hard to judge whether something makes
sense. The lack of labeling on many web pages
makes it hard to know the source. Calculations
based on databases are even harder to deal
with Logical deductions may be worse. tacR gene
regulates the human nervous system tacQ gene is
similar to tacR but is found in E. coli so tacQ
gene regulates the E. coli nervous system
30
Who spotted ?
E. coli nervous system
31
Significance
  • Appreciating that mathematical biological
    significance are different is crucial
  • Important in understanding the limitations of
  • database search algorithms
  • multiple sequence alignment algorithms
  • pattern recognition techniques
  • functional site structure prediction tools
  • Contrary to popular opinion, there is currently
    still
  • no biologically-reliable automatic multiple
    alignment algorithm
  • no infallible pattern-recognition technique
  • no reliable gene, function or structure
    prediction algorithm

32
New Problems
  • As a result, we will have to give up the safe''
    idea of a stable databank composed of entries
    that are correct when they are first distributed
    in mature form and stay fixed thereafter.
  • Databanks are dynamic in information content and
    growing in size, and maturing in quality.
  • Maintaining local copies largely top up this
    is not sufficient.
  • Proliferation of various copies in various states
    with out-of-date linkages
Write a Comment
User Comments (0)
About PowerShow.com