Title: CVA for NMR data
1(No Transcript)
2From Yeast to Mouse to HumansFrom Manchester
to Hinxton and beyond
- Steve Oliver
- Professor of Genomics
- Faculty of Life Sciences
- The University of Manchester
- http//www.cogeme.man.ac.uk
- http//www.ispider.ac.uk
Faculty of Life Sciences
3Functional Genomics
4Tie everything back to the genome
5Mind your Ps ( Qs)
PEDRo Pedro Pierre
6PEDRo Model Life History
- Developed in COGEME Project (Consortium for
Genomics of Microbial Eukaryotes) within BBSRC
Investigating Gene Function (IGF) Initiative. - Published in Nature Biotech, after feedback from
wider community. - No complete data sets at point of publication.
- Recent activities
- Collecting data from multiple (mostly IGF) sites
that conform to the PEDRo model. - Developing database containing PEDRo data (Pierre)
7The PEDRo UML schema in reduced form
8The nature of proteomics experiment data
- Sample generation
- Origin of sample
- hypothesis, organism, environment, preparation,
paper citations - Sample processing
- Gels (1D/ 2D) and columns
- images, gel type and ranges, band/spot
coordinates - stationary and mobile phases, flow rate,
temperature, fraction details - Mass Spectrometry
- machine type, ion source, voltages
- In Silico analysis
- peak lists, database name version, partial
sequence, search parameters, search hits,
accession numbers
9Implementing the schema
ltxselement name"Column"gt ltxscomplexTypegt
ltxssequencegt ltxselement
name"description" type"xsstring"/gt
ltxselement name"manufacturer"
type"xsstring"/gt lt/xssequencegt
lt/xscomplexTypegt lt/xselementgt
map
read
UML Model
Pedro
XML Schema
map
type/ load
CREATE TABLE LCColumn ( id integer PRIMARY
KEY , description varchar(200) NOT NULL ,
manufacturer varchar(100) NOT NULL ,
part_number varchar(50) NOT NULL )
ltSamplegt ltsample_idgtD0117lt/sample_idgt
ltsample_dategt2001-07-09lt/sample_dategt
ltexperimentergtDavid SteadZhikang
Yinlt/experimentergt lt/Samplegt
RDB Schema
XML Data
XML Database Pierre
10Modelling goals for PEDRo
- Enough detail to
- Allow results of different experiments to be
analysed/compared. - Allow suitability of experiment design and
implementation decisions to be assessed. - Allow protein identifications to be rerun in
future with new databases or software. - Not detailed enough
- To allow experiments to be rerun.
11IGF Datasets currently in Database
12There must be carrots as well as sticks
13Proteomics aint the only show in town
14INTEGRATION
15Why integrate data?
- These 200 proteins over expressed in my mouse
model compared to WT. What are the interaction
partners of these proteins?
- Data is stored at a variety of sites and formats.
- Databases designed mainly for browsing
- (MIPS, SGD, BIND, SCPD, KEGG).
- Need databases that allow complex queries.
- Need to be easily usable by biologists.
16Genome Information Management System (GIMS)
Paton NW, Khan SA, Hayes A, Moussouni F, Brass A,
Eilbeck K, Goble GA, Hubbard SJ, Oliver SG
(2000) Conceptual modelling of genomic
information. Bioinformatics 16, 548-557.
17Database implementation
- Uses the object database FastObjects.
- All database classes and analysis programs are
written in Java. - Allows close integration of the programming
language with the database. - Allows fast access to database data from
application programs. - Allows data to be stored in a way that reflects
the underlying mechanisms in the organism. - Very flexible and extensible.
18(No Transcript)
19GIMS User Interface
- Java application.
- Can download from http//img.cs.man.ac.uk/gims
- Communicates with database via RMI.
- On start-up, application is sent information
about database classes and canned queries. - Very flexible.
- Allows user to browse database, ask canned
queries, and store and combine data sets. - Can save results as txt, html or xml.
20(No Transcript)
21Cross-validation of high-throughput data is
essential
22Evaluating protein-interaction data
von Mering C, Krause R, Snel B, Cornell M,
Oliver SG, Fields S, Bork P (2002) Comparative
assessment of large-scale data sets of
proteinprotein interactions. Nature 417,
399-403. Cornell M, Paton NW, Oliver SG (2004) A
critical and integrated view of the yeast
interactome. Comp. Funct. Genom. 5, 382-402
23Set of confirmed Y2H interactions
Confirmation of an interaction requires
- Identification in more than one Y2H screen, OR
- The reverse interaction must have been
identified, OR - The two proteins must have been identified in the
same protein complex (from either classical or - high-throughput affinity purification
studies).
A total of 451 reliable interactions,
involving 581 proteins have been identified
from a combined data set comprising 5214
interactions and 4025 proteins
24 25Quantitative comparison of interaction datasets.
26GENOME
TRANSCRIPTOME
You dont get owt for nowt
PROTEOME
METABOLOME
27Yeast aint the only show in town either!
28GIMS empowers the biologist
29Resources at the centre
Workflows that could be used to generate this data
People who have registered an interest in these
data
Related Data
Provenance record on how the data were produced
Ontologies describing data
30Biologists/Clinicians at the centre
Workflows they wrote or used
People they collaborate with
31myGrid
- EPSRC UK e-Science pilot project.
- Open Source Upper Middleware for Bioinformatics.
- (Web) Service-based architecture -gt Grid services.
www.mygrid.org.uk
32iSPIDER A Pilot Grid for Integrative Proteomics
- In Silico Proteome Integrated Data Environment
Resource
33Diversity of proteome data
gels
sequences
gtA01562 MAPKATYLIGAADKFHW gtA01567 MAQQPKEMLNILADKF
HWFLYC
Other data Species, PTMS, pathways, functional
annotation, transcriptome data
Structures/folds
mass spec
34Integration problems
- Lack of specific middleware
- Existing resources not wrapped
- Lack of data standards
- Standards for proteomics, incl. MS and protein
identification are emerging - Data not modelled
- New challenges from proteomics
- Data not captured/modelled
- Data not captured
- No mature repositories/databases for some
proteome data - But there are lots of data
35Aims
- To develop an integrated platform of proteomic
data resources enabled as Grid/Web services - Integrate existing proteome resources, enabling
them as Grid/Web services. - To develop novel, proteome-specific databases as
part of iSPIDER delivered as Grid/Web and
browser-based services - A repository for experimental proteome data
- A proteome protein identification server and
database - A phosphoproteome specific database
- To develop middleware support for distributed
querying, workflows and other integrated data
analysis tasks - Demonstrate effectiveness of the resulting
infrastructure studies in proteomics, including - Visualisation clients for proteomic data e.g. LRF
data - Analyses for fungal species of industrial
interest - Protein structural/functional trends in
experimental proteomics e.g. linking domain
structural patterns
36Integrated Proteomics Informatics Platform -
Architecture
ISPIDER Proteomics Clients
Vanilla Query Client
PPI Validation Analysis Client
Protein ID Client
WP3
WP4
WP6
WP1
WP5
WP2
Web services
ISPIDER Proteomics Grid Infrastructure
Existing E-Science Infrastructure
WP1
Public Proteomic Resources
WP6
WP3
Existing Resources
iSPIDER Resources
KEY WS Web services, GS Genome sequence, TR
transcriptomic data, PS protein structure, PF
protein family, FA functional annotation, PPI
protein-protein interaction data, WP Work
Package
37Existing infrastructure and skills
- myGRID
- OGSA-DQP
- AutoMed
- PSI/Pedro infrastructure/standards
- Protein id tools at Manchester
- 3 primary data integration strategies
- Workflows
- DQP using OGSA-DAI
- Heterogenous schema integration technologies
38Need to share and cross-validate analytical
procedures
39Workflow Components
Freefluo
Freefluo Workflow engine to run workflows
Scufl Simple Conceptual Unified Flow
Language Taverna Writing, running workflows
examining results SOAPLAB Makes applications
available
40Security Ethics