Title: WP6%20Part%201:%20Bioinformatics
1WP6 Part 1 Bioinformatics
Presenters Xueping Quan, Marco Schorlemmer, Dave
Robertson
- First results passed peer review
- Working on more extensive proteomics knowledge
sharing - Library of existing services collated
- Library of LCC experiment protocols underway
2OK From an Experimenters Viewpoint
- Interaction model Experiment design
- Experimental roles allocated to peers
- Constraints prescribe methods on peers
- Message passing synchronises tasks
- Formal model gives
- Automation, extending experiment repertoire
- Repeatability, because we preserve state
- Scrutiny, for reviewers
3P2P Proteomics
- Proteome is the protein equivalent of the genome
- Proteomics studies the quantitative changes
occurring in a proteome and its application for - disease diagnostics
- therapy
- drug development
4Peer-to-Peer Experimentation in Protein Structure
Prediction an Architecture, Experiment and
Initial Results
5Experiment - Consistency Checking
- Taking a non-expert users perspective
- Applied Bioinformatics - Whom to believe??
- Note
- This Scenario needs to allow for passive
peers - to incorporate knowledge from the large
number of - traditional bioinformatics resources
(databases etc.)
Comparison of server results for
consistency typically increases confidence in the
result.
6Experiment Consistency Checking
Step1 Proxy per service allowing data retrieving
from passive peers. Each query is
related to the appropriate service.
query (input, keyword, ID, sequence, etc. )
data relating to input
Proxies (Wrappers)
Interfaces (WSDL, etc)
Application
Database
Web Server
7Experiment Consistency Checking
Step 2 Automated harvesting of results for
targets and collation to allow easy comparison
of answers. Scientist logs local opinion on
relative quality of (passive) other peers for
each target and caches the most important
positive and/or negative results.
Local database of trusted results with provenance
Polling multiple sites
8Experiment Specific Task
- Extend structural knowledge through modelling
- Find fragments of 3D-models of S.cerevisiae
(yeast) - proteins that can be trusted
-
- 6604 yeast protein sequences (some predicted)
- currently 330 known 3D-structures (in PDB)
(Popular strategy, typically accomplished with
the help of a meta-WWW-server today.)
9Databases of pre-computed 3D-models
SWISS restrictive non-redundant high-quality models only (SWISSMODEL)
SAM yeast models complete (at least one model per ID) redundant raw models (SAM-T06 / UNDERTAKER)
ModBase permissive highly redundant pre-filtered before the task (PSI-BLAST / MODELLER)
10Complications True and False Redundancy
Example 1 highly redundant set
Example 2 multi-domain proteins non-redundant
sets (lt 90 overlap)
11Databases of pre-computed 3D-models
SWISS 769 models
SAM yeast models 2211 models (selected top model if E-value lt 10-3)
ModBase 2546 models (pre-filtered sequence-id gt 20 score gt 0.7 E-value lt 10-6)
12Implementation using LCC interpreter
- multi-agent interaction coordination through
service composition - LCC interpreter
- loosely based on electronic societies (of peers)
- uses WSDL as standard
- For more information please refer to
- Xueping Quan, Chris Walton, Dietlind L
Gerloff, Joanna L Sharman and Dave Robertson,
GCCB2006. - to be superseded by (more flexible) OK-kernel
13Implementation using LCC Interpreter
14LCC Protocol
a(data_collator, X) data_request(Is) lt
a(experimenter, E) then
a(data_collector(Is,Sp,Sd),X) ? yeast_id(Is) and
source(Sp) then filter(Is,Sp,Sd) gt
a(data_filter((Is,Sp,Sd),F) then
filtered(Is,Sp,S) lt a(data_filter(Is,Sp,Sd),F)
then filtered(Is,Sp,S) gt
a(data_comparer,C) then
data_compared(Is,SF) lt a(data_comparer,C) then
data_compared(Is,SF) gt a(experimenter,E)
then data_compared(Is,SF) gt
a(data_publisher,PU) a(experimenter, E)
data_request(Is) gt a(data_collator, X) then
data_compared(Is,SF) lt a(data_collator,
X) a(data_collector(Is,Sp,Sd),X) ( null
? Sp and Sd) or (
a(data_retriever(I,P,D),X) ? (SpPRp and
SdDRd and IsIRi) then
a(data_collector(Ri,Rp,Rd),X) ) a(data_retriever(I
,P,D),X) data_request(I) gt
a(data_source,P) then data_report(I,D)
lt a(data_source,P) a(data_filter(I,Sp,Sd),F)
filter(I,Sp,Sd) lt a(data_collator,X) then
filtered(I,Sp,S) gt a(data_collator,X) ?
apply_filter(Sd,S) a(data_source,P)
data_request(I) lt a(data_retriever(I,P,D),X)
then data_report(I,D) gt
a(data_retriever(I,P,D),X) ? lookup(I,D) a(data_co
mparer,C) filtered(Is,Sp,S) lt
a(data_collator,X) then data_compared(Is,SF)
gt a(data_collator,X) ? consistency_check(S,SF)
15MaxSub - Examples
- pair-wise, sequence-dependent
- finds common substructure (shown in blue)
16Results
- CYSP
- Comparison of Yeast 3D Structure Predictions
- 578 three-way supported
- MaxSub-substructures gt 45 aa
- from 545 proteins
- (Linked from www.openk.org)
-
Pair-wise MaxSub Comparisons
SWISS ModBase SAM
SWISS 769 (717) 649 (594) 585 (559)
ModBase 2546 (2280) 620 (594)
SAM 2211 (2211)
17Proteomic Analysis
- Expression Proteomics
- proteins are extracted from cells and tissues
- proteins are separated
- two dimensional cell electrophoresis
- liquid chromatography
- proteins are digested and identified
- various mass spectrometry methods
- Bioinformatic Analysis
- primary, secondary, tertiary structures
- sequence alignment and homology
- motifs and domains
- protein interactions and networks
- Functional Proteomics
18Expression Proteomics
19Expression Proteomics
20Peptide/Protein Identification
- Sequencing information in archives that do not
produce clear identifications rarely accessible
to other groups - most part of it will never be reflected in
protein DBs - information is trashed
- Information of high importance for other groups
analysing sequence/function of homologue proteins - contains sequences with post-translational
modifications not to be found in current protein
DBs - Spectra and sequence tags generated in one lab
could be used by other labs to evaluate
confidence of experimental or predicted sequences
21Information Overflow
- Proteomic analysis is currently an inhumane task
- LC-MS analysis produces gt10,000 of spectra
- each spectra yields (after sequencing and DB
search) several peptide or peptide tag candidates - each step produces an identification score whose
final evaluation is performed manually (using
probability data) - Many proteomic labs are involved in the
characterization of proteomes, protein complexes
and networks - ? speed of information production increases very
fast
22Expression Proteomics
23P2P Proteomics with OK
24Sequence Identification Scenario
- An investigator asks an identifier to match a
sequence against proteomic labs repositories. - The identifier acts as a searcher inquiring each
known proteomics lab retrieving hits for the
given input sequence, collects results, and then
sends them back to investigator. - The inquired proteomics lab could store high
scoring queries to increase the reliability of
the matching sequences. - The end-point process of sequence data-mining
done by the proteomics lab is performed by Blast
engines local to each peer. - The first prototype only matches input sequences
next release could also directly accept mass
spectra as input. For this task will us an OMSSA
engine capable of matching spectra against the
same sequence database used by Blast engine.
25Sequence Identification IM in LCC
- a(investigator,A)
- identify(Seqs,P) gt a(identifier,B) ?
get_sequences(Seqs,P) then - visualise(Result_set) ? answer(Result_set)
lt a(identifier,B) - a(identifier,B)
- identify(Seqs,P) lt a(investigator,A) then
- a(searcher(Seqs,P,Ls,Result_set),B) ?
lab_list(Ls) then - answer(Result_set) gt a(investigator,A) then
- a(identifier,B)
- a(searcher(Seqs,P,Ls,Result_set),B)
- ( query(Seqs,P) gt a(proteomics_lab,L) ? Ls
LRLs then - Result_set (Result,L)RSs ?
answer(Result) lt a(proteomics_lab,L) then - a(searcher(Seqs,P,RLs,RSs) ) or
- null ? Ls and Result_set
- a(proteomics_lab,L)
- query(Seqs,P) lt a(searcher(_,_,_,_),B) then
- answer(Result) gt a(searcher(_,_,_,_),B) ?
find_hit(Seqs,P,Result) then - a(proteomics_lab,L)
26Step by Step
peer
message
constraint
An investigator uses a GUI to get an input
sequences and a set of parameters P
Investigator sends message identify(Seqs, P) to
an identifier
identifier retrieves a list of known proteomics
labs
identifier becomes searcher and sends a query to
the first proteomics_lab of the list
proteomics_lab resolves find_hit constraint and
sends back an answer with the result (i.e. an URL
for a XML file)
searcher loops the queries over the list of
proteomics_labs and collects results in a
result_set
searcher comes back to role identifier and sends
back result_set to investigator
investigator receives the result_set and displays
it on a GUI
investigator
identifier
identify(Seqs, P)
searcher
proteomics_lab
query(Seqs, P)
answer(result)
identifier
find_hit() constraint also kicks up a process
inside proteomics_lab peer which will store high
scoring queries
investigator
answer(result_set)