Title: Knowledgebased Middleware for BioGrid services from the myGrid Project'
1Knowledge-based Middleware for BioGrid services
from the myGrid Project.
- myGrid consortium
- http//www.mygrid.org.uk
- 5th Steering Meeting
- January 16th 2004, Sheffield
2Agenda
- 9.009.30 Welcome and coffee in room G22
- 9.30 Move to Learning Media Unit Studio
- 9.35Â Summary of general progress, plans, issues
and highlights over past 6 months  (Carole
Goble) - 10.15 Demos Graves Disease (Anil Wipat)
- Williams Syndrome (Robert
Stevens/Hannah) - 1045 Coffee
- 11.00 Discussion about progress and demos
- 11.30Â Technical status and past 6 months
activities, including summaries of different work
packages (Nick Sharman) - 12.30Â Discussion
- 13.00 Lunch in room G22, Department of Computer
Science - 13.40 Return to Learning Media Unit Studio
- 13.45 Discussion on detailed planning for the AHM
2004 invited comments from Industrial partners
concerning experience of using software roll-out
and closure of project May 2005. - 15.00Â Tea
- 15.20Â Wrap up discussion
- 16.00Â End
- Architecture and Integration - Nick Sharman
- Workbench - Chris Greenhalgh
- Ambit - Rob Gaizauskas
- DQP mIR - Paul Watson
- Semantics and Find service - Carole Goble.
- Registry Views Luc Moreau
- Event notification - Luc Moreau
- Metadata/information model and provenance- Carole
Goble - Workflow enactment and tools - Matthew Addis
- Bio services Taverna - Peter Rice
3Data-intensive bioinformatics
source GlaxoSmithKline
ID MURA_BACSU STANDARD PRT 429
AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE
1-CARBOXYVINYLTRANSFERASE DE (EC 2.5.1.7)
(ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMI
NE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA
OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA
FIRMICUTES BACILLUS/CLOSTRIDIUM GROUP
BACILLACEAE OC BACILLUS. KW PEPTIDOGLYCAN
SYNTHESIS CELL WALL TRANSFERASE. FT ACT_SITE
116 116 BINDS PEP (BY SIMILARITY). FT
CONFLICT 374 374 S -gt A (IN REF.
3). SQ SEQUENCE 429 AA 46016 MW 02018C5C
CRC32 MEKLNIAGGD SLNGTVHISG AKNSAVALIP
ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE
MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI
GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER
LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE
IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP
DRIEAGTFMI
4Graves disease
Application Drivers
- Autoimmune disease of the thyroid in which the
immune system of an individual attacks cells in
the thyroid gland resulting in hyperthyroidism - Weight loss, trembling, muscle weakness,
increased pulse rate, increased sweating and heat
intolerance, goitre, exophtalmos
5Experiment life cycle
Personalised registries Personalised
workflows Info repository views Personalised
annotations Personalised metadata Security
Resource service discovery Repository
creation Workflow creation Database query
formation
Forming experiments
Personalisation
Discovering and reusing experiments and resources
Executing experiments
Workflow discovery refinement Resource
service discovery Repository creation Provenance
Workflow enactment Distributed Query
processing Job execution Provenance
generation Single sign-on authentician Event
notification
Providing services experiments
Managing experiments
Service registration Workflow deposition Metadata
Annotation Third party registration
Information repository Metadata
management Provenance management Workflow
evolution Event notification
6Bio in silico experiments service types
- Making in silico experiments
- workflow
- distributed database query processing.
- Managing experimental outcomes
- information management
- managing metadata
- Scientific method
- provenance management
- change notification
- personalisation
- Sharing experiments
- semantic services for discovering services and
workflows, and managing metadata - third party service registries and federated
personalised views over those registries, - ontologies and ontology management.
- Base services that tools that will constitute the
experiments - third party services such databases,
computational analyses, simulations . - specialised services such as AMBIT text
extraction.
7Investigation / Study set of experiments
metadata
- Experimental design components
- Workflow specs, queries, notes, data
- Experimental instances, records of enacted
experiments - Parameter settings, result data, workflow runs
- Experimental glue that groups and links design
and instance components - Life Science IDs (LSIDs)
- RDF
8myGrid Service Stack Confusagram
Work bench
Taverna workflow environment
Talisman application
Web Portal
Applications
e
Gateway
d
Personalisation
Service and Workflow Discovery
Registries
Provenance mgt
Event Notification
Ontology Mgt
Ontologies
Metadata Mgt
c
Core services
myGrid Information Repository
FreeFluo Workflow enactment engine
OGSA Distributed Query Processor
b
Web Service Grid communication fabric OGSI
External services
AMBIT Text Extraction Service
Bio Services
Soaplab
SRS
a
EMBOSS
9A work bench for demonstrating services
myView on the mIR
Workflow
Metadata about workflow
note about workflow
NetBeans
10Notification service
- A new gene with changed expression in Graves
Disease added to mIR - User registers interest in notification topics
- Informs the user via a notification client in the
workbench that new data has been added to the
mIR. - Notifications presented to the user with a client
in the workbench environment. -
11Semantic discovery services workflows
- Services and workflows described using semantic
web technologies and ontologies - Selection by the types of inputs they use,
outputs they produce, the bioinformatics tasks
they perform - DAMLOIL ? OWL
- RDF-based UDDI registry
- Multiple 3rd party registries
- Multiple 3rd party metadata
A registry browser
A workflow wizard
12The mIR holds the experimental components
- We need to discover which workflows have been
published that can operate on data of this
specific semantic type (an Affymetrix probe set
identifier) - Some might be in mIR, some might be in global
registry - mIR holds all experimental components
- Multiple mIRs
- Built on RDMS OGSA-DAI
- Plans Federated architecture, LSIDs and RDF
13Create and run a workflow
- If an appropriate workflow does not exist, a new
one can be created in the Taverna editor - Workflow outputs stored in mIR
- Freefluo workflow enactment engine
- WSFL Scufl
- Joint development with HGMP and EBI
http//www.mygrid.org.uk/myGrid/web/components/Wor
kflow/
14Provenance logging and reusing
- FreeFluo provides a detailed provenance record
stored in the mIR describing what was done, with
what services and when - Can be viewed within the workbench
- XML document
- Every mIR object have (Dublin Core) provenance
properties
Provenance is not just workflow Derivation
paths workflows, queries Annotations
notes Evolution paths workflow ? workflow
15Legacy Bio Services publication
- Wrap CORBA, Perl etc to look like web services,
to become Grid services (eventually) - Soaplab
- A soap-based programmatic interface to
command-line applications - 300 different classes of services
- Swiss-Prot, EMBOSS, Medline
- 3rd parties
- JEMBOSS, PathPort, bioMoby
16An in silico experiment a web of interconnected
investigation holdings
People to notify of the workflow status
Provenance of the workflow template. Related
workflows.
Ontologies describing workflows
17Semantic Glue
Workflows
Provenance record of workflow runs
Notes
People
Data holdings
Services
18Status and Highlights
- Reflecting on what we have
- All the components have an implementation in
various states of maturity and functionality,
some of which are downloadable already Freefluo,
Taverna, Soaplab. - Field evaluations with Grave Disease geneticists
- Expanding the user base
- Williams-Beuren Syndrome
- GSK
- ISMB 2003 demo
- S-MOBY proposal
19myGrid PR and interaction with e-Science
- Stacks of talks!
- PharmaGrid
- AMIA 2003
- ODBASE 2003
- ISMB 2003
- Virtual Genome conference
- GlobusWorld
- IBM Almaden
- Japanese Bioinformatics etc
- Stacks of NeSC workshops
- Security
- Workflow
- Provenance
- Data
- Users
- Fundamentals etc
EPSRC progress reports and talks on myGrid Wiki
20Progress since June 2003
- Software Release
- Consolidation, demo everything together
workbench. - Concentrated on myGrid distinctiveness.
- provenance and personalisation. Collaboration?
- Continued Application perspective
- Graves Disease Simon Pearce, Claire Jennings
- Williams-Beuren Syndrome Hannah Tipney, May
Tassabehji, Andy Brass - Information model, type system, provenance model
- Sorting out the mIR
- Adoption of LSID preliminary experiments
- Release preparation documentation, testing,
builds - Publications Outreach
21Software release
- First release, November 2003
- Proposal Phase 2
- Currently fixing install-and-use issues
- Next release
- Intermediate Alpha for EPSRC Pilot Projects
Meeting, March - Beta for All Hands Meeting, September
- Proposal Phase 3
- Release for October
- Proposal Phase 4
22Software release Phase 2 release
- Contains support for most myGrid themes
- But DQP, AMBIT not fully integrated
- Considerable variations in component
scope/completeness - Hard to build deploy
- Dependence on third-party components
- Use of proprietary products
- licensing, ease of adoption
- Inconvenient configuration methods
23Software release next steps
- Fix immediate install-and-use issues
- Establish conventions for
- Configuration at deployment time
- Limiting inter-module dependencies
24Short and Medium Term Plans
- Each component has plans
- E.g. Simplification of ontology delivery
- More sophisticated model of provenance and other
experimental data holdings, to store much more
heavily linked metadata about provenance that
will enable us to create views of the mIR along
many axes. - The Information model and myGrid Information
Repository significantly revised. - Review systematisation of type management
- Migration strategy to OGSA
25Related Projects
- PASOA provenance project by Luc.
- DynamO dynamic ontologies for describing
services. - Link-Up Sisters project with SDSC and ISI USC.
Started Jan 2004.
26Follow-on projects
- BBSRC projects 4 proposals
- MRC projects - 3 proposals
- NSF proposal
- NIH proposal
- Best Practise proposals
- OMII ingestion
27Links
- Global Grid Forum
- LSG RG, OGSA-DAIS WG, OGSI WG, SEM-GRD RG
- Other Grid projects
- CLEF, Integrative Biology
- SCECIT, SDSC Link-Up Project
- North Carolina BioGrid, PathPort
- Human Genome Mapping Project - Taverna
- I3C
- BioSciences Service Registry, Life Sciences ID
(LSID) - BioMOBY
- BioMOBY registry and object typing.
- OMG
- LSID
- AKT IRC
- Semantic Web technologies
28Staff changes
- Transfers
- Milena Radenkovic becomes CI, Notts
- Moving on
- Darren Marvin, of IT Innovation, Southampton,
- Joining
- Stefan Egglestone from 1st January 2004, Notts
- Graduate Students
- Antoon Goderis Formal languages for, and
reasoning over, workflows - Duncan Hull Types in bioionformatics workflows
- Jun Zhao Provenance
- Pinar Alper Grid service registries and
registration lifecycle - Keith Flanagan Working applying myGrid to Graves
Disease - John Dickman Suspended Jan 2003 (due to ill
health) - Martin Szomszor Dr. Terry Paynes studentship.
29Williams-Beuren Syndrome Microdeletion
FKBP6 FZD9 BAZ1B BCL7B TBL2 WBSCR14 STX1A
CLDN4 CLDN3 ELN LIMK1 LAB EIF4H RFC2 CYCLN2 GTF2IR
D1 GTF2I NCF1 GTF2IRD2
CTA-315H11
gap
7q11.23
1.4 Mb
CTB-51J22
SVAS WBS
Chr 7 155 Mb
Physical contig
Patient deletions
30WBS Workflows
Query nucleotide sequence
ncbiBlastWrapper
Pink Outputs/inputs of a service Purple
Taylor-made services Green Emboss soaplab
services Yellow Manchester soaplab services
Grey Unknowns
GenBank Accession No
URL inc GB identifier
Translation/sequence file. Good for records and
publications
prettyseq
GenBank Entry
Amino Acid translation
Sort for appropriate Sequences only
Identifies PEST seq
epestfind
6 ORFs
Seqret
Identifies FingerPRINTS
pscan
MW, length, charge, pI, etc
Nucleotide seq (Fasta)
pepstats
sixpack
ORFs
transeq
Predicts Coiled-coil regions
RepeatMasker
pepcoil
tblastn Vs nr, est, est_mouse, est_human
databases. Blastp Vs nr
GenScan
Coding sequence
ncbiBlastWrapper
Restriction enzyme map
restrict
SignalP TargetP PSORTII
Predicts cellular location
CpG Island locations and
cpgreport
InterPro PFAM Prosite Smart
Identifies functional and structural
domains/motifs
RepeatMasker
Repetative elements
Hydrophobic regions
Pepwindow? Octanol?
Blastn Vs nr, est databases.
ncbiBlastWrapper
31Results from WBS Workflows
The results gathered from several iterations of
the workflows significantly extended the
centromeric WBS Critical Region (WBSCR) contig,
primarily through identification of an
overlapping BAC (RP11-622P13), reducing the gap
by 121004bp. Of the six putative coding regions
predicted by the workflow, five correspond
exactly to the five known genes in this region
(see table). Of the 34 exons known to reside in
this region, 31 were correctly identified.
32Changing Work Style
- It works runs workflows, gathers and
co-ordinates results - Manually takes two days () including analysis
- Now takes 30 mins to produce results and half a
day for analysis - Manually Do analysis as perform experiment
- Workflow Do analysis at end of experiment
- Therefore need good result co-ordination for
back-tracking - We have an enthusiastic user
33The Downside
- Reliability can be a flaky
- Tom IT Innovation addressing reliability
- Services
- Need many, many more distributed services
- Need redundancy of services
- Licensing issues
- Access to third party Web Services licensing
agreement means cannot use outside licensed body - Fiddly-bits
- Made explicit in workflows (done by human on Web)
34Issues from Williams
- Core development that must be robust
- FreeFluo workflow enactment engine that generates
new provenance model - Taverna environment
- mIR for new information model (although dont use
it) - Metadata storage
- BioServices (and lots more of them)
- Service registry
- Event notification service
- Research development
- Semantic find service
- Text mining
- Provenance browsing
- Workbench? Is Taverna the workbench?
35The Information Model
- Hypothesis, materials and methods, results,
conclusions, acknowledgements, bibliography - Who, what, where, why, when, (w)how? recorded by
provenance records - The traceability of knowledge as it is evolves
and as it is derived. - A web of myGrid holdings
- input data, data results, intermediate data,
parameter sets, workflow logs, workflow
templates, people, organisations, personal notes,
services etc. - Discovering links between experimental holdings
- Information model document out for comment
- Deployment model draft proposed
36Information model influences
- Our experience
- Development
- Scenarios
- CLRC Scientific Metadata Model
- Scholarly Ontology
- Dublin Core
- VCARD
37Information model status plans
- Initial version out for review - now
- Review revise (11 Feb)
- Impact analysis (end Feb)
- Map responsibilities
- Assess existing capabilities vs. model
- Development (end Aug)
- Revise components to reflect model
38Provenance in Release 1.0
39Provenance of knowledge
- Declarative semantic execution trail
Are_similar_to
Bacterial artificial chromosome
Pairwise alignments
as stated by
input
output
run_for
processstart timeend time
by_service
urnHannah Tipney
lsidBLASTN search
topic
Williams Beuren Syndrome
40Haystack Provenance Browser
41COHSE
Lymphocyte and neutrophil are subsumed by the
concept white blood cell
Generated link anchors
42Past Phases of development
43Future Phases of development
44Technical Issues
- Finish pair-wise integrations vs Additional
functionality - Authorization Service
- A central resource to manage the identities,
roles and privileges of the members of a myGrid
user community within an organizational unit,
organization or virtual organization. - On roadmap but no action yet
- Workbench Taverna Haystack
- Need a way of demoing stuff.
- Get on with deploying the information model.
- The impact of LSID ontology services, provenance
browser, mIR. - Supporting our users.
- OGSI version 1.5
45Non-technical Issues
- Post-Sept04 follow-on projects.
- End of contracts.
- Out of travel money.
- Software development vs research split.
- Software vs Papers.
- Outreach vs delivery of stuff.
- OMII ingestion
- myGrid book
46