GGF11 Semantic Grid Applications Workshop, - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

GGF11 Semantic Grid Applications Workshop,

Description:

External domain services. No control or influence over service providers ... Registration. Event Notification Service. GGF11 Semantic Grid Applications Workshop, ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 44
Provided by: Chris547
Category:

less

Transcript and Presenter's Notes

Title: GGF11 Semantic Grid Applications Workshop,


1
Exploring Williams-Beuren Syndrome
  • Professor Carole Goble
  • http//www.mygrid.org.uk

2
Acknowledgements
myGrid is an EPSRC funded UK eScience Program
Pilot Project
Particular thanks to the other members of the
Taverna project, http//taverna.sf.net
3
Roadmap
  • myGrid in a nutshell
  • Gene characterisation in Williams-Beuren
    Syndrome.
  • Semantic Aspects
  • Information model
  • Service discovery
  • Data Management - LSID
  • Metadata management for provenance RDF
  • Lessons learnt and opportunities

4
Experiment life cycle
Forming experiments
Personalisation
Discovering and reusing experiments and resources
Executing and monitoring experiments
Managing lifecycle, provenance and results of
experiments
Sharing services experiments
5
In a nutshell
  • Bioinformatics toolkit
  • Open (Web) Services
  • myGrid components
  • External domain services
  • No control or influence over service providers
  • Open to third party metadata
  • Open extensible architecture
  • Assemble your own components
  • Designed to work together
  • Toolkit
  • Axis/Apache based
  • RDF and DAMLOIL/OWL
  • Jena, OilEd, Instance Store FaCT

Haystack Provenance Browser
Semantic Discovery
Pedro
View UDDI registry
Gateway CHEF Portal
Taverna WfDE
Freefluo WfEE
Event Notification
LSID
Info. Model
mIR
Soaplab Gowlab
6
Williams-Beuren Syndrome
  • Microdeletion of 155 Mbases on Chromosome 7
  • Hannah Tipney, May Tassabehji, Andy Brass, St
    Marys Hospital, Manchester, UK
  • Characterise an unknown gene
  • Annotation pipelines and Gene expression analysis
    Services from USA, Japan, various sites in UK

7
Williams-Beuren Syndrome Microdeletion
C-cen
A-cen
B-cen
C-mid
B-mid
A-mid
B-tel
A-tel
C-tel
WBSCR1/E1f4H
WBSCR5/LAB
GTF2IRD1
WBSCR21
WBSCR18
WBSCR22
WBSCR14
POM121
GTF2IRD2
BCL7B
BAZ1B
NOLR1
GTF2I
FKBP6
CYLN2
CLDN4
CLDN3
STX1A
LIMK1
NCF1
RFC2
TBL2
FZD9
ELN
1.5 Mb
7q11.23
Patient deletions


WBS
SVAS
Chr 7 155 Mb
8
Filling a genomic gap
  • Two major steps
  • Extend into the gap Similarity searches
    RepeatMasker, BLAST
  • Characterise the new sequence NIX, Interpro,
    etc
  • Numerous web-based services (i.e. BLAST,
    RepeatMasker)
  • Cutting and pasting between screens
  • Large number of steps
  • Frequently repeated info now rapidly added to
    public databases
  • Dont always get results
  • Time consuming
  • Huge amount of interrelated data is produced
    handled in lab book and files saved to local hard
    drive
  • Mundane
  • Much knowledge remains undocumented
  • Bioinformatician does the analysis

9
Point, click, cut, paste
ID MURA_BACSU STANDARD PRT 429
AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE
1-CARBOXYVINYLTRANSFERASE DE (EC 2.5.1.7)
(ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMI
NE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA
OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA
FIRMICUTES BACILLUS/CLOSTRIDIUM GROUP
BACILLACEAE OC BACILLUS. KW PEPTIDOGLYCAN
SYNTHESIS CELL WALL TRANSFERASE. FT ACT_SITE
116 116 BINDS PEP (BY SIMILARITY). FT
CONFLICT 374 374 S -gt A (IN REF.
3). SQ SEQUENCE 429 AA 46016 MW 02018C5C
CRC32 MEKLNIAGGD SLNGTVHISG AKNSAVALIP
ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE
MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI
GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER
LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE
IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP
DRIEAGTFMI
10
WBS Workflows
Query nucleotide sequence
ncbiBlastWrapper
RepeatMasker
Pink Outputs/inputs of a service Purple
Taylor-made services Green Emboss soaplab
services Yellow Manchester soaplab services
Grey Unknowns
GenBank Accession No
URL inc GB identifier
Translation/sequence file. Good for records and
publications
prettyseq
GenBank Entry
Amino Acid translation
Sort for appropriate Sequences only
Identifies PEST seq
epestfind
6 ORFs
Seqret
Identifies FingerPRINTS
pscan
MW, length, charge, pI, etc
Nucleotide seq (Fasta)
pepstats
sixpack
ORFs
transeq
Predicts Coiled-coil regions
RepeatMasker
pepcoil
tblastn Vs nr, est, est_mouse, est_human
databases. Blastp Vs nr
GenScan
Coding sequence
ncbiBlastWrapper
Restriction enzyme map
restrict
SignalP TargetP PSORTII
Predicts cellular location
CpG Island locations and
cpgreport
InterPro PFAM Prosite Smart
Identifies functional and structural
domains/motifs
RepeatMasker
Repetative elements
Hydrophobic regions
Pepwindow? Octanol?
Blastn Vs nr, est databases.
ncbiBlastWrapper
11
Collections of Tasks
Building
Domain Tasks
Workflow
Service Providers
Enactment
Bioinformaticians
Storage
Scientists
Description
Service Discovery
Provenance
Data Management
Finding
Querying
Annotation providers
12
Registry
Bioinformaticians
Taverna WfDE
Querying/sharing/ federating/registering
Query Retrieve
Workflow Execution
Feta
Annotation/description
FreeFluo WfEE
invoking
Annotation providers
Store data/ knowledge
Interface Description
Pedro Annotation tool
mIR
Others
Service Providers
WSDL
Soap- lab
Vocabulary
Haystack Provenance Browser
Ontology Store
Data descriptions
Scientists
13
High level architecture
Semantic Discovery Registration
Provenance and Data browser i.e. Haystack
Taverna Workbench
View Service
LSID Authority
UDDI
mIR
Freefluo Workflow Engine
Store Service
Web services, local tools User interaction etc.
Event Notification Service
14
WBS task
  • Wrap services as web services
  • Register them
  • Build a workflow using the services
  • Evolve the workflow
  • Run it over and over again in case data has
    changed
  • Record results provenance
  • Inspect and compare results provenance
  • Event notification, portal, 3rd party annotation

15
(No Transcript)
16
User Results
  • Benchmark Two iterations of workflows (1 day
    run)
  • Reduced gap by 267 693 bp at its centrmeric end
  • Correctly located all seven known genes in this
    region
  • Identified 33 of the 36 known exons residing in
    this location
  • Manually takes two days () including analysis
  • Now takes 30 mins to produce results and half a
    day for analysis.
  • Less boring. Less prone to mistakes.
  • Once notification installed wont even have to
    initiate it.

17
Where is the semantics
18
Information Model v2
  • Scientific data and the life-science identifier
  • Types
  • Identifier Types
  • Values and Documents
  • Provenance information
  • Annotation and Argumentation
  • Resources and Identifiers
  • People, teams and organizations
  • Representing the e-science process
  • Experimental methods for e-science

XML messages between services conform to the IMv2
19
Semantic discovery
  • The User does the choosing of services
  • A common ontology is used to annotate and query
    any myGrid object including services.
  • Ontology is built using DAMLOIL and reasoning
  • Deployed as a static RDF graph
  • Discover workflows and services described in the
    registry via Taverna.
  • Look for all workflows that accept an input of
    semantic type nucleotide sequence.

20
Role of Ontologies
Service matching and provisioning
Composing and validating workflows and service
compositions negotiations
Service resource registration discovery
Help
Knowledge-based guidance and recommendation
Schema mediation
21
Observations
22
Services
http//pedro.man.ac.uk
  • Practically all the services are remote and third
    party
  • Services are changeable and unreliable
  • Redundant services are essential
  • WSDL in the wild is poor
  • Automated annotation

23
Can you guess what it is yet?
24
Model of services
operation name, description input output task met
hod resource application
service name, description authororganisation
parameter name, description semantic
type format transport type collection
type collection format
workflow
WSDL operation
WSDL service
Soaplab service
bioMoby service
25
SHIM Services
  • Services that enable domain services to fit
    together
  • Outnumber domain services
  • Libraries
  • Candidates for automatic selection, composition
    and substitution

Main Bioinformatics Applications
Main Bioinformatics Services
Main Bioinformatics Application
Main Bioinformatics Application
SHIM Services
26
Results management
  • Automated workflows produce lots of heterogeneous
    data
  • These are just some of the results from one
    workflow run for Williams Disease

27
Amplification
One input
Many outputs
28
Dealing with results
  • FreeFluo agnostic about the data flowing through
    it.
  • Taverna includes a DataThing class, which can be
    tagged with terms from ontologies, free text
    descriptions and MIME types, and which may
    contain arbitrary collection structures.
  • Using the metadata hints we can locate and launch
    pluggable view components.
  • Hybrid typing scheme allows for a best effort
    approach to data typing.
  • Life science types are intractable for reasonable
    effort or completeness.

29
  • Implicit iteration framework handles type
    mismatches where cardinality changes are required
  • Permissive type scheme, guides rather than
    enforces
  • Graphical view supplemented by tree explorer
    style view
  • High level language wraps low level operations
    into sensible conceptual units.
  • Configurable fault handling.

30
Intermediate Results
31
Intermediate Results
  • Workflows change the way the bioinformatican
    works
  • Before analyse results as go along
  • After all results in one go
  • So linking intermediate results important

32
Life Science IDs
http//www.i3c.org/wgr/ta/resources/lsid/docs/
  • LSID provides a uniform naming scheme.
  • LSID Resolver guarantees to resolve to same data
    object.
  • LSID Authority dishes them out.
  • Also returns metadata of object.
  • Used throughout myGrid as an object naming
    device.
  • myGrid Repository acts an LSID Authority
  • LSID allows universal access to results for
    collaboration, as well as for review.
  • RDFLSID explains the context of results, and
    provides guidance for further investigations.

I3C / IBM / EBI proposal for a Life Science
Identifier
Pioneered by myGrid
33
Process Provenance
34
Link v Data Representation
  • Data management questions refer to relationships
    rather than internal content
  • What are the origins of this data?
  • Which service produced this data?
  • Which data is this derived from?
  • Who was this data produced for?
  • ?What is this data telling me?
  • Data analysis questions delegated to external
    services.

35
Representing links
urnlsidtaverna.sf.netdatathing45fg6
urnlsidtaverna.sf.netdatathing23ty3
  • Identify each resource
  • Life science identifier URI with associated data
    and metadata retrieval protocols.
  • Understanding that underlying data will not change

36
Representing links II
http//www.mygrid.org.uk/ontologyderived_from
urnlsidtaverna.sf.netdatathing45fg6
urnlsidtaverna.sf.netdatathing23ty3
  • Identify link type
  • Again use URI
  • Allows us to use RDF infrastructure
  • Repositories
  • Ontologies

37
Organisation level provenance
Process level provenance
Service
Project
runBye.g. BLAST _at_ NCBI
Experiment design
Process
Workflow design
componentProcesse.g. web service invocation of
BLAST _at_ NCBI
Event
partOf
instanceOf
componentEvente.g. completion of a web service
invocation at 12.04pm
Workflow run
Data/ knowledge level provenance
knowledge statementse.g. similar protein
sequence to
run for
User can add templates to each workflow process
to determine links between data items.
Data item
Person
Organisation
Data item
Data item
data derivation e.g. output data derived from
input data
38
Provenance tracking
  • Automated generation of this web of links
    preferable
  • Workflow enactor generates
  • LSIDs
  • Data derivation links
  • Knowledge links
  • Process links
  • Organisation links

Relationship BLAST report has with other items in
the repository
Other classes of information related to BLAST
report
39
Storage
  • LSID has no protocol for storage
  • Taverna/ Freefluo implements its own data/
    metadata storage protocol

Publish interface
Taverna/ Freefluo
Metadata Store
data
Data store
metadata
40
Retrieval
  • LSID protocol used to retrieve data and metadata
  • Query handled separately

LSID aware client
RDF aware client
LSID interface
Query
Publish interface
Metadata Store
Taverna/ Freefluo
Metadata Store
data
Data store
Data store
metadata
41
IBMs BioHaystack
GenBank record
Portion of the Web of provenance
Managing collection of sequences for review
42
Observations
  • Managed the transition from generic middleware
    development to practical day to day useful
    services
  • Real users (plural) fundamental to that
  • End to end support for an entire scenario
  • Bury the semantics
  • Show stoppers for practical adoption are not
    technical showstoppers
  • Can I incorporate my favourite service?
  • Can I manage the results?
  • By tapping into (defacto) standards and
    communities we can leverage others results and
    tools LSID, Haystack, Pedro.

43
Acknowledgements
myGrid is an EPSRC funded UK eScience Program
Pilot Project
Particular thanks to the other members of the
Taverna project, http//taverna.sf.net
44
myGrid People
  • Core
  • Matthew Addis, Nedim Alpdemir, Tim Carver, Rich
    Cawley, Neil Davis, Alvaro Fernandes, Justin
    Ferris, Robert Gaizaukaus, Kevin Glover, Carole
    Goble, Chris Greenhalgh, Mark Greenwood, Yikun
    Guo, Ananth Krishna, Peter Li, Phillip Lord,
    Darren Marvin, Simon Miles, Luc Moreau, Arijit
    Mukherjee, Tom Oinn, Juri Papay, Savas
    Parastatidis, Norman Paton, Terry Payne, Matthew
    Pockock Milena Radenkovic, Stefan
    Rennick-Egglestone, Peter Rice, Martin Senger,
    Nick Sharman, Robert Stevens, Victor Tan, Anil
    Wipat, Paul Watson and Chris Wroe.
  • Users
  • Simon Pearce and Claire Jennings, Institute of
    Human Genetics School of Clinical Medical
    Sciences, University of Newcastle, UK
  • Hannah Tipney, May Tassabehji, Andy Brass, St
    Marys Hospital, Manchester, UK
  • Postgraduates
  • Martin Szomszor, Duncan Hull, Jun Zhao, Pinar
    Alper, John Dickman, Keith Flanagan, Antoon
    Goderis, Tracy Craddock, Alastair Hampshire
  • Industrial
  • Dennis Quan, Sean Martin, Michael Niemi, Syd
    Chapman (IBM)
  • Robin McEntire (GSK)
  • Collaborators
  • Keith Decker

45
http//www.mygrid.org.uk
46
Semantic Futures
  • Information Management
  • More on Results management
  • Complete Deployment of Information Model
  • Using provenance and event notification e.g.
    for impact analysis
  • Access
  • CHEF-based portal for finding workflows,
    launching monitoring workflows, launching
    Taverna, browsing results
  • Redeveloping the view registry to be more
    efficient.
  • Deploying publicly accessible semantic registry
  • Workflow enactment
  • Reinstate service discovery during enactment
  • Authorisation Authentication

47
Summary
  • myGrid offers service based middleware components
  • Open source and freely downloadable
  • Open Grid Service Architecture-compliant
  • Allows the scientist to be at the centre of the
    Grid -- Personalisation
  • Generic middleware that suits the creation of
    bioinformatics applications
  • Inclusion of rich semantics to facilitate the
    scientific process
  • Available from http//www.mygrid.org.uk
Write a Comment
User Comments (0)
About PowerShow.com