Title: Bioinformatics and Grids
1Bioinformatics and Grids
- Professor Carole Goble,
- University of Manchester,UK
- carole_at_cs.man.ac.uk
- Director of myGrid e-Science project
- Co-director ESNW e-Science Regional Centre
2Roadmap
- Post Genome biology
- Challenges for bioinformatics
- Why biology isnt physics
- Information-centric Grids
- An example myGrid
- Other projects
- Take home
3Take home
- Complexity Diversity - Size isnt everything.
- Computation is important but information and
knowledge services dominate. - Integration, curation, annotation, fusion
- Automating support for integration and fusion
means moving from - human interaction to machine interaction.
- machine readable to machineunderstandable.
- Metadata using ontologies for finding, managing
controlling services content
4Functional Genomics
- An integrated view of how organisms work and
interact in growth, development and pathogenesis - From single gene to whole genome
- From single biochemical reactions to whole
physiological and developmental systems - What do genes do?
- How do they interact?
5Genotype to Phenotype
DNA chips
Modelling
Expression
Folding
population
protein sequence
- Proteomics
- Domain analysis
- SNP
- Gene prediction
- HTP Sequencing
- Link the observable behaviour of an organism with
its genotype
6Drug Discovery
7Pharmacogenomics Knowledge/Information Flow
Hypotheses
Design
Integration
ClinicalResourcesIndividualisedMedicine
Data Mining Case-BaseReasoning
InformationFusion
8Use Cases (I3C)
- Show me all the genes in the glucose metabolism
pathway and get their GenBank accession numbers - Find all the citations for the HOX gene family
for human and mouse - Find all the kinase genes from Wormbase and
retrieve the DNA sequence
9Use Cases
- Show me Nucleotide binding proteins in mouse
- Answer
- P12345 in Swiss-Prot is an ATPase
- Terri Attwood is an expert on this
- Jackson labs have a database but you need to
register - A paper has just been published in Proteins by
the Stanford lab on this.
10Which compounds interact with (alpha-adrenergic
receptors) ((over expressed in (bladder
epithelial cells)) but not (smooth muscle
tissue)) of ((patients with urinary flow
dysfunction) and a sensitivity to the
(quinazoline family of compounds))?
Enzyme database
SNPs database
Tissue database
Drug formulary
High throput screening
Receptor database
Clinical trials database
Chemical database
Expressn. database
11Large amounts of data
http//www3.ebi.ac.uk/Services/DBStats/
- EMBL July 2001
- 150 Gbytes
- Microarray
- 1 Petabyte per annum
- Sanger Centre
- 20 terabytes of data
- Genome sequences increase 4x per annum
12High throughput experimental methods
- Micro arrays for gene expression
- Robot-based capture
- 10K data points per chip
- 20 x per chip
- Cottage industry to industrial scale
13Heterogeneity
- Data types forms
- Community
- Autonomy
- Over 500 different databases
- Different formats, structure, schemas, coverage
- Web interfaces, flat file distribution,
14Heterogeneity
Proteome
15Heterogeneity
Genomic, proteomic, transcriptomic, metabalomic,
protein-protein interactions, regulatory
bio-networks, alignments, disease, patterns
motifs, protein structure, protein
classifications, specialist proteins (enzymes,
receptors),
Proteome
16Heterogeneous Data
- Multimedia
- Images Video
- Text annotations literature
- Descriptive as well as numeric
- Knowledge-based
Text Extraction
17Swiss-Prot
SWISSPROTTET9_ENTFA ID TET9_ENTFA
STANDARD PRT 639 AA. AC P21598 DT
01-MAY-1991 (REL. 18, CREATED) DT 01-MAY-1991
(REL. 18, LAST SEQUENCE UPDATE) DT 01-OCT-1993
(REL. 27, LAST ANNOTATION UPDATE) DE
TETRACYCLINE RESISTANCE PROTEIN TETM (TRANSPOSON
TN916). GN TETM(916). OS ENTEROCOCCUS
FAECALIS (STREPTOCOCCUS FAECALIS). RA BURDETT
V. RL NUCLEIC ACIDS RES. 186137-6137(1990). CC
-!- FUNCTION ABOLISH THE INHIBITORY EFFECT OF
TETRACYCLIN ON PROTEIN CC SYNTHESIS BY A
NON-COVALENT MODIFICATION OF THE RIBOSOMES. CC
-!- SIMILARITY VERY HIGH TO OTHER TETM/TETO
PROTEINS. CC -!- SIMILARITY TO GTP-BINDING
ELONGATION FACTORS. DR EMBL X56353 G47062
-. DR PIR S13142 S13142. DR PROSITE
PS00301 EFACTOR_GTP 1. KW PROTEIN
BIOSYNTHESIS ANTIBIOTIC RESISTANCE
GTP-BINDING KW TRANSPOSABLE ELEMENT. FT
NP_BIND 10 17 GTP (BY
SIMILARITY). FT NP_BIND 74 78
GTP (BY SIMILARITY). SQ SEQUENCE 639 AA
72464 MW 523F1359 CRC32 gtTET9_ENTFA MKIINIGVLAH
VDAGKTTLTESLLYNSGAITELGSVDKGTTRTDNTLLERQRGITIQTGI
TSFQWENTKVNIIDTPGHMDFLAEVYRSLSVLDGAILLISAKDGVQAQTR
ILFHALRKMG IPTIFFINKIDQNGIDLSTVYQDIKEKLSAEIVIKQKVE
LYPNVCVTNFTESEQWDTVIE GNDDLLEKYMSGKSLEALELEQEESIRF
QNCSLFPLYHGSAKSNIGIDNLIEVITNKFYS STHRGPSELCGNVFKIE
YTKKRQRLAYIRLYSGVLHLRDSVRVSEKEKIKVTEMYTSING ELCKID
RAYSGEIVILQNEFLKLNSVLGDTKLLPQRKKIENPHPLLQTTVEPSKPE
QREM LLDALLEISDSDPLLRYYVDSTTHEIILSFLGKVQMEVISALLQE
KYHVEIEITEPTVIY MERPLKNAEYTIHIEVPPNPFWASIGLSVSPLPL
GSGMQYESSVSLGYLNQSFQNAVMEG IRYGCEQGLYGWNVTDCKICFKY
GLYYSPVSTPADFRMLAPIVLEQVLKKAGTELLEPYL SFKIYAPQEYLS
RAYNDAPKYCANIVDTQLKNNEVILSGEIPARCIQEYRSDLTFFTNGR S
VCLTELKGYHVTTGEPVCQPRRPNSRIDKVRYMFNKIT
18Heterogeneity
- Lymphocyte associated receptor of death
- LARD
- WSL-LR WSL-S1 WSL-S2 proteins
- WSL-1 protein precursor
- Apoptosis-mediating receptor DR3
- Apoptosis-mediating receptor TRAMP
- Death Domain receptor 3
- WSL protein
- apoptosis inducing receptor AIR
- APO-3
19- Data resources have been built introspectively
for human researchers - Information is machine readable not machine
understandable - CONTROLLED VOCABULARIES ONTOLOGIES
20Shared data-gt shared meaning
Service provider
Service provider
Service provider
Service provider
Service provider
21Complexity
- Multiple views
- Interrelated
- Intra and inter cell interactions and
bio-processes
"Courtesy U.S. Department of Energy Genomes to
Life program (proposed) DOEGenomesToLife.org."
22Instability Quality
- Exploring the unknown
- At least 5 definitions of a gene
- The sequence is a model
- Other models are work in progress
- Names unstable
- Data unstable
- Models unstable
- the problem in the field is not a lack of good
integrating software, Smith says. The packages
usually end up leading back to public databases.
"The problem is the databases are God-awful," he
told BioMedNet. If the data is still
fundamentally flawed, then better algorithms add
little. - Temple Smith, director of the Molecular
Engineering Research Center at Boston University,
23Curation
SWISS-PROT
MEDLINE papers
annotation
PRINTS
BLOCKS
24Infrastructure Integration
- Technologies
- CORBA and the OMA
- Java and JavaBeans
- Data mining
- Algorithm development
- Knowledge discovery
- Knowledge representation
- Visualisation
- Query tools and services
- Database replication
- OO technology
- OO databases
- Networks and security
- Data cleaning validation
Structural Genomics
SNPs
Sequence Data
Expression Data
Gene Analysis
Mutation/Variation Pattern Discovery Gene
Prediction Splice Sites Promoters EST
Differential Temporal In situ
Functional Genomics
Gene Identification
Gene Networks
Proteomics
Gene Annotation and Function Regulation of
Metabolism Biochemical Pathways Signal
Transduction
Metabolomics
CORBA / Java / BSA / SRS
25Bioinformatics Analysis
- Different algorithms
- BLAST, FASTA, pSW
- Different implementations
- WU-BLAST, NCBI-BLAST
- Different service providers
- NCBI, EBI, DDBJ
26In silico experimentation
27In silico experimentation
myProteins
BLAST
Swiss-Prot
BLAST
PIR
Go-Blast visualisation
BLAST
28In silico experimentation
myProteins
BLAST
Interpro
Swiss-Prot
BLAST
PIR
Go-Blast visualisation
BLAST
29In silico experimentation
30In silico experimentation
31In silico experimentation
32In silico experimentation
- Discovery, interoperation, fusion, sharing of
data, knowledge and workflows - Explicit management of workflows
- information processes best practice
- Improving quality of experiments data
- provenance propagating change
- Scientific discovery is personal global
- personalisation collaborative working
- Security, ownership -gt valuable assets
33myGrid
- Personalised
- extensible environments for
- data-intensive
- in silico experiments in biology
- http//www.mygrid.org.uk
34myGrid
- UK e-Science Grid programme pilot (EPSRC)
- Generic middleware
- Bioinformatics Genomics setting
- 1st October 2001 -- 31st March 2005
- (36 months funded in 42 execution period)
- 16 full-time researchers/developers
35myGrid Partners
36A Desiderata (cf. Grid)
A p p l i c a t i o n s
- Software development toolkits
- Standard protocols, services APIs
- A modular bag of technologies
- Enable incremental development of grid-enabled
tools and applications - Reference implementations
- Learn through deployment and applications
- Open source
Diverse global services
Core services
Local OS
37Approach
myGrid Stack
Personalisation
Metadata
Agent-based Interoperation layer
I.E
38myGrid Outcomes
- e-Scientists
- Environment built on toolkits for service access,
personalisation community - Gene function expression analysis using S.
cerevisiae - Annotation workbench for the PRINTS pattern
database - Developers
- Protocols and service descriptions
- myGrid-in-a-Box developers kit
- Re-purposing DAS, AppLab and OpenBSA
- Integrating ISYS GlaxoSmithKline platforms
39myGrid tech outcomes
- Services, service descriptions (ontologies),
message protocols APIs - Database access from the Grid
- Process enactment on the Grid
- Personalisation services
- Provenance services
- Metadata services DAMLOIL, RDF(S)
- Laying the foundations for Agent Services
40Converging technologies
Grid Computing
Globus, Sun Grid Engine, Condor, DS (Jini, Corba)
Web Service Semantic Web Technologies
Agents
SOAP, WSDL, UDDI, WSIL, DAMLOIL, OWL, RDF(S),
WSFL
ACL, methodology
41 myGrid Services
42Standards and Activities
Open Source Open Bio Foundation BioJava, BioPerl
Consortium Expertise View propagation,
reasoning, workflow
(DeFacto) Standards OMG LSR, I3C, MGED, Gene
Ontology
Semantic Web RDF, RDFS, DAMLOIL
Bioinformatics integration platforms DAS,
OpenBSA, ISYS, OpenMMS, Kleisli, Ensembl, AppLab,
SRS, BioNavigator, DiscoveryLink, K1 TAMBIS.
MOBY
Distributed Computing Environments CORBA, RMI,
JavaOne
Web Services XML, SOAP, WSDL, UDDI
GRID Globus/SRB/Condor/Sun Grid Engine
43Other BioGrids
- BioOpera
- North Carolina BioGrid
- Novartis Grid
- Scientific Annotation Middleware project
- Entropia AIDS modelling Grid .
- DiscoveryNet
- Proteomics analysis
- Protein structure prediction
- Biodiversity
- CLEF Clinical records
44myGrid Summary
- myGrid aims to develop infrastructure middleware
for an e-Biologists workbench - The setting is bioinformatics but the results are
intended to be generally applicable to e-Science - A mix of standard, vanguard and bleed edge
technologies, advanced development and (some)
research - Academic commercial partnership
- myGrid project is timely reflects a community
desire to collaborate, or die
45Take home reprise
- Complexity Diversity - Size isnt everything.
- Computation is important but information and
knowledge services dominate. - Integration, curation, annotation, fusion
- Automating support for integration and fusion
means moving from - human interaction to machine interaction.
- machine readable to machineunderstandable.
- Metadata using ontologies for finding, managing
controlling services content
46Acknowledgements
- Colleagues on myGrid
- Robert Stevens
- Norman Paton
- Alan Robinson at EMBL-EBI
- I3C Interoperable Informatics Infrastructure
Consortium http//www.i3c.org
47URLs
- EBI
- http//www.ebi.ac.uk/
- LSR
- http//www.omg.org/homepages/lsr/
- Open-Bio
- http//www.open-bio.org/
- I3C
- http//www.i3c.org/
48- "Molecular biologists appear to have eyes for
data that are bigger than their stomachs. As
genomes near completion, as DNA arrays on chips
begin to reveal patterns of gene sequences and
expressions, as researchers embark on
characterising all known proteins, the
anticipated flood of data vastly exceeds in scale
anything biologists have been used to." - (Editorial Nature, June 10, 1999)
49- Presented over the AccessGrid to the CSC Finnish
IT Centre for Science Grid Seminar - Otaniemi, Espoo, Finland
- 6th March 2002
- http//www.csc.fi/suomi/tapahtumat/GridSeminar/