Bioinformatics and Grids - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Bioinformatics and Grids

Description:

CC -!- SIMILARITY: VERY HIGH TO OTHER TETM/TETO PROTEINS. ... Annotation workbench for the PRINTS pattern database. Developers ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 50
Provided by: Carole143
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics and Grids


1
Bioinformatics and Grids
  • Professor Carole Goble,
  • University of Manchester,UK
  • carole_at_cs.man.ac.uk
  • Director of myGrid e-Science project
  • Co-director ESNW e-Science Regional Centre

2
Roadmap
  • Post Genome biology
  • Challenges for bioinformatics
  • Why biology isnt physics
  • Information-centric Grids
  • An example myGrid
  • Other projects
  • Take home

3
Take home
  • Complexity Diversity - Size isnt everything.
  • Computation is important but information and
    knowledge services dominate.
  • Integration, curation, annotation, fusion
  • Automating support for integration and fusion
    means moving from
  • human interaction to machine interaction.
  • machine readable to machineunderstandable.
  • Metadata using ontologies for finding, managing
    controlling services content

4
Functional Genomics
  • An integrated view of how organisms work and
    interact in growth, development and pathogenesis
  • From single gene to whole genome
  • From single biochemical reactions to whole
    physiological and developmental systems
  • What do genes do?
  • How do they interact?

5
Genotype to Phenotype
DNA chips
Modelling
Expression
Folding
population
protein sequence
  • Synchrotron
  • Proteomics
  • Domain analysis
  • SNP
  • Gene prediction
  • HTP Sequencing
  • Link the observable behaviour of an organism with
    its genotype

6
Drug Discovery
7
Pharmacogenomics Knowledge/Information Flow
Hypotheses
Design
Integration
ClinicalResourcesIndividualisedMedicine
Data Mining Case-BaseReasoning
InformationFusion
8
Use Cases (I3C)
  • Show me all the genes in the glucose metabolism
    pathway and get their GenBank accession numbers
  • Find all the citations for the HOX gene family
    for human and mouse
  • Find all the kinase genes from Wormbase and
    retrieve the DNA sequence

9
Use Cases
  • Show me Nucleotide binding proteins in mouse
  • Answer
  • P12345 in Swiss-Prot is an ATPase
  • Terri Attwood is an expert on this
  • Jackson labs have a database but you need to
    register
  • A paper has just been published in Proteins by
    the Stanford lab on this.

10
Which compounds interact with (alpha-adrenergic
receptors) ((over expressed in (bladder
epithelial cells)) but not (smooth muscle
tissue)) of ((patients with urinary flow
dysfunction) and a sensitivity to the
(quinazoline family of compounds))?
Enzyme database
SNPs database
Tissue database
Drug formulary
High throput screening
Receptor database
Clinical trials database
Chemical database
Expressn. database
11
Large amounts of data
http//www3.ebi.ac.uk/Services/DBStats/
  • EMBL July 2001
  • 150 Gbytes
  • Microarray
  • 1 Petabyte per annum
  • Sanger Centre
  • 20 terabytes of data
  • Genome sequences increase 4x per annum

12
High throughput experimental methods
  • Micro arrays for gene expression
  • Robot-based capture
  • 10K data points per chip
  • 20 x per chip
  • Cottage industry to industrial scale

13
Heterogeneity
  • Data types forms
  • Community
  • Autonomy
  • Over 500 different databases
  • Different formats, structure, schemas, coverage
  • Web interfaces, flat file distribution,

14
Heterogeneity
  • Complexity
  • Diversity

Proteome
15
Heterogeneity
  • Complexity
  • Diversity

Genomic, proteomic, transcriptomic, metabalomic,
protein-protein interactions, regulatory
bio-networks, alignments, disease, patterns
motifs, protein structure, protein
classifications, specialist proteins (enzymes,
receptors),
Proteome
16
Heterogeneous Data
  • Multimedia
  • Images Video
  • Text annotations literature
  • Descriptive as well as numeric
  • Knowledge-based

Text Extraction
17
Swiss-Prot
SWISSPROTTET9_ENTFA ID TET9_ENTFA
STANDARD PRT 639 AA. AC P21598 DT
01-MAY-1991 (REL. 18, CREATED) DT 01-MAY-1991
(REL. 18, LAST SEQUENCE UPDATE) DT 01-OCT-1993
(REL. 27, LAST ANNOTATION UPDATE) DE
TETRACYCLINE RESISTANCE PROTEIN TETM (TRANSPOSON
TN916). GN TETM(916). OS ENTEROCOCCUS
FAECALIS (STREPTOCOCCUS FAECALIS). RA BURDETT
V. RL NUCLEIC ACIDS RES. 186137-6137(1990). CC
-!- FUNCTION ABOLISH THE INHIBITORY EFFECT OF
TETRACYCLIN ON PROTEIN CC SYNTHESIS BY A
NON-COVALENT MODIFICATION OF THE RIBOSOMES. CC
-!- SIMILARITY VERY HIGH TO OTHER TETM/TETO
PROTEINS. CC -!- SIMILARITY TO GTP-BINDING
ELONGATION FACTORS. DR EMBL X56353 G47062
-. DR PIR S13142 S13142. DR PROSITE
PS00301 EFACTOR_GTP 1. KW PROTEIN
BIOSYNTHESIS ANTIBIOTIC RESISTANCE
GTP-BINDING KW TRANSPOSABLE ELEMENT. FT
NP_BIND 10 17 GTP (BY
SIMILARITY). FT NP_BIND 74 78
GTP (BY SIMILARITY). SQ SEQUENCE 639 AA
72464 MW 523F1359 CRC32 gtTET9_ENTFA MKIINIGVLAH
VDAGKTTLTESLLYNSGAITELGSVDKGTTRTDNTLLERQRGITIQTGI
TSFQWENTKVNIIDTPGHMDFLAEVYRSLSVLDGAILLISAKDGVQAQTR
ILFHALRKMG IPTIFFINKIDQNGIDLSTVYQDIKEKLSAEIVIKQKVE
LYPNVCVTNFTESEQWDTVIE GNDDLLEKYMSGKSLEALELEQEESIRF
QNCSLFPLYHGSAKSNIGIDNLIEVITNKFYS STHRGPSELCGNVFKIE
YTKKRQRLAYIRLYSGVLHLRDSVRVSEKEKIKVTEMYTSING ELCKID
RAYSGEIVILQNEFLKLNSVLGDTKLLPQRKKIENPHPLLQTTVEPSKPE
QREM LLDALLEISDSDPLLRYYVDSTTHEIILSFLGKVQMEVISALLQE
KYHVEIEITEPTVIY MERPLKNAEYTIHIEVPPNPFWASIGLSVSPLPL
GSGMQYESSVSLGYLNQSFQNAVMEG IRYGCEQGLYGWNVTDCKICFKY
GLYYSPVSTPADFRMLAPIVLEQVLKKAGTELLEPYL SFKIYAPQEYLS
RAYNDAPKYCANIVDTQLKNNEVILSGEIPARCIQEYRSDLTFFTNGR S
VCLTELKGYHVTTGEPVCQPRRPNSRIDKVRYMFNKIT
18
Heterogeneity
  • Lymphocyte associated receptor of death
  • LARD
  • WSL-LR WSL-S1 WSL-S2 proteins
  • WSL-1 protein precursor
  • Apoptosis-mediating receptor DR3
  • Apoptosis-mediating receptor TRAMP
  • Death Domain receptor 3
  • WSL protein
  • apoptosis inducing receptor AIR
  • APO-3

19
  • Data resources have been built introspectively
    for human researchers
  • Information is machine readable not machine
    understandable
  • CONTROLLED VOCABULARIES ONTOLOGIES

20
Shared data-gt shared meaning
Service provider
Service provider
Service provider
Service provider
Service provider
21
Complexity
  • Multiple views
  • Interrelated
  • Intra and inter cell interactions and
    bio-processes

"Courtesy U.S. Department of Energy Genomes to
Life program (proposed) DOEGenomesToLife.org."
22
Instability Quality
  • Exploring the unknown
  • At least 5 definitions of a gene
  • The sequence is a model
  • Other models are work in progress
  • Names unstable
  • Data unstable
  • Models unstable
  • the problem in the field is not a lack of good
    integrating software, Smith says. The packages
    usually end up leading back to public databases.
    "The problem is the databases are God-awful," he
    told BioMedNet. If the data is still
    fundamentally flawed, then better algorithms add
    little.
  • Temple Smith, director of the Molecular
    Engineering Research Center at Boston University,

23
Curation
SWISS-PROT
MEDLINE papers
annotation
PRINTS
BLOCKS
24
Infrastructure Integration
  • Technologies
  • CORBA and the OMA
  • Java and JavaBeans
  • Data mining
  • Algorithm development
  • Knowledge discovery
  • Knowledge representation
  • Visualisation
  • Query tools and services
  • Database replication
  • OO technology
  • OO databases
  • Networks and security
  • Data cleaning validation

Structural Genomics
SNPs
Sequence Data
Expression Data
Gene Analysis
Mutation/Variation Pattern Discovery Gene
Prediction Splice Sites Promoters EST
Differential Temporal In situ
Functional Genomics
Gene Identification
Gene Networks
Proteomics
Gene Annotation and Function Regulation of
Metabolism Biochemical Pathways Signal
Transduction
Metabolomics
CORBA / Java / BSA / SRS
25
Bioinformatics Analysis
  • Different algorithms
  • BLAST, FASTA, pSW
  • Different implementations
  • WU-BLAST, NCBI-BLAST
  • Different service providers
  • NCBI, EBI, DDBJ

26
In silico experimentation
27
In silico experimentation
myProteins
BLAST
Swiss-Prot
BLAST
PIR
Go-Blast visualisation
BLAST
28
In silico experimentation
myProteins
BLAST
Interpro
Swiss-Prot
BLAST
PIR
Go-Blast visualisation
BLAST
29
In silico experimentation
30
In silico experimentation
31
In silico experimentation
32
In silico experimentation
  • Discovery, interoperation, fusion, sharing of
    data, knowledge and workflows
  • Explicit management of workflows
  • information processes best practice
  • Improving quality of experiments data
  • provenance propagating change
  • Scientific discovery is personal global
  • personalisation collaborative working
  • Security, ownership -gt valuable assets

33
myGrid
  • Personalised
  • extensible environments for
  • data-intensive
  • in silico experiments in biology
  • http//www.mygrid.org.uk

34
myGrid
  • UK e-Science Grid programme pilot (EPSRC)
  • Generic middleware
  • Bioinformatics Genomics setting
  • 1st October 2001 -- 31st March 2005
  • (36 months funded in 42 execution period)
  • 16 full-time researchers/developers

35
myGrid Partners
36
A Desiderata (cf. Grid)
A p p l i c a t i o n s
  • Software development toolkits
  • Standard protocols, services APIs
  • A modular bag of technologies
  • Enable incremental development of grid-enabled
    tools and applications
  • Reference implementations
  • Learn through deployment and applications
  • Open source

Diverse global services
Core services
Local OS
37
Approach
myGrid Stack
Personalisation
Metadata
Agent-based Interoperation layer
I.E
38
myGrid Outcomes
  • e-Scientists
  • Environment built on toolkits for service access,
    personalisation community
  • Gene function expression analysis using S.
    cerevisiae
  • Annotation workbench for the PRINTS pattern
    database
  • Developers
  • Protocols and service descriptions
  • myGrid-in-a-Box developers kit
  • Re-purposing DAS, AppLab and OpenBSA
  • Integrating ISYS GlaxoSmithKline platforms

39
myGrid tech outcomes
  • Services, service descriptions (ontologies),
    message protocols APIs
  • Database access from the Grid
  • Process enactment on the Grid
  • Personalisation services
  • Provenance services
  • Metadata services DAMLOIL, RDF(S)
  • Laying the foundations for Agent Services

40
Converging technologies
Grid Computing
Globus, Sun Grid Engine, Condor, DS (Jini, Corba)
Web Service Semantic Web Technologies
Agents
SOAP, WSDL, UDDI, WSIL, DAMLOIL, OWL, RDF(S),
WSFL
ACL, methodology
41
myGrid Services
42
Standards and Activities
Open Source Open Bio Foundation BioJava, BioPerl
Consortium Expertise View propagation,
reasoning, workflow
(DeFacto) Standards OMG LSR, I3C, MGED, Gene
Ontology
Semantic Web RDF, RDFS, DAMLOIL
Bioinformatics integration platforms DAS,
OpenBSA, ISYS, OpenMMS, Kleisli, Ensembl, AppLab,
SRS, BioNavigator, DiscoveryLink, K1 TAMBIS.
MOBY
Distributed Computing Environments CORBA, RMI,
JavaOne
Web Services XML, SOAP, WSDL, UDDI
GRID Globus/SRB/Condor/Sun Grid Engine
43
Other BioGrids
  • BioOpera
  • North Carolina BioGrid
  • Novartis Grid
  • Scientific Annotation Middleware project
  • Entropia AIDS modelling Grid .
  • DiscoveryNet
  • Proteomics analysis
  • Protein structure prediction
  • Biodiversity
  • CLEF Clinical records

44
myGrid Summary
  • myGrid aims to develop infrastructure middleware
    for an e-Biologists workbench
  • The setting is bioinformatics but the results are
    intended to be generally applicable to e-Science
  • A mix of standard, vanguard and bleed edge
    technologies, advanced development and (some)
    research
  • Academic commercial partnership
  • myGrid project is timely reflects a community
    desire to collaborate, or die

45
Take home reprise
  • Complexity Diversity - Size isnt everything.
  • Computation is important but information and
    knowledge services dominate.
  • Integration, curation, annotation, fusion
  • Automating support for integration and fusion
    means moving from
  • human interaction to machine interaction.
  • machine readable to machineunderstandable.
  • Metadata using ontologies for finding, managing
    controlling services content

46
Acknowledgements
  • Colleagues on myGrid
  • Robert Stevens
  • Norman Paton
  • Alan Robinson at EMBL-EBI
  • I3C Interoperable Informatics Infrastructure
    Consortium http//www.i3c.org

47
URLs
  • EBI
  • http//www.ebi.ac.uk/
  • LSR
  • http//www.omg.org/homepages/lsr/
  • Open-Bio
  • http//www.open-bio.org/
  • I3C
  • http//www.i3c.org/

48
  • "Molecular biologists appear to have eyes for
    data that are bigger than their stomachs. As
    genomes near completion, as DNA arrays on chips
    begin to reveal patterns of gene sequences and
    expressions, as researchers embark on
    characterising all known proteins, the
    anticipated flood of data vastly exceeds in scale
    anything biologists have been used to."
  • (Editorial Nature, June 10, 1999)

49
  • Presented over the AccessGrid to the CSC Finnish
    IT Centre for Science Grid Seminar
  • Otaniemi, Espoo, Finland
  • 6th March 2002
  • http//www.csc.fi/suomi/tapahtumat/GridSeminar/
Write a Comment
User Comments (0)
About PowerShow.com