Bioinformatics applications on the EGEE Grid - PowerPoint PPT Presentation

1 / 87
About This Presentation
Title:

Bioinformatics applications on the EGEE Grid

Description:

Bioinformatics applications on the EGEE Grid – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 88
Provided by: brendan76
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics applications on the EGEE Grid


1
Bioinformatics applications on the EGEE Grid
  • Brendan Hamill
  • Edinburgh Centre for Bioinformatics

2
Contents
  • History of the EGEE Grid
  • Overview of the main grid services
  • Training activity in EGEE
  • Bioinformatics applications

3
EGEE international e-infrastructure
  • Objectives of programme
  • Build, deploy and operate a consistent, robust,
    large scale production grid service that links
    with and builds on national, regional and
    international initiatives
  • Improve and maintain the middleware in order to
    deliver a reliable service to users
  • Attract new users from research and industry and
    ensure training and support for them

4
History of EGEE
5
History of EGEE (2)
  • European DataGrid (EDG) Project
  • Ended March 2004
  • EGEE phase 1
  • April 2004-March 2006
  • EGEE-II
  • April 2006-March 2008
  • Part of the EU Sixth Framework Programme (FP6)
  • Budget gt 50M
  • gt1200 individuals in 91 partner organisations

6
CERN Large Hadron Collider
7
Large Hadron Collider (2)
8
Large Hadron Collider (3)
9
Large Hadron Collider (4)
10
LHC pre-accelerators and detectors
11
(No Transcript)
12
CMS Detector
13
Other subject areas in EGEE
Astrophysics
Bioinformatics
Computational Chemistry
14
Applications on EGEE
  • More than 25 applications from 9 domains
  • Astrophysics
  • MAGIC, Planck
  • Computational Chemistry
  • Earth Sciences
  • Earth Observation, Solid Earth Physics,
    Hydrology, Climate
  • Financial Simulation
  • E-GRID
  • Fusion
  • Geophysics
  • EGEODE
  • High Energy Physics
  • 4 LHC experiments (ALICE, ATLAS, CMS, LHCb)
  • BaBar, CDF, DØ, ZEUS
  • Multimedia
  • Life Sciences
  • Bioinformatics (Drug Discovery, GPS_at_,
    Xmipp_MLrefine, etc.)
  • Medical imaging (GATE, CDSS, gPTM3D, SiMRI 3D,
    etc.)

15
Distribution of CPU time by disciplines and dates
16
EGEE-II Expertise Resources
  • More than 90 partners
  • 32 countries
  • 12 federations
  • ? Major and national Grid projects in Europe,
    USA, Asia
  • 27 countries through related projects
  • BalticGrid
  • SEE-GRID
  • EUMedGrid
  • EUChinaGrid
  • EELA

17
Collaborating projects
18
Regional distribution
18
19
EGEE-II Activities
  • Service activities - establishing operations
  • Grid Operations Geneva
  • Security Lyon
  • Testing Geneva
  • Network activities - supporting VOs
  • Project management Geneva
  • Training Edinburgh
  • Applications Support Paris
  • External projects Athens
  • Joint Research Activities - e.g. hardening
    middleware
  • Middleware development Bologna

20
Related projects infrastructure, education,
application
21
Grid services
  • How can EGEE middleware support collaboration and
    resource sharing within and between many diverse
    VOs ?

22
Grid Middleware
  • When using a Grid you
  • Login with digital credentials (Authentication)
  • Use rights given you (Authorisation)
  • Run jobs
  • Manage files create them, read/write, list
    directories
  • Services are linked by the Internet
  • Middleware
  • Many admin domains
  • When using a PC or workstation you
  • Login with a username and password
    (Authentication)
  • Use rights given to you (Authorisation)
  • Run jobs
  • Manage files create them, read/write, list
    directories
  • Components are linked by a bus
  • Operating system
  • One admin domain

23
Typical current grid
  • Grid middleware runs on each shared resource
  • Data storage
  • (Usually) batch queues on pools of processors
  • Users join VOs
  • Virtual organisation negotiates with sites to
    agree access to resources
  • Distributed services (both people and middleware)
    enable the grid, allow single sign-on

24
Authorisation, Authentication (AA)
Users in many locations and organisations
Grid Security Infrastructure
Resources in many locations and organisations
System software
Operating system
Local scheduler
File system
Hardware
Computing clusters,
Network resources
Data storage
25
Basic job submission
Users
  • Tools that
  • copy files to and between CEs and data storage
  • Submit job to a CE
  • Monitor job
  • Get output

How do I run a job on a compute element (CE) ?
(CE batch queue)
Resources
Compute elements
Data storage
Network resources
26
Information service (IS)
Users
  • Information service
  • Resources send updates to IS
  • Grid services query IS before running jobs

How do I know which CE could run my job? Which is
free?
Resources
Compute elements
Data storage
Network resources
27
File management
Users
Storage Transfer Replica management
Weve terabytes of data in files.
My data are in files, and Ive terabytes
Our data are in files, and Ive terabytes
  • EGEE data primarily file-based
  • services for databases used by some VOs

Resources
Compute elements
Data storage
Network resources
28
Security, Authentication and Authorisation
29
Authentication and Authorisation
  • Authentication - communication of identity
  • Basis for
  • Message integrity - so tampering is recognised
  • Message confidentiality, if needed - so only
    sender and receiver can understand the message
  • Non-repudiation knowing who did what, when
    cant deny it
  • Authorisation - once identity is known, what can
    a user do?
  • Delegation- A allows service B to act on behalf
    of A
  • Based on X.509 certificates

30
http//compchem.unipg.it
31
Current production middleware
Replica Catalogue
User interface
Information Service
Resource Broker
Author. Authen.
Input sandbox Broker Info
Output sandbox
Logging Book-keeping
Computing Element
Job Status
32
User Interface node
  • The users interface to the Grid
  • Command-line interface to
  • Create/Manage proxy certificates
  • Job operations
  • To submit a job
  • Monitor its status
  • Retrieve output
  • Data operations
  • Upload file to SE
  • Create replica
  • Discover replicas
  • Other grid services

33
User Interface node
  • Also C and Java APIs
  • To run a job user creates a JDL (Job Description
    Language) file

34
Querying job status
Possible Job States
35
(No Transcript)
36
Live Real Time Monitor Site
  • http//gridportal.hep.ph.ic.ac.uk/rtm/applet.html

37
Overall load
  • 19.6 million jobs run in 1st year of EGEE-II
  • 56000 per day sustained average
  • Peak of 98000
  • Non-LHC 13500 /day
  • Level of total in EGEE in 2005
  • 8400 CPU-years delivered in 1 year
  • 1/3 of total available sustained over the year
  • Peak of 50 of available in Feb 07
  • 1/3 of total was non-LHC in Dec 06

37
38
EGEE Training Site
http//www.egee.nesc.ac.uk
39
NA3 Activity Partners in EGEE-II
EGEE Training Team, University of Warsaw
40
Training effort in EGEE-II
  • 30 partners
  • 31 FTEs, 135 individuals
  • 5 of project budget (2.4M)
  • e-Learning digital library
  • Training infrastructure
  • dedicated sub-grid of training clusters
    (Catania, Karlsruhe, Edinburgh, Budapest,
    Warsaw, Athens, Prague, Bratislava)

41
EGEE-II Training Events
42
EGEE Training Site
http//www.egee.nesc.ac.uk
43
Digital Library
http//egee.lib.ed.ac.uk/
44
EGEE Digital Library statistics
  • 73 articles
  • 13 courses
  • 316 events
  • 53 modules
  • 3926 presentations
  • 70 tutorials
  • 97 videos
  • 27 ETF Exemplars

The EGEE Digital Library contains over 4000
learning resources derived from EGEE events
45
UIG pages
  • http//www.egee.nesc.ac.uk/uig

46
UIG pages
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
Link to EGEE 07 presentations
  • EGEE 07 Conference Agenda Page
  • http//indico.cern.ch/conferenceDisplay.py?confId
    18714

58
Example Applications
  • WISDOM Project
  • BioinfoGRID
  • Health e-Child
  • MATLAB in Grids

59
WISDOM
  • WISDOM stands for World-wide In Silico Docking On
    Malaria
  • Goal find new drugs for neglected and emerging
    diseases
  • Neglected diseases lack RD
  • Emerging diseases require very rapid response
    time
  • Method grid-enabled virtual docking
  • Cheaper than in vitro tests
  • Faster than in vitro tests

60
In-Silico Drug Discovery
  • WISDOM Project (Wide In-Silico Docking On
    Malaria)
  • About 80 CPU years to produce TB of data

61
First Target Malaria
  • 300 million people worldwide are affected
  • 1-1.5 million people die every year
  • Widely spread
  • Caused by protozoan parasites of the genus
    Plasmodium

Life cycle
62
Role of Plasmepsins
  • Plasmepsins are involved in hemoglobin
    degradation during the parasites life cycle.
  • Present in the 4 species of Plasmodium causing
    the disease in humans
  • Sequence homology between the plasmepsins is high
    (65-70)
  • X-ray-crystallography data available

HEMOGLOBIN
Plasmepsins (I, II, IV, and HAP)
Small Peptides
Heme
Falcipain and plasmepsin
oxidation
Smaller Peptides
Hematin
polymerization
Aminopeptidases
Hemozoin (malarial pigment)
Amino acids
63
Second Target Avian Flu
  • Profiling Inhibitors of Influenza H5N1
  • docking of 300,000 compounds studied
  • 8 different target structures of Influenza A
    neuraminidases
  • 2000 CPUs were used over 4 weeks (gt100 CPU-years)
  • gt60,000 output files with a data volume of 600
    Gigabytes

64
Biological objectives
  • Malaria Find active molecules
  • on a known mutated protein (DHFR)
  • on new targets
  • Plasmepsins
  • GST
  • Tubulin
  • Avian Flu
  • Study the impact of point mutations of the N1
    enzyme
  • Tamiflu active on N1
  • Find new molecules active on N1

N1
H5
65
A first step towards in silico drug discovery
virtual screening
  • In silico virtual screening
  • Starting from millions of compounds, select a
    handful of compounds for in vitro testing
  • Very computationally intensive but potentially
    much cheaper than in vitro testing
  • Where to find CPUs to make it time effective ?

66
Grid-enabled virtual docking
Millions of potential drugs to test
against interesting proteins!
High Throughput Screening 1-10/compound, several
hours
67
Statistics of deployment
  • First Data Challenge July 1st - August 15th 2005
  • Target malaria
  • 80 CPU years
  • 1 TB of data produced
  • 1700 CPUs used in parallel
  • 1st large scale docking deployment world-wide on
    an e-infrastructure
  • Second Data Challenge April 15th - June 30th
    2006
  • Target avian flu
  • 100 CPU years
  • 800 GB of data produced
  • 1700 CPUs used in parallel
  • Collaboration initiated on March 1st deployment
    preparation achieved in 45 days
  • Third Data Challenge October 1st - 15th December
    2006
  • Target malaria
  • 400 CPU years
  • 1,6 TB of data produced
  • Up to 5000 CPUs used in parallel

68
Status of in vitro tests
  • Avian Flu
  • Initial number of compounds 300,000
  • 123 compounds bought and tested out of the 2250
    selected
  • 7 out of 123, approximately 6, are active
  • Usual average success rate for in vitro tests
    0,1
  • Factor 60 increase to be confirmed on more
    compounds
  • Tests under way at Chonnam National University
    (ROK)
  • Malaria
  • Initial number of compounds 500,000 (WISDOM-I)
  • Selection of 30 molecules in 2 steps
  • 1000 molecules selected on docking score
  • Selection of 30 molecules through molecular
    dynamics
  • Tests under way at Chonnam National University
    (ROK)
  • First results are very encouraging

69
http//www.bioinfogrid.eu
70
Biological Databases Use case
71
Biological Database in GRID
  • The following biological databases are currently
    available in Grid
  • InterPro databases
  • PROSITE Patterns (Hofmann, K. et al. 1999),
  • PROSITE profile (Hofmann, K. et al. 1999),
  • PRINTS (Attwood, T. K. et al. 2000),
  • Pfam (Bateman, A. et al. 2000),
  • PRODOM (Corpet, F. et al. 1999),
  • SMART (Schultz, J. et al. 2000),
  • TIGRFAMs (Haft, D.H. et al. 2001),
  • PIRSF,
  • PANTHER,
  • SUPERFAMILY

72
Biological Databases
  • BLAST databases
  • nr (NCBI),
  • nt (NCBI),
  • pdbaa (NCBI),
  • UCSC_human_chrs (UCSC),
  • human_genomic (NCBI),
  • refseq_protein (NCBI),
  • refseq_rna (NCBI),
  • refseq_genomic (NCBI),
  • ecoli (NCBI),
  • yeast (NCBI),
  • uniprot (UNIPROT),
  • est_human (NCBI),
  • est_mouse (NCBI)?

73
Functional Analogous Finder
  • Goal
  • Compare gene products according to their
    description
  • AND NOT
  • according to their sequence similarity.

As description we use the standardised
terminology of the Gene Ontology (GO).
Data source The Gene Ontology Database (GODB),
is a repository of the GO and the associations
between the terms and the gene products
(GOA). Currently there are 2M gene products
described by 21000 terms producing 9M
associations.
74
Functional Analogous Finder
  • Approach
  • A selection of about 1M well annotated gene
    products are involved in the search.
  • A simple chi-square application compares the
    common and non-common terms between two compared
    gene products.
  • Problem
  • A comparison of one gene product against the
    whole 1M gene products occupies 1 CPU for 30 min.
    on average
  • The whole every gene product against each other
    search would occupy 1 CPU for more than 50 years.

75
Functional Analogous Finder
  • Solution
  • Split the search into a number of small jobs and
    distribute them together with the DB on as many
    free WN as possible.
  • The job submission is made by a script running as
    a daemon
  • The script submits 80 jobs every 30 minutes
  • It is possible to run more instances of the
    submission daemon in order to increase the total
    number of jobs submitted in one hour
  • The multi-process submission improve the speed of
    submission
  • The submission uses 3 RB in a round robin
    algorithm in order to avoid overloading a single
    RB and to avoid that the failure of a single RB
    can stop the submission of jobs
  • Retrieve periodically the OutputSandbox of the
    jobs
  • Monitor the status of the production by simply
    querying the monitoring DB
  • The user can know the number of processed/running
    genes
  • The number of the running jobs
  • The location of each job
  • Debug possible errors in running jobs
  • The software to submit jobs is installed on 2
    different machines in order to avoid that a
    single hardware failure can stop the submission

76
Functional Analogous Finder the job submission
Farm2
SE2
SE1
  • A simple monitoring system, based on a central
    DB, makes it possible to know in real time the
    status of each job and make some post-mortem
    analysis.
  • Status of the single operation made by the
    running script
  • Location of the jobs

Farm1
RB
DB
  • A series of scripts runs periodically on the UI
    to submit and control the jobs
  • The script submits 80 jobs every 30 minutes

UI
  • The central DB acts as a task queue for
    automatic job submission

77
Functional Analogous Finder actions performed
when the job reaches the WN
Farm2
SE1
Farm1
SE2
Reads from a DB the n genes to compare (for 10
hours jobs) (chosen between the not completed
genes or the running ones from more than 48 hours)
RB
DB
  • Downloads the input files (from one of three
    available SE)
  • Decompresses them
  • Installs the perl libraries

UI
  • Start the perl script and the comparison

78
Functional Analogous Finder Results
  • All 1M gene products processed in less than one
    month
  • Different Farms used 64
  • Different Hosts used 2446
  • Total Submitted jobs 95041
  • Total started jobs 66313
  • Total of successful jobs (from application point
    of view) 42992
  • Total Failed jobs (for input staging problem)
    3209

79
The Health-e-Child Project PlatformA gLite
Adoption Case Study
2007-10-01 EGEE Business Track Budapest, Hungary
David Manset - dmanset_at_maat-g.com maat
Gknowledge MAAT http//www.maat-g.com
80
Project Objectives
  • Establish Horizontal and Vertical integration of
    data, information and knowledge for Paediatrics
  • Develop a grid-based biomedical information
    platform, supported by sophisticated and robust
    search, optimisation, and matching techniques for
    heterogeneous information,
  • Build enabling tools and services that improve
    the quality of care and reduce its cost by
    increasing efficiency
  • Integrated disease models exploiting all
    available information levels
  • Database-guided decision support systems
  • Large-scale, cross-modality information fusion
    and data mining for knowledge discovery
  • A Knowledge Repository for Paediatrics

3
4
5
6
1
2
81
Distributed Computing with MATLAB in Grids
  • Silvina Grad-Freilich
  • Manager, Parallel Computing Technical Marketing
  • sgradfre_at_mathworks.com

http//indico.cern.ch/materialDisplay.py?contribId
283sessionId25materialIdslidesconfId18714
82
Licensing for Third-Party and Global Use
University A
HPC Center
83
  • License Management within Grid Framework
  • Some of the issues to resolve
  • Third-party licensing
  • Global licensing
  • Commercial vs. academic use
  • Policy on license management within the EGEE
    framework

84
Pilot EGEE The MathWorks
Integrate distributed computing tools with EGEE
middleware
  • Step 1 Research need and pre-setup
  • Survey EGEE virtual organizations on MATLAB use
    (EGEE)
  • Identify sites to be used in test (EGEE)
  • Provide trial licenses (MathWorks)
  • Step 2 Technical feasibility study
  • Integrate with local resource manager (EGEE)
  • Integrate with local resource manager through
    Workload Management System (MathWorks EGEE)
  • Step 3 Define licensing model
  • Create model for Grid deployment within the EGEE
    framework (MathWorks, with much appreciated EGEE
    support!)

85
Further Information
  • EGEE Public Portal
  • http//www.eu-egee.org
  • EGEE Training Site
  • http//www.egee.nesc.ac.uk
  • EGEE Digital Library
  • http//egee.lib.ed.ac.uk
  • EGEE User Information Group
  • http//www.egee.nesc.ac.uk/uig

86
3rd EGEE User Forum
87
EGEE08
  • The EGEE08 Conference will take place in
Write a Comment
User Comments (0)
About PowerShow.com