Title: HPC in the Human Genome Project
1HPC in theHuman Genome Project
James Cuff
james_at_sanger.ac.uk
2- The Sanger Centre is a research centre funded
primarily by the Wellcome Trust - Located in 55 acres of parkland
- Also on site are the
- European Bioinformatics Institute (EBI)
- Human Genome Mapping Project Resource Centre
(HGMP-RC)
3The Sanger Centre
- Founded in 1993 gt570 staff members now.
- Our purpose is to further the knowledge of the
biology of organisms, particularly through large
scale sequencing and analysis of their genomes. - Our lead project is to sequence a third of the
human genome as part of the international Human
Genome Project.
4Sanger Centre research programmes
- Pathogen sequencing programme
- Human genetic programme - study genetic
variation (SNPs) and find disease genes
- Informatics
- support data collection
- analyse and present results
- develop methodology algorithms and data
resources
5The Structure of DNA
Four nucleic acids Guanine, Thymine Cytosine,
Adenine are represented computationally by the
characters A, C, T, G
GUANINE
THYMINE
CYTOSINE
BASE- PAIRS
ADENINE
The order (or sequence) of the bases in the
DNA chain codes for the genes
6Typical DNA Sequence
A typist typing at 60 w.p.m for 8 hours a day
would take around 50 years to type the book of
life. Human DNA consists of 3000,000,000
letters
7The era of genome sequencing
Size No. of Genes
Completion (Mbases)
date H. influenzae 2
1,700 1/1kb Bacterium
1995 Yeast 13 6,000 1/2kb
Eukaryotic cell 1996 Nematode 100
18,000 1/6kb Animal 1998 Human
3000 ?40,000 1/60kb Mammal
2000/3
Sequence data production increase of gt2000
8The Sequencing Facility
9Sanger I.T.
Sanger network- more than 1600 devices
gt350 Compaq Alpha systems (DS10, DS20,
ES40,8400) 440 Node Sequence Annotation Farm
(PC/DS10/DS10L) gt750 Alpha processors in total
300 PCs various 150 X-terms/Network Computers
(NCD) 250 NT/Mac ABI Collection devices Various
other servers, Linux desktop systems, printers,
etc. Paracel, Compugen and Timelogic systems
10Raid,8400,DS20,PC Farm
11Systems architecture hierarchy
Compute Server Farm
Raid Storage
Raid Storage
F/C
F/C
Front-end Compute Servers
Front-end Compute Servers
A T M
Desk top workgroup systems
Desk top workgroup systems
LSF - Load Sharing Facility by Platform Computing
Ltd
12Computer Systems Architecture
Fibre Channel/Memory Channel Tru64 Clusters
Implementing tightly coupled clustering
with Tru64 V5.x We get Improved
disk I/O (fibre channel), scaleability
(multi-cpu, multi-terabyte) Improved
manageability - single system image, whole
clusters are managed as single entities)
13ES40 Clusters, F.C Storage
14Annotation Farms
- 8 Racks each with 40 x Tru64 v5.0 Alpha DS10L.
- 320x466Mhz Alpha EV6.7, 1U High Total of
320GB mem, spinning 19.2TB internal storage
- ca. Equivalent to 10 x GS320, perf around 355
Gflops
- 32 x CableTron 100Mbs switches, 16 x RS232
Terminal servers, 2 x 155Mb ATM fibre uplinks
back to v5.0 cluster - Two network subnets (multicast and backbone)
- 640x100Mb Fast Ethernet ports
- 1,920 UTP cable crimps, 8 cabinets 100kW of
power
15(No Transcript)
16Network Overview
- Highly available NFS (Tru64 CAA)
- Fast I/O (ATM gt switched full duplex ethernet)
- Socket data transfer (via rdist, rcp, and MySQL
DBI sockets) - Segmented network architecture via two elans
Farm
Sanger
172.25
8 node ES40 M/C F/C cluster
172.25
172.27
uplinks
172.27
17Compute systems architecture
Firewall DMZ External Services
400 node S.A. farm
Pathogen
ATM
Trace server
18Enterprise Clustering
- LSF is still key for job scheduling and batch
- operations
- LSF offers greater granularity of operation
and functionality than Tru64 scheduling - Schedule individual nodes, cluster-wide and
- cross-cluster scheduling
- With LSF we still have the capability to
- use many of the 750 compute nodes as
- a single Sanger Compute Engine
MODULAR SUPERCOMPUTING
19Projects
- All are computationally expensive
- We need to continue scaling up
- and deal with the physical limitations
- Will involve thousands of CPUs
- - Large numbers of PC farm nodes
- - High-end, large memory SMP configurations
- Will require gt 100 Terabytes of storage
20Immediate Future
The Sanger Centre Genome Campus
Implement Storage Area Network Install multi-TB
to enable disk mirroring, controller/controller
snapshots
ATM
LSF Clustering
Institute to Institute Clustering Closer
collaborations between Sanger, EBI and other
organisations brings the need for site wide
shared clusters.
21Longer Term Future
- Wide Area Clusters
- Needed for large scale collaborations.
- GRID Technology - Global Distributed Computing
- International Cluster collaborations with
other - scientific institutes
- Sanger is keen to keep abreast of this emerging
- technology
GLOBAL COMPUTE ENGINES
22Questions ?