HPC in the Human Genome Project - PowerPoint PPT Presentation

About This Presentation
Title:

HPC in the Human Genome Project

Description:

The Sanger Centre is a research centre funded primarily by the Wellcome Trust ... Nematode 100 18,000 1/6kb Animal 1998. Human 3000 ?40,000 1/60kb Mammal 2000/3 ... – PowerPoint PPT presentation

Number of Views:213
Avg rating:3.0/5.0
Slides: 20
Provided by: conferen
Category:

less

Transcript and Presenter's Notes

Title: HPC in the Human Genome Project


1
HPC in theHuman Genome Project
James Cuff
james_at_sanger.ac.uk
2
  • The Sanger Centre is a research centre funded
    primarily by the Wellcome Trust
  • Located in 55 acres of parkland
  • Also on site are the
  • European Bioinformatics Institute (EBI)
  • Human Genome Mapping Project Resource Centre
    (HGMP-RC)

3
The Sanger Centre
  • Founded in 1993 gt570 staff members now.
  • Our purpose is to further the knowledge of the
    biology of organisms, particularly through large
    scale sequencing and analysis of their genomes.
  • Our lead project is to sequence a third of the
    human genome as part of the international Human
    Genome Project.

4
Sanger Centre research programmes
  • Pathogen sequencing programme
  • Human genetic programme - study genetic
    variation (SNPs) and find disease genes
  • Cancer genome project
  • Informatics
  • support data collection
  • analyse and present results
  • develop methodology algorithms and data
    resources

5
The Structure of DNA
Four nucleic acids Guanine, Thymine Cytosine,
Adenine are represented computationally by the
characters A, C, T, G
GUANINE
THYMINE
CYTOSINE
BASE- PAIRS
ADENINE
The order (or sequence) of the bases in the
DNA chain codes for the genes
6
Typical DNA Sequence
A typist typing at 60 w.p.m for 8 hours a day
would take around 50 years to type the book of
life. Human DNA consists of 3000,000,000
letters
7
The era of genome sequencing
Size No. of Genes
Completion (Mbases)
date H. influenzae 2
1,700 1/1kb Bacterium
1995 Yeast 13 6,000 1/2kb
Eukaryotic cell 1996 Nematode 100
18,000 1/6kb Animal 1998 Human
3000 ?40,000 1/60kb Mammal
2000/3
Sequence data production increase of gt2000
8
The Sequencing Facility
9
Sanger I.T.
Sanger network- more than 1600 devices
gt350 Compaq Alpha systems (DS10, DS20,
ES40,8400) 440 Node Sequence Annotation Farm
(PC/DS10/DS10L) gt750 Alpha processors in total
300 PCs various 150 X-terms/Network Computers
(NCD) 250 NT/Mac ABI Collection devices Various
other servers, Linux desktop systems, printers,
etc. Paracel, Compugen and Timelogic systems
10
Raid,8400,DS20,PC Farm
11
Systems architecture hierarchy
Compute Server Farm
Raid Storage
Raid Storage
F/C
F/C
Front-end Compute Servers
Front-end Compute Servers
A T M
Desk top workgroup systems
Desk top workgroup systems
LSF - Load Sharing Facility by Platform Computing
Ltd
12
Computer Systems Architecture
Fibre Channel/Memory Channel Tru64 Clusters
Implementing tightly coupled clustering
with Tru64 V5.x We get Improved
disk I/O (fibre channel), scaleability
(multi-cpu, multi-terabyte) Improved
manageability - single system image, whole
clusters are managed as single entities)
13
ES40 Clusters, F.C Storage
14
Annotation Farms
  • 8 Racks each with 40 x Tru64 v5.0 Alpha DS10L.
  • 320x466Mhz Alpha EV6.7, 1U High Total of
    320GB mem, spinning 19.2TB internal storage
  • ca. Equivalent to 10 x GS320, perf around 355
    Gflops
  • 32 x CableTron 100Mbs switches, 16 x RS232
    Terminal servers, 2 x 155Mb ATM fibre uplinks
    back to v5.0 cluster
  • Two network subnets (multicast and backbone)
  • 640x100Mb Fast Ethernet ports
  • 1,920 UTP cable crimps, 8 cabinets 100kW of
    power

15
(No Transcript)
16
Network Overview
  • Highly available NFS (Tru64 CAA)
  • Fast I/O (ATM gt switched full duplex ethernet)
  • Socket data transfer (via rdist, rcp, and MySQL
    DBI sockets)
  • Segmented network architecture via two elans

Farm
Sanger
172.25
8 node ES40 M/C F/C cluster
172.25
172.27
uplinks
172.27
17
Compute systems architecture
Firewall DMZ External Services
400 node S.A. farm
Pathogen
ATM
Trace server
18
Enterprise Clustering
  • LSF is still key for job scheduling and batch
  • operations
  • LSF offers greater granularity of operation
    and functionality than Tru64 scheduling
  • Schedule individual nodes, cluster-wide and
  • cross-cluster scheduling
  • With LSF we still have the capability to
  • use many of the 750 compute nodes as
  • a single Sanger Compute Engine

MODULAR SUPERCOMPUTING
19
Projects
  • All are computationally expensive
  • We need to continue scaling up
  • and deal with the physical limitations
  • Will involve thousands of CPUs
  • - Large numbers of PC farm nodes
  • - High-end, large memory SMP configurations
  • Will require gt 100 Terabytes of storage

20
Immediate Future
The Sanger Centre Genome Campus
Implement Storage Area Network Install multi-TB
to enable disk mirroring, controller/controller
snapshots
ATM
LSF Clustering
Institute to Institute Clustering Closer
collaborations between Sanger, EBI and other
organisations brings the need for site wide
shared clusters.
21
Longer Term Future
  • Wide Area Clusters
  • Needed for large scale collaborations.
  • GRID Technology - Global Distributed Computing
  • International Cluster collaborations with
    other
  • scientific institutes
  • Sanger is keen to keep abreast of this emerging
  • technology

GLOBAL COMPUTE ENGINES
22
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com