High Performance Linux Clusters - PowerPoint PPT Presentation

1 / 96
About This Presentation
Title:

High Performance Linux Clusters

Description:

High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC Overview of San Diego Supercomputer Center Founded in 1985 Non-military ... – PowerPoint PPT presentation

Number of Views:259
Avg rating:3.0/5.0
Slides: 97
Provided by: gregb61
Category:

less

Transcript and Presenter's Notes

Title: High Performance Linux Clusters


1
High Performance Linux Clusters
  • Guru Session, Usenix, Boston
  • June 30, 2004
  • Greg Bruno, SDSC

2
Overview of San Diego Supercomputer Center
  • Founded in 1985
  • Non-military access to supercomputers
  • Over 400 employees
  • Mission Innovate, develop, and deploy technology
    to advance science
  • Recognized as an international leader in
  • Grid and Cluster Computing
  • Data Management
  • High Performance Computing
  • Networking
  • Visualization
  • Primarily funded by NSF

3
My Background
  • 1984 - 1998 NCR - Helped to build the worlds
    largest database computers
  • Saw the transistion from proprietary parallel
    systems to clusters
  • 1999 - 2000 HPVM - Helped build Windows clusters
  • 2000 - Now Rocks - Helping to build Linux-based
    clusters

4
Why Clusters?
5
Moores Law
6
Cluster Pioneers
  • In the mid-1990s, Network of Workstations project
    (UC Berkeley) and the Beowulf Project (NASA)
    asked the question

Can You Build a High Performance Machine
From Commodity Components?
7
The Answer is Yes
Source Dave Pierce, SIO
8
The Answer is Yes
9
Types of Clusters
  • High Availability
  • Generally small (less than 8 nodes)
  • Visualization
  • High Performance
  • Computational tools for scientific computing
  • Large database machines

10
High Availability Cluster
  • Composed of redundant components and multiple
    communication paths

11
Visualization Cluster
  • Each node in the cluster drives a display

12
High Performance Cluster
  • Constructed with many compute nodes and often a
    high-performance interconnect

13
Cluster Hardware Components
14
Cluster Processors
  • Pentium/Athlon
  • Opteron
  • Itanium

15
Processors x86
  • Most prevalent processor used in commodity
    clustering
  • Fastest integer processor on the planet
  • 3.4 GHz Pentium 4, SPEC2000int 1705

16
Processors x86
  • Capable floating point performance
  • 5 machine on Top500 list built with Pentium 4
    processors

17
Processors Opteron
  • Newest 64-bit processor
  • Excellent integer performance
  • SPEC2000int 1655
  • Good floating point performance
  • SPEC2000fp 1691
  • 10 machine on Top500

18
Processors Itanium
  • First systems released June 2001
  • Decent integer performance
  • SPEC2000int 1404
  • Fastest floating-point performance on the planet
  • SPEC2000fp 2161
  • Impressive Linpack efficiency 86

19
Processors Summary
Processor GHz SPECint SPECfp Price
Pentium 4 EE 3.4 1705 1561 791
Athlon FX-51 2.2 1447 1423 728
Opteron 150 2.4 1655 1644 615
Itanium 2 1.5 1404 2161 4798
Itanium 2 1.3 1162 1891 1700
Power4 1.7 1158 1776 ????
20
But What You Really Build?
  • Itanium Dell PowerEdge 3250
  • Two 1.4 GHz CPUs (1.5 MB cache)
  • 11.2 Gflops peak
  • 2 GB memory
  • 36 GB disk
  • 7,700
  • Two 1.5 GHz (6 MB cache) makes the system cost
    17,700
  • 1.4 GHz vs. 1.5 GHz
  • 7 slower
  • 130 cheaper

21
Opteron
  • IBM eServer 325
  • Two 2.0 GHz Opteron 246
  • 8 Gflops peak
  • 2 GB memory
  • 36 GB disk
  • 4,539
  • Two 2.4 GHz CPUs 5,691
  • 2.0 GHz vs. 2.4 GHz
  • 17 slower
  • 25 cheaper

22
Pentium 4 Xeon
  • HP DL140
  • Two 3.06 GHz CPUs
  • 12 Gflops peak
  • 2 GB memory
  • 80 GB disk
  • 2,815
  • Two 3.2 GHz 3,368
  • 3.06 GHz vs. 3.2 GHz
  • 4 slower
  • 20 cheaper

23
If You Had 100,000 To Spend On A Compute Farm
System of Boxes Peak GFlops Aggregate SPEC2000fp Aggregate SPEC2000int
Pentium 4 3 GHz 35 420 89810 104370
Opteron 246 2.0 GHz 22 176 56892 57948
Itanium 1.4 GHz 12 132 46608 24528
24
What People Are Buying
  • Gartner study
  • Servers shipped in 1Q04
  • Itanium 6,281
  • Opteron 31,184
  • Opteron shipped 5x more servers than Itanium

25
What Are People Buying
  • Gartner study
  • Servers shipped in 1Q04
  • Itanium 6,281
  • Opteron 31,184
  • Pentium 1,000,000
  • Pentium shipped 30x more than Opteron

26
Interconnects
27
Interconnects
  • Ethernet
  • Most prevalent on clusters
  • Low-latency interconnects
  • Myrinet
  • Infiniband
  • Quadrics
  • Ammasso

28
Why Low-Latency Interconnects?
  • Performance
  • Lower latency
  • Higher bandwidth
  • Accomplished through OS-bypass

29
How Low Latency Interconnects Work
  • Decrease latency for a packet by reducing the
    number memory copies per packet

30
Bisection Bandwidth
  • Definition If split system in half, what is the
    maximum amount of data that can pass between each
    half?
  • Assuming 1 Gb/s links
  • Bisection bandwidth 1 Gb/s

31
Bisection Bandwidth
  • Assuming 1 Gb/s links
  • Bisection bandwidth 2 Gb/s

32
Bisection Bandwidth
  • Definition Full bisection bandwidth is a network
    topology that can support N/2 simultaneous
    communication streams.
  • That is, the nodes on one half of the network can
    communicate with the nodes on the other half at
    full speed.

33
Large Networks
  • When run out of ports on a single switch, then
    you must add another network stage
  • In example above Assuming 1 Gb/s links, uplinks
    from stage 1 switches to stage 2 switches must
    carry at least 6 Gb/s

34
Large Networks
  • With low-port count switches, need many switches
    on large systems in order to maintain full
    bisection bandwidth
  • 128-node system with 32-port switches requires 12
    switches and 256 total cables

35
Myrinet
  • Long-time interconnect vendor
  • Delivering products since 1995
  • Deliver single 128-port full bisection bandwidth
    switch
  • MPI Performance
  • Latency 6.7 us
  • Bandwidth 245 MB/s
  • Cost/port (based on 64-port configuration) 1000
  • Switch NIC cable
  • http//www.myri.com/myrinet/product_list.html

36
Myrinet
  • Recently announced 256-port switch
  • Available August 2004

37
Myrinet
  • 5 System on Top500 list
  • System sustains 64 of peak performance
  • But smaller Myrinet-connected systems hit 70-75
    of peak

38
Quadrics
  • QsNetII E-series
  • Released at the end of May 2004
  • Deliver 128-port standalone switches
  • MPI Performance
  • Latency 3 us
  • Bandwidth 900 MB/s
  • Cost/port (based on 64-port configuration) 1800
  • Switch NIC cable
  • http//doc.quadrics.com/Quadrics/QuadricsHome.nsf/
    DisplayPages/A3EE4AED738B6E2480256DD30057B227

39
Quadrics
  • 2 on Top500 list
  • Sustains 86 of peak
  • Other Quadrics-connected systems on Top500 list
    sustain 70-75 of peak

40
Infiniband
  • Newest cluster interconnect
  • Currently shipping 32-port switches and 192-port
    switches
  • MPI Performance
  • Latency 6.8 us
  • Bandwidth 840 MB/s
  • Estimated cost/port (based on 64-port
    configuration) 1700 - 3000
  • Switch NIC cable
  • http//www.techonline.com/community/related_conten
    t/24364

41
Ethernet
  • Latency 80 us
  • Bandwidth 100 MB/s
  • Top500 list has ethernet-based systems sustaining
    between 35-59 of peak

42
Ethernet
  • What we did with 128 nodes and a 13,000 ethernet
    network
  • 101 / port
  • 28/port with our latest Gigabit Ethernet switch
  • Sustained 48 of peak
  • With Myrinet, would have sustained 1 Tflop
  • At a cost of 130,000
  • Roughly 1/3 the cost of the system

43
Rockstar Topology
  • 24-port switches
  • Not a symmetric network
  • Best case - 41 bisection bandwidth
  • Worst case - 81
  • Average - 5.31

44
Low-Latency Ethernet
  • Bring os-bypass to ethernet
  • Projected performance
  • Latency less than 20 us
  • Bandwidth 100 MB/s
  • Potentially could merge management and
    high-performance networks
  • Vendor Ammasso

45
Application Benefits
46
Storage
47
Local Storage
  • Exported to compute nodes via NFS

48
Network Attached Storage
  • A NAS box is an embedded NFS appliance

49
Storage Area Network
  • Provides a disk block interface over a network
    (Fibre Channel or Ethernet)
  • Moves the shared disks out of the servers and
    onto the network
  • Still requires a central service to coordinate
    file system operations

50
Parallel Virtual File System
  • PVFS version 1 has no fault tolerance
  • PVFS version 2 (in beta) has fault tolerance
    mechanisms

51
Lustre
  • Open Source
  • Object-based storage
  • Files become objects, not blocks

52
Cluster Software
53
Cluster Software Stack
  • Linux Kernel/Environment
  • RedHat, SuSE, Debian, etc.

54
Cluster Software Stack
  • HPC Device Drivers
  • Interconnect driver (e.g., Myrinet, Infiniband,
    Quadrics)
  • Storage drivers (e.g., PVFS)

55
Cluster Software Stack
  • Job Scheduling and Launching
  • Sun Grid Engine (SGE)
  • Portable Batch System (PBS)
  • Load Sharing Facility (LSF)

56
Cluster Software Stack
  • Cluster Software Management
  • E.g., Rocks, OSCAR, Scyld

57
Cluster Software Stack
  • Cluster State Management and Monitoring
  • Monitoring Ganglia, Clumon, Nagios, Tripwire,
    Big Brother
  • Management Node naming and configuration (e.g.,
    DHCP)

58
Cluster Software Stack
  • Message Passing and Communication Layer
  • E.g., Sockets, MPICH, PVM

59
Cluster Software Stack
  • Parallel Code / Web Farm / Grid / Computer Lab
  • Locally developed code

60
Cluster Software Stack
  • Questions
  • How to deploy this stack across every machine in
    the cluster?
  • How to keep this stack consistent across every
    machine?

61
Software Deployment
  • Known methods
  • Manual Approach
  • Add-on method
  • Bring up a frontend, then add cluster packages
  • OpenMosix, OSCAR, Warewulf
  • Integrated
  • Cluster packages are added at frontend
    installation time
  • Rocks, Scyld

62
Rocks
63
Primary Goal
  • Make clusters easy
  • Target audience Scientists who want a capable
    computational resource in their own lab

64
Philosophy
  • Not fun to care and feed for a system
  • All compute nodes are 100 automatically
    installed
  • Critical for scaling
  • Essential to track software updates
  • RHEL 3.0 has issued 232 source RPM updates since
    Oct 21
  • Roughly 1 updated SRPM per day
  • Run on heterogeneous standard high volume
    components
  • Use the components that offer the best
    price/performance!

65
More Philosophy
  • Use installation as common mechanism to manage a
    cluster
  • Everyone installs a system
  • On initial bring up
  • When replacing a dead node
  • Adding new nodes
  • Rocks also uses installation to keep software
    consistent
  • If you catch yourself wondering if a nodes
    software is up-to-date, reinstall!
  • In 10 minutes, all doubt is erased
  • Rocks doesnt attempt to incrementally update
    software

66
Rocks Cluster Distribution
  • Fully-automated cluster-aware distribution
  • Cluster on a CD set
  • Software Packages
  • Full Red Hat Linux distribution
  • Red Hat Linux Enterprise 3.0 rebuilt from source
  • De-facto standard cluster packages
  • Rocks packages
  • Rocks community packages
  • System Configuration
  • Configure the services in packages

67
Rocks Hardware Architecture
68
Minimum Components
Local Hard Drive
Power
Ethernet
OS on all nodes (not SSI)
X86, Opteron, IA64 server
69
Optional Components
  • Myrinet high-performance network
  • Infiniband support in Nov 2004
  • Network-addressable power distribution unit
  • keyboard/video/mouse network not required
  • Non-commodity
  • How do you manage your management network?
  • Crash carts have a lower TCO

70
Storage
  • NFS
  • The frontend exports all home directories
  • Parallel Virtual File System version 1
  • System nodes can be targeted as Compute PVFS or
    strictly PVFS nodes

71
Minimum Hardware Requirements
  • Frontend
  • 2 ethernet connections
  • 18 GB disk drive
  • 512 MB memory
  • Compute
  • 1 ethernet connection
  • 18 GB disk drive
  • 512 MB memory
  • Power
  • Ethernet switches

72
Cluster Software Stack
73
Rocks Rolls
  • Rolls are containers for software packages and
    the configuration scripts for the packages
  • Rolls dissect a monolithic distribution

74
Rolls User-Customizable Frontends
  • Rolls are added by the Red Hat installer
  • Software is added and configured at initial
    installation time
  • Benefit apply security patches during initial
    installation
  • This method is more secure than the add-on method

75
Red Hat Installer Modified to Accept Rolls
76
Approach
  • Install a frontend
  • Insert Rocks Base CD
  • Insert Roll CDs (optional components)
  • Answer 7 screens of configuration data
  • Drink coffee (takes about 30 minutes to install)
  • Install compute nodes
  • Login to frontend
  • Execute insert-ethers
  • Boot compute node with Rocks Base CD (or PXE)
  • Insert-ethers discovers nodes
  • Goto step 3
  • Add user accounts
  • Start computing
  • Optional Rolls
  • Condor
  • Grid (based on NMI R4)
  • Intel (compilers)
  • Java
  • SCE (developed in Thailand)
  • Sun Grid Engine
  • PBS (developed in Norway)
  • Area51 (security monitoring tools)

77
Login to Frontend
  • Create ssh public/private key
  • Ask for passphrase
  • These keys are used to securely login into
    compute nodes without having to enter a password
    each time you login to a compute node
  • Execute insert-ethers
  • This utility listens for new compute nodes

78
Insert-ethers
  • Used to integrate appliances into the cluster

79
Boot a Compute Node in Installation Mode
  • Instruct the node to network boot
  • Network boot forces the compute node to run the
    PXE protocol (Pre-eXecution Environment)
  • Also can use the Rocks Base CD
  • If no CD and no PXE-enabled NIC, can use a boot
    floppy built from Etherboot (http//www.rom-o-ma
    tic.net)

80
Insert-ethers Discovers the Node
81
Insert-ethers Status
82
eKVEthernet Keyboard and Video
  • Monitor your compute node installation over the
    ethernet network
  • No KVM required!
  • Execute ssh compute-0-0

83
Node Info Stored In A MySQL Database
  • If you know SQL, you can execute some powerful
    commands

84
Cluster Database
85
Kickstart
  • Red Hats Kickstart
  • Monolithic flat ASCII file
  • No macro language
  • Requires forking based on site information and
    node type.
  • Rocks XML Kickstart
  • Decompose a kickstart file into nodes and a graph
  • Graph specifies OO framework
  • Each node specifies a service and its
    configuration
  • Macros and SQL for site configuration
  • Driven from web cgi script

86
Sample Node File
lt?xml version"1.0" standalone"no"?gt lt!DOCTYPE
kickstart SYSTEM "_at_KICKSTART_DTD_at_" lt!ENTITY ssh
"openssh"gtgt ltkickstartgt ltdescriptiongt Enable
SSH lt/descriptiongt ltpackagegtsshlt/packagegt
ltpackagegtssh-clientslt/packagegt ltpackagegtssh-s
erverlt/packagegt ltpackagegtssh-askpasslt/packagegt
ltpostgt ltfile name"/etc/ssh/ssh_config"gt Host
CheckHostIP no
ForwardX11 yes ForwardAgent
yes StrictHostKeyChecking
no UsePrivilegedPort no
FallBackToRsh no Protocol
1,2 lt/filegt chmod orx /root mkdir
/root/.ssh chmod orx /root/.ssh lt/postgt lt/kickst
artgtgt
87
Sample Graph File
lt?xml version"1.0" standalone"no"?gt lt!DOCTYPE
kickstart SYSTEM "_at_GRAPH_DTD_at_"gt ltgraphgt ltdescrip
tiongt Default Graph for NPACI Rocks. lt/descripti
ongt ltedge from"base" to"scripting"/gt ltedge
from"base" to"ssh"/gt ltedge from"base"
to"ssl"/gt ltedge from"base" to"lilo"
arch"i386"/gt ltedge from"base" to"elilo"
arch"ia64"/gt ltedge from"node" to"base"
weight"80"/gt ltedge from"node"
to"accounting"/gt ltedge from"slave-node"
to"node"/gt ltedge from"slave-node"
to"nis-client"/gt ltedge from"slave-node"
to"autofs-client"/gt ltedge from"slave-node"
to"dhcp-client"/gt ltedge from"slave-node"
to"snmp-server"/gt ltedge from"slave-node"
to"node-certs"/gt ltedge from"compute"
to"slave-node"/gt ltedge from"compute"
to"usher-server"/gt ltedge from"master-node"
to"node"/gt ltedge from"master-node"
to"x11"/gt ltedge from"master-node"
to"usher-client"/gt lt/graphgt
88
Kickstart framework
89
Appliances
  • Laptop / Desktop
  • Appliances
  • Final classes
  • Node types
  • Desktop IsA
  • standalone
  • Laptop IsA
  • standalone
  • pcmcia
  • Code re-use is good

90
Architecture Differences
  • Conditional inheritance
  • Annotate edges with target architectures
  • if i386
  • Base IsA grub
  • if ia64
  • Base IsA elilo
  • One Graph, Many CPUs
  • Heterogeneity is easy
  • Not for SSI or Imaging

91
Installation Timeline
92
Status
93
But Are Rocks Clusters High Performance Systems?
  • Rocks Clusters on June 2004 Top500 list

94
(No Transcript)
95
What We Proposed To Sun
  • Lets build a Top500 machine
  • from the ground up
  • in 2 hours
  • in the Sun booth at Supercomputing 03

96
Rockstar Cluster (SC03)
  • Demonstrate
  • We are now in the age of personal
    supercomputing
  • Highlight abilities of
  • Rocks
  • SGE
  • Top500 list
  • 201 - November 2003
  • 413 - June 2004
  • Hardware
  • 129 Intel Xeon servers
  • 1 Frontend Node
  • 128 Compute Nodes
  • Gigabit Ethernet
  • 13,000 (US)
  • 9 24-port switches
  • 8 4-gigabit trunk uplinks
Write a Comment
User Comments (0)
About PowerShow.com