Title: Technologies for the Future: CLUSTERS
1Technologies for the Future CLUSTERS
- Anne C. Elster
- Dept. of Computer Information Science (IDI)
- Norwegian Univ. of Science Tech. (NTNU)
- Trondheim, Norway
NOTUR 2003
2Clusters (Networks of PCs/Workstation)
- Are they suitable for HPC?
- Advantage
- Cost-effective hardware since uses COTS
(Commercial Of-The-Shelf) parts - BUT
- Typically much slower processor interconnectes
than traditional HPC systems - What about usability?
NTNU IDIs 40-node AMD 1.46GHz cluster 2GB RAM,
40GB disk, Fast Ethernet
3Cluster TechnologiesNOTUR Emerging Technology
projectCollaboration between NTNU Univ. of
Tromsø
- Goal
- Analyze Cluster technologies suitability for HPC
by looking at some of the most interesting NOTUR
applications - The results will provide a foundation for
decisions regarding future HPC programs
4Main Collaborators include
- Anne C. Elster (IDI, NTNU) Project leader
- Otto Anshus Tore Larsen (CS, U of Tromsø)
- Tor Johansen staff (CC, U of Tromsø)
- Torbjørn Hallgren (IDI, NTNU)
- Einar Rønquist (IMF, NTNU)
- Master Ph.D. Students and Post Docs at NTNU and
Univ. of Tromsø
5General Issues to Consider
- Why cluster vs. Powerful desktop vs. Large SMPs?
- What are the total costs associated with clusters
(hardware, software, support, usability) - 32-bit vs. 64-bit architectures
6Cluster Project ACTIVITIES
- A.1 Profiling Tuning Selected Applications
- A.1.a/b Physics and Chemistry Codes
- (Elster students, Dept. of Computer Science
Dept., NTNU) - A.1.2a Profiling User-Analysis of Amber, Dalton
Gaussian - (Tor Johansen staff, Comp. Center, U of
Tromsø) - A.1.2b Optimization tool analysis of Dalton
- (Anshus PostDoc/student, Dept. of Comp. Sci.,
U of Tromsø)
7Cluster Project ACTIVITIES continuted
- A.2 Execution Monitoring
- (Anshus, Tore Larsen students, CS, U of T)
- A.3 Visualization servers, etc.
- (Hallgren, Elster students, CS, NTNU)
- A.4 Impact of future numerical algorithms
- (Rønquist student, Dept. of Mathematics, NTNU
- A.5 Interface with NOTUR ET Grid Project
- (Elster, Harald Simonsen and colleagues, staff
students associated with the NOTUR ET Cluster
Grid projects)
8A.1.a/b Physics Chemistry Codes (Elster
students, Dept. of CS Dept., NTNU)
Lessons Learned so far -- Paul Sacks work on a
Physics application (report available on the
Web)
- FORTRAN problems
- Different FORTRAN implementations have
non-stardard add-ons (e.g. FORTRAN 90) - Leads to great difficulty in porting code to a
different platform with a different Fortran
compiler (e.g. by a different vendor)
9A.1.a/b Physics Chemistry Codes contin.
- Performance of programs can individually vary on
different machines - Åsmund Østvold wrote a proj. report on
- porting PROTOMOL from an SMP w/ MPI one-siden
communication primitives (MPI put/get) to a
cluster. (available on WWW) - He also did a MS study with SCALI on various
- MPI broadcast algorithms and bechmarking
10A.1.a/b Physics Chemistry Codes contin.2
- Ongoing work with Snorre Boasson Jan Christian
Meyer on porting of PIC code using Pthread (SMP
primitives) to MPI . - Preliminary report will be available later this
week. - Recent Trends in Cluster Computing presented at
ParCo 2003 by Elster et. al. includes harware
trends and survey of libraries and performance
tools.
11A.1.2a Profiling User-Analysis of Amber, Dalton
Gaussian (Tor Johansen staff, Comp.
Center, U of Tromsø)
- Koordineringsarbeide
- Reise NOTUR 2003
- Porting og testing av Amber og Scali SW
-
12A.1.2b Optimization tool analysis of
Dalton(Anshus PostDoc/students, CS, U of
Tromsø)
- Ytelsesmålinger gjort på DALTON
- A Report for the NOTUR Project Emerging
Technologies Cluster - Daniel Stødle, Otto J. Anshus, John Markus
Bjørndalen - Survey of optimizing techniques for parallel
programs running on computer clusters - Espen S. Johnsen, Otto J. Anshus, John Markus
Bjørndalen, Lars Ailo Bongo (September 29, 2003)
13A.1.2b Optimization tool analysis of Dalton
(Anshus PostDoc/student, IFI, U i Tromsø)
CONTINUED
- RESULTS
- Dalton scales pretty well 25x speedup on 32
nodes - NOTE Only with-out caching temp. If use cache
only 3-5x speedup on 32! - Even through the 8-way cluster had no local disk
(only a netork file system), the sequential
Dalton code was significantly faster. - This indicates that network bandwith may not
be a problem if caching is used in the parallel - Communication pattern master-slave
"bag-of-tasks" oriented programs with little
communicaiton sychronization and generally good
utilization of the slave nodes. - Master does relatively little work and is blocked
most of the time - Finally checked if the master node could be a
bottle neck, but could not detect differences in
execution time when Master put on a slow node vs.
a fast node.. NOTE Only tested up to 32 nodes
using larger no. of nodes may limit performance
by overloading the master node.
14A.1.2b Optimization tool analysis of Dalton
(Anshus PostDoc/student, IFI, U i Tromsø)
CONTINUED 2
- Thanks to
- Kenneth Ruud, Chemistry, UiT
- Roy Dragseth, CC UiT for support on the Itanium
at U og Tromsø.
15A.2 Execution Monitoring (Anshus, Tore Larsen
students, CS, U of T)
- Survey of execution monitoring tools for
computer clusters - Espen S. Johnsen, Otto J. Anshus, John Markus
Bjørndalen, Lars Ailo Bongo, Sept 03 - Performance Monitoring
- Lars Ailo Bongo, Otto J. Anshus, John Markus
Bjørndalen
16A.3 Visualization servers, etc. (Hallgren,
Elster students, CS, NTNU)
- On going work with Torbjørn Vik
- Preliminary report on survey of how clusters are
currently used in visualization - To types of Cluster usages
- off-line (non-real-time rendering). Often called
"renderingfarms" with lots of nodes which all
work on a frame each of a larger animation. - Typically used in the film industry and other
areas where interactivity and/or real-time
rendering not needed. - All larger 3D modelling programs such as
Lightwave, 3DStudio, Maya has functionality for
this. - on-line ( realtime). Most interesting from a
technical viewpoint...
17A.3 Visualization servers, etc. - Contin.
- Cluster brukes innenfor interaktiv
visualiseringsprogramvare for å - øke ytelsen,
- muliggjøre større datasett,
- unngå begrensninger i lokal hardware.
- De fleste visualiseringscluster fungerer
prinsipielt ved at en bruker sitter på en
klientmaskin som i seg selv ikke har noe særlig
kapasitet. Clusteret tar seg av all beregning og
sender bare de ferdige bildene til klienten.
Klientmaskinen sørger også for å ta imot input
fra bruker og sende disse til cluster. Datasett
for slik visualisering er ofte svært store, og,
avhengig av situasjonen, brukes både
polygonbasert og voxelbasert rendering. - Hovedproblemet med å få clusters brukbare
innenfor interaktive visualiseringsprogram er
forsinkelser pga nettverk. Dette løses ved å
redusere tiden som brukes for å overføre bilder
mellom cluster og klient. Det kan enten løses ved
Ã¥ - redusere datamengden (komprimeringsmetoder) eller
- øke nettverksytelsen. Eller begge.
- Parallelitet i selve clusteret baseres på
uavhengighetsforhold mellom forskjellige data.
Det kan være uavhengigheter mellom forskjellige
deler i samme datasett, eller det kan være
uavhengigheter mellom forskjellige frames i et 4D
datasett. - Load-balancing blir ofte et problem i slike
sammenhenger og er et viktig forskningsområde. - Hvilken metode som brukes for load-balancing er
som oftest svært kontekstavhengig. - Clusterprogramvare for visualisering fremdeles
manglende ??
18A.4 Impact of future numerical algorithms (Rønqui
st student, Dept. of Mathematics, NTNU
- Rønquist student Staff (now at Simulasenteret)
wrote a report based on his summer jobb - May add in experiences from Elsters group fall
2003
19A.5 Interface with NOTUR ET Grid
Project (Elster, Harald Simonsen and colleagues,
staff students associated with the NOTUR ET
Cluster Grid projects)
- Test node established at NTNU
- Andreas Botnen(USIT) and
- Robin Holtet (IDI, now ITEA)
- May use IDIs 30-40-node cluster in testgrid
- Meetings
- Between Elster and Simonsens groups
- Robin Holtet and Elsters student Thorvald Natvig
to Linköping meeting this month. - Collaborations re. National GRID and EEGE
- Student from NTNU and UiO at CERN
20Main cluster issues
- Global operations have more severe impact on
cluster performance than traditional
supercomputers since communication between
processors take relatively more of the total
execution time - SCALABILITY!!
21Lessons leared
- Clusters generally have cheap hardware, but may
cause increased hidden costs regarding - More incompatible compilers, especially Fortran
90 (also C) - Some applications are non-trivial to port from a
share-memory paradigm to a distributed memory
paradigms - Some applications require high-bandwidth
interconnects which drive up costs (e.g. SGI
Altix) - Power and cooling costs (ref. Brian Vinter)
- Stability, recovery
- Overall costs and scalability should be further
studied
22The Ideal Cluster -- Hardware
- High-bandwidth network
- Low-latency network
- Low Operating System overhead (tcp causes slow
start) - Great floating-point performance
- (64-bit processors or more?)
23The Ideal Cluster -- Software
- Compiler that is
- Portable
- Optimizing
- Do extra work to save communication
- Self-tuning /Load -balanced
- Automatic selection of best algorithm
- One-sided communication support?
- Optimized middleware
24For more information
- A dozen or more reports associated with this
project will be made available on the web at - http//www.idi.ntnu.no/elster
- Email elster_at_idi.ntnu.no