Title: Peter Wegner, DESY CHEP03, 25 March 2003 1
1LQCD benchmarks on cluster architecturesM.
Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen),A.
Gellrich, H.Wittig (DESY Hamburg) CHEP03, 25
March 2003Category 6 Lattice Gauge Computing
- Motivation
- PC Cluster _at_DESY
- Benchmark architectures
- DESY Cluster
- E7500 systems
- Infiniband blade servers
- Itanium2
- Benchmark programs, Results
- Future
- Conclusions, Acknowledgements
2 PC Cluster MotivationLQCD, Stream Benchmark,
Myrinet Bandwidth
- 32/64-bit Dirac Kernel, LQCD (Martin Lüscher,
(DESY) CERN, 2000) - P4, 1.4 GHz, 256 MB Rambus, using SSE1(2)
instructions incl. cache pre-fetch - Time per lattice point
- 0.926 micro sec (1503 Mflops 32 bit
arithmetic) - 1.709 micro sec (814 Mflops 64 bit arithmetic)
- Stream Benchmark, Memory Bandwidth
- P4(1.4 GHz, PC800 Rambus) 1.4 2.0 GB/s
- PIII (800MHz, PC133 SDRAM) 400 MB/s
- PIII(400 MHz, PC133 SDRAM) 340 MB/s
- Myrinet, external Bandwidth
- 2.02.0 Gb/s optical-connection, bidirectional,
240 MB/s sustained
3Benchmark Architectures - DESY Cluster Hardware
Nodes Mainboard Supermicro P4DC6 2 x XEON P4,
1.7 (2.0) GHz, 256 (512) kByte Cache 1 Gbyte
(4x 256 Mbyte) RDRAM IBM 18.3 GB DDYS-T18350
U160 3.5 SCSI disk Myrinet 2000 M3F-PCI64B-2
Interface Network Fast Ethernet Switch Gigaline
2024M, 48x100BaseTX ports GIGAline2024
1000BaseSX-SC Myrinet Fast Interconnect
M3-E32 5 slot chassis, 2xM3-SW16 Line
cards Installation Zeuthen 16 dual CPU nodes,
Hamburg 32 dual CPU nodes
4Benchmark Architectures DESY Cluster i860
chipset problem
400MHz System Bus
DualChannel RDRAM
3.2 GB/s
MRH
gt1GB/s
800MB/s
P64H
MCH
3.2GB/s
MRH
800MB/s
P64H
Intel Hub Architecture
266 MB/s
64 bit PCI
ATA 100 MB/s (dual IDE Channels)
6 channel audio
64 bit PCI
ICH2
LAN Connection Interface
Up to 4 GB of RDRAM
133 MB/s
PCI Slots (66 MHz, 64bit)
PCI Slots (33 MHz, 32bit)
10/100 Ethernet
bus_read (send) 227 MBytes/s bus_write (recv)
315 MBytes/s of max. 528 MBytes/s
4 USB ports
External Myrinet bandwidth 160 Mbytes/s 90
Mbytes/s bidirectional
5Benchmark Architectures Intel E7500 chipset
6Benchmark Architectures - E7500 system
Par-Tec (Wuppertal) 4 Nodes Intel(R) Xeon(TM)
CPU 2.60GHz 2 GB ECC PC1600 (DDR-200) SDRAM
Super Micro P4DPE-G2 Intel E7500
chipset PCI 64/66 2 x Intel(R) PRO/1000
Network Connection Myrinet M3F-PCI64B-2
7Benchmark Architectures
Leibniz-Rechenzentrum Munich (single cpu
tests) Pentium IV 3,06GHz. with ECC
Rambus Pentium IV 2,53GHz. with Rambus 1066
memory Xeon, 2.4GHz. with PC2100 DDR SDRAM
memory (probably FSB400) Megware 8 nodes
dual XEON, 2.4GHz, E7500 2GB DDR ECC memory
Myrinet2000 Supermicro P4DMS-6GM University
of Erlangen Itanium2, 900MHz, 1.5MB Cache, 10GB
RAM zx1 chipset (HP)
8Benchmark Architectures - Infiniband
Megware 10 Mellanox ServerBlades Single Xeon
2.2 GHz 2 GB DDR RAM ServerWorks GC-LE
Chipsatz InfiniBand 4X HCA RedHat 7.3, Kernel
2.4.18-3 MPICH-1.2.2.2 und OSU-Patch für
VIA/InfiniBand 0.6.5 Mellanox Firmware
1.14 Mellanox SDK (VAPI) 0.0.4 Compiler GCC
2.96
9Dirac Operator Benchmark (SSE) 16x163, single
P4/XEON CPU
Dirac operator
Linear Algebra
MFLOPS
10Parallel (1-dim) Dirac Operator Benchmark (SSE),
even-odd preconditioned, 2 x 163 , XEON CPUs,
single CPU performance
Myrinet2000 i860 90 MB/s
E7500 190 MB/s
11Parallel (1-dim) Dirac Operator Benchmark (SSE),
even-odd preconditioned, 2 x 163 , XEON CPUs,
single CPU performance,2, 4 nodes
Performance comparisons (MFLOPS)
Single node Single node Dual node Dual node
SSE2 non-SSE SSE2 non-SSE
446 328 (74) 330 283 (85)
Parastation3 software non-blocking I/O support
(MFLOPS, non-SSE)
blocking non-blocking I/O
308 367 (119)
12Maximal Efficiency of external I/O
MFLOPs (without communication) MFLOPS (with communication) Maximal Bandwidth Efficiency
Myrinet (i860), SSE 579 307 90 90 0.53
Myrinet/GM (E7500), SSE 631 432 190 190 0.68
Myrinet/ Parastation (E7500), SSE 675 446 181 181 0.66
Myrinet/ Parastation (E7500), non-blocking, non-SSE 406 368 hidden 0.91
Gigabit, Ethernet, non-SSE 390 228 100 100 0.58
Infiniband non-SSE 370 297 210 210 0.80
13Parallel (1-dim) Dirac Operator Benchmark (SSE),
even-odd preconditioned, 2 x 163 , XEON/Itanium2
CPUs, single CPU performance, 4 nodes
4 single CPU nodes, Gbit Ethernet, non-blocking
switch, full duplex P4 (2.4 GHz, 0.5 MB
cache) SSE 285 MFLOPS 88.92 88.92
MB/s non-SSE 228 MFLOPS 75.87 75.87
MB/s Itanium2 (900 MHz, 1.5 MB
cache) non-SSE 197 MFLOPS 63.13 63.13 MB/s
14Infiniband interconnect
up to 10GB/s Bi-directional
Switch Simple, low cost, multistage network
Link High Speed Serial1x, 4x, and 12x
I/O Cntlr
TCA
Target Channel Adapter Interface to I/O
controller SCSI, FC-AL, GbE, ...
TCA
I/O Cntlr
- Host Channel Adapter
- Protocol Engine
- Moves data via messages queued in memory
Chips IBM, Mellanox PCI-X cards Fujitsu,
Mellanox, JNI, IBM
http//www.infinibandta.org
15 Infiniband interconnect
16Parallel (2-dim) Dirac Operator Benchmark
(Ginsparg-Wilson-Fermions) , XEON CPUs, single
CPU performance, 4 nodes
Infiniband vs Myrinet performance, non-SSE
(MFLOPS)
XEON 1.7 GHz Myrinet, i860 chipset XEON 1.7 GHz Myrinet, i860 chipset XEON 2.2 GHz Infiniband, E7500 chipset XEON 2.2 GHz Infiniband, E7500 chipset
32-Bit 64-Bit 32-Bit 64-Bit
8x83 lattice, 2x2 processor grid 370 281 697 477
16x163 lattice, 2x4 processor grid 338 299 609 480
17Future - Low Power Cluster Architectures ?
18Future Cluster Architectures - Blade Servers ?
NEXCOM Low voltage blade server 200 low voltage
Intel XEON CPUs (1.6 GHz 30W) in a 42U
Rack Integrated Gbit Ethernet network
Mellanox Infiniband blade server
Single XEON Blades connected via a 10 Gbit (4X)
Infiniband network
MEGWARE, NCSA, Ohio State University
19 Conclusions
PC CPUs have an extremely high sustained LQCD
performance using SSE/SSE2 (SIMDpre-fetch),
assuming a sufficient large local lattice
Bottlenecks are the memory throughput and the
external I/O bandwidth, both components are
improving (Chipsets i860 ? E7500 ? E705 ?
, FSB 400MHz ? 533 MHz ? 667 MHz ? , external
I/O Gbit-Ethernet ? Myrinet2000 ? QSnet ?
Inifiniband ? ) Non-blocking MPI communication
can improve the performance by using adequate
MPI implementations (e.g. ParaStation) 32-bit
Architectures (e.g. IA32) have a much better
price performance ratio than 64-bit architectures
(Itanium, Opteron ?) Large low voltage dense
blade clusters could play an important role in
LQCD computing (low voltage XEON, CENTRINO ?, )
20Acknowledgements
Acknowledgements We would like to thank Martin
Lüscher (CERN) for the benchmark codes and the
fruitful discussions about PCs for LQCD, and
Isabel Campos Plasencia (Leibnitz-Rechenzentrum
Munich), Gerhard Wellein (Uni Erlangen), Holger
Müller (Megware), Norbert Eicker (Par-Tec), Chris
Eddington (Mellanox) for the opportunity to run
the benchmarks on their clusters.