Peter Wegner, DESY CHEP03, 25 March 2003 1 - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Peter Wegner, DESY CHEP03, 25 March 2003 1

Description:

32/64-bit Dirac Kernel, LQCD (Martin L scher, (DESY) CERN, 2000) ... 32-bit Architectures (e.g. IA32) have a much better price performance ratio than ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 21

Provided by: mjan6

Learn more at: http://www.slac.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Peter Wegner, DESY CHEP03, 25 March 2003 1

1
LQCD benchmarks on cluster architecturesM.
Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen),A.
Gellrich, H.Wittig (DESY Hamburg) CHEP03, 25
March 2003Category 6 Lattice Gauge Computing

Motivation
PC Cluster _at_DESY
Benchmark architectures
DESY Cluster
E7500 systems
Infiniband blade servers
Itanium2
Benchmark programs, Results
Future
Conclusions, Acknowledgements

2
PC Cluster MotivationLQCD, Stream Benchmark,
Myrinet Bandwidth

32/64-bit Dirac Kernel, LQCD (Martin Lüscher,
(DESY) CERN, 2000)
P4, 1.4 GHz, 256 MB Rambus, using SSE1(2)
instructions incl. cache pre-fetch
Time per lattice point
0.926 micro sec (1503 Mflops 32 bit
arithmetic)
1.709 micro sec (814 Mflops 64 bit arithmetic)
Stream Benchmark, Memory Bandwidth
P4(1.4 GHz, PC800 Rambus) 1.4 2.0 GB/s
PIII (800MHz, PC133 SDRAM) 400 MB/s
PIII(400 MHz, PC133 SDRAM) 340 MB/s
Myrinet, external Bandwidth
2.02.0 Gb/s optical-connection, bidirectional,
240 MB/s sustained

3
Benchmark Architectures - DESY Cluster Hardware
Nodes Mainboard Supermicro P4DC6 2 x XEON P4,
1.7 (2.0) GHz, 256 (512) kByte Cache 1 Gbyte
(4x 256 Mbyte) RDRAM IBM 18.3 GB DDYS-T18350
U160 3.5 SCSI disk Myrinet 2000 M3F-PCI64B-2
Interface Network Fast Ethernet Switch Gigaline
2024M, 48x100BaseTX ports GIGAline2024
1000BaseSX-SC Myrinet Fast Interconnect
M3-E32 5 slot chassis, 2xM3-SW16 Line
cards Installation Zeuthen 16 dual CPU nodes,
Hamburg 32 dual CPU nodes
4
Benchmark Architectures DESY Cluster i860
chipset problem
400MHz System Bus
DualChannel RDRAM
3.2 GB/s
MRH
gt1GB/s

800MB/s
P64H
MCH
3.2GB/s
MRH

800MB/s

P64H
Intel Hub Architecture
266 MB/s
64 bit PCI
ATA 100 MB/s (dual IDE Channels)
6 channel audio
64 bit PCI
ICH2
LAN Connection Interface
Up to 4 GB of RDRAM
133 MB/s
PCI Slots (66 MHz, 64bit)
PCI Slots (33 MHz, 32bit)
10/100 Ethernet
bus_read (send) 227 MBytes/s bus_write (recv)
315 MBytes/s of max. 528 MBytes/s
4 USB ports
External Myrinet bandwidth 160 Mbytes/s 90
Mbytes/s bidirectional
5
Benchmark Architectures Intel E7500 chipset
6
Benchmark Architectures - E7500 system
Par-Tec (Wuppertal) 4 Nodes Intel(R) Xeon(TM)
CPU 2.60GHz 2 GB ECC PC1600 (DDR-200) SDRAM
Super Micro P4DPE-G2 Intel E7500
chipset PCI 64/66 2 x Intel(R) PRO/1000
Network Connection Myrinet M3F-PCI64B-2
7
Benchmark Architectures
Leibniz-Rechenzentrum Munich (single cpu
tests) Pentium IV 3,06GHz. with ECC
Rambus Pentium IV 2,53GHz. with Rambus 1066
memory Xeon, 2.4GHz. with PC2100 DDR SDRAM
memory (probably FSB400) Megware 8 nodes
dual XEON, 2.4GHz, E7500 2GB DDR ECC memory
Myrinet2000 Supermicro P4DMS-6GM University
of Erlangen Itanium2, 900MHz, 1.5MB Cache, 10GB
RAM zx1 chipset (HP)
8
Benchmark Architectures - Infiniband
Megware 10 Mellanox ServerBlades Single Xeon
2.2 GHz 2 GB DDR RAM ServerWorks GC-LE
Chipsatz InfiniBand 4X HCA RedHat 7.3, Kernel
2.4.18-3 MPICH-1.2.2.2 und OSU-Patch für
VIA/InfiniBand 0.6.5 Mellanox Firmware
1.14 Mellanox SDK (VAPI) 0.0.4 Compiler GCC
2.96
9
Dirac Operator Benchmark (SSE) 16x163, single
P4/XEON CPU
Dirac operator
Linear Algebra
MFLOPS
10
Parallel (1-dim) Dirac Operator Benchmark (SSE),
even-odd preconditioned, 2 x 163 , XEON CPUs,
single CPU performance
Myrinet2000 i860 90 MB/s
E7500 190 MB/s
11
Parallel (1-dim) Dirac Operator Benchmark (SSE),
even-odd preconditioned, 2 x 163 , XEON CPUs,
single CPU performance,2, 4 nodes
Performance comparisons (MFLOPS)
Single node Single node Dual node Dual node
SSE2 non-SSE SSE2 non-SSE
446 328 (74) 330 283 (85)
Parastation3 software non-blocking I/O support
(MFLOPS, non-SSE)
blocking non-blocking I/O
308 367 (119)

12
Maximal Efficiency of external I/O
MFLOPs (without communication) MFLOPS (with communication) Maximal Bandwidth Efficiency
Myrinet (i860), SSE 579 307 90 90 0.53
Myrinet/GM (E7500), SSE 631 432 190 190 0.68
Myrinet/ Parastation (E7500), SSE 675 446 181 181 0.66
Myrinet/ Parastation (E7500), non-blocking, non-SSE 406 368 hidden 0.91
Gigabit, Ethernet, non-SSE 390 228 100 100 0.58
Infiniband non-SSE 370 297 210 210 0.80

13
Parallel (1-dim) Dirac Operator Benchmark (SSE),
even-odd preconditioned, 2 x 163 , XEON/Itanium2
CPUs, single CPU performance, 4 nodes
4 single CPU nodes, Gbit Ethernet, non-blocking
switch, full duplex P4 (2.4 GHz, 0.5 MB
cache) SSE 285 MFLOPS 88.92 88.92
MB/s non-SSE 228 MFLOPS 75.87 75.87
MB/s Itanium2 (900 MHz, 1.5 MB
cache) non-SSE 197 MFLOPS 63.13 63.13 MB/s

14
Infiniband interconnect

up to 10GB/s Bi-directional
Switch Simple, low cost, multistage network
Link High Speed Serial1x, 4x, and 12x

I/O Cntlr
TCA
Target Channel Adapter Interface to I/O
controller SCSI, FC-AL, GbE, ...
TCA
I/O Cntlr

Host Channel Adapter
Protocol Engine
Moves data via messages queued in memory

Chips IBM, Mellanox PCI-X cards Fujitsu,
Mellanox, JNI, IBM
http//www.infinibandta.org
15
Infiniband interconnect

16
Parallel (2-dim) Dirac Operator Benchmark
(Ginsparg-Wilson-Fermions) , XEON CPUs, single
CPU performance, 4 nodes
Infiniband vs Myrinet performance, non-SSE
(MFLOPS)
XEON 1.7 GHz Myrinet, i860 chipset XEON 1.7 GHz Myrinet, i860 chipset XEON 2.2 GHz Infiniband, E7500 chipset XEON 2.2 GHz Infiniband, E7500 chipset
32-Bit 64-Bit 32-Bit 64-Bit
8x83 lattice, 2x2 processor grid 370 281 697 477
16x163 lattice, 2x4 processor grid 338 299 609 480
17
Future - Low Power Cluster Architectures ?
18
Future Cluster Architectures - Blade Servers ?
NEXCOM Low voltage blade server 200 low voltage
Intel XEON CPUs (1.6 GHz 30W) in a 42U
Rack Integrated Gbit Ethernet network
Mellanox Infiniband blade server
Single XEON Blades connected via a 10 Gbit (4X)
Infiniband network
MEGWARE, NCSA, Ohio State University
19
Conclusions
PC CPUs have an extremely high sustained LQCD
performance using SSE/SSE2 (SIMDpre-fetch),
assuming a sufficient large local lattice
Bottlenecks are the memory throughput and the
external I/O bandwidth, both components are
improving (Chipsets i860 ? E7500 ? E705 ?
, FSB 400MHz ? 533 MHz ? 667 MHz ? , external
I/O Gbit-Ethernet ? Myrinet2000 ? QSnet ?
Inifiniband ? ) Non-blocking MPI communication
can improve the performance by using adequate
MPI implementations (e.g. ParaStation) 32-bit
Architectures (e.g. IA32) have a much better
price performance ratio than 64-bit architectures
(Itanium, Opteron ?) Large low voltage dense
blade clusters could play an important role in
LQCD computing (low voltage XEON, CENTRINO ?, )
20
Acknowledgements
Acknowledgements We would like to thank Martin
Lüscher (CERN) for the benchmark codes and the
fruitful discussions about PCs for LQCD, and
Isabel Campos Plasencia (Leibnitz-Rechenzentrum
Munich), Gerhard Wellein (Uni Erlangen), Holger
Müller (Megware), Norbert Eicker (Par-Tec), Chris
Eddington (Mellanox) for the opportunity to run
the benchmarks on their clusters.

Write a Comment

User Comments (0)