BG/L architecture and high performance QCD - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

BG/L architecture and high performance QCD

Description:

Title: QCD on the BlueGene/L supercomputer Author: IBM_User Last modified by: IBM_USER Created Date: 6/8/2004 9:48:41 PM Document presentation format – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 49
Provided by: IBM66
Category:

less

Transcript and Presenter's Notes

Title: BG/L architecture and high performance QCD


1
BG/L architecture and high performance QCD
  • P. Vranas
  • IBM Watson Research Lab

2
BlueGene/L

3
(No Transcript)
4
A Three Dimensional Torus
5
System
64 Racks, 64x32x32
BlueGene/L
Cabled 8x8x16
Rack
32 Node Cards
180/360 TF/s 32 TB
Node Card
(32 chips 4x4x2) 16 compute, 0-2 IO cards
2.8/5.6 TF/s 512 GB
Compute Card
2 chips, 1x2x1
90/180 GF/s 16 GB
Chip
2 processors
5.6/11.2 GF/s 1.0 GB
2.8/5.6 GF/s 4 MB
6
(No Transcript)
7
BlueGene/L Compute ASIC
8
Dual Node Compute Card
206 mm (8.125) wide, 54mm high (2.125), 14
layers, single sided, ground referenced
Heatsinks designed for 15W
9 x 512 Mb DRAM 16B interface no external
termination
Metral 4000 high speed differential connector
(180 pins)
9
32- way (4x4x2) node card
Midplane (450 pins) torus, tree, barrier, clock,
Ethernet service port
16 compute cards
Ethernet-JTAG FPGA
Custom dual voltage, dc-dc converters I2C control
2 optional IO cards
IO Gb Ethernet connectors through tailstock
Latching and retention
10
(No Transcript)
11
360TF Peak Footprint 8.5m x 17m
12
64 racks at LLNL
13
BlueGene/L Compute Rack Power
87
ASIC 14.4W DRAM 5W per node
89
MF/W (Peak) 250
MF/W (Sustained-Linpack) 172
14
BG/L is the fastest computer ever built .
15
(No Transcript)
16
BlueGene/L Link Eye Measurements at 1.6 Gb/s
Signal path includes module, card wire (86 cm),
and card edge connectors Signal path
includes module, card wire (2 x 10 cm), cable
connectors, and 8 m cable
17
Torus top level
Processor Injection
CPU
Processor Reception
CPU
Net wires
Net wires
Net Sender
Net Receiver
18
Torus network hardware packets
The hardware packets come in sizes of S 32,
64, 256 Bytes
Hardware header (routing etc), 8 bytes
Payload S-8 Bytes
Packet tail (CRC etc..) 4 bytes
19
Torus interface fifos
  • The cpus access the torus via the memory mapped
    torus fifos.
  • Each fifo has 1Kbyte of SRAM memory.
  • There are 6 normal-priority injection fifos.
  • There are 2 high priority injection fifos.
  • Injection fifos are not associated with network
    directions. For example a packet going out the z
    direction can be injected into any fifo.
  • There are 2 groups of normal-priority reception
    fifos. Each group has 6 reception fifos, one for
    each direction (x, x-, y, y-, z, z-).
  • The packet header has a bit that specifies into
    which group the packet should be received. A
    packet received from the z- direction with header
    group bit 0 will go to the z- fifo of group 0.
  • There are 2 groups of high-priority fifos. Each
    group has 1 fifo. All packets with the header
    high priority bit set will go to the
    corresponding fifo.
  • All fifos have status bits that can be read from
    specific hardware addresses. The status indicates
    how full a fifo is.

20
Torus communications code
Injection
  • Prepare a complete packet in memory that has 8
    bytes hardware header and the remaining bytes
    contain the desired payload.
  • Must be aligned at a 16 Byte boundary of
    memory (Quad aligned).
  • Must have size 32, 64, up to 256 bytes.
  • Pick a torus fifo to inject your packet.
  • Read the status bits of that fifo from the
    corresponding fifo-status hardware address. These
    include the available space in the fifo.
  • Keep polling until the fifo has enough space for
    your packet.
  • Use the double FPU (DFPU) QuadLoad to load the
    first Quad (16 Bytes) into a DFPU register.
  • Use the DFPU QuadStore to store the 16 Bytes
    into the desired torus fifo. Each fifo has a
    specific hardware address.
  • Repeat until all bytes are stored in fifo.
  • Done. The torus hardware will take care and
    deliver your packet to the destination node
    specified in the hardware header.

21
Torus communications code
Reception
  • Read the status bits of the reception fifos.
    These indicate the number of bytes in each
    reception fifo. The status is updated only after
    a full packet is completely in the reception
    fifo.
  • Keep polling until a reception fifo has data to
    be read.
  • Use the double FPU (DFPU) QuadLoad to load the
    first Quad (16 Bytes) from the corresponding fifo
    hardware address into a DFPU register.
  • This is the packet header and has the size of
    the packet.
  • Use the DFPU QuadStore to store the 16 Bytes
    into the desired memory location.
  • Repeat until all bytes of that packet are read
    from the fifo and stored into memory. (you know
    how many times to read since the header had the
    packet size).
  • Remember that QuadStores store data in quad
    aligned memory addresses.
  • Done. The torus hardware has advanced the fifo
    to the next packet received (if any).

22
Routing
Dynamic
Virtual
VC
VC
Cut-through
VC
VC
with bubble escape
VC
VC
and priority channel
VCB
VCP
23
Routing examples
  • Deterministic and adaptive routing
  • A hardware implementation of multicasting along a
    line

24
All to all performance
1,000
25
The double FPU
P0
FPR primary
P31
S0
FPR secondary
S31
  • The BG/L chip has two 440 cores. Each core has a
    double FPU.
  • The DFPU has two register files (primary and
    secondary). Each has 32, 64-bit floating point
    registers.
  • There are floating-point instructions that allow
    load/store and manipulation of all registers.
  • These instructions are an extension to the
    PowerPC Book E instruction set.
  • The DFPU is ideal for complex arithmetic.
  • The primary and secondary registers can be
    loaded independently or simultaneously. For
    example R4-primary and R4-secondary can be loaded
    with a single Quad-Load instruction. In this case
    the data must be coming from a Quad-aligned
    address.
  • Similarly with stores.

26
BlueGene/L and QCD at night

27
Physics is what physicists do at night.
R.
Feynman
28
The 1 sustained-Teraflops landmark
1 sustained-Teraflops for 8.5 hours on 1024 nodes
(1 rack) June 2004
29
QCD on BlueGene/L machines (1/25/06)
  • More than 20 racks 112 Teraflops worldwide
    mostly for QCD.
  • LLNL and Watson-IBM will possibly run some QCD.

30
One chip hardware
External DDR 1GB For 2 nodes
CPU0
L2 Pre-fetch
L3 4 MB
L1 32KB
CPU1
L1 32KB
3D-Torus
Fifos
Tree
Combine/Bcast
Receiver
Sender
5 ls roundtrip
Virtual cut-through
31
QCD on the hardware
  • 1) Virtual node mode
  • CPU0, CPU1 act as independent virtual nodes
  • Each one does both computations and
    communications
  • The 4th direction is along the two CPUs (it
    can also be spread
  • across the machine via hand-coded
    cut-through routing or
  • MPI)
  • The two CPUs communicate via common memory
    buffers
  • Computations and communications can not
    overlap.
  • Peak compute performance is then 5.6 GFlops

CPU0
CPU1
32
QCD on the hardware
  • 2) Co-processor mode
  • CPU0 does all the computations
  • CPU1 does most of the communications (MPI
    etc)
  • The 4-th direction is internal to CPU0 or can
    be spread
  • across the machine using hand-coded
    cut-through
  • routing or MPI
  • Communications can overlap with computations
  • Peak compute performance is then 5.6/2 2.8
    GFlops

CPU0
CPU1
33
Optimized Wilson D with even/odd preconditioning
in virtual node mode
  • Inner most kernel code is in C/C inline
    assembly.
  • Algorithm is similar to the one used in CM2 and
    QCDSP
  • Spin project in the 4 backward directions
  • Spin project in the 4 forward directions and
    multiply with gauge field
  • Communicate backward and forward spinors to
    nn
  • Multiply the backward spinors with gauge field
    and spin reconstruct
  • Spin reconstruct forward spinors

34
  • All computations use the double Hummer
    multiply/add instructions.
  • All floating computations are carefully arranged
    to avoid pipeline conflicts.
  • Memory storage ordering is chosen for minimal
    pointer arithmetic.
  • Quad Load/store are carefully arranged to take
    advantage of the cache hierarchy and the CPUs
    ability to issue up to 3 outstanding loads.
  • Computations almost fully overlap with
    load/stores. Local performance is bounded by
    memory access to L3.
  • A very thin and effective nearest-neighbor
    communication layer interacts directly with the
    torus network hardware to do the data transfers.
  • Global sums are done via a fast torus or tree
    routines.
  • Communications do not overlap with computations
    or memory access.
  • Small local size Fast L1 memory access but more
    communications
  • Large local size Slower L3 memory access
    less communications.

35
Cycle breakdown
For the Wilson Dslash operator with even/odd
preconditioning. Processor cycle measurements
(pcycles) in virtual node mode. The lattices are
the local lattices on each core.
24 (pcycles/site) 16x43 (pcycles/site)
cmat_two_spproj 457 489
comm 1537 432
mat_reconstruct 388 479
reconstruct 154 193
Dslash 2564 1596
Theoretical Best 324 324
Performance 12.6 20.3
36
Wilson kernel node performance
Spin-projection and even/odd preconditioning
(squashed along x dir) Numbers are for single
chip with self-wrapped links Full inverter (with
torus global sum)

of peak 24 4x 23 44 8 x 43 82 x 42 16 x 43
D no comms 31.5 28.2 25.9 27.1 27.1 27.8
D 12.6 15.4 15.6 19.5 19.7 20.3
Inverter 13.1 15.3 15.4 18.7 18.8 19.0
37
(No Transcript)
38
Weak Scaling (fixed local size)
Spin-projection and even/odd preconditioning.
Full inverter (with torus global sum) 16x4x4x4
local lattice. CG iterations 21.
Machine ½ chip midplane 1 rack 2 racks
Cores 1 1024 2048 4096
Global lattice 4x4x4x16 32x32x32x32 32x32x64x32 32x64x64x32
of peak 19.0 18.9 18.8 18.7
39
(No Transcript)
40
Special OS tricks (not necessarily dirty)
  • It was found that L1 evictions cause delays due
    to increased L3 traffic. In order to avoid some
    of this the temporary spin-projected
    two-component spinors are stored into memory with
    L1 attribute of write-through-swoa.
  • An OS function is called that returns a pointer
    to memory and a fixed size. That image of memory
    has the above attributes. This increased
    performance from 16 to 19.
  • The on-chip, core-to-core communications are
    done with a local copy in common memory. It was
    found that the copy was faster if it was done via
    the common SRAM.
  • An OS function is called that returns a pointer
    to memory and a fixed size. That image of memory
    is in SRAM and has size about 1KB. This increased
    performance from 19 to 20.
  • Under construction An OS function that splits
    the L1 cache into two pieces (standard and
    transient). Loads in the transient L1 will not
    get evicted or cause evictions. Since the gauge
    fields are not modified during inversion this is
    an ideal place to store them.
  • These functions exist in the IBM Watson software
    group experimental kernel called controlX. They
    have not migrated to the BG/L standard software
    release.

41
Full QCD physics system
  • The physics code (besides the Wilson Dslash) is
    the Columbia C physics system (cps).
  • The full system ported very easily and worked
    immediately.
  • The BG/L additions/modifications to the system
    have been kept isolated.

Acknowledgement
We would like to thank the QCDOC collaboration
for useful discussions and for providing us with
the Columbia physics system software.
42
BlueGene next generations

43
P
44
Q
45
What would you do
?
46
if they come to you with 1 Petaflop for a
month?
47
QCD, the movie

48
QCD thermal phase transition a clip from a
BG/L lattice simulation.
  • This clip is from a state of the art simulation
    of QCD on a ½ a rack of a BG/L machine (2.8
    Teraflops). It took about about 2 days.
  • It shows 2-flavor dynamical QCD on a 16x16x16x4
    lattice with the DWF 5th dimension set to 24
    sites.
  • The pion mass is about 400 MeV.
  • The color of each lattice point is the value of
    the Polyakov loop which can fluctuate between -3
    and 3. Think of it as a spin system.
  • The graph shows the volume average of the
    Polyakov line. This value is directly related to
    the single quark free energy. In the confined
    phase there are no free quarks and the value is
    low (not zero because of screening), in the
    quark-gluon plasma phase quarks can exist alone
    and the value is large.

G. Bhanot, D. Chen, A. Gara, P. Heidelberger, J.
Sexton, P. Vranas, B. Walkup
Write a Comment
User Comments (0)
About PowerShow.com