Title: Parallel Programming on the
1Parallel Programming on the SGI Origin2000
Taub Computer Center Technion
Moshe Goldberg, mgold_at_tx.technion.ac.il
With thanks to Igor Zacharov / Benoit Marchand,
SGI
Mar 2004 (v1.2)
2Parallel Programming on the SGI Origin2000
- Parallelization Concepts
- SGI Computer Design
- Efficient Scalar Design
- Parallel Programming -OpenMP
- Parallel Programming- MPI
32) SGI Computer Design
4Origin2000/3000 architecture features
Important hardware and software components
node board processors memory node
interconnect topology and configurations
scalability of the architecture
directory-based cache coherency single system
image components
5Origin2000 node board
6Origin node board
HUB crossbar ASIC - Single chip integrates all
four functions processor interface two rxK
processors on the same bus memory interface,
integrating the memory controller and
(direct) cache coherency interface to
CrayLink Interconnect to other nodes in the
system interface to I/O defices with
XIO-to-PCI bridges - Memory access
characteristics read bandwidth single
processor 460 MB/s sustained average access
latency 315 ns to restart processor pipeline
7Origin2000 node components
8Origin router interconnect
- Router chip has 6 CrayLink interfaces 2 for
connections to nodes (HUBs) and 4 for
connections to other routers in the network
4-dimensional interconnect - The
interconnect topology is determined by the size
of the computer (number of nodes)
direct (back-to-back) connection for 2 nodes (4
cpu) strongly connected cube up to 32 cpu
hypercube for up to 64 cpu hypercube of
hypercubes for up to 256 cpu
9Origin2000 two nodes
10Origin2000 module connections
11Origin2000 interconnect
12Origin2000 interconnect
32 processors
64 processors
13Origin2000 interconnect
14Directory-based uniform cache
Cache line use is recorded in directory, which
resides in memory
15Origin cache coherence
- Memory page is divided in data blocks of 32
words or 128 bytes each (L2 cache line size)
- Each data request transfers one data block (128
bytes) - Each data block has associated presence
and state information
directory
memory
. . . . . . . .
. . . .
presence state 64 bits 3 bits
data block (cache line) 128 bytes (32 words)
- If a node (HUB) requests a data block, the
corresponding presence bit is set and the
state of that cache line is recorded - HUB runs
the cache coherence protocol, updating the state
of the data block and notifying nodes for
which the presence bit is set
16Origin address space
- Physically the memory is distributed and not
contiguous - Node id is assigned at boot time
- Logically memory is a shared single contiguous
address space, the virtual address space is 44
bits (16 TB) - A program (compiler) uses the
virtual address space - CPU translates from
virtual to physical address space
39 32 31
0
node id 8 bits
Node offset 32 bits (4 GB)
Empty slot
0 1 2 n
page
Physical
k 1 n 0
Memory present
0 1 2 3 .. Node id
Virtual
TLB
TLB Translation Look-aside Buffer
17Summary origin2000 properties
- Single machine image behaves like a
large workstation same compilers
time sharing all SGI old code (binaries)
will run OS schedules the hardware
resources on the machine - processor
scalability 2-1024 cpu - I/O scalability - all
memory and I/O devices are directly addressable
no limitations on the size of a single
program, it can use all available memory
no limitations on the location of the data,
all disks can be used in a single file
system - 64 bit operating system and file
system HPC features Checkpoint/restart,
queueing system - machine stability
18Origin2000/3000 architecture goal
Hardware design distributed memory
But to a programmer It looks like shared
memory
19Example Simple Memory Access
20(No Transcript)
21(No Transcript)
22(No Transcript)
23Parix run limits
(1) NQS queues on parix
(2) Interactive Maximum cputime 15 minutes
24Two ways to run a batch job
(1) Parameters in command line
(2) Parameters in script file
25QSUB options
26Output of command qstat a
27Exercise 1 login and submit a job
28(No Transcript)