Parallel Programming on the - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Parallel Programming on the

Description:

Parallel Programming on the SGI Origin2000 Parallelization Concepts SGI Computer Design Efficient ... HUB crossbar ASIC ... 0 0 1 2 n TLB Physical ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 29

Provided by: Moshe83

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Programming on the

1
Parallel Programming on the SGI Origin2000
Taub Computer Center Technion
Moshe Goldberg, mgold_at_tx.technion.ac.il
With thanks to Igor Zacharov / Benoit Marchand,
SGI
Mar 2004 (v1.2)
2
Parallel Programming on the SGI Origin2000

Parallelization Concepts
SGI Computer Design
Efficient Scalar Design
Parallel Programming -OpenMP
Parallel Programming- MPI

3
2) SGI Computer Design
4
Origin2000/3000 architecture features
Important hardware and software components
node board processors memory node
interconnect topology and configurations
scalability of the architecture
directory-based cache coherency single system
image components
5
Origin2000 node board
6
Origin node board
HUB crossbar ASIC - Single chip integrates all
four functions processor interface two rxK
processors on the same bus memory interface,
integrating the memory controller and
(direct) cache coherency interface to
CrayLink Interconnect to other nodes in the
system interface to I/O defices with
XIO-to-PCI bridges - Memory access
characteristics read bandwidth single
processor 460 MB/s sustained average access
latency 315 ns to restart processor pipeline
7
Origin2000 node components
8
Origin router interconnect
- Router chip has 6 CrayLink interfaces 2 for
connections to nodes (HUBs) and 4 for
connections to other routers in the network
4-dimensional interconnect - The
interconnect topology is determined by the size
of the computer (number of nodes)
direct (back-to-back) connection for 2 nodes (4
cpu) strongly connected cube up to 32 cpu
hypercube for up to 64 cpu hypercube of
hypercubes for up to 256 cpu
9
Origin2000 two nodes
10
Origin2000 module connections
11
Origin2000 interconnect
12
Origin2000 interconnect
32 processors
64 processors
13
Origin2000 interconnect
14
Directory-based uniform cache
Cache line use is recorded in directory, which
resides in memory
15
Origin cache coherence
- Memory page is divided in data blocks of 32
words or 128 bytes each (L2 cache line size)
- Each data request transfers one data block (128
bytes) - Each data block has associated presence
and state information
directory
memory
. . . . . . . .
. . . .
presence state 64 bits 3 bits
data block (cache line) 128 bytes (32 words)
- If a node (HUB) requests a data block, the
corresponding presence bit is set and the
state of that cache line is recorded - HUB runs
the cache coherence protocol, updating the state
of the data block and notifying nodes for
which the presence bit is set
16
Origin address space
- Physically the memory is distributed and not
contiguous - Node id is assigned at boot time
- Logically memory is a shared single contiguous
address space, the virtual address space is 44
bits (16 TB) - A program (compiler) uses the
virtual address space - CPU translates from
virtual to physical address space
39 32 31
0
node id 8 bits
Node offset 32 bits (4 GB)
Empty slot
0 1 2 n
page
Physical
k 1 n 0

Memory present

0 1 2 3 .. Node id
Virtual
TLB
TLB Translation Look-aside Buffer
17
Summary origin2000 properties
- Single machine image behaves like a
large workstation same compilers
time sharing all SGI old code (binaries)
will run OS schedules the hardware
resources on the machine - processor
scalability 2-1024 cpu - I/O scalability - all
memory and I/O devices are directly addressable
no limitations on the size of a single
program, it can use all available memory
no limitations on the location of the data,
all disks can be used in a single file
system - 64 bit operating system and file
system HPC features Checkpoint/restart,
queueing system - machine stability
18
Origin2000/3000 architecture goal
Hardware design distributed memory
But to a programmer It looks like shared
memory
19
Example Simple Memory Access
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
Parix run limits
(1) NQS queues on parix
(2) Interactive Maximum cputime 15 minutes
24
Two ways to run a batch job
(1) Parameters in command line
(2) Parameters in script file
25
QSUB options
26
Output of command qstat a
27
Exercise 1 login and submit a job
28
(No Transcript)

Write a Comment

User Comments (0)