Title: Today
1Todays topics
- Single processors and the Memory Hierarchy
- Busses and Switched Networks
- Interconnection Network Topologies
- Multiprocessors
- Multicomputers
- Flynns Taxonomy
- Modern clusters hybrid
2Processors and the Memory Hierarchy
- Registers (1 clock cycle, 100s of bytes)
- 1st level cache (3-5 clock cycles, 100s KBytes)
- 2nd level cache (10 clock cycles, MBytes)
- Main memory (100 clock cycles, GBytes)
- Disk (milliseconds, 100GB to gianormous)
CPU
registers
1st level Instructions
1st level Data
2nd Level unified (Instructions Data)
3IBM Dual Core
From Intel 64 and IA-32 Architectures
Optimization Reference Manual http//www.intel.com
/design/processor/manuals/248966.pdf
4Interconnection Network Topologies - Bus
- Bus
- A single shared data path
- Pros
- Simplicity
- cache coherence
- synchronization
- Cons
- fixed bandwidth
- Does not scale well
Global Memory
CPU
CPU
CPU
5Interconnection Network Topologies Switch based
- Switch Based
- mxn switches
- Many possible topologies
- Characterized by
- Diameter
- Worst case number of switches between two
processors - Impacts latency
- Bisection width
- Minimum number of connections that must be
removed to split the network into two - Communication bandwidth limitation
- Edges per switch
- Best if this is independent of the size of the
network
6Interconnection Network Topologies - Mesh
- 2-D Mesh
- 2-D array of processors
- Torus/Wraparound Mesh
- Processors on edge of mesh are connected
- Characteristics (n nodes)
- Diameter or
- Bisection width
- Switch size 4
- Number of switches n
7Interconnection Network Topologies - Hypercube
- Hypercube
- A d-dimensional hypercube has n2d processors.
- Each processor directly connected to d other
processors - Shortest path between a pair of processors is at
most d - Characteristics (n2d nodes)
- Diameter d
- Bisection width n/2
- Switch size d
- Number of switches n
3-D Hypercube
4-D Hypercube
8Multistage Networks
- Butterfly
- Omega
- Perfect shuffle
- Characteristics for an Omega network (n2d nodes)
- Diameter d-1
- Bisection width n/2
- Switch size 2
- Number of switches d?n/2
An 8-input, 8-output Omega network of 2x2
switches
9Shared Memory
- One or more memories
- Global address space (all system memory visible
to all processors) - Transfer of data between processors is usually
implicit, just read (write) to (from) a given
address (OpenMP) - Cache-coherency protocol to maintain consistency
between processors.
(UMA) Uniform-memory-access Shared-memory System
Memory
Memory
Memory
Interconnection Network
CPU
CPU
CPU
10Distributed Shared Memory
- Single address space with implicit communication
- Hardware support for read/write to non-local
memories, cache coherency - Latency for a memory operation is greater when
accessing non local data than when accessing date
within a CPUs own memory
11Distributed Memory
- Each processor has access to its own memory only
- Data transfer between processors is explicit,
user calls message passing functions - Common Libraries for message passing
- MPI, PVM
- User has complete control/responsibility for data
placement and management
12Hybrid Systems
- Distributed memory system with multiprocessor
shared memory nodes. - Most common architecture for current generation
of parallel machines
Interconnection Network
Network Interface
CPU
CPU
Memory
CPU
13Flynns Taxonomy (figure 2.20 from Quinn)
Single
Multiple
SISD Uniprocessor SIMD Procesor arrays Pipelined vector processors
MISD Systolic array MIMD Multiprocessors Multicomputers
Single
Instruction stream
Multiple
14Top 500 List
- Some highlights from http//www.top500.org/
- On the new list, the IBM BlueGene/L system,
installed at DOEs Lawrence Livermore National
Laboratory (LLNL), retains the No. 1 spot with a
Linpack performance of 280.6 teraflops (trillions
of calculations per second, or Tflop/s). - The new No. 2 systems is Sandia National
Laboratories Cray Red Storm supercomputer, only
the second system ever to be recorded to exceed
the 100 Tflops/s mark with 101.4 Tflops/s. The
initial Red Storm system was ranked No. 9 in the
last listing. - Slipping to No. 3 from No. 2 last June is the IBM
eServer Blue Gene Solution system, installed at
IBMs Thomas Watson Research Center with 91.20
Tflops/s Linpack performance. - The new No. 5 is the largest system in Europe, an
IBM JS21 cluster installed at the Barcelona
Supercomputing Center. The system reached 62.63
Tflops/s.
15Linux/Beowulf cluster basics
- Goal
- Get super computing processing power at the cost
of a few PCs - How
- Commodity components PCs and networks
- Free software with open source
16CPU nodes
- A typical configuration
- Dual socket
- Dual core AMD or Intel nodes
- 4 GB memory per node
17Network Options
From D.K. Pandas Nowlab website at Ohio State,
http//nowlab.cse.ohio-state.edu/ Research
Overview presentation
18Challenges
- Cooling
- Power constraints
- Reliability
- System Administration