Reconfigurable interconnects for sharedmemory multiprocessor systems - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Reconfigurable interconnects for sharedmemory multiprocessor systems

Description:

Communication will take place between the sub-problems ... Steps in reconfiguring the network: ... Reconfigure the network. Period of profit (until pattern changes) ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 32

Provided by: wimhe

Category:

more less

Transcript and Presenter's Notes

Title: Reconfigurable interconnects for sharedmemory multiprocessor systems

1
Reconfigurable interconnects for shared-memory
multiprocessor systems

Wim Heirman
UGent/PARIS VUB/Tona
Couvin, May 17, 2004

2
Overview

Introduction to shared-memory machines
The cache coherence mechanism
How can reconfiguration help?
Conclusion future work

3
Multiprocessing

Problems (CFD computation, DNA comparison,
Website hosting) can be too large for a single
processor
Multiple processors work on a sub-problem, and
solve the big problem together
Communication will take place between the
sub-problems
Two communication paradigms message-passing and
shared-memory

4
Two different communication paradigms

Message passing
Each processor has its own private memory
Programmer inserts calls that send and receive
messages
Communication is explicit
Programmer must know points of communication, but
can handle latency
Typical latency 1 ms (1 KiB)

Shared memory
All system memory is accessible by all processors
Program reads and writes memory, accesses to
remote memory result in communication
Communication is implicit
Programmer mustnt know points of communication,
so cant handle latency
Typical latency100 ns (64 bytes)

5
Architecture of a shared-memory system

Components
Nodes
Processor(s)
Caches
A part of main memory
Network interface
Interconnection network

6
Example Sun Fire E25K server

72 CPUs (1.2 GHz)
576 GiB main memory (8 GiB / CPU)
172 GB/s network bandwidth
Power consumption6 kW
Dimensions (HxWxD)191 x 85 x 166 cm
Weight1122 kg

7
Caching and cache coherence

At least one memory access per instruction
(instruction fetch)
Instruction execution time lt1 ns _at_ 1.2 GHz
Memory access latency 50 ns (local), gt100 ns
(remote)
Caches can hide latency for most of the memory
accesses (80 _at_ 2 ns, 15 _at_ 10 ns, 5 _at_ gt10 ns)
However caching means multiple copies of the
same data can be present in the system -gt all
copies must have the same value!
Enter cache coherence protocols

8
Overview

Introduction to shared-memory machines
The cache coherence mechanism
How can reconfiguration help?
Conclusion future work

9
Directory based coherence

Memory is divided into cache lines ( 64 bytes),
each has its home in main memory on one of the
nodes
Caches on different nodes can keep a private copy
of a cache line
Multiple caches can have a copy read-only
(shared)
Only one cache can write to a line it therefore
needs exclusive access. A directory on each
node knows which caches have copies of all lines
on this home node
Line state in a directory is one of
Uncached (no cache has the line)
Shared (all caches in ltsharing-setgt have the line
read-only)
Exclusive (ltownergt has the line and may have
modified it)

10
The coherence protocol Line states
Read miss
The cache line is present in none of the caches
The cache lineis present inone or morecaches,
and mayonly be read
Read miss
Uncached
Shared
Write miss
Read miss
Write miss
Write-back
The cache line is present inone cache only, and
may bemodified by it
Exclusive
Write miss
11
The coherence protocolnetwork traffic (1)
Invalid
Cache
Read miss for uncached line (uncached to shared
transition)
Network IF
Network IF
Uncached
MEM
12
The coherence protocolnetwork traffic (1)
Invalid
Cache
Read miss for uncached line (uncached to shared
transition)
Network IF
- Send request over network (100 ns)
Network IF
Uncached
MEM
13
The coherence protocolnetwork traffic (1)
Invalid
Cache
Read miss for uncached line (uncached to shared
transition)
Network IF
- Send request over network (100 ns)
- Read data in remote memory (50 ns)
Network IF
Uncached
MEM
14
The coherence protocolnetwork traffic (1)
Invalid
Cache
Read miss for uncached line (uncached to shared
transition)
Network IF
- Send request over network (100 ns)
- Read data in remote memory (50 ns)
- Send reply over network (100 ns)
Network IF
Uncached
MEM
Shared
15
The coherence protocolnetwork traffic (1)
Invalid
Cache
Shared
Read miss for uncached line (uncached to shared
transition)
Network IF
- Send request over network (100 ns)
- Read data in remote memory (50 ns)
- Send reply over network (100 ns)
Total 250 ns
Network IF
Uncached
MEM
Shared
16
The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
Network IF
Network IF
Exclusive
Modified
MEM
Cache
17
The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
Network IF
Network IF
Exclusive
Modified
MEM
Cache
18
The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
- Send request for writeback (100 ns)
Network IF
Network IF
Exclusive
Modified
MEM
Cache
19
The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
- Send request for writeback (100 ns)
- Read modified data from cache (10 ns)
Network IF
Network IF
Exclusive
Modified
MEM
Cache
20
The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
- Send request for writeback (100 ns)
- Read modified data from cache (10 ns)
- Send modified data back (100 ns)
Network IF
Network IF
Exclusive
Modified
MEM
Cache
Invalid
21
The coherence protocolnetwork traffic (2)
Invalid
Cache
Modified
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
- Send request for writeback (100 ns)
- Read modified data from cache (10 ns)
- Send modified data back (100 ns)
Total 310 ns
Network IF
Network IF
Exclusive
Modified
MEM
Cache
Exclusive
Invalid
22
The coherence protocol conclusion

Memory access time is an important factor of
total system performance
Using caches can hide the latency of some memory
accesses, but not all of them
One memory access can span several network
round-trip times
Network delay is the largest factor in memory
access time
-gt Reducing network delay can dramatically
improve system performance

23
Overview

Introduction to shared-memory machines
The cache coherence mechanism
How can reconfiguration help?
Conclusion future work

24
Reconfiguration for shared-memory machines (1)

Why?
Memory access latency (MAL) largely determines
machine efficiency, decreasing MAL increases
performance
Network latency is a large part of MAL
How?
A parallel program has access patterns on
different timescales, so network load is not
uniform over time / space
Optimize for the common case
Make faster connections between pairs of nodes
that communicate often

25
Reconfiguration for shared-memory machines (2)

When?
Try to predict a future communication pattern,
and adapt network to optimize communication that
fits this pattern
Most (and possibly only) feasible prediction
assume the current pattern continues
Patterns occur at different timescales
One memory access incurs several messages (100
ns)
Memory accesses in a program have locality in
time and space, which is also exploited by caches
(10 µs)
One machine often runs different programs, each
with their own access patterns (multitasking, 10
ms)

26
Reconfiguration for shared-memory machines (3)

Traffic bursts caused by program locality
Measure time that traffic between a pair of nodes
is above a threshold
Node pair that communicates in long burst(s) is
candidate for optimized link
Feasibility of this technique?
Bursts should be long enough
Look at distribution of bursty traffic durations

27
Traffic characterization (1)

Radix sort benchmark (SPLASH-2 suite)
16 CPUs
Simulated time .3 s
Real time 1h
Bursts wide distribution (power law?), up
tomillisecond range

28
Traffic characterization (2)

Volrend benchmark (SPLASH-2 suite)
16 CPUs
Simulated time .03 s
Real time 2h
Several bursts of 5 ms
but at the same moment in time

29
Traffic characterization (3)

Volrend traffic from each node thru time

30
Traffic characterization (3)

Cholesky benchmark (SPLASH-2 suite)
16 CPUs
Much longer bursts (up to .5 seconds)
At regular intervals

31
Reconfiguration for shared-memory machines (4)

Steps in reconfiguring the network
Detect there is a pattern (alternatively
programmer or compiler hints)
Reconfigure the network
Period of profit (until pattern changes)
Thus we need treconfiguration ltlt tpattern

32
Reconfiguration for shared-memory machines (5)

Other problems
Routing in a variable network
Network must stay connected at all times (no
loose parts)
Optimize for the common case, but the less-common
case should not suffer too badly or total
performance will go down again
Current solution fixed base network with extra
reconfigurable connections

33
Reconfiguration for shared-memory machines (6)
CPU MEM
CPU MEM
CPU MEM
Base network (fixed)
CPU MEM
Extra network (reconfigurable)
CPU MEM
CPU MEM
CPU MEM
CPU MEM
CPU MEM
34
Overview

Introduction to shared-memory machines
The cache coherence mechanism
How can reconfiguration help?
Conclusion future work

35
Conclusion

Locality of communication in time and space does
exist in shared-memory multiprocessor machines
Locality occurs at different timescales, up to
millisecond range
Reconfiguration techniques with treconfiguration
lt 1 ms should be effective

36
Future work

Bursty traffic durations measure distribution
for different benchmarks to get a better feel for
the feasibility of reconfiguration at that time
scale
Better visualization methods for analyzing
traffic
Simulation extend our simulation platform with
models of ROI hardware to enable full-system
simulation
Simulation detail speed up simulation by
removing factors with little influence
(out-of-order processor?)

37
Typical latency and bandwidth requirements
CPU1 GHz
0.5 ns, 400 Gbps
0.5 ns, 100 Gbps
L1I
L1D
2x 64 KB, 80 hits
5 ns, 200 Gbps
L2
1 MiB, 90 hits
100 ns, 100 Gbps
Network IF
50 ns, 30 Gbps
MEM
8 GB
38
Quantifying improvements with simulation

Overall goal programs should complete faster
Simulator computer program that models
processors, caches and interconnect network
This virtual computer runs a program
(benchmark), similar or identical to the
workload of a real shared-memory machine
We measure its behavior (total execution time,
average memory access latency, ) to grade the
network

39
Drawbacks of simulation

Detailed simulation takes a lot of time (typical
slowdown by a factor of 5000), but many
components (like an out-of-order processor) can
influence individual instruction timing and thus
need to be modeled.
Really? Higher-level properties like total
execution time may be (reasonably) independent of
low-level details
Were interested in relative performance (this
network is 10 better than that one), so errors
in absolute performance are not really a problem
More work needed to see what level of simulation
detail we need to make valid statements about the
networks