Title: Reconfigurable interconnects for sharedmemory multiprocessor systems
1Reconfigurable interconnects for shared-memory
multiprocessor systems
- Wim Heirman
- UGent/PARIS VUB/Tona
- Couvin, May 17, 2004
2Overview
- Introduction to shared-memory machines
- The cache coherence mechanism
- How can reconfiguration help?
- Conclusion future work
3Multiprocessing
- Problems (CFD computation, DNA comparison,
Website hosting) can be too large for a single
processor - Multiple processors work on a sub-problem, and
solve the big problem together - Communication will take place between the
sub-problems - Two communication paradigms message-passing and
shared-memory
4Two different communication paradigms
- Message passing
- Each processor has its own private memory
- Programmer inserts calls that send and receive
messages - Communication is explicit
- Programmer must know points of communication, but
can handle latency - Typical latency 1 ms (1 KiB)
- Shared memory
- All system memory is accessible by all processors
- Program reads and writes memory, accesses to
remote memory result in communication - Communication is implicit
- Programmer mustnt know points of communication,
so cant handle latency - Typical latency100 ns (64 bytes)
5Architecture of a shared-memory system
- Components
- Nodes
- Processor(s)
- Caches
- A part of main memory
- Network interface
- Interconnection network
6Example Sun Fire E25K server
- 72 CPUs (1.2 GHz)
- 576 GiB main memory (8 GiB / CPU)
- 172 GB/s network bandwidth
- Power consumption6 kW
- Dimensions (HxWxD)191 x 85 x 166 cm
- Weight1122 kg
7Caching and cache coherence
- At least one memory access per instruction
(instruction fetch) - Instruction execution time lt1 ns _at_ 1.2 GHz
- Memory access latency 50 ns (local), gt100 ns
(remote) - Caches can hide latency for most of the memory
accesses (80 _at_ 2 ns, 15 _at_ 10 ns, 5 _at_ gt10 ns) - However caching means multiple copies of the
same data can be present in the system -gt all
copies must have the same value! - Enter cache coherence protocols
8Overview
- Introduction to shared-memory machines
- The cache coherence mechanism
- How can reconfiguration help?
- Conclusion future work
9Directory based coherence
- Memory is divided into cache lines ( 64 bytes),
each has its home in main memory on one of the
nodes - Caches on different nodes can keep a private copy
of a cache line - Multiple caches can have a copy read-only
(shared) - Only one cache can write to a line it therefore
needs exclusive access. A directory on each
node knows which caches have copies of all lines
on this home node - Line state in a directory is one of
- Uncached (no cache has the line)
- Shared (all caches in ltsharing-setgt have the line
read-only) - Exclusive (ltownergt has the line and may have
modified it)
10The coherence protocol Line states
Read miss
The cache line is present in none of the caches
The cache lineis present inone or morecaches,
and mayonly be read
Read miss
Uncached
Shared
Write miss
Read miss
Write miss
Write-back
The cache line is present inone cache only, and
may bemodified by it
Exclusive
Write miss
11The coherence protocolnetwork traffic (1)
Invalid
Cache
Read miss for uncached line (uncached to shared
transition)
Network IF
Network IF
Uncached
MEM
12The coherence protocolnetwork traffic (1)
Invalid
Cache
Read miss for uncached line (uncached to shared
transition)
Network IF
- Send request over network (100 ns)
Network IF
Uncached
MEM
13The coherence protocolnetwork traffic (1)
Invalid
Cache
Read miss for uncached line (uncached to shared
transition)
Network IF
- Send request over network (100 ns)
- Read data in remote memory (50 ns)
Network IF
Uncached
MEM
14The coherence protocolnetwork traffic (1)
Invalid
Cache
Read miss for uncached line (uncached to shared
transition)
Network IF
- Send request over network (100 ns)
- Read data in remote memory (50 ns)
- Send reply over network (100 ns)
Network IF
Uncached
MEM
Shared
15The coherence protocolnetwork traffic (1)
Invalid
Cache
Shared
Read miss for uncached line (uncached to shared
transition)
Network IF
- Send request over network (100 ns)
- Read data in remote memory (50 ns)
- Send reply over network (100 ns)
Total 250 ns
Network IF
Uncached
MEM
Shared
16The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
Network IF
Network IF
Exclusive
Modified
MEM
Cache
17The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
Network IF
Network IF
Exclusive
Modified
MEM
Cache
18The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
- Send request for writeback (100 ns)
Network IF
Network IF
Exclusive
Modified
MEM
Cache
19The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
- Send request for writeback (100 ns)
- Read modified data from cache (10 ns)
Network IF
Network IF
Exclusive
Modified
MEM
Cache
20The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
- Send request for writeback (100 ns)
- Read modified data from cache (10 ns)
- Send modified data back (100 ns)
Network IF
Network IF
Exclusive
Modified
MEM
Cache
Invalid
21The coherence protocolnetwork traffic (2)
Invalid
Cache
Modified
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
- Send request for writeback (100 ns)
- Read modified data from cache (10 ns)
- Send modified data back (100 ns)
Total 310 ns
Network IF
Network IF
Exclusive
Modified
MEM
Cache
Exclusive
Invalid
22The coherence protocol conclusion
- Memory access time is an important factor of
total system performance - Using caches can hide the latency of some memory
accesses, but not all of them - One memory access can span several network
round-trip times - Network delay is the largest factor in memory
access time - -gt Reducing network delay can dramatically
improve system performance
23Overview
- Introduction to shared-memory machines
- The cache coherence mechanism
- How can reconfiguration help?
- Conclusion future work
24Reconfiguration for shared-memory machines (1)
- Why?
- Memory access latency (MAL) largely determines
machine efficiency, decreasing MAL increases
performance - Network latency is a large part of MAL
- How?
- A parallel program has access patterns on
different timescales, so network load is not
uniform over time / space - Optimize for the common case
- Make faster connections between pairs of nodes
that communicate often
25Reconfiguration for shared-memory machines (2)
- When?
- Try to predict a future communication pattern,
and adapt network to optimize communication that
fits this pattern - Most (and possibly only) feasible prediction
assume the current pattern continues - Patterns occur at different timescales
- One memory access incurs several messages (100
ns) - Memory accesses in a program have locality in
time and space, which is also exploited by caches
(10 µs) - One machine often runs different programs, each
with their own access patterns (multitasking, 10
ms)
26Reconfiguration for shared-memory machines (3)
- Traffic bursts caused by program locality
- Measure time that traffic between a pair of nodes
is above a threshold - Node pair that communicates in long burst(s) is
candidate for optimized link - Feasibility of this technique?
- Bursts should be long enough
- Look at distribution of bursty traffic durations
27Traffic characterization (1)
- Radix sort benchmark (SPLASH-2 suite)
- 16 CPUs
- Simulated time .3 s
- Real time 1h
- Bursts wide distribution (power law?), up
tomillisecond range
28Traffic characterization (2)
- Volrend benchmark (SPLASH-2 suite)
- 16 CPUs
- Simulated time .03 s
- Real time 2h
- Several bursts of 5 ms
- but at the same moment in time
29Traffic characterization (3)
- Volrend traffic from each node thru time
30Traffic characterization (3)
- Cholesky benchmark (SPLASH-2 suite)
- 16 CPUs
- Much longer bursts (up to .5 seconds)
- At regular intervals
31Reconfiguration for shared-memory machines (4)
- Steps in reconfiguring the network
- Detect there is a pattern (alternatively
programmer or compiler hints) - Reconfigure the network
- Period of profit (until pattern changes)
- Thus we need treconfiguration ltlt tpattern
32Reconfiguration for shared-memory machines (5)
- Other problems
- Routing in a variable network
- Network must stay connected at all times (no
loose parts) - Optimize for the common case, but the less-common
case should not suffer too badly or total
performance will go down again - Current solution fixed base network with extra
reconfigurable connections
33Reconfiguration for shared-memory machines (6)
CPU MEM
CPU MEM
CPU MEM
Base network (fixed)
CPU MEM
Extra network (reconfigurable)
CPU MEM
CPU MEM
CPU MEM
CPU MEM
CPU MEM
34Overview
- Introduction to shared-memory machines
- The cache coherence mechanism
- How can reconfiguration help?
- Conclusion future work
35Conclusion
- Locality of communication in time and space does
exist in shared-memory multiprocessor machines - Locality occurs at different timescales, up to
millisecond range - Reconfiguration techniques with treconfiguration
lt 1 ms should be effective
36Future work
- Bursty traffic durations measure distribution
for different benchmarks to get a better feel for
the feasibility of reconfiguration at that time
scale - Better visualization methods for analyzing
traffic - Simulation extend our simulation platform with
models of ROI hardware to enable full-system
simulation - Simulation detail speed up simulation by
removing factors with little influence
(out-of-order processor?)
37Typical latency and bandwidth requirements
CPU1 GHz
0.5 ns, 400 Gbps
0.5 ns, 100 Gbps
L1I
L1D
2x 64 KB, 80 hits
5 ns, 200 Gbps
L2
1 MiB, 90 hits
100 ns, 100 Gbps
Network IF
50 ns, 30 Gbps
MEM
8 GB
38Quantifying improvements with simulation
- Overall goal programs should complete faster
- Simulator computer program that models
processors, caches and interconnect network - This virtual computer runs a program
(benchmark), similar or identical to the
workload of a real shared-memory machine - We measure its behavior (total execution time,
average memory access latency, ) to grade the
network
39Drawbacks of simulation
- Detailed simulation takes a lot of time (typical
slowdown by a factor of 5000), but many
components (like an out-of-order processor) can
influence individual instruction timing and thus
need to be modeled. - Really? Higher-level properties like total
execution time may be (reasonably) independent of
low-level details - Were interested in relative performance (this
network is 10 better than that one), so errors
in absolute performance are not really a problem - More work needed to see what level of simulation
detail we need to make valid statements about the
networks