Reconfigurable interconnects for sharedmemory multiprocessor systems - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Reconfigurable interconnects for sharedmemory multiprocessor systems

Description:

Communication will take place between the sub-problems ... Steps in reconfiguring the network: ... Reconfigure the network. Period of profit (until pattern changes) ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 32
Provided by: wimhe
Category:

less

Transcript and Presenter's Notes

Title: Reconfigurable interconnects for sharedmemory multiprocessor systems


1
Reconfigurable interconnects for shared-memory
multiprocessor systems
  • Wim Heirman
  • UGent/PARIS VUB/Tona
  • Couvin, May 17, 2004

2
Overview
  • Introduction to shared-memory machines
  • The cache coherence mechanism
  • How can reconfiguration help?
  • Conclusion future work

3
Multiprocessing
  • Problems (CFD computation, DNA comparison,
    Website hosting) can be too large for a single
    processor
  • Multiple processors work on a sub-problem, and
    solve the big problem together
  • Communication will take place between the
    sub-problems
  • Two communication paradigms message-passing and
    shared-memory

4
Two different communication paradigms
  • Message passing
  • Each processor has its own private memory
  • Programmer inserts calls that send and receive
    messages
  • Communication is explicit
  • Programmer must know points of communication, but
    can handle latency
  • Typical latency 1 ms (1 KiB)
  • Shared memory
  • All system memory is accessible by all processors
  • Program reads and writes memory, accesses to
    remote memory result in communication
  • Communication is implicit
  • Programmer mustnt know points of communication,
    so cant handle latency
  • Typical latency100 ns (64 bytes)

5
Architecture of a shared-memory system
  • Components
  • Nodes
  • Processor(s)
  • Caches
  • A part of main memory
  • Network interface
  • Interconnection network

6
Example Sun Fire E25K server
  • 72 CPUs (1.2 GHz)
  • 576 GiB main memory (8 GiB / CPU)
  • 172 GB/s network bandwidth
  • Power consumption6 kW
  • Dimensions (HxWxD)191 x 85 x 166 cm
  • Weight1122 kg

7
Caching and cache coherence
  • At least one memory access per instruction
    (instruction fetch)
  • Instruction execution time lt1 ns _at_ 1.2 GHz
  • Memory access latency 50 ns (local), gt100 ns
    (remote)
  • Caches can hide latency for most of the memory
    accesses (80 _at_ 2 ns, 15 _at_ 10 ns, 5 _at_ gt10 ns)
  • However caching means multiple copies of the
    same data can be present in the system -gt all
    copies must have the same value!
  • Enter cache coherence protocols

8
Overview
  • Introduction to shared-memory machines
  • The cache coherence mechanism
  • How can reconfiguration help?
  • Conclusion future work

9
Directory based coherence
  • Memory is divided into cache lines ( 64 bytes),
    each has its home in main memory on one of the
    nodes
  • Caches on different nodes can keep a private copy
    of a cache line
  • Multiple caches can have a copy read-only
    (shared)
  • Only one cache can write to a line it therefore
    needs exclusive access. A directory on each
    node knows which caches have copies of all lines
    on this home node
  • Line state in a directory is one of
  • Uncached (no cache has the line)
  • Shared (all caches in ltsharing-setgt have the line
    read-only)
  • Exclusive (ltownergt has the line and may have
    modified it)

10
The coherence protocol Line states
Read miss
The cache line is present in none of the caches
The cache lineis present inone or morecaches,
and mayonly be read
Read miss
Uncached
Shared
Write miss
Read miss
Write miss
Write-back
The cache line is present inone cache only, and
may bemodified by it
Exclusive
Write miss
11
The coherence protocolnetwork traffic (1)
Invalid
Cache
Read miss for uncached line (uncached to shared
transition)
Network IF
Network IF
Uncached
MEM
12
The coherence protocolnetwork traffic (1)
Invalid
Cache
Read miss for uncached line (uncached to shared
transition)
Network IF
- Send request over network (100 ns)
Network IF
Uncached
MEM
13
The coherence protocolnetwork traffic (1)
Invalid
Cache
Read miss for uncached line (uncached to shared
transition)
Network IF
- Send request over network (100 ns)
- Read data in remote memory (50 ns)
Network IF
Uncached
MEM
14
The coherence protocolnetwork traffic (1)
Invalid
Cache
Read miss for uncached line (uncached to shared
transition)
Network IF
- Send request over network (100 ns)
- Read data in remote memory (50 ns)
- Send reply over network (100 ns)
Network IF
Uncached
MEM
Shared
15
The coherence protocolnetwork traffic (1)
Invalid
Cache
Shared
Read miss for uncached line (uncached to shared
transition)
Network IF
- Send request over network (100 ns)
- Read data in remote memory (50 ns)
- Send reply over network (100 ns)
Total 250 ns
Network IF
Uncached
MEM
Shared
16
The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
Network IF
Network IF
Exclusive
Modified
MEM
Cache
17
The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
Network IF
Network IF
Exclusive
Modified
MEM
Cache
18
The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
- Send request for writeback (100 ns)
Network IF
Network IF
Exclusive
Modified
MEM
Cache
19
The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
- Send request for writeback (100 ns)
- Read modified data from cache (10 ns)
Network IF
Network IF
Exclusive
Modified
MEM
Cache
20
The coherence protocolnetwork traffic (2)
Invalid
Cache
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
- Send request for writeback (100 ns)
- Read modified data from cache (10 ns)
- Send modified data back (100 ns)
Network IF
Network IF
Exclusive
Modified
MEM
Cache
Invalid
21
The coherence protocolnetwork traffic (2)
Invalid
Cache
Modified
Miss for modified line(modified to modified
transition)
Network IF
- Send request over network (100 ns)
- Send request for writeback (100 ns)
- Read modified data from cache (10 ns)
- Send modified data back (100 ns)
Total 310 ns
Network IF
Network IF
Exclusive
Modified
MEM
Cache
Exclusive
Invalid
22
The coherence protocol conclusion
  • Memory access time is an important factor of
    total system performance
  • Using caches can hide the latency of some memory
    accesses, but not all of them
  • One memory access can span several network
    round-trip times
  • Network delay is the largest factor in memory
    access time
  • -gt Reducing network delay can dramatically
    improve system performance

23
Overview
  • Introduction to shared-memory machines
  • The cache coherence mechanism
  • How can reconfiguration help?
  • Conclusion future work

24
Reconfiguration for shared-memory machines (1)
  • Why?
  • Memory access latency (MAL) largely determines
    machine efficiency, decreasing MAL increases
    performance
  • Network latency is a large part of MAL
  • How?
  • A parallel program has access patterns on
    different timescales, so network load is not
    uniform over time / space
  • Optimize for the common case
  • Make faster connections between pairs of nodes
    that communicate often

25
Reconfiguration for shared-memory machines (2)
  • When?
  • Try to predict a future communication pattern,
    and adapt network to optimize communication that
    fits this pattern
  • Most (and possibly only) feasible prediction
    assume the current pattern continues
  • Patterns occur at different timescales
  • One memory access incurs several messages (100
    ns)
  • Memory accesses in a program have locality in
    time and space, which is also exploited by caches
    (10 µs)
  • One machine often runs different programs, each
    with their own access patterns (multitasking, 10
    ms)

26
Reconfiguration for shared-memory machines (3)
  • Traffic bursts caused by program locality
  • Measure time that traffic between a pair of nodes
    is above a threshold
  • Node pair that communicates in long burst(s) is
    candidate for optimized link
  • Feasibility of this technique?
  • Bursts should be long enough
  • Look at distribution of bursty traffic durations

27
Traffic characterization (1)
  • Radix sort benchmark (SPLASH-2 suite)
  • 16 CPUs
  • Simulated time .3 s
  • Real time 1h
  • Bursts wide distribution (power law?), up
    tomillisecond range

28
Traffic characterization (2)
  • Volrend benchmark (SPLASH-2 suite)
  • 16 CPUs
  • Simulated time .03 s
  • Real time 2h
  • Several bursts of 5 ms
  • but at the same moment in time

29
Traffic characterization (3)
  • Volrend traffic from each node thru time

30
Traffic characterization (3)
  • Cholesky benchmark (SPLASH-2 suite)
  • 16 CPUs
  • Much longer bursts (up to .5 seconds)
  • At regular intervals

31
Reconfiguration for shared-memory machines (4)
  • Steps in reconfiguring the network
  • Detect there is a pattern (alternatively
    programmer or compiler hints)
  • Reconfigure the network
  • Period of profit (until pattern changes)
  • Thus we need treconfiguration ltlt tpattern

32
Reconfiguration for shared-memory machines (5)
  • Other problems
  • Routing in a variable network
  • Network must stay connected at all times (no
    loose parts)
  • Optimize for the common case, but the less-common
    case should not suffer too badly or total
    performance will go down again
  • Current solution fixed base network with extra
    reconfigurable connections

33
Reconfiguration for shared-memory machines (6)
CPU MEM
CPU MEM
CPU MEM
Base network (fixed)
CPU MEM
Extra network (reconfigurable)
CPU MEM
CPU MEM
CPU MEM
CPU MEM
CPU MEM
34
Overview
  • Introduction to shared-memory machines
  • The cache coherence mechanism
  • How can reconfiguration help?
  • Conclusion future work

35
Conclusion
  • Locality of communication in time and space does
    exist in shared-memory multiprocessor machines
  • Locality occurs at different timescales, up to
    millisecond range
  • Reconfiguration techniques with treconfiguration
    lt 1 ms should be effective

36
Future work
  • Bursty traffic durations measure distribution
    for different benchmarks to get a better feel for
    the feasibility of reconfiguration at that time
    scale
  • Better visualization methods for analyzing
    traffic
  • Simulation extend our simulation platform with
    models of ROI hardware to enable full-system
    simulation
  • Simulation detail speed up simulation by
    removing factors with little influence
    (out-of-order processor?)

37
Typical latency and bandwidth requirements
CPU1 GHz
0.5 ns, 400 Gbps
0.5 ns, 100 Gbps
L1I
L1D
2x 64 KB, 80 hits
5 ns, 200 Gbps
L2
1 MiB, 90 hits
100 ns, 100 Gbps
Network IF
50 ns, 30 Gbps
MEM
8 GB
38
Quantifying improvements with simulation
  • Overall goal programs should complete faster
  • Simulator computer program that models
    processors, caches and interconnect network
  • This virtual computer runs a program
    (benchmark), similar or identical to the
    workload of a real shared-memory machine
  • We measure its behavior (total execution time,
    average memory access latency, ) to grade the
    network

39
Drawbacks of simulation
  • Detailed simulation takes a lot of time (typical
    slowdown by a factor of 5000), but many
    components (like an out-of-order processor) can
    influence individual instruction timing and thus
    need to be modeled.
  • Really? Higher-level properties like total
    execution time may be (reasonably) independent of
    low-level details
  • Were interested in relative performance (this
    network is 10 better than that one), so errors
    in absolute performance are not really a problem
  • More work needed to see what level of simulation
    detail we need to make valid statements about the
    networks
Write a Comment
User Comments (0)
About PowerShow.com