Title: CS 7960-4 Lecture 20
1CS 7960-4 Lecture 20
The Case for a Single-Chip Multiprocessor K.
Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson,
K-Y. Chang Proceedings of ASPLOS-VII October 1996
2CMP vs. Wide-Issue Superscalar
- What is the best use of on-chip real estate?
- wide-issue processor (complex design/clock,
- diminishing ILP returns)
- CMP (simple design, high TLP, lower ILP)
- Contributions
- Takes area and latencies into account
- Attempts fine-grain parallelization
3Scalability of Superscalars
- Properties of large-window processors
- Requires good branch prediction and fetch
- High rename complexity
- High issue queue complexity (grows with issue
- width and window size)
- High bypassing complexity
- High port requirements in the register file and
cache - ? Necessitates partitioned architectures
4Application Requirements
- Low-ILP programs (SPEC-Int) benefit little from
- wide-issue superscalar machines (1-wide R5000
- is within 30 of 4-wide R10000)
- High-ILP programs (SPEC-FP) benefit from large
- windows typically, loop-level parallelism
that - might be easy to extract
5The CMP Argument
- Build many small CPU cores
- The small cores are enough to optimize low-ILP
- programs (high thruput with multiprogramming)
- For high-ILP programs, the compiler parallelizes
- the application into multiple threads since
the - cores are on a single die, cost of
communication - is affordable
- Low communication cost ? even integer programs
- with moderate ILP could be parallelized
6The CMP Approach
- Wide-issue superscalar ? the brute force method
- that extracts parallelism by blindly increasing
- in-flight window size and using more hardware
- CMP ? extract parallelism by static analysis
- minimum hardware complexity and maximum
- compiler smarts
- CMP can exploit far-flung ILP, has low hw cost
- Far-flung ILP and SPEC-Int threads are hard to
- automatically extract ? memory disam, control
flow
7Area Extrapolations
4-wide 6-wide SS 4x2-way CMP Comments
32KB DL1 13 17 4x 3 Banking/muxing
32KB IL1 14 18 4x 3 Banking/muxing
TLB 5 15 4x 5
Bpred 9 28 4x 7
Decode 11 38 4x 5 Quadratic effect
Queues 14 50 4x 4 Quadratic effect
ROB/Regs 9 34 4x 2 Quadratic effect
Int FUs 10 31 4x 10 More FUs in CMP
FP FUs 12 37 4x 12 More FUs in CMP
Crossbar 50 Multi-L1s ? L2
L2, clock, external interface unit 163 163 163 Remains unchanged
8Processor Parameters
9Applications
Benchmark Description Parallelism
Integer
compress Compresses and uncompresses file in memory None
eqntott Translates logic eqns into truth tables Manual
m88ksim Motorola 88000 CPU simulator Manual
MPsim Verilog simulation of a multiprocessor Manual
FP
applu Solver for partial differential eqns SUIF
apsi Temp, wind, velocity models SUIF
swim Shallow water model SUIF
tomcatv Mesh-generation with Thompson solver SUIF
Multiprogramming
pmake Parallel compilation for gnuchess Multi-task
102-Wide ? 6-Wide
- No change in branch prediction accuracy ? area
penalty for 6-wide? - More speculation ? more cache misses
- IPC improvements of at least 30 for all programs
11CMP Statistics
Application Icache L1D 2-way L1D 4x2-way L2 2-way L2 4x2-way
Compress 0 3.5 3.5 1.0 1.0
Eqntott 0.6 0.8 5.4 0.7 1.2
M88ksim 2.3 0.4 3.3 0 0
MPsim 4.8 2.3 2.5 2.3 3.4
Applu 0 2.0 2.1 1.7 1.8
Apsi 2.7 4.1 6.9 2.1 2.0
Swim 0 1.2 1.2 1.2 1.5
Tomcatv 0 7.7 7.8 2.2 2.5
pmake 2.4 2.1 4.6 0.4 0.7
12Results
13Clustered SMT vs. CMP
CMP
Single-Thread Performance
Fetch
Fetch
Fetch
Fetch
Proc
Proc
Proc
Proc
Clustered SMT
Fetch
Fetch
Fetch
Fetch
DL1
DL1
DL1
DL1
Cluster
Cluster
Cluster
Cluster
Interconnect for register traffic
DL1
DL1
DL1
DL1
Interconnect for cache coherence traffic
14Clustered SMT vs. CMP
CMP
Multi-Program Performance
Fetch
Fetch
Fetch
Fetch
Proc
Proc
Proc
Proc
Clustered SMT
Fetch
Fetch
Fetch
Fetch
DL1
DL1
DL1
DL1
Cluster
Cluster
Cluster
Interconnect for register traffic
DL1
DL1
DL1
Interconnect for cache coherence traffic
15Clustered SMT vs. CMP
CMP
Multi-thread Performance
Fetch
Fetch
Fetch
Fetch
Proc
Proc
Proc
Proc
Clustered SMT
Fetch
Fetch
Fetch
Fetch
DL1
DL1
DL1
DL1
Cluster
Cluster
Cluster
Interconnect for register traffic
DL1
DL1
DL1
Interconnect for cache coherence traffic
16Clustered SMT vs. CMP
CMP
Multi-thread Performance
Fetch
Fetch
Fetch
Fetch
Proc
Proc
Proc
Proc
Clustered SMT
Fetch
Fetch
Fetch
Fetch
DL1
DL1
DL1
DL1
Cluster
Cluster
Cluster
Cluster
Interconnect for register traffic
DL1
DL1
DL1
DL1
Interconnect for cache coherence traffic
17Conclusions
- CMP reduces hardware/power overhead
- Clustered SMT can yield better single-thread
- and multi-programmed performance (at high cost)
- CMP can improve application performance if the
- compiler can extract thread-level parallelism
- What is the most effective use of on-chip real
- estate?
- Depends on the workload
- Depends on compiler technology
18Next Class Paper
- The Potential for Using Thread-Level Data
- Speculation to Facilitate Automatic
Parallelization, - J.G. Steffan and T.C. Mowry, Proceedings of
- HPCA-4, February 1998
19Title