CS 7960-4 Lecture 20 - PowerPoint PPT Presentation

About This Presentation

Title:

CS 7960-4 Lecture 20

Description:

Wide-issue superscalar the brute force method. that extracts parallelism by blindly increasing ... Next Class' Paper 'The Potential for Using Thread-Level Data ... – PowerPoint PPT presentation

Number of Views:13

Avg rating:3.0/5.0

Slides: 20

Provided by: RajeevBala4

Learn more at: https://users.cs.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 7960-4 Lecture 20

1
CS 7960-4 Lecture 20
The Case for a Single-Chip Multiprocessor K.
Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson,
K-Y. Chang Proceedings of ASPLOS-VII October 1996
2
CMP vs. Wide-Issue Superscalar

What is the best use of on-chip real estate?
wide-issue processor (complex design/clock,
diminishing ILP returns)
CMP (simple design, high TLP, lower ILP)
Contributions
Takes area and latencies into account
Attempts fine-grain parallelization

3
Scalability of Superscalars

Properties of large-window processors
Requires good branch prediction and fetch
High rename complexity
High issue queue complexity (grows with issue
width and window size)
High bypassing complexity
High port requirements in the register file and
cache
? Necessitates partitioned architectures

4
Application Requirements

Low-ILP programs (SPEC-Int) benefit little from
wide-issue superscalar machines (1-wide R5000
is within 30 of 4-wide R10000)
High-ILP programs (SPEC-FP) benefit from large
windows typically, loop-level parallelism
that
might be easy to extract

5
The CMP Argument

Build many small CPU cores
The small cores are enough to optimize low-ILP
programs (high thruput with multiprogramming)
For high-ILP programs, the compiler parallelizes
the application into multiple threads since
the
cores are on a single die, cost of
communication
is affordable
Low communication cost ? even integer programs
with moderate ILP could be parallelized

6
The CMP Approach

Wide-issue superscalar ? the brute force method
that extracts parallelism by blindly increasing
in-flight window size and using more hardware
CMP ? extract parallelism by static analysis
minimum hardware complexity and maximum
compiler smarts
CMP can exploit far-flung ILP, has low hw cost
Far-flung ILP and SPEC-Int threads are hard to
automatically extract ? memory disam, control
flow

7
Area Extrapolations
4-wide 6-wide SS 4x2-way CMP Comments
32KB DL1 13 17 4x 3 Banking/muxing
32KB IL1 14 18 4x 3 Banking/muxing
TLB 5 15 4x 5
Bpred 9 28 4x 7
Decode 11 38 4x 5 Quadratic effect
Queues 14 50 4x 4 Quadratic effect
ROB/Regs 9 34 4x 2 Quadratic effect
Int FUs 10 31 4x 10 More FUs in CMP
FP FUs 12 37 4x 12 More FUs in CMP
Crossbar 50 Multi-L1s ? L2
L2, clock, external interface unit 163 163 163 Remains unchanged
8
Processor Parameters
9
Applications
Benchmark Description Parallelism
Integer
compress Compresses and uncompresses file in memory None
eqntott Translates logic eqns into truth tables Manual
m88ksim Motorola 88000 CPU simulator Manual
MPsim Verilog simulation of a multiprocessor Manual
FP
applu Solver for partial differential eqns SUIF
apsi Temp, wind, velocity models SUIF
swim Shallow water model SUIF
tomcatv Mesh-generation with Thompson solver SUIF
Multiprogramming
pmake Parallel compilation for gnuchess Multi-task
10
2-Wide ? 6-Wide

No change in branch prediction accuracy ? area
penalty for 6-wide?
More speculation ? more cache misses
IPC improvements of at least 30 for all programs

11
CMP Statistics
Application Icache L1D 2-way L1D 4x2-way L2 2-way L2 4x2-way
Compress 0 3.5 3.5 1.0 1.0
Eqntott 0.6 0.8 5.4 0.7 1.2
M88ksim 2.3 0.4 3.3 0 0
MPsim 4.8 2.3 2.5 2.3 3.4
Applu 0 2.0 2.1 1.7 1.8
Apsi 2.7 4.1 6.9 2.1 2.0
Swim 0 1.2 1.2 1.2 1.5
Tomcatv 0 7.7 7.8 2.2 2.5
pmake 2.4 2.1 4.6 0.4 0.7
12
Results
13
Clustered SMT vs. CMP
CMP
Single-Thread Performance
Fetch
Fetch
Fetch
Fetch
Proc
Proc
Proc
Proc
Clustered SMT
Fetch
Fetch
Fetch
Fetch
DL1
DL1
DL1
DL1
Cluster
Cluster
Cluster
Cluster
Interconnect for register traffic
DL1
DL1
DL1
DL1
Interconnect for cache coherence traffic
14
Clustered SMT vs. CMP
CMP
Multi-Program Performance
Fetch
Fetch
Fetch
Fetch
Proc
Proc
Proc
Proc
Clustered SMT
Fetch
Fetch
Fetch
Fetch
DL1
DL1
DL1
DL1
Cluster
Cluster
Cluster
Interconnect for register traffic
DL1
DL1
DL1
Interconnect for cache coherence traffic
15
Clustered SMT vs. CMP
CMP
Multi-thread Performance
Fetch
Fetch
Fetch
Fetch
Proc
Proc
Proc
Proc
Clustered SMT
Fetch
Fetch
Fetch
Fetch
DL1
DL1
DL1
DL1
Cluster
Cluster
Cluster
Interconnect for register traffic
DL1
DL1
DL1
Interconnect for cache coherence traffic
16
Clustered SMT vs. CMP
CMP
Multi-thread Performance
Fetch
Fetch
Fetch
Fetch
Proc
Proc
Proc
Proc
Clustered SMT
Fetch
Fetch
Fetch
Fetch
DL1
DL1
DL1
DL1
Cluster
Cluster
Cluster
Cluster
Interconnect for register traffic
DL1
DL1
DL1
DL1
Interconnect for cache coherence traffic
17
Conclusions