Title: CST305 Performance Evaluation
1CST305Performance Evaluation
- L2 Basic Serial Architecture
- Pipelining
- Memory Hierarchy
2References
- Course web page
- www.wmin.ac.uk/lancasd/CST305/
- K.Dowd and C.Severance High Performance Computing
chapters 2,3 (1st ed) - J.L.Hennessy and D.A.Patterson Computer
Architecture a Quantitative Approach
3- Designing to Last through Trends
- Capacity Speed
- Logic 2x in 3 years 2x in 3 years
- DRAM 4x in 3 years 2x in 10 years
- Disk 4x in 3 years 2x in 10 years
- Processor ( n.a.) 2x in 1.5 years
- Time to run the task
- Execution time, response time, latency
- Tasks per day, hour, week, sec, ns,
- Throughput, bandwidth
- X is n times faster than Y means
- ExTime(Y) Performance(X)
- --------- -------------- n
- ExTime(X) Performance(Y)
-
4Price vs. Cost
5Computer Architectures Changing Definition
- 1950s to 1960s Computer Architecture Course
Computer Arithmetic - 1970s to mid 1980s Computer Architecture
Course Instruction Set Design, especially ISA
appropriate for compilers - 1990s Computer Architecture Course Design of
CPU, memory system, I/O system, Multiprocessors
6Instruction Set Architecture (ISA)
software
instruction set
hardware
7- Performance prediction is hard because
architecture and software are complicated - Consider a (benchmark) program, compile to get
the list of instructions -
- Takes 14 cycles sequentially - Assume that data
is in cache
8- Pipeline
- Like an assembly line
- Overlap the execution of each instruction, so
they start 1 cycle before the previous one
completes
- Takes 11 cycles with pipeline - Does scale with
clock speed
9- Memory Hierarchy
- If one of the load instructions cant find the
data in cache
- Takes 62 cycles with cache miss - Does not
scale with clock speed -- how do you predict
that there will be a cache miss?
10RISC
- Pipelining and a complicated memory hierarchy are
characteristic of modern RISC processors - They make it much more difficult (than with older
CISC processors) to predict performance - RISC Reduced Instruction Set Computer
- CISC Complex Instruction Set Computer
11RISC vs CISC
- A disagreement in instruction set philosophy
- CISC
- Powerful primitives, close to high level
languages - VAX, Intel
- RISC
- Low-level primitives - can compute anything, but
need more instructions - Alpha, PowerPC, Sparc
12- Before circa 1985, design variables favoured
CISC, they now favour RISC - Memory was precious, and CISC executables are
smaller. Now memory is cheap. - Programmers wrote in assembler and used the
complex instructions. Now people rely on
compilers which have difficulty in using complex
instructions when optimising.
13A "Typical" RISC
- Uniform instruction length
- easier pipeline (with higher clock speed)
- Simple addressing modes
- to avoid stall
- Load/store architecture
- memory references only in these explicit
instructions - Many registers
- avoid memory references
- Delayed branch
- a branch delay slot after any branch instruction
14A "Typical" RISC
- 32-bit fixed format instruction (3 formats)
- 32 32-bit GPR (R0 contains zero, DP take pair)
- 3-address, reg-reg arithmetic instruction
- Single address mode for load/store base
displacement - no indirection
- Simple branch conditions
- Delayed branch
see SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM
PowerPC, CDC 6600, CDC 7600, Cray-1,
Cray-2, Cray-3
15Example MIPS ( DLX)
Register-Register
5
6
10
11
31
26
0
15
16
20
21
25
Op
Rs1
Rs2
Rd
Opx
Register-Immediate
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rd
Branch
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rs2/Opx
Jump / Call
31
26
0
25
target
Op
16Pipelining Its Natural!
- Laundry Example
- Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold - Washer takes 30 minutes
- Dryer takes 40 minutes
- Folder takes 20 minutes
17Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
- Sequential laundry takes 6 hours for 4 loads
18Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
- Pipelined laundry takes 3.5 hours for 4 loads
19Pipelining Lessons
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously
- Potential speedup Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- Time to fill pipeline and time to drain it
reduces speedup
6 PM
7
8
9
Time
T a s k O r d e r
20Computer Pipelines
- Execute billions of instructions, so throughout
is what matters - DLX desirable features all instructions same
length, registers located in same place in
instruction format, memory operands only in loads
or stores
215 Steps of DLX Datapath
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
IR
L M D
22Pipelined DLX Datapath
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
Write Back
Memory Access
- Data stationary control
- local decode for each instruction phase /
pipeline stage
23Visualizing Pipelining
Time (clock cycles)
I n s t r. O r d e r
24Its Not That Easy for Computers
- If a complicated memory access occurs in stage 3,
stage 4 will be delayed and the rest of the pipe
is stalled. - If there is a branch, if.. and jump, then some of
the instructions that have already entered the
pipeline should not be processed. - Need optimal conditions in order to keep the
pipeline moving
25Hazards prevent efficient pipelining
- Limits to pipelining Hazards prevent next
instruction from executing during its designated
clock cycle - Structural hazards HW cannot support this
combination of instructions (single person to
fold and put clothes away) - Data hazards Instruction depends on result of
prior instruction still in the pipeline (missing
sock) - Control hazards Pipelining of branches other
instructions stall the pipeline
26One Memory Port/Structural Hazards
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
27One Memory Port/Structural Hazards
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
28Can now build RISC cheaply
- Instruction cache to speed instruction fetch
- Dramatic memory size increases
- Better pipeline design
- Optimising compilers
- CISC made it difficult to build pipelines
resistant to stalls
29Pipelining Summary
- Just overlap tasks - easy if the tasks are
independent - Speed Up Pipeline Depth if ideal CPI is 1,
then - Hazards limit pipeline performance
- Structural need more HW resources
- Data (RAW,WAR,WAW) need forwarding, compiler
scheduling - Control delayed branch, prediction
Pipeline Depth
Clock Cycle Unpipelined
X
Speedup
1 Pipeline stall CPI
Clock Cycle Pipelined
30Floating Point pipelines
- RISC aims for 1 instruction/cycle
- Actually with FP can get more!
- Separate and more than one FP pipeline
- Superscalar
- schedule operations for side by side execution at
run time - IBM RS/6000, Supersparc, DEC AXP
- Speculative execution
315 minute Class Break
- 120 minutes straight is too long for me to
lecture (1100 1300) - 5 minutes review last time motivate this
lecture - 40 minutes lecture - pipeline
- 5 minutes break
- 40 minutes lecture - memory hierarchy
- 5 minutes summary of todays important topics
32Memory Hierarchy
- To cope with modern fast CPUs need large and
fast memory (and economic) - interleave
- wide memory bus
- separate instruction cache
- hierarchy
- fast expensive volatile cache
- slower cheap non-volatile disk
33Random Access Memory
- DRAM
- dynamic, charge based -- must keep refreshing
- generally best price/performance
- memory cycle time
- SRAM
- static, preserved as long as power is on
- sometimes (eg graphics) better for critical fast
access
34Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
35The Principle of Locality
- The Principle of Locality
- Programs access a relatively small portion of the
address space at any instant of time. - Two Different Types of Locality
- Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon (e.g., loops, reuse) - Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon (e.g., array access) - Last 15 years, HW relied on locality for speed
36A Modern Memory Hierarchy
- By taking advantage of the principle of locality
- Present the user with as much memory as is
available in the cheapest technology. - Provide access at speeds offered by the fastest
technology.
Processor
Control
Tertiary Storage (Disk/Tape)
Secondary Storage (Disk)
Main Memory (DRAM)
Second Level Cache (SRAM)
On-Chip Cache
Datapath
Registers
1s
10,000,000s (10s ms)
Speed (ns)
10s
100s
10,000,000,000s (10s sec)
100s
Size (bytes)
Ks
Ms
Gs
Ts
37Levels of the Memory Hierarchy
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 512-4K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-6
-5
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
38Memory Hierarchy Terminology
- Hit data appears in some block (eg X) in the
upper level - Hit Rate the fraction of memory access found in
the upper level - Hit Time Time to access the upper level which
consists of - RAM access time Time to determine hit/miss
- Miss data must be retrieved from a lower level
block (Y) Miss Rate 1 - (Hit Rate) - Miss Penalty Time to replace a block in the
upper level - Time to deliver the block the processor
Lower Level Memory
Upper Level Memory
To Processor
Blk X
Blk Y
From Processor
39Memory Hierarchy
- Cache - Main memory
- Cache miss
- Cache types
- Main Memory - Disk
- Virtual memory
- Paging
- TLB
40Cache Measures
- Hit rate fraction found in that level
- So high that usually talk about Miss rate
- Average memory-access time Hit time Miss
rate x Miss penalty (ns or clocks) - Miss penalty time to replace a block from lower
level, including time to replace in CPU - access time time to lower level
- f(latency to lower level)
- transfer time time to transfer block
- f(bandwidth between upper lower levels)
- Hit Time ltlt Miss Penalty (500 instructions on
21264!)
41Cache lines
- Arranged in lines, each holding a handful of
adjacent main memory entries - Neighbouring lines might contain data far apart
in main memory - Reading hit/miss rates
- Writing (write policy)
- line eventually gets replaced into main memory
- Multi CPU must implement cache coherency
- Effective due to locality can often use the rest
of the information on a new line
42Cache - Main Memory mapping
- Direct mapped
- each main memory ref ? particular cache line
- simplest
- danger of thrashing
- Fully Associative
- general table, store the main memory address
- not much used (but sometimes in TLB)
- Set Associative
- usually two-way or four-way
- each main memory ref ? 2 (4) cache lines
43Simplest Cache Direct Mapped
Memory Address
Memory
0
4 Byte Direct Mapped Cache
1
Cache Index
2
0
3
1
4
2
5
3
6
- Location 0 can be occupied by data from
- Memory location 0, 4, 8, ... etc.
- In general any memory locationwhose 2 LSBs of
the address are 0s - Addresslt10gt gt cache index
- Which one should we place in the cache?
- How can we tell which one is in the cache?
7
8
9
A
B
C
D
E
F
441 KB Direct Mapped Cache, 32B blocks
- For a 2 N byte cache
- The uppermost (32 - N) bits are always the Cache
Tag - The lowest M bits are the Byte Select (Block Size
2 M)
0
4
31
9
Cache Index
Cache Tag
Example 0x50
Byte Select
Ex 0x01
Ex 0x00
Stored as part of the cache state
Cache Data
Valid Bit
Cache Tag
0
Byte 0
Byte 1
Byte 31
1
0x50
Byte 32
Byte 33
Byte 63
2
3
31
Byte 992
Byte 1023
45Thrashing
- If alternate memory references point to the same
cache line
REAL4 A(1024), B(1024) DO 10 I 1,1024
A(I) A(I) B(I) 10 CONTINUE
Arrays A and B take up exactly 4kB, and are
adjacent in memory. In a 4kB direct mapped cache,
A(1) and B(1) will be mapped to the same cache
line. References to A and B will alternately
cause cache misses.
46Set Associative Cache
- Two (or four) direct mapped caches next to each
other making a single cache - Now there are 2 (4) choices for where a
particular main memory reference will appear in
the cache - Which choice to use?
- Least recently used
- Random
- Less likely to thrash
47Two-way Set Associative Cache
- N-way set associative N entries for each Cache
Index - N direct mapped caches operates in parallel (N
typically 2 to 4) - Example Two-way set associative cache
- Cache Index selects a set from the cache
- The two tags in the set are compared in parallel
- Data is selected based on the tag result
Cache Index
Cache Data
Cache Tag
Valid
Cache Block 0
Adr Tag
Compare
0
1
Mux
Sel1
Sel0
OR
Cache Block
Hit
48Write Policy
- Write Through Write the information to both the
block in cache and to the block in lower-level
memory. - Write Back Only write to the block in cache. The
modified cache block is written to main memory
only when replaced. - What about cache coherency for multiprocessors?
- To reduce write stalls, use a write buffer
49The Cache Design Space
- Several interacting dimensions
- cache size
- block size
- associativity
- replacement policy
- write-through vs write-back
- write allocation
- The optimal choice is a compromise
- depends on access characteristics
- workload
- use (I-cache, D-cache, TLB)
- depends on technology / cost
- Simplicity often wins
Cache Size
Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More
50Virtual Memory
- All processes think they have all the memory to
themselves - Virtual memory ? physical memory
- Pages (typical size 512B - 16kB, say 4kB)
- Mapping in page table
- Virtual memory is 32bit/64bit - bigger than
physical, some pages are on disk - Size of page table
- 32bit CPU, 4kB pages has gt million entries
51Virtual Memory History
- To run programs bigger than memory - overlays
- program out-of-core by hand
- overlay1 completed, pulled overlay2 from disk etc
- Found a way to automate this - make the OS
control it - 1961 - virtual memory, paging
- See a book on OS
52Address Map
V 0, 1, . . . , n - 1 virtual address
space M 0, 1, . . . , m - 1 physical address
space MAP V --gt M U 0 address mapping
function
n gt m
MAP(a) a' if data at virtual address a is
present in physical
address a' and a' in M 0 if
data at virtual address a is not present in M
a
missing item fault
Name Space V
fault handler
Processor
0
Secondary Memory
Addr Trans Mechanism
Main Memory
a
a'
physical address
OS performs this transfer
53Paging Organization
V.A.
P.A.
unit of mapping
frame 0
0
1K
0
Addr Trans MAP
1K
page 0
1
1024
1K
1024
1
1K
also unit of transfer from virtual to physical
memory
7
1K
7168
Physical Memory
31
1K
31744
Virtual Memory
Address Mapping
10
VA
page no.
disp
Page Table
Page Table Base Reg
Access Rights
actually, concatenation is more likely
V
PA
index into page table
table located in physical memory
physical memory address
54VM problems
- Size of page table
- VM with a cache
- It takes an extra memory access to translate VA
to PA, and this makes cache access too expensive
55Translation Lookaside Buffer TLB
- A way to speed up translation is to use a special
cache of recently used page table entries
Translation Lookaside Buffer or TLB - Based on locality of page reference
- Like a cache on the page table mappings
- TLB access time is comparable to cache access
time
56Translation Lookaside Buffers
TLBs are usually small, typically not more than
128 - 256 entries even on high end machines.
This permits fully associative lookup on these
machines. Most mid-range machines use small n-way
set associative organizations.
hit
miss
VA
PA
TLB Lookup
Cache
Main Memory
CPU
Translation with a TLB
hit
miss
Trans- lation
data
t
20 t
1/2 t
57Summary TLB, Virtual Memory
- Page tables map virtual address to physical
address - TLBs are important for fast translation
- TLB misses are significant in processor
performance
58Summary of Memory Hierarchy
- The Principle of Locality
- Program access a relatively small portion of the
address space at any instant of time. - Temporal Locality, Spatial Locality
- Three Major Categories of Cache Misses
- Compulsory Misses sad facts of life. Example
cold start misses. - Capacity Misses increase cache size
- Conflict Misses increase cache size and/or
associativity. Nightmare Scenario thrashing!
59Summary
- Caches, TLBs, Virtual Memory all understood by
examining how they deal with 4 questions - 1) Where can block be placed?
- 2) How is block found?
- 3) What block is replaced on miss?
- 4) How are writes handled?