CS252 Graduate Computer Architecture Lecture 16 Cache Optimizations Cont Memory Technology PowerPoint PPT Presentation

presentation player overlay
1 / 62
About This Presentation
Transcript and Presenter's Notes

Title: CS252 Graduate Computer Architecture Lecture 16 Cache Optimizations Cont Memory Technology


1
CS252Graduate Computer ArchitectureLecture
16Cache Optimizations (Cont)Memory Technology
  • John Kubiatowicz
  • Electrical Engineering and Computer Sciences
  • University of California, Berkeley
  • http//www.eecs.berkeley.edu/kubitron/cs252
  • http//www-inst.eecs.berkeley.edu/cs252

2
Review Cache performance
  • Miss-oriented Approach to Memory Access
  • Separating out Memory component entirely
  • AMAT Average Memory Access Time

3
Review 6 Basic Cache Optimizations
  • Reducing hit time
  • Avoiding Address Translation during Cache
    Indexing
  • E.g., Overlap TLB and cache access, Virtual
    Addressed Caches
  • Reducing Miss Penalty
  • 2. Giving Reads Priority over Writes
  • E.g., Read complete before earlier writes in
    write buffer
  • 3. Multilevel Caches
  • Reducing Miss Rate
  • 4. Larger Block size (Compulsory misses)
  • 5. Larger Cache size (Capacity misses)
  • 6. Higher Associativity (Conflict misses)

4
1. Fast hits by Avoiding Address Translation
  • Send virtual address to cache? Called Virtually
    Addressed Cache or just Virtual Cache vs.
    Physical Cache
  • Every time process is switched logically must
    flush the cache otherwise get false hits
  • Cost is time to flush compulsory misses from
    empty cache
  • Dealing with aliases (sometimes called synonyms)
    Two different virtual addresses map to same
    physical address
  • I/O must interact with cache, so need virtual
    address
  • Solution to aliases
  • HW guaranteess covers index field direct
    mapped, they must be uniquecalled page coloring
  • Solution to cache flush
  • Add process identifier tag that identifies
    process as well as address within process cant
    get a hit if wrong process

5
Two options for avoiding translation
CPU
CPU
VA
VA
PA Tags
TB

TB
PA
PA
L2

MEM
PA
MEM
Still Physically Indexed Overlap access with VA
translation requires index to remain
invariant across translation
Physically Addressed(indexed) Conventional Organ
ization
Variation A
Variation B
6
3. Multi-level cache
  • L2 Equations
  • AMAT Hit TimeL1 Miss RateL1 x Miss
    PenaltyL1
  • Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
    PenaltyL2
  • AMAT Hit TimeL1
  • Miss RateL1 x (Hit TimeL2 Miss RateL2
    Miss PenaltyL2)
  • Definitions
  • Local miss rate misses in this cache divided by
    the total number of memory accesses to this cache
    (Miss rateL2)
  • Global miss ratemisses in this cache divided by
    the total number of memory accesses generated by
    the CPU (Miss RateL1 x Miss RateL2)
  • Global Miss Rate is what matters

7
Review (Cont)12 Advanced Cache Optimizations
  • Reducing hit time
  • Small and simple caches
  • Way prediction
  • Trace caches
  • Increasing cache bandwidth
  • Pipelined caches
  • Multibanked caches
  • Nonblocking caches
  • Reducing Miss Penalty
  • Critical word first
  • Merging write buffers
  • Reducing Miss Rate
  • Victim Cache
  • Hardware prefetching
  • Compiler prefetching
  • Compiler Optimizations

8
4 Increasing Cache Bandwidth by Pipelining
  • Pipeline cache access to maintain bandwidth, but
    higher latency
  • Instruction cache access pipeline stages
  • 1 Pentium
  • 2 Pentium Pro through Pentium III
  • 4 Pentium 4
  • ? greater penalty on mispredicted branches
  • ? more clock cycles between the issue of the load
    and the use of the data

9
5. Increasing Cache Bandwidth Non-Blocking
Caches
  • Non-blocking cache or lockup-free cache allow
    data cache to continue to supply cache hits
    during a miss
  • requires F/E bits on registers or out-of-order
    execution
  • requires multi-bank memories
  • hit under miss reduces the effective miss
    penalty by working during miss vs. ignoring CPU
    requests
  • hit under multiple miss or miss under miss
    may further lower the effective miss penalty by
    overlapping multiple misses
  • Significantly increases the complexity of the
    cache controller as there can be multiple
    outstanding memory accesses
  • Requires muliple memory banks (otherwise cannot
    support)
  • Penium Pro allows 4 outstanding memory misses

10
Value of Hit Under Miss for SPEC (old data)
0-1 1-2 2-64 Base
Hit under n Misses
  • FP programs on average AMAT 0.68 - 0.52 -
    0.34 - 0.26
  • Int programs on average AMAT 0.24 - 0.20 -
    0.19 - 0.19
  • 8 KB Data Cache, Direct Mapped, 32B block, 16
    cycle miss, SPEC 92

11
6 Increasing Cache Bandwidth via Multiple Banks
  • Rather than treat the cache as a single
    monolithic block, divide into independent banks
    that can support simultaneous accesses
  • E.g.,T1 (Niagara) L2 has 4 banks
  • Banking works best when accesses naturally spread
    themselves across banks ? mapping of addresses to
    banks affects behavior of memory system
  • Simple mapping that works well is sequential
    interleaving
  • Spread block addresses sequentially across banks
  • E,g, if there 4 banks, Bank 0 has all blocks
    whose address modulo 4 is 0 bank 1 has all
    blocks whose address modulo 4 is 1

12
7. Reduce Miss Penalty Early Restart and
Critical Word First
  • Dont wait for full block before restarting CPU
  • Early restartAs soon as the requested word of
    the block arrives, send it to the CPU and let the
    CPU continue execution
  • Spatial locality ? tend to want next sequential
    word, so not clear size of benefit of just early
    restart
  • Critical Word FirstRequest the missed word first
    from memory and send it to the CPU as soon as it
    arrives let the CPU continue execution while
    filling the rest of the words in the block
  • Long blocks more popular today ? Critical Word
    1st Widely used

13
8. Merging Write Buffer to Reduce Miss Penalty
  • Write buffer to allow processor to continue while
    waiting to write to memory
  • If buffer contains modified blocks, the addresses
    can be checked to see if address of new data
    matches the address of a valid write buffer entry
  • If so, new data are combined with that entry
  • Increases block size of write for write-through
    cache of writes to sequential words, bytes since
    multiword writes more efficient to memory
  • The Sun T1 (Niagara) processor, among many
    others, uses write merging

14
9. Reducing Misses a Victim Cache
  • How to combine fast hit time of direct mapped
    yet still avoid conflict misses?
  • Add buffer to place data discarded from cache
  • Jouppi 1990 4-entry victim cache removed 20
    to 95 of conflicts for a 4 KB direct mapped data
    cache
  • Used in Alpha, HP machines

DATA
TAGS
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
To Next Lower Level In
Hierarchy
15
10. Reducing Misses by Hardware Prefetching of
Instructions Data
  • Prefetching relies on having extra memory
    bandwidth that can be used without penalty
  • Instruction Prefetching
  • Typically, CPU fetches 2 blocks on a miss the
    requested block and the next consecutive block.
  • Requested block is placed in instruction cache
    when it returns, and prefetched block is placed
    into instruction stream buffer
  • Data Prefetching
  • Pentium 4 can prefetch data into L2 cache from up
    to 8 streams from 8 different 4 KB pages
  • Prefetching invoked if 2 successive L2 cache
    misses to a page, if distance between those
    cache blocks is

16
11. Reducing Misses by Software Prefetching Data
  • Data Prefetch
  • Load data into register (HP PA-RISC loads)
  • Cache Prefetch load into cache (MIPS IV,
    PowerPC, SPARC v. 9)
  • Special prefetching instructions cannot cause
    faultsa form of speculative execution
  • Issuing Prefetch Instructions takes time
  • Is cost of prefetch issues misses?
  • Higher superscalar reduces difficulty of issue
    bandwidth

17
12. Reducing Misses by Compiler Optimizations
  • McFarling 1989 reduced caches misses by 75 on
    8KB direct mapped cache, 4 byte blocks in
    software
  • Instructions
  • Reorder procedures in memory so as to reduce
    conflict misses
  • Profiling to look at conflicts(using tools they
    developed)
  • Data
  • Merging Arrays improve spatial locality by
    single array of compound elements vs. 2 arrays
  • Loop Interchange change nesting of loops to
    access data in order stored in memory
  • Loop Fusion Combine 2 independent loops that
    have same looping and some variables overlap
  • Blocking Improve temporal locality by accessing
    blocks of data repeatedly vs. going down whole
    columns or rows

18
Merging Arrays Example
  • / Before 2 sequential arrays /
  • int valSIZE
  • int keySIZE
  • / After 1 array of stuctures /
  • struct merge
  • int val
  • int key
  • struct merge merged_arraySIZE
  • Reducing conflicts between val key improve
    spatial locality

19
Loop Interchange Example
  • / Before /
  • for (k 0 k
  • for (j 0 j
  • for (i 0 i
  • xij 2 xij
  • / After /
  • for (k 0 k
  • for (i 0 i
  • for (j 0 j
  • xij 2 xij
  • Sequential accesses instead of striding through
    memory every 100 words improved spatial locality

20
Loop Fusion Example
  • / Before /
  • for (i 0 i
  • for (j 0 j
  • aij 1/bij cij
  • for (i 0 i
  • for (j 0 j
  • dij aij cij
  • / After /
  • for (i 0 i
  • for (j 0 j
  • aij 1/bij cij
  • dij aij cij
  • 2 misses per access to a c vs. one miss per
    access improve spatial locality

21
Blocking Example
  • / Before /
  • for (i 0 i
  • for (j 0 j
  • r 0
  • for (k 0 k
  • r r yikzkj
  • xij r
  • Two Inner Loops
  • Read all NxN elements of z
  • Read N elements of 1 row of y repeatedly
  • Write N elements of 1 row of x
  • Capacity Misses a function of N Cache Size
  • 2N3 N2 (assuming no conflict otherwise )
  • Idea compute on BxB submatrix that fits

22
Blocking Example
  • / After /
  • for (jj 0 jj
  • for (kk 0 kk
  • for (i 0 i
  • for (j jj j
  • r 0
  • for (k kk k
  • r r yikzkj
  • xij xij r
  • B called Blocking Factor
  • Capacity Misses from 2N3 N2 to 2N3/B N2
  • Conflict Misses Too?

23
Reducing Conflict Misses by Blocking
  • Conflict misses in caches not FA vs. Blocking
    size
  • Lam et al 1991 a blocking factor of 24 had a
    fifth the misses vs. 48 despite both fit in cache

24
Summary of Compiler Optimizations to Reduce Cache
Misses (by hand)
25
Compiler Optimization vs. Memory Hierarchy Search
  • Compiler tries to figure out memory hierarchy
    optimizations
  • New approach Auto-tuners 1st run variations of
    program on computer to find best combinations of
    optimizations (blocking, padding, ) and
    algorithms, then produce C code to be compiled
    for that computer
  • Auto-tuner targeted to numerical method
  • E.g., PHiPAC (BLAS), Atlas (BLAS), Sparsity
    (Sparse linear algebra), Spiral (DSP), FFT-W

26
Sparse Matrix Search for Blocking
for finite element problem Im, Yelick, Vuduc,
2005
27
Best Sparse Blocking for 8 Computers
8
4
row block size (r)
2
1
1
2
4
8
column block size (c)
  • All possible column block sizes selected for 8
    computers How could compiler know?

28
(No Transcript)
29
AMD Opteron Memory Hierarchy
  • 12-stage integer pipeline yields a maximum clock
    rate of 2.8 GHz and fastest memory PC3200 DDR
    SDRAM
  • 48-bit virtual and 40-bit physical addresses
  • I and D cache 64 KB, 2-way set associative, 64-B
    block, LRU
  • L2 cache 1 MB, 16-way, 64-B block, pseudo LRU
  • Data and L2 caches use write back, write allocate
  • L1 caches are virtually indexed and physically
    tagged
  • L1 I TLB and L1 D TLB fully associative, 40
    entries
  • 32 entries for 4 KB pages and 8 for 2 MB or 4 MB
    pages
  • L2 I TLB and L1 D TLB 4-way, 512 entities of 4
    KB pages
  • Memory controller allows up to 10 cache misses
  • 8 from D cache and 2 from I cache

30
Opteron Memory Hierarchy Performance
  • For SPEC2000
  • I cache misses per instruction is 0.01 to 0.09
  • D cache misses per instruction are 1.34 to 1.43
  • L2 cache misses per instruction are 0.23 to
    0.36
  • Commercial benchmark (TPC-C-like)
  • I cache misses per instruction is 1.83 (100X!)
  • D cache misses per instruction are 1.39 (? same)
  • L2 cache misses per instruction are 0.62 (2X to
    3X)
  • How compare to ideal CPI of 0.33?

31
CPI breakdown for Integer Programs
  • CPI above base attributable to memory ? 50
  • L2 cache misses ? 25 overall (50 memory CPI)
  • Assumes misses are not overlapped with the
    execution pipeline or with each other, so the
    pipeline stall portion is a lower bound

32
CPI breakdown for Floating Pt. Programs
  • CPI above base attributable to memory ? 60
  • L2 cache misses ? 40 overall (70 memory CPI)
  • Assumes misses are not overlapped with the
    execution pipeline or with each other, so the
    pipeline stall portion is a lower bound

33
Pentium 4 vs. Opteron Memory Hierarchy
Clock rate for this comparison in 2005 faster
versions existed
34
Misses Per Instruction Pentium 4 vs. Opteron
3.4X
2.3X
?Opteron better
1.5X
0.5X
?Pentium better
  • D cache miss P4 is 2.3X to 3.4X vs. Opteron
  • L2 cache miss P4 is 0.5X to 1.5X vs. Opteron
  • Note Same ISA, but not same instruction count

35
Fallacies and Pitfalls
  • Not delivering high memory bandwidth in a
    cache-based system
  • 10 Fastest computers at Stream benchmark
    McCalpin 2005
  • Only 4/10 computers rely on data caches, and
    their memory BW per processor is 7X to 25X slower
    than NEC SX7

36
Main Memory Background
  • Performance of Main Memory
  • Latency Cache Miss Penalty
  • Access Time time between request and word
    arrives
  • Cycle Time time between requests
  • Bandwidth I/O Large Block Miss Penalty (L2)
  • Main Memory is DRAM Dynamic Random Access Memory
  • Dynamic since needs to be refreshed periodically
    (8 ms, 1 time)
  • Addresses divided into 2 halves (Memory as a 2D
    matrix)
  • RAS or Row Address Strobe
  • CAS or Column Address Strobe
  • Cache uses SRAM Static Random Access Memory
  • No refresh (6 transistors/bit vs. 1
    transistorSize DRAM/SRAM 4-8, Cost/Cycle
    time SRAM/DRAM 8-16

37
Main Memory Deep Background
  • Out-of-Core, In-Core, Core Dump?
  • Core memory?
  • Non-volatile, magnetic
  • Lost to 4 Kbit DRAM (today using 512Mbit DRAM)
  • Access time 750 ns, cycle time 1500-3000 ns

38
Core Memories (1950s 60s)
The first magnetic core memory, from the IBM 405
Alphabetical Accounting Machine.
  • Core Memory stored data as magnetization in iron
    rings
  • Iron cores woven into a 2-dimensional mesh of
    wires
  • Origin of the term Dump Core
  • Rumor that IBM consulted Life Saver company
  • See http//www.columbia.edu/acis/history/core.htm
    l

39
DRAM logical organization (4 Mbit)
Column Decoder

D
Sense
Amps I/O
1
1
Q
Memory
Array
A0A1
0
(2,048 x 2,048)
Storage
W
ord Line
Cell
  • Square root of bits per RAS/CAS

40
Quest for DRAM Performance
  • Fast Page mode
  • Add timing signals that allow repeated accesses
    to row buffer without another row access time
  • Such a buffer comes naturally, as each array will
    buffer 1024 to 2048 bits for each access
  • Synchronous DRAM (SDRAM)
  • Add a clock signal to DRAM interface, so that the
    repeated transfers would not bear overhead to
    synchronize with DRAM controller
  • Double Data Rate (DDR SDRAM)
  • Transfer data on both the rising edge and falling
    edge of the DRAM clock signal ? doubling the peak
    data rate
  • DDR2 lowers power by dropping the voltage from
    2.5 to 1.8 volts offers higher clock rates up
    to 400 MHz
  • DDR3 drops to 1.5 volts higher clock rates up
    to 800 MHz
  • Improved Bandwidth, not Latency

41
DRAM name based on Peak Chip Transfers / SecDIMM
name based on Peak DIMM MBytes / Sec
42
Classical DRAM Organization (square)
bit (data) lines
r o w d e c o d e r
Each intersection represents a 1-T DRAM Cell
RAM Cell Array
word (row) select
Column Selector I/O Circuits
row address
Column Address
  • Row and Column Address together
  • Select 1 bit a time

data
43
Review1-T Memory Cell (DRAM)
row select
  • Write
  • 1. Drive bit line
  • 2.. Select row
  • Read
  • 1. Precharge bit line to Vdd/2
  • 2.. Select row
  • 3. Cell and bit line share charges
  • Very small voltage changes on the bit line
  • 4. Sense (fancy sense amp)
  • Can detect changes of 1 million electrons
  • 5. Write restore the value
  • Refresh
  • 1. Just do a dummy read to every cell.

bit
44
DRAM Capacitors more capacitance in a small area
  • Trench capacitors
  • Logic ABOVE capacitor
  • Gain in surface area of capacitor
  • Better Scaling properties
  • Better Planarization
  • Stacked capacitors
  • Logic BELOW capacitor
  • Gain in surface area of capacitor
  • 2-dim cross-section quite small

45
DRAM Read Timing
  • Every DRAM access begins at
  • The assertion of the RAS_L
  • 2 ways to read early or late v. CAS

DRAM Read Cycle Time
CAS_L
A
Row Address
Junk
Col Address
Row Address
Junk
Col Address
WE_L
OE_L
D
High Z
Data Out
Junk
Data Out
High Z
Read Access Time
Output Enable Delay
Early Read Cycle OE_L asserted before CAS_L
Late Read Cycle OE_L asserted after CAS_L
46
4 Key DRAM Timing Parameters
  • tRAC minimum time from RAS line falling to the
    valid data output.
  • Quoted as the speed of a DRAM when buy
  • A typical 4Mb DRAM tRAC 60 ns
  • Speed of DRAM since on purchase sheet?
  • tRC minimum time from the start of one row
    access to the start of the next.
  • tRC 110 ns for a 4Mbit DRAM with a tRAC of 60
    ns
  • tCAC minimum time from CAS line falling to valid
    data output.
  • 15 ns for a 4Mbit DRAM with a tRAC of 60 ns
  • tPC minimum time from the start of one column
    access to the start of the next.
  • 35 ns for a 4Mbit DRAM with a tRAC of 60 ns

47
Main Memory Performance
Cycle Time
Access Time
Time
  • DRAM (Read/Write) Cycle Time DRAM
    (Read/Write) Access Time
  • 21 why?
  • DRAM (Read/Write) Cycle Time
  • How frequent can you initiate an access?
  • Analogy A little kid can only ask his father for
    money on Saturday
  • DRAM (Read/Write) Access Time
  • How quickly will you get what you want once you
    initiate an access?
  • Analogy As soon as he asks, his father will give
    him the money
  • DRAM Bandwidth Limitation analogy
  • What happens if he runs out of money on Wednesday?

48
Increasing Bandwidth - Interleaving
Access Pattern without Interleaving
CPU
Memory
D1 available
Start Access for D1
Start Access for D2
Memory Bank 0
Access Pattern with 4-way Interleaving
Memory Bank 1
CPU
Memory Bank 2
Memory Bank 3
Access Bank 1
Access Bank 0
Access Bank 2
Access Bank 3
We can Access Bank 0 again
49
Main Memory Performance
  • Wide
  • CPU/Mux 1 word Mux/Cache, Bus, Memory N words
    (Alpha 64 bits 256 bits)
  • Interleaved
  • CPU, Cache, Bus 1 word Memory N Modules(4
    Modules) example is word interleaved
  • Simple
  • CPU, Cache, Bus, Memory same width (32 bits)

50
Main Memory Performance
  • Timing model
  • 1 to send address,
  • 4 for access time, 10 cycle time, 1 to send data
  • Cache Block is 4 words
  • Simple M.P. 4 x (1101) 48
  • Wide M.P. 1 10 1 12
  • Interleaved M.P. 1101 3 15

51
Avoiding Bank Conflicts
  • Lots of banks
  • int x256512
  • for (j 0 j
  • for (i 0 i
  • xij 2 xij
  • Even with 128 banks, since 512 is multiple of
    128, conflict on word accesses
  • SW loop interchange or declaring array not power
    of 2 (array padding)
  • HW Prime number of banks
  • bank number address mod number of banks
  • bank number address mod number of banks
  • address within bank ?address / number of words
    in bank
  • modulo divide per memory access with prime no.
    banks?

52
Finding Bank Number and Address within a bank
  • Problem We want to determine the number of
    banks, Nb, to use
  • and the number of words to store in each bank,
    Wb, such that
  • given a word address x, it is easy to find the
    bank where x will
  • be found, B(x), and the address of x within the
    bank, A(x).
  • for any address x, B(x) and A(x) are unique.
  • the number of bank conflicts is minimized

53
Finding Bank Number and Address within a bank
Solution We will use the following relation to
determine the bank number for x, B(x), and the
address of x within the bank, A(x) B(x) x
MOD Nb A(x) x MOD Wb and we will choose Nb
and Wb to be co-prime, i.e., there is no
prime number that is a factor of Nb and Wb (this
condition is satisfied if we choose Nb to be a
prime number that is equal to an integer power of
two minus 1). We can then use the Chinese
Remainder Theorem to show that B(x) and A(x) is
always unique.
54
Fast Bank Number
  • Chinese Remainder Theorem As long as two sets of
    integers ai and bi follow these rules
  • and that ai and aj are co-prime if i ? j, then
    the integer x has only one solution (unambiguous
    mapping)
  • bank number b0, number of banks a0
  • address within bank b1, number of words in bank
    a1
  • N word address 0 to N-1, prime no. banks, words
    power of 2
  • 3 banks Nb 3, and 8 words per bank, Wb 8.

Seq. Interleaved Modulo
Interleaved Bank Number 0 1 2 0 1 2 Address
within Bank 0 0 1 2 0 16 8 1 3 4 5
9 1 17 2 6 7 8 18 10 2 3 9 10 11 3 19 11 4 12 13
14 12 4 20 5 15 16 17 21 13 5 6 18 19 20 6 22 14 7
21 22 23 15 7 23
55
Fast Memory Systems DRAM specific
  • Multiple CAS accesses several names (page mode)
  • Extended Data Out (EDO) 30 faster in page mode
  • New DRAMs to address gap what will they cost,
    will they survive?
  • RAMBUS startup company reinvent DRAM interface
  • Each Chip a module vs. slice of memory
  • Short bus between CPU and chips
  • Does own refresh
  • Variable amount of data returned
  • 1 byte / 2 ns (500 MB/s per chip)
  • Synchronous DRAM 2 banks on chip, a clock signal
    to DRAM, transfer synchronous to system clock (66
    - 150 MHz)
  • Intel claims RAMBUS Direct (16 b wide) is future
    PC memory
  • Niche memory or main memory?
  • e.g., Video RAM for frame buffers, DRAM fast
    serial output

56
Fast Page Mode Operation
  • Regular DRAM Organization
  • N rows x N column x M-bit
  • Read Write M-bit at a time
  • Each M-bit access requiresa RAS / CAS cycle
  • Fast Page Mode DRAM
  • N x M SRAM to save a row
  • After a row is read into the register
  • Only CAS is needed to access other M-bit blocks
    on that row
  • RAS_L remains asserted while CAS_L is toggled

Column Address
DRAM
Row Address
N rows
N x M SRAM
M bits
M-bit Output
57
Something new Structure of Tunneling Magnetic
Junction
  • Tunneling Magnetic Junction RAM (TMJ-RAM)
  • Speed of SRAM, density of DRAM, non-volatile (no
    refresh)
  • Spintronics combination quantum spin and
    electronics
  • Same technology used in high-density disk-drives

58
MEMS-based Storage
  • Magnetic sled floats on array of read/write
    heads
  • Approx 250 Gbit/in2
  • Data ratesIBM 250 MB/s w 1000 headsCMU 3.1
    MB/s w 400 heads
  • Electrostatic actuators move media around to
    align it with heads
  • Sweep sled 50?m in
  • Capacity estimated to be in the 1-10GB in 10cm2

See Ganger et all http//www.lcs.ece.cmu.edu/rese
arch/MEMS
59
Big storage (such as DRAM/DISK)Potential for
Errors!
  • Motivation
  • DRAM is dense ?Signals are easily disturbed
  • High Capacity ? higher probability of failure
  • Approach Redundancy
  • Add extra information so that we can recover from
    errors
  • Can we do better than just create complete
    copies?
  • Block Codes Data Coded in blocks
  • k data bits coded into n encoded bits
  • Measure of overhead Rate of Code K/N
  • Often called an (n,k) code
  • Consider data as vectors in GF(2) i.e. vectors
    of bits
  • Code Space is set of all 2n vectors, Data space
    set of 2k vectors
  • Encoding function Cf(d)
  • Decoding function df(C)
  • Not all possible code vectors, C, are valid!

60
Need for Error Correction!
  • Motivation
  • Failures/time proportional to number of bits!
  • As DRAM cells shrink, more vulnerable
  • Went through period in which failure rate was low
    enough without error correction that people
    didnt do correction
  • DRAM banks too large now
  • Servers always corrected memory systems
  • Basic idea add redundancy through parity bits
  • Common configuration Random error correction
  • SEC-DED (single error correct, double error
    detect)
  • One example 64 data bits 8 parity bits (11
    overhead)
  • Really want to handle failures of physical
    components as well
  • Organization is multiple DRAMs/DIMM, multiple
    DIMMs
  • Want to recover from failed DRAM and failed DIMM!
  • Chip kill handle failures width of single DRAM
    chip

61
General IdeaCode Vector Space
  • Not every vector in the code space is valid
  • Hamming Distance (d)
  • Minimum number of bit flips to turn one code word
    into another
  • Number of errors that we can detect (d-1)
  • Number of errors that we can fix ½(d-1)

62
Conclusion
  • Memory wall inspires optimizations since so much
    performance lost there
  • Reducing hit time Small and simple caches, Way
    prediction, Trace caches
  • Increasing cache bandwidth Pipelined caches,
    Multibanked caches, Nonblocking caches
  • Reducing Miss Penalty Critical word first,
    Merging write buffers
  • Reducing Miss Rate Compiler optimizations
  • Reducing miss penalty or miss rate via
    parallelism Hardware prefetching, Compiler
    prefetching
  • Auto-tuners search replacing static compilation
    to explore optimization space?
Write a Comment
User Comments (0)
About PowerShow.com