CS252 Graduate Computer Architecture Lecture 16 Cache Optimizations Cont Memory Technology presentation

About This Presentation

Transcript and Presenter's Notes

Title: CS252 Graduate Computer Architecture Lecture 16 Cache Optimizations Cont Memory Technology

1
CS252Graduate Computer ArchitectureLecture
16Cache Optimizations (Cont)Memory Technology

John Kubiatowicz
Electrical Engineering and Computer Sciences
University of California, Berkeley
http//www.eecs.berkeley.edu/kubitron/cs252
http//www-inst.eecs.berkeley.edu/cs252

2
Review Cache performance

Miss-oriented Approach to Memory Access
Separating out Memory component entirely
AMAT Average Memory Access Time

3
Review 6 Basic Cache Optimizations

Reducing hit time
Avoiding Address Translation during Cache
Indexing
E.g., Overlap TLB and cache access, Virtual
Addressed Caches
Reducing Miss Penalty
2. Giving Reads Priority over Writes
E.g., Read complete before earlier writes in
write buffer
3. Multilevel Caches
Reducing Miss Rate
4. Larger Block size (Compulsory misses)
5. Larger Cache size (Capacity misses)
6. Higher Associativity (Conflict misses)

4
1. Fast hits by Avoiding Address Translation

Send virtual address to cache? Called Virtually
Addressed Cache or just Virtual Cache vs.
Physical Cache
Every time process is switched logically must
flush the cache otherwise get false hits
Cost is time to flush compulsory misses from
empty cache
Dealing with aliases (sometimes called synonyms)
Two different virtual addresses map to same
physical address
I/O must interact with cache, so need virtual
address
Solution to aliases
HW guaranteess covers index field direct
mapped, they must be uniquecalled page coloring
Solution to cache flush
Add process identifier tag that identifies
process as well as address within process cant
get a hit if wrong process

5
Two options for avoiding translation
CPU
CPU
VA
VA
PA Tags
TB

TB
PA
PA
L2

MEM
PA
MEM
Still Physically Indexed Overlap access with VA
translation requires index to remain
invariant across translation
Physically Addressed(indexed) Conventional Organ
ization
Variation A
Variation B
6
3. Multi-level cache

L2 Equations
AMAT Hit TimeL1 Miss RateL1 x Miss
PenaltyL1
Miss PenaltyL1 Hit TimeL2 Miss RateL2 x Miss
PenaltyL2
AMAT Hit TimeL1
Miss RateL1 x (Hit TimeL2 Miss RateL2
Miss PenaltyL2)
Definitions
Local miss rate misses in this cache divided by
the total number of memory accesses to this cache
(Miss rateL2)
Global miss ratemisses in this cache divided by
the total number of memory accesses generated by
the CPU (Miss RateL1 x Miss RateL2)
Global Miss Rate is what matters

7
Review (Cont)12 Advanced Cache Optimizations

Reducing hit time
Small and simple caches
Way prediction
Trace caches
Increasing cache bandwidth
Pipelined caches
Multibanked caches
Nonblocking caches

Reducing Miss Penalty
Critical word first
Merging write buffers
Reducing Miss Rate
Victim Cache
Hardware prefetching
Compiler prefetching
Compiler Optimizations

8
4 Increasing Cache Bandwidth by Pipelining

Pipeline cache access to maintain bandwidth, but
higher latency
Instruction cache access pipeline stages
1 Pentium
2 Pentium Pro through Pentium III
4 Pentium 4
? greater penalty on mispredicted branches
? more clock cycles between the issue of the load
and the use of the data

9
5. Increasing Cache Bandwidth Non-Blocking
Caches

Non-blocking cache or lockup-free cache allow
data cache to continue to supply cache hits
during a miss
requires F/E bits on registers or out-of-order
execution
requires multi-bank memories
hit under miss reduces the effective miss
penalty by working during miss vs. ignoring CPU
requests
hit under multiple miss or miss under miss
may further lower the effective miss penalty by
overlapping multiple misses
Significantly increases the complexity of the
cache controller as there can be multiple
outstanding memory accesses
Requires muliple memory banks (otherwise cannot
support)
Penium Pro allows 4 outstanding memory misses

10
Value of Hit Under Miss for SPEC (old data)
0-1 1-2 2-64 Base
Hit under n Misses

FP programs on average AMAT 0.68 - 0.52 -
0.34 - 0.26
Int programs on average AMAT 0.24 - 0.20 -
0.19 - 0.19
8 KB Data Cache, Direct Mapped, 32B block, 16
cycle miss, SPEC 92

11
6 Increasing Cache Bandwidth via Multiple Banks

Rather than treat the cache as a single
monolithic block, divide into independent banks
that can support simultaneous accesses
E.g.,T1 (Niagara) L2 has 4 banks
Banking works best when accesses naturally spread
themselves across banks ? mapping of addresses to
banks affects behavior of memory system
Simple mapping that works well is sequential
interleaving
Spread block addresses sequentially across banks
E,g, if there 4 banks, Bank 0 has all blocks
whose address modulo 4 is 0 bank 1 has all
blocks whose address modulo 4 is 1

12
7. Reduce Miss Penalty Early Restart and
Critical Word First

Dont wait for full block before restarting CPU
Early restartAs soon as the requested word of
the block arrives, send it to the CPU and let the
CPU continue execution
Spatial locality ? tend to want next sequential
word, so not clear size of benefit of just early
restart
Critical Word FirstRequest the missed word first
from memory and send it to the CPU as soon as it
arrives let the CPU continue execution while
filling the rest of the words in the block
Long blocks more popular today ? Critical Word
1st Widely used

13
8. Merging Write Buffer to Reduce Miss Penalty

Write buffer to allow processor to continue while
waiting to write to memory
If buffer contains modified blocks, the addresses
can be checked to see if address of new data
matches the address of a valid write buffer entry
If so, new data are combined with that entry
Increases block size of write for write-through
cache of writes to sequential words, bytes since
multiword writes more efficient to memory
The Sun T1 (Niagara) processor, among many
others, uses write merging

14
9. Reducing Misses a Victim Cache

How to combine fast hit time of direct mapped
yet still avoid conflict misses?
Add buffer to place data discarded from cache
Jouppi 1990 4-entry victim cache removed 20
to 95 of conflicts for a 4 KB direct mapped data
cache
Used in Alpha, HP machines

DATA
TAGS
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
To Next Lower Level In
Hierarchy
15
10. Reducing Misses by Hardware Prefetching of
Instructions Data

Prefetching relies on having extra memory
bandwidth that can be used without penalty
Instruction Prefetching
Typically, CPU fetches 2 blocks on a miss the
requested block and the next consecutive block.
Requested block is placed in instruction cache
when it returns, and prefetched block is placed
into instruction stream buffer
Data Prefetching
Pentium 4 can prefetch data into L2 cache from up
to 8 streams from 8 different 4 KB pages
Prefetching invoked if 2 successive L2 cache
misses to a page, if distance between those
cache blocks is

16
11. Reducing Misses by Software Prefetching Data

Data Prefetch
Load data into register (HP PA-RISC loads)
Cache Prefetch load into cache (MIPS IV,
PowerPC, SPARC v. 9)
Special prefetching instructions cannot cause
faultsa form of speculative execution
Issuing Prefetch Instructions takes time
Is cost of prefetch issues misses?
Higher superscalar reduces difficulty of issue
bandwidth

17
12. Reducing Misses by Compiler Optimizations

McFarling 1989 reduced caches misses by 75 on
8KB direct mapped cache, 4 byte blocks in
software
Instructions
Reorder procedures in memory so as to reduce
conflict misses
Profiling to look at conflicts(using tools they
developed)
Data
Merging Arrays improve spatial locality by
single array of compound elements vs. 2 arrays
Loop Interchange change nesting of loops to
access data in order stored in memory
Loop Fusion Combine 2 independent loops that
have same looping and some variables overlap
Blocking Improve temporal locality by accessing
blocks of data repeatedly vs. going down whole
columns or rows

18
Merging Arrays Example

/ Before 2 sequential arrays /
int valSIZE
int keySIZE
/ After 1 array of stuctures /
struct merge
int val
int key
struct merge merged_arraySIZE
Reducing conflicts between val key improve
spatial locality

19
Loop Interchange Example

/ Before /
for (k 0 k
for (j 0 j
for (i 0 i
xij 2 xij
/ After /
for (k 0 k
for (i 0 i
for (j 0 j
xij 2 xij
Sequential accesses instead of striding through
memory every 100 words improved spatial locality

20
Loop Fusion Example

/ Before /
for (i 0 i
for (j 0 j
aij 1/bij cij
for (i 0 i
for (j 0 j
dij aij cij
/ After /
for (i 0 i
for (j 0 j
aij 1/bij cij
dij aij cij
2 misses per access to a c vs. one miss per
access improve spatial locality

21
Blocking Example

/ Before /
for (i 0 i
for (j 0 j
r 0
for (k 0 k
r r yikzkj
xij r
Two Inner Loops
Read all NxN elements of z
Read N elements of 1 row of y repeatedly
Write N elements of 1 row of x
Capacity Misses a function of N Cache Size
2N3 N2 (assuming no conflict otherwise )
Idea compute on BxB submatrix that fits

22
Blocking Example

/ After /
for (jj 0 jj
for (kk 0 kk
for (i 0 i
for (j jj j
r 0
for (k kk k
r r yikzkj
xij xij r
B called Blocking Factor
Capacity Misses from 2N3 N2 to 2N3/B N2
Conflict Misses Too?

23
Reducing Conflict Misses by Blocking

Conflict misses in caches not FA vs. Blocking
size
Lam et al 1991 a blocking factor of 24 had a
fifth the misses vs. 48 despite both fit in cache

24
Summary of Compiler Optimizations to Reduce Cache
Misses (by hand)
25
Compiler Optimization vs. Memory Hierarchy Search

Compiler tries to figure out memory hierarchy
optimizations
New approach Auto-tuners 1st run variations of
program on computer to find best combinations of
optimizations (blocking, padding, ) and
algorithms, then produce C code to be compiled
for that computer
Auto-tuner targeted to numerical method
E.g., PHiPAC (BLAS), Atlas (BLAS), Sparsity
(Sparse linear algebra), Spiral (DSP), FFT-W

26
Sparse Matrix Search for Blocking
for finite element problem Im, Yelick, Vuduc,
2005
27
Best Sparse Blocking for 8 Computers
8
4
row block size (r)
2
1
1
2
4
8
column block size (c)

All possible column block sizes selected for 8
computers How could compiler know?

28
(No Transcript)
29
AMD Opteron Memory Hierarchy

12-stage integer pipeline yields a maximum clock
rate of 2.8 GHz and fastest memory PC3200 DDR
SDRAM
48-bit virtual and 40-bit physical addresses
I and D cache 64 KB, 2-way set associative, 64-B
block, LRU
L2 cache 1 MB, 16-way, 64-B block, pseudo LRU
Data and L2 caches use write back, write allocate
L1 caches are virtually indexed and physically
tagged
L1 I TLB and L1 D TLB fully associative, 40
entries
32 entries for 4 KB pages and 8 for 2 MB or 4 MB
pages
L2 I TLB and L1 D TLB 4-way, 512 entities of 4
KB pages
Memory controller allows up to 10 cache misses
8 from D cache and 2 from I cache

30
Opteron Memory Hierarchy Performance

For SPEC2000
I cache misses per instruction is 0.01 to 0.09
D cache misses per instruction are 1.34 to 1.43
L2 cache misses per instruction are 0.23 to
0.36
Commercial benchmark (TPC-C-like)
I cache misses per instruction is 1.83 (100X!)
D cache misses per instruction are 1.39 (? same)
L2 cache misses per instruction are 0.62 (2X to
3X)
How compare to ideal CPI of 0.33?

31
CPI breakdown for Integer Programs

CPI above base attributable to memory ? 50
L2 cache misses ? 25 overall (50 memory CPI)
Assumes misses are not overlapped with the
execution pipeline or with each other, so the
pipeline stall portion is a lower bound

32
CPI breakdown for Floating Pt. Programs

CPI above base attributable to memory ? 60
L2 cache misses ? 40 overall (70 memory CPI)
Assumes misses are not overlapped with the
execution pipeline or with each other, so the
pipeline stall portion is a lower bound

33
Pentium 4 vs. Opteron Memory Hierarchy
Clock rate for this comparison in 2005 faster
versions existed
34
Misses Per Instruction Pentium 4 vs. Opteron
3.4X
2.3X
?Opteron better
1.5X
0.5X
?Pentium better

D cache miss P4 is 2.3X to 3.4X vs. Opteron
L2 cache miss P4 is 0.5X to 1.5X vs. Opteron
Note Same ISA, but not same instruction count

35
Fallacies and Pitfalls

Not delivering high memory bandwidth in a
cache-based system
10 Fastest computers at Stream benchmark
McCalpin 2005
Only 4/10 computers rely on data caches, and
their memory BW per processor is 7X to 25X slower
than NEC SX7

36
Main Memory Background

Performance of Main Memory
Latency Cache Miss Penalty
Access Time time between request and word
arrives
Cycle Time time between requests
Bandwidth I/O Large Block Miss Penalty (L2)
Main Memory is DRAM Dynamic Random Access Memory
Dynamic since needs to be refreshed periodically
(8 ms, 1 time)
Addresses divided into 2 halves (Memory as a 2D
matrix)
RAS or Row Address Strobe
CAS or Column Address Strobe
Cache uses SRAM Static Random Access Memory
No refresh (6 transistors/bit vs. 1
transistorSize DRAM/SRAM 4-8, Cost/Cycle
time SRAM/DRAM 8-16

37
Main Memory Deep Background

Out-of-Core, In-Core, Core Dump?
Core memory?
Non-volatile, magnetic
Lost to 4 Kbit DRAM (today using 512Mbit DRAM)
Access time 750 ns, cycle time 1500-3000 ns

38
Core Memories (1950s 60s)
The first magnetic core memory, from the IBM 405
Alphabetical Accounting Machine.

Core Memory stored data as magnetization in iron
rings
Iron cores woven into a 2-dimensional mesh of
wires
Origin of the term Dump Core
Rumor that IBM consulted Life Saver company
See http//www.columbia.edu/acis/history/core.htm
l

39
DRAM logical organization (4 Mbit)
Column Decoder

D
Sense
Amps I/O
1
1
Q
Memory
Array
A0A1
0
(2,048 x 2,048)
Storage
W
ord Line
Cell

Square root of bits per RAS/CAS

40
Quest for DRAM Performance

Fast Page mode
Add timing signals that allow repeated accesses
to row buffer without another row access time
Such a buffer comes naturally, as each array will
buffer 1024 to 2048 bits for each access
Synchronous DRAM (SDRAM)
Add a clock signal to DRAM interface, so that the
repeated transfers would not bear overhead to
synchronize with DRAM controller
Double Data Rate (DDR SDRAM)
Transfer data on both the rising edge and falling
edge of the DRAM clock signal ? doubling the peak
data rate
DDR2 lowers power by dropping the voltage from
2.5 to 1.8 volts offers higher clock rates up
to 400 MHz
DDR3 drops to 1.5 volts higher clock rates up
to 800 MHz
Improved Bandwidth, not Latency

41
DRAM name based on Peak Chip Transfers / SecDIMM
name based on Peak DIMM MBytes / Sec
42
Classical DRAM Organization (square)
bit (data) lines
r o w d e c o d e r
Each intersection represents a 1-T DRAM Cell
RAM Cell Array
word (row) select
Column Selector I/O Circuits
row address
Column Address

Row and Column Address together
Select 1 bit a time

data
43
Review1-T Memory Cell (DRAM)
row select

Write
1. Drive bit line
2.. Select row
Read
1. Precharge bit line to Vdd/2
2.. Select row
3. Cell and bit line share charges
Very small voltage changes on the bit line
4. Sense (fancy sense amp)
Can detect changes of 1 million electrons
5. Write restore the value
Refresh
1. Just do a dummy read to every cell.

bit
44
DRAM Capacitors more capacitance in a small area

Trench capacitors
Logic ABOVE capacitor
Gain in surface area of capacitor
Better Scaling properties
Better Planarization

Stacked capacitors
Logic BELOW capacitor
Gain in surface area of capacitor
2-dim cross-section quite small

45
DRAM Read Timing

Every DRAM access begins at
The assertion of the RAS_L
2 ways to read early or late v. CAS

DRAM Read Cycle Time
CAS_L
A
Row Address
Junk
Col Address
Row Address
Junk
Col Address
WE_L
OE_L
D
High Z
Data Out
Junk
Data Out
High Z
Read Access Time
Output Enable Delay
Early Read Cycle OE_L asserted before CAS_L
Late Read Cycle OE_L asserted after CAS_L
46
4 Key DRAM Timing Parameters

tRAC minimum time from RAS line falling to the
valid data output.
Quoted as the speed of a DRAM when buy
A typical 4Mb DRAM tRAC 60 ns
Speed of DRAM since on purchase sheet?
tRC minimum time from the start of one row
access to the start of the next.
tRC 110 ns for a 4Mbit DRAM with a tRAC of 60
ns
tCAC minimum time from CAS line falling to valid
data output.
15 ns for a 4Mbit DRAM with a tRAC of 60 ns
tPC minimum time from the start of one column
access to the start of the next.
35 ns for a 4Mbit DRAM with a tRAC of 60 ns

47
Main Memory Performance
Cycle Time
Access Time
Time

DRAM (Read/Write) Cycle Time DRAM
(Read/Write) Access Time
21 why?
DRAM (Read/Write) Cycle Time
How frequent can you initiate an access?
Analogy A little kid can only ask his father for
money on Saturday
DRAM (Read/Write) Access Time
How quickly will you get what you want once you
initiate an access?
Analogy As soon as he asks, his father will give
him the money
DRAM Bandwidth Limitation analogy
What happens if he runs out of money on Wednesday?

48
Increasing Bandwidth - Interleaving
Access Pattern without Interleaving
CPU
Memory
D1 available
Start Access for D1
Start Access for D2
Memory Bank 0
Access Pattern with 4-way Interleaving
Memory Bank 1
CPU
Memory Bank 2
Memory Bank 3
Access Bank 1
Access Bank 0
Access Bank 2
Access Bank 3
We can Access Bank 0 again
49
Main Memory Performance

Wide
CPU/Mux 1 word Mux/Cache, Bus, Memory N words
(Alpha 64 bits 256 bits)

Interleaved
CPU, Cache, Bus 1 word Memory N Modules(4
Modules) example is word interleaved

Simple
CPU, Cache, Bus, Memory same width (32 bits)

50
Main Memory Performance

Timing model
1 to send address,
4 for access time, 10 cycle time, 1 to send data
Cache Block is 4 words
Simple M.P. 4 x (1101) 48
Wide M.P. 1 10 1 12
Interleaved M.P. 1101 3 15

51
Avoiding Bank Conflicts

Lots of banks
int x256512
for (j 0 j
for (i 0 i
xij 2 xij
Even with 128 banks, since 512 is multiple of
128, conflict on word accesses
SW loop interchange or declaring array not power
of 2 (array padding)
HW Prime number of banks
bank number address mod number of banks
bank number address mod number of banks
address within bank ?address / number of words
in bank
modulo divide per memory access with prime no.
banks?

52
Finding Bank Number and Address within a bank

Problem We want to determine the number of
banks, Nb, to use
and the number of words to store in each bank,
Wb, such that
given a word address x, it is easy to find the
bank where x will
be found, B(x), and the address of x within the
bank, A(x).
for any address x, B(x) and A(x) are unique.
the number of bank conflicts is minimized

53
Finding Bank Number and Address within a bank
Solution We will use the following relation to
determine the bank number for x, B(x), and the
address of x within the bank, A(x) B(x) x
MOD Nb A(x) x MOD Wb and we will choose Nb
and Wb to be co-prime, i.e., there is no
prime number that is a factor of Nb and Wb (this
condition is satisfied if we choose Nb to be a
prime number that is equal to an integer power of
two minus 1). We can then use the Chinese
Remainder Theorem to show that B(x) and A(x) is
always unique.
54
Fast Bank Number

Chinese Remainder Theorem As long as two sets of
integers ai and bi follow these rules
and that ai and aj are co-prime if i ? j, then
the integer x has only one solution (unambiguous
mapping)
bank number b0, number of banks a0
address within bank b1, number of words in bank
a1
N word address 0 to N-1, prime no. banks, words
power of 2
3 banks Nb 3, and 8 words per bank, Wb 8.

Seq. Interleaved Modulo
Interleaved Bank Number 0 1 2 0 1 2 Address
within Bank 0 0 1 2 0 16 8 1 3 4 5
9 1 17 2 6 7 8 18 10 2 3 9 10 11 3 19 11 4 12 13
14 12 4 20 5 15 16 17 21 13 5 6 18 19 20 6 22 14 7
21 22 23 15 7 23
55
Fast Memory Systems DRAM specific

Multiple CAS accesses several names (page mode)
Extended Data Out (EDO) 30 faster in page mode
New DRAMs to address gap what will they cost,
will they survive?
RAMBUS startup company reinvent DRAM interface
Each Chip a module vs. slice of memory
Short bus between CPU and chips
Does own refresh
Variable amount of data returned
1 byte / 2 ns (500 MB/s per chip)
Synchronous DRAM 2 banks on chip, a clock signal
to DRAM, transfer synchronous to system clock (66
- 150 MHz)
Intel claims RAMBUS Direct (16 b wide) is future
PC memory
Niche memory or main memory?
e.g., Video RAM for frame buffers, DRAM fast
serial output

56
Fast Page Mode Operation

Regular DRAM Organization
N rows x N column x M-bit
Read Write M-bit at a time
Each M-bit access requiresa RAS / CAS cycle
Fast Page Mode DRAM
N x M SRAM to save a row
After a row is read into the register
Only CAS is needed to access other M-bit blocks
on that row
RAS_L remains asserted while CAS_L is toggled

Column Address
DRAM
Row Address
N rows
N x M SRAM
M bits
M-bit Output
57
Something new Structure of Tunneling Magnetic
Junction

Tunneling Magnetic Junction RAM (TMJ-RAM)
Speed of SRAM, density of DRAM, non-volatile (no
refresh)
Spintronics combination quantum spin and
electronics
Same technology used in high-density disk-drives

58
MEMS-based Storage

Magnetic sled floats on array of read/write
heads
Approx 250 Gbit/in2
Data ratesIBM 250 MB/s w 1000 headsCMU 3.1
MB/s w 400 heads
Electrostatic actuators move media around to
align it with heads
Sweep sled 50?m in
Capacity estimated to be in the 1-10GB in 10cm2

See Ganger et all http//www.lcs.ece.cmu.edu/rese
arch/MEMS
59
Big storage (such as DRAM/DISK)Potential for
Errors!

Motivation
DRAM is dense ?Signals are easily disturbed
High Capacity ? higher probability of failure
Approach Redundancy
Add extra information so that we can recover from
errors
Can we do better than just create complete
copies?
Block Codes Data Coded in blocks
k data bits coded into n encoded bits
Measure of overhead Rate of Code K/N
Often called an (n,k) code
Consider data as vectors in GF(2) i.e. vectors
of bits
Code Space is set of all 2n vectors, Data space
set of 2k vectors
Encoding function Cf(d)
Decoding function df(C)
Not all possible code vectors, C, are valid!

60
Need for Error Correction!

Motivation
Failures/time proportional to number of bits!
As DRAM cells shrink, more vulnerable
Went through period in which failure rate was low
enough without error correction that people
didnt do correction
DRAM banks too large now
Servers always corrected memory systems
Basic idea add redundancy through parity bits
Common configuration Random error correction
SEC-DED (single error correct, double error
detect)
One example 64 data bits 8 parity bits (11
overhead)
Really want to handle failures of physical
components as well
Organization is multiple DRAMs/DIMM, multiple
DIMMs
Want to recover from failed DRAM and failed DIMM!
Chip kill handle failures width of single DRAM
chip

61
General IdeaCode Vector Space

Not every vector in the code space is valid
Hamming Distance (d)
Minimum number of bit flips to turn one code word
into another
Number of errors that we can detect (d-1)
Number of errors that we can fix ½(d-1)

62
Conclusion

Memory wall inspires optimizations since so much
performance lost there
Reducing hit time Small and simple caches, Way
prediction, Trace caches
Increasing cache bandwidth Pipelined caches,
Multibanked caches, Nonblocking caches
Reducing Miss Penalty Critical word first,
Merging write buffers
Reducing Miss Rate Compiler optimizations
Reducing miss penalty or miss rate via
parallelism Hardware prefetching, Compiler
prefetching
Auto-tuners search replacing static compilation
to explore optimization space?

Write a Comment

User Comments (0)

About PowerShow.com

CS252 Graduate Computer Architecture Lecture 16 Cache Optimizations Cont Memory Technology PowerPoint PPT Presentation