Title: Dagoberto A.R.Justo
1Memory System PerformanceChapter 3
- Dagoberto A.R.Justo
- PPGMAp
- UFRGS
2Introduction
- The clock rate of your CPU does not alone
determine its performance - Nowadays, memory speeds are becoming the
limitation - Hardware designers are creating architectures
that try to overcome memory speed limitations - However, the hardware is designed to be efficient
under only some models of how the program running
on them is designed - Thus, careful program design is essential to
obtain high performance - We introduce these issues by looking at simple
matrix operations and modeling their performance,
given certain characteristics of the CPU
architecture
3Memory Issues -- Definitions
- Consider an architecture with several caches
- How long does it take to receive a word in a
particular cache, once requested? - How many words can be retrieved in a unit of
time, once the first one is received? - Assume we request a word of memory and receive a
block of data of size b words containing the
desired word - Latency
- The time l to receive the first word after the
request (usually in nanoseconds) - Bandwidth
- The number of words per time unit received with
one request (usually in million of words per
second)
4Hypothetical Machine 1(no Cache)
- Clock rate 1 GHz (1 nanosecond, 1 ns)
- Two multiply-add units
- can perform 2 multiply and 2 add operations per
cycle - Fast CPU -- 4 FLOP per cycle ? peak performance
4 GFlops - Latency 100 ns
- takes 100 cycles x 1 ns to obtain a word once
requested - Block size 1 word (8 bytes64 bits)
- Thus, the bandwidth is 10 megawords per second
(Mwords/s) - However, very slow in practice
- Consider a dot product operation
- Each step is a multiply and add accessing 2
elements of 2 vectors and accumulating the result
in a register, say s - 2 elements are sent to cache every 200 ns and 2
ops are performed in 1 cycle - Thus, the machine runs at 100.5ns for each flop
- ? 10 Mflops -- a factor 400 times slower than the
peak
Fetch 1 word
Fetch 1 word
2FL
X
100
5Hypothetical Machine 2(Cache BUT The Problem is
Different)
- Consider matrix multiplication problem
- Cache size 32 Kbytes
- Block size 1 word (8 bytes)
- Two situations are different here
- The problem is different
- Dot-product performs 1 operation for each data in
average - 2n operands, 2n operations
- Matrix multiplication has data reuse
- O(n3) operations for O(n2) data
- This machine has a cache with line (block) size 4
- A cache is an intermediate storage area for which
the CPU accesses its memory in 1 cycle (low
latency) but stores sufficient data to take
advantage of data reuse -- the important aspect
of matrix multiplication
6?
- Same processor, clock rate 1 GHz (1 ns)
- Latency 100 ns (from memory to cache)
- Block size 1 word
- Let n32, A,B,C be 32?32 matrices. Consider
multiplying CAB - Each matrix needs 1024 words, 3 matrices, times 8
bytes per matrix 24KB, which fits entirely in a
32 KB cache - Time to fetching 2 matrices into the cache
- 2048 words x 100 ns takes 204.8 ?s
- Perform 2n3 operations for the matrix multiply
(2323 64K op) - Time 2323/4 ns or 16.3 ?s (4 flop per cycle)
- Thus, the flop rate is 64K/(204.816.3) ? 296
Mflops - A lot better than 10 Mflops but no where near the
peak of 4 Gflops - The notion of reusing data in this way is called
temporal locality (many operations on the same
data occur close in time)
Fetch 1 word
4FL
Fetch 1 word
Fetch 1 word
4FL
4FL
4FL
100
7Hypothetical Machine 3(Increase the Memory/Cache
Bandwidth)
- increase the block size b, from 1 word to 4 words
- As stated, this implies that the data path from
memory to cache is 128 bits wide (4?32 bits/word) - For the dot product algorithm,
- A request for a(1) brings in a(14)
- The a(1) takes 100 ns but a(2), a(3), and a(4)
arrive at the same time - Similarly, a request for b(1) brings in b(14)
- The request for b(1) is issued one cycle after
that for a(1) but the bus is busy bringing a(14)
into the cache - Thus, after 201 ns, the dot product computation
starts and proceeds 1 cycle at a time, completing
at a(4) and b(4) - Next the request for a(5) brings in a(58) and so
on - Thus, the CPU is performing at approximately 8
flops for roughly 204 ns or 1 operation per 25 ns
or 40 Mflops
8Hypothetical Machine 3-- Analyzed In Terms of
Cache-Hit Ratio
- Cache hit ratio the number of memory accesses
which are in cache/total number of memory
accesses - In this case, the first access in every 4
accesses is a miss and the remaining 3 are hits
or successes - Thus, the cache hit ratio is 75
- Assume the dominant overhead is the misses
- Then 25 of the memory cycle time is an average
overhead per access or 25 ns (25 of 100 ns
memory latency) - Because the dot-product has one operation per
word accessed, this also works out to 40 Mflops - A more accurate estimate is (75 ? 1 25 ?
100) ns/word - Or 25.74 ns or 38.8 Mflops
9Actual Implementations
- 128 bit wide buses are expensive
- The usual implementation is to pipeline a 32-bit
bus so that the words in the line (block) arrive
at each clock cycle after the first item is
received - That is, instead of 4 words received after a 100
ns latency, the 4 items arrive after 100 3 ns - However, multiply-add operations can start after
each item arrives so that the result is the same
-- that is, 40 or so Mflops
Fetch 128 bits
4FL
4FL
4FL
4FL
Fetch 32 bits
4FL
Fetch 32 bits
4FL
Fetch 32 bits
4FL
Fetch 32 bits
4FL
100
10Spatial Locality Issue
- It is clear that the dot-product is taking
advantage of the consecutiveness of elements of
the vector - This is called spatial locality -- the elements
are close together in memory - Consider a different problem
- The multiplication matrix-vector yAx
- The elements of the column are not consecutive in
C - That is, they are separated by a number of
columns equal to the column length - In this case, the accesses are not spatially
local and essentially all accesses to every
element of every column are cache misses
11In Fortran
- The matrix A is stored by columns in the memory
- 8000, A(1,1)
- 8001, A(2,1)
- 8002, A(3,1)
- 8003, A(4,1)
- 8004, A(1,2)
- 8005, A(2,2)
- 8006, A(3,2)
- 8007, A(4,2)
- 8008, A(1,3)
- 8009, A(2,3)
- 800A, A(3,3)
- 800B, A(4,3)
12Sum All Elements of a Matrix
- Consider the problem of computing the sum of all
elements a 1000?1000 matrix B - S0.D0
- do i1, 1000
- do j1, 1000
- SSB(i,j)
- end do
- end do
- This code performs very poorly
- s fits in cache
- Since the inner loop is in the columns,
consecutive elements are far apart in the memory - unlikely to be in the same cache line
- every access experiences the maximum latency
delay (100 ns)
13Sum All Elements of a Matrix
- Changing the order of the loops
- S0.D0
- do j1, 1000
- do i1, 1000
- SSB(i,j)
- end do
- end do
- the inner loop accesses B by columns
- The elements in the columns are consecutive and
thus a memory access brings multiple elements at
a time (4 for our model machine) and thus the
performance is reasonable for this machine
14Peak Floating Point Performance x Peak Memory
Bandwidth
- The performance issue
- the peak floating point performance is bounded by
the peak memory bandwidth - For fast microprocessors, it is 100 MFLOPS/MByte
of bandwidth - Solve the problem by modifying the algorithms to
hide the large memory latencies - Some compilers can transform simple codes to
obtain better performance - For large scale vector processors, 1 MFLOP/MByte
of bandwidth - These modifications are typically unnecessary but
they don't hurt the computation and sometimes help
15Hiding Memory Latency
- Consider the example of getting information from
Internet using a browser - What can you do to reduce the wait time?
- While reading one page, we anticipate the next
pages we will read and therefore begin fetches
for them in advance - This corresponds to pre-fetching pages in
anticipation of them being read - We open multiple browser windows and begin
accesses in each window in parallel - This corresponds to multiple threads running in
parallel - We request many pages in order
- This corresponds to pipelining with spatial
locality
16Multi-threading To Hide Memory Latency
- Consider the matrix-vector multiplication cAb
- Each row by vector inner product is an
independent computation -- thus, create a
different thread for each computation as follows - do k1, n
- c(k)create_thread( dot_product, A(k, ),b)
- end do
- As separate threads
- On the first cycle, the first thread accesses the
first pair of data elements for the first row and
waits for the data to arrive - On the second cycle, the second thread accesses
the first pair of elements for the second row and
waits for the data to arrive - And so on until l units of time (the latency)
- Then the first thread performs a computation and
requests more data next - Then the second thread performs a computation and
requests more data - And so on so that after the first latency of l
cycles, every cycle is performing a computation
17Multithread, block size1
Fetch A(1,1)
FL
Fetch A(1,2)
FL
FL
Fetch A(2,1)
Fetch A(2,2)
FL
Fetch A(3,1)
Fetch A(3,2)
FL
FL
FL
Fetch A(4,1)
Fetch A(4,2)
FL
Fetch A(5,1)
FL
Fetch A(5,2)
FL
Fetch A(6,1)
FL
Fetch A(6,2)
FL
FL
Fetch A(7,1)
Fetch A(7,2)
FL
Fetch A(8,1)
FL
FL
Fetch A(8,2)
Fetch A(9,1)
FL
FL
Fetch A(10,1)
8
18Pre-fetching To Hide Memory Latency
- Advance the loads ahead of when the data is
needed - The problem is that the cache space may be needed
for computation between the pre-fetch and use of
the pre-fetched data - This is no worse that not performing the
pre-fetch because the pre-fetch memory unit is
typically an independent functional unit - Dot product again (or vector sum) provides an
example - a(1) and b(1) are requested in a loop
- The processor sees that a(2) and b(2) are needed
for the next iteration and in the next cycle
requests them and so on - Assume the first item takes 100ns to obtain the
data and the requests for the others are every
consecutive cycle - The processor waits 101 cycles for the first
pair, performs the computation, and the next pair
are there on the next cycle ready for computation
and so on
19Impact On Memory Bandwidth
- Pre-fetching or multithreading increase the
bandwidth requirements to memory. Compare
- A 1 single thread computation experiencing a 90
cache hit ratio - The memory bandwidth requirement is estimated to
be 400 MB/sec
- a 32 thread computation experiencing a cache hit
ratio of 25 (because all threads share the same
cache and memory access) - The memory bandwidth requirement is estimated to
be 3 GB/sec