Dagoberto A.R.Justo - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Dagoberto A.R.Justo

Description:

... given certain characteristics of the CPU architecture Memory ... Bandwidth Hiding Memory Latency Multi-threading To Hide Memory Latency Multithread, ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 20

Provided by: smi245

Category:

more less

Transcript and Presenter's Notes

Title: Dagoberto A.R.Justo

1
Memory System PerformanceChapter 3

Dagoberto A.R.Justo
PPGMAp
UFRGS

2
Introduction

The clock rate of your CPU does not alone
determine its performance
Nowadays, memory speeds are becoming the
limitation
Hardware designers are creating architectures
that try to overcome memory speed limitations
However, the hardware is designed to be efficient
under only some models of how the program running
on them is designed
Thus, careful program design is essential to
obtain high performance
We introduce these issues by looking at simple
matrix operations and modeling their performance,
given certain characteristics of the CPU
architecture

3
Memory Issues -- Definitions

Consider an architecture with several caches
How long does it take to receive a word in a
particular cache, once requested?
How many words can be retrieved in a unit of
time, once the first one is received?
Assume we request a word of memory and receive a
block of data of size b words containing the
desired word
Latency
The time l to receive the first word after the
request (usually in nanoseconds)
Bandwidth
The number of words per time unit received with
one request (usually in million of words per
second)

4
Hypothetical Machine 1(no Cache)

Clock rate 1 GHz (1 nanosecond, 1 ns)
Two multiply-add units
can perform 2 multiply and 2 add operations per
cycle
Fast CPU -- 4 FLOP per cycle ? peak performance
4 GFlops
Latency 100 ns
takes 100 cycles x 1 ns to obtain a word once
requested
Block size 1 word (8 bytes64 bits)
Thus, the bandwidth is 10 megawords per second
(Mwords/s)
However, very slow in practice
Consider a dot product operation
Each step is a multiply and add accessing 2
elements of 2 vectors and accumulating the result
in a register, say s
2 elements are sent to cache every 200 ns and 2
ops are performed in 1 cycle
Thus, the machine runs at 100.5ns for each flop
? 10 Mflops -- a factor 400 times slower than the
peak

Fetch 1 word
Fetch 1 word
2FL
X
100
5
Hypothetical Machine 2(Cache BUT The Problem is
Different)

Consider matrix multiplication problem
Cache size 32 Kbytes
Block size 1 word (8 bytes)
Two situations are different here
The problem is different
Dot-product performs 1 operation for each data in
average
2n operands, 2n operations
Matrix multiplication has data reuse
O(n3) operations for O(n2) data
This machine has a cache with line (block) size 4
A cache is an intermediate storage area for which
the CPU accesses its memory in 1 cycle (low
latency) but stores sufficient data to take
advantage of data reuse -- the important aspect
of matrix multiplication

6
?

Same processor, clock rate 1 GHz (1 ns)
Latency 100 ns (from memory to cache)
Block size 1 word
Let n32, A,B,C be 32?32 matrices. Consider
multiplying CAB
Each matrix needs 1024 words, 3 matrices, times 8
bytes per matrix 24KB, which fits entirely in a
32 KB cache
Time to fetching 2 matrices into the cache
2048 words x 100 ns takes 204.8 ?s
Perform 2n3 operations for the matrix multiply
(2323 64K op)
Time 2323/4 ns or 16.3 ?s (4 flop per cycle)
Thus, the flop rate is 64K/(204.816.3) ? 296
Mflops
A lot better than 10 Mflops but no where near the
peak of 4 Gflops
The notion of reusing data in this way is called
temporal locality (many operations on the same
data occur close in time)

Fetch 1 word
4FL
Fetch 1 word
Fetch 1 word
4FL
4FL
4FL
100
7
Hypothetical Machine 3(Increase the Memory/Cache
Bandwidth)

increase the block size b, from 1 word to 4 words
As stated, this implies that the data path from
memory to cache is 128 bits wide (4?32 bits/word)
For the dot product algorithm,
A request for a(1) brings in a(14)
The a(1) takes 100 ns but a(2), a(3), and a(4)
arrive at the same time
Similarly, a request for b(1) brings in b(14)
The request for b(1) is issued one cycle after
that for a(1) but the bus is busy bringing a(14)
into the cache
Thus, after 201 ns, the dot product computation
starts and proceeds 1 cycle at a time, completing
at a(4) and b(4)
Next the request for a(5) brings in a(58) and so
on
Thus, the CPU is performing at approximately 8
flops for roughly 204 ns or 1 operation per 25 ns
or 40 Mflops

8
Hypothetical Machine 3-- Analyzed In Terms of
Cache-Hit Ratio

Cache hit ratio the number of memory accesses
which are in cache/total number of memory
accesses
In this case, the first access in every 4
accesses is a miss and the remaining 3 are hits
or successes
Thus, the cache hit ratio is 75
Assume the dominant overhead is the misses
Then 25 of the memory cycle time is an average
overhead per access or 25 ns (25 of 100 ns
memory latency)
Because the dot-product has one operation per
word accessed, this also works out to 40 Mflops
A more accurate estimate is (75 ? 1 25 ?
100) ns/word
Or 25.74 ns or 38.8 Mflops

9
Actual Implementations

128 bit wide buses are expensive
The usual implementation is to pipeline a 32-bit
bus so that the words in the line (block) arrive
at each clock cycle after the first item is
received
That is, instead of 4 words received after a 100
ns latency, the 4 items arrive after 100 3 ns
However, multiply-add operations can start after
each item arrives so that the result is the same
-- that is, 40 or so Mflops

Fetch 128 bits
4FL
4FL
4FL
4FL
Fetch 32 bits
4FL
Fetch 32 bits
4FL
Fetch 32 bits
4FL
Fetch 32 bits
4FL
100
10
Spatial Locality Issue

It is clear that the dot-product is taking
advantage of the consecutiveness of elements of
the vector
This is called spatial locality -- the elements
are close together in memory
Consider a different problem
The multiplication matrix-vector yAx
The elements of the column are not consecutive in
C
That is, they are separated by a number of
columns equal to the column length
In this case, the accesses are not spatially
local and essentially all accesses to every
element of every column are cache misses

11
In Fortran

The matrix A is stored by columns in the memory
8000, A(1,1)
8001, A(2,1)
8002, A(3,1)
8003, A(4,1)
8004, A(1,2)
8005, A(2,2)
8006, A(3,2)
8007, A(4,2)
8008, A(1,3)
8009, A(2,3)
800A, A(3,3)
800B, A(4,3)

12
Sum All Elements of a Matrix

Consider the problem of computing the sum of all
elements a 1000?1000 matrix B
S0.D0
do i1, 1000
do j1, 1000
SSB(i,j)
end do
end do
This code performs very poorly
s fits in cache
Since the inner loop is in the columns,
consecutive elements are far apart in the memory
unlikely to be in the same cache line
every access experiences the maximum latency
delay (100 ns)

13
Sum All Elements of a Matrix

Changing the order of the loops
S0.D0
do j1, 1000
do i1, 1000
SSB(i,j)
end do
end do
the inner loop accesses B by columns
The elements in the columns are consecutive and
thus a memory access brings multiple elements at
a time (4 for our model machine) and thus the
performance is reasonable for this machine

14
Peak Floating Point Performance x Peak Memory
Bandwidth

The performance issue
the peak floating point performance is bounded by
the peak memory bandwidth
For fast microprocessors, it is 100 MFLOPS/MByte
of bandwidth
Solve the problem by modifying the algorithms to
hide the large memory latencies
Some compilers can transform simple codes to
obtain better performance
For large scale vector processors, 1 MFLOP/MByte
of bandwidth
These modifications are typically unnecessary but
they don't hurt the computation and sometimes help

15
Hiding Memory Latency

Consider the example of getting information from
Internet using a browser
What can you do to reduce the wait time?
While reading one page, we anticipate the next
pages we will read and therefore begin fetches
for them in advance
This corresponds to pre-fetching pages in
anticipation of them being read
We open multiple browser windows and begin
accesses in each window in parallel
This corresponds to multiple threads running in
parallel
We request many pages in order
This corresponds to pipelining with spatial
locality

16
Multi-threading To Hide Memory Latency

Consider the matrix-vector multiplication cAb
Each row by vector inner product is an
independent computation -- thus, create a
different thread for each computation as follows
do k1, n
c(k)create_thread( dot_product, A(k, ),b)
end do
As separate threads
On the first cycle, the first thread accesses the
first pair of data elements for the first row and
waits for the data to arrive
On the second cycle, the second thread accesses
the first pair of elements for the second row and
waits for the data to arrive
And so on until l units of time (the latency)
Then the first thread performs a computation and
requests more data next
Then the second thread performs a computation and
requests more data
And so on so that after the first latency of l
cycles, every cycle is performing a computation

17
Multithread, block size1
Fetch A(1,1)
FL
Fetch A(1,2)
FL
FL
Fetch A(2,1)
Fetch A(2,2)
FL
Fetch A(3,1)
Fetch A(3,2)
FL
FL
FL
Fetch A(4,1)
Fetch A(4,2)
FL
Fetch A(5,1)
FL
Fetch A(5,2)
FL
Fetch A(6,1)
FL
Fetch A(6,2)
FL
FL
Fetch A(7,1)
Fetch A(7,2)
FL
Fetch A(8,1)
FL
FL
Fetch A(8,2)
Fetch A(9,1)
FL
FL
Fetch A(10,1)
8
18
Pre-fetching To Hide Memory Latency

Advance the loads ahead of when the data is
needed
The problem is that the cache space may be needed
for computation between the pre-fetch and use of
the pre-fetched data
This is no worse that not performing the
pre-fetch because the pre-fetch memory unit is
typically an independent functional unit
Dot product again (or vector sum) provides an
example
a(1) and b(1) are requested in a loop
The processor sees that a(2) and b(2) are needed
for the next iteration and in the next cycle
requests them and so on
Assume the first item takes 100ns to obtain the
data and the requests for the others are every
consecutive cycle
The processor waits 101 cycles for the first
pair, performs the computation, and the next pair
are there on the next cycle ready for computation
and so on

19
Impact On Memory Bandwidth

Pre-fetching or multithreading increase the
bandwidth requirements to memory. Compare

A 1 single thread computation experiencing a 90
cache hit ratio
The memory bandwidth requirement is estimated to
be 400 MB/sec

a 32 thread computation experiencing a cache hit
ratio of 25 (because all threads share the same
cache and memory access)
The memory bandwidth requirement is estimated to
be 3 GB/sec

Write a Comment

User Comments (0)