Title: Friday, September 22, 2006
1Friday, September 22, 2006
- If one ox could not do the job they did not try
to grow a bigger ox, but used two oxen. - Grace Murray Hopper
- (1906-1992)
2Today
- Block matrix operations
- Network topologies
3Strided access
- Stride
- Sequence of memory reads and writes to
addresses, each of which is separated from the
last by a constant interval called "the stride
length - Unit stride
4- do i 1, N
- do j 1, N
- Ai Ai Bj
- enddo
- enddo
N is large so Bj cannot remain in cache until
it is used again in another iteration of outer
loop. Little reuse between touches How many cache
misses for A and B?
5Blocking
- do i 1, N
- do j 1, N, S
- do jj j, MIN(jS, N)
- Ai Ai Bjj
- enddo
- enddo
- enddo
- do i 1, N
- do j 1, N
- Ai Ai Bj
- enddo
- enddo
6Blocking
- do j 1, N, S
- do i 1, N
- do jj j, MIN(jS, N)
- Ai Ai Bjj
- enddo
- enddo
- enddo
- do i 1, N
- do j 1, N
- Ai Ai Bj
- enddo
- enddo
S is the maximum number of elements of B that can
remain in cache between two iterations of the i
loop Block or strip mine How many cache misses
for A and B?
7Operation Count vs. Memory Operations
- Example Matrix multiplication
- Previous example?
8 9Matrix multiplication
- int i,j,k
- for (i0iltni)
- for(j0jltnj)
- for (k0kltnk)
- cijcij aikbkj
-
-
Remember to initialize cij to zero
10Matrix multiplication with blocking
- int i,j,k,ii,jj,kk
- for (ii0iiltniiS)
- for (jj0jjltnjjS)
- for (kk0kkltnkkS)
- for(iiiiltmin((iiS),n)i)
- for(jjjjltmin((jjS),n)j)
for(kkkkltmin((kkS),n)k) - cijcijaikbkj
-
-
-
-
-
Remember to initialize cij to zero
11Exercise
- Matrix Vector Multiplication
12Cache coherence in multiprocessor systems
- Suppose two processors on a shared bus have
loaded the same variable. - If one processor changes value of that variable
then
13Cache coherence in multiprocessor systems
- Suppose two processors on a shared bus have
loaded the same variable. - If one processor changes value of that variable
then - Invalidate other copies
- Update other copies
14(No Transcript)
15Cache coherence in multiprocessor systems
- What if a processor reads a data item only once
initially? - Invalidate protocol is more commonly used.
16False Sharing (multiprocessor)
- Two processors are accessing different data items
in the same cache block. - What happens if they both attempt to write to it?
17False Sharing (multiprocessor)
- Two processors are accessing different data items
in the same cache block. - What happens if they both attempt to write to it?
- Padding in data structures (tradeoff space vs.
time)
18Network Topologies
- Bus based, crossbar and multistage networks
- Earth simulator crossbar
- IBM SP-2 Multistage network
19Network Topologies
Large number of links in completely
connected. Bottleneck in star topology.
20Network Topologies
1-D torus
Intel Paragon 2-D Mesh BlueGene/L 3-D
torus Cray TE3 3-D Cube
21- 2-D and 3-D meshes are common in parallel
computers - Regularly structured computation maps naturally
to 2-D mesh. - 3-D network topologies weather modeling,
structure modeling