Title: Mobile Memory Improving memory locality in very large reconfigurable fabrics
1Mobile MemoryImproving memory locality in very
large reconfigurable fabrics
Rong Yan, Seth C. Goldstein
- Carnegie Mellon University
- yanrong,seth_at_cs.cmu.edu
- 3/22/2002
2Outline
- Motivation
- Mobile Memory vs. Cache-Only Memory Architecture
- Design Issues
- Implementation Cost
- Conclusion
3Outline
- Motivation
- Mobile Memory vs. Cache-Only Memory Architecture
- Design Issues
- Implementation Cost
- Conclusion
4Increasing FPGA Density
http//www.xilinx.com/support/techxclusives/evolut
ion-techX20.htm
5Increasing FPGA Density
- Configurable Logic Block Embedded Memory
- Expect entire applications to be mapped on very
large reconfigurable fabric (VLRF)
http//www.xilinx.com/support/techxclusives/evolut
ion-techX20.htm
6Abstraction for VLRF
Computation Core
Embedded Memory
Very Large Reconfigurable Fabric
7Problem in VLRF
- Long idle time for some benchmarks
- One of the main reasons large memory latency
- Above are all the benchmarks which can run on
our simulator from MediaBench and SpecInt95
8Possible Solutions
- Possible solutions exploit the reference
locality - introduce cache
- move memory data at run time
-
- Our Choice Mobile memory
- Move the memory data closer to accessor at run
time, inspired by cache-only memory
architecture(COMA) - Investigate whether it is enough or we need more
complex solution, e.g. replication
9Outline
- Motivation
- Mobile Memory vs. Cache-Only Memory Architecture
- Design Issues
- Implementation Cost
- Conclusion
10Quick Review - COMA
- Key Points
- Shared memory multiprocessor, connected by
network - Main Memory acts as cache
- Automatically replicate/ migrate data to the
accessing processor at run time
11Mobile memory vs. COMA
- Similar idea in different contexts
- Analogy
VLRF
Multiprocessors
Code
Processor
12Mobile memory vs. COMA
13Limit Study
- Purpose Examine if mobile memory may or may not
be beneficial in the context of VLRF - Definition in our computational Model
- Unit area of 32-bit memory or 32-bit adder
(assume equal size) - Cluster A number of units grouped together
- Assumption
- There are huge, infinite resources available
- Assume the memory data can move to any position
at run time, even overlapped with code region - Assume no additional cost for memory movement
- No replacement policy
- Move only one memory word at a time
14Outline
- Motivation
- Mobile Memory vs. Cache-Only Memory Architecture
- Design Issues
- Analytical Model
- Implementation Cost
- Conclusion
15Mobile Memory
- Goal reduce the memory latency by exploiting
memory locality - Approach Move the memory at run time, without
replication - mobile memory policies
16Mobile memory policies
- Three main design axes
- When to move
- Where to move (our focus)
- How much to move
- Proposed policies
- Greedy Policy, N-Best Policy, Centroid
17Greedy Policy
- Always move memory data to the most recent
accessor
A
M
Example
18Greedy Policy
- Always move memory data to the most recent
accessor
A
M
Example
19Bad case for Greedy Policy
- Example Ping-Pong access, two accessors
alternate accesses to a memory location
20N-Best Policy
- History of last N accessors, shared for whole
cluster - Assume the access pattern is repeating
- Move to one among these accessors that minimizes
memory access cycles
Example(N 3)
21N-Best Policy
- History of last N accessors, shared for whole
cluster - Assume the access pattern is repeating
- Move to one among these accessors that minimizes
memory access cycles
A
A1
M
A2
Example(N 3)
22Centroid Policy
- History of last N accessors, shared for whole
cluster - Move to the centroid of the N accessors
Example(N 3)
23Centroid Policy
- History of last N accessors, shared for whole
cluster - Move to the centroid of the N accessors
A1
A
M
A2
Example(N 3)
24Comparison
- An offline algorithm is used to estimate the
optimal performance, - Please refer to our paper for more details
25Memory Access Cycles
Memory access cycles for different policies
normalized by baseline cycles
26Total Cycles
Total cycles for different policies normalized
by baseline cycles
27Outline
- Motivation
- Mobile Memory vs. Cache-Only Memory Architecture
- Design Issues
- Implementation Cost
- Conclusion
28Implementation Cost
- Directories
- Cost to localize the moved memory
- Assume each clusters coupled with two directories
Cluster
Local DIR
Home DIR
29Implementation Cost
Local Directory Misses
Cluster
Local DIR
A
M
Home Cluster
Home DIR
30Sensitivity to directory size
The effect on memory access cycles with different
size local directory
31Implementation Cost
- Cost of making room
- Assure enough room to accommodate new data
- Reserve portion of memory, and thus expand graph
- Increase both control transfer cycles and memory
access cycles
2
1
Dilation Factor 2
Cluster
3
4
32Effect of implementation cost
Total execution cycles with different dilation
factor for centroid policy, normalized by
baseline cycles
33Conclusion
- Mobile memory aims at solving the memory
bottleneck in VLRF, inspired by COMA - Simple heuristics are enough
- Lower implementation cost is a key issue
- Mobile memory may not be sufficient
- Replication is probably required