Title: Prefetching%20Challenges%20in%20%20Distributed%20Memories%20for%20CMPs
1Prefetching Challenges in Distributed Memories
for CMPs
- Martà Torrents, Raúl MartÃnez, and Carlos Molina
Computer Architecture Department UPC
BarcelonaTech
2Outline
- Introduction
- Naming the challenges
- Challenge evaluation methodology
- Experimental framework
- Challenge Quantification
- Facing the Challenges
- Conclusions
3Outline
- Introduction
- Naming the challenges
- Challenge evaluation methodology
- Experimental framework
- Challenge Quantification
- Facing the Challenges
- Conclusions
4Prefetching
- Reduce memory latency
- Bring to a nearest cache next data required by
CPU - Increase the hit ratio
- It is implemented in most of the commercial
processors - Erroneous prefetching may produce
- Cache pollution
- Resources consumption (queues, bandwidth, etc.)
- Power consumption
5Motivation
- Number of cores in a same chip grows every year
Intel Polaris 80 Cores
Nehalem 46 Cores
Tilera 64100 Cores
Nvidia GeForce Up to 256 Cores
6Prefetch in CMPs
- Useful prefetchers implies more performance
- Avoid network latency
- Reduce memory access latency
- Useless prefetchers implies less performance
- More power consumption
- More NoC congestion
- Interference with other cores requests
7Prefetch adverse behaviors
M. Torrents, R. MartÃnez, C. Molina. Network
Aware Performance Evaluation of Prefetching
Techniques in CMPs. Simulation Modeling Practice
and Theory (SIMPAT), 2014.
8Distributed memories
- Distribution of the memory access pattern
_at_
_at_2
_at_4
_at_6
_at_8
_at_10
9Distributed memories
- Distribution of the memory access pattern
_at_14
_at_
_at_2
_at_4
_at_6
_at_8
_at_10
_at_12
10Outline
- Introduction
- Naming the challenges
- Challenge evaluation methodology
- Experimental framework
- Challenge Quantification
- Facing the Challenges
- Conclusions
11Prefetch Distributed Memory Systems
Distributed patterns
DISTRIBUTED L2 MEMORY
_at_
L1 MISS for _at_
12Pattern Detection Challenge
- Distribution of the memory stream
- Prefetcher aware of a certain part of the stream
- Harder to detect access patterns or correlation
- Not all the prefetchers affected
- Correlation prefetchers affected GHB
- One Block Lookahead not affected Tagged
13Prefetch Distributed Memory Systems
DISTRIBUTED L2 MEMORY
_at_
_at_ 2
_at_ 4
Queue filtering
14Prefetch Queue Filtering Challenge
- Prefetch requests queued in distributed queues
- Independent engines generating requests
- Repeated requests can be queued
- In a centralized queue those would be merged
- Adverse effects
- Power consumption
- Network contention
15Prefetch Distributed Memory Systems
Dynamic profiling
DISTRIBUTED L2 MEMORY
?
_at_
_at_ 2
_at_ 4
L1 MISS for _at_ 2
16Dynamic Profiling Challenge
- Prefetch requests generated in one tile
- Dynamic profiling information in another tile
- Erroneous profiling in the self tile
- Techniques using this info may work erroneously
- Filtering
- Throttling
- Concrete prefetching engines
17Outline
- Introduction
- Naming the challenges
- Challenge evaluation methodology
- Experimental framework
- Challenge Quantification
- Facing the Challenges
- Conclusions
18Challenge evaluation methodology
- Three environments to test the challenges
- Pattern Detection Challenge Ideal Prefetcher
- Prefetcher that it is aware of all the memory
stream - No extra network contention added in the system
- No extra power consumed
- Requests classified depending on its core
identifier - To preserve the original stream of each core
- Prefetcher used to test Global History Buffer
19Pattern Detection Challenge
20Challenge evaluation methodology
- Three environments to test the challenges
- Prefetch Queue Filtering Centralized queue
- All the requests sent to a centralized queue
- Repeated requests are merged
- No extra network contention added in the system
- No extra power consumed
- Repeated requests are not issued
- Prefetcher used to test Tagged prefercher
21Prefetch Queue Filtering Challenge
22Challenge evaluation methodology
23Dynamic Profiling Challenge
24Outline
- Introduction
- Naming the challenges
- Challenge evaluation methodology
- Experimental framework
- Challenge Quantification
- Facing the Challenges
- Conclusions
25Experimental framework
- Gem5
- 64 x86 CPUs
- Ruby memory system
- L2 prefetchers
- MOESI coherency protocol
- Garnet network simulator
- Parsecs 2.1
26Simulation environment
27Outline
- Introduction
- Naming the challenges
- Challenge evaluation methodology
- Experimental framework
- Challenge Quantification
- Facing the Challenges
- Conclusions
28Pattern Detection Challenge
29Prefetch Queue Filtering Challenge
30Dynamic Profiling Challenge
31Outline
- Introduction
- Naming the challenges
- Challenge evaluation methodology
- Experimental framework
- Challenge Quantification
- Facing the Challenges
- Conclusions
32Facing the challenges
- There are two main options
- Redesign the entire prefetch philosophy
- Adapt the current techniques to work with DSMs
- Moreover, there are two main directions
- Centralize the information
- Handicap of communication increment
- Distribute the prefetcher
- Handicap of smartly distribute the prefetcher
33Outline
- Introduction
- Naming the challenges
- Challenge evaluation methodology
- Experimental framework
- Challenge Quantification
- Facing the Challenges
- Conclusions
34Conclusions
- Three challenges when prefetching in DSMs
- Prefetch Queue Filtering Challenge
- Dynamic Profiling Challenge
- Challenge evaluation methodology
- Directions for future investigators
- There are no evident solutions for them
- Not solving them -gt limited prefetch performance
35Q A
36Prefetching Challenges in Distributed Memories
for CMPs
- Martà Torrents, Raúl MartÃnez, and Carlos Molina
Computer Architecture Department UPC
BarcelonaTech