Title: Stall-Time Fair Memory Access Scheduling
1Stall-Time Fair Memory Access Scheduling
- Onur Mutlu and Thomas Moscibroda
- Computer Architecture Group
- Microsoft Research
2Multi-Core Systems
Multi-Core Chip
CORE 0
CORE 1
CORE 2
CORE 3
L2 CACHE
L2 CACHE
L2 CACHE
L2 CACHE
unfairness
Shared DRAM Memory System
DRAM MEMORY CONTROLLER
. . .
DRAM Bank 0
DRAM Bank 1
DRAM Bank 2
DRAM Bank 7
3DRAM Bank Operation
Access Address (Row 0, Column 0)
Access Address (Row 0, Column 1)
Access Address (Row 0, Column 9)
Access Address (Row 1, Column 0)
Columns
Row address 0
Row address 1
Row decoder
Rows
Row Buffer
HIT
HIT
CONFLICT !
Row 0
Empty
Row 1
Column decoder
Column address 0
Column address 1
Column address 9
Column address 0
Data
4DRAM Controllers
- A row-conflict memory access takes significantly
longer than a row-hit access - Current controllers take advantage of the row
buffer - Commonly used scheduling policy (FR-FCFS)
Rixner, ISCA00 - (1) Row-hit (column) first Service row-hit
memory accesses first - (2) Oldest-first Then service older accesses
first - This scheduling policy aims to maximize DRAM
throughput - But, it is unfair when multiple threads share the
DRAM system
5Outline
- The Problem
- Unfair DRAM Scheduling
- Stall-Time Fair Memory Scheduling
- Fairness definition
- Algorithm
- Implementation
- System software support
- Experimental Evaluation
- Conclusions
6The Problem
- Multiple threads share the DRAM controller
- DRAM controllers are designed to maximize DRAM
throughput - DRAM scheduling policies are thread-unaware and
unfair - Row-hit first unfairly prioritizes threads with
high row buffer locality - Streaming threads
- Threads that keep on accessing the same row
- Oldest-first unfairly prioritizes
memory-intensive threads
7The Problem
Row decoder
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T1 Row 5
T0 Row 0
T1 Row 111
T0 Row 0
T1 Row 16
Request Buffer
Row Buffer
Row 0
Row 0
Column decoder
Row size 8KB, cache block size 64B 128 requests
of T0 serviced before T1
T0 streaming thread
Data
T1 non-streaming thread
8Consequences of Unfairness in DRAM
DRAM is the only shared resource
7.74
4.72
1.85
1.05
- Vulnerability to denial of service Moscibroda
Mutlu, Usenix Security07 - System throughput loss
- Priority inversion at the system/OS level
- Poor performance predictability
9Outline
- The Problem
- Unfair DRAM Scheduling
- Stall-Time Fair Memory Scheduling
- Fairness definition
- Algorithm
- Implementation
- System software support
- Experimental Evaluation
- Conclusions
10Fairness in Shared DRAM Systems
- A threads DRAM performance dependent on its
inherent - Row-buffer locality
- Bank parallelism
- Interference between threads can destroy either
or both - A fair DRAM scheduler should take into account
all factors affecting each threads DRAM
performance - Not solely bandwidth or solely request latency
- Observation A threads performance degradation
due to interference in DRAM mainly characterized
by the extra memory-related stall-time due to
contention with other threads
11Stall-Time Fairness in Shared DRAM Systems
- A DRAM system is fair if it slows down
equal-priority threads equally - Compared to when each thread is run alone on the
same system - Fairness notion similar to SMT Cazorla, IEEE
Micro04Luo, ISPASS01, SoEMT Gabor,
Micro06, and shared caches Kim, PACT04 - Tshared DRAM-related stall-time when the thread
is running with other threads - Talone DRAM-related stall-time when the thread
is running alone - Memory-slowdown Tshared/Talone
- The goal of the Stall-Time Fair Memory scheduler
(STFM) is to equalize Memory-slowdown for all
threads, without sacrificing performance - Considers inherent DRAM performance of each
thread
12Outline
- The Problem
- Unfair DRAM Scheduling
- Stall-Time Fair Memory Scheduling
- Fairness definition
- Algorithm
- Implementation
- System software support
- Experimental Evaluation
- Conclusions
13STFM Scheduling Algorithm (1)
- During each time interval, for each thread, DRAM
controller - Tracks Tshared
- Estimates Talone
- At the beginning of a scheduling cycle, DRAM
controller - Computes Slowdown Tshared/Talone for each
thread with an outstanding legal request - Computes unfairness MAX Slowdown / MIN Slowdown
- If unfairness lt ?
- Use DRAM throughput oriented baseline scheduling
policy - (1) row-hit first
- (2) oldest-first
14STFM Scheduling Algorithm (2)
- If unfairness ?
- Use fairness-oriented scheduling policy
- (1) requests from thread with MAX Slowdown first
- (2) row-hit first
- (3) oldest-first
- Maximizes DRAM throughput if it cannot improve
fairness - Does NOT waste useful bandwidth to improve
fairness - If a request does not interfere with any other,
it is scheduled
15How Does STFM Prevent Unfairness?
T0 Row 0
T1 Row 5
T0 Row 0
T0 Row 0
T1 Row 111
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T1 Row 16
Row Buffer
Row 16
Row 111
Row 0
Row 0
T0 Slowdown
1.00
1.03
1.04
1.04
1.07
1.10
T1 Slowdown
1.03
1.06
1.06
1.08
1.11
1.14
1.00
Unfairness
1.03
1.06
1.03
1.04
1.06
1.04
1.03
1.00
Data
?
1.05
16Outline
- The Problem
- Unfair DRAM Scheduling
- Stall-Time Fair Memory Scheduling
- Fairness definition
- Algorithm
- Implementation
- System software support
- Experimental Evaluation
- Conclusions
17Implementation
- Tracking Tshared
- Relatively easy
- The processor increases a counter if the thread
cannot commit instructions because the oldest
instruction requires DRAM access - Estimating Talone
- More involved because thread is not running alone
- Difficult to estimate directly
- Observation
- Talone Tshared - Tinterference
- Estimate Tinterference Extra stall-time due to
interference
18Estimating Tinterference(1)
- When a DRAM request from thread C is scheduled
- Thread C can incur extra stall time
- The requests row buffer hit status might be
affected by interference - Estimate the row that would have been in the row
buffer if the thread were running alone - Estimate the extra bank access latency the
request incurs
Extra latency amortized across outstanding
accesses of thread C (memory level parallelism)
19Estimating Tinterference(2)
- When a DRAM request from thread C is scheduled
- Any other thread C with outstanding requests
incurs extra stall time - Interference in the DRAM data bus
- Interference in the DRAM bank (see paper)
Bank Access Latency of Scheduled Request
Tinterference(C)
K
Banks Needed by C Requests
20Hardware Cost
- lt2KB storage cost for
- 8-core system with 128-entry memory request
buffer - Arithmetic operations approximated
- Fixed point arithmetic
- Divisions using lookup tables
- Not on the critical path
- Scheduler makes a decision only every DRAM cycle
- More details in paper
21Outline
- The Problem
- Unfair DRAM Scheduling
- Stall-Time Fair Memory Scheduling
- Fairness definition
- Algorithm
- Implementation
- System software support
- Experimental Evaluation
- Conclusions
22Support for System Software
- Supporting system-level thread weights/priorities
- Thread weights communicated to the memory
controller - Larger-weight threads should be slowed down less
- Each threads slowdown is scaled by its weight
- Weighted slowdown used for scheduling
- Favors threads with larger weights
- OS can choose thread weights to satisfy QoS
requirements - ? Maximum tolerable unfairness set by system
software - Dont need fairness? Set ? large.
- Need strict fairness? Set ? close to 1.
- Other values of ? trade-off fairness and
throughput
23Outline
- The Problem
- Unfair DRAM Scheduling
- Stall-Time Fair Memory Scheduling
- Fairness definition
- Algorithm
- Implementation
- System software support
- Experimental Evaluation
- Conclusions
24Evaluation Methodology
- 2-, 4-, 8-, 16-core systems
- x86 processor model based on Intel Pentium M
- 4 GHz processor, 128-entry instruction window
- 512 Kbyte per core private L2 caches
- Detailed DRAM model based on Micron DDR2-800
- 128-entry memory request buffer
- 8 banks, 2Kbyte row buffer
- Row-hit round-trip latency 35ns (140 cycles)
- Row-conflict latency 70ns (280 cycles)
- Benchmarks
- SPEC CPU2006 and some Windows Desktop
applications - 256, 32, 3 benchmark combinations for 4-, 8-,
16-core experiments
25Comparison with Related Work
- Baseline FR-FCFS Rixner et al., ISCA00
- Unfairly penalizes non-intensive threads with
low-row-buffer locality - FCFS
- Low DRAM throughput
- Unfairly penalizes non-intensive threads
- FR-FCFSCap
- Static cap on how many younger row-hits can
bypass older accesses - Unfairly penalizes non-intensive threads
- Network Fair Queueing (NFQ) Nesbit et al.,
Micro06 - Per-thread virtual-time based scheduling
- A threads private virtual-time increases when
its request is scheduled - Prioritizes requests from thread with the
earliest virtual-time - Equalizes bandwidth across equal-priority threads
- Does not consider inherent performance of each
thread - Unfairly prioritizes threads with bursty access
patterns (idleness problem) - Unfairly penalizes threads with unbalanced bank
usage (in paper)
26Idleness/Burstiness Problem in Fair Queueing
Thread 1s virtual time increases even though no
other thread needs DRAM
Only Thread 2 serviced in interval t1,t2 since
its virtual time is smaller than Thread 1s
Only Thread 3 serviced in interval t2,t3 since
its virtual time is smaller than Thread 1s
Only Thread 4 serviced in interval t3,t4 since
its virtual time is smaller than Thread 1s
Non-bursty thread suffers large performance loss
even though it fairly utilized DRAM when no
other thread needed it
27Unfairness on 4-, 8-, 16-core Systems
Unfairness MAX Memory Slowdown / MIN Memory
Slowdown
1.26X
1.27X
1.81X
28System Performance
29Hmean-speedup (Throughput-Fairness Balance)
10.8
9.5
11.2
30Outline
- The Problem
- Unfair DRAM Scheduling
- Stall-Time Fair Memory Scheduling
- Fairness definition
- Algorithm
- Implementation
- System software support
- Experimental Evaluation
- Conclusions
31Conclusions
- A new definition of DRAM fairness stall-time
fairness - Equal-priority threads should experience equal
memory-related slowdowns - Takes into account inherent memory performance of
threads - New DRAM scheduling algorithm enforces this
definition - Flexible and configurable fairness substrate
- Supports system-level thread priorities/weights ?
QoS policies - Results across a wide range of workloads and
systems show - Improving DRAM fairness also improves system
throughput - STFM provides better fairness and system
performance than previously-proposed DRAM
schedulers
32Thank you. Questions?
33Stall-Time Fair Memory Access Scheduling
- Onur Mutlu and Thomas Moscibroda
- Computer Architecture Group
- Microsoft Research
34Backup
35Structure of the STFM Controller
36Comparison using NFQ QoS Metrics
- Nesbit et al. MICRO06 proposed the following
target for quality of service - A thread that is allocated 1/Nth of the memory
system bandwidth will run no slower than the same
thread on a private memory system running at
1/Nth of the frequency of the shared physical
memory system - Baseline with memory bandwidth scaled down by N
- We compared different DRAM schedulers
effectiveness using this metric - Number of violations of the above QoS target
- Harmonic mean of IPC normalized to the above
baseline
37Violations of the NFQ QoS Target
38Hmean Normalized IPC using NFQ Baseline
7.3
5.9
5.1
10.3
9.1
7.8
39Shortcomings of the NFQ QoS Target
- Low baseline (easily achievable target) for
equal-priority threads - N equal-priority threads ? a thread should do
better than on a system with 1/Nth of the memory
bandwidth - This target is usually very easy to achieve
- Especially when N is large
- Unachievable target in some cases
- Consider two threads always accessing the same
bank in an interleaved fashion ? too much
interference - Baseline performance very difficult to determine
in a real system - Cannot scale memory frequency arbitrarily
- Not knowing baseline performance makes it
difficult to set thread priorities (how
much bandwidth to assign to each thread)
40A Case Study
Unfairness 7.28 2.07 2.08
1.87 1.27
Memory Slowdown
41Windows Desktop Workloads
42Enforcing Thread Weights
43Effect of ?
44Effect of Banks and Row Buffer Size