Stall-Time Fair Memory Access Scheduling - PowerPoint PPT Presentation

About This Presentation

Title:

Stall-Time Fair Memory Access Scheduling

Description:

Stall-Time Fairness in Shared DRAM Systems ... System software support. Experimental Evaluation. Conclusions. 17. Implementation. Tracking Tshared ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 45

Provided by: onur55

Category:

more less

Transcript and Presenter's Notes

Title: Stall-Time Fair Memory Access Scheduling

1
Stall-Time Fair Memory Access Scheduling

Onur Mutlu and Thomas Moscibroda
Computer Architecture Group
Microsoft Research

2
Multi-Core Systems
Multi-Core Chip
CORE 0
CORE 1
CORE 2
CORE 3
L2 CACHE
L2 CACHE
L2 CACHE
L2 CACHE
unfairness
Shared DRAM Memory System
DRAM MEMORY CONTROLLER
. . .
DRAM Bank 0
DRAM Bank 1
DRAM Bank 2
DRAM Bank 7
3
DRAM Bank Operation
Access Address (Row 0, Column 0)
Access Address (Row 0, Column 1)
Access Address (Row 0, Column 9)
Access Address (Row 1, Column 0)
Columns
Row address 0
Row address 1
Row decoder
Rows
Row Buffer
HIT
HIT
CONFLICT !
Row 0
Empty
Row 1
Column decoder
Column address 0
Column address 1
Column address 9
Column address 0
Data
4
DRAM Controllers

A row-conflict memory access takes significantly
longer than a row-hit access
Current controllers take advantage of the row
buffer
Commonly used scheduling policy (FR-FCFS)
Rixner, ISCA00
(1) Row-hit (column) first Service row-hit
memory accesses first
(2) Oldest-first Then service older accesses
first
This scheduling policy aims to maximize DRAM
throughput
But, it is unfair when multiple threads share the
DRAM system

5
Outline

The Problem
Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling
Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions

6
The Problem

Multiple threads share the DRAM controller
DRAM controllers are designed to maximize DRAM
throughput
DRAM scheduling policies are thread-unaware and
unfair
Row-hit first unfairly prioritizes threads with
high row buffer locality
Streaming threads
Threads that keep on accessing the same row
Oldest-first unfairly prioritizes
memory-intensive threads

7
The Problem
Row decoder
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T1 Row 5
T0 Row 0
T1 Row 111
T0 Row 0
T1 Row 16
Request Buffer
Row Buffer
Row 0
Row 0
Column decoder
Row size 8KB, cache block size 64B 128 requests
of T0 serviced before T1
T0 streaming thread
Data
T1 non-streaming thread
8
Consequences of Unfairness in DRAM
DRAM is the only shared resource
7.74
4.72
1.85
1.05

Vulnerability to denial of service Moscibroda
Mutlu, Usenix Security07
System throughput loss
Priority inversion at the system/OS level
Poor performance predictability

9
Outline

The Problem
Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling
Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions

10
Fairness in Shared DRAM Systems

A threads DRAM performance dependent on its
inherent
Row-buffer locality
Bank parallelism
Interference between threads can destroy either
or both
A fair DRAM scheduler should take into account
all factors affecting each threads DRAM
performance
Not solely bandwidth or solely request latency
Observation A threads performance degradation
due to interference in DRAM mainly characterized
by the extra memory-related stall-time due to
contention with other threads

11
Stall-Time Fairness in Shared DRAM Systems

A DRAM system is fair if it slows down
equal-priority threads equally
Compared to when each thread is run alone on the
same system
Fairness notion similar to SMT Cazorla, IEEE
Micro04Luo, ISPASS01, SoEMT Gabor,
Micro06, and shared caches Kim, PACT04
Tshared DRAM-related stall-time when the thread
is running with other threads
Talone DRAM-related stall-time when the thread
is running alone
Memory-slowdown Tshared/Talone
The goal of the Stall-Time Fair Memory scheduler
(STFM) is to equalize Memory-slowdown for all
threads, without sacrificing performance
Considers inherent DRAM performance of each
thread

12
Outline

The Problem
Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling
Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions

13
STFM Scheduling Algorithm (1)

During each time interval, for each thread, DRAM
controller
Tracks Tshared
Estimates Talone
At the beginning of a scheduling cycle, DRAM
controller
Computes Slowdown Tshared/Talone for each
thread with an outstanding legal request
Computes unfairness MAX Slowdown / MIN Slowdown
If unfairness lt ?
Use DRAM throughput oriented baseline scheduling
policy
(1) row-hit first
(2) oldest-first

14
STFM Scheduling Algorithm (2)

If unfairness ?
Use fairness-oriented scheduling policy
(1) requests from thread with MAX Slowdown first
(2) row-hit first
(3) oldest-first
Maximizes DRAM throughput if it cannot improve
fairness
Does NOT waste useful bandwidth to improve
fairness
If a request does not interfere with any other,
it is scheduled

15
How Does STFM Prevent Unfairness?
T0 Row 0
T1 Row 5
T0 Row 0
T0 Row 0
T1 Row 111
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T1 Row 16
Row Buffer
Row 16
Row 111
Row 0
Row 0
T0 Slowdown
1.00
1.03
1.04
1.04
1.07
1.10
T1 Slowdown
1.03
1.06
1.06
1.08
1.11
1.14
1.00
Unfairness
1.03
1.06
1.03
1.04
1.06
1.04
1.03
1.00
Data
?
1.05
16
Outline

The Problem
Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling
Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions

17
Implementation

Tracking Tshared
Relatively easy
The processor increases a counter if the thread
cannot commit instructions because the oldest
instruction requires DRAM access
Estimating Talone
More involved because thread is not running alone
Difficult to estimate directly
Observation
Talone Tshared - Tinterference
Estimate Tinterference Extra stall-time due to
interference

18
Estimating Tinterference(1)

When a DRAM request from thread C is scheduled
Thread C can incur extra stall time
The requests row buffer hit status might be
affected by interference
Estimate the row that would have been in the row
buffer if the thread were running alone
Estimate the extra bank access latency the
request incurs

Extra latency amortized across outstanding
accesses of thread C (memory level parallelism)
19
Estimating Tinterference(2)

When a DRAM request from thread C is scheduled
Any other thread C with outstanding requests
incurs extra stall time
Interference in the DRAM data bus
Interference in the DRAM bank (see paper)

Bank Access Latency of Scheduled Request
Tinterference(C)
K
Banks Needed by C Requests
20
Hardware Cost

lt2KB storage cost for
8-core system with 128-entry memory request
buffer
Arithmetic operations approximated
Fixed point arithmetic
Divisions using lookup tables
Not on the critical path
Scheduler makes a decision only every DRAM cycle
More details in paper

21
Outline

The Problem
Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling
Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions

22
Support for System Software

Supporting system-level thread weights/priorities
Thread weights communicated to the memory
controller
Larger-weight threads should be slowed down less
Each threads slowdown is scaled by its weight
Weighted slowdown used for scheduling
Favors threads with larger weights
OS can choose thread weights to satisfy QoS
requirements
? Maximum tolerable unfairness set by system
software
Dont need fairness? Set ? large.
Need strict fairness? Set ? close to 1.
Other values of ? trade-off fairness and
throughput

23
Outline

The Problem
Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling
Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions

24
Evaluation Methodology

2-, 4-, 8-, 16-core systems
x86 processor model based on Intel Pentium M
4 GHz processor, 128-entry instruction window
512 Kbyte per core private L2 caches
Detailed DRAM model based on Micron DDR2-800
128-entry memory request buffer
8 banks, 2Kbyte row buffer
Row-hit round-trip latency 35ns (140 cycles)
Row-conflict latency 70ns (280 cycles)
Benchmarks
SPEC CPU2006 and some Windows Desktop
applications
256, 32, 3 benchmark combinations for 4-, 8-,
16-core experiments

25
Comparison with Related Work

Baseline FR-FCFS Rixner et al., ISCA00
Unfairly penalizes non-intensive threads with
low-row-buffer locality
FCFS
Low DRAM throughput
Unfairly penalizes non-intensive threads
FR-FCFSCap
Static cap on how many younger row-hits can
bypass older accesses
Unfairly penalizes non-intensive threads
Network Fair Queueing (NFQ) Nesbit et al.,
Micro06
Per-thread virtual-time based scheduling
A threads private virtual-time increases when
its request is scheduled
Prioritizes requests from thread with the
earliest virtual-time
Equalizes bandwidth across equal-priority threads
Does not consider inherent performance of each
thread
Unfairly prioritizes threads with bursty access
patterns (idleness problem)
Unfairly penalizes threads with unbalanced bank
usage (in paper)

26
Idleness/Burstiness Problem in Fair Queueing
Thread 1s virtual time increases even though no
other thread needs DRAM
Only Thread 2 serviced in interval t1,t2 since
its virtual time is smaller than Thread 1s
Only Thread 3 serviced in interval t2,t3 since
its virtual time is smaller than Thread 1s
Only Thread 4 serviced in interval t3,t4 since
its virtual time is smaller than Thread 1s
Non-bursty thread suffers large performance loss
even though it fairly utilized DRAM when no
other thread needed it
27
Unfairness on 4-, 8-, 16-core Systems
Unfairness MAX Memory Slowdown / MIN Memory
Slowdown
1.26X
1.27X
1.81X
28
System Performance
29
Hmean-speedup (Throughput-Fairness Balance)
10.8
9.5
11.2
30
Outline

The Problem
Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling
Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions

31
Conclusions

A new definition of DRAM fairness stall-time
fairness
Equal-priority threads should experience equal
memory-related slowdowns
Takes into account inherent memory performance of
threads
New DRAM scheduling algorithm enforces this
definition
Flexible and configurable fairness substrate
Supports system-level thread priorities/weights ?
QoS policies
Results across a wide range of workloads and
systems show
Improving DRAM fairness also improves system
throughput
STFM provides better fairness and system
performance than previously-proposed DRAM
schedulers

32
Thank you. Questions?
33
Stall-Time Fair Memory Access Scheduling

Onur Mutlu and Thomas Moscibroda
Computer Architecture Group
Microsoft Research

34
Backup
35
Structure of the STFM Controller
36
Comparison using NFQ QoS Metrics

Nesbit et al. MICRO06 proposed the following
target for quality of service
A thread that is allocated 1/Nth of the memory
system bandwidth will run no slower than the same
thread on a private memory system running at
1/Nth of the frequency of the shared physical
memory system
Baseline with memory bandwidth scaled down by N
We compared different DRAM schedulers
effectiveness using this metric
Number of violations of the above QoS target
Harmonic mean of IPC normalized to the above
baseline

37
Violations of the NFQ QoS Target
38
Hmean Normalized IPC using NFQ Baseline
7.3
5.9
5.1
10.3
9.1
7.8
39
Shortcomings of the NFQ QoS Target

Low baseline (easily achievable target) for
equal-priority threads
N equal-priority threads ? a thread should do
better than on a system with 1/Nth of the memory
bandwidth
This target is usually very easy to achieve
Especially when N is large
Unachievable target in some cases
Consider two threads always accessing the same
bank in an interleaved fashion ? too much
interference
Baseline performance very difficult to determine
in a real system
Cannot scale memory frequency arbitrarily
Not knowing baseline performance makes it
difficult to set thread priorities (how
much bandwidth to assign to each thread)

40
A Case Study
Unfairness 7.28 2.07 2.08
1.87 1.27
Memory Slowdown
41
Windows Desktop Workloads
42
Enforcing Thread Weights
43
Effect of ?
44
Effect of Banks and Row Buffer Size

Write a Comment

User Comments (0)