Stall-Time Fair Memory Access Scheduling - PowerPoint PPT Presentation

About This Presentation
Title:

Stall-Time Fair Memory Access Scheduling

Description:

Stall-Time Fairness in Shared DRAM Systems ... System software support. Experimental Evaluation. Conclusions. 17. Implementation. Tracking Tshared ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 45
Provided by: onur55
Category:

less

Transcript and Presenter's Notes

Title: Stall-Time Fair Memory Access Scheduling


1
Stall-Time Fair Memory Access Scheduling
  • Onur Mutlu and Thomas Moscibroda
  • Computer Architecture Group
  • Microsoft Research

2
Multi-Core Systems
Multi-Core Chip
CORE 0
CORE 1
CORE 2
CORE 3
L2 CACHE
L2 CACHE
L2 CACHE
L2 CACHE
unfairness
Shared DRAM Memory System
DRAM MEMORY CONTROLLER
. . .
DRAM Bank 0
DRAM Bank 1
DRAM Bank 2
DRAM Bank 7
3
DRAM Bank Operation
Access Address (Row 0, Column 0)
Access Address (Row 0, Column 1)
Access Address (Row 0, Column 9)
Access Address (Row 1, Column 0)
Columns
Row address 0
Row address 1
Row decoder
Rows
Row Buffer
HIT
HIT
CONFLICT !
Row 0
Empty
Row 1
Column decoder
Column address 0
Column address 1
Column address 9
Column address 0
Data
4
DRAM Controllers
  • A row-conflict memory access takes significantly
    longer than a row-hit access
  • Current controllers take advantage of the row
    buffer
  • Commonly used scheduling policy (FR-FCFS)
    Rixner, ISCA00
  • (1) Row-hit (column) first Service row-hit
    memory accesses first
  • (2) Oldest-first Then service older accesses
    first
  • This scheduling policy aims to maximize DRAM
    throughput
  • But, it is unfair when multiple threads share the
    DRAM system

5
Outline
  • The Problem
  • Unfair DRAM Scheduling
  • Stall-Time Fair Memory Scheduling
  • Fairness definition
  • Algorithm
  • Implementation
  • System software support
  • Experimental Evaluation
  • Conclusions

6
The Problem
  • Multiple threads share the DRAM controller
  • DRAM controllers are designed to maximize DRAM
    throughput
  • DRAM scheduling policies are thread-unaware and
    unfair
  • Row-hit first unfairly prioritizes threads with
    high row buffer locality
  • Streaming threads
  • Threads that keep on accessing the same row
  • Oldest-first unfairly prioritizes
    memory-intensive threads

7
The Problem
Row decoder
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T1 Row 5
T0 Row 0
T1 Row 111
T0 Row 0
T1 Row 16
Request Buffer
Row Buffer
Row 0
Row 0
Column decoder
Row size 8KB, cache block size 64B 128 requests
of T0 serviced before T1
T0 streaming thread
Data
T1 non-streaming thread
8
Consequences of Unfairness in DRAM
DRAM is the only shared resource
7.74
4.72
1.85
1.05
  • Vulnerability to denial of service Moscibroda
    Mutlu, Usenix Security07
  • System throughput loss
  • Priority inversion at the system/OS level
  • Poor performance predictability

9
Outline
  • The Problem
  • Unfair DRAM Scheduling
  • Stall-Time Fair Memory Scheduling
  • Fairness definition
  • Algorithm
  • Implementation
  • System software support
  • Experimental Evaluation
  • Conclusions

10
Fairness in Shared DRAM Systems
  • A threads DRAM performance dependent on its
    inherent
  • Row-buffer locality
  • Bank parallelism
  • Interference between threads can destroy either
    or both
  • A fair DRAM scheduler should take into account
    all factors affecting each threads DRAM
    performance
  • Not solely bandwidth or solely request latency
  • Observation A threads performance degradation
    due to interference in DRAM mainly characterized
    by the extra memory-related stall-time due to
    contention with other threads

11
Stall-Time Fairness in Shared DRAM Systems
  • A DRAM system is fair if it slows down
    equal-priority threads equally
  • Compared to when each thread is run alone on the
    same system
  • Fairness notion similar to SMT Cazorla, IEEE
    Micro04Luo, ISPASS01, SoEMT Gabor,
    Micro06, and shared caches Kim, PACT04
  • Tshared DRAM-related stall-time when the thread
    is running with other threads
  • Talone DRAM-related stall-time when the thread
    is running alone
  • Memory-slowdown Tshared/Talone
  • The goal of the Stall-Time Fair Memory scheduler
    (STFM) is to equalize Memory-slowdown for all
    threads, without sacrificing performance
  • Considers inherent DRAM performance of each
    thread

12
Outline
  • The Problem
  • Unfair DRAM Scheduling
  • Stall-Time Fair Memory Scheduling
  • Fairness definition
  • Algorithm
  • Implementation
  • System software support
  • Experimental Evaluation
  • Conclusions

13
STFM Scheduling Algorithm (1)
  • During each time interval, for each thread, DRAM
    controller
  • Tracks Tshared
  • Estimates Talone
  • At the beginning of a scheduling cycle, DRAM
    controller
  • Computes Slowdown Tshared/Talone for each
    thread with an outstanding legal request
  • Computes unfairness MAX Slowdown / MIN Slowdown
  • If unfairness lt ?
  • Use DRAM throughput oriented baseline scheduling
    policy
  • (1) row-hit first
  • (2) oldest-first

14
STFM Scheduling Algorithm (2)
  • If unfairness ?
  • Use fairness-oriented scheduling policy
  • (1) requests from thread with MAX Slowdown first
  • (2) row-hit first
  • (3) oldest-first
  • Maximizes DRAM throughput if it cannot improve
    fairness
  • Does NOT waste useful bandwidth to improve
    fairness
  • If a request does not interfere with any other,
    it is scheduled

15
How Does STFM Prevent Unfairness?
T0 Row 0
T1 Row 5
T0 Row 0
T0 Row 0
T1 Row 111
T0 Row 0
T0 Row 0
T0 Row 0
T0 Row 0
T1 Row 16
Row Buffer
Row 16
Row 111
Row 0
Row 0
T0 Slowdown
1.00
1.03
1.04
1.04
1.07
1.10
T1 Slowdown
1.03
1.06
1.06
1.08
1.11
1.14
1.00
Unfairness
1.03
1.06
1.03
1.04
1.06
1.04
1.03
1.00
Data
?
1.05
16
Outline
  • The Problem
  • Unfair DRAM Scheduling
  • Stall-Time Fair Memory Scheduling
  • Fairness definition
  • Algorithm
  • Implementation
  • System software support
  • Experimental Evaluation
  • Conclusions

17
Implementation
  • Tracking Tshared
  • Relatively easy
  • The processor increases a counter if the thread
    cannot commit instructions because the oldest
    instruction requires DRAM access
  • Estimating Talone
  • More involved because thread is not running alone
  • Difficult to estimate directly
  • Observation
  • Talone Tshared - Tinterference
  • Estimate Tinterference Extra stall-time due to
    interference

18
Estimating Tinterference(1)
  • When a DRAM request from thread C is scheduled
  • Thread C can incur extra stall time
  • The requests row buffer hit status might be
    affected by interference
  • Estimate the row that would have been in the row
    buffer if the thread were running alone
  • Estimate the extra bank access latency the
    request incurs

Extra latency amortized across outstanding
accesses of thread C (memory level parallelism)
19
Estimating Tinterference(2)
  • When a DRAM request from thread C is scheduled
  • Any other thread C with outstanding requests
    incurs extra stall time
  • Interference in the DRAM data bus
  • Interference in the DRAM bank (see paper)

Bank Access Latency of Scheduled Request
Tinterference(C)
K
Banks Needed by C Requests
20
Hardware Cost
  • lt2KB storage cost for
  • 8-core system with 128-entry memory request
    buffer
  • Arithmetic operations approximated
  • Fixed point arithmetic
  • Divisions using lookup tables
  • Not on the critical path
  • Scheduler makes a decision only every DRAM cycle
  • More details in paper

21
Outline
  • The Problem
  • Unfair DRAM Scheduling
  • Stall-Time Fair Memory Scheduling
  • Fairness definition
  • Algorithm
  • Implementation
  • System software support
  • Experimental Evaluation
  • Conclusions

22
Support for System Software
  • Supporting system-level thread weights/priorities
  • Thread weights communicated to the memory
    controller
  • Larger-weight threads should be slowed down less
  • Each threads slowdown is scaled by its weight
  • Weighted slowdown used for scheduling
  • Favors threads with larger weights
  • OS can choose thread weights to satisfy QoS
    requirements
  • ? Maximum tolerable unfairness set by system
    software
  • Dont need fairness? Set ? large.
  • Need strict fairness? Set ? close to 1.
  • Other values of ? trade-off fairness and
    throughput

23
Outline
  • The Problem
  • Unfair DRAM Scheduling
  • Stall-Time Fair Memory Scheduling
  • Fairness definition
  • Algorithm
  • Implementation
  • System software support
  • Experimental Evaluation
  • Conclusions

24
Evaluation Methodology
  • 2-, 4-, 8-, 16-core systems
  • x86 processor model based on Intel Pentium M
  • 4 GHz processor, 128-entry instruction window
  • 512 Kbyte per core private L2 caches
  • Detailed DRAM model based on Micron DDR2-800
  • 128-entry memory request buffer
  • 8 banks, 2Kbyte row buffer
  • Row-hit round-trip latency 35ns (140 cycles)
  • Row-conflict latency 70ns (280 cycles)
  • Benchmarks
  • SPEC CPU2006 and some Windows Desktop
    applications
  • 256, 32, 3 benchmark combinations for 4-, 8-,
    16-core experiments

25
Comparison with Related Work
  • Baseline FR-FCFS Rixner et al., ISCA00
  • Unfairly penalizes non-intensive threads with
    low-row-buffer locality
  • FCFS
  • Low DRAM throughput
  • Unfairly penalizes non-intensive threads
  • FR-FCFSCap
  • Static cap on how many younger row-hits can
    bypass older accesses
  • Unfairly penalizes non-intensive threads
  • Network Fair Queueing (NFQ) Nesbit et al.,
    Micro06
  • Per-thread virtual-time based scheduling
  • A threads private virtual-time increases when
    its request is scheduled
  • Prioritizes requests from thread with the
    earliest virtual-time
  • Equalizes bandwidth across equal-priority threads
  • Does not consider inherent performance of each
    thread
  • Unfairly prioritizes threads with bursty access
    patterns (idleness problem)
  • Unfairly penalizes threads with unbalanced bank
    usage (in paper)

26
Idleness/Burstiness Problem in Fair Queueing
Thread 1s virtual time increases even though no
other thread needs DRAM
Only Thread 2 serviced in interval t1,t2 since
its virtual time is smaller than Thread 1s
Only Thread 3 serviced in interval t2,t3 since
its virtual time is smaller than Thread 1s
Only Thread 4 serviced in interval t3,t4 since
its virtual time is smaller than Thread 1s
Non-bursty thread suffers large performance loss
even though it fairly utilized DRAM when no
other thread needed it
27
Unfairness on 4-, 8-, 16-core Systems
Unfairness MAX Memory Slowdown / MIN Memory
Slowdown
1.26X
1.27X
1.81X
28
System Performance
29
Hmean-speedup (Throughput-Fairness Balance)
10.8
9.5
11.2
30
Outline
  • The Problem
  • Unfair DRAM Scheduling
  • Stall-Time Fair Memory Scheduling
  • Fairness definition
  • Algorithm
  • Implementation
  • System software support
  • Experimental Evaluation
  • Conclusions

31
Conclusions
  • A new definition of DRAM fairness stall-time
    fairness
  • Equal-priority threads should experience equal
    memory-related slowdowns
  • Takes into account inherent memory performance of
    threads
  • New DRAM scheduling algorithm enforces this
    definition
  • Flexible and configurable fairness substrate
  • Supports system-level thread priorities/weights ?
    QoS policies
  • Results across a wide range of workloads and
    systems show
  • Improving DRAM fairness also improves system
    throughput
  • STFM provides better fairness and system
    performance than previously-proposed DRAM
    schedulers

32
Thank you. Questions?
33
Stall-Time Fair Memory Access Scheduling
  • Onur Mutlu and Thomas Moscibroda
  • Computer Architecture Group
  • Microsoft Research

34
Backup
35
Structure of the STFM Controller
36
Comparison using NFQ QoS Metrics
  • Nesbit et al. MICRO06 proposed the following
    target for quality of service
  • A thread that is allocated 1/Nth of the memory
    system bandwidth will run no slower than the same
    thread on a private memory system running at
    1/Nth of the frequency of the shared physical
    memory system
  • Baseline with memory bandwidth scaled down by N
  • We compared different DRAM schedulers
    effectiveness using this metric
  • Number of violations of the above QoS target
  • Harmonic mean of IPC normalized to the above
    baseline

37
Violations of the NFQ QoS Target
38
Hmean Normalized IPC using NFQ Baseline
7.3
5.9
5.1
10.3
9.1
7.8
39
Shortcomings of the NFQ QoS Target
  • Low baseline (easily achievable target) for
    equal-priority threads
  • N equal-priority threads ? a thread should do
    better than on a system with 1/Nth of the memory
    bandwidth
  • This target is usually very easy to achieve
  • Especially when N is large
  • Unachievable target in some cases
  • Consider two threads always accessing the same
    bank in an interleaved fashion ? too much
    interference
  • Baseline performance very difficult to determine
    in a real system
  • Cannot scale memory frequency arbitrarily
  • Not knowing baseline performance makes it
    difficult to set thread priorities (how
    much bandwidth to assign to each thread)

40
A Case Study
Unfairness 7.28 2.07 2.08
1.87 1.27
Memory Slowdown
41
Windows Desktop Workloads
42
Enforcing Thread Weights
43
Effect of ?
44
Effect of Banks and Row Buffer Size
Write a Comment
User Comments (0)
About PowerShow.com