Architectural Support for Enhanced SMT Job Scheduling - PowerPoint PPT Presentation

About This Presentation

Title:

Architectural Support for Enhanced SMT Job Scheduling

Description:

Profile based, simulated OS and architecture [J. Lo; ISCA 98] Data cache address ... different decisions than the performance counter based schedule 23% of the time ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 30

Provided by: abn7

Learn more at: https://research.ac.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Architectural Support for Enhanced SMT Job Scheduling

1
Architectural Support for Enhanced SMT Job
Scheduling

Alex Settle
Joshua Kihm
Andy Janiszewski
Daniel A. Connors
University of Colorado at Boulder

2
Introduction

Shared memory systems of SMT processors limit
performance
Threads continuously compete for shared cache
resources
Interference between threads causes workload
slowdown
Detecting thread interference is a challenge for
real systems
Low level cache monitoring
Difficult to exploit run-time data
Goal
Design the performance monitoring hardware
required to capture thread interference
information that can be exposed to the operating
system scheduler to improve workload performance

3
Simultaneous Multithreading (SMT)

Concurrently executes instructions from different
contexts
Thread level parallelism (TLP)
Improves instruction level parallelism (ILP)
Improves utilization of base processor
Intel Pentium 4 Xeon
2 level cache hierarchy
Instruction trace cache
8K data cache 4 way associative 64 bytes per line
512K L2 cache Unified 8 way associative 64
bytes per line
2 way SMT

4
Inter-thread Interference

Competition for shared resources
Memory system
Buses
Physical cache storage
Fetch and issue queues
Functional units
Threads evict cache data belonging to other
threads
Increase in cache misses
Diminishes processor utilization
Inter-thread kick outs (ITKO)
Measured in simulator
Thread id of evicted cache line compared to new
cache line
Increased ITKO leads to decrease in IPC

5
ITKO to IPC Correlation Level 3 Cache

IPC recorded for each phase interval
High ITKO rate leads to significant drop in IPC
Large variability in IPC over workload lifetime
cache interference

6
Related Work

Different levels of addressing the interference
problem
Compiler
Kumar,Tullsen Micro 02 Procedure placement
optimization
workload fixed at compile time
J. Lo Micro 97 Tailoring compiler
optimizations for SMT
Effects of traditional optimizations on SMT
performance
Static optimizations
Operating System
Tullsen, Snavely ASPLOS 00 Symbiotic job
scheduling
Profile based, simulated OS and architecture
J. Lo ISCA 98 Data cache address remapping
workload dependent, data base applications
Microarchitecture
Brown Micro 01 - Issue policy feedback from
memory system
Improved fetch and issue resource allocation
Does not tackle inter-thread interference

7
Motivation

Improve performance by reducing inter-thread
interference
Multi-faceted problem
Dependent on thread pairings
Occurs at low-level cache line granularity
Difficult to detect at runtime
OS scheduling decisions affect microarchitecture
performance
Observed on both simulator and real system
Observation
Cache access footprints vary over program
lifetimes
Accesses are concentrated in small cache regions

8
Concentration of L2-Cache Access

Cache access and miss footprints vary across
program phases
Intervals with high access and miss rates are
concentrated in small physical regions of the
cache (green, red)
Current performance counters can not detect that
activity is concentrated in small regions

9
Cache Use Map Runtime Monitoring

Spatial locality
vertical
Temporal locality
horizontal

10
Benchmark Pairings ITKO

gzip/mesa

mesa/equake

mesa/perl

gzip/equake

equake/perl

gzip/perl

Yellow represents very high interference
Interference is dependent on job mix

11
Performance Guided Scheduling Theory

gzip

equake

Total ITKOs Best Static 2.91 Million Dynamic
2.55 Million
Total ITKOs Best Static 2.91 Million Dynamic
2.55 Million

perl

mesa

Total ITKOs Best Static 2.91 Million Dynamic
2.90 Million
Total ITKOs Best Static 7.30 Million Dynamic
6.70 Million

Each phase scheduler selects jobs with least
interference

12
Solution to Inter-thread Interference

Predict future interference
Capture inter-thread interference behavior
Introduce cache line activity counters
Expose to operating system
Current schedulers use symmetric multiprocessing
(SMP) algorithms for SMT processors
Activity based job scheduler
Schedule for minimal inter-thread interference

13
Activity Vectors

Interface between OS and microarchitecture
Divide cache into Super Sets
Access counters assigned to each super set
One vector bit corresponds to each counter
Bit is set when threshold is exceeded
Job scheduler
Compare active vector with jobs in run queue
Selects job with fewest common set bits

7949
Thresholds established through static
analysis Global median across all benchmarks
14
Vector Prediction - Simulator

Use last vector to approximate next vector
Average accuracy 91
Simple and effective

Activity Vector Use Predictability Miss Predictability
D-Cache 82.3 93.6
I-Cache 94.9 90.3
L2-Cache 93.8 94.6
Average 90.3 92.8
15
OS Scheduling Algorithm
Run queue 1
Run queue 0
vectors

perlbmk
gzip
mesa
OS task

OS task mcf ammp parser
CMP
Twolf vector
twolf CPU 0
Physical processor
CPU 1

Weighted sum of vectors at each level
Vectors from L2 given highest weight

16
Activity Vector Procedure

Real system
Modified Linux kernel 2.6.0
Tested on Intel P4 Xeon Hyper-threading
Emulated activity counter registers
Generate vectors off-line
Valgrind memory simulator
Text file output
Copy vectors to kernel memory space
Activate vector scheduler
Time and run workloads

Program Phase D-cache Vector L2-cache Vector
0 11100110 00111011
1 11000000 01111000
2 00111101 11010000
N 11100001 00011100

Simulator
Vector hardware
Simulated OS

17
Workloads - Xeon

8 Spec 2000 jobs per workload
Combination of integer and floating point
applications
Run to completion in parallel with OS level jobs

WL1 gzip.vpr.gcc.mesa.art.mcf.equake.crafty
WL2 parser.gap.vortex.bzip2.vpr.mesa.crafty.mcf
WL3 Mesa.twolf.vortex.gzip.gcc.art.crafty.vpr
WL4 Gzip.twolf.vpr.bzip2.gcc.gap.mesa.parser
WL5 Equake.crafty.mcf.parser.art.gap.mesa.vortex
WL6 twolf.bzip2.vortex.gap.parser.crafty.equake.mcf
18
Comparison of Scheduling Algorithms

Default Linux vs. Activity based
More than 30 of default scheduler decisions
could have been improved by the activity based
scheduler

19
Activity Vector Performance - Xeon
20
Comparing Activity Vectors to Existing
Performance Counters - Simulation
Benchmark Mix Diff.
164.gzip, 164.gzip, 181.mcf, 183.equake 0.0
164.gzip, 164.gzip, 188.ammp, 300.twolf 12.0
164.gzip, 177.mesa, 181.mcf, 183.equake 0.0
164.gzip, 177.mesa, 183.equake, 183.equake 0.0
164.gzip, 197.parser, 253.perlbmk, 300.twolf 44.4
177.mesa, 177.mesa, 197.parser, 300.twolf 11.1
177.mesa, 181.mcf, 253.perlbmk, 256.bzip2 0.0
177.mesa, 188.ammp, 253.perlbmk, 300.twolf 59.5
177.mesa, 197.parser, 197.parser, 256.bzip2 96.2
181.mcf, 181.mcf, 256.bzip2, 256.bzip2 0.0
181.mcf, 183.equake, 253.perlbmk, 300.twolf 4.0
181.mcf, 253.perlbmk, 253.perlbmk, 256.bzip2 0.0
183.equake, 188.ammp, 188.ammp, 256.bzip2 11.1
188.ammp, 188.ammp, 197.parser, 197.parser 96.2
188.ammp, 300.twolf, 300.twolf, 300.twolf 8.0
197.parser, 197.parser, 253.perlbmk, 256.bzip2 0.0
Average 22.5
On average activity schedule makes different
decisions than the performance counter based
schedule 23 of the time
21
ITKO Reduction - Simulation
Benchmarks ITKO Reduction IPC Gain
gzip.gzip.mcf.equake 54.0 3.6
gzip.gzip.ammp.twolf 10.5 4.5
gzip.mesa.mcf.equake 39.5 3.0
gzip.mesa.equake.equake 47.0 2.4
mesa.mesa.parser.twolf 10.3 4.8
mcf.equake.perlbmk.twolf 1.7 3.0
mcf.perlbmk.perlbmk.bzip2 13.0 12.1
ammp.twolf.twolf.twolf 1.9 6.1
Average 22 5
22
Contributions

Interference analysis of cache accesses
Introduce fine grained performance counters
General purpose adaptable optimization
Expose microarchitecture to OS
Workload independent
Tested on a real SMT machine
Implemented on Linux kernel
2 way SMT core

23
Activity Based Scheduling Summary

Prevents inter-thread interference
Monitors cache access behavior
Co-schedules jobs with expected low interference
Adapts to phased workload behavior
Performance improvements
Greater than 30 opportunity to improve the
default Linux scheduling decisions
22 Reduction in inter-thread interference
5 Improvement in execution time

24
Thank You
25
Super Set Size

What happens when we change the number of super
sets used. Can we include a graph here?
Slide 17 once we have the data
May want to include the tree chart

26
Performance Challenges

Difficult to detect interference
Inter-thread interference is a multi-faceted
problem
Occurs at low-level cache line granularity
Temporal variability in benchmark memory requests
Dependent on thread pairings
OS scheduling decisions affect performance
Current systems
Increased cache associativity
Could use PMU register feedback

27
Activity Vectors

Interface between OS and microarchitecture
Divide cache into Super Sets
Access counters assigned to each super set
One vector bit corresponds to each counter
Bit is set when threshold is exceeded
Job scheduler
Compare active vector with jobs in run queue
Selects job with fewest common set bits

Expect no interference
Expect interference
28
OS Scheduling

OS scheduling important when more jobs than
contexts
Current schedulers use symmetric multiprocessing
(SMP) algorithms for SMT processors
Proposed work
For each time interval co-schedule jobs whose
cache accesses are in different regions