Title: Architectural Support for Enhanced SMT Job Scheduling
1Architectural Support for Enhanced SMT Job
Scheduling
- Alex Settle
- Joshua Kihm
- Andy Janiszewski
- Daniel A. Connors
- University of Colorado at Boulder
2Introduction
- Shared memory systems of SMT processors limit
performance - Threads continuously compete for shared cache
resources - Interference between threads causes workload
slowdown - Detecting thread interference is a challenge for
real systems - Low level cache monitoring
- Difficult to exploit run-time data
- Goal
- Design the performance monitoring hardware
required to capture thread interference
information that can be exposed to the operating
system scheduler to improve workload performance
3Simultaneous Multithreading (SMT)
- Concurrently executes instructions from different
contexts - Thread level parallelism (TLP)
- Improves instruction level parallelism (ILP)
- Improves utilization of base processor
- Intel Pentium 4 Xeon
- 2 level cache hierarchy
- Instruction trace cache
- 8K data cache 4 way associative 64 bytes per line
- 512K L2 cache Unified 8 way associative 64
bytes per line - 2 way SMT
4Inter-thread Interference
- Competition for shared resources
- Memory system
- Buses
- Physical cache storage
- Fetch and issue queues
- Functional units
- Threads evict cache data belonging to other
threads - Increase in cache misses
- Diminishes processor utilization
- Inter-thread kick outs (ITKO)
- Measured in simulator
- Thread id of evicted cache line compared to new
cache line - Increased ITKO leads to decrease in IPC
5ITKO to IPC Correlation Level 3 Cache
- IPC recorded for each phase interval
- High ITKO rate leads to significant drop in IPC
- Large variability in IPC over workload lifetime
cache interference
6Related Work
- Different levels of addressing the interference
problem - Compiler
- Kumar,Tullsen Micro 02 Procedure placement
optimization - workload fixed at compile time
- J. Lo Micro 97 Tailoring compiler
optimizations for SMT - Effects of traditional optimizations on SMT
performance - Static optimizations
- Operating System
- Tullsen, Snavely ASPLOS 00 Symbiotic job
scheduling - Profile based, simulated OS and architecture
- J. Lo ISCA 98 Data cache address remapping
- workload dependent, data base applications
- Microarchitecture
- Brown Micro 01 - Issue policy feedback from
memory system - Improved fetch and issue resource allocation
- Does not tackle inter-thread interference
7Motivation
- Improve performance by reducing inter-thread
interference - Multi-faceted problem
- Dependent on thread pairings
- Occurs at low-level cache line granularity
- Difficult to detect at runtime
- OS scheduling decisions affect microarchitecture
performance - Observed on both simulator and real system
- Observation
- Cache access footprints vary over program
lifetimes - Accesses are concentrated in small cache regions
8Concentration of L2-Cache Access
- Cache access and miss footprints vary across
program phases - Intervals with high access and miss rates are
concentrated in small physical regions of the
cache (green, red) - Current performance counters can not detect that
activity is concentrated in small regions
9Cache Use Map Runtime Monitoring
- Spatial locality
- vertical
- Temporal locality
- horizontal
10Benchmark Pairings ITKO
- Yellow represents very high interference
- Interference is dependent on job mix
11Performance Guided Scheduling Theory
Total ITKOs Best Static 2.91 Million Dynamic
2.55 Million
Total ITKOs Best Static 2.91 Million Dynamic
2.55 Million
Total ITKOs Best Static 2.91 Million Dynamic
2.90 Million
Total ITKOs Best Static 7.30 Million Dynamic
6.70 Million
- Each phase scheduler selects jobs with least
interference
12Solution to Inter-thread Interference
- Predict future interference
- Capture inter-thread interference behavior
- Introduce cache line activity counters
- Expose to operating system
- Current schedulers use symmetric multiprocessing
(SMP) algorithms for SMT processors - Activity based job scheduler
- Schedule for minimal inter-thread interference
13Activity Vectors
- Interface between OS and microarchitecture
- Divide cache into Super Sets
- Access counters assigned to each super set
- One vector bit corresponds to each counter
- Bit is set when threshold is exceeded
- Job scheduler
- Compare active vector with jobs in run queue
- Selects job with fewest common set bits
7949
Thresholds established through static
analysis Global median across all benchmarks
14Vector Prediction - Simulator
- Use last vector to approximate next vector
- Average accuracy 91
- Simple and effective
Activity Vector Use Predictability Miss Predictability
D-Cache 82.3 93.6
I-Cache 94.9 90.3
L2-Cache 93.8 94.6
Average 90.3 92.8
15OS Scheduling Algorithm
Run queue 1
Run queue 0
vectors
- perlbmk
- gzip
- mesa
- OS task
OS task mcf ammp parser
CMP
Twolf vector
twolf CPU 0
Physical processor
CPU 1
- Weighted sum of vectors at each level
- Vectors from L2 given highest weight
16Activity Vector Procedure
- Real system
- Modified Linux kernel 2.6.0
- Tested on Intel P4 Xeon Hyper-threading
- Emulated activity counter registers
- Generate vectors off-line
- Valgrind memory simulator
- Text file output
- Copy vectors to kernel memory space
- Activate vector scheduler
- Time and run workloads
Program Phase D-cache Vector L2-cache Vector
0 11100110 00111011
1 11000000 01111000
2 00111101 11010000
N 11100001 00011100
- Simulator
- Vector hardware
- Simulated OS
17Workloads - Xeon
- 8 Spec 2000 jobs per workload
- Combination of integer and floating point
applications - Run to completion in parallel with OS level jobs
WL1 gzip.vpr.gcc.mesa.art.mcf.equake.crafty
WL2 parser.gap.vortex.bzip2.vpr.mesa.crafty.mcf
WL3 Mesa.twolf.vortex.gzip.gcc.art.crafty.vpr
WL4 Gzip.twolf.vpr.bzip2.gcc.gap.mesa.parser
WL5 Equake.crafty.mcf.parser.art.gap.mesa.vortex
WL6 twolf.bzip2.vortex.gap.parser.crafty.equake.mcf
18Comparison of Scheduling Algorithms
- Default Linux vs. Activity based
- More than 30 of default scheduler decisions
could have been improved by the activity based
scheduler
19Activity Vector Performance - Xeon
20Comparing Activity Vectors to Existing
Performance Counters - Simulation
Benchmark Mix Diff.
164.gzip, 164.gzip, 181.mcf, 183.equake 0.0
164.gzip, 164.gzip, 188.ammp, 300.twolf 12.0
164.gzip, 177.mesa, 181.mcf, 183.equake 0.0
164.gzip, 177.mesa, 183.equake, 183.equake 0.0
164.gzip, 197.parser, 253.perlbmk, 300.twolf 44.4
177.mesa, 177.mesa, 197.parser, 300.twolf 11.1
177.mesa, 181.mcf, 253.perlbmk, 256.bzip2 0.0
177.mesa, 188.ammp, 253.perlbmk, 300.twolf 59.5
177.mesa, 197.parser, 197.parser, 256.bzip2 96.2
181.mcf, 181.mcf, 256.bzip2, 256.bzip2 0.0
181.mcf, 183.equake, 253.perlbmk, 300.twolf 4.0
181.mcf, 253.perlbmk, 253.perlbmk, 256.bzip2 0.0
183.equake, 188.ammp, 188.ammp, 256.bzip2 11.1
188.ammp, 188.ammp, 197.parser, 197.parser 96.2
188.ammp, 300.twolf, 300.twolf, 300.twolf 8.0
197.parser, 197.parser, 253.perlbmk, 256.bzip2 0.0
Average 22.5
On average activity schedule makes different
decisions than the performance counter based
schedule 23 of the time
21ITKO Reduction - Simulation
Benchmarks ITKO Reduction IPC Gain
gzip.gzip.mcf.equake 54.0 3.6
gzip.gzip.ammp.twolf 10.5 4.5
gzip.mesa.mcf.equake 39.5 3.0
gzip.mesa.equake.equake 47.0 2.4
mesa.mesa.parser.twolf 10.3 4.8
mcf.equake.perlbmk.twolf 1.7 3.0
mcf.perlbmk.perlbmk.bzip2 13.0 12.1
ammp.twolf.twolf.twolf 1.9 6.1
Average 22 5
22Contributions
- Interference analysis of cache accesses
- Introduce fine grained performance counters
- General purpose adaptable optimization
- Expose microarchitecture to OS
- Workload independent
- Tested on a real SMT machine
- Implemented on Linux kernel
- 2 way SMT core
23Activity Based Scheduling Summary
- Prevents inter-thread interference
- Monitors cache access behavior
- Co-schedules jobs with expected low interference
- Adapts to phased workload behavior
- Performance improvements
- Greater than 30 opportunity to improve the
default Linux scheduling decisions - 22 Reduction in inter-thread interference
- 5 Improvement in execution time
24Thank You
25Super Set Size
- What happens when we change the number of super
sets used. Can we include a graph here? - Slide 17 once we have the data
- May want to include the tree chart
26Performance Challenges
- Difficult to detect interference
- Inter-thread interference is a multi-faceted
problem - Occurs at low-level cache line granularity
- Temporal variability in benchmark memory requests
- Dependent on thread pairings
- OS scheduling decisions affect performance
- Current systems
- Increased cache associativity
- Could use PMU register feedback
27Activity Vectors
- Interface between OS and microarchitecture
- Divide cache into Super Sets
- Access counters assigned to each super set
- One vector bit corresponds to each counter
- Bit is set when threshold is exceeded
- Job scheduler
- Compare active vector with jobs in run queue
- Selects job with fewest common set bits
Expect no interference
Expect interference
28OS Scheduling
- OS scheduling important when more jobs than
contexts - Current schedulers use symmetric multiprocessing
(SMP) algorithms for SMT processors - Proposed work
- For each time interval co-schedule jobs whose
cache accesses are in different regions
29- Prevent jobs from running together during program
phases where they exhibit high degrees of cache
interference
Program Phase D-cache Vector L2-cache Vector
0 11100110 00111011
1 11000000 01111000
2 00111101 11010000
N 11100001 00011100