Architectural Support for Enhanced SMT Job Scheduling - PowerPoint PPT Presentation

About This Presentation
Title:

Architectural Support for Enhanced SMT Job Scheduling

Description:

Profile based, simulated OS and architecture [J. Lo; ISCA 98] Data cache address ... different decisions than the performance counter based schedule 23% of the time ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 30
Provided by: abn7
Category:

less

Transcript and Presenter's Notes

Title: Architectural Support for Enhanced SMT Job Scheduling


1
Architectural Support for Enhanced SMT Job
Scheduling
  • Alex Settle
  • Joshua Kihm
  • Andy Janiszewski
  • Daniel A. Connors
  • University of Colorado at Boulder

2
Introduction
  • Shared memory systems of SMT processors limit
    performance
  • Threads continuously compete for shared cache
    resources
  • Interference between threads causes workload
    slowdown
  • Detecting thread interference is a challenge for
    real systems
  • Low level cache monitoring
  • Difficult to exploit run-time data
  • Goal
  • Design the performance monitoring hardware
    required to capture thread interference
    information that can be exposed to the operating
    system scheduler to improve workload performance

3
Simultaneous Multithreading (SMT)
  • Concurrently executes instructions from different
    contexts
  • Thread level parallelism (TLP)
  • Improves instruction level parallelism (ILP)
  • Improves utilization of base processor
  • Intel Pentium 4 Xeon
  • 2 level cache hierarchy
  • Instruction trace cache
  • 8K data cache 4 way associative 64 bytes per line
  • 512K L2 cache Unified 8 way associative 64
    bytes per line
  • 2 way SMT

4
Inter-thread Interference
  • Competition for shared resources
  • Memory system
  • Buses
  • Physical cache storage
  • Fetch and issue queues
  • Functional units
  • Threads evict cache data belonging to other
    threads
  • Increase in cache misses
  • Diminishes processor utilization
  • Inter-thread kick outs (ITKO)
  • Measured in simulator
  • Thread id of evicted cache line compared to new
    cache line
  • Increased ITKO leads to decrease in IPC

5
ITKO to IPC Correlation Level 3 Cache
  • IPC recorded for each phase interval
  • High ITKO rate leads to significant drop in IPC
  • Large variability in IPC over workload lifetime
    cache interference

6
Related Work
  • Different levels of addressing the interference
    problem
  • Compiler
  • Kumar,Tullsen Micro 02 Procedure placement
    optimization
  • workload fixed at compile time
  • J. Lo Micro 97 Tailoring compiler
    optimizations for SMT
  • Effects of traditional optimizations on SMT
    performance
  • Static optimizations
  • Operating System
  • Tullsen, Snavely ASPLOS 00 Symbiotic job
    scheduling
  • Profile based, simulated OS and architecture
  • J. Lo ISCA 98 Data cache address remapping
  • workload dependent, data base applications
  • Microarchitecture
  • Brown Micro 01 - Issue policy feedback from
    memory system
  • Improved fetch and issue resource allocation
  • Does not tackle inter-thread interference

7
Motivation
  • Improve performance by reducing inter-thread
    interference
  • Multi-faceted problem
  • Dependent on thread pairings
  • Occurs at low-level cache line granularity
  • Difficult to detect at runtime
  • OS scheduling decisions affect microarchitecture
    performance
  • Observed on both simulator and real system
  • Observation
  • Cache access footprints vary over program
    lifetimes
  • Accesses are concentrated in small cache regions

8
Concentration of L2-Cache Access
  • Cache access and miss footprints vary across
    program phases
  • Intervals with high access and miss rates are
    concentrated in small physical regions of the
    cache (green, red)
  • Current performance counters can not detect that
    activity is concentrated in small regions

9
Cache Use Map Runtime Monitoring
  • Spatial locality
  • vertical
  • Temporal locality
  • horizontal

10
Benchmark Pairings ITKO
  • gzip/mesa
  • mesa/equake
  • mesa/perl
  • gzip/equake
  • equake/perl
  • gzip/perl
  • Yellow represents very high interference
  • Interference is dependent on job mix

11
Performance Guided Scheduling Theory
  • gzip
  • equake

Total ITKOs Best Static 2.91 Million Dynamic
2.55 Million
Total ITKOs Best Static 2.91 Million Dynamic
2.55 Million
  • perl
  • mesa

Total ITKOs Best Static 2.91 Million Dynamic
2.90 Million
Total ITKOs Best Static 7.30 Million Dynamic
6.70 Million
  • Each phase scheduler selects jobs with least
    interference

12
Solution to Inter-thread Interference
  • Predict future interference
  • Capture inter-thread interference behavior
  • Introduce cache line activity counters
  • Expose to operating system
  • Current schedulers use symmetric multiprocessing
    (SMP) algorithms for SMT processors
  • Activity based job scheduler
  • Schedule for minimal inter-thread interference

13
Activity Vectors
  • Interface between OS and microarchitecture
  • Divide cache into Super Sets
  • Access counters assigned to each super set
  • One vector bit corresponds to each counter
  • Bit is set when threshold is exceeded
  • Job scheduler
  • Compare active vector with jobs in run queue
  • Selects job with fewest common set bits

7949
Thresholds established through static
analysis Global median across all benchmarks
14
Vector Prediction - Simulator
  • Use last vector to approximate next vector
  • Average accuracy 91
  • Simple and effective

Activity Vector Use Predictability Miss Predictability
D-Cache 82.3 93.6
I-Cache 94.9 90.3
L2-Cache 93.8 94.6
Average 90.3 92.8
15
OS Scheduling Algorithm
Run queue 1
Run queue 0
vectors
  • perlbmk
  • gzip
  • mesa
  • OS task

OS task mcf ammp parser
CMP
Twolf vector
twolf CPU 0
Physical processor
CPU 1
  • Weighted sum of vectors at each level
  • Vectors from L2 given highest weight

16
Activity Vector Procedure
  • Real system
  • Modified Linux kernel 2.6.0
  • Tested on Intel P4 Xeon Hyper-threading
  • Emulated activity counter registers
  • Generate vectors off-line
  • Valgrind memory simulator
  • Text file output
  • Copy vectors to kernel memory space
  • Activate vector scheduler
  • Time and run workloads

Program Phase D-cache Vector L2-cache Vector
0 11100110 00111011
1 11000000 01111000
2 00111101 11010000
N 11100001 00011100
  • Simulator
  • Vector hardware
  • Simulated OS

17
Workloads - Xeon
  • 8 Spec 2000 jobs per workload
  • Combination of integer and floating point
    applications
  • Run to completion in parallel with OS level jobs

WL1 gzip.vpr.gcc.mesa.art.mcf.equake.crafty
WL2 parser.gap.vortex.bzip2.vpr.mesa.crafty.mcf
WL3 Mesa.twolf.vortex.gzip.gcc.art.crafty.vpr
WL4 Gzip.twolf.vpr.bzip2.gcc.gap.mesa.parser
WL5 Equake.crafty.mcf.parser.art.gap.mesa.vortex
WL6 twolf.bzip2.vortex.gap.parser.crafty.equake.mcf
18
Comparison of Scheduling Algorithms
  • Default Linux vs. Activity based
  • More than 30 of default scheduler decisions
    could have been improved by the activity based
    scheduler

19
Activity Vector Performance - Xeon
20
Comparing Activity Vectors to Existing
Performance Counters - Simulation
Benchmark Mix Diff.
164.gzip, 164.gzip, 181.mcf, 183.equake 0.0
164.gzip, 164.gzip, 188.ammp, 300.twolf 12.0
164.gzip, 177.mesa, 181.mcf, 183.equake 0.0
164.gzip, 177.mesa, 183.equake, 183.equake 0.0
164.gzip, 197.parser, 253.perlbmk, 300.twolf 44.4
177.mesa, 177.mesa, 197.parser, 300.twolf 11.1
177.mesa, 181.mcf, 253.perlbmk, 256.bzip2 0.0
177.mesa, 188.ammp, 253.perlbmk, 300.twolf 59.5
177.mesa, 197.parser, 197.parser, 256.bzip2 96.2
181.mcf, 181.mcf, 256.bzip2, 256.bzip2 0.0
181.mcf, 183.equake, 253.perlbmk, 300.twolf 4.0
181.mcf, 253.perlbmk, 253.perlbmk, 256.bzip2 0.0
183.equake, 188.ammp, 188.ammp, 256.bzip2 11.1
188.ammp, 188.ammp, 197.parser, 197.parser 96.2
188.ammp, 300.twolf, 300.twolf, 300.twolf 8.0
197.parser, 197.parser, 253.perlbmk, 256.bzip2 0.0
Average 22.5
On average activity schedule makes different
decisions than the performance counter based
schedule 23 of the time
21
ITKO Reduction - Simulation
Benchmarks ITKO Reduction IPC Gain
gzip.gzip.mcf.equake 54.0 3.6
gzip.gzip.ammp.twolf 10.5 4.5
gzip.mesa.mcf.equake 39.5 3.0
gzip.mesa.equake.equake 47.0 2.4
mesa.mesa.parser.twolf 10.3 4.8
mcf.equake.perlbmk.twolf 1.7 3.0
mcf.perlbmk.perlbmk.bzip2 13.0 12.1
ammp.twolf.twolf.twolf 1.9 6.1
Average 22 5
22
Contributions
  • Interference analysis of cache accesses
  • Introduce fine grained performance counters
  • General purpose adaptable optimization
  • Expose microarchitecture to OS
  • Workload independent
  • Tested on a real SMT machine
  • Implemented on Linux kernel
  • 2 way SMT core

23
Activity Based Scheduling Summary
  • Prevents inter-thread interference
  • Monitors cache access behavior
  • Co-schedules jobs with expected low interference
  • Adapts to phased workload behavior
  • Performance improvements
  • Greater than 30 opportunity to improve the
    default Linux scheduling decisions
  • 22 Reduction in inter-thread interference
  • 5 Improvement in execution time

24
Thank You
25
Super Set Size
  • What happens when we change the number of super
    sets used. Can we include a graph here?
  • Slide 17 once we have the data
  • May want to include the tree chart

26
Performance Challenges
  • Difficult to detect interference
  • Inter-thread interference is a multi-faceted
    problem
  • Occurs at low-level cache line granularity
  • Temporal variability in benchmark memory requests
  • Dependent on thread pairings
  • OS scheduling decisions affect performance
  • Current systems
  • Increased cache associativity
  • Could use PMU register feedback

27
Activity Vectors
  • Interface between OS and microarchitecture
  • Divide cache into Super Sets
  • Access counters assigned to each super set
  • One vector bit corresponds to each counter
  • Bit is set when threshold is exceeded
  • Job scheduler
  • Compare active vector with jobs in run queue
  • Selects job with fewest common set bits

Expect no interference
Expect interference
28
OS Scheduling
  • OS scheduling important when more jobs than
    contexts
  • Current schedulers use symmetric multiprocessing
    (SMP) algorithms for SMT processors
  • Proposed work
  • For each time interval co-schedule jobs whose
    cache accesses are in different regions

29
  • Prevent jobs from running together during program
    phases where they exhibit high degrees of cache
    interference

Program Phase D-cache Vector L2-cache Vector
0 11100110 00111011
1 11000000 01111000
2 00111101 11010000
N 11100001 00011100
Write a Comment
User Comments (0)
About PowerShow.com