A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems - PowerPoint PPT Presentation

About This Presentation
Title:

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems

Description:

Blue Onyx Deluxe, Blue Pearl Deluxe: Generally for 'customer-facing' presentations - Blue Pearl Deluxe is useful for one-on-one laptop presentations and for easy ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 30
Provided by: samlinu4
Category:

less

Transcript and Presenter's Notes

Title: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems


1
A Bandwidth-aware Memory-subsystem Resource
Management using Non-invasive Resource Profilers
for Large CMP Systems
  • Dimitris Kaseridis1, Jeff Stuecheli1,2, Jian
    Chen1 Lizy K. John1
  • 1University of Texas Austin
  • 2IBM Austin

2
Motivation
  • Datacenters
  • Widely spread
  • Multiple core/sockets available
  • Hierarchical cost of communication
  • Core-to-Core, Socket-to-Socket and Board-to-Board
  • Datacenter-like CMP multi-chip

3
Motivation
  • Virtualization systems is the norm
  • Multiple single thread workloads in system
  • Decision based on high level scheduling
  • algorithms
  • CMP heavily relied on shared resources
  • Destructive Interference
  • Unfairness
  • Lack of QoS
  • Limit optimization in single-chip ? suboptimal
    solutions
  • Explore opportunities within and outside single
    chip
  • Most important shared resources in CMPs
  • Last Level Cache ? Capacity Limits
  • Memory bandwidth ? Bandwidth Limits
  • Capacity and Bandwidth partitioning as promising
    means of resource management

4
Motivation
Equal Partitions
  • Previous Work focus on single chip
  • Trial-and-error
  • lower complexity
  • - less efficient
  • - slow to react
  • Artificial Intelligent
  • better performance
  • - Black box ? difficult to tune
  • - High cost for accurate schemes.
  • Predictive evaluating multiple solutions
  • more accurate
  • - higher complexity
  • - high cost of wrong decision (drastic changes
    to configurations)
  • Need for low-overhead, non-invasive monitoring
    that efficiently drives resource management
    algorithms

5
Outline
  • Applications Profiling Mechanisms
  • Cache Capacity
  • Memory Bandwidth
  • Bandwidth-aware Resource Management Scheme
  • Intra chip allocation algorithm
  • Inter chip resource management
  • Evaluation

6
Applications Profiling Mechanisms
7
Overview Resource Requirements Profiling
  • Based on Mattsons Stack-distance Algorithm (MSA)
  • Non-invasive, predictive
  • Parallel monitoring on each core assuming each
    core is assigned the whole LLC
  • Cache misses for all partitions assignment
  • Monitor/Predict Cache misses
  • Help estimate ideal cache partitions sizes
  • Memory Bandwidth
  • Two components
  • Memory Read traffic ? Cache fills
  • Memory Write traffic ? Dirty Write-back traffic
    from Cache to Main memory

8
LLC misses Profiling
  • Mattson stack algorithm (MSA)
  • Originally proposed to concurrently simulate many
    cache sizes
  • Based on LRU inclusion property
  • Structure is a true LRU cache
  • Stack distance from MRU of each reference is
    recorded
  • Misses can be calculated for fraction of ways

9
MSA-based Bandwidth Profiling
  • Read traffic
  • proportional to misses
  • derived from LLC misses profiling
  • Write traffic
  • Cache evictions of dirty lines sent back to
    memory
  • Traffic depends on assigned cache partition on
    write-back caches
  • Hit to dirty line
  • if stack distance of hit bigger than assigned
    capacity it is sent to main memory ? Traffic
  • Otherwise it is a hit ? No Traffic
  • Only one write-back per store should be counted

Monitoring Mechanism
10
MSA-based Bandwidth Profiling
  • Additions to profiler
  • Dirty Bit Dirty line
  • Dirty Stack Distance (reg) Largest distance a
    dirty line accessed
  • Dirty_Counter Dirty accesses for every LRU
    distance
  • Rules
  • Track traffic for all cache allocations
  • Dirty bit reset when line is evicted from whole
    monitored cache
  • Track greatest stack distance each store is
    referenced before evicted
  • Keep a counter (Dirty_Counter) of this max
    evictions
  • Traffic estimation
  • For a cache size projection that uses W ways
  • Sum of Dirty_Counteri, i W max_ways 1

11
MSA-based Bandwidth Example
12
Profiling Examples
milc
calculix
gcc
  • Different behavior on write traffic
  • Milc No fit, updates complex matrix structures
  • Calculix Cache blocking of matrix and dot
    product operations, data contained in cache ?
    read only traffic beyond blocking size
  • Gcc Code generation ? small caches are read
    dominated due to data tables ? bigger are write
    dominated due to code output
  • Accurate monitoring of Memory Bandwidth use is
    important

13
Hardware MSA implementation
ways
  • Naïve algorithm is prohibitive
  • Fully associative
  • Complete cache directory of maximum cache size
    for every core on the CMP (total size)
  • H/W Overhead Reduction
  • Set Sampling
  • Partial Hashed Tags XOR tree of bits
  • Max capacity assignable per core
  • Sensitivity Analysis (Details in paper)
  • 1-in-32 set sampling
  • 11bit partial hashed tags
  • 9/16 Maximal capacity
  • LRU, Dirty-stack register ? 6 bits
  • Hit, Dirty counter ? 32 bits
  • Overall 117 Kbits ? 1.4 of 8MB LLC

sets
14
Resource Management Scheme
15
Overall Scheme
  • Two levels approach
  • Intra-chip Partitioning Algorithm Assign LLC
    capacity on a single chip to minimize misses
  • Inter-chip Partitioning Algorithm Use LLC
    assignments and Memory bandwidth to find a better
    workload assignment over whole system
  • Epochs of 100M instructions for re-evaluation and
    initiate migrations

16
Intra-chip Partitioning Algorithm
  • Based on Marginal Utility
  • Miss rate relative to capacity is non-linear, and
    heavily workload dependent
  • Dramatic miss rate reduction as data structures
    become cache contained
  • In practice
  • Iteratively assign cache to cores that produce
    the most hits per capacity
  • O(n2) complexity

Equal Partitions
17
Inter-chip Partitioning Algorithm
  • Suboptimal assignment of workloads on chips based
    on execution phase of each workload
  • Two greedy algorithms looking over multiple chips
  • Cache Capacity
  • Memory Bandwidth
  • Cache Capacity
  • Estimate ideal capacity assignment assuming whole
    cache belongs to core
  • Find the worst assignment for a core per chip
  • Find chips with most surplus of ways (ways not
    significantly contributing to miss reduction)
  • Perform with a greedy approach workloads swaps
    between chips
  • Bound swap with threshold to keep migrations down
  • Perform finally selected migrations

18
Bandwidth Algorithm Example
  • Memory Bandwidth
  • Algorithm finds combinations of low/high
    bandwidth demands cores
  • Migrate high to low bandwidth chips
  • Migrated jobs should have similar partitions (10
    bounds)
  • Perform until no over-committed or no additional
    reduction

C ? D
A?B
C ? B
  • A ? lbm B ? calculix
  • C ? bwaves D ? zeusmp

19
Evaluation
20
Methodology
  • Workloads
  • 64 cores ? 8 chips with 8-core CMPs running mix
    of 29 SPEC CPU2006 workloads
  • What benchmark mix? 30 Million mix of 8
    benchmarks
  • High level - Monte Carlo
  • Compare Intra and Inter algorithm to equal
    partitions assignment
  • Show algorithm works for many cases /
    configurations
  • 1000 experiments
  • Detailed simulation
  • Cycle accurate / Full system
  • Simics GEMS CMP-DNUCA Profiling Mechanisms
    Cache Partitions
  • Comparison
  • Utility-based Cache Partitioning (UCP) modified
    for our DNUCA CMP
  • Only last level cache misses
  • Uses Marginal Utility on Single Chip to assign
    capacity

21
High level LLC misses
Relative reduction BW-aware over UCP
Relative Miss rate
  • 25.7 over simple equal partitions
  • Average 7.9 reduction over UCP
  • Significant reductions with only 1.4 overhead
    for monitoring mechanisms that UCP already
    requires
  • As LLC increases ? more surplus of ways ? more
    opportunities for migrations in Inter-chip

22
High level Memory Bandwidth
Relative reduction BW-aware over UCP
Relative Bandwidth Reduction
  • UCP reductions are due to miss rate reduction
    ?19 over equal
  • Average 18 reduction over UCP and 36 over
    equal
  • Winning more in smaller caches due to contention
  • Number of Chips increase ? more opportunities for
    Inter-chip

23
Full system case studies
  • Case 1
  • 8.6 reduction in IPC and 15.3 MPKI reduction
  • Chip 4 bwaves, mcf ? Chip 7 povray, calculix
  • Case 2
  • 8.5 IPC and 11 MPKI reduction
  • Chip 7 overcommitted in memory bandwidth
  • bwaves Chip 7 ? zeusmp Chip 2
  • gcc Chip 7 ? gamess Chip 6

24
Conclusions
  • As core in a system increases ? resource
    contention dominating factor
  • Memory Bandwidth a significant factor in system
    performance and should always be considered in
    Memory resource management
  • Bandwidth-aware achieved 18 reduction in memory
    bandwidth and 8 in miss rate over existing
    partitioning techniques and more than 25 over no
    partitioning schemes
  • Overall improvement can justify the cost of the
    proposed monitoring mechanisms of only 1.4
    overhead that could exist in predictive single
    chip schemes

25
Thank YouQuestions?Laboratory for Computer
ArchitectureThe University of Texas Austin
26
Backup Slides
27
  • Misses absolute and effective error

28
  • Bandwidth absolute and effective error

29
  • Overhead analysis
Write a Comment
User Comments (0)
About PowerShow.com