A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems - PowerPoint PPT Presentation

About This Presentation

Title:

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems

Description:

Blue Onyx Deluxe, Blue Pearl Deluxe: Generally for 'customer-facing' presentations - Blue Pearl Deluxe is useful for one-on-one laptop presentations and for easy ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 30

Provided by: samlinu4

Learn more at: http://lca.ece.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems

1
A Bandwidth-aware Memory-subsystem Resource
Management using Non-invasive Resource Profilers
for Large CMP Systems

Dimitris Kaseridis1, Jeff Stuecheli1,2, Jian
Chen1 Lizy K. John1
1University of Texas Austin
2IBM Austin

2
Motivation

Datacenters
Widely spread
Multiple core/sockets available
Hierarchical cost of communication
Core-to-Core, Socket-to-Socket and Board-to-Board
Datacenter-like CMP multi-chip

3
Motivation

Virtualization systems is the norm
Multiple single thread workloads in system
Decision based on high level scheduling
algorithms
CMP heavily relied on shared resources
Destructive Interference
Unfairness
Lack of QoS
Limit optimization in single-chip ? suboptimal
solutions
Explore opportunities within and outside single
chip
Most important shared resources in CMPs
Last Level Cache ? Capacity Limits
Memory bandwidth ? Bandwidth Limits
Capacity and Bandwidth partitioning as promising
means of resource management

4
Motivation
Equal Partitions

Previous Work focus on single chip
Trial-and-error
lower complexity
- less efficient
- slow to react
Artificial Intelligent
better performance
- Black box ? difficult to tune
- High cost for accurate schemes.
Predictive evaluating multiple solutions
more accurate
- higher complexity
- high cost of wrong decision (drastic changes
to configurations)
Need for low-overhead, non-invasive monitoring
that efficiently drives resource management
algorithms

5
Outline

Applications Profiling Mechanisms
Cache Capacity
Memory Bandwidth
Bandwidth-aware Resource Management Scheme
Intra chip allocation algorithm
Inter chip resource management
Evaluation

6
Applications Profiling Mechanisms
7
Overview Resource Requirements Profiling

Based on Mattsons Stack-distance Algorithm (MSA)
Non-invasive, predictive
Parallel monitoring on each core assuming each
core is assigned the whole LLC
Cache misses for all partitions assignment
Monitor/Predict Cache misses
Help estimate ideal cache partitions sizes
Memory Bandwidth
Two components
Memory Read traffic ? Cache fills
Memory Write traffic ? Dirty Write-back traffic
from Cache to Main memory

8
LLC misses Profiling

Mattson stack algorithm (MSA)
Originally proposed to concurrently simulate many
cache sizes
Based on LRU inclusion property
Structure is a true LRU cache
Stack distance from MRU of each reference is
recorded
Misses can be calculated for fraction of ways

9
MSA-based Bandwidth Profiling

Read traffic
proportional to misses
derived from LLC misses profiling
Write traffic
Cache evictions of dirty lines sent back to
memory
Traffic depends on assigned cache partition on
write-back caches
Hit to dirty line
if stack distance of hit bigger than assigned
capacity it is sent to main memory ? Traffic
Otherwise it is a hit ? No Traffic
Only one write-back per store should be counted

Monitoring Mechanism
10
MSA-based Bandwidth Profiling

Additions to profiler
Dirty Bit Dirty line
Dirty Stack Distance (reg) Largest distance a
dirty line accessed
Dirty_Counter Dirty accesses for every LRU
distance
Rules
Track traffic for all cache allocations
Dirty bit reset when line is evicted from whole
monitored cache
Track greatest stack distance each store is
referenced before evicted
Keep a counter (Dirty_Counter) of this max
evictions
Traffic estimation
For a cache size projection that uses W ways
Sum of Dirty_Counteri, i W max_ways 1

11
MSA-based Bandwidth Example
12
Profiling Examples
milc
calculix
gcc

Different behavior on write traffic
Milc No fit, updates complex matrix structures
Calculix Cache blocking of matrix and dot
product operations, data contained in cache ?
read only traffic beyond blocking size
Gcc Code generation ? small caches are read
dominated due to data tables ? bigger are write
dominated due to code output
Accurate monitoring of Memory Bandwidth use is
important

13
Hardware MSA implementation
ways

Naïve algorithm is prohibitive
Fully associative
Complete cache directory of maximum cache size
for every core on the CMP (total size)
H/W Overhead Reduction
Set Sampling
Partial Hashed Tags XOR tree of bits
Max capacity assignable per core
Sensitivity Analysis (Details in paper)
1-in-32 set sampling
11bit partial hashed tags
9/16 Maximal capacity
LRU, Dirty-stack register ? 6 bits
Hit, Dirty counter ? 32 bits
Overall 117 Kbits ? 1.4 of 8MB LLC

sets
14
Resource Management Scheme
15
Overall Scheme

Two levels approach
Intra-chip Partitioning Algorithm Assign LLC
capacity on a single chip to minimize misses
Inter-chip Partitioning Algorithm Use LLC
assignments and Memory bandwidth to find a better
workload assignment over whole system
Epochs of 100M instructions for re-evaluation and
initiate migrations

16
Intra-chip Partitioning Algorithm

Based on Marginal Utility
Miss rate relative to capacity is non-linear, and
heavily workload dependent
Dramatic miss rate reduction as data structures
become cache contained
In practice
Iteratively assign cache to cores that produce
the most hits per capacity
O(n2) complexity

Equal Partitions
17
Inter-chip Partitioning Algorithm

Suboptimal assignment of workloads on chips based
on execution phase of each workload
Two greedy algorithms looking over multiple chips
Cache Capacity
Memory Bandwidth
Cache Capacity
Estimate ideal capacity assignment assuming whole
cache belongs to core
Find the worst assignment for a core per chip
Find chips with most surplus of ways (ways not
significantly contributing to miss reduction)
Perform with a greedy approach workloads swaps
between chips
Bound swap with threshold to keep migrations down
Perform finally selected migrations

18
Bandwidth Algorithm Example

Memory Bandwidth
Algorithm finds combinations of low/high
bandwidth demands cores
Migrate high to low bandwidth chips
Migrated jobs should have similar partitions (10
bounds)
Perform until no over-committed or no additional
reduction

C ? D
A?B
C ? B

A ? lbm B ? calculix
C ? bwaves D ? zeusmp

19
Evaluation
20
Methodology

Workloads
64 cores ? 8 chips with 8-core CMPs running mix
of 29 SPEC CPU2006 workloads
What benchmark mix? 30 Million mix of 8
benchmarks
High level - Monte Carlo
Compare Intra and Inter algorithm to equal
partitions assignment
Show algorithm works for many cases /
configurations
1000 experiments
Detailed simulation
Cycle accurate / Full system
Simics GEMS CMP-DNUCA Profiling Mechanisms
Cache Partitions
Comparison
Utility-based Cache Partitioning (UCP) modified
for our DNUCA CMP
Only last level cache misses
Uses Marginal Utility on Single Chip to assign
capacity

21
High level LLC misses
Relative reduction BW-aware over UCP
Relative Miss rate

25.7 over simple equal partitions
Average 7.9 reduction over UCP
Significant reductions with only 1.4 overhead
for monitoring mechanisms that UCP already
requires
As LLC increases ? more surplus of ways ? more
opportunities for migrations in Inter-chip

22
High level Memory Bandwidth
Relative reduction BW-aware over UCP
Relative Bandwidth Reduction

UCP reductions are due to miss rate reduction
?19 over equal
Average 18 reduction over UCP and 36 over
equal
Winning more in smaller caches due to contention
Number of Chips increase ? more opportunities for
Inter-chip

23
Full system case studies

Case 1
8.6 reduction in IPC and 15.3 MPKI reduction
Chip 4 bwaves, mcf ? Chip 7 povray, calculix
Case 2
8.5 IPC and 11 MPKI reduction
Chip 7 overcommitted in memory bandwidth
bwaves Chip 7 ? zeusmp Chip 2
gcc Chip 7 ? gamess Chip 6

24
Conclusions

As core in a system increases ? resource
contention dominating factor
Memory Bandwidth a significant factor in system
performance and should always be considered in
Memory resource management
Bandwidth-aware achieved 18 reduction in memory
bandwidth and 8 in miss rate over existing
partitioning techniques and more than 25 over no
partitioning schemes
Overall improvement can justify the cost of the
proposed monitoring mechanisms of only 1.4
overhead that could exist in predictive single
chip schemes