Scaling and Packing on a Chip Multiprocessor - PowerPoint PPT Presentation

About This Presentation
Title:

Scaling and Packing on a Chip Multiprocessor

Description:

Want to save power without a performance hit. Dynamic ... For multi-node tests, prepend number of nodes. 4 2: 4 nodes, cores 0 and 1 active, 8 total cores ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 12
Provided by: mossCs
Category:

less

Transcript and Presenter's Notes

Title: Scaling and Packing on a Chip Multiprocessor


1
Scaling and Packing on a Chip Multiprocessor
  • Vincent W. Freeh
  • Tyler K. Bletsch
  • Freeman L. Rawson, III

2
Introduction
  • Want to save power without a performance hit
  • Dynamic Frequency and Voltage Scaling
  • Slow down the CPU
  • Linear speed loss, quadratic CPU power drop
  • Efficient, but limited range
  • A number of fixed p-states
  • CPU Packing
  • Run a workload on few CPU cores
  • Linear speed loss, linear CPU power drop
  • Less efficient, greater range
  • A number of fixed configurations
  • Using both?

3
Hardware architecture
  • 16 nodes complete systems, each with
  • 2 CPU sockets per node (physical dies)
  • 2 cores per socket (4 total cores per node)
  • 4-level memory hierarchy
  • L1 L2 cache
  • Per-core
  • Local memory
  • Per-socket
  • Remote memory
  • Accessible viaHyperTransport bus

HyperTransport
Socket 0
Socket 1
AMD64 Core 1
AMD64 Core 0
AMD64 Core 3
AMD64 Core 2
L1 Instr
L1 Data
L1 Instr
L1 Data
L1 Instr
L1 Data
L1 Instr
L1 Data
L2
L2
L2
L2
Memory (1GB)
Memory (1GB)
4
P-states and configurations
  • Scaling
  • Entire socket must scale together
  • 5 P-states every 200Mhz from 1.8 to 1.0GHz
  • Packing
  • 5 configurations
  • All four cores 4
  • Three cores 3
  • Cores 0 and 1 2
  • Cores 0 and 2 2
  • One core 1
  • For multi-node tests, prepend number of nodes
  • 42 4 nodes, cores 0 and 1 active, 8 total cores
  • Packing results "simulate" full socket shutdown
    (subtract 20W)

5
Three application classes
  • CPU-bound
  • No communication, fits in cache
  • 100 CPU utilization
  • Similar to while(1)
  • High-Performance Computing (HPC)
  • Inter-node communication
  • Significant memory usage
  • Performance Execution time
  • Commercial
  • Constant servicing of remote requests
  • Possibly significant memory usage
  • Performance Throughput

6
(1) CPU-bound workloads
  • Workload
  • DAXPY A small linear algebra kernel
  • Representative of entire class
  • Scaling
  • Linear slowdown
  • Quadratic power cut
  • Packing
  • 4 is most efficient
  • 2 is no good here
  • 3 is right out
  • Single-socket configs 1 and 2 save power, but
    kill performance

7
(2) HPC workloads
  • Packing with fixed nodes

LU 2 speedup
CG slowdown
Power
Time
CG CPU utilization falls
2 has no effect
LU 2 speedup
Energy
EDP
8
(2) HPC workloads
  • Packing with fixed cores

Power
Time
Energy
EDP
9
(3) Commercial workloads
  • Scale first, then pack

Power (W)
Throughput (replies/second)
10
Conclusions
  • Packing less efficient than scaling
  • Therefore Scale first, then pack
  • Nothing can help CPU-bound apps
  • Memory/IO bound workloads are scalable
  • Resource utilization affects (predicts?)
    effectiveness of scaling and packing
  • Business workloads can benefit from
    scaling/packing
  • Especially at low utilization levels

11
Future work
  • How does resource utilization influence the
    effectiveness of scaling/packing?
  • A predictive model based on resource usage?
  • A power management engine based on resource
    usage?
  • Dynamic packing
  • Virtualization allows live migration
  • Can this be used to do packing on the fly?
Write a Comment
User Comments (0)
About PowerShow.com