Title: Challenges for High Performance Processors
1Challenges for High Performance Processors
- Hiroshi NAKAMURA
- Research Center for Advanced Science and
Technology, The University of Tokyo
2Whats the challenge?
- Our Primary Goal Performance
- How ?
- increase the number and/or operating frequency of
functional units - AND
- supply functional units with sufficient data
(bandwidth) - Problems
- Memory Wall
- system performance is limited by poor memory
performance - Power Wall
- power consumption is approaching cooling
limitation
3Memory Wall Problem
- Performance improvement
- CPU 55 / year
- DRAM 7 / year
4Example of Memory Wall Performance of 2GHz
Pentium4 for aibici
non-blocking cache out-of-order issue
? lack of effective memory throughput
5Recap Memory Wall Problem
- growing gap between processor and memory speed
- performance is limited by memory ability in High
Performance Computing (HPC) - long access latency of main memory
- lack of throughput of main memory
- ? making full use of local memory (on-chip
memory) of wide bandwidth is indispensable - on-chip memory space is valuable resource
- not enough for HPC
- should exploit data locality
6Does cache work well in HPC?
- works well in many cases, but not the best for
HPC - data location and replacement by hardware
- unfortunate line conflicts occur although most
of data accesses are regular - ex. data used only once flush out other useful
data - transfer size of cache ?? off-chip is fixed
- for consecutive data larger transfer size is
preferable - for non-consecutive data large line transfer
incurs unnecessary data transfer ? waste of
bandwidth - Most of HPC applications exhibit regularity in
data access, which is sometimes not well enjoyed.
7SCIMA (Software Controlled Integrated Memory
Architecture) kondo-ICCD2000
(joint work with Prof. Boku _at_ Univ. of Tsukuba
and others)
- addressable SCM in addition to ordinary cache
- a part of logical address space
- no inclusive relations with Cache
- SCM and cache are reconfigurable at the
granularity of way
(SCM Software Controllable Memory)
overview of SCIMA
address space
8Data Transfer Instruction
- load/store
- register ?? SCM/Cache
- page-load/page-store
- SCM ?? Off-Chip Memory
- large granularity transfer
- wider effective bandwidthby reducing latency
stall - block stride transfer
- avoid unnecessary data transfer
- more effective utilizationof On-Chip Memory
New
Register
Cache
SCM
Off-Chip Memory
9Strategy of Software Control
- SCM must be controlled by software
- arrays are classified into 6 groups
Consecutiveness
irregular
Reusability
prototype of semi-automatic compiler users
specify hints on reusability of data arrays
10Results of Memory Traffic
- unnecessary memory traffic is suppressed
1 - 61 of memory traffic decreases in SCIMA
11Results of Performance
normalized execution time
- CPU busy time
- latency stall elapsed time due to memory
latency - throughput stall elapsed time due to lack of
throughput
- 1.3-2.5 times faster than cache
- latency stall reduction by large granularity of
data transfer - throughput stall reduction by suppressing
unnecessary data transfer
12Power Wall
- Next Focus Power Consumption of Processors
- Is there any room for power reduction ?
- If yes, then how to reduce ?
Trends of Heat Density
13Observation(1) Moores Law
- Num. of transistors doubles every 18 months
14Observation (2) frequency
- Frequency doubles every 3 years.
- Number of transistors doubles every 18 months
- Number of switching on a chip 8 times every 3
years
15Observation (3) performance
- of switching on a chip 8 times every 3 years
- effective performance 4 times every 3 years
- microprocessor performance improved 55 per
year from Computer Architecture A
Quantitative Approach by J.Henessy and
D.Patterson, Morgan Kaufmann -
- unnecessary switching chance of power
reduction doubles every 3 years
16An Evidence of the Observation - unnecessary
switching x2 / 3 years -
Zyuban00 _at_ ISLPED00
rename map table bypass mechanism load/store
window issue window register file functional units
flushed instruction
access energy per instruction (nJ)
committed instruction
Issue Width
- energy/instr. increases to exploit ILP for higher
performance - at functional units no increase
- at issue window, register file
increase - flushed instruction by incorrect prediction
increase
waste of power
17Registers
- Register consumes a lot of power
- roughly speaking, power ?(num. of registers) X
(num. of ports) - high performance wide issue superscalar
processors? more registers, more read/write
ports - Open Question
- in HPC, what is the best way to use many function
units (or accelerators) from the perspective of
register file design - scalar registers with SIMD operations
- vector registers with vector operations
-
- Personal Impression
- vector registers are accessed in well-organized
fashion, it is easy to reduce num. of ports by
sub-banking technique - can vector operations make good use of local
on-chip memory? (at least, traditional vector
processors can never!)
18Dual Core helps
Rule of thumb
In the same process technology
Voltage 1 Freq 1 Area 1 Power
1 Perf 1
Voltage -15 Freq -15 Area
2 Power 1 Perf 1.8
19Multi-Core helps more
Power
Power 1/4
4
Performance
Performance 1/2
3
2
2
1
1
1
1
no need for wider instruction issue ?
4
4
Multi-Core Power efficient Better power and
thermal management
3
3
2
2
1
1
20Leakage problem
IEEE Computer Magazine
- How to attack leakage problem?
21Introduction of our research
- Innovative Power Control for Ultra Low-Power and
High-Performance System LSIs - 5 years project started October, 2006
- supported by JST (Japan Science and Technology
Agency) as a CREST (Core Research for Evolutional
Science and Technology) program - Objective drastic power reduction of
high-performance system LSIs by innovative power
control through tight cooperation of various
design levels including circuit, architecture,
and system software. - Members
- Prof. H. Nakamura (U. Tokyo) architecture
compiler leader - Prof. M. Namiki (Tokyo Univ of Agri. Tech) OS
- Prof. H. Amano (Keio Univ) architecture F/E
design - Prof. K. Usami (Shibaura I.T.) circuit B/E
design
22How to reduce leakage Power Gating
- Focusing on Power Gating for reducing leakage
- Inserting a Power Switch (PS) between VDD and GND
- Turning off PS when sleep
logic gates
VDD
VDD
logic gates
GND
Virtual GND
Power Switch
23Run-time Power Gating (RTPG)
- control power switch at run time
- Coarse grain Mobile processor by Renesas
- (independent power domains for BB module, MPEG
module, ..) - Fine grain (our target) power gating within a
module
24Fine-grain Run-time Power Gating
- Longer sleep time is preferable
- Leakage savings
- Overheads power penalties for wakeup
- Evaluation through a real chip not reported
- Test vehicle 32b x 32b Multiplier
- Either or both operands (input data) are likely
less than 16-bit - Circuit portions to compute upper bits of product
need not to operate ? waste leakage power
By detecting 0s at upper 16-bits of operands,
power gate internal Multiplier array
25Test chip "Pinnacle"
real measurement
Not applied
FG-RTPG applied
- - Exhibits good power reduction
- - Current Status
- Designing a pipelined microprocessor with FG-RTPG
- Compiler (instruction scheduler) to increase
sleep time
26Low Power Linux Scheduler based onstatistical
modeling
- Co-optimization of System Software and
Architecture - Objective
- process scheduler which reduce power consumption
by DVFS (dynamic voltage and frequency scaling)
of each process with satisfying its performance
constraint - How to find the lowest frequency with satisfying
performance constraints ? - it depends on hardware and program
characteristics - performance ratio is different from frequency
ratio - hard to find the answer straightforward
- ? modeling by statistical analysis of hardware
events
27Evaluation result
Pentium M 760 (Max 2.00 GHz, FSB 533 MHz)
- Specified threshold
- Black dotted line
- Perf. is within the threshold in all the cases
except for mgrid - 3-7 below the threshold
- Accurate model is obtained
- Linux scheduler using this model is developed
May 8, 2007
27
28Summary
- Challenge for high performance processors
- Memory Wall and Power Wall
- One solution to memory wall
- make good use of on-chip memory with software
controllability - Solutions to power wall
- many cores will relax the problem, but
- leakage current is getting a big problem
- new research/approach is required
- our project Innovative Power Control for Ultra
Low-Power and High-Performance System LSIs is
introduced
29(No Transcript)