Title: Profile-Based Dynamic Optimization Research for Future Computer Systems
1Profile-Based Dynamic Optimization Research for
Future Computer Systems
- Takanobu Baba
- Department of Information Science
- Utsunomiya University, Japan
- http//aquila.is.utsunomiya-u.ac.jp
- November 12, 2004
-
2Brief history of my research
- 1970s The MPG System
- A Machine-Independent Efficient Microprogram
- Generator
- 1980s MUNAP
- A Two-Level Microprogrammed Multiprocessor
- Computer
-
- 1990s A-NET
- A Language-Architecture Integrated Approach
- for Parallel Object-Oriented Computation
3A Two-Level Microprogrammed Multiprocessor
Computer-MUNAP
A 28-bit vertical microinstruction activates up
to 4 nanoprograms in 4 PUs every machine cycle
MUNAP
4A Parallel Object-Oriented Total Architecture
A-NET(Actors-NETwork )
- Massively parallel computation
- Each node consists of a PE and a router.
- PE has the language-oriented, typical CISC
architecture. - The programmable router is topology- independent.
-
A-NET Multicomputer
5Current dynamic optimization projects
- Computation-oriented
- YAWARA A meta-level optimizing computer system
- HAGANE Binary-level multithreading
- Communication-oriented
- Spec-All Aggressive Read/Write Access
Speculation Method for DSM Systems - Cross-Line Adaptive Router Using Dynamic
Information
6YAWARA A Meta-Level Optimizing Computer System
7Background
- Moores Law will be maintained by the
semiconductor technology - how can we utilize the huge amount of transistors
for speedup of program execution? -
- our idea is to utilize some chip area for
dynamically and autonomously tuning the
configuration of on-chip multiprocessor
8Meta-level
Meta-level processor
Base-level
Profile of control and data
Results of optimization
Base-level processor
Results of computation
Instructions and data
Memory
9Design considerations
- HW vs. SW reconfiguration
- ? SW reconfiguration
- Static vs. dynamic reconfiguration
- ? both a static and dynamic reconfig.
capability - Homogeneous vs. heterogeneous architecture
- ? unified homogeneous structure
10Basic concepts of thread-level reconfiguration
????Meta-level????
????Base-level????
Profiling
MT
PT
Application
PT
PT
PT
CT
CT
CT
CT
CT
PT
Management Thread
CT
PT
CT
CT
CT
CT
CT
Optimization
OT
CT
CT
CT
OT
OT
OT
OT
OT
OT
Memory
MT Management Thread, PT Profiling Thread, OT
Optimizing Thread, CT Computing Thread
11Execution model
Management Thread (MT)
activate
Profiling Thread (PT)
Computing Thread (CT)
Profiling-centric
sleep
collect profile
wake up
optimization initiate condition satisfied
activate
Optimizing Thread (OT)
sleep
collect profile
Computing Thread (CT)
Profiling Thread (PT)
Computing-centric
sleep
collect profile
optimization initiate condition satisfied
12Change of configurations by meta-level
optimization
Meta-level
Base-level
MT
OT
PT
PT
CT
OT
OT
OT
PT
CT
PT
OT
OT
OT
CT
OT
PT
CT
CT
PT
PT
OT
OT
CT
PT
CT
CT
PT
OT
OT
CT
PT
CT
PT
OT
OT
MT
OT
CT
PT
CT
CT
MT
OT
CT
PT
CT
CT
OT
OT
CT
CT
CT
PT
OT
CT
PT
CT
CT
CT
OT
OT
CT
CT
CT
PT
PT
CT
CT
CT
CT
CT
PT
CT
CT
CT
PT
OT
CT
CT
CT
CT
CT
CT
PT
CT
CT
CT
CT
OT
CT
CT
CT
CT
CT
CT
CT
CT
CT
CT
PT
OT
CT
CT
CT
CT
CT
CT
13The YAWARA System
- an implementation of the computation model
- the SW system consists of static and dynamic
optimization systems - the HW system includes uniformly structured
thread engines (TE) each TE can execute base-
and meta-level threads - spirit of YAWARA "A flexible method prevails
where a rigid one fails."
14Software System
Static feedback
Source Code (C/C,Java,Fortran,)
Execution Profile
SOS (Static Optimization System)
DOS (Dynamic Optimization System)
Code Analysis Info
Dynamic feedback
Executable image
Run-time Profile
Execution Results
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
Thread Engines
15Hardware System
feedback-directed resource control
TE
TE
TE
TE
I
register file
net- work OUT
thread- code cache
TE
TE
TE
TE
to/from network
thread -0
thread- data cache
thread -1
I
D
thread -2
net- work IN
thread -N
TE
TE
TE
TE
INT4 FP1
execution control
TE
TE
TE
TE
D
profiling buffer
profiling controller
Thread Engine(TE)
16Example application compress
Speculative multithreading using path prediction
mechanism
Hot path
Hot loop
Phased behavior
Hot path0
Base
1
1
hit
miss ? 1
speculative multithreading code
generation helper threads generation path
predictor generation
(OT) management thread (MT)
speculative multithreading profiling (PT)
hot loop / hot path detection (PT, OT)
Meta
17Conclusion -YAWARA-
- we proposed an autonomous reconfiguration
mechanism based on dynamic behavior - we also proposed a software and hardware system,
called YAWARA, that implements the
reconfiguration efficiently - we are now developing the software system and the
simulator.
18Prediction and Execution Methods of Frequently
Executed Two Paths for Speculative Multithreading
YAWARA_at_PDCS2004
19- Occurrence ratios of the top-two paths
2 path
1 path
other paths
compress/ compress
54.5
22.4
ijpeg/ forward_DCT
48.2
42.1
m88ksim/ killtime
97.0
3.0
li/ sweep
80.7
19.3
The top two paths occupy 80-100 of execution
20Two-level path prediction
- Introducing two-level branch prediction
- history register keeps sequence of 1 path
executions (1 1, 0 the other paths) - counter table counts 1 path executions
Single Path Predictor (SPP)
history register
if v13 gt X predict 1
counter table
1101
v0
v1
v13
v14
v15
otherwise predict 2
threshold X
21Another path predictor
Dual Path Predictor (DPP)
1 path history register
1 path counter table
1101
v0
v1
if v13 gt v2 predict 1
v13
v14
v15
2 path history register
2 path counter table
0010
v0
v1
otherwise predict 2
v2
v14
v15
22Single Speculation (SS)
When a thread fails
recovery process
Abort succeeding threads
1 path
1 path
Recovery process
execute non-speculative thread
Non-speculative execution
Continue speculative execution
continue speculative execution
Speculation failure degrades performance
1 path
1 path
23Double Speculation (DS)
- Even when 1st speculation fails,
- secondary choice has high possibility
Top-Two Paths are Dominant.
because
expected 2 hit 49.2
expected 2 hit 81.3
expected 2 hit 100
expected 2 hit 100
24Double Speculation (DS)
recovery process
1 path
2 path
1 path
1 path
2 path
1 path
secondary speculation
1 path
continue speculative execution
- If secondary speculation succeeds,
- performance loss is not so large.
25Evaluation flow
hot-path detection (SIMCA)
- thread codes
- 1 path speculative thread
- 2 path speculative thread
- non-speculative thread
thread-code generation
path history acquisition (SIMCA)
path execution history
performance estimator
speculation hit ratio speed-up ratio
26Prediction success ratio
100
compress
80
60
succ. ratio ()
40
20
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
100
forward_DCT
80
60
succ. ratio ()
40
20
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
history length
27Prediction success ratio
100
80
killtime
60
succ. ratio ()
40
20
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
100
80
sweep
60
succ. ratio ()
40
20
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
history length
28Speed-up ratio
2.0
compress
1.0
speed-up ratio
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
S 100
P1 only
4.0
forward_DCT
3.0
2.0
speed-up ratio
1.0
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
S 100
P1 only
history length
29Speed-up ratio
3.0
2.0
killtime
speed-up ratio
1.0
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
S 100
P1 only
3.0
2.0
speed-up ratio
sweep
1.0
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
S 100
P1 only
history length
30Conclusions- Two-Path-Limited Speculative
Multithreading -
- We proposed
- - path prediction method and predictors
- - speculation methods
- for path-based speculative multithreading
- Preliminary performance estimation results are
shown
31Current and future works
- Accurate and detailed evaluation for various
applications - ? SPEC 2000, MediaBench,
- Integration to our Dynamic Optimization Framework
YAWARA
32Current dynamic optimization projects
- Computation-oriented
- YAWARA A meta-level optimizing computer system
- HAGANE Binary-level multithreading
- Communication-oriented
- Spec-All Aggressive Read/Write Access
Speculation Method for DSM Systems - Cross-Line Adaptive Router Using Dynamic
Information
33HAGANEBinary-Level Multithreading
34Background
- Multithread programming is not so easy.
- ? Automatic multithreading system
- However
- Source codes are not always available.
- ? Multithreading at binary level
35Binary Translator Optimizer System
Source Binary Code
Execution Profile
Analysis Info
STO (Static Translator Optimizer)
DTO (Dynamic Translator Optimizer)
Multithreaded Binary Code (statically translated)
Multithreaded Binary Code (dynamically translated)
Process Memory Image
Multithread Processor
Execution Profile Info
36Thread Pipelining Model
- Loop iterations are mapped onto threads
Thread i
Thread i1
Thread i2
TSAG Target Store Address Generation
37Example translation
mtc1 zero0,f4 addu v13,zero0,zero0
bstr slti v02,v13,5000 beq v02,zer
o0,ST_LL0 addu t08,a04,zero0 addu t
19,a15,zero0 addi v13,v13,1 addi
a04,a04,4 addi a15,a15,4 lfrk wtsagd
addu t210,sp28,zero0 altsw t210 ts
agd l.s f0,0(t08) l.s f2,0(t19) l.s f4
,0(t210) mul.s f0,f0,f2 add.s f4,f4,f0 s
ttsw t210,f4 ST_LL0 estr mov.s f0,f4 jr
ra31
mtc1 zero0,f4 addu v13,zero0,zero0
BB1 l.s f0,0(a04) l.s f2,0(a15) mul.s
f0,f0,f2 addiu v13,v13,1 add.s f4,f4,
f0 slti v02,v13,5000 addiu a15,a15,4
addiu a04,a04,4 bne v02,zero0,BB1
BB2 mov.s f0,f4 jr ra31
Cont.
TSAG
Comp.
Source Binary Code
W.B.
Thread Management Instructions
Overhead code for multithreading
Translated Code
38Superthreaded Architecture
L1 Instruction Cache
Thread Processing Unit
Thread Processing Unit
Execution Unit
Execution Unit
Communication Unit
Communication Unit
?
?
?
Memory Buffer
Memory Buffer
Write-Back Unit
Write-Back Unit
L1 Data Cache
39m88ksim (SPECint95)
- poor speedup ratios
- loop unrolling does not affect the performance
- number of iterations is quite small.
40ijpeg (SPECint95)
- the thread code size is too small to hide the
thread management - overhead
- loop unrolling is effective to achieve good
speedup ratios - excessive loop unrolling causes performance
degradation - number of iterations is not so large.
41swim (SPECfp95)
- good speedup ratios
- loop unrolling is effective to achieve linear
speedup - number of iterations is large.
42Conclusion-HAGANE-
- We have evaluated the binary-level multithreading
using some SPEC95 benchmark programs. - The performance evaluation results indicate
- the thread code size should be large enough to
improve the performance. - loop unrolling is effective for the small loop
body. - excessive loop unrolling degrades performance
43A Methodology ofBinary-Level Variable
Analysisfor Multithreading
HAGANE_at_PDCS2004
44Background and Objective
- Usually, loop-iterations are interrelated through
memory variables, such as induction ones.
However, it is difficult to analyze this kind of
dependency at binary level.
Binary-level variable analysis method is strongly
required for binary-level multithreading.
45Example Binary Code
- lw a15, 16(s830)
- lw v13, 16(s830)
- lw a04, 16(s830)
- sll v13, v13, 0x2
- addu v13, v13, a26
- lw v02, 16(s830)
- lw v13, -4(v13)
- addiu v02, v02, 1
- sw v02, 16(s830)
- lw v02, 16(s830)
- sll a15, a15, 0x1
- sll a04, a04, 0x2
- sll v02, v13, 0x1
- addu v02, v02, v13
- lw v13, 16(s830)
- addu a04, a04, a26
- addu a15, a15, v02
- sw a15, 0(a04)
- slt v13, v13, a37
- for (i 1 i lt N i)
- z i 2
- x ai-1
- y x 3
- ai z y
-
-4(v13)
0(a04)
46Binary-Level Variable Analysis
- Register values are analyzed using data flow
trees. - When register values, used for memory references,
are judged as the same, the memory location is
regarded as a virtual register. - Using the virtual registers, steps (1) and (2)
are repeated.
47Construction of Dataflow Tree
- addiu 291, 290, -8
- sw 0, 0(291)
- addu 51, 0, 0
- lw 21, 0(291)
- addu 31, 51, 40
- addiu 52, 51, 1
- addu 22, 21, 31
- sw 22, 0(291)
- slti 23, 52, 100
- bne 23, 0, L1
48Example Normalization
49Detection of Loop Induction Variables
- Loop induction variable is the register, which
- has inter-iteration dependency, and
- increases with a fixed value between iterations.
The concept of virtual register makes it possible
to detect induction variables on memory.
50Application
- 101.tomcatv of SPECfp95 Benchmark
- Fortran to C translator ver. 19940927
- GCC cross compiler ver 2.7.2.3 for SIMCA
- Data set test
- The six most inner loops (1-6) are selected
- They have induction variables on memory
51Speedup Ratios
52Conclusion -Binary-Level Variable Analysis-
- We proposed a binary-level variable analysis
method. - This method makes it possible to detect induction
variables and the increment/decrement values. - The detected information allows us to multithread
binary codes they may not be multithreaded
without our algorithm. - We attained up to 9.8 speedup by the
multithreading. -
53Summary
- Dynamic optimization projects at our laboratory
- The results show the performance improvement
quantitatively in each project
54Whats the next step of computer architecture
research?
-
- from performance to reliability? or low power?
- e.g. dependable computing
- architecture for new device technologies?
- e.g. quantum computing
- However.
- if we stick to conventional high-performance
computing research, - whats the promising way?