Profile-Based Dynamic Optimization Research for Future Computer Systems - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

Profile-Based Dynamic Optimization Research for Future Computer Systems

Description:

Profile-Based Dynamic Optimization Research. for Future Computer Systems. Takanobu Baba ... compress/ compress. ijpeg/ forward_DCT. m88ksim/ killtime. li/ sweep ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 55

Provided by: aquilaIsU

Category:

more less

Transcript and Presenter's Notes

Title: Profile-Based Dynamic Optimization Research for Future Computer Systems

1
Profile-Based Dynamic Optimization Research for
Future Computer Systems

Takanobu Baba
Department of Information Science
Utsunomiya University, Japan
http//aquila.is.utsunomiya-u.ac.jp
November 12, 2004

2
Brief history of my research

1970s The MPG System
A Machine-Independent Efficient Microprogram
Generator
1980s MUNAP
A Two-Level Microprogrammed Multiprocessor
Computer
1990s A-NET
A Language-Architecture Integrated Approach
for Parallel Object-Oriented Computation

3
A Two-Level Microprogrammed Multiprocessor
Computer-MUNAP
A 28-bit vertical microinstruction activates up
to 4 nanoprograms in 4 PUs every machine cycle
MUNAP
4
A Parallel Object-Oriented Total Architecture
A-NET(Actors-NETwork )

Massively parallel computation
Each node consists of a PE and a router.
PE has the language-oriented, typical CISC
architecture.
The programmable router is topology- independent.

A-NET Multicomputer
5
Current dynamic optimization projects

Computation-oriented
YAWARA A meta-level optimizing computer system
HAGANE Binary-level multithreading
Communication-oriented
Spec-All Aggressive Read/Write Access
Speculation Method for DSM Systems
Cross-Line Adaptive Router Using Dynamic
Information

6
YAWARA A Meta-Level Optimizing Computer System
7
Background

Moores Law will be maintained by the
semiconductor technology
how can we utilize the huge amount of transistors
for speedup of program execution?
our idea is to utilize some chip area for
dynamically and autonomously tuning the
configuration of on-chip multiprocessor

8
Meta-level
Meta-level processor
Base-level
Profile of control and data
Results of optimization
Base-level processor
Results of computation
Instructions and data
Memory
9
Design considerations

HW vs. SW reconfiguration
? SW reconfiguration
Static vs. dynamic reconfiguration
? both a static and dynamic reconfig.
capability
Homogeneous vs. heterogeneous architecture
? unified homogeneous structure

10
Basic concepts of thread-level reconfiguration
????Meta-level????
????Base-level????
Profiling
MT
PT
Application
PT
PT
PT
CT
CT
CT
CT
CT
PT
Management Thread
CT
PT
CT
CT
CT
CT
CT
Optimization
OT
CT
CT
CT
OT
OT
OT
OT
OT
OT
Memory
MT Management Thread, PT Profiling Thread, OT
Optimizing Thread, CT Computing Thread
11
Execution model
Management Thread (MT)
activate
Profiling Thread (PT)
Computing Thread (CT)
Profiling-centric
sleep
collect profile
wake up
optimization initiate condition satisfied
activate
Optimizing Thread (OT)
sleep
collect profile
Computing Thread (CT)
Profiling Thread (PT)
Computing-centric
sleep
collect profile
optimization initiate condition satisfied
12
Change of configurations by meta-level
optimization
Meta-level
Base-level
MT
OT
PT
PT
CT
OT
OT
OT
PT
CT
PT
OT
OT
OT
CT
OT
PT
CT
CT
PT
PT
OT
OT
CT
PT
CT
CT
PT
OT
OT
CT
PT
CT
PT
OT
OT
MT
OT
CT
PT
CT
CT
MT
OT
CT
PT
CT
CT
OT
OT
CT
CT
CT
PT
OT
CT
PT
CT
CT
CT
OT
OT
CT
CT
CT
PT
PT
CT
CT
CT
CT
CT
PT
CT
CT
CT
PT
OT
CT
CT
CT
CT
CT
CT
PT
CT
CT
CT
CT
OT
CT
CT
CT
CT
CT
CT
CT
CT
CT
CT
PT
OT
CT
CT
CT
CT
CT
CT
13
The YAWARA System

an implementation of the computation model
the SW system consists of static and dynamic
optimization systems
the HW system includes uniformly structured
thread engines (TE) each TE can execute base-
and meta-level threads
spirit of YAWARA "A flexible method prevails
where a rigid one fails."

14
Software System
Static feedback
Source Code (C/C,Java,Fortran,)
Execution Profile
SOS (Static Optimization System)
DOS (Dynamic Optimization System)
Code Analysis Info
Dynamic feedback
Executable image
Run-time Profile
Execution Results
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
TE (Thread Engine)
Thread Engines
15
Hardware System
feedback-directed resource control
TE
TE
TE
TE
I
register file
net- work OUT
thread- code cache
TE
TE
TE
TE
to/from network
thread -0
thread- data cache
thread -1
I
D
thread -2
net- work IN
thread -N
TE
TE
TE
TE
INT4 FP1
execution control
TE
TE
TE
TE
D
profiling buffer
profiling controller
Thread Engine(TE)
16
Example application compress
Speculative multithreading using path prediction
mechanism
Hot path
Hot loop
Phased behavior
Hot path0
Base
1
1
hit
miss ? 1
speculative multithreading code
generation helper threads generation path
predictor generation
(OT) management thread (MT)
speculative multithreading profiling (PT)
hot loop / hot path detection (PT, OT)
Meta
17
Conclusion -YAWARA-

we proposed an autonomous reconfiguration
mechanism based on dynamic behavior
we also proposed a software and hardware system,
called YAWARA, that implements the
reconfiguration efficiently
we are now developing the software system and the
simulator.

18
Prediction and Execution Methods of Frequently
Executed Two Paths for Speculative Multithreading
YAWARA_at_PDCS2004
19

Occurrence ratios of the top-two paths

2 path
1 path
other paths
compress/ compress
54.5
22.4
ijpeg/ forward_DCT
48.2
42.1
m88ksim/ killtime
97.0
3.0
li/ sweep
80.7
19.3
The top two paths occupy 80-100 of execution
20
Two-level path prediction

Introducing two-level branch prediction
history register keeps sequence of 1 path
executions (1 1, 0 the other paths)
counter table counts 1 path executions

Single Path Predictor (SPP)
history register
if v13 gt X predict 1
counter table
1101
v0
v1

v13
v14
v15
otherwise predict 2
threshold X
21
Another path predictor
Dual Path Predictor (DPP)
1 path history register
1 path counter table
1101
v0
v1
if v13 gt v2 predict 1

v13
v14
v15
2 path history register
2 path counter table
0010
v0
v1
otherwise predict 2
v2

v14
v15
22
Single Speculation (SS)
When a thread fails
recovery process
Abort succeeding threads
1 path
1 path
Recovery process
execute non-speculative thread
Non-speculative execution
Continue speculative execution
continue speculative execution
Speculation failure degrades performance
1 path
1 path
23
Double Speculation (DS)

Even when 1st speculation fails,
secondary choice has high possibility

Top-Two Paths are Dominant.
because
expected 2 hit 49.2
expected 2 hit 81.3
expected 2 hit 100
expected 2 hit 100
24
Double Speculation (DS)
recovery process
1 path
2 path
1 path
1 path
2 path
1 path
secondary speculation
1 path
continue speculative execution

If secondary speculation succeeds,
performance loss is not so large.

25
Evaluation flow
hot-path detection (SIMCA)

thread codes
1 path speculative thread
2 path speculative thread
non-speculative thread

thread-code generation
path history acquisition (SIMCA)
path execution history
performance estimator
speculation hit ratio speed-up ratio
26
Prediction success ratio
100
compress
80
60
succ. ratio ()
40
20
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
100
forward_DCT
80
60
succ. ratio ()
40
20
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
history length
27
Prediction success ratio
100
80
killtime
60
succ. ratio ()
40
20
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
100
80
sweep
60
succ. ratio ()
40
20
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
history length
28
Speed-up ratio
2.0
compress
1.0
speed-up ratio
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
S 100
P1 only
4.0
forward_DCT
3.0
2.0
speed-up ratio
1.0
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
S 100
P1 only
history length
29
Speed-up ratio
3.0
2.0
killtime
speed-up ratio
1.0
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
S 100
P1 only
3.0
2.0
speed-up ratio
sweep
1.0
0
1
2
3
4
6
5
7
8
9
10
11
12
13
14
15
16
S 100
P1 only
history length
30
Conclusions- Two-Path-Limited Speculative
Multithreading -

We proposed
- path prediction method and predictors
- speculation methods
for path-based speculative multithreading
Preliminary performance estimation results are
shown

31
Current and future works

Accurate and detailed evaluation for various
applications
? SPEC 2000, MediaBench,
Integration to our Dynamic Optimization Framework
YAWARA

32
Current dynamic optimization projects

Computation-oriented
YAWARA A meta-level optimizing computer system
HAGANE Binary-level multithreading
Communication-oriented
Spec-All Aggressive Read/Write Access
Speculation Method for DSM Systems
Cross-Line Adaptive Router Using Dynamic
Information

33
HAGANEBinary-Level Multithreading
34
Background

Multithread programming is not so easy.
? Automatic multithreading system
However
Source codes are not always available.
? Multithreading at binary level

35
Binary Translator Optimizer System
Source Binary Code
Execution Profile
Analysis Info
STO (Static Translator Optimizer)
DTO (Dynamic Translator Optimizer)
Multithreaded Binary Code (statically translated)
Multithreaded Binary Code (dynamically translated)
Process Memory Image
Multithread Processor
Execution Profile Info
36
Thread Pipelining Model
- Loop iterations are mapped onto threads
Thread i
Thread i1
Thread i2
TSAG Target Store Address Generation
37
Example translation
mtc1 zero0,f4 addu v13,zero0,zero0
bstr slti v02,v13,5000 beq v02,zer
o0,ST_LL0 addu t08,a04,zero0 addu t
19,a15,zero0 addi v13,v13,1 addi
a04,a04,4 addi a15,a15,4 lfrk wtsagd
addu t210,sp28,zero0 altsw t210 ts
agd l.s f0,0(t08) l.s f2,0(t19) l.s f4
,0(t210) mul.s f0,f0,f2 add.s f4,f4,f0 s
ttsw t210,f4 ST_LL0 estr mov.s f0,f4 jr
ra31
mtc1 zero0,f4 addu v13,zero0,zero0
BB1 l.s f0,0(a04) l.s f2,0(a15) mul.s
f0,f0,f2 addiu v13,v13,1 add.s f4,f4,
f0 slti v02,v13,5000 addiu a15,a15,4
addiu a04,a04,4 bne v02,zero0,BB1
BB2 mov.s f0,f4 jr ra31
Cont.
TSAG
Comp.
Source Binary Code
W.B.
Thread Management Instructions
Overhead code for multithreading
Translated Code
38
Superthreaded Architecture
L1 Instruction Cache
Thread Processing Unit
Thread Processing Unit
Execution Unit
Execution Unit
Communication Unit
Communication Unit
?
?
?
Memory Buffer
Memory Buffer
Write-Back Unit
Write-Back Unit
L1 Data Cache
39
m88ksim (SPECint95)

poor speedup ratios
loop unrolling does not affect the performance
number of iterations is quite small.

40
ijpeg (SPECint95)

the thread code size is too small to hide the
thread management
overhead
loop unrolling is effective to achieve good
speedup ratios
excessive loop unrolling causes performance
degradation
number of iterations is not so large.

41
swim (SPECfp95)

good speedup ratios
loop unrolling is effective to achieve linear
speedup
number of iterations is large.

42
Conclusion-HAGANE-

We have evaluated the binary-level multithreading
using some SPEC95 benchmark programs.
The performance evaluation results indicate
the thread code size should be large enough to
improve the performance.
loop unrolling is effective for the small loop
body.
excessive loop unrolling degrades performance

43
A Methodology ofBinary-Level Variable
Analysisfor Multithreading
HAGANE_at_PDCS2004
44
Background and Objective

Usually, loop-iterations are interrelated through
memory variables, such as induction ones.

However, it is difficult to analyze this kind of
dependency at binary level.
Binary-level variable analysis method is strongly
required for binary-level multithreading.
45
Example Binary Code

lw a15, 16(s830)
lw v13, 16(s830)
lw a04, 16(s830)
sll v13, v13, 0x2
addu v13, v13, a26
lw v02, 16(s830)
lw v13, -4(v13)
addiu v02, v02, 1
sw v02, 16(s830)
lw v02, 16(s830)
sll a15, a15, 0x1
sll a04, a04, 0x2
sll v02, v13, 0x1
addu v02, v02, v13
lw v13, 16(s830)
addu a04, a04, a26
addu a15, a15, v02
sw a15, 0(a04)
slt v13, v13, a37

for (i 1 i lt N i)
z i 2
x ai-1
y x 3
ai z y

-4(v13)
0(a04)
46
Binary-Level Variable Analysis

Register values are analyzed using data flow
trees.
When register values, used for memory references,
are judged as the same, the memory location is
regarded as a virtual register.
Using the virtual registers, steps (1) and (2)
are repeated.

47
Construction of Dataflow Tree

addiu 291, 290, -8
sw 0, 0(291)
addu 51, 0, 0
lw 21, 0(291)
addu 31, 51, 40
addiu 52, 51, 1
addu 22, 21, 31
sw 22, 0(291)
slti 23, 52, 100
bne 23, 0, L1

48
Example Normalization
49
Detection of Loop Induction Variables

Loop induction variable is the register, which
has inter-iteration dependency, and
increases with a fixed value between iterations.

The concept of virtual register makes it possible
to detect induction variables on memory.
50
Application

101.tomcatv of SPECfp95 Benchmark
Fortran to C translator ver. 19940927
GCC cross compiler ver 2.7.2.3 for SIMCA
Data set test
The six most inner loops (1-6) are selected
They have induction variables on memory

51
Speedup Ratios
52
Conclusion -Binary-Level Variable Analysis-

We proposed a binary-level variable analysis
method.
This method makes it possible to detect induction
variables and the increment/decrement values.
The detected information allows us to multithread
binary codes they may not be multithreaded
without our algorithm.
We attained up to 9.8 speedup by the
multithreading.

53
Summary