Title: Instruction and Data Address Trace Compression
1Instruction and Data Address Trace Compression
- Aleksandar Milenkovic
- (collaborative work with Milena Milenkovic and
Martin Burtscher) - Electrical and Computer Engineering Department
- The University of Alabama in Huntsville
- Email milenka_at_ece.uah.edu
- Web http//www.ece.uah.edu/milenka
- http//www.ece.uah.edu/lacasa
2Outline
- Program Execution Traces
- Trace Compression
- Trace Compression in Hardware
- Stream caches and predictors for instruction
address trace compression - Data address stride caches for data address
trace compression - Results
- Conclusions
3Program Execution Traces
- Streams of recorded events
- Basic block traces
- Address traces
- Instruction words
- Operands
- Trace uses
- Computer architects for evaluation of new
architectures - Computer analysts for workload characterization
- Software developers for program tuning,
optimization, and debugging
4Instruction and Data Address TracesAn Example
for(i0 ilt100 i) ci sai bi
sum sum ci
Dinero Execution Trace
InstructionAddress
DataAddress
Type
2 0x020001f4 0 0x020001f8 0xbfffbe24 0 0x020001fc
0xbfffbc94 2 0x02000200 2 0x02000204 2 0x02000208
2 0x0200020c 1 0x02000210 0xbfffbb04 2 0x02000214
_at_ 0x020001f4 mov r1,r12, lsl 2 _at_ 0x020001f8
ldr r2,r4, r1 _at_ 0x020001fc ldr r3,r14, r1
_at_ 0x02000200 mla r0,r2,r8,r3 _at_ 0x02000204 add
r12,r12,1 (1 gtgtgt 0) _at_ 0x02000208 cmp r12,99
(99 gtgtgt 0) _at_ 0x0200020c add r6,r6,r0 _at_
0x02000210 str r0,r5, r1 _at_ 0x02000214 ble
0x20001f4
5Trace Issues
- Trace issues
- Capture
- Compression
- Processing
- Traces tend to be very large
- In terabytes for a minute of program execution
- Expensive to store, transfer, and use
- Effective reduction techniques
- Lossless
- High compression ratio
- Fast decompression
6Outline
- Program Execution Traces
- Trace Compression
- Trace Compression in Hardware
- Stream caches and predictors for instruction
address trace compression - Data address stride caches for data address
trace compression - Results
- Conclusions
7Trace Compression
- General purpose compression algorithms
- Ziv-Lempel (gzip)
- Burroughs-Wheeler transformation (bzip2)
- Sequitur
- Trace specific compression techniques
- Tuned to exploit redundancy in traces
- Better compression, faster, can be further
combined with general-purpose compression
algorithms
8Trace-Specific Compression Techniques
Lossless Compression
Instructions
Instructions data
Link data addresses to dynamic basic block
Offset
Mache Samples 1989,LBTC Luo and John 2004
Replacing an execution sequence with its
identifier
Pleszkun 1994,SBC Milenkovic and Milenkovic,
2003
Offset repetitions
- Acyclic path (WPP Larus 1999, Time Stamped
WPP Zhang and Gupta 2001) - - N-tuple Milenkovic, Milenkovic and Kulick
2003 - Instruction (PDI Johnson, Ha and Zaidi 2001)
Control flow graph trace of transitions
PDATS Johnson, Ha and Zaidi 2001
Link data addresses to loop
QPT Larus 1993
Elnozahy 1999, SIGMA DeRose, et al. 2002
Regenerate addresses
Abstract execution
Value Predictor
Graph with number of repetitions in nodes
VPC Burtscher and Jeeradit 2003,TCGEN
Burtscher and Sam 2005
Eggers, et al. 1990,Larus 1993
Hamou-Lhadj and Lethbridge 2002
9Outline
- Program Execution Traces
- Trace Compression
- Trace Compression in Hardware
- Stream caches and predictors for instruction
address traces - Data address stride caches for data address
traces - Results
- Conclusions
10Why Trace Compression in Hardware?
- Problem 1 Capture program traces
- In software trap after each instruction or taken
branch - E.g., IBMs Performance Inspector
- Slowdown gt 100 times
- Multiple cores on a single chip more detailed
information needed (e.g., time stamps of events) - Problem 2 debugging is far from fun
- Stop execution on breakpoints, examine the state
- Time-consuming, difficult, may miss a critical
state leading to erroneous behavior - Stopping the CPU may perturb the sequence of
events making your bugs disappear - gt Need an unobtrusive real-time tracing mechanism
11Trace Compression in Hardware
- Goals
- Small on-chip area and small number of pins
- Real-time compression (never stall the processor)
- Achieve a good compression ratio
- Solution
- A set of compression algorithms targeting
on-the-fly compression of instruction and data
address traces
12Exploiting Stream and Strides
- Instruction address trace compression
- Limited number andstrong temporal locality of
instruction streams - gt Replace an instruction streamwith its
identifier - Data address trace compression
- Spatial and temporal locality of data addresses
- gt Recognize regular strides
13Trace Compressor System Overview
Processor Core
Data Address
Program Counter
Task Switch
System Under Test
Data AddressBuffer
Processor Core
Memory
Stream Cache(SC)
Data Address Stride Cache (DASC)
Trace Compressor
SCIT
SCMT
DMT
DT
Predictor Byte rep. FSM
Byte rep.FSM
Trace port
External Trace Unitfor Storing/Processing (PC or
Intelligent Drive)
Trace Output Controller
To External Unit
14Outline
- Program Execution Traces
- Trace Compression
- Trace Compression in Hardware
- Stream caches and predictors for instruction
address traces - Data address stride caches for data address
traces - Results
- Conclusions
15Stream Detector Stream Cache
Stream Cache (SC)
NWAY - 1
iWay
SA
1
0
SA
0
reserved
SA
L
1
F(S.SA, S.SL)
0x0E
S.SA S.L
i
iSet
000
iWay
(0x020001f4,0x09)
NSET - 1
S.SA S.LFrom InstructionStream Buffer
?
Hit/Miss
0x00 // it. 0
SCMT (SA, SL)
SCIT
(0x020001f4,0x09)
Stream Cache Index Trace
Stream Cache Miss Trace
0x0E // it. 1
0x0E // it. 99
16SC Itrace Compression
- Compress instruction stream
- Get the next instruction stream record from the
instruction stream buffer(S.SA, S.SL) - Lookup in the stream cache with iSet F(S.SA,
S.SL) - if (hit)
- Emit(iSet iWay) to SCIT
- else
- Emit reserved value 0 to SCIT
- Emit stream descriptor (S.SA, S.SL) to SCMT
- Select an entry (iWay) in the iSet set to be
replaced - Update stream cache entry SCiSetiWay.Valid
1 SCiSetiWay.SA S.SA,
SCiSetiWay.SL S.SL - Update stream cache replacement indicators
Design Decisions
- Stream cache
- Size
- Associativity
- Replacement policy
- Mapping function
- Instruction Stream Buffer size
- Not to stall processor (e.g., have consecutive
very short instruction streams)
17SC Itrace Compression An Analytical Model
- Legend
- CR(SC.I) compression ratio
- N number of instructions
- SL.Dyn average stream length (dynamic)
- SC.Hit(Nset,Nway) SC hit rate
- Assumptions
- stream length lt 256(1 byte for SL)
- 4 bytes for stream starting address
182nd Level Itrace Compression
- Size(SCIT) gtgt Size(SCMT)
- HitRate 98, 8-bit index gt Size(SCIT)
10Size(SCMT) - Redundancy in SCIT
- Temporal and spatial locality of instruction
streams - Reduce SCIT trace
- Global Predictor
- N-tuple compression using Tuple History Table
- N-tuple compression using SCIT History Buffer
19Global Predictor Structure
SCIT Trace
History Buffer
Predictor
next.sid
...
0
F
pindex
MaxP-1
?
0
1
Hit/Miss
SCIT PRED Trace
SCIT PRED Miss Trace
20SCIT Compression
- Predict SCIT index
- Get the incoming index, next.sid, from the SCIT
trace - Calculate the SCIT predictor index, pindex,
using indices in the History bufferpindex F
(indices in the History Buffer) - Perform lookup in the SCIT Predictor with pindex
- if(SCIT.Predictorpindex next.sid)
- Emit(1') to SCIT PRED trace
- else
- Emit(0) to SCIT PRED trace
- Emit next.sid to SCIT Miss PRED trace
- SCIT.Predictorpindex next.sid
- Shift in the next.sid to the History Buffer
Design Decisions
- Global predictor
- Size
- Mapping function
21Redundancy in SCIT Pred Trace
- High predictor hit rates and long runs of 0xFF
bytes are expected in Predictor Hit Trace - Use a simple FSM to exploit byte repetitions
PREDHit Trace
// Detect byte repetitions in SCIT pred 1. Get
next SCIT Pred byte, Next.BYTE 2. if
(Next.BYTE Prev.BYTE) CNT 3. else 4.
if (CNT 0) 5. Emit Prev.BYTE to
SCIT.REP.Trace 6. Emit 0 to SCIT
Header 7. else 8. Emit (Prev.BYTE,
CNT) pair to SCIT.REP.Trace 9. Emit 1
to SCIT Header 10. Prev.BYTE Next.BYTE
Prev.BYTE
CNT
?
SCIT PRED Repetition Trace
SCIT PRED Header
22Outline
- Program Execution Traces
- Trace Compression
- Trace Compression in Hardware
- Stream caches and predictors for instruction
address traces - Data address stride caches for data address
traces - Results
- Conclusions
23Data Address Trace Compression
- More challenging task
- Data addresses rarely stay constant during
program execution - However, they often have a regular stride
- gt Use Data Address Stride Cache (DASC) to
exploit locality of memory referencing
instructions and regularity in data address
strides
24Data Address Stride Cache
Data Address Stride Cache (DASC)
0x020001f8
- DASC
- Tagless structure
- Indexed by PC of the corresponding instruction
- Entry fields
- LDA Last Data Address
- Stride
PC
LDA Stride
0
1
G(PC)
i
index
0xbfffbe24
N - 1
0xbfffbe20
0xbfffbe1c
DA
DA-LDA
?
0
1
Stride.Hit
Stride.Hit
0xbfffbe24
DT (Data trace)
DMT Data Miss Trace
0xbfffbe20
0
0
1
25DASC Compression
- // Compress data address stream
- Get the next pair from data buffers (PC, DA)
- Lookup in the data address stream cache indexSet
G(PC) - cStride DA - DASCiSet.LDA
- if (cStride DASCiSet.Stride)
- Emit(1) to DT //1-bit info
- else
- Emit(0) to DT
- Emit DA to DMT
- DASCiSet.Stride lsb(cStride)
- DASCiSet.LDA DA
Design Decisions
- Number of entries
- Index function G
- Stride length
- Data address buffer depth
26DASC Dtrace Compression An Analytical Model
- Legend
- CR(SC.D) compression ratio
- Nmemref number of memory referencing
instructions - DASC.Hit DASC hit rate
- Assumptions
- 4 bytes for stream starting address
27Redundancy in DT Trace
- High predictor hit rates and long runs of 0xFF
bytes are expected in DT Trace - Use a simple FSM to exploit byte repetitions
DT
// Detect data repetitions 1. Get next DT byte
2. if (DT Prev.DT) CNT 3. else 4. if
(CNT 0) 5. Emit Prev.DT to DRT 6.
Emit 0 to DH 7. else 8. Emit
(Prev.DT, CNT) pair to DRT 9. Emit 1 to
DH 10. Prev.DT DT
Prev.DT
CNT
?
Data Header (DH)
Data Repetition Trace (DRT)
28Outline
- Program Execution Traces
- Trace Compression
- Trace Compression in Hardware
- Stream caches and predictors for instruction
address traces - Data address stride caches for data address
traces - Results
- Conclusions
29Experimental Evaluation
- Goals
- Assess the effectiveness of the proposed
algorithms - Explore the feasibility of the proposed hardware
implementations - Determine optimal size and organization of HW
structures - Workload
- 16 MiBench benchmarks
- ARM architecture
- Legend
- IC Instruction count
- NUS Number of unique instruction streams
- maxSL Maximum stream length
- SL.Dyn Average stream length (dynamic)
30Findings about SC Size/Organization
- Good compression ratio
- Outperforms fast GZIP
- High stream cache hit rates for all application
(gt98 ) - Smaller SCs work well too
- Replacement policy
- Pseudo-LRU vs. FIFO
- Associativity
- 4-way is a reasonable choice
- 8-way and 16-way desirable
- Mapping function
- S.SAlt5n6gt xor S.Lltn-10gtnlog2(NSET)
31Findings about Global Predictor
- Number of entries should not exceed the number
of entries in SC - Having longer histories and larger
predictorsgives only marginal improvements for
all applicationsexcept ghostscript, blowfish,
and stringsearch - History length 1
- Index GPRED using the previous SCIT index
32Putting It All Together (SCGPREDBREP) Itrace
Compression
33Findings about DASC
- Stride size
- 1 byte is optimal
- 2 byte stride improves compression for ? 10
- DASC with 1K entriesis an optimal choice
- Tagged (multi-way) DASC further improves overall
compression ratio - Increased complexity
34DASC Compression Ratio
35Hardware Complexity Estimation
Component Entries Complexity Bytes
Instruction stream buffer 2 2x5 10
Stream detector 2 2x4 8
Stream cache 64x4 256x5 1280
Global Predictor 256 256 1(h) 257
Data address buffer 8 8x8 64
Data address stride cache 1024 1024x5 5120
Byte repetition state machines - 4 4
- CPU model
- In-order, Xscale like
- Vary SC and DASC parameters
- SC and DASC timings
- SC Hit latency 1 clock, Miss latency 2
clocks - DASC Hit latency 2 clocks Miss latency 2
clocks - To avoid any stalls
- Instruction stream input buffer MIN 2 entries
- Data address input buffer MIN 8 entries
- Results are relatively independent of SC and DASC
organization
36Trace Port Bandwidth Analysis
37Outline
- Program Execution Traces
- Trace Compression
- Trace Compression in Hardware
- Stream caches and predictors for instruction
address traces - Data address stride caches for data address
traces - Results
- Conclusions
38Conclusions
- A set of algorithms and hardware structuresfor
instruction and data address trace compression - Stream Caches Global Predictor Byte
repetition FSMfor instruction traces - Data Address Stride Cache Byte repetition FSM
for data traces - Benefits
- Enabling real-time trace compression with high
compression ratio - Low complexity (small structures, small number of
external pins) - Analytical simulation analysis focusing on
compression ratio and optimal sizing/organization
of the structures as well as real-time trace
port bandwidth requirements
39Laboratory for Advanced Computer Architectures
and Systems at Alabama Research Overview
- Aleksandar Milenkovic
- The LaCASA Laboratory
- Electrical and Computer Engineering Department
- The University of Alabama in Huntsville
- Email milenka_at_ece.uah.edu
- Web http//www.ece.uah.edu/milenka
- http//www.ece.uah.edu/lacasa
40Secure Processors
Software physical attacks
Computer Security is Critical
Sign Verify for Guaranteed Integrity and
Confidentiality of Code
Improvements
- PMAC (Parallel MACs) for reducedcryptographic
latency - A variation of the one-time-pad for code
encryption - Instruction Verification Buffer for conditional
execution before verification
http//www.ece.uah.edu/lacasa/research.htmsecure
_processors
41Microbenchmarks for Architectural Analysis
- Small programs for uncovering architectural
parameters (usually not publicly disclosed) of
modern processors - Relatively simple, so their behavior can be
understood
- Benefits
- Architecture-aware compiler optimization
- Processor design evaluation and verification
- Testing
- Competitive analysis
Microbenchmarks
- Results
- Microbenchmarks for BTB analysis
- Experimental flow foroutcome predictor
- Tested on P6 and NetBurst (Northwood core)
BTB Size
Outcome Predictor
BTB Org.
BTB
BTB Indexing
...
Local History
PerformanceCounters
- Challenge
- Dothan (PentiumM) predictor
Branch relatedevents
Global History
...
http//www.ece.uah.edu/lacasa/bp_mbs/bp_microbenc
h.htm
42TinyHMS
Prototype
Concept
Software
http//www.ece.uah.edu/lacasa/research.htmtinyHM
S
43TinyHMS