Project : Phase 1 Grading PowerPoint PPT Presentation

presentation player overlay

1 / 19

About This Presentation

Transcript and Presenter's Notes

Title: Project : Phase 1 Grading

1
Project Phase 1 Grading

Default Statistics (40 points)
Values and Charts (30 points)
Analyses (10 points)
Branch Predictor Statistics (30 points)
Values and Charts (25 points)
Analyses (5 points)
L2 cache Replacement Statistics (30 points)
Values and Charts (30 points)

2
Default Statistics Analyses

CPI affected by
Percentage of branches, predictability of
branches
Cache hit rates
Parallelism inherent in programs
CPI of cc and go higher than others
Larger percentage of tough to predict branches
cc 17 branches abt 12 of which is
miss-predicted
Go 13 branches abt 20 of which is
miss-predicted
CPI of cc higher than go
L1 miss rate of cc (2.6) is higher than go (0.6)

3
Default Statistics Analyses

Compress has high miss rates
Smaller execution run compulsory misses
L2 miss rate of anagram high
Very few L2 accesses compulsory misses
Program based analyses
Gcc has lot of branches
Go program has small memory footprint
Anagram is a simple program
Compress input file only 20 bytes
Note All are integer programs
CPI lt 1, multiple issue, out of order

4
Branch Predictor Statistics

Perfect gt Bimodal gt taken not-taken
Variation across benchmarks (2 points)
Go and cc show greatest variation
They have significant number of tough to predict
branches.

5
L2 replacement policies

No great change in miss-rate or CPI
30 points for the values and plots
L1 cache was big so very few L2 accesses
Associativity of L2 cache was small
LRU gt FIFO gt Random

6
Distribution

90 100

7
Phase 2 Profile guided OPT

Profiling Run
Run un-optimized code with sample inputs
Instrument code to collect information about the
run
Callgraph frequencies
Basicblock frequencies
Recompile
Use collected information to produce better code
Inlining
Put hot code together to improve I

8
Phase 2 Compiler branch hints

if (error) // not-taken
Compiler provides hints about branches
taken/not-taken using profile information
In this question
Learn to use simulator as a profiler
Learn to estimate benefits of optimizations.

9
Example

Simple loop
1000
1004
// mostly not taken
1008 jz 1020
1012 jmp 1000
For each branch mark taken or not-taken
Taken gt 50
Mark taken
Not-taken gt 50
Mark Not-taken
In the above example
1008 not-taken
1032 not-taken
1064 taken

10
Profiling Run

For each static branch instruction
Collect execution frequency
Percentage taken/not-taken
Modify bpred_update function in bpred.c
Maintain data structure for each branch
instruction indexed by instruction address
Maintain frequency, taken information
Dump this information in the end.

11
Analysis

From the information collected
If branch is taken gt 50 of time, mark taken
Otherwise not-taken
Remember the instruction addresses and the hint.

12
Performance Estimation

For all branches
Predict taken/ not-taken according to the hint
You may want to load all the hints into a data
structure at the start.
Data structure similar to one used for profiling.
Indexed by branch instruction address.
Estimate new CPI
Notes
Sufficient to do this for cc and anagram.
After modifying SimpleScalar need to make !!!

13
Phase2 L2 replacement policy

LRU policy
Works well
HW complexity is high
Number of status bits to track when each block in
a set is last accessed
This number increases with associativity.
PLRU
Pseudo LRU policies
Simpler replacement policy that attempts to mimic
LRU.

14
Tree based PLRU policy

For a n way cache, there are nway -1 binary
decision bits
Let us consider a 4 way set associative cache
L0, L1, L2 and L3 are the blocks in the set
B0, B1 and B2 are decision bits

15
Tree based LRU for 4 way
16
Notes

Use a 4K direct mapped L1 cache
Hopefully this should lead to L2 accesses!
Use a 16 way 256 KB L2 cache
Hopefully enough ways to make a difference!
Compare PLRU with LRU, FIFO and Random
Sufficient to do this experiment for cc and
anagram!

17
Perfect Mem Disambiguation

Memory Disambiguation
Techniques employed by processor to execute
loads/stores out of order
Use a HW structure called Load/Store queue
Tracks addresses / values of loads and stores
Load can be issued from LSQ
If there are no prior stores writing to the same
address.
If address in unknown, then cant issue load
Perfect Disambiguation
All addresses are known

18
How are addresses known

Two ways to do this
Trace based Run once and collect and remember
all the addresses
All registers values are actually known to the
simulator through functional simulation
Even though a register is yet to be computed,
the simulator knows the value
Look at lsq_refresh() function in sim-outorder.c
To give you flexibility to do both ways
Simulate only a million instructions
Fast forward 100 million instructions

19
Mem Disambiguation

Compare CPI with and without perfect
disambiguation
Sufficient to do this for cc and go
-fastfwd 100 million instructions
Simulate for additional 1 million instructions

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user