CS252 Graduate Computer Architecture Lecture 12 Branch Prediction Possible Projects presentation

About This Presentation

Transcript and Presenter's Notes

Title: CS252 Graduate Computer Architecture Lecture 12 Branch Prediction Possible Projects

1
CS252Graduate Computer ArchitectureLecture
12Branch PredictionPossible Projects

October 8th, 2003
Prof. John Kubiatowicz
http//www.cs.berkeley.edu/kubitron/courses/cs252
-F03

2
CS252 Projects

Two People from this class
Projects can overlap with other classes
Exceptions to the two person requirement need to
be OKd
Amount of work 3 Solid Weeks of work
Spread over the remainder of the term
Should be a miniature research project
State of the art (cant redo something that
others have done)
Should be publishable work
Must have solid methodology!
Elements
Base architecture to measure against
Simulation or other analysis against some
application set
Several variations on a theme

3
CS252 Projects

DynaCOMP related (or Introspective Computing)
OceanStore related
Smart Dust/NEST
ROC Related Projects
BRASS project related
Benchmarking Related (Yelick)

4
DynaCOMPIntrospective Computing

Biological Analogs for computer systems
Continuous adaptation
Insensitivity to design flaws
Both hardware and software
Necessary if can never besure that all
componentsare working properly
Examples
ISTORE -- applies introspectivecomputing to disk
storage
DynaComp -- applies introspectivecomputing at
chip level
Compiler always running and part of execution!

Monitor
Compute
Adapt
5
DynaCOMP Vision Statement

Modern microprocessors gather profile information
in hardware in order to generate predictions
Branches, dependencies, and values.
Processors such as the Pentium-II employ a
primitive form of compilation to translate x86
operations into internal RISC-like micro-ops.
So, why not do all of this in software? Make use
of a combination of explicit monitoring, dynamic
compilation technology, and genetic algorithms
to
Simplify hardware, possibly using large on-chip
multiprocessors built from simple processors.
Improve performance through feedback-driven
optimization. Continuous Execution, Monitoring,
Analysis, Recompilation
Generate design complexity automatically so that
designers are not required to. Use of explicit
proof verification techniques to verify that code
generation is correct.
This is aptly called Introspective Computing
Related idea use of continuous observation to
reduce power on buses!

6
The Thermodynamic Analogy

Large Systems have a variety of latent order
Connections between elements
Mathematical structure (erasure coding, etc)
Distributions peaked about some desired behavior
Permits Stability through Statistics
Exploit the behavior of aggregates (redundancy)
Subject to Entropy
Servers/Components, fail, attacks happen, system
changes
Requires continuous repair
Apply energy (i.e. through servers) to reduce
entropy
Introspection restores distributions

7
ThermoSpective
Comp
Adapt
Monitor

Many Redundant Components (Fault Tolerance)
Continuous Repair (Entropy Reduction)
What about NanoComputing Domain?
How will you build reliable systems from
unreliable components?

8
OceanStore Vision
9
Ubiquitous Devices ? Ubiquitous Storage

Consumers of data move, change from one device to
another, work in cafes, cars, airplanes, the
office, etc.
Properties REQUIRED for Endeavour storage
substrate
Strong Security data must be encrypted whenever
in the infrastructure resistance to monitoring
Coherence too much data for naïve users to keep
coherent by hand
Automatic replica management and optimization
huge quantities of data cannot be managed
manually
Simple and automatic recovery from disasters
probability of failure increases with size of
system
Utility model world-scale system requires
cooperation across administrative boundaries

10
Utility-based Infrastructure
Canadian OceanStore
Sprint
ATT
IBM
Pac Bell
IBM

Service provided by confederation of companies
Monthly fee paid to one service provider
Companies buy and sell capacity from each other

11
Preliminary Smart Dust Mote
Brett Warneke, Bryan Atwood, Kristofer
Pister Berkeley Sensor and Actuator Center Dept.
of Electrical Engineering and Computer Sciences
University of California, Berkeley
12
Smart Dust
1-2mm
13
COTS Dust

GOAL
Get our feet wet
RESULT
Cheap, easy, off-the-shelf RF systems
Fantastic interest in cheap, easy, RF
Industry
Berkeley Wireless Research Center
Center for the Built Environment (IUCRC)
PC Enabled Toys (Intel)
Endeavor Project (UCB)
Optical proof of concept

14
Smart Dust/Micro ServerProjects

David Culler and Kris Pister collaborating
What is the proper operating system for devices
of this nature?
Linux or Window is not appropriate!
State machine execution model is much simpler!
Assume that little device is backed by servers in
net.
Questions of hardware/software tradeoffs
What is the high-level organization of zillions
of dust motes in the infrastructure???
What type of computational/communication ability
provides the right tradeoff between functionality
and power consumption???

15
A glimpse into the future?

System-on-a-chip enables computer, memory,
redundant network interfaces without
significantly increasing size of disk
ISTORE HW in 5-7 years

2006 brick System On a Chip integrated with
MicroDrive
9GB disk, 50 MB/sec from disk
connected via crossbar switch
From brick to domino
If low power, 10,000 nodes fit into one rack!
O(10,000) scale is our ultimate design point

16
ROC visionStorage System of the Future

Availability, Maintainability, and Evolutionary
growth key challenges for storage systems
Maintenance Cost gt10X Purchase Cost per year,
Even 2X purchase cost for 1/2 maintenance cost
wins
AME improvement enables even larger systems
ISTORE has cost-performance advantages
Better space, power/cooling costs (_at_colocation
site)
More MIPS, cheaper MIPS, no bus bottlenecks
Compression reduces network , encryption
protects
Single interconnect, supports evolution of
technology
Match to future software storage services
Future storage service software target clusters

17
Is Maintenance the Key?

Rule of Thumb Maintenance 10X to 100X HW
so over 5 year product life, 95 of cost is
maintenance
VAX crashes 85, 93 Murp95 extrap. to 01
Sys. Man. N crashes/problem, SysAdmin action
Actions set params bad, bad config, bad app
install
HW/OS 70 in 85 to 28 in 93. In 01, 10?

18
Availability benchmark methodology

Goal quantify variation in QoS metrics as events
occur that affect system availability
Leverage existing performance benchmarks
to generate fair workloads
to measure trace quality of service metrics
Use fault injection to compromise system
hardware faults (disk, memory, network, power)
software faults (corrupt input, driver error
returns)
maintenance events (repairs, SW/HW upgrades)
Examine single-fault and multi-fault workloads
the availability analogues of performance micro-
and macro-benchmarks

19
Quantum ArchitectureUse of Spin for QuBits

Quantum effect gives 1 and 0
Either spin is UP or DOWN nothing in between
Superposition Mix of 1 and 0
Written as ? C00gt C11gt
An n-bit register can have 2n values
simultaneously!
? C000000gt C001001gt C010010gt C011011gt
C100 100gt C101 101gt C110 110gt C111
111gt

20
Skinner-Kane Si based computer

Silicon substrate
Phosphorus ion spin donor electron spin qubit
A-gate
Hyperfine interaction
Electron-ion spin swap
S-gate
Electron shuttling
Global magnetic field
0 ltgt 1 qubit flip
Single-electron transistors
Qubit readout

21
Interesting Ubiquitous ComponentThe Entropy
Exchange Unit

Possibilities for cooling
Spin-polarized photons ?spin-polarized electrons
?spin-polarized nucleons
Simple thermal cooling of some sort
Two material domains
One material in contact with environment
Analysis of properties of such a system

22
Swap cell
e1-
e2-
P ion
P ion

A lot of steps for two qubits!

23
Swap Cell Control Complexity
Time
Control signals

What a mess! Long pulse sequence

24
Single-electron transistors (SETs)
Y. Takahashi et. al.

Electrons move one-by-one through tunnel junction
onto quantum dot and out other side
Work well at low temperatures
Low drive current (5nA) and voltage swing (40mV)

25
Swap control circuitACK!
S-gate pulse cascade
On-off A-gate pulse ratio (2254)
A-gate pulse repeats 24 times

Can this even be built with SETs?

26
In SIMD we trust?

Large control circuit/small swap cell ratio
SIMD
Like clock distribution network
Clock skew at 11.3GHz?
Error correction?

27
Brass Vision Statement

The emergence of high capacity reconfigurable
devices is igniting a revolution in
general-purpose processing. It is now becoming
possible to tailor and dedicate functional units
and interconnect to take advantage of application
dependent dataflow. Early research in this area
of reconfigurable computing has shown encouraging
results in a number of spot areas including
cryptography, signal processing, and searching
--- achieving 10-100x computational density and
reduced latency over more conventional processor
solutions.
BRASS Microprocessor FPGA on single chip
use some of millions of transitors to customize
HW dynamically to application

28
Architecture Target

Integrated RISC core memory system
reconfigurable array.
Combined RAM/Logic structure.
Rapid reconfiguration with many contexts.
Large local data memories and buffers.
These capabilities enable
hardware virtualization
on-the-fly specialization

128 LUTs
2Mbit
29
SCORE Stream-oriented computation model
Goal Provide view of reconfigurable hardware
which exposes strengths while abstracting
physical resources.

Computations are expressed as data-flow graphs.
Graphs are broken up into compute pages.
Compute pages are linked together in a data-flow
manner with streams.
A run-time manager allocates and schedules pages
for computations and memory.

30
Ok. Back to Branch Prediction
31
Review Problem Fetch unit

Instruction fetch decoupled from execution
Often issue logic ( rename) included with Fetch

32
Branches must be resolved quickly for loop
overlap!

In our loop-unrolling example, we relied on the
fact that branches were under control of fast
integer unit in order to get overlap!
Loop LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R
1 SUBI R1 R1 8 BNEZ R1 Loop
What happens if branch depends on result of
multd??
We completely lose all of our advantages!
Need to be able to predict branch outcome.
If we were to predict that branch was taken, this
would be right most of the time.
Problem much worse for superscalar machines!

33
Review Predicated Execution

Avoid branch prediction by turning branches into
conditionally executed instructions
if (x) then A B op C else NOP
If false, then neither store result nor cause
exception
Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move PA-RISC can annul any following
instr.
IA-64 64 1-bit condition fields selected so
conditional execution of any instruction
This transformation is called if-conversion
Drawbacks to conditional instructions
Still takes a clock even if annulled
Stall if condition evaluated late
Complex conditions reduce effectiveness
condition becomes known late in pipeline

x
A B op C
34
Dynamic Branch Prediction Problem
History Information
Branch Predictor
Incoming Branches Address
Prediction Address, Value
Corrections Address, Value

Incoming stream of addresses
Fast outgoing stream of predictions
Correction information returned from pipeline

35
Review Branch Target Buffer

Branch Target Buffer (BTB) Address of branch
index to get prediction AND branch address (if
taken)
Note must check for branch match now, since
cant use wrong branch address (Figure 4.22, p.
273)
Return instruction addresses predicted with stack
Remember branch folding (Crisp processor)?

PC of instruction FETCH
?
Predict taken or untaken
36
Branch (Pattern?) History Table
Predictor 0
Predictor 1
Branch PC
Predictor 7

BHT is a table of Predictors
Usually 2-bit, saturating counters
Indexed by PC address of Branch without tags
In Fetch state of branch
BTB identifies branch
Predictor from BHT used to make prediction
When branch completes
Update corresponding Predictor

37
Review Dynamic Branch Prediction(Jim Smith,
1981)

Predictor 2-bit scheme where change prediction
only if get misprediction twice
Red stop, not taken
Green go, taken
Adds hysteresis to decision making process

T
Predict Taken
Predict Taken
T
NT
Predict Not Taken
Predict Not Taken
NT
38
Correlating Branches

Hypothesis recent branches are correlated that
is, behavior of recently executed branches
affects prediction of current branch
Two possibilities Current branch depends on
Last m most recently executed branches anywhere
in programProduces a GA (for global
adaptive) in the Yeh and Patt classification
(e.g. GAg)
Last m most recent outcomes of same
branch.Produces a PA (for per-address
adaptive) in same classification (e.g. PAg)
Idea record m most recently executed branches as
taken or not taken, and use that pattern to
select the proper branch history table entry
A single history table shared by all branches
(appends a g at end), indexed by history value.
Address is used along with history to select
table entry (appends a p at end of
classification)
If only portion of address used, often appends an
s to indicate set-indexed tables (I.e. GAs)

39
Discussion of Yeh and Patt classification
PAg
PAp
GAg

GAg Global History Register, Global History
Table
PAg Per-Address History Register, Global History
Table
PAp Per-Address History Register, Per-Address
History Table

40
Other Global VariantsTry to Avoid Aliasing
GAs
GShare

GAs Global History Register, Per-Address (Set
Associative) History Table
Gshare Global History Register, Global History
Table with Simple attempt at anti-aliasing

41
What are Important Metrics?

Clearly, Hit Rate matters
Even 1 can be important when above 90 hit rate
Speed Does this affect cycle time?
Space Clearly Total Space matters!
Papers which do not try to normalize across
different options are playing fast and lose with
data
Try to get best performance for the cost

42
Accuracy of Different Schemes(Figure 4.21, p.
272)
18
4096 Entries 2-bit BHT Unlimited Entries 2-bit
BHT 1024 Entries (2,2) BHT
Frequency of Mispredictions
0
43
Discussion of Papers

A Comparative Analysis of Schemes for Correlated
Branch Prediciton
Cliff Young, Nicolas Gloy and Michael D. Smith
An Analysis of Correlation and Predictability
What Makes Two-Level Branch Predictors Work?
Marius Evers, Sanjay J. Patel, Robert S. Chappel,
and Yale N. Patt

44
Summary 1Dynamic Branch Prediction

Prediction becoming important part of scalar
execution.
Prediction is exploiting information
compressibility in execution
Branch History Table 2 bits for loop accuracy
Correlation Recently executed branches
correlated with next branch.
Either different branches (GA)
Or different executions of same branches (PA).
Branch Target Buffer include branch address
prediction
Predicated Execution can reduce number of
branches, number of mispredicted branches

45
Summary 2

Prediction, prediction, prediction!
Over next couple of lectures, we will explore
prediction of everything! Branches,
Dependencies, Data
The high prediction accuracies will cause us to
ask
Is the deterministic Von Neumann model the right
one???

Write a Comment

User Comments (0)

About PowerShow.com

CS252 Graduate Computer Architecture Lecture 12 Branch Prediction Possible Projects PowerPoint PPT Presentation