Title: CS252 Graduate Computer Architecture Lecture 12 Branch Prediction Possible Projects
1CS252Graduate Computer ArchitectureLecture
12Branch PredictionPossible Projects
- October 8th, 2003
- Prof. John Kubiatowicz
- http//www.cs.berkeley.edu/kubitron/courses/cs252
-F03
2CS252 Projects
- Two People from this class
- Projects can overlap with other classes
- Exceptions to the two person requirement need to
be OKd - Amount of work 3 Solid Weeks of work
- Spread over the remainder of the term
- Should be a miniature research project
- State of the art (cant redo something that
others have done) - Should be publishable work
- Must have solid methodology!
- Elements
- Base architecture to measure against
- Simulation or other analysis against some
application set - Several variations on a theme
3CS252 Projects
- DynaCOMP related (or Introspective Computing)
- OceanStore related
- Smart Dust/NEST
- ROC Related Projects
- BRASS project related
- Benchmarking Related (Yelick)
4DynaCOMPIntrospective Computing
- Biological Analogs for computer systems
- Continuous adaptation
- Insensitivity to design flaws
- Both hardware and software
- Necessary if can never besure that all
componentsare working properly - Examples
- ISTORE -- applies introspectivecomputing to disk
storage - DynaComp -- applies introspectivecomputing at
chip level - Compiler always running and part of execution!
Monitor
Compute
Adapt
5DynaCOMP Vision Statement
- Modern microprocessors gather profile information
in hardware in order to generate predictions
Branches, dependencies, and values. - Processors such as the Pentium-II employ a
primitive form of compilation to translate x86
operations into internal RISC-like micro-ops. - So, why not do all of this in software? Make use
of a combination of explicit monitoring, dynamic
compilation technology, and genetic algorithms
to - Simplify hardware, possibly using large on-chip
multiprocessors built from simple processors. - Improve performance through feedback-driven
optimization. Continuous Execution, Monitoring,
Analysis, Recompilation - Generate design complexity automatically so that
designers are not required to. Use of explicit
proof verification techniques to verify that code
generation is correct. - This is aptly called Introspective Computing
- Related idea use of continuous observation to
reduce power on buses!
6The Thermodynamic Analogy
- Large Systems have a variety of latent order
- Connections between elements
- Mathematical structure (erasure coding, etc)
- Distributions peaked about some desired behavior
- Permits Stability through Statistics
- Exploit the behavior of aggregates (redundancy)
- Subject to Entropy
- Servers/Components, fail, attacks happen, system
changes - Requires continuous repair
- Apply energy (i.e. through servers) to reduce
entropy - Introspection restores distributions
7ThermoSpective
Comp
Adapt
Monitor
- Many Redundant Components (Fault Tolerance)
- Continuous Repair (Entropy Reduction)
- What about NanoComputing Domain?
- How will you build reliable systems from
unreliable components?
8OceanStore Vision
9Ubiquitous Devices ? Ubiquitous Storage
- Consumers of data move, change from one device to
another, work in cafes, cars, airplanes, the
office, etc. - Properties REQUIRED for Endeavour storage
substrate - Strong Security data must be encrypted whenever
in the infrastructure resistance to monitoring - Coherence too much data for naïve users to keep
coherent by hand - Automatic replica management and optimization
huge quantities of data cannot be managed
manually - Simple and automatic recovery from disasters
probability of failure increases with size of
system - Utility model world-scale system requires
cooperation across administrative boundaries
10Utility-based Infrastructure
Canadian OceanStore
Sprint
ATT
IBM
Pac Bell
IBM
- Service provided by confederation of companies
- Monthly fee paid to one service provider
- Companies buy and sell capacity from each other
11Preliminary Smart Dust Mote
Brett Warneke, Bryan Atwood, Kristofer
Pister Berkeley Sensor and Actuator Center Dept.
of Electrical Engineering and Computer Sciences
University of California, Berkeley
12Smart Dust
1-2mm
13COTS Dust
- GOAL
- Get our feet wet
- RESULT
- Cheap, easy, off-the-shelf RF systems
- Fantastic interest in cheap, easy, RF
- Industry
- Berkeley Wireless Research Center
- Center for the Built Environment (IUCRC)
- PC Enabled Toys (Intel)
- Endeavor Project (UCB)
- Optical proof of concept
14Smart Dust/Micro ServerProjects
- David Culler and Kris Pister collaborating
- What is the proper operating system for devices
of this nature? - Linux or Window is not appropriate!
- State machine execution model is much simpler!
- Assume that little device is backed by servers in
net. - Questions of hardware/software tradeoffs
- What is the high-level organization of zillions
of dust motes in the infrastructure??? - What type of computational/communication ability
provides the right tradeoff between functionality
and power consumption???
15A glimpse into the future?
- System-on-a-chip enables computer, memory,
redundant network interfaces without
significantly increasing size of disk - ISTORE HW in 5-7 years
- 2006 brick System On a Chip integrated with
MicroDrive - 9GB disk, 50 MB/sec from disk
- connected via crossbar switch
- From brick to domino
- If low power, 10,000 nodes fit into one rack!
- O(10,000) scale is our ultimate design point
16ROC visionStorage System of the Future
- Availability, Maintainability, and Evolutionary
growth key challenges for storage systems - Maintenance Cost gt10X Purchase Cost per year,
- Even 2X purchase cost for 1/2 maintenance cost
wins - AME improvement enables even larger systems
- ISTORE has cost-performance advantages
- Better space, power/cooling costs (_at_colocation
site) - More MIPS, cheaper MIPS, no bus bottlenecks
- Compression reduces network , encryption
protects - Single interconnect, supports evolution of
technology - Match to future software storage services
- Future storage service software target clusters
17Is Maintenance the Key?
- Rule of Thumb Maintenance 10X to 100X HW
- so over 5 year product life, 95 of cost is
maintenance - VAX crashes 85, 93 Murp95 extrap. to 01
- Sys. Man. N crashes/problem, SysAdmin action
- Actions set params bad, bad config, bad app
install - HW/OS 70 in 85 to 28 in 93. In 01, 10?
18Availability benchmark methodology
- Goal quantify variation in QoS metrics as events
occur that affect system availability - Leverage existing performance benchmarks
- to generate fair workloads
- to measure trace quality of service metrics
- Use fault injection to compromise system
- hardware faults (disk, memory, network, power)
- software faults (corrupt input, driver error
returns) - maintenance events (repairs, SW/HW upgrades)
- Examine single-fault and multi-fault workloads
- the availability analogues of performance micro-
and macro-benchmarks
19Quantum ArchitectureUse of Spin for QuBits
- Quantum effect gives 1 and 0
- Either spin is UP or DOWN nothing in between
- Superposition Mix of 1 and 0
- Written as ? C00gt C11gt
- An n-bit register can have 2n values
simultaneously! - ? C000000gt C001001gt C010010gt C011011gt
C100 100gt C101 101gt C110 110gt C111
111gt
20Skinner-Kane Si based computer
- Silicon substrate
- Phosphorus ion spin donor electron spin qubit
- A-gate
- Hyperfine interaction
- Electron-ion spin swap
- S-gate
- Electron shuttling
- Global magnetic field
- 0 ltgt 1 qubit flip
- Single-electron transistors
- Qubit readout
21Interesting Ubiquitous ComponentThe Entropy
Exchange Unit
- Possibilities for cooling
- Spin-polarized photons ?spin-polarized electrons
?spin-polarized nucleons - Simple thermal cooling of some sort
- Two material domains
- One material in contact with environment
- Analysis of properties of such a system
22Swap cell
e1-
e2-
P ion
P ion
- A lot of steps for two qubits!
23Swap Cell Control Complexity
Time
Control signals
- What a mess! Long pulse sequence
24Single-electron transistors (SETs)
Y. Takahashi et. al.
- Electrons move one-by-one through tunnel junction
onto quantum dot and out other side - Work well at low temperatures
- Low drive current (5nA) and voltage swing (40mV)
25Swap control circuitACK!
S-gate pulse cascade
On-off A-gate pulse ratio (2254)
A-gate pulse repeats 24 times
- Can this even be built with SETs?
26In SIMD we trust?
- Large control circuit/small swap cell ratio
SIMD - Like clock distribution network
- Clock skew at 11.3GHz?
- Error correction?
27Brass Vision Statement
- The emergence of high capacity reconfigurable
devices is igniting a revolution in
general-purpose processing. It is now becoming
possible to tailor and dedicate functional units
and interconnect to take advantage of application
dependent dataflow. Early research in this area
of reconfigurable computing has shown encouraging
results in a number of spot areas including
cryptography, signal processing, and searching
--- achieving 10-100x computational density and
reduced latency over more conventional processor
solutions. - BRASS Microprocessor FPGA on single chip
- use some of millions of transitors to customize
HW dynamically to application
28Architecture Target
- Integrated RISC core memory system
reconfigurable array. - Combined RAM/Logic structure.
- Rapid reconfiguration with many contexts.
- Large local data memories and buffers.
- These capabilities enable
- hardware virtualization
- on-the-fly specialization
128 LUTs
2Mbit
29SCORE Stream-oriented computation model
Goal Provide view of reconfigurable hardware
which exposes strengths while abstracting
physical resources.
- Computations are expressed as data-flow graphs.
- Graphs are broken up into compute pages.
- Compute pages are linked together in a data-flow
manner with streams. - A run-time manager allocates and schedules pages
for computations and memory.
30Ok. Back to Branch Prediction
31Review Problem Fetch unit
- Instruction fetch decoupled from execution
- Often issue logic ( rename) included with Fetch
32Branches must be resolved quickly for loop
overlap!
- In our loop-unrolling example, we relied on the
fact that branches were under control of fast
integer unit in order to get overlap!
Loop LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R
1 SUBI R1 R1 8 BNEZ R1 Loop - What happens if branch depends on result of
multd?? - We completely lose all of our advantages!
- Need to be able to predict branch outcome.
- If we were to predict that branch was taken, this
would be right most of the time. - Problem much worse for superscalar machines!
33Review Predicated Execution
- Avoid branch prediction by turning branches into
conditionally executed instructions - if (x) then A B op C else NOP
- If false, then neither store result nor cause
exception - Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move PA-RISC can annul any following
instr. - IA-64 64 1-bit condition fields selected so
conditional execution of any instruction - This transformation is called if-conversion
- Drawbacks to conditional instructions
- Still takes a clock even if annulled
- Stall if condition evaluated late
- Complex conditions reduce effectiveness
condition becomes known late in pipeline
x
A B op C
34Dynamic Branch Prediction Problem
History Information
Branch Predictor
Incoming Branches Address
Prediction Address, Value
Corrections Address, Value
- Incoming stream of addresses
- Fast outgoing stream of predictions
- Correction information returned from pipeline
35Review Branch Target Buffer
- Branch Target Buffer (BTB) Address of branch
index to get prediction AND branch address (if
taken) - Note must check for branch match now, since
cant use wrong branch address (Figure 4.22, p.
273) - Return instruction addresses predicted with stack
- Remember branch folding (Crisp processor)?
PC of instruction FETCH
?
Predict taken or untaken
36Branch (Pattern?) History Table
Predictor 0
Predictor 1
Branch PC
Predictor 7
- BHT is a table of Predictors
- Usually 2-bit, saturating counters
- Indexed by PC address of Branch without tags
- In Fetch state of branch
- BTB identifies branch
- Predictor from BHT used to make prediction
- When branch completes
- Update corresponding Predictor
37Review Dynamic Branch Prediction(Jim Smith,
1981)
- Predictor 2-bit scheme where change prediction
only if get misprediction twice - Red stop, not taken
- Green go, taken
- Adds hysteresis to decision making process
T
Predict Taken
Predict Taken
T
NT
Predict Not Taken
Predict Not Taken
NT
38Correlating Branches
- Hypothesis recent branches are correlated that
is, behavior of recently executed branches
affects prediction of current branch - Two possibilities Current branch depends on
- Last m most recently executed branches anywhere
in programProduces a GA (for global
adaptive) in the Yeh and Patt classification
(e.g. GAg) - Last m most recent outcomes of same
branch.Produces a PA (for per-address
adaptive) in same classification (e.g. PAg) - Idea record m most recently executed branches as
taken or not taken, and use that pattern to
select the proper branch history table entry - A single history table shared by all branches
(appends a g at end), indexed by history value. - Address is used along with history to select
table entry (appends a p at end of
classification) - If only portion of address used, often appends an
s to indicate set-indexed tables (I.e. GAs)
39Discussion of Yeh and Patt classification
PAg
PAp
GAg
- GAg Global History Register, Global History
Table - PAg Per-Address History Register, Global History
Table - PAp Per-Address History Register, Per-Address
History Table
40Other Global VariantsTry to Avoid Aliasing
GAs
GShare
- GAs Global History Register, Per-Address (Set
Associative) History Table - Gshare Global History Register, Global History
Table with Simple attempt at anti-aliasing
41What are Important Metrics?
- Clearly, Hit Rate matters
- Even 1 can be important when above 90 hit rate
- Speed Does this affect cycle time?
- Space Clearly Total Space matters!
- Papers which do not try to normalize across
different options are playing fast and lose with
data - Try to get best performance for the cost
42Accuracy of Different Schemes(Figure 4.21, p.
272)
18
4096 Entries 2-bit BHT Unlimited Entries 2-bit
BHT 1024 Entries (2,2) BHT
Frequency of Mispredictions
0
43Discussion of Papers
- A Comparative Analysis of Schemes for Correlated
Branch Prediciton - Cliff Young, Nicolas Gloy and Michael D. Smith
- An Analysis of Correlation and Predictability
What Makes Two-Level Branch Predictors Work? - Marius Evers, Sanjay J. Patel, Robert S. Chappel,
and Yale N. Patt
44Summary 1Dynamic Branch Prediction
- Prediction becoming important part of scalar
execution. - Prediction is exploiting information
compressibility in execution - Branch History Table 2 bits for loop accuracy
- Correlation Recently executed branches
correlated with next branch. - Either different branches (GA)
- Or different executions of same branches (PA).
- Branch Target Buffer include branch address
prediction - Predicated Execution can reduce number of
branches, number of mispredicted branches
45Summary 2
- Prediction, prediction, prediction!
- Over next couple of lectures, we will explore
prediction of everything! Branches,
Dependencies, Data - The high prediction accuracies will cause us to
ask - Is the deterministic Von Neumann model the right
one???