The Stanford Hydra Chip Multiprocessor - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

The Stanford Hydra Chip Multiprocessor

Description:

Hydra Approach. A single-chip multiprocessor architecture composed of simple fast processors ... Hydra offers a new way to design microprocessors ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 29
Provided by: kunl
Category:

less

Transcript and Presenter's Notes

Title: The Stanford Hydra Chip Multiprocessor


1
The Stanford Hydra Chip Multiprocessor
  • Kunle Olukotun
  • The Hydra Team
  • Computer Systems Laboratory
  • Stanford University

2
Technology ??Architecture
  • Transistors are cheap, plentiful and fast
  • Moores law
  • 100 million transistors by 2000
  • Wires are cheap, plentiful and slow
  • Wires get slower relative to transistors
  • Long cross-chip wires are especially slow
  • Architectural implications
  • Plenty of room for innovation
  • Single cycle communication requires localized
    blocks of logic
  • High communication bandwidth across the chip
    easier to achieve than low latency

3
Exploiting Program Parallelism
4
Hydra Approach
  • A single-chip multiprocessor architecture
    composed of simple fast processors
  • Multiple threads of control
  • Exploits parallelism at all levels
  • Memory renaming and thread-level speculation
  • Makes it easy to develop parallel programs
  • Keep design simple by taking advantage of single
    chip implementation

5
Outline
  • Base Hydra Architecture
  • Performance of base architecture
  • Speculative thread support
  • Speculative thread performance
  • Improving speculative thread performance
  • Hydra prototype design
  • Conclusions

6
The Base Hydra Design
  • Shared 2nd-level cache
  • Low latency interprocessor communication (10
    cycles)
  • Separate read and write buses
  • Single-chip multiprocessor
  • Four processors
  • Separate primary caches
  • Write-through data caches to maintain coherence

7
Hydra vs. Superscalar
  • ILP only
  • ??SS 30-50 better than single Hydra processor
  • ILP fine thread
  • ??SS and Hydra comparable
  • ILP coarse thread
  • ??Hydra 1.52???better
  • The Case for a CMP ASPLOS 96


4
Hydra 4 x 2-way issue
3.5
Superscalar 6-way issue
3
2.5
2
Speedup
1.5
1
0.5
0
swim
applu
OLTP
eqntott
MPEG2
tomcatv
compress
m88ksim
8
Problem Parallel Software
  • Parallel software is limited
  • Hand-parallelized applications
  • Auto-parallelized dense matrix FORTRAN
    applications
  • Traditional auto-parallelization of C-programs is
    very difficult
  • Threads have data dependencies ??synchronization
  • Pointer disambiguation is difficult and expensive
  • Compile time analysis is too conservative
  • How can hardware help?
  • Remove need for pointer disambiguation
  • Allow the compiler to be aggressive

9
Solution Data Speculation
  • Data speculation enables parallelization without
    regard for data-dependencies
  • Loads and stores follow original sequential
    semantics
  • Speculation hardware ensures correctness
  • Add synchronization only for performance
  • Loop parallelization is now easily automated
  • Other ways to parallelize code
  • Break code into arbitrary threads (e.g.
    speculative subroutines )
  • Parallel execution with sequential commits
  • Data speculation support
  • Wisconsin multiscalar
  • Hydra provides low-overhead support for CMP

10
Data Speculation Requirements I
  • Forward data between parallel threads
  • Detect violations when reads occur too early

11
Data Speculation Requirements II
  • Safely discard bad state after violation
  • Correctly retire speculative state

12
Data Speculation Requirements III
  • Maintain multiple views of memory

13
Hydra Speculation Support
  • Write bus and L2 buffers provide forwarding
  • Read L1 tag bits detect violations
  • Dirty L1 tag bits and write buffers provide
    backup
  • Write buffers reorder and retire speculative
    state
  • Separate L1 caches with pre-invalidation smart
    L2 forwarding for view
  • Speculation coprocessors to control threads

14
Speculative Reads
  • L1 hit
  • The read bits are set
  • L1 miss
  • L2 and write buffers are checked in parallel
  • The newest bytes written to a line are pulled in
    by priority encoders on each byte (priority A-D)

15
Speculative Writes
  • A CPU writes to its L1 cache write buffer
  • Earlier CPUs invalidate our L1 cause RAW
    hazard checks
  • Later CPUs just pre-invalidate our L1
  • Non-speculative write buffer drains out into the
    L2

16
Speculation Runtime System
  • Software Handlers
  • Control speculative threads through CP2 interface
  • Track order of all speculative threads
  • Exception routines recover from data dependency
    violations
  • Adds more overhead to speculation than hardware
    but more flexible and simpler to implement
  • Complete description in Data Speculation Support
    for a Chip Multiprocessor ASPLOS 98 and
    Improving the Performance of Speculatively
    Parallel Applications on the Hydra CMP ICS 99

17
Creating Speculative Threads
  • Speculative loops
  • for and while loop iterations
  • Typically one speculative thread per iteration
  • Speculative procedures
  • Execute code after procedure speculatively
  • Procedure calls generate a speculative thread
  • Compiler support
  • C source to source translator
  • Pfor, pwhile
  • Analyze loop body and globalize any local
    variables that could cause loop-carried
    dependencies

18
Base Speculative Thread Performance

4
3.5
Base
  • Entire applications
  • GCC 2.7.2 -O2
  • 4 single-issue processors
  • Accurate modeling of all aspects of Hydra
    architecture and real runtime system

3
2.5
Speedup
2
1.5
1
0.5
0
wc
ear
ijpeg
grep
alvin
eqntott
mpeg2
simplex
m88ksim
cholesky
compress
sparse1.3
19
Improving Speculative Runtime System
  • Procedure support adds overhead to loops
  • Threads are not created sequentially
  • Dynamic thread scheduling necessary
  • Start and end of loop 75 cycles
  • End of iteration 80 cycles
  • Performance
  • Best performing speculative applications use
    loops
  • Procedure speculation often lowers performance
  • Need to optimize RTS for common case
  • Lower speculative overheads
  • Start and end of loop 25 cycles
  • End of iteration 12 cycles (almost a factor of
    7)
  • Limit procedure speculation to specific
    procedures

20
Improved Speculative Performance

4
3.5
Optimized RTS
  • Improves performance of all applications
  • Most improvement for applications with
    fine-grained threads
  • Eqntott uses procedure speculation

Base
3
2.5
Speedup
2
1.5
1
0.5
0
wc
ear
ijpeg
grep
alvin
eqntott
mpeg2
simplex
cholesky
m88ksim
sparse1.3
compress
21
Optimizing Parallel Performance
  • Cache coherent shared memory
  • No explicit data movement
  • 100 cycle communication latency
  • Need to optimize for data locality
  • Look at cache misses (MemSpy, Flashpoint)
  • Speculative threads
  • No explicit data independence
  • Frequent dependence violations limit performance
  • Need to optimize to reduce frequency and impact
    of data violations
  • Dependence prediction can help
  • Look at violation statistics (requires some
    hardware support)

22
Feedback and Code Transformations
  • Feedback tool
  • Collects violation statistics (PCs, frequency,
    work lost)
  • Correlates read and write PC values with source
    code
  • Synchronization
  • Synchronize frequently occurring violations
  • Use non-violating loads
  • Code Motion
  • Find dependent load-stores
  • Move loads down in thread
  • Move stores up in thread

23
Code Motion
  • Rearrange reads and writes to increase
    parallelism
  • Delay reads and advance writes
  • Create local copies to allow earlier data
    forwarding

iteration i
iteration i
read x
read x
iteration i1
write x
read x
read x
write x
read x
read x
write x
iteration i1
read x
write x
24
Optimized Speculative Performance



4
3.5
3
  • Base performance
  • Optimized RTS with no manual intervention
  • Violation statistics used to manually transform
    code

2.5
Speedup
2
1.5
1
0.5
0
wc
ear
grep
alvin
ijpeg
eqntott
mpeg2
simplex
cholesky
m88ksim
compress
sparse1.3
25
Size of Speculative Write State
Max no. lines of write state
  • Max size determines size of write buffer for max
    performance
  • Non-head processor stalls when write buffer fills
    up
  • Small write buffers (lt 64 lines) will achieve
    good performance

32 byte cache lines
26
Hydra Prototype
  • Design based on Integrated Device Technology
    (IDT) RC32364
  • 88 mm2 in 0.25mm with 8 KB I, D and 128 KB L2

27
Conclusions
  • Hydra offers a new way to design microprocessors
  • Single-chip MP exploits parallelism at all levels
  • Low overhead support for speculative parallelism
  • Provides high performance on applications with
    medium to large-grain parallelism
  • Allows performance optimization migration path
    for difficult to parallelize fine-grain
    applications
  • Prototype Implementation
  • Work out implementation details
  • Provide platform for application and compiler
    development
  • Realistic performance evaluation

28
Hydra Team
  • Team
  • Monica Lam, Lance Hammond, Mike Chen, Ben
    Hubbert, Manohar Prahbu, Mike Siu, Melvyn Lim
    and Maciek Kozyrczak (IDT)
  • URL
  • http//www-hydra.stanford.edu
Write a Comment
User Comments (0)
About PowerShow.com