Revisiting - PowerPoint PPT Presentation

About This Presentation
Title:

Revisiting

Description:

Revisiting Multiprocessors Should Support Simple Memory Consistency Models Mark D. Hill Multifacet Project (www.cs.wisc.edu/multifacet) Computer Sciences Department – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 39
Provided by: Multi92
Category:

less

Transcript and Presenter's Notes

Title: Revisiting


1
Revisiting Multiprocessors Should Support Simple
Memory Consistency Models
  • Mark D. Hill
  • Multifacet Project (www.cs.wisc.edu/multifacet)
  • Computer Sciences Department
  • University of WisconsinMadison
  • October 2003

2
High- vs. Low-Level Memory Model Interface
Most of This Workshop
ThisTalk
3
Outline
  • Subroutine Call
  • Value Prediction Memory Model Subtleties
  • Review Original Paper Computer, Dec. 1998
  • Commercial Memory Model Classes
  • Performance Similarities Differences
  • Predictions Recommendation
  • Revisit in 2003
  • Revisiting 1998 Predictions
  • SC ILP RC? Paper
  • Revisiting Commercial Memory Model Classes
  • Analysis, Predictions. Recommendation

4
Correctly Implementing Value Prediction in
Microprocessors that Support Multithreading or
Multiprocessing
  • Milo M.K. Martin, Daniel J. Sorin, Harold W.
    Cain,
  • Mark D. Hill, and Mikko H. Lipasti
  • Computer Sciences Department
  • Department of Electrical and Computer Engineering
  • University of WisconsinMadison

5
Big Picture
  • Naïve value prediction can break concurrent
    systems
  • Microprocessors incorporate concurrency
  • Multithreading (SMT)
  • Multiprocessing (SMP, CMP)
  • Coherent I/O
  • Correctness defined by memory consistency model
  • Comparing predicted value to actual value not
    always OK
  • Different issues for different models
  • Violations can occur in practice
  • Solutions exist for detecting violations

6
Value Prediction
  • Predict the value of an instruction
  • Speculatively execute with this value
  • Later verify that prediction was correct
  • Example Value predict a load that misses in
    cache
  • Execute instructions dependent on value-predicted
    load
  • Verify the predicted value when the load data
    arrives
  • Without concurrency simple verification is OK
  • Compare actual value to predicted
  • Value prediction literature has ignored
    concurrency

7
Informal Example of Problem, part 1
  • Student 2 predicts grades are on bulletin board
    B
  • Based on prediction, assumes score is 60

Bulletin Board B
8
Informal Example of Problem, part 2
  • Professor now posts actual grades for this class
  • Student 2 actually got a score of 80
  • Announces to students that grades are on board B

9
Informal Example of Problem, part 3
  • Student 2 sees profs announcement and says,
  • I made the right prediction (bulletin board
    B), and my score is 60!
  • Actually, Student 2s score is 80
  • What went wrong here?
  • Intuition predicted value from future
  • Problem is concurrency
  • Interaction between student and professor
  • Just like multiple threads, processors, or
    devices
  • E.g., SMT, SMP, CMP

10
Linked List Example of Problem (initial state)
  • Linked list with single writer and single reader
  • No synchronization (e.g., locks) needed

Initial state of list
Uninitialized node
head
A
B.data
B.next
A.data
A.next
11
Linked List Example of Problem (Writer)
  • Writer sets up node B and inserts it into list

Code For Writer Thread W1 store memB.data lt-
80 W2 load reg0 lt- memHead W3 store
memB.next lt- reg0 W4 store memHead lt- B
head
B
B.data

B.next
A.data
Setup node
A.next
Insert
12
Linked List Example of Problem (Reader)
  • Reader cache misses on head and value predicts
    headB.
  • Cache hits on B.data and reads 60.
  • Later verifies prediction of B. Is this
    execution legal?

Predict headB
Code For Reader Thread R1 load reg1 lt-
memHead B R2 load reg2 lt- memreg1 60
head
?
B.data
B.next
A.data
A.next
13
Why This Execution Violates SC
  • Sequential Consistency
  • Simplest memory consistency model
  • Must exist total order of all operations
  • Total order must respect program order at each
    processor
  • Our example execution has a cycle
  • No total order exists

14
Trying to Find a Total Order
  • What orderings are enforced in this example?

Code For Reader Thread R1 load reg1 lt-
memHead R2 load reg2 lt- memreg1
Code For Writer Thread W1 store memB.data lt-
80 W2 load reg0 lt- memHead W3 store
memB.next lt- reg0 W4 store memHead lt- B

Setup node
Setup node
Insert
15
Program Order
  • Must enforce program order

Code For Reader Thread R1 load reg1 lt-
memHead R2 load reg2 lt- memreg1
Code For Writer Thread W1 store memB.data lt-
80 W2 load reg0 lt- memHead W3 store
memB.next lt- reg0 W4 store memHead lt- B

Setup node
Insert
16
Data Order
  • If we predict that R1 returns the value B, we
    can violate SC

Code For Reader Thread R1 load reg1 lt-
memHead B R2 load reg2 lt- memreg1 60
Code For Writer Thread W1 store memB.data lt-
80 W2 load reg0 lt- memHead W3 store
memB.next lt- reg0 W4 store memHead lt- B

Setup node
Insert
17
Value Prediction and Sequential Consistency
  • Key value prediction reorders dependent
    operations
  • Specifically, read-to-read data dependence order
  • Execute dependent operations out of program order
  • Applies to almost all consistency models
  • Models that enforce data dependence order
  • Must detect when this happens and recover
  • Similar to other optimizations that complicate SC

18
How to Fix SC Implementations
  • Address-based detection of violations
  • Student watches board B between prediction and
    verification
  • Like existing techniques for out-of-order SC
    processors
  • Track stores from other threads
  • If address matches speculative load, possible
    violation
  • Value-based detection of violations
  • Student checks grade again at verification
  • Also an existing idea
  • Replay all speculative instructions at commit
  • Can be done with dynamic verification (e.g., DIVA)

19
Relaxed Consistency Models
  • Relax some orderings between reads and writes
  • Allows HW/SW optimizations
  • Software must add memory barriers to get ordering
  • Intuition should make value prediction easier
  • Our intuition is wrong

20
Weakly Ordered Consistency Models
  • Relax orderings unless memory barrier between
  • Examples
  • SPARC RMO
  • IA-64
  • PowerPC
  • Alpha
  • Subtle point that affects value prediction
  • Does model enforce data dependence order?

21
Relaxed Models that Enforce Data Dependence
  • Examples SPARC RMO, PowerPC, and IA-64

Code For Writer Thread W1 store memB.data lt-
80 W2 load reg0 lt- memHead W3 store
memB.next lt- reg0 W3b Memory Barrier W4 store
memHead lt- B
Code For Reader Thread R1 load reg1 lt-
memHead R2 load reg2 lt- memreg1

Setup node
Memory barrier orders W4 after W1, W2, W3
Insert
22
Violating Consistency Model
  • Simple value prediction can break RMO, PPC, IA-64
  • How? By relaxing dependence order between reads
  • Same issues as for SC and PC

23
Solutions to Problem
  • Dont enforce dependence order (add memory
    barriers)
  • Changes architecture
  • Breaks backward compatibility
  • Not practical
  • Enforce SC or PC
  • Potential performance loss
  • More efficient solutions possible

24
Models that Dont Enforce Data Dependence
  • Example Alpha
  • Requires extra memory barrier (between R1 R2)

Code For Writer Thread W1 store memB.data lt-
80 W2 load reg0 lt- memHead W3 store
memB.next lt- reg0 W3b Memory Barrier W4 store
memHead lt- B
Code For Reader Thread R1 load reg1 lt-
memHead R1b Memory Barrier R2 load reg2 lt-
memreg1

Setup node
Insert
25
Issues in Not Enforcing Data Dependence
  • Works correctly with value prediction
  • No detection mechanism necessary
  • Do not need to add any more memory barriers for
    VP
  • Additional memory barriers
  • Non-intuitive locations
  • Added burden on programmer

26
Summary of Memory Model Issues
SC
Relaxed Models
Weakly Ordered Models
PC
IA-32 SPARC TSO
Enforce Data Dependence
NOT Enforce Data Dependence
IA-64 SPARC RMO
Alpha
27
Conclusions
  • Naïve value prediction can violate consistency
  • Subtle issues for each class of memory model
  • Solutions for SC PC require detection mechanism
  • Use existing mechanisms for enhancing SC
    performance
  • Solutions for more relaxed memory models
  • Enforce stronger model

28
Outline
  • Subroutine Call
  • Value Prediction Memory Model Subtleties
  • Review Original Paper Computer, Dec. 1998
  • Commercial Memory Model Classes
  • Performance Similarities Differences
  • Predictions Recommendation
  • Revisit in 2003
  • Revisiting 1998 Predictions
  • SC ILP RC? Paper
  • Revisiting Commercial Memory Model Classes
  • Analysis, Predictions. Recommendation

29
1998 Commercial Memory Model Classes
  • Sequential Consistency (SC)
  • MIPS/SGI
  • HP PA-RISC
  • Processor Consistency (PC)
  • Relax write?read dependencies
  • Intel x86 (a.k.a., IA-32)
  • Sun TSO
  • Relaxed Consistency (RC)
  • Relax all dependencies, but add fences
  • DEC Alpha
  • IBM PowerPC
  • Sun RMO (no implementations)

30
With All Models, Hardware Can
  • Use
  • Coherent Caches
  • Non-binding prefetches
  • Simultaneous vertical multithreading
  • With Speculative Execution
  • Allow expected misses to prefetch
  • Speculatively perform all reads writes
  • Whats different?

31
Performance Difference
  • RC/PC/SC can do same optimzations
  • But RC/PC can sometimes commit early
  • While SC can lose performance
  • Undoing execution on (suspected) model violation
  • Stalls due to full instruction windows, etc.
  • Performance over SC Ranganathan et al. 1997
  • 11 for PC
  • 20 for RC
  • Closer if SC uses their Speculative Retirement

32
1998 Predictions Recommendation
  • My Performance Gap Predictions
  • Longer (relative) memory latency
  • Larger caches, bigger windows, etc.
  • New inventions
  • My Recommendation
  • Implement SC (or PC)
  • Keep interface simple
  • Innovate in implementation

33
Outline
  • Subroutine Call
  • Value Prediction Memory Model Subtleties
  • Review Original Paper Computer, Dec. 1998
  • Commercial Memory Model Classes
  • Performance Similarities Differences
  • Predictions Recommendation
  • Revisit in 2003
  • Revisiting 1998 Predictions
  • SC ILP RC? Paper
  • Revisiting Commercial Memory Model Classes
  • Analysis, Predictions. Recommendation

34
Revisiting Predictions
  • Evolutionary Predictions
  • Longer (relative) memory latency
  • Larger caches
  • Larger instruction windows.
  • New Inventions
  • Run-ahead Helper threads
  • SMT commercialized
  • Chip Multiprocessors (CMPs)
  • SC ILP RC?

Happened, but on-balance made gap bigger
Wonderful prefetching
Many threads per processor
Many threads per chip
Can close gap
Relaxed HW memory model offers little more
performance
35
SC IPC RC?, 1999
  • Challenge
  • Hill, however, argues that with current trends
    toward larger levels of on-chip integration,
    sophisticated microarchitectural innovation, and
    larger caches, the performance gap between memory
    models should eventually vanish.
  • Response
  • This paper confirms Hills conjecture by showing,
    for the first time, that an SC implementation can
    perform as well as an RC implementation if
    hardware provides enough support for speculation.
  • Deep history buffer write speculative stores
    into cache
  • Filter table to detect conflicts on snoops

36
2003 Commercial Memory Model Classes
  • Sequential Consistency (SC)
  • MIPS/SGI
  • HP PA-RISC
  • Processor Consistency (PC)
  • Relax write?read dependencies
  • Intel x86 (a.k.a., IA-32)
  • Sun TSO
  • Relaxed Consistency (RC)
  • Relax all dependencies, but add fences
  • DEC Alpha
  • IBM PowerPC
  • Sun RMO (no implementations)

Intel IPF (IA-64)
37
Current Analysis
  • Architectures changed mostly for business reasons
  • No one substantially changed model class
  • Clearly, all three classes work
  • E.g., generating fences not too bad

38
Current Options
  • Assume Relaxed HLL model ? Three HW Model Options
  • Expose SC/PC Implement SC/PC
  • Add SC/PC mechanisms speculate! (somewhat
    complex)
  • HW implementers verifiers know what correct is
  • Expose Relaxed Implement Relaxed
  • Many HW implementers verifiers dont understand
    relaxed
  • More performance?
  • Deep speculation require HW to pass fences
  • Run-ahead throw all away?
  • Speculative execution with SC/PC-like mechanisms?
  • Expose Relaxed Implement SC/PC
  • Implement fences as no-ops
  • Use SC/PC mechanisms, speculate!
  • HW implementers verifiers know what correct is

39
Predictions Recommendation
  • Predictions
  • Longer (relative) memory latency
  • Only partially compensated by caches, etc.
  • Will speculate further without larger windows
    (run-ahead)
  • Will need to speculate past synchronization
    fences
  • Use CMPs to get many outstanding misses per chip
  • Recommendations (unrepentant ? )
  • Implement SC (or PC)
  • Keep interface simple
  • Innovate in implementation

40
Outline
  • Subroutine Call
  • Value Prediction Memory Model Subtleties
  • Review Original Paper Computer, Dec. 1998
  • High- vs. Low-Level Memory Models
  • Commercial Memory Model Classes
  • Performance Similarities Differences
  • Predictions Recommendation
  • Revisit in 2003
  • Revisiting 1998 Predictions
  • SC ILP RC? Paper
  • Revisiting Commercial Memory Model Classes
  • Analysis, Predictions. Recommendation
Write a Comment
User Comments (0)
About PowerShow.com