Computer architecture II - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Computer architecture II

Description:

Computer architecture II – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 42
Provided by: Bog61
Category:

less

Transcript and Presenter's Notes

Title: Computer architecture II


1
Computer architecture II
  • Lecture 9

2
Today
  • Synchronization for SMM
  • Test and set, ll and sc, array
  • barrier
  • Scalable Multiprocessors
  • What is a scalable machine?

3
Synchronization
  • Types of Synchronization
  • Mutual Exclusion
  • Event synchronization
  • point-to-point
  • group
  • global (barriers)
  • All solutions rely on hardware support for an
    atomic read-modify-write operation
  • We look today at synchronization for
    cache-coherent, bus-based multiprocessors

4
Components of a Synchronization Event
  • Acquire method
  • Acquire right to the synch (e.g. enter critical
    section)
  • Waiting algorithm
  • Wait for synch to become available when it isnt
  • busy-waiting, blocking, or hybrid
  • Release method
  • Enable other processors to acquire

5
Performance Criteria for Synch. Ops
  • Latency (time per op)
  • especially when light contention
  • Bandwidth (ops per sec)
  • especially under high contention
  • Traffic
  • load on critical resources
  • especially on failures under contention
  • Storage
  • Fairness

6
Strawman Lock
Busy-Waiting
  • lock ld register, location / copy location to
    register /
  • cmp location, 0 / compare with 0 /
  • bnz lock / if not 0, try again /
  • st location, 1 / store 1 to mark it locked /
  • ret / return control to caller /
  • unlock st location, 0 / write 0 to location
    /
  • ret / return control to caller /

Location is initially 0 Why doesnt the acquire
method work?
7
Atomic Instructions
  • Specifies a location, register, atomic
    operation
  • Value in location read into a register
  • Another value (function of value read or not)
    stored into location
  • Many variants
  • Varying degrees of flexibility in second part
  • Simple example testset
  • Value in location read into a specified register
  • Constant 1 stored into location
  • Successful if value loaded into register is 0
  • Other constants could be used instead of 1 and 0

8
Simple TestSet Lock
  • lock ts register, location
  • bnz lock / if not 0, try again /
  • ret / return control to caller /
  • unlock st location, 0 / write 0 to location
    /
  • ret / return control to caller /
  • The same code for lock in pseudocode
  • while (not acquired) / lock is aquired be
    another one/
  • testset(location) / try to acquire the lock/
  • Condition architecture supports atomic test and
    set
  • Copy location to register and set location to 1
  • Problem
  • ts modifies the variable location in its cache
    each time it tries to acquire the lockgt cache
    block invalidations gt bus traffic (especially
    for high contention)

9
TS Lock Microbenchmark SGI Challenge
20
l
T
estset,
c
0
s
s
T
estset, exponential backoff

c
3.64
l
s
18
T
estset, exponential backoff

c
0
n
n
Ideal
u
s
16
s
l
s
s
n
14
l
s)
l
s
m
s
n
12
s
n
l
s
ime (
s
s
l
10
l
n
T
l
s
lock delay(c) unlock
n
n
8
n
l
6
n
l
n
4
l
s
n
l
l
n
2
l
s
uuuuuuuuuuuuuuu
l
n
n
n
n
s
u
l
0
11
13
15
9
7
5
3
Number of processors
  • Why does performance degrade?
  • Bus Transactions on TS

10
Other read-modify-write primitives
  • Fetchop
  • Atomically read and modify (by using op
    operation) and write a memory location
  • E.g. fetchadd, fetchincr
  • Compareswap
  • Three operands location, register to compare
    with, register to swap with

11
Enhancements to Simple Lock
  • Problem of ts lots of invalidations if the lock
    can not be taken
  • Reduce frequency of issuing testsets while
    waiting
  • Testset lock with exponential backoff
  • i0
  • while (! acquired) / lock is acquired be
    another one/
  • testset(location)
  • if (!acquired) / testset didnt succeed/
  • wait (ti) / sleep some time
  • i
  • Less invalidations
  • May wait more

12
TS Lock Microbenchmark SGI Challenge
20
l
T
estset,
c
0
s
s
T
estset, exponential backoff

c
3.64
l
s
18
T
estset, exponential backoff

c
0
n
n
Ideal
u
s
16
s
l
s
s
n
14
l
s)
l
s
m
s
n
12
s
n
l
s
ime (
s
s
l
10
l
n
T
l
s
lock delay(c) unlock
n
n
8
n
l
6
n
l
n
4
l
s
n
l
l
n
2
l
s
uuuuuuuuuuuuuuu
l
n
n
n
n
s
u
l
0
11
13
15
9
7
5
3
Number of processors
  • Why does performance degrade?
  • Bus Transactions on TS

13
Enhancements to Simple Lock
  • Reduce frequency of issuing testsets while
    waiting
  • Test-and-testset lock
  • while (! acquired) / lock is acquired be
    another one/
  • if (location1) / test with ordinary load /
  • continue
  • else
  • testset(location)
  • if (acquired) /succeeded/
  • break
  • Keep testing with ordinary load
  • Just a hint cached lock variable will be
    invalidated when release occurs
  • If location becomes 0, use ts to modify the
    variable atomically
  • If failure start over
  • Further reduces the bus transactions
  • load produces bus traffic only when the lock is
    released
  • ts produces bus traffic each time is executed

14
Lock performance
15
Improved Hardware Primitives LL-SC
  • Goals
  • Problem of testset generate lot of bus traffic
  • Failed read-modify-write attempts dont generate
    invalidations
  • Nice if single primitive can implement range of
    r-m-w operations
  • Load-Locked (or -linked), Store-Conditional
  • LL reads variable into register
  • Work on the value from the register
  • SC tries to store back to location
  • succeed if and only if no other write to the
    variable since this processors LL
  • indicated by a condition flag
  • If SC succeeds, all three steps happened
    atomically
  • If fails, doesnt write or generate invalidations
  • must retry acquire

16
Simple Lock with LL-SC
  • lock ll reg1, location / LL location to reg1
    /
  • sc location, reg2 / SC reg2 into
    location/
  • beqz reg2, lock / if failed, start
    again /
  • ret
  • unlock st location, 0 / write 0 to location /
  • ret
  • Can simulate the atomic ops ts, fetchop,
    compareswap by changing whats between LL SC
    (exercise)
  • Only a couple of instructions so SC likely to
    succeed
  • Dont include instructions that would need to be
    undone (e.g. stores)
  • SC can fail (without putting transaction on bus)
    if
  • Detects intervening write even before trying to
    get bus
  • Tries to get bus but another processors SC gets
    bus first
  • LL, SC are not lock, unlock respectively
  • Only guarantee no conflicting write to lock
    variable between them
  • But can use directly to implement simple
    operations on shared variables

17
Advanced lock algorithms
  • Problems with presented approaches
  • Unfair the order of arrival does not count
  • All processors try to acquire the lock when
    released
  • More processes may incur a read miss when the
    lock released
  • Desirable only one miss

18
Ticket Lock
  • Draw a ticket with a number, wait until the
    number is shown
  • Two counters per lock (next_ticket, now_serving)
  • Acquire fetchinc next_ticket wait for
    now_serving next_ticket
  • atomic op when arrive at lock, not when its free
    (so less contention)
  • Release increment now-serving
  • Performance
  • low latency for low-contention
  • O(p) read misses at release, since all spin on
    same variable
  • FIFO order
  • like simple LL-SC lock, but no invalidation when
    SC succeeds, and fair

19
Array-based Queuing Locks
  • Waiting processes poll on different locations in
    an array of size p
  • Acquire
  • fetchinc to obtain address on which to spin
    (next array element)
  • ensure that these addresses are in different
    cache lines or memories
  • Release
  • set next location in array, thus waking up
    process spinning on it
  • O(1) traffic per acquire with coherent caches
  • FIFO ordering, as in ticket lock, but, O(p) space
    per lock
  • Not so great for non-cache-coherent machines with
    distributed memory
  • array location I spin on not necessarily in my
    local memory

20
Lock performance
21
Point to Point Event Synchronization
  • Software methods
  • Busy-waiting use ordinary variables as flags
  • Blocking semaphores
  • Interrupts
  • Full hardware support full-empty bit with each
    word in memory
  • Set when word is full with newly produced data
    (i.e. when written)
  • Unset when word is empty due to being consumed
    (i.e. when read)
  • Natural for word-level producer-consumer
    synchronization
  • producer write if empty, set to full
  • consumer read if full set to empty
  • Hardware preserves read or write atomicity
  • Problem flexibility
  • multiple consumers
  • multiple update of a producer

22
Barriers
  • Hardware barriers
  • Wired-AND line separate from address/data bus
  • Set input 1 when arrive, wait for output to be 1
    to leave
  • Useful when barriers are global and very frequent
  • Difficult to support arbitrary subset of
    processors
  • even harder with multiple processes per processor
  • Difficult to dynamically change number and
    identity of participants
  • e.g. latter due to process migration
  • Not common today on bus-based machines
  • Software algorithms implemented using locks,
    flags, counters

23
A Simple Centralized Barrier
  • Shared counter maintains number of processes that
    have arrived
  • increment when arrive (lock), check until reaches
    numprocs
  • Problem?

struct bar_type int counter struct lock_type
lock int flag 0 bar_name BARRIER
(bar_name, p) LOCK(bar_name.lock) if
(bar_name.counter 0) bar_name.flag 0
/ reset flag if first to reach/ mycount
bar_name.counter / mycount is private
/ UNLOCK(bar_name.lock) if (mycount p)
/ last to arrive / bar_name.counter
0 / reset for next barrier
/ bar_name.flag 1 / release waiters
/ else while (bar_name.flag 0) /
busy wait for release /
24
A Working Centralized Barrier
  • Consecutively entering the same barrier doesnt
    work
  • Must prevent process from entering until all have
    left previous instance
  • Could use another counter, but increases latency
    and contention
  • Sense reversal wait for flag to take different
    value consecutive times
  • Toggle this value only when all processes reach
  • BARRIER (bar_name, p)
  • local_sense !(local_sense) / toggle private
    sense variable /
  • LOCK(bar_name.lock)
  • mycount bar_name.counter / mycount is
    private /
  • if (bar_name.counter p)
  • UNLOCK(bar_name.lock)
  • bar_name.counter 0
  • bar_name.flag local_sense / release
    waiters/
  • else
  • UNLOCK(bar_name.lock)
  • while (bar_name.flag ! local_sense)

25
Centralized Barrier Performance
  • Latency
  • critical path length at least proportional to p
    (the accesses to the critical region are
    serialized by the lock)
  • Traffic
  • p bus transaction to obtain the lock
  • p bus transactions to modify the counter
  • 2 bus transaction for the last processor to reset
    the counter and release the waiting process
  • p-1 bus transactions for the first p-1 processors
    to read the flag
  • Storage Cost
  • Very low centralized counter and flag
  • Fairness
  • Same processor should not always be last to exit
    barrier
  • Key problems for centralized barrier are latency
    and traffic
  • Especially with distributed memory, traffic goes
    to same node

26
Improved Barrier Algorithms for a Bus
  • Software combining tree
  • Only k processors access the same location, where
    k is degree of tree (k2 in the example below)
  • Separate arrival and exit trees, and use sense
    reversal
  • Valuable in distributed network communicate
    along different paths
  • On bus, all traffic goes on same bus, and no less
    total traffic
  • Higher latency (log p steps of work, and O(p)
    serialized bus transactions)
  • Advantage on bus is use of ordinary reads/writes
    instead of locks

27
Scalable Multiprocessors
28
Scalable Machines
  • Scalability capability of a system to increase
    by adding processors, memory, I/O devices
  • 4 important aspects of scalability
  • bandwidth increases with number of processors
  • latency does not increase or increases slowly
  • Cost increases slowly with number of processors
  • Physical placement of resources

29
Limited Scaling of a Bus
Characteristic Bus Physical Length 1 ft Number
of Connections fixed Maximum Bandwidth fixed Inter
face to Comm. medium extended memory
interface Global Order arbitration Protection virt
ual -gt physical Trust total OS single comm.
abstraction HW
  • Small configurations are cost-effective

30
Workstations in a LAN?
Characteristic Bus LAN Physical Length 1
ft KM Number of Connections fixed many Maximum
Bandwidth fixed ??? Interface to Comm.
medium memory interface peripheral Global
Order arbitration ??? Protection Virtual -gt
physical OS Trust total none OS single independent
comm. abstraction HW SW
  • No clear limit to physical scaling, little trust,
    no global order
  • Independent failure and restart

31
Bandwidth Scalability
  • Bandwidth limitation single set of wires
  • Must have many independent wires (remember
    bisection width?) gt switches

32
Dancehall MP Organization
  • Network bandwidth demand scales linearly with
    number of processors
  • Latency Increases with number of stages of
    switches (remember butterfly?)
  • Adding local memory would offer fixed latency

33
Generic Distributed Memory Multiprocessor
  • Most common structure

34
Bandwidth scaling requirements
  • Large number of independent communication paths
    between nodes large number of concurrent
    transactions using different wires
  • Independent transactions
  • No global arbitration
  • Effect of a transaction only visible to the nodes
    involved
  • Broadcast difficult (was easy for bus)
    additional transactions needed

35
Latency Scaling
  • T(n) Overhead Channel Time (Channel
    Occupancy) Routing Delay Contention Time
  • Overhead processing time in initiating and
    completing a transfer
  • Channel Time(n) n/B
  • RoutingDelay (h,n)

36
Cost Scaling
  • Cost(p,m) fixed cost incremental cost (p,m)
  • Bus Based SMP
  • Add more processors and memory
  • Scalable machines
  • processors, memory, network
  • Parallel efficiency(p) Speedup(p) / p
  • Costup(p) Cost(p) / Cost(1)
  • Cost-effective Speedup(p) gt Costup(p)

37
Cost Effective?
  • 2048 processors 475 fold speedup at 206x cost

38
Physical Scaling
  • Chip-level integration
  • Board-level
  • System level

39
Chip-level integration nCUBE/2
1024 Nodes
  • Network integrated onto the chip 14 bidirectional
    links gt 8096 nodes
  • Entire machine synchronous at 40 MHz

40
Board level integration CM-5
  • Use standard microprocessor components
  • Scalable network interconnect

41
System Level Integration
  • Loose packaging
  • IBM SP2
  • Cluster blades
Write a Comment
User Comments (0)
About PowerShow.com