Title: Computer Architecture
1Computer Architecture
Iolanthe II racing in Waitemata Harbour
2Classification of Parallel Processors
- Flynns Taxonomy
- Classifies according to instruction and data
stream - Single Instruction Single Data
- Sequential processors
- Single Instruction Multiple Data
- CM-2 multiple small processors
- Vector processors
- Parts of commercial processors - MMX, Altivec
- Multiple Instruction Single Data
- ?
- Multiple Instruction Multiple Data
- General Parallel Processors
3MIMD Systems
- Recipe
- Buy a few high performance commercial PEs
- DEC Alpha
- MIPS R10000
- UltraSPARC
- Pentium?
- Put them together with some memory and
peripherals on a common bus - Instant parallel processor!
- How to program it?
4Programming Model
- Problem not unique to MIMD
- Even sequential machines need one
- von Neuman (stored program) model
- Parallel - Splitting the work load
- Data
- Distribute data to PEs
- Instructions
- Distribute tasks to PEs
- Synchronization
- Having divided the data tasks,how do we
synchronize tasks?
5Programming Model Shared Memory Model
- Shared Memory Model
- Flavour of the year
- Generally thought to be simplest to manage
- All PEs see a common (virtual) address space
- PEs communicate by writing into the common
address space
6Data Distribution
- Trivial
- All the data sits in the common address space
- Any PE can access it!
- Uniform Memory Access(UMA) systems
- All PEs access all data with same tacc
- Non-UMA (NUMA) systems
- Memory is physically distributed
- Some PEs are closer to some addresses
- More later!
7Synchronisation
- Read static shared data
- No problem!
- Update problem
- PE0 writes x
- PE1 reads x
- How to ensure thatPE1 reads the lastvalue
written by PE0? - Semaphores
- Lock resources(memory areas or ...)while being
updatedby one PE
8Synchronisation
- Semaphore
- Data structure in memory
- Count of waiters
- -1 resource free
- gt 0 resource in use
- Pointer to list of waiters
- Two operations
- Wait
- Proceed immediately if resource free(waiter
count -1) - Notify
- Advise semaphore that you have finished with
resource - Decrement waiter count
- First waiter will be given control
9Semaphores - Implementation
- Scenario
- Semaphore free (-1)
- PE0 wait ..
- Resource free, so PE0 uses it (sets 0)
- PE1 wait ..
- Reads count (0)
- Starts to increment it ..
- PE0 notify ..
- Gets bus and writes -1
- PE1 (finishing wait)
- Adds 1 to 0, writes 1 to count, adds PE1 TCB to
list - Stalemate!
- Who issues notify to free the resource?
10Atomic Operations
- Problem
- PE0 wrote a new value (-1) after PE1 had read
the counter - PE1 increments the value it read (0) and writes
it back - Solution
- PE1s read and update must be atomic
- No other PE must gain access to counterwhile PE1
is updating - Usually an architecture will provide
- Test and set instruction
- Read a memory location, test it,if its 0, write
a new value,else do nothing - Atomic or indivisible .. No other PE can access
the value until the operation is complete
11Atomic Operations
- Test Set
- Read a memory location, test it,if its 0, write
a new value,else do nothing - Can be used to guard a resource
- When the location contains 0 -access to the
resource is allowed - Non-zero value means the resource is locked
- Semaphore
- Simple semaphore (no wait list)
- Implement directly
- Waiter backs off and tries again (rather than
being queued) - Complex semaphore (with wait list)
- Guards the wait counter
12Atomic Operations
- Processor must provide an atomic operation for
- Multi-tasking or multi-threading on a single PE
- Multiple processes
- Interrupts occur at arbitrary points in time
- including timer interrupts signaling end of
time-slice - Any process can be interrupted in the middle of a
read-modify-write sequence - Shared memory multi-processors
- One PE can lose control of the bus after the read
of a read-modify-write - Cache?
- Later!
13Atomic Operations
- Variations
- Provide equivalent capability
- Sometimes appear in strange guises!
- Read-modify-write bus transactions
- Memory location is read, modified and written
back as a single, indivisible operation - Test and exchange
- Check registers value, if 0, exchange with
memory - Reservation Register (PowerPC)
- lwarx - load word and reserve indexed
- stwcx - store word conditional indexed
- Reservation register stores address of reserved
word - Reservation and use can be separated by sequence
of instructions
14Synchronization High level
15Barriers
- In shared memoryenvironment
- PEs must know whenanother PE hasproduced a
result - Simplest casebarrier for all PEs
- Must be inserted byprogrammer
- Potentially expensive
- All PEs stall and waste time in the barrier
16PE-PE synchronization
- Barriers are global and potentially wasteful
- Small group of PEs (subset of total) may be
working on a sub-task - Need to synchronize within the group
- Steps
- Allocate semaphore (its just a block of memory)
- PEs within the group access a shared location
guarded by this semaphore - eg
- shared location is count of PEs which have
completed their tasks each PE increments the
count when it completes - master monitors count until all PEs have
finished
17Cache
- Performance of a modern PE depends on the
cache(s)!
18Cache?
- What happens to cachedlocations?
19Multiple Caches
- Coherence
- PEA reads location xfrom memory
- Copy in cache A
- PEB reads location x from memory
- Copy in cache B
- PEA adds 1
20Multiple Caches - Inconsistent states
- Coherence
- PEA reads location xfrom memory
- Copy in cache A
- PEB reads location x from memory
- Copy in cache B
- PEA adds 1
- As copy now 201
- PEB reads location x
- Reads 200 from cache B!!
21Multiple Caches - Inconsistent states
- Coherence
- PEA reads location xfrom memory
- Copy in cache A
- PEB reads location x from memory
- Copy in cache B
- PEA adds 1
- As copy now 201
- PEB reads location x
- Reads 200 from cache B
- Caches and memory are now inconsistent ornot
coherent
22Cache - Maintaining Coherence
- Invalidate on write
- PEA reads location xfrom memory
- Copy in cache A
- PEB reads location x from memory
- Copy in cache B
- PEA adds 1
- As copy now 201
- PEA Issues invalidate x
- Cache B marks x invalid
- Invalidate is address transaction only
23Cache - Maintaining Coherence
- Reading the new value
- PEB reads location x
- Main memoryis wrong also
- PEA snoops read
- Realises it hasvalid copy
- PEA issues retry
24Cache - Maintaining Coherence
- Reading the new value
- PEB reads location x
- Main memoryis wrong also!
- PEA snoops read
- Realises it hasvalid copy
- PEA issues retry
- PEA writes x back
- Memory now correct
- PEB reads location x again
- Reads latest version
25Coherent Cache - Snooping
- SIU snoops bus for transactions
- Addresses compared with local cache
- On matches
- Hits in the local cache
- Initiate retries
- Local copy is modified
- Local copy is written to bus
- Invalidate local copies
- Another PE is writing
- Mark local copies shared
- second PE is readingsame value
26Coherent Cache - MESI protocol
- Cache line has 4 states
- Invalid
- Modified
- Only valid copy
- Memory copy is invalid
- Exclusive
- Only cached copy
- Memory copy is valid
- Shared
- Multiple cached copies
- Memory copy is valid
27MESI State Diagram
- Note the number of bus transactions needed!
WH Write Hit WM Write Miss RH Read Hit RMS
Read Miss Shared RME Read Miss Exclusive SHW
Snoop Hit Write
28Coherent Cache - The Cost
- Cache coherency transactions
- Additional transactions needed
- Shared
- Write Hit
- Other caches must be notified
- Modified
- Other PE read
- Push-out needed
- Other PE write
- Push-out needed - writing one word of n-word line
- Invalid - modified in other cache
- Read or write
- Wait for push-out
29Clusters
- A bus which is too long becomes slow!
- eg PCI is limited to 10 TTL loads
- Lots of processors?
- On the same bus
- Bus speed must be limited
- Low communication rate
- Better to use a single PE!
- Clusters
- 8 processors on a bus
30Clusters
8 cache coherent (CC) processors on a bus
Interconnect network
100? clusters
31Clusters
Network Interface Unit Detects requests
for remote memory
32Clusters
Message despatched to remote clusters NIU
Memory Request Message
33Clusters - Shared Memory
- Non Uniform Memory Access
- Access time to memory depends on location!
From PEs in this cluster
This memory is much closer than this one!
34Clusters - Shared Memory
- Non Uniform Memory Access
- Access time to memory depends on location!
Worse! NIU needs to maintain cache
coherence across the entire machine
35Clusters - Maintaining Cache Coherence
- NIU (or equivalent) maintains directory
- Directory Entries
- All lines from local memory cached elsewhere
- NIU software (firmware)
- Checks memory requests against directory
- Update directory
- Send invalidate messages to other clusters
- Fetch modified (dirty) lines from other clusters
- Remote memory access cost
- 100s of cycles!
Directory (Cluster 2)
Address Status Clusters 4340 S 1, 3,
8 5260 E 9
36Clusters - Off the shelf
- Commercial clusters
- Provide page migration
- Make copy of a remote page on the local PE
- Programmer remains responsible for coherence
- Dont provide hardware support for cache
coherence (across network) - Fully CC machines may never be available!
- Software Systems
- .... è
37Shared Memory Systems
- Software Systems
- eg Treadmarks
- Provide shared memory on page basis
- Software
- detects references to remote pages
- moves copy to local memory
- Reduces shared memory overhead
- Provides some of the shared memory model
convenience - Without swamping interconnection network with
messages - Message overhead is too high for a single word!
- Word basis is too expensive!!
38Granularity in Parallel Systems
39Shared Memory Systems - Granularity
- Granularity
- Keeping data coherent on a word basis is too
expensive!! - Sharing data at low granularity
- Fine grain sharing
- Access / sharing for individual words
- Overheads too high
- Number of messages
- Message overhead is high for one word
- Compare
- Burst access to memory
- Dont fetch a single word -
- Overhead (bus protocol) is too high
- Amortize cost of access over multiple words
40Shared Memory Systems - Granularity
- Coarse Grain Systems
- Transferring data from cluster to cluster
- Overhead
- Messages
- Updating directory
- Amortise the overhead over a whole page
- Lower relative overhead
- Applies to thread size also
- Split program into small threads of control
- Parallel Overhead
- Cost of setting up starting each thread
- Ccost of synchronising at the end of a set of
threads - Can be more efficient to run a single sequential
thread!
41Coarse Grain Systems
- So far ...
- Most experiments suggest that fine grain systems
are impractical - Larger, coarser grain
- Blocks of data
- Threads of computation
- needed to reduce overall computation time by
using multiple processors - Too Fine grain parallel systems
- can run slower than a single processor!
42Parallel Overhead
- Ideal
- T(n) time to solve problem with n PEs
- Sequential time T(1)
- Wed like
- T(n) T(1) / n
- Add Overhead
- Time gt optimal
- No point to usemore than4 PEs!!
Actual T(n)
43Parallel Overhead
- Ideal
- Time 1/n
- Add Overhead
- Time gt optimal
- No point to usemore than4 PEs!!
44Parallel Overhead
- Shared memory systems
- Best results if you
- Share on large block basis
- eg page
- Split program into coarse grain(long running)
threads - Give away some parallelismto achieve any
parallel speedup! - Coarse grain
- Data
- Computation
Theres parallelism at the instruction level
too!The instruction issue unit in a sequential
processor is trying to exploit it!
45Clusters - Improving multiple PE performance
- Bandwidth to memory
- Cache reduces dependency on the memory-CPU
interface - 95 cache hits
- 5 of memory accesses crossing the interface
- but add
- a few PEs and
- a few CC transactions
- even if the interface was coping before,it wont
in a multiprocessor system!
A major bottleneck!
46Clusters - Improving multiple PE performance
- Bus protocols add to access time
- Request / Grant / Release phases needed
- Point-to-point is faster!
- Cross-bar switch interface to memory
- No PE contends with any other for the common
bus
Cross-bar? Name taken from old telephone
exchanges!
47Clusters - Memory Bandwidth
- Modern Clusters
- Use Point-to-point X-bar interfaces to memory
to get bandwidth! - Cache coherence?
- Now really hard!!
- How does each cachesnoop all transactions?
48Programming Model - Distributed Memory
- Distributed Memory
- also Message passing
- Alternative to shared memory
- Each PE has own address space
- PEs communicate with messages
- Messages providesynchronisation
- PE can block orwait for a message
49Programming Model - Distributed Memory
- Distributed Memory Systems
- Hardware is simple!
- Network can be as simple as ethernet
- Networks of Workstations model
- Commodity (cheap!) PEs
- Commodity Network
- Standard
- Ethernet
- ATM
- Proprietary
- Myrinet
- Achilles (UWA!)
50Programming Model - Distributed Memory
- Distributed Memory Systems
- Software is considered harder
- Programmer responsible for
- Distributing data to individual PEs
- Explicit Thread control
- Starting, stopping synchronising
- At least two commonly available systems
- Parallel Virtual Machine (PVM)
- Message Passing Interface (MPI)
- Built on two operations
- Send ( data, destPE, block dont block )
- Receive ( data, srcPE, block dont block )
- Blocking ensures synchronisation
51Programming Model - Distributed Memory
- Distributed Memory Systems
- Performance generally better (versus shared
memory) - Shared memory has hidden overheads
- Grain size poorly chosen
- eg data doesnt fit into pages
- Unnecessary coherencetransactions
- Updating a shared region (each page)before end
of computation - MP system waits and updates page when computation
is complete
52Programming Model - Distributed Memory
- Distributed Memory Systems
- Performance generally better (versus shared
memory) - False sharing
- Severely degrades performance
- May not be apparent on superficial analysis
Memory page
PEa accesses this data
This whole page ping-pongs between PEa and PEb
PEb accesses this data
53Distributed Memory - Summary
- Simpler (almost trivial) hardware
- Software
- More programmer effort
- Explicit data distribution
- Explicit synchronisation
- Performance generally better
- Programmer knows more about the problem
- Communicates only when necessary
- Communication grain size can be optimum
- Lower overheads