Title: MultiprocessorMulticore Systems Scheduling, Synchronization
1Multiprocessor/Multicore SystemsScheduling,
Synchronization
2Multiprocessors
- DefinitionA computer system in which two or
more CPUs share full access to a common RAM
3Multiprocessor/Multicore Hardware (ex.1)
- Bus-based multiprocessors
4Multiprocessor/Multicore Hardware (ex.2) UMA
(uniform memory access)
- Not/hardly scalable
- Bus-based architectures -gt saturation
- Crossbars too expensive (wiring constraints)
- Possible solutions
- Reduce network traffic by caching
- Clustering -gt non-uniform memory latency
behaviour (NUMA.
5Multiprocessor/Multicore Hardware (ex.3) NUMA
(non-uniform memory access)
- Single address space visible to all CPUs
- Access to remote memory slower than to local
- Cache-controller/MMU determines whether a
reference is local or remote - When caching is involved, it's called CC-NUMA
(cache coherent NUMA) - Typically Read Replication (write invalidation)
6Cache-coherence
7Cache coherence (cont)
- cache coherency protocols are based on a set of
(cache block) states and state transitions 2
types of protocols - write-update
- write-invalidate (suffers from false sharing)
- Some invalidations are not necessary for correct
program execution - Processor 1
Processor 2 - while (true) do while
(true) do - A A 1 B
B 1 - If A and B are located in the same cache block,
a cache miss occurs in each loop-iteration due to
a ping-pong of invalidations
8On multicores
Reason for multicores physical limitations can
cause significant heat dissipation and data
synchronization problems In addition to
operating system (OS) support, adjustments to
existing software are required to maximize
utilization of the computing resources provided
by multi-core processors. Virtual machine
approach again in focus.
Intel Core 2 dual core processor, with CPU-local
Level 1 caches shared, on-die Level 2 cache.
9On multicores (cont)
- Also possible (figure from www.microsoft.com/licen
sing/highlights/multicore.mspx)
10OS Design issues (1)Who executes the
OS/scheduler(s)?
- Master/slave architecture Key kernel functions
always run on a particular processor - Master is responsible for scheduling slave
sends service request to the master - Disadvantages
- Failure of master brings down whole system
- Master can become a performance bottleneck
- Peer architecture Operating system can execute
on any processor - Each processor does self-scheduling
- New issues for the operating system
- Make sure two processors do not choose the same
process
11Master-Slave multiprocessor OS
Bus
12Non-symmetric Peer Multiprocessor OS
Bus
- Each CPU has its own operating system
13Symmetric Peer Multiprocessor OS
Bus
- Symmetric Multiprocessors
- SMP multiprocessor model
14Scheduling in Multiprocessors
- Recall Tightly coupled multiprocessing (SMPs)
- Processors share main memory
- Controlled by operating system
- Different degrees of parallelism
- Independent and Coarse-Grained Parallelism
- no or very limited synchronization
- can by supported on a multiprocessor with little
change (and a bit of salt ?) - Medium-Grained Parallelism
- collection of threads usually interact
frequently - Fine-Grained Parallelism
- Highly parallel applications specialized and
fragmented area
15Design issues 2 Assignment of Processes to
Processors
- Per-processor ready-queues vs global ready-queue
- Permanently assign process to a processor
- Less overhead
- A processor could be idle while another processor
has a backlog - Have a global ready queue and schedule to any
available processor - can become a bottleneck
- Task migration not cheap
16Multiprocessor Schedulingper partition RQ
- Space sharing
- multiple threads at same time across multiple CPUs
17Multiprocessor SchedulingLoad sharing / Global
ready queue
- Timesharing
- note use of single data structure for scheduling
18Multiprocessor Scheduling Load Sharing a problem
- Problem with communication between two threads
- both belong to process A
- both running out of phase
19Design issues 3Multiprogramming on processors?
- Experience shows
- Threads running on separate processors (to the
extend of dedicating a processor to a thread)
yields dramatic gains in performance - Allocating processors to threads allocating
pages to processes (can use working set model?) - Specific scheduling discipline is less important
with more than on processor the decision of
distributing tasks is more important
20Gang Scheduling
- Approach to address the prev. problem
- Groups of related threads scheduled as a unit (a
gang) - All members of gang run simultaneously
- on different timeshared CPUs
- All gang members start and end time slices
together
21Gang Scheduling another option
22Multiprocessor Thread Scheduling Dynamic
Scheduling
- Number of threads in a process are altered
dynamically by the application - Programs (through thread libraries) give info to
OS to manage parallelism - OS adjusts the load to improve use
- Or os gives info to run-time system about
available processors, to adjust of threads. - i.e dynamic vesion of partitioning
23Summary Multiprocessor Thread Scheduling
- Load sharing processors/threads not assigned to
particular processors - load is distributed evenly across the processors
- needs central queue may be a bottleneck
- preemptive threads are unlikely to resume
execution on the same processor cache use is
less efficient - Gang scheduling Assigns threads to particular
processors (simultaneous scheduling of threads
that make up a process) - Useful where performance severely degrades when
any part of the application is not running (due
to synchronization) - Extreme version Dedicated processor assignment
(no multiprogramming of processors)
24Multiprocessor Scheduling and Synchronization
- Priorities blocking synchronization may result
in - priority inversion low-priority process P holds
a lock, high-priority process waits, medium
priority processes do not allow P to complete and
release the lock fast (scheduling less
efficient). To cope/avoid this - use priority inheritance
- use non-blocking synchronization (wait-free,
lock-free, optimistic synchronization) - convoy effect processes need a resource for
short time, the process holding it may block them
for long time (hence, poor utilization) - non-blocking synchronization is good here, too
25Readers-Writersnon-blocking synchronization
- (some slides are adapted from J. Andersons
slides on same topic)
26The Mutual Exclusion ProblemLocking
Synchronization
while true do Noncritical Section Entry
Section Critical Section Exit Section od
- N processes, each with this structure
- Basic Requirements
- Exclusion Invariant( in CS ? 1).
- Starvation-freedom (process i in Entry) leads-to
(process i in CS). - Can implement by busy waiting (spin locks) or
using kernel calls.
27Non-blocking Synchronization
- The problem
- Implement a shared object without mutual
exclusion. - Shared Object A data structure (e.g., queue)
shared by concurrent processes. - Why?
- To avoid performance problems that result when a
lock-holding task is delayed. - To avoid priority inversions (more on this later).
Locking
28Non-blocking Synchronization
- Two variants
- Lock-free
- Only system-wide progress is guaranteed.
- Usually implemented using retry loops.
- Wait-free
- Individual progress is guaranteed.
- Code for object invocations is purely sequential.
29Readers/Writers Problem
- Courtois, et al. 1971.
- Similar to mutual exclusion, but several readers
can execute critical section at once. - If a writer is in its critical section, then no
other process can be in its critical section. - no starvation, fairness
30Solution 1
Readers have priority
Reader P(mutex) rc rc 1 if rc 1
then P(w) fi V(mutex) CS P(mutex) rc rc
? 1 if rc 0 then V(w) fi V(mutex)
Writer P(w) CS V(w)
First reader executes P(w). Last one
executes V(w).
31Solution 2
Writers have priority readers should not
build long queue on r, so that writers can
overtake gt mutex3
Reader P(mutex3) P(r)
P(mutex1) rc rc 1 if rc 1
then P(w) fi V(mutex1) V(r) V(mutex3) CS
P(mutex1) rc rc ? 1 if rc 0 then V(w)
fi V(mutex1)
Writer P(mutex2) wc wc 1 if wc 1
then P(r) fi V(mutex2) P(w)
CS V(w) P(mutex2) wc wc ? 1 if wc 0
then V(r) fi V(mutex2)
32Properties
- If several writers try to enter their critical
sections, one will execute P(r), blocking
readers. - Works assuming V(r) has the effect of picking a
process waiting to execute P(r) to proceed. - Due to mutex3, if a reader executes V(r) and a
writer is at P(r), then the writer is picked to
proceed.
33Concurrent Reading and WritingLamport 77
- Previous solutions to the readers/writers problem
use some form of mutual exclusion. - Lamport considers solutions in which readers and
writers access a shared object concurrently. - Motivation
- Dont want writers to wait for readers.
- Readers/writers solution may be needed to
implement mutual exclusion (circularity problem).
34Interesting Factoids
- This is the first ever lock-free algorithm
guarantees consistency without locks - An algorithm very similar to this is implemented
within an embedded controller in Mercedes
automobiles!!
35The Problem
- Let v be a data item, consisting of one or more
digits. - For example, v 256 consists of three digits,
2, 5, and 6. - Underlying model Digits can be read and written
atomically. - Objective Simulate atomic reads and writes of
the data item v.
36Preliminaries
- Definition vi, where i ? 0, denotes the ith
value written to v. (v0 is vs initial value.) - Note No concurrent writing of v.
- Partitioning of v v1 ? vm.
- vi may consist of multiple digits.
- To read v Read each vi (in some order).
- To write v Write each vi (in some order).
37More Preliminaries
We say r reads vk,l. Value is consistent if
k l.
38Theorem 1
If v is always written from right to left, then a
read from left to right obtains a value
v1k1,l1 v2k2,l2 vmkm,lm where
k1 ? l1 ? k2 ? l2 ? ? km ? lm.
Example v v1v2v3 d1d2d3
Read reads v10,0 v21,1 v32,2.
39Another Example
v v1 v2
read v1
read v2
d1d2
d3d4
?
?
read
?
?
rd1
rd2
rd4
rd3
wv1
wv2
wv1
wv2
?
?
?
?
?
?
?
?
wd3
wd4
wd1
wd2
wd3
wd4
wd1
wd2
write1
write0
Read reads v10,1 v21,2.
40Theorem 2
Assume that i ? j implies that vi ? vj, where
v d1 dm. (a) If v is always written from
right to left, then a read from left to
right obtains a value vk,l ? vl. (b) If v is
always written from left to right, then a read
from right to left obtains a value vk,l ?
vk.
41Example of (a)
v d1d2d3
Read obtains v0,2 390 lt 400 v2.
42Example of (b)
v d1d2d3
Read obtains v0,2 498 gt 398 v0.
43Readers/Writers Solution
gt means assign larger value. ? V1 means left
to right. ? V2 means right to left.
44Proof Obligation
- Assume reader reads V2k1, l1 Dk2, l2 V1k3,
l3. - Proof Obligation V2k1, l1 V1k3, l3 ? k2
l2.
45Proof
By Theorem 2,
V2k1,l1 ? V2l1 and V1k3 ?
V1k3,l3.
(1) Applying Theorem 1 to V2 D V1,
k1 ? l1 ? k2
? l2 ? k3 ? l3 .
(2) By the
writer program,
l1 ? k3 ? V2l1 ? V1k3.
(3) (1), (2), and (3) imply
V2k1,l1 ? V2l1 ? V1k3
? V1k3,l3. Hence, V2k1,l1 V1k3,l3 ?
V2l1 V1k3 ? l1 k3
, by the writers
program. ? k2 l2
by (2).
46Supplemental Reading
- check
- G.L. Peterson, Concurrent Reading While
Writing, ACM TOPLAS, Vol. 5, No. 1, 1983, pp.
46-55. - Solves the same problem in a wait-free manner
- guarantees consistency without locks and
- the unbounded reader loop is eliminated.
- First paper on wait-free synchronization.
- Now, very rich literature on the topic. Check
also - PhD thesis A. Gidenstam, 2006, CTH
- PhD Thesis H. Sundell, 2005, CTH
47Useful Synchronization PrimitivesUsually
Necessary in Nonblocking Algorithms
CAS(var, old, new) ? if var ? old then return
false fi var new return true ?
CAS2 extends this
LL(var) ? establish link to var return
var ? SC(var, val) ? if link to var still
exists then break all current links of all
processes var val return true
else return false fi ?
48Another Lock-free ExampleShared Queue
type Qtype record v valtype next pointer
to Qtype end shared var Tail pointer to
Qtype local var old, new pointer to
Qtype procedure Enqueue (input valtype) new
(input, NIL) repeat old Tail
until CAS2(Tail, old-gtnext, old, NIL, new, new)
retry loop
new
new
old
old
Tail
Tail
49Using Locks in Real-time SystemsThe Priority
Inversion Problem
Uncontrolled use of locks in RT systems can
result in unbounded blocking due to priority
inversions.
High
Med
Low
Time
t
t
t
0
1
2
Shared Object Access
Priority Inversion
Computation not involving object accesses
50Dealing with Priority Inversions
- Common Approach Use lock-based schemes that
bound their duration (as shown). - Examples Priority-inheritance and -ceiling
protocols. - Disadvantages Kernel support, very inefficient
on multiprocessors. - Alternative Use non-blocking objects.
- No priority inversions or kernel support.
- Wait-free algorithms are clearly applicable here.
- What about lock-free algorithms?
- Advantage Usually simpler than wait-free
algorithms. - Disadvantage Access times are potentially
unbounded. - But for periodic task sets access times are also
predictable!! (check further-reading-pointers)