Title: Parallel Programming with PThreads
1Parallel Programming with PThreads
2Threads
- Sometimes called a lightweight process
- smaller execution unit than a process
- Consists of
- program counter
- register set
- stack space
- Threads share
- memory space
- code section
- OS resources(open files, signals, etc.)
3Threads
4POSIX Threads
- Thread API available on many OSs
- include ltpthread.hgt
- cc myprog.c o myprog -lpthread
- Thread creation
- int pthread_create(pthread_t thread,
pthread_attr_t attr, void
(start_routine)(void ), void arg) - Thread termination
- void pthread_exit(void retval)
- Waiting for Threads
- int pthread_join(pthread_t th, void
thread_return)
5(No Transcript)
6Thread Issues
- Timing
- False Sharing
- Variables are not shared but are so close
together, they are within the same cache line. - writes to a shared cache line invalidate other
processes caches. - Locking Overhead
7Thread Timing
- Scenario 1
- Thread T1 creates thread T2
- T2 requires data from T1 that will be placed in
global memory - T1 places the data in global memory after it
creates T2 - Assumes T1 will be able to place the data before
T2 starts or before T2 needs the data - Scenario 2
- T1 creates T2 and needs to pass data to T2
- T1 gives T2 a pointer to a variable on its stack
- What happens if / when T1 finishes?
8Thread Timing
- Set up all requirements before creating the
thread - It is possible that the created thread will start
and run to completion before the creating thread
gets scheduled again - Producer Consumer relationships
- make sure the data is placed before it is needed
(Dont rely on timing) - make sure the data is there before consuming
(Dont rely on timing) - make sure the data lasts until all potential
consumers have consumed the data (Dont rely
on timing) - Use synchronization primitives to enforce order
9False Sharing
Variables are not shared but are so close
together, they are within the same cache line.
P1
P2
Read A Write A (invalidate cache line)
Read B Write B (invalidate cache line)
Cache
Cache
Location A
Location B
Cache Line
Memory
10Effects of False Sharing
11Synchronization Primitives
- int pthread_mutex_init( pthread_mutex_t
mutex_lock, const pthread_mutexattr_t
lock_attr) - int pthread_mutex_lock( pthread_mutex_t
mutex_lock) - int pthread_mutex_unlock( pthread_mutex_t
mutex_lock) - int pthread_mutex_trylock( pthread_mutex_t
mutex_lock)
12include ltpthread.hgt void find_min(void
list_ptr) pthread_mutex_t minimum_value_lock int
minimum_value, partial_list_size main() minim
um_value MIN_INT pthread_init() pthread_mute
x_init(minimum_value_lock, NULL) /inititaliz
e lists etc, create and join threads/ void
find_min(void list_ptr) int
partial_list_ptr, my_min MIN_INT,
i partial_list_ptr (int )list_ptr for (i
0 i lt partial_list_size i) if
(partial_list_ptri lt my_min) my_min
partial_list_ptri pthread_mutex_lock(minimum_v
alue_lock) if (my_min lt minimum_value) minimum
_value my_min pthread_mutex_unlock(minimum_val
ue_lock) pthread_exit(0)
13Locking Overhead
- Serialization points
- Minimize the size of critical sections
- Be careful
- Rather than wait, check if lock is available
- pthread_mutex_trylock
- If already locked, will return EBUSY
- Will require restructuring of code
14/ Finding k matches in a list / void
find_entries(void start_pointer) / This is
the thread function / struct database_record
next_record int count current_pointer
start_pointer do next_record
find_next_entry(current_pointer) count
output_record(next_record) while (count lt
requested_number_of_records) int
output_record(struct database_record record_ptr)
int count pthread_mutex_lock(output_count_lo
ck) output_count count output_count
pthread_mutex_unlock(output_count_lock) if
(count lt requested_number_of_records)
print_record(record_ptr) return (count)
15/ rewritten output_record function / int
output_record(struct database_record record_ptr)
int count int lock_status
lock_statuspthread_mutex_trylock(output_count_l
ock) if (lock_status EBUSY)
insert_into_local_list(record_ptr) return(0)
else count output_count output_count
number_on_local_list 1 pthread_mutex_unlock
(output_count_lock) print_records(record_ptr,
local_list, requested_number_of_records -
count) return(count number_on_local_list
1)
16Mutex features/issues
- Limited to just mutex
- Use posix semaphores for more control
- Can only have one process in mutex
- What if you get in and then realize things arent
quite ready yet? - Must exit the mutex and start over
- Can we avoid going to the back of the line?
17Condition Variables for Synchronization
- A condition variable allows a thread to block
itself until specified data reaches a predefined
state. - A condition variable is associated with a
predicate. - When the predicate becomes true, the condition
variable is used to signal one or more threads
waiting on the condition. - A single condition variable may be associated
with more than one predicate. - A condition variable always has a mutex
associated with it. - A thread locks this mutex and tests the predicate
defined on the shared variable. - If the predicate is not true, the thread waits on
the condition variable associated with the
predicate using the function pthread_cond_wait.
18Using Condition Variables
Main Thread Declare and initialize global
data/variables which require synchronization
(such as "count") Declare and initialize a
condition variable object Declare and
initialize an associated mutex Create
threads A and B to do work Thread A Do
work up to the point where a certain condition
must occur (such as "count" must reach a
specified value) Lock associated mutex and
check value of a global variable Call
pthread_cond_wait() to perform a blocking wait
for signal from Thread-B. Note that a
call to pthread_cond_wait() automatically and
atomically unlocks the associated mutex
variable so that it can be used by Thread-B.
When signalled, wake up. Mutex is automatically
and atomically locked. Explicitly unlock
mutex Continue Thread B Do work
Lock associated mutex Change the value
of the global variable that Thread-A is waiting
upon. Check value of the global Thread-A
wait variable. If it fulfills the desired
condition, signal Thread-A. Unlock mutex.
Continue
19Condition Variables for Synchronization
- Pthreads provides the following functions for
condition variables - int pthread_cond_wait(pthread_cond_t cond,
- pthread_mutex_t mutex)
- int pthread_cond_signal(pthread_cond_t cond)
- int pthread_cond_broadcast(pthread_cond_t cond)
- int pthread_cond_init(pthread_cond_t cond,
- const pthread_condattr_t attr)
- int pthread_cond_destroy(pthread_cond_t cond)
20Producer-Consumer Using Locks
- pthread_mutex_t task_queue_lock
- int task_available
- ...
- main()
- ....
- task_available 0
- pthread_mutex_init(task_queue_lock, NULL)
- ....
-
- void producer(void producer_thread_data)
- ....
- while (!done())
- inserted 0
- create_task(my_task)
- while (inserted 0)
- pthread_mutex_lock(task_queue_lock)
- if (task_available 0)
- insert_into_queue(my_task)
- task_available 1
21Producer-Consumer Using Locks
- void consumer(void consumer_thread_data)
- int extracted
- struct task my_task
- / local data structure declarations /
- while (!done())
- extracted 0
- while (extracted 0)
- pthread_mutex_lock(task_queue_lock)
- if (task_available 1)
- extract_from_queue(my_task)
- task_available 0
- extracted 1
-
- pthread_mutex_unlock(task_queue_lock)
-
- process_task(my_task)
-
-
22Producer-Consumer Using Condition Variables
- pthread_cond_t cond_queue_empty, cond_queue_full
- pthread_mutex_t task_queue_cond_lock
- int task_available
- / other data structures here /
- main()
- / declarations and initializations /
- task_available 0
- pthread_init()
- pthread_cond_init(cond_queue_empty, NULL)
- pthread_cond_init(cond_queue_full, NULL)
- pthread_mutex_init(task_queue_cond_lock, NULL)
- / create and join producer and consumer threads
/ -
23Producer-Consumer Using Condition Variables
- void producer(void producer_thread_data)
-
- int inserted
- while (!done())
-
- create_task()
- pthread_mutex_lock(task_queue_cond_lock)
- while (task_available 1)
- pthread_cond_wait(cond_queue_empty,
- task_queue_cond_lock)
- insert_into_queue()
- task_available 1
- pthread_cond_signal(cond_queue_full)
- pthread_mutex_unlock(task_queue_cond_lock)
-
-
24Producer-Consumer Using Condition Variables
- void consumer(void consumer_thread_data)
-
- while (!done())
-
- pthread_mutex_lock(task_queue_cond_lock)
- while (task_available 0)
- pthread_cond_wait(cond_queue_full,
- task_queue_cond_lock)
- my_task extract_from_queue()
- task_available 0
- pthread_cond_signal(cond_queue_empty)
- pthread_mutex_unlock(task_queue_cond_lock)
- process_task(my_task)
-
-
25Condition Variables
- Rather than just signaling one blocked thread, we
can signal all - int pthread_cond_broadcast(pthread_cond_t cond)
- Can also have a timeout
- int pthread_cond_timedwait( pthread_cond_t cond,
pthread_mutex_t mutex,
const struct timespec abstime)
26include ltpthread.hgt include ltstdio.hgt include
ltstdlib.hgt define NUM_THREADS 3 define TCOUNT
10 define COUNT_LIMIT 12 int count 0 int
thread_ids5 0,1,2,3,4 pthread_mutex_t
count_mutex pthread_cond_t count_threshold_cv i
nt main(int argc, char argv) int i, rc
pthread_t threads5 pthread_attr_t attr
/ Initialize mutex and condition variable
objects / pthread_mutex_init(count_mutex,
NULL) pthread_cond_init (count_threshold_cv,
NULL) / For portability, explicitly create
threads in a joinable state /
pthread_attr_init(attr) pthread_attr_setdetach
state(attr, PTHREAD_CREATE_JOINABLE)
pthread_create(threads4, attr, watch_count,
(void )thread_ids4) pthread_create(threads
3, attr, watch_count, (void )thread_ids3)
pthread_create(threads2, attr, watch_count,
(void )thread_ids2) pthread_create(threads
1, attr, inc_count, (void )thread_ids1)
pthread_create(threads0, attr, inc_count,
(void )thread_ids0) / Wait for all
threads to complete / for (i 0 i lt
NUM_THREADS i) pthread_join(threadsi,
NULL) printf ("Main() Waited on d
threads. Done.\n", NUM_THREADS) / Clean up
and exit / pthread_attr_destroy(attr)
pthread_mutex_destroy(count_mutex)
pthread_cond_destroy(count_threshold_cv)
pthread_exit (NULL)
27int count 0 int thread_ids5
0,1,2,3,4 pthread_mutex_t count_mutex pthread_
cond_t count_threshold_cv
void inc_count(void idp) int j,i double
result0.0 int my_id idp for (i0 i lt
TCOUNT i) pthread_mutex_lock(count_mutex
) count / Check the value of
count and signal waiting thread when condition
is reached. Note that this occurs while
mutex is locked. / if (count
COUNT_LIMIT) pthread_cond_broadcast(
count_threshold_cv) printf("inc_count()
thread d, count d Threshold reached.\n",
my_id, count)
printf("inc_count() thread d, count d,
unlocking mutex\n", my_id, count)
pthread_mutex_unlock(count_mutex) / Do
some work so threads can alternate on mutex lock
/ for (j0 j lt 1000 j) result
result (double)random()
pthread_exit(NULL)
void watch_count(void idp) int my_id
idp printf("Starting watch_count() thread
d\n", my_id) / Lock mutex and wait for
signal. Note that the pthread_cond_wait routine
will automatically and atomically unlock mutex
while it waits. Also, note that if COUNT_LIMIT
is reached before this routine is run by the
waiting thread, the loop will be skipped to
prevent pthread_cond_wait from never
returning. / pthread_mutex_lock(count_mutex)
if (count lt COUNT_LIMIT)
pthread_cond_wait(count_threshold_cv,
count_mutex) printf("watch_count() thread
d Condition signal received.\n", my_id)
pthread_mutex_unlock(count_mutex)
pthread_exit(NULL)
28Composite Synchronization Constructs
- By design, Pthreads provide support for a basic
set of operations. - Higher level constructs can be built using basic
synchronization constructs. - Consider Read-Write Locks and Barriers
29Barriers
- A barrier holds a thread until all threads
participating in the barrier have reached it. - Some versions of the Pthreads library support
barriers (not required) - pthread_barrier_t barr
- attributes NULL
- pthread_barrier_init(barr, attributes, nthreads)
- pthread_barrier_wait(barr)
- pthread_barrier_destroy(barr)
- pthread barriers are available on most Linux
implementations.
30Barrier Implementation
- Barriers can be implemented using a counter, a
mutex and a condition variable. - A single integer is used to keep track of the
number of threads that have reached the barrier. - If the count is less than the total number of
threads, the threads execute a condition wait. - The last thread entering (and setting the count
to the number of threads) wakes up all the
threads using a condition broadcast.
31Barriers
- typedef struct
-
- pthread_mutex_t count_lock
- pthread_cond_t ok_to_proceed
- int count
- mylib_barrier_t
- void mylib_init_barrier(mylib_barrier_t b)
-
- b -gt count 0
- pthread_mutex_init((b -gt count_lock), NULL)
- pthread_cond_init((b -gt ok_to_proceed), NULL)
32Barriers
- void mylib_barrier (mylib_barrier_t b, int
num_threads) -
- pthread_mutex_lock((b -gt count_lock))
- b -gt count
- if (b -gt count num_threads)
-
- b -gt count 0
- pthread_cond_broadcast((b -gt ok_to_proceed))
-
- else
- while (pthread_cond_wait((b -gt ok_to_proceed),
- (b -gt count_lock)) !
0) - pthread_mutex_unlock((b -gt count_lock))
33Barriers
- Linear barrier.
- The trivial lower bound on execution time of this
function is O(n) for n threads. - This implementation of a barrier can be speeded
up using multiple barrier variables organized in
a tree. - Use n/2 condition variable-mutex pairs for
implementing a barrier for n threads. - At the lowest level, threads are paired up and
each pair of threads shares a single condition
variable-mutex pair. - Once both threads arrive, one of the two moves
on, the other one waits. - This process repeats up the tree.
- This is called a log barrier and its runtime
grows as O(log p).
34Barrier
- Execution time of 1000 sequential and logarithmic
barriers as a function of number of threads on a
32 processor SGI Origin 2000.
35Barriers
- Use pthread condition variables and mutexes.
- Is this the best way?
- Forces a thread to sleep and give up the
processor - Rather than give up the processor, just wait on a
variable. - busy wait
- Will it be faster?
36QbusyBarrier
void qbusy_init_barrier(qbusy_barrier_t b, int
nthreads) b -gt wait (int
)malloc(nthreads sizeof(int)) b -gt
count 0 pthread_mutex_init((b -gt
count_lock), NULL) void qbusy_barrier
(qbusy_barrier_t b, int iproc, int
num_threads) int i
float tmp if (num_threads
1) return
b-gtwaitiproc 1
pthread_mutex_lock((b -gt count_lock))
b -gt count if (b -gt
count num_threads)
pthread_mutex_unlock((b -gt
count_lock)) /
Now release the hounds /
for (i 0 i lt num_threads i)
b-gtwaiti 0
else
pthread_mutex_unlock((b -gt count_lock))
while(b-gtwaitiproc)
37Read-Write Locks
- In many applications, a data structure is read
frequently but written infrequently. - For such applications, we should use read-write
locks. - A read lock is granted when there are other
threads that may already have read locks. - If there is a write lock on the data (or if there
are queued write locks), the thread performs a
condition wait. - If there are multiple threads requesting a write
lock, they must perform a condition wait.
38Read-Write Locks
- The lock data type mylib_rwlock_t holds the
following - a count of the number of readers,
- the writer (a 0/1 integer specifying whether a
writer is present), - a condition variable readers_proceed that is
signaled when readers can proceed, - a condition variable writer_proceed that is
signaled when one of the writers can proceed, - a count pending_writers of pending writers, and
- a mutex read_write_lock associated with the
shared data structure
39Read-Write Locks
- typedef struct
- int readers
- int writer
- pthread_cond_t readers_proceed
- pthread_cond_t writer_proceed
- int pending_writers
- pthread_mutex_t read_write_lock
- mylib_rwlock_t
- void mylib_rwlock_init (mylib_rwlock_t l)
- l -gt readers l -gt writer l -gt pending_writers
0 - pthread_mutex_init((l -gt read_write_lock),
NULL) - pthread_cond_init((l -gt readers_proceed), NULL)
- pthread_cond_init((l -gt writer_proceed), NULL)
-
40Read-Write Locks
- void mylib_rwlock_rlock(mylib_rwlock_t l)
-
- / if there is a write lock or pending writers,
perform condition wait.. else increment count of
readers and grant read lock / - pthread_mutex_lock((l -gt read_write_lock))
- while ((l -gt pending_writers gt 0) (l -gt writer
gt 0)) - pthread_cond_wait((l -gt readers_proceed),
- (l -gt read_write_lock))
- l -gt readers
- pthread_mutex_unlock((l -gt read_write_lock))
-
41Read-Write Locks
- void mylib_rwlock_wlock(mylib_rwlock_t l)
-
- / if there are readers or writers, increment
pending writers count and wait. On being woken,
decrement pending writers count and increment
writer count / - pthread_mutex_lock((l -gt read_write_lock))
- while ((l -gt writer gt 0) (l -gt readers gt 0))
-
- l -gt pending_writers
- pthread_cond_wait((l -gt writer_proceed),
- (l -gt read_write_lock))
-
- l -gt pending_writers --
- l -gt writer
- pthread_mutex_unlock((l -gt read_write_lock))
-
42Read-Write Locks
- void mylib_rwlock_unlock(mylib_rwlock_t l)
-
- / if there is a write lock then unlock, else if
there are read locks, decrement count of read
locks. If the count is 0 and there is a pending
writer, let it through, else if there are pending
readers, let them all go through / - pthread_mutex_lock((l -gt read_write_lock))
- if (l -gt writer gt 0)
- l -gt writer 0
- else if (l -gt readers gt 0)
- l -gt readers --
- pthread_mutex_unlock((l -gt read_write_lock))
- if ((l -gt readers 0) (l -gt pending_writers
gt 0)) - pthread_cond_signal((l -gt writer_proceed))
- else if (l -gt readers gt 0)
- pthread_cond_broadcast((l -gt readers_proceed))
-
43Semaphores
- Synchronization tool provided by the OS
- Integer variable and 2 operations
- Wait(s) while (s lt 0) do noop
/sleep/ s s - 1 - Signal(s) s s 1
- All modifications to s are atomic
44The critical section problem
Shared semaphore mutex 1
repeat wait(mutex) critical
section signal(mutex) remainder
section until false
45Using semaphores
- Two processes P1 and P2
- Statements S1 and S2
- S2 must execute only after S1
46Bounded Buffer Solution
Shared semaphore empty n, full 0, mutex 1
repeat produce an item in nextp wait(empty) w
ait(mutex) add nextp to the buffer signal(mut
ex) signal(full) until false
repeat wait(full) wait(mutex) remove an
item from buffer place it in nextc signal(mutex
) signal(empty) consume the item in
nextc until false
47Readers - Writers (priority?)
Shared Semaphore mutex1, wrt 1 Shared
integer readcount 0
wait(mutex) readcount readcount 1 if
(readcount 1) wait(wrt) signal(mutex) read
the data wait(mutex) readcount readcount -
1 if (readcount 0) signal(wrt) signal(mutex)
wait(wrt) write to the data object signal(wrt)
48Readers Writers (priority?)
outerQ, rsem, rmutex, wmutex, wsem 1
wait (outerQ) wait (rsem) wait
(rmutex) readcnt if (readcnt 1) wait
(wsem) signal(rmutex) signal
(rsem) signal (outerQ) READ wait (rmutex)
readcnt-- if (readcnt 0) signal
(wsem) signal (rmutex)
wait (wsem) writecnt if (writecnt
1) wait (rsem) signal (wsem) wait
(wmutex) WRITE signal (wmutex) wait (wsem)
writecnt-- if (writecnt 0) signal
(rsem) signal (wsem)
49Unix Semaphores
- Are a generalization of the counting semaphores
(more operations are permitted). - A semaphore includes
- the current value S of the semaphore
- number of processes waiting for S to increase
- number of processes waiting for S to be 0
- System calls
- semget creates an array of semaphores
- semctl allows for the initialization of
semaphores - semop performs a list of operations one on each
semaphore (atomically)
50Unix Semaphores
- Each operation to be done is specified by a value
sop. - Let S be the semaphore value
- if sop gt 0 (signal operation)
- S is incremented and process awaiting for S to
increase are awaken - if sop 0
- If S0 do nothing
- if S!0, block the current process on the event
that S0 - if sop lt 0 (wait operation)
- if S gt sop then S S - sop then if S
lt0 wait
51(No Transcript)