Title: Programming Shared Address Space Platforms
1Programming Shared Address Space Platforms
- Carl Tropper
- Department of Computer Science
2Threads
- On a shared memory architecture, processes are
expensive - To maintain-they require that all data associated
with a process is private - To manipulate-overhead for scheduling processes
and maintaining security is significant - Enter threads, which assume that all memory is
globalf - Except for stacks associated with the thread
- Threads are invoked via function calls and are
created by a C function create_thread
3Threads-an example
4Logical memory model-a picture
- Logical model treats thread stack as local data
5Why threads?
- Portable
- Can develop application on a serial machine and
then run it on a parallel machine. - Can migrate between different shared memory
platforms. - Latency
- While one thread waits for communications,
another can use the CPU - I/O, memory access latencies can also be hidden
- Support for load balancing and scheduling in
APIs - Result-
- easier to get good performance then message
passing, - development tools for POSIX threads are
widespread and stable
6POSIX Thread API
- IEEE Standard, supported by most vendors
- Pthreads for short
- Other thread packages-NT threads, Solaris
threads, Java threads - Example-compute p by
- Generating random numbers
- Counting how many fall within the largest circle
which can be inscribed in a unit square
(areap/4) - Idea of program-a bunch of threads, each has a
fixed number of random points, each counts how
many are in the circle
7Threads-pthread_create
- Pthread_create creates threads
- Invokes thread function
- include ltpthread.hgt
- int pthread_create (
- pthread_t thread_handle, const pthread_attr_t
attribute, - void (thread_function)(void ),
- void arg)
- Thread_handle points to thread ID
- Thread has attributes specified by attribute
argument (later). If null, then get a default
thread - Arg passes important stuff to thread, e.g.
integer which serves as the random seed - Pthread_create returns 0 after successful
creation of thread
8Threads-Pthread_join
- Pthread_join provides mechanism to communicate
results produced by individual threads - int pthread_join (
- pthread_t thread,
- void ptr)
- Suspends execution of the calling program until
thread with ID given by thread terminates - Upon completion of pthread_join, the value passed
to pthread.exit is returned in the location
pointed to by ptr - Pthread_join returns 0 upon successful completion
9Compute Pi 1
- include ltpthread.hgt
- include ltstdlib.hgt
- define MAX_THREADS 512
- void compute_pi (void )
- ....
- main()
- ...
- pthread_t p_threadsMAX_THREADS
- pthread_attr_t attr
- pthread_attr_init (attr)
- for (i0 ilt num_threads i)
- hitsi i
- pthread_create(p_threadsi, attr, compute_pi,
- (void ) hitsi)
-
- for (i0 ilt num_threads i)
- pthread_join(p_threadsi, NULL)
- total_hits hitsi
-
10Compute Pi 2
- void compute_pi (void s)
- int seed, i, hit_pointer
- double rand_no_x, rand_no_y
- int local_hits
- hit_pointer (int ) s
- seed hit_pointer
- local_hits 0
- for (i 0 i lt sample_points_per_thread i)
- rand_no_x (double)(rand_r(seed))/(double)((2ltlt14
)-1) - rand_no_y (double)(rand_r(seed))/(double)((2ltlt14
)-1) - if (((rand_no_x - 0.5) (rand_no_x - 0.5)
- (rand_no_y - 0.5) (rand_no_y - 0.5)) lt 0.25)
- local_hits
- seed i
-
- hit_pointer local_hits
- pthread_exit(0)
-
11Threads are so sensitive
- Could increase local_hits outside (instead of
inside) the main loop in compute_pi by - changing localHits to (hit_pointer) and
- deleting hit_pointerlocal_hits outside the loop
- But performance gets whacked because adjacent
items in a shared cache line of the hit array are
getting written to, causing invalidates to be
issued (False sharing) - Solution- change hits to a 2-D array. Arrays are
stored row major, so cache lines wont be shared
because we force distance between the entries
12False sharing and the solution
Spaced_1 is the original change Spaced_1616
integer 2nd column Spaced_32 correspond to 32
integer 2nd column
13Synchronization Primitives
- Mutual exclusion for shared variables
- Race conditions-for example
- If (my_costltbest cost) best_costmy_cost
- If best cost is initially 100,
- there are two threads with my_cost 50,75,
- we have a non-deterministic outcome !
- A race determines the value of best_cost
- So threaded APIs provide support for
- critical sections and
- atomic operations
14Mutex_locks
- Mutex_locks (mutual exclusion locks)
- Atomic operation associated with code or a
critical section - Thread tries to get lock-if it is locked, the
thread is blocked - Thread leaving the cs has to unlock the
mutex_lock - P_threads provides functions for dealing with
locks
15Mutex_locks-the 3 maidens
- Int
- Pthread_mutex_lock
- pthread_mutex_t mutex_lock
- attempts a lock on mutex_lock, blocks if
mutex_lock is blocked success returns 0 - Int
- Pthread_mutex_unlock
- pthread_mutex_t mutex_lock
- Mutex_lock is released and a blocked thread is
allowed in - Int
- Pthread_mutex_init
- pthread_mutex_t mutex_lock,
- const pthread_mutexattr_t lock_attr
- Unlocks mutex_lock.
- Atributes of mutex_lock are specified in in
lock_attr. NULL supplies default values.
16Compute the minimum of a list of integers
- We can now write our previously incorrect code
segment as - pthread_mutex_t minimum_value_lock
- ...
- main()
- ....
- pthread_mutex_init(minimum_value_lock, NULL)
- ....
-
- void find_min(void list_ptr)
- ....
- pthread_mutex_lock(minimum_value_lock)
- if (my_min lt minimum_value)
- minimum_value my_min
- / and unlock the mutex /
- pthread_mutex_unlock(minimum_value_lock)
-
17Producer Consumer Work Queues
- Scenario-producer thread creates tasks and puts
them in work queues. Consumer threads grab tasks
and executes them - Can use shared data structure for the tasks,
but - Producer thread must not overwrite buffer before
consumer thread picks up task - Consumer threads must not pick up tasks which
dont exist - Pick up one task at a time
18Producer Consumer
- Task_available
- 1 producer task has to wait to produce
- 0 consumer thread can pick up task
- Need to protect operations on task available
variable
19Producer Code
- pthread_mutex_t task_queue_lock
- int task_available
- ...
- main()
- ....
- task_available 0
- pthread_mutex_init(task_queue_lock, NULL)
- ....
-
- void producer(void producer_thread_data)
- ....
- while (!done())
- inserted 0
- create_task(my_task)
- while (inserted 0)
- pthread_mutex_lock(task_queue_lock)
- if (task_available 0)
- insert_into_queue(my_task)
- task_available 1
20Consumer Code
- void consumer(void consumer_thread_data)
- int extracted
- struct task my_task
- / local data structure declarations /
- while (!done())
- extracted 0
- while (extracted 0)
- pthread_mutex_lock(task_queue_lock)
- if (task_available 1)
- extract_from_queue(my_task)
- task_available 0
- extracted 1
-
- pthread_mutex_unlock(task_queue_lock)
-
- process_task(my_task)
-
-
21Overheads of locking
- Size of critical section matters-only one thread
at a time is allowed in the section, i.e. the
program becomes serial - What to do? Instead of blocking a thread when
mutex is locked, allow thread to do other work
and poll mutex to see if it is unlocked - Int
- Pthread_mutex_trylock
- pthread_mutex_t mutex_lock
22K matches in a list
- / Finding k matches in a list /
- void find_entries(void start_pointer)
- / This is the thread function /
- struct database_record next_record
- int count
- current_pointer start_pointer
- do
- next_record find_next_entry(current_pointer)
- count output_record(next_record)
- while (count lt requested_number_of_records)
-
- int output_record(struct database_record
record_ptr) - int count
- pthread_mutex_lock(output_count_lock)
- output_count
- count output_count
- pthread_mutex_unlock(output_count_lock)
- if (count lt requested_number_of_records)
- print_record(record_ptr)
23K matches (continued)
- / rewritten output_record function /
- int output_record(struct database_record
record_ptr) - int count
- int lock_status
- lock_statuspthread_mutex_trylock(output_count_lo
ck) - if (lock_status EBUSY)
- insert_into_local_list(record_ptr)
- return(0)
-
- else
- count output_count
- output_count number_on_local_list 1
- pthread_mutex_unlock(output_count_lock)
- print_records(record_ptr, local_list,
- requested_number_of_records - count)
- return(count number_on_local_list 1)
-
-
24Condition variables
- Polling takes time. In the producer-consumer
example, the producer and consumer threads have
to poll for a lock. - An easier idea would be for the consumer thread
to signal the producer thread when buffer space
is available. - Enter condition variables-a thread can block
itself if a predicate becomes true - Task_available1 in producer-consumer
- Thread locks mutex on condition variable, and
tests the predicate-if the predicate is false,
the thread waits on the condition variable
25Condition variables
- Int pthread_cond_wait(pthread_cond_t
cond,pthread_mutex_t mutex) - is the function used to wait on the condition
variable - The function blocks the thread until it receives
a signal from the OS or from another thread - Releases the lock on mutex-have to do this in
order for other threads to be able to change the
predicate - Thread enters a queue
- When a condition becomes true, it is signaled
using pthread_cond_signal, unblocking one of the
threads in the queue
26Producer Consumer
- Producer produces task, inserts task on queue,
sets task-available1 - Sends signal to consumer thread via
- Int pthread_cond_signal(pthread_cond_t cond)
- Releases lock on mutex via pthread_mutex_unlock
- Initialize condition variable via
- Int pthread_cond_init(pthread_cond_t cond, const
pthread_condattr_t attr) - Destroy condition variable via
- Int pthread_cond_destroy(pthread_cond_t cond)
27Producer Consumer Using Condition Variables
- pthread_cond_t cond_queue_empty, cond_queue_full
- pthread_mutex_t task_queue_cond_lock
- int task_available
- / other data structures here /
- main()
- / declarations and initializations /
- task_available 0
- pthread_init()
- pthread_cond_init(cond_queue_empty, NULL)
- pthread_cond_init(cond_queue_full, NULL)
- pthread_mutex_init(task_queue_cond_lock, NULL)
- / create and join producer and consumer threads
/ -
28Producer Consumer Using Condition Variables II
- /create and join producer and consumer threads/
- void producer(void producer_thread_data)
- int inserted
- while (!done())
- create_task()
- pthread_mutex_lock(task_queue_cond_lock)
- while (task_available 1)
- pthread_cond_wait(cond_queue_empty,
- task_queue_cond_lock)
- insert_into_queue()
- task_available 1
- pthread_cond_signal(cond_queue_full)
- pthread_mutex_unlock(task_queue_cond_lock)
-
-
29Producer Consumer Using Condition Variables II
- void consumer(void consumer_thread_data)
- while (!done())
- pthread_mutex_lock(task_queue_cond_lock)
- while (task_available 0)
- pthread_cond_wait(cond_queue_full,
- task_queue_cond_lock)
- my_task extract_from_queue()
- task_available 0
- pthread_cond_signal(cond_queue_empty)
- pthread_mutex_unlock(task_queue_cond_lock)
- process_task(my_task)
-
-
30Controlling Thread and Synchronization Attributes
- Threads have different attributes (scheduling
algorithms, stack sizes.) - Attributes object has default attributes-programme
r can modify them - Advantage-system implements the attributes
- easier to read
- easier to change attributes
- easier to port to another OS
31Attribute Objects for threads
- Use pthread_attr_init to create an attributes
object (pointed to by attr)with default values - pthread_attr_destroy takes out attributes object
- Individual properties associated with the
attributes object can be changed using the
following functions - pthread_attr_setdetachstate,
- pthread_attr_setguardsize_np,
- pthread_attr_setstacksize,
- pthread_attr_setinheritsched,
- pthread_attr_setschedpolicy,
- pthread_attr_setschedparam
32Attribute objects for mutexes
- Pthreads supports 3 types of locks
- Normal-the default brand. Deadlocks if a thread
with a lock attempts to lock it again. Recursive
code can cause this to happen. - Recursive mutex-single thread can lock a mutex
multiple times - Each time it locks the mutex a counter is
incremented - each time it unlocks the mutex, a counter is
decremented. - Any other thread can lock only if the counter0
- Error-check mutex-like a normal mutex, only it
returns an error if a thread tries to lock a
mutex twice. Good for debugging.
33Attribute Objects for mutexes
- Initialize the attributes object using function
pthread_mutexattr_init. - The function pthread_mutexattr_settype_np can be
used for setting the type of mutex specified by
the mutex attributes object. - pthread_mutexattr_settype_np (
- pthread_mutexattr_t attr,
- int type)
- Here, type specifies the type of the mutex and
can take one of - PTHREAD_MUTEX_NORMAL_NP
- PTHREAD_MUTEX_RECURSIVE_NP
- PTHREAD_MUTEX_ERRORCHECK_NP
34Thread cancellation
- Sometimes it is necessary to cancel
thread(s)-threads which evaluate different
strategies in parallel need to go after best is
chosen, e.g. chess - Use pthread_cancel (pthread_t thread)
- Idea-a thread can cancel itself or other threads
35Composite Synchronization Structures
- Look at Useful higher-level constructs
- Useful because they meet commonly occurring needs
- Constructs built by combining basic synch
structures - Look at
- Read-write locks
- Barriers
36Read-write locks
- Think of readers and writers in a critical
section - Model-many readers, few writers
- Idea-let the readers read concurrently, but
serialize the writers - Read lock is granted if there are only readers
- If there is a write lock (or queued writes),
thread does a condition_wait
37Read write locks
- Read write locks based on a data structure
- mylib_rwlock_t
- containing the following
- a count of the number of readers,
- the writer (a 0/1 integer specifying whether a
writer is present), - a condition variable readers_proceed that is
signaled when readers can proceed, - a condition variable writer_proceed that is
signaled when one of the writers can proceed, - a count pending_writers of pending writers, and
- a mutex read_write_lock associated with the
shared data structure
38Read-write locks II
- Mylib_rwlock_rlock first checks for a write lock
or if there are writers queueing. If so, then it
does a condition_wait on readers_proceed, else it
grants read lock and increments the count of
readers - Mylib_rwlock_wlock checks for readers or
writers-if so, it increments the count of writers
and does a condition_wait on writer_proceed, else
it grants the lock - Mylib_rwlock_unlock
39Read-write locks
- typedef struct
- int readers
- int writer
- pthread_cond_t readers_proceed
- pthread_cond_t writer_proceed
- int pending_writers
- pthread_mutex_t read_write_lock
- mylib_rwlock_t
- void mylib_rwlock_init (mylib_rwlock_t l)
- l -gt readers l -gt writer l -gt pending_writers
0 - pthread_mutex_init((l -gt read_write_lock),
NULL) - pthread_cond_init((l -gt readers_proceed), NULL)
- pthread_cond_init((l -gt writer_proceed), NULL)
-
40Read-write locks code II
- void mylib_rwlock_rlock(mylib_rwlock_t l)
- / if there is a write lock or pending writers,
perform condition wait.. else increment count of
readers and grant read lock / - pthread_mutex_lock((l -gt read_write_lock))
- while ((l -gt pending_writers gt 0) (l -gt writer
gt 0)) - pthread_cond_wait((l -gt readers_proceed),
- (l -gt read_write_lock))
- l -gt readers
- pthread_mutex_unlock((l -gt read_write_lock))
-
41Read-write locks code III
- void mylib_rwlock_wlock(mylib_rwlock_t l)
- / if there are readers or writers, increment
pending writers count and wait. On being woken,
decrement pending writers count and increment
writer count / - pthread_mutex_lock((l -gt read_write_lock))
- while ((l -gt writer gt 0) (l -gt readers gt 0))
- l -gt pending_writers
- pthread_cond_wait((l -gt writer_proceed),
- (l -gt read_write_lock))
-
- l -gt pending_writers --
- l -gt writer
- pthread_mutex_unlock((l -gt read_write_lock))
-
42Barriers
- Can be implemented using counter, mutex,condition
variable. - Use an integer to keep track of the number of
threads which reach the barrier - If countltnumber of threads, threads execute
condition wait - If countnumber of threads, last thread wakes up
all the threads using a condition broadcast
43Barrier code
- typedef struct
- pthread_mutex_t count_lock
- pthread_cond_t ok_to_proceed
- int count
- mylib_barrier_t
- void mylib_init_barrier(mylib_barrier_t b)
- b -gt count 0
- pthread_mutex_init((b -gt count_lock), NULL)
- pthread_cond_init((b -gt ok_to_proceed), NULL)
-
44Barrier code II
- void mylib_barrier (mylib_barrier_t b, int
num_threads) - pthread_mutex_lock((b -gt count_lock))
- b -gt count
- if (b -gt count num_threads)
- b -gt count 0
- pthread_cond_broadcast((b -gt ok_to_proceed))
-
- else
- while (pthread_cond_wait((b -gt ok_to_proceed),
- (b -gt count_lock)) ! 0)
- pthread_mutex_unlock((b -gt count_lock))
-
45Barrier thoughts
- Lower bound on execution time is O(n), since
broadcast message follows queue - Speed things up by using a tree to implement the
barriers - Pair threads up so that each pair shares a single
condition variable mutex pair - A designated member of the pair waits for both
threads to arrive at the barrier - Continue until there is one thread
- Releasing threads requires signaling n/2
condition variables
46Performance on SGI Origin 2000
47Designing asynchronous programs-things that go
wrong
- Thread T1 creates thread T2 .It wants to put data
to send to T2 in global memory T1 gets switched
after creating T2 but before placing data.T2
might look for data and get garbage. - T1 creates T2, which is on T1s stack. T1 passes
data to T1, but finishes before T2 is scheduled.
Stack is destroyed and T2 never gets its data.
48(No Transcript)
49(No Transcript)
50(No Transcript)