pthreads (posix threads)tdesell.cs.und.edu/lectures/pthreads.pdf · threads are more lightweight...

!

PThreads (POSIX Threads)

All material not from online sources/textbook copyright © Travis Desell, 2012

A good tutorial/overview can be found here as well: https://computing.llnl.gov/tutorials/pthreads/

https://computing.llnl.gov/tutorials/pthreads/

Overview

1. Forking and Joining Threads

2. Busy Waiting

3. Mutexes

4. Semaphores

5. Conditional Variables

6. Read-Write Locks

7. Conclusions

!

Forking and Joining Threads

A process can have multiple threads, each of which access the same memory of the process. !Many cores have hyper-threading which lets them run multiple threads at the same time. Threads can run across multiple cores and processors on the same motherboard. !Now for some re-cap.

Threads are Shared Memory

Threads are Shared Memory

ALU Regs. !!!

Address Contents !

!

Main Memory

.

.

.

Core

…Ctrl. Regs. !!!

ALU Regs. !!!

Core

Ctrl. Regs. !!!

ALU Regs. !!!

Core

Ctrl. Regs. !!!

ALU Regs. !!!

Core

Ctrl. Regs. !!!

Interconnect

Threads are more lightweight than processes, as they are contained within the same process and utilize the same memory and resources. This allows them to be swapped faster than processes. Threads still need their own program counter and call stack however. !The time it takes to swap a thread or process is called context-switching time.

Threads are Lightweight

The initial process often acts as the master thread. It will fork off child threads, which later join back to the master process when they have completed their subtask.

Threads

Thread 2

Thread 1

Process

As the process and its threads share the same memory, it is important to make sure that they don’t try to modify the same memory at the same time, as it could lead to inconsistencies in the program execution.

Threads

Thread 2

Thread 1

Process

Threads can also fork off other threads, which can later join back to them (or to another thread).

Threads

Thread 2

Thread 1

Process

Thread 3

01 #include <pthread.h> … 10 //A global variable accessible to all threads. 11 int thread_count; 12 13 void* Hello(void* rank); /* The function for each thread to run */ 14 15 int main(int argc, char** argv) { 16 pthread_t* thread_handles; 24 25 thread_count = atol(argv[1]); 26 28 thread_handles = new pthread_t[thread_count]; 29 30 long thread; 31 for (thread = 0; thread < thread_count; thread++) { 32 pthread_create(&thread_handles[thread], NULL, Hello, (void*)thread); 33 } 34 35 printf("Hello from the main thread!\n"); 37 38 for (thread = 0; thread < thread_count; thread++) { 39 pthread_join(thread_handles[thread], NULL); 40 } 41 42 free(thread_handles); 43 return 0; 44 } 45 46 void* Hello(void* rank) { 47 long my_rank = (long)rank; 50 printf("Hello from thread %ld of %d\n", my_rank, thread_count); 52 return NULL; 53 }

You can start using pthreads simply by including the pthreads library (it is on almost every system), and compiling with the -lpthreads library.

pthread_hello.cxx


Threads are started with variables of the the pthread_t type, and you make sets of them just like you would any other array.

pthread_hello.cxx


pthread_create is the function you use to create threads. Calling this will actually create the thread and start it running.

pthread_hello.cxx

pthread_create takes a set of arguments: !int pthread_create( pthread_t* thread_p, /*out*/ const pthread_attr_t* attr_p, /*in*/ void* (*start_routine)(void*), /*in*/ void* arg_p /*in*/ );

!The first argument, thread_p, is initialized by this call, and returns a pointer to the thread object. !attr_p we can ignore for the time being, it lets us set thread attributes

pthread_create

pthread_create takes a set of arguments: !int pthread_create( pthread_t* thread_p, /*out*/ const pthread_attr_t* attr_p, /*in*/ void* (*start_routine)(void*), /*in*/ void* arg_p /*in*/ );

!The third argument is a function that will be what the thread will run (that function can call other functions and so on). It needs to return a void pointer and take a void pointer as arguments. !And the last argument is a void pointer which contain the arguments that will be passed to the function when the thread starts.

pthread_create


So in this case, each thread will start as it’s own Hello function, each a different thread argument.

pthread_hello.cxx


The pthread_create function returns immediately, it doesn’t wait for the function passed into it or the thread it creates to finish. This allows all the threads to be created and parallel and the main thread to do other things.

pthread_hello.cxx


After you’ve started threads, often you’ll want to wait for them to complete what they’re working on before you proceed to do something else !This can be accomplished with the pthread_join function.

pthread_hello.cxx

pthread_join takes a set of arguments: !int pthread_join( pthread_t thread_p, /*in*/ void** ret_val_p /*out*/ );

!pthread_join waits for the thread identified by thread_p to complete (exit its function). !ret_val_p is set to a pointer to return value from that function.

pthread_join


Finally, the memory allocated by the thread objects need to be released.

pthread_hello.cxx

!

Busy Waiting (don’t do this)

Often when dealing with threads, they’ll all need to access the same piece of memory, but only one thread can be allowed to access it at a given time to prevent memory issues.

Busy Waiting

unsorted_map<string, int> my_map; !void my_thread_function(void* arguments) { … //only one thread can put things in the //unsorted map at a time ! if (my_map[“key”] > 0) { my_map[“key”]++; } else { my_map.insert(“key”, 1); } … }

unsorted_map<string, int> my_map; !int flag = 0; !void my_thread_function(void* arguments) { … //only one thread can put things in the //unsorted map at a time ! my_rank = get_rank(arguments); ! while (flag != my_rank); ! if (my_map[“key”] > 0) { my_map[“key”]++; } else { my_map.insert(“key”, 1); } ! flag++; … }

A simple solution might be to put a busy waiting while loop in front of this critical section of the code, and increment the flag after it’s done. !This way (assuming the compiler doesn’t optimize the loop away) will ensure all the threads go into the critical section in order.

Busy Waiting


However, this approach has a lot of problems. Depending on the compiler and it’s optimizations, the while loop might get removed, or some of the critical section might get placed before the while loop.

Busy Waiting


Second, this approach ensures that threads go through the critical section in order. If higher ranked threads get to this section before others, they will have to wait on them.

Busy Waiting


Lastly, all the waiting threads are spinning away busily constantly checking that while loop. This is not a good use of system resources (and on mobile devices would probably put unnecessary drain on the battery).

Busy Waiting

!

Mutexes

Mutexes provide a much easier way to control access to the critical sections of your program. !Mutex is an abbreviation of mutual exclusion, which is what mutexes provide: the thread in control of the mutex excludes all other threads from obtaining it and entering the critical section until it releases the mutex.

Mutexes

Mutexes are used with 5 different fuctions: ! int pthread_mutex_init(pthread_mutex_t* mutex_p, const pthread_mutexattr_t attr_p); ! int pthread_mutex_lock(pthread_mutex_t* mutex_p); int pthread_mutex_unlock(pthread_mutex_t* mutex_p); int pthread_mutex_trylock(pthread_mutex_t* mutex_p); int pthread_mutex_destroy(pthread_mutex_t* mutex_p);

Mutexes

pthread_mutex_init initializes the mutex. Just like with pthread_create, we can pass NULL for the attribute if we aren’t using it. !This returns 0 on success, an an error value otherwise.

pthread_mutex_init

pthread_mutex_lock will try to get control of the mutex. It will block until it gains control of the mutex. !This returns 0 on success, an an error value otherwise.

pthread_mutex_lock

pthread_mutex_unlock will release the mutex, causing one (and only one) of the other threads waiting for the given mutex to gain the lock. The rest will continue blocking. !This returns 0 on success, an an error value otherwise.

pthread_mutex_unlock

pthread_mutex_trylock gives a non blocking version of lock. For example, you might want to see if you can get a lock, and if you can’t do something else (and then check back on the lock). !This way you can compute something else while waiting on a lock. It returns 0 when it gets the lock, 1 otherwise).

pthread_mutex_trylock

unsorted_map<string, int> my_map; !pthread_mutex_t mymutex = PTHREAD_MUTEX_INITIALIZER; !void my_thread_function(void* arguments) { … //only one thread can put things in the //unsorted map at a time ! pthread_mutex_lock(mymutex); ! if (my_map[“key”] > 0) { my_map[“key”]++; } else { my_map.insert(“key”, 1); } ! pthread_mutex_unlock(mymutex); … }

We can take the previous example code and rewrite it using mutexes. It actually ends up a bit simpler.

Mutexes

unsorted_map<string, int> my_map; !pthread_mutex_t mymutex = PTHREAD_MUTEX_INITIALIZER; !void my_thread_function(void* arguments) { … //only one thread can put things in the //unsorted map at a time ! pthread_mutex_lock(mymutex); ! if (my_map[“key”] > 0) { my_map[“key”]++; } else { my_map.insert(“key”, 1); } ! pthread_mutex_unlock(mymutex); … }

Mutexes can also be created statically.

Mutexes

unsorted_map<string, int> my_map; !pthread_mutex_t mymutex = PTHREAD_MUTEX_INITIALIZER; !void my_thread_function(void* arguments) { … //only one thread can put things in the //unsorted map at a time ! pthread_mutex_lock(&mymutex); ! if (my_map[“key”] > 0) { my_map[“key”]++; } else { my_map.insert(“key”, 1); } ! pthread_mutex_unlock(&mymutex); … }

Instead of having a busy wait loop, we simply have a lock call.

Mutexes


When we leave the critical section, we unlock the mutex.

Mutexes


Note that mutexes do not guarantee any order to how the threads will pass through the critical section. !This makes it more efficient than busy waiting, but if order is important then a mutex is not applicable (however order generally is not important).

Mutexes

!

Semaphores

Suppose you want to guarantee some order for threads passing through a critical section? Semaphores allow this (and a lot more). Semaphores are commonly used in implementing mailboxes for concurrent and distributed message passing. !They’re named after a mechanical railroad device by Edsger Dijkstra. !In essence, they’re a specialized case of an unsigned int, taking values 0, 1, 2, 3, etc. They are in a locked state when their value is 0, and unlocked otherwise.

Semaphores

int sem_init(sem_t* semaphore_p /*out*/, int shared /*in*/, unsigned initial_val /*in*/); !int sem_destroy(sem_t* semaphore_p); int sem_post(sem_t* semaphore_p); int sem_wait(sem_t* semaphore_p);

The use of semaphores is similar to mutexes. Instead of lock and unlock, there is wait and post. !Remember, semaphores take values 0..N. !sem_post increments the value of the semaphore. !sem_wait decrements it by 1, unless it is 0. If it is 0, it waits for it to become positive, then decrements it and unblocks.

Semaphores

Implementing a Barrier

A barrier makes all threads reach the same point before they can all proceed. !It is possible to implement a barrier with semaphores that doesn’t require busy waiting (which using only a mutex would).

!int counter = 0; sem_t count_sem; /*initialize to 1*/ sem_t barrier_sem; /*initialize to 0*/ … void* thread_task(…) { … /*start barrier*/ sem_wait( &count_sem ); if (counter == thread_count - 1) { counter = 0; sem_post(&count_sem); for (j = 0; j < thread_count; j++) { sem_post(&barrier_sem); } } else { counter++; sem_post(&count_sem); sem_wait(&barrier_sem); } /*end barrier*/ }


With count_sem initialized to 1, one thread will be able to pass this sem_wait call, all the others will wait.




counter won’t be equal to thread_count-1 until all the threads have made it through the sem_wait call.



The counter will be incremented for every thread that gets through the first sem_wait. !This ensures only 1 thread is accessing the counter at the same time. !After incrementing the counter, each thread will wait on the barrier semaphore.

!int counter = 0; sem_t count_sem; /*initialize to 1*/ sem_t barrier_sem; /*initialize to 0*/ … void* thread_task(…) { … /*start barrier*/ sem_wait( &count_sem ); if (counter == thread_count - 1) { counter = 0; sem_post(&count_sem); for (j = 0; j < thread_count - 1; j++) { sem_post(&barrier_sem); } } else { counter++; sem_post(&count_sem); sem_wait(&barrier_sem); } /*end barrier*/ }


When the last thread makes it through the sem_wait on the count semaphore, it will reset the counter, make a post to the count semaphore (so it can be reused for the next barrier) and then post enough to the barrier semaphore to allow all other threads to pass through it.

!int counter = 0; sem_t count_sem; /*initialize to 1*/ sem_t barrier_sem; /*initialize to 0*/ … void* thread_task(…) { … /*start barrier*/ sem_wait( &count_sem ); if (counter == thread_count - 1) { counter = 0; sem_post(&count_sem); for (j = 0; j < thread_count - 1; j++) { sem_post(&barrier_sem); } } else { counter++; sem_post(&count_sem); sem_wait(&barrier_sem); } /*end barrier*/ }

Race conditions in this Barrier

What if we try to reuse this barrier? !A potential problem is that while the last thread going through the barrier is making enough posts, a second thread could reach the beginning of the barrier and snag one of posts to the barrier semaphore; preventing all the threads making it through the first barrier (which will make it wait indefinitely).

!

Condition Variables

Condition Variables

There is an even better way to implement a barrier. Pthreads allow for condition variables, which allow threads to wait until a certain event (or condition) happens. These conditions are always associated with a mutex. !Condition variables are also used similar to mutexes, however there is a third option. With a condition variable, it is possible to wait, signal a single thread to unlock, or broadcast making all waiting threads unlock.

int pthread_cond_init( pthread_cond_t* cond_p /*out*/, const pthread_condattr_t cond_attr_p /*in*/); !int pthread_cond_destroy(pthread_cond_t *cond_p); !!int pthread_cond_signal(pthread_cond_t *cond_var_p); int pthread_cond_broadcast(pthread_cond_t *cond_var_p); int pthread_cond_wait(pthread_cond_t *cond_var_p, pthread_mutex_t *mutex_p);

Just like mutexes, condition variables need to be initialized and destroyed. !pthread_cond_wait causes threads to wait on the mutex for a given signal. !pthread_cond_signal will cause one (and only one) thread to unlock from the pthread_cond_wait. !pthread_cond_broadcast will cause all threads to unlock from the pthread_cond_wait.

Condition Variables

int pthread_cond_wait(pthread_cond_t *cond_var_p, pthread_mutex_t *mutex_p); !!//pthread_cond_wait is essentially: !pthread_mutex_unlock(&mutex_p); wait_on_signal(&cond_var_p); pthread_mutex_lock(&mutex_p);

In detail, pthread_cond_wait unlocks the mutex it is referring to, and cause the executing thread to block until it is unblocked by another thread’s pthread_cond_signal or pthread_cond_broadcast.

Condition Variables

!/*shared variables*/ int counter = 0; pthreat_mutex_t mutex; pthread_cond_t cond_var; … void* thread_task(…) { … /*start barrier*/ pthread_mutex_lock(&mutex); counter++; if (counter == thread_count) { counter = 0; pthread_cond_broadcast(&cond_var); } else { while (pthread_cond_wait(&cond_var, &mutex) != 0); } pthread_mutex_unlock(&mutex); /*end barrier*/ }

A Safer Barrier

Using condition variables we can make a reusable barrier without race conditions.


A Safer Barrier

This will let one thread through at a time.


A Safer Barrier

The first thread_count-1 threads will enter the while loop, unlocking the mutex for another thread to enter the while loop, etc. !This needs to be in a while loop in case the pthread_cond_wait call exits on an error (or potentially from another signal).


A Safer Barrier

When the last thread gets through the mutex, it can broadcast to all other threads, causing them to exit the pthread_cond_wait statement, and then exit the barrier.


A Safer Barrier

Finally, the mutex is unlocked to get all the threads out of the pthread_cond_wait (remember the 3rd part of it is to lock the mutex).

!

Read-Write Locks

Read-Write Locks

Say we want to have a data structure which is thread safe, i.e., multiple threads can access it simultaneously without data becoming corrupted and without there being segfaults. !One of the simplest well performing strategies for this is to use read-write locks. The general idea is that multiple threads can read from the data structure simultaneously without any problem because they aren’t changing anything. On the other head, only one thread can write to the data structure at a time because writing is when problems occur.

int pthread_rwlock_init( pthread_rwlock_t* rwlock_p /*out*/, const pthread_rwlockattr_t attr_p /*in*/); !int pthread_rwlock_destroy(pthread_rwlock_t *rwlock_p); !!int pthread_rwlock_rdlock(pthread_rwlock_t *rwlock_p); int pthread_rwlock_wrlock(pthread_rwlock_t *rwlock_p); int pthread_rwlock_unlock(pthread_rwlock_t *rwlock_p);

Similar to the other locks, rowlocks need to be initialized and destroyed. !pthread_rwlock_rdlock locks the rwlock for reading. Multiple threads can hold the read lock at the same time. !pthread_rwlock_wrlock locks the rwlock for writing. Only one thread can hold the lock for writing (and no threads can have it for reading while the write lock is held). !pthread_rwlock_unlock unlocks the rwlock.

Read-Write Locks

!

Concurrent/Lock Free Data Structures

For more info: http://en.wikipedia.org/wiki/Non-blocking_algorithm

http://en.wikipedia.org/wiki/Non-blocking_algorithm

Concurrent/Lock-Free Data Structures

When developing your data structures for concurrent use you want to make sure that this does not cause any data inconsistencies, race conditions, deadlocks, etc.

Non-Blocking Algorithms

In current terminology, there are non-blocking algorithms which ensure that if there are multiple threads competing for their resource, no thread is postponed indefinitely by mutual exclusion (i.e., a mutex lock). !Non-blocking algorithms can be lock-free, where the entire system is guaranteed to progress if it runs long enough (although individual threads my starve or block indefinitely). !They can also be wait-free, which is stronger. This guarantees each thread will make progress (i.e., no thread starves).

Lock-Free Data Structures

You should be aware that in recent years implementations of different data structures (such as queues) have been made lock free, which can provide some big performance benefits. See: !! 1.! Michael, Maged; Scott, Michael (May 23–26). "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue

Algorithms". Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing(PODC 1996). Philadelphia, Pennsylvania, USA: ACM Press. pp. 267–275. ISBN 0-89791-800-2.!

! 2.! Kogan, Alex; Petrank, Erez (February 25–29). "A methodology for creating fast wait-free data structures". Proceedings of the 17ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP 2012). New Orleans, LA: ACM Press. pp. 141–150. ISBN 978-1-4503-1160-1.

http://doi.acm.org/10.1145/248052.248106

http://en.wikipedia.org/wiki/International_Standard_Book_Number

http://en.wikipedia.org/wiki/Special:BookSources/0-89791-800-2

http://doi.acm.org/10.1145/2145816.2145835

http://en.wikipedia.org/wiki/International_Standard_Book_Number

http://en.wikipedia.org/wiki/Special:BookSources/978-1-4503-1160-1

Lock-Free Data Structures

Many of these lock-free data structures are build using a hardware provided Compare and Swap (CAS) operator1. !Compare and Swap compares the contents of a memory location to a given value, and if they are the same, modifies the value at that memory location to a new given value. It will return true if the swap occurred, and false otherwise. !int cas(void *pointer, int compare_to, int new_value); !This must be done atomically (no other threads can access/modify the memory) for this to be useful without memory issues.

1http://en.wikipedia.org/wiki/Compare-and-swap

http://en.wikipedia.org/wiki/Compare-and-swap

A two-lock queue

struct node_t {value: data type, next: pointer_t} struct queue_t {Head: pointer_t, Tail: pointer_t, H_lock: lock type, T_lock: lock type} !initialize(Q: pointer to queue_t) node_t node = node_node(); node->next.ptr = NULL Q->head = Q->tail = node Q->H_lock = Q->T_lock = FREE #locks are initial unlocked (free)

Michael & Scott, Simple, fast, and practical non-blocking and blocking concurrent queue algorithms: http://dl.acm.org/citation.cfm?doid=248052.248106

http://dl.acm.org/citation.cfm?doid=248052.248106

A two-lock queue

enqueue(Q: pointer to queue_t, value: data type) node = new_node() node->value = value node->next.ptr = NULL lock(&Q->T_lock) #acquire T_lock to access tail Q->Tail->next = node #link node at the end of the linked list Q->Tail = node #swing tail to node unlock(&Q->T_lock)



A two-lock queue

dequeue(Q: pointer to queue_t, pvalue: pointer to data type): boolean lock(&Q->H_lock) #Acquire H_lock in order to access Head node = Q->Head #Read Head new_head = node->next #Read next pointer if new_head == NULL #Is the queue empty? unlock(&Q->H_lock) #If so, release the lock and return false return FALSE endif *pvalue = new_head->value #queue was not empty, read value before #releasing lock Q->Head = new_head #swing head to next node unlock(&Q->H_lock) #release H_lock free(node) #free node return TRUE #queue was not empty, dequeue succeeded



A lock free queue

struct pointer_t {ptr: pointer to node_t, count: unsigned int} struct node_t {value: data type, next: pointer_t} struct queue_t {Head: pointer_t, Tail: pointer_t} !initialize(Q: pointer to queue_t) node_t node = node_node(); node->next.ptr = NULL Q->head = Q->tail = node



A lock free queue

enqueue(Q: pointer to queue_t, value: data type) E1: node = new_node() #create a new node E2: node->value = value E3: node->next.ptr = NULL E4: loop #keep trying until enqueue finished E5: tail = Q->Tail #read tail.ptr and tail.count together E6: next = tail.ptr->next #read next ptr and count fields together E7: if tail == Q->Tail #are tail and next consistent? E8: if next.ptr == NULL #was tail pointing to the last node? #try to link to the node at the end of the linked list E9: if CAS(&tail.ptr->next, next, <node, next.count+1>) E10: break #enqueue done, exit loop E11: endif E12: else #tail was not pointing to the last node #try to swing tail to the next node E13: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>) E14: endif E15: endif E16: endloop #enqueue done, try to swing tail to the next node E17: CAS(&Q->Tail, tail, <node, tail.count+1>



A lock free queuedequeue(Q: pointer to queue_t, value: data type) : boolean D1: loop #keep trying until dequeue finished D2: head = Q->Head D3: tail = Q->Tail D4: next = head->next D5: if head.ptr == Q->Head #are head, tail and next consistent? D6: if head.ptr == tail.ptr #is the queue empty or tail falling behind? D7: if next.ptr == NULL #is the queue empty? D8: return FALSE #queue is empty, could not dequeue D9: endif #tail is falling behind, try to advance it D10: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>) D11: else #no need to deal with tail #Read value before CAS, otherwise another dequeue might #free the next node D12: *pvalue = next.ptr->value #try to swing head to the next node D13: if CAS(&Q->Head, head, <next.ptr,head.count+1>) D14: break #dequeue is done, exit loop D15: endif D16: endif D17: endif D18: endloop D19: free(head.ptr) D20: return TRUE



!

Conclusions

Conclusions

Just because a program has the right output does not mean it is correct! (See the barrier implementation(s)). !Minimizing the use of locks makes your programs faster! If you use locks poorly, you end up with serial programs. !Researchers are actively working on better and faster concurrent data structures, trying to eliminate as many locks as possible. Keep yourself up to date!

pthreads (posix threads)tdesell.cs.und.edu/lectures/pthreads.pdf · threads are more lightweight...

Documents