evaluating threading building blocks pipelines · choice. in this report we evaluate intel...
TRANSCRIPT
Evaluating Threading Building Blocks
Pipelines
Sunu Antony Joseph
Master of Science
Computer Science
School of Informatics
University of Edinburgh
2010
Abstract
Parallel programming is the need in the muti-core era. With many parallel program-
ming languages and libraries developed with the aim of providing higher levels of ab-
straction, which allows programmers to focus on algorithms and data structures rather
than the complexity of the machines they are working on, it becomes difficult for pro-
grammers to choose the right programming environment best suited for their applica-
tion development. Unlike serial programming languages there are very few evaluations
done for parallel languages or libraries that can help programmers to make the right
choice. In this report we evaluate Intel Threading Building Blocks library which is a
library in C++ language that supports scalable parallel programming. The evaluation
is done specifically for the pipeline applications that are implemented using filter and
pipeline class provided by the library. Various features of the library which help during
pipeline application development are evaluated. Different applications are developed
using the library and are evaluated in terms of their usability and expressibility. All
these evaluations are done in comparison to POSIX thread implementation of different
applications. Performance evaluation of these applications are also done to understand
the benefits threading building blocks have in comparison to the POSIX thread imple-
mentations. In the end we provide a guide to future programmers that will help them
decide the best suited programming library for their pipeline application development
depending on their needs.
i
Acknowledgements
First, I would like to thank my supervisor, Murray Cole, for his guidance and help
throughout this project and mostly for the invaluable support in difficult times of the
project period. I would also like to thank my family and friends for standing always
by me in every choice I make.
ii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Sunu Antony Joseph)
iii
To my parents and grandparents...
iv
Table of Contents
1 Introduction 11.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 82.1 Parallel Programming Languages . . . . . . . . . . . . . . . . . . . . 8
2.2 Task and Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Parallel Design Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Pipeline Pattern . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Pipeline using both data and task parallelism . . . . . . . . . 11
2.4 Intel Threading Building Blocks . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Threading Building Blocks Pipeline . . . . . . . . . . . . . . 13
2.5 POSIX Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Issues and Methodology 143.1 Execution modes of the filters/stages . . . . . . . . . . . . . . . . . . 14
3.2 Setting the number of threads to run the application . . . . . . . . . . 15
3.3 Setting an upper limit on the number of tokens in flight . . . . . . . . 15
3.4 Nested parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Usability, Performance and Expressibility . . . . . . . . . . . . . . . 16
4 Design and Implementation 184.1 Selection of Pipeline Applications . . . . . . . . . . . . . . . . . . . 18
4.1.1 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 Filter bank for multi-rate signal processing . . . . . . . . . . 19
4.1.3 Bitonic sorting network . . . . . . . . . . . . . . . . . . . . . 20
v
4.2 Fast Fourier Transform Kernel . . . . . . . . . . . . . . . . . . . . . 20
4.2.1 Application Design . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Filter bank for multi-rate signal processing . . . . . . . . . . . . . . . 26
4.3.1 Application Design . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Bitonic sorting network . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.1 Application Design . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Evaluation and Results 325.1 Bitonic Sorting Network . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.1 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.2 Expressibility . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Filter bank for multi-rate signal processing . . . . . . . . . . . . . . . 37
5.2.1 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.2 Expressibility . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Fast Fourier Transform Kernel . . . . . . . . . . . . . . . . . . . . . 41
5.3.1 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3.2 Expressibility . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Feature 1: Execution modes of the filters/stages . . . . . . . . . . . . 44
5.4.1 serial out of order and serial in order Filters . . . . . . . . . 44
5.4.2 Parallel Filters . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 Feature 2: Setting the number of threads to run the application . . . . 47
5.6 Feature 3: Setting an upper limit on the number of tokens in flight . . 50
5.7 Feature 4: Nested parallelism . . . . . . . . . . . . . . . . . . . . . . 53
6 Guide to the Programmers 546.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Experience of the programmer . . . . . . . . . . . . . . . . . . . . . 54
6.3 Design of the application . . . . . . . . . . . . . . . . . . . . . . . . 55
6.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.5 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
vi
6.6 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.7 Application Development Time . . . . . . . . . . . . . . . . . . . . . 57
7 Future Work and Conclusions 587.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Bibliography 61
vii
List of Figures
1.1 Parallel Programming Environments[2] . . . . . . . . . . . . . . . . 2
2.1 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Task Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Linear Pipeline Pattern . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Non-Linear Pipeline Pattern . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Flow of tokens through the stages in a pipeline along the timeline. . . 11
2.6 Using the Hybrid approach. Multiple workers working on multiple
data in stage 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 Structure of the Fast Fourier Transform Kernel Pipeline. . . . . . . . 20
4.2 Structure of the Fast Fourier Transform Kernel Pipeline implemented
in TBB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1 Performance of the Bitonic Sorting application(TBB) on machines with
different number of cores. . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Performance of the Bitonic Sorting application(pthread) on machines
with different number of cores. . . . . . . . . . . . . . . . . . . . . . 36
5.3 Performance of the Bitonic Sorting application(pthread) with and with-
out Locks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.4 Performance of the Filter Bank application(TBB) on machines with
different number of cores. . . . . . . . . . . . . . . . . . . . . . . . . 40
5.5 Performance of the Filter Bank application(pthread) on machines with
different number of cores. . . . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Performance of the Fast Fourier Transform Kernel application(TBB)
on machines with different number of cores. . . . . . . . . . . . . . . 43
5.7 Performance of the Fast Fourier Transform Kernel application(pthread)
on machines with different number of cores. . . . . . . . . . . . . . . 44
viii
5.8 Latency difference for linear and non-linear implementation assuming
equal execution times of the pipelines stages. . . . . . . . . . . . . . 45
5.9 Performance of Filter bank application(TBB) for different modes of
operation of the filters. . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.10 Performance of Filter bank application with stages running with data
parallelism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.11 Performance of Bitonic Sorting Network varying the number of threads
in execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.12 Performance of Fast Fourier Transform Kernel varying the number of
threads in execution. . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.13 Performance of Filter Bank varying the number of threads in execution. 49
5.14 Performance of the FFT Kernel for different input sizes and number of
cores of the machine. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.15 Performance of the Filter Bank application different input sizes and
number of cores of the machine. . . . . . . . . . . . . . . . . . . . . 50
5.16 Performance of Bitonic Sorting Network varying the limit on the num-
ber of tokens in flight. . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.17 Performance of Fast Fourier Transform varying the limit on the number
of tokens in flight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.18 Performance of Filter Bank varying the limit on the number of tokens
in flight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.19 Performance of pthread applications varying the limit on the number
of tokens in flight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
ix
List of Tables
4.1 Applications selected for the evaluation. . . . . . . . . . . . . . . . . 19
x
Chapter 1
Introduction
Programmers have got used to serial programming and expect the programs they de-
velop to run faster with the next generation processor that have been coming out in the
market in the past years. But those days are over and the next generation chips will
have more processors with each individual processor not being faster than the previous
years model [8]. The clock frequencies of chips are no longer increasing making par-
allelism the only way to improve the speed of the computer. Parallel computer systems
are useless without having parallel software to utilise their full potential. The idea
of converting serial programs to run in parallel can be limiting since its performance
may be much lower than the best parallel algorithms. In sequential programming lan-
guages like conventional C/C++, it is assumed that the set of instructions are executed
sequentially by a single processor whereas a parallel programming language assumes
that simultaneous streams of instructions are fed to multiple processors. Multithread-
ing aims at exploiting the full potential of multi-core processors. The transition to
the multithreaded applications is inevitable and leveraging the present multithreading
libraries can help the developers in threading their application in many ways.
Parallel programming languages or libraries should help software developers to
write parallel programs that work reliably giving good performance. Parallel programs
execute non-deterministically, asynchronously and with the absence of total order of
events, so they are hard to test and debug. The parallel programs developed should
also scale with the additions of more processors into the hardware system. The ul-
timate challenge of parallel programming languages is to overcome these issues and
help developers to develop software with the same ease as in the case with serial pro-
gramming languages.
Several parallel programming languages, libraries and environments have been de-
1
Chapter 1. Introduction 2
veloped to ease the task of writing programs for multiprocessors. Proponents of each
approach often point out various language features that are designed to provide the
programmer with a simple programming interface. But then why are programmers
hesitant to go parallel? For large codes, the cost of parallel software development can
easily surpass that of the hardware on which the code is intended to run. As a result,
users will often choose a particular multiprocessor platform based not only on abso-
lute performance but also the ease with which the multiprocessor may be programmed
[18]. In the past two decades there were many parallel programming environments
developed. So the next question in mind is why have these parallel programming lan-
guages not been so productive? Why are there only a small fraction of the programmers
that write parallel code? This could be because of the hassle of designing, developing,
debugging, testing and maintaining large parallel codes.
Figure 1.1: Parallel Programming Environments[2]
One of the most promising techniques to make parallel programming available for
the general users is the use of parallel programming patterns. Parallel programming
patterns gives software developers a language to describe the architecture of parallel
software. The use of design patterns promotes faster development of structurally cor-
rect parallel programs. All design patterns are structured description of high quality
solutions to recurring problems.
An interesting parallel programming framework that provides templates for the
Chapter 1. Introduction 3
common patterns in parallel object-oriented design is Threaded Building Blocks (TBB).
Intel Thread Building Blocks1 is a library in C++ language which was built on the no-
tion to separate logical task patterns from physical threads and to delegate task schedul-
ing to the multi-core systems [2]. Threading Building Blocks uses templates for com-
mon parallel patterns. A common parallel programming pattern is the pipeline pattern.
The functional pipeline parallelism is a pattern that is very well suited and used for
many emerging applications, such as streaming and Recognition, Mining and Synthe-
sis (RMS) workloads [12].
In this report we evaluate the pipeline template provided by the Intel threading
building blocks library. We evaluate the pipeline class in terms of its usability, ex-
pressibility and performance. The evaluation is done is comparison to the conven-
tional POSIX2 thread library. This comparative evaluation is based on many experi-
ments that we conducted on both the parallel programming libraries. We implemented
various features that the threading building block library provides in pthread to under-
stand how much easier the threading building blocks library make the programmer’s
job during pipeline application development.
The main intention of the project is to provide a guide to future programmers about
Intel Threading Building Blocks pipelines and also to provide a comparative analysis
of the TBB pipelines with the corresponding pipeline implementations using POSIX
thread.
1.1 Related Work
The growth in commercial and academic interest in the parallel systems has seen an
increase in the number of parallel programming languages and libraries. Development
of usable and efficient parallel programming systems have received much attention
from across the computing and research community. Szafron and Schaeffer in their
paper [17] evaluate parallel programming systems focusing on 3 primary issues of
performance, applicability and usability of many different parallel programming sys-
tems. A controlled experiment was conducted in which half of the graduate students
in a parallel/distributed computing class solved a problem using the Enterprise parallel
programming system [16] while the rest used a parallel programming system con-
sisting of a PVM[5]-like library of message passing calls (NMP [9]). They perform
1www.threadingbuildingblocks.org/2http://standards.ieee.org/regauth/posix/
Chapter 1. Introduction 4
an objective evaluation during system development that gave them valuable feedback
on the programming model, completeness of the programming environment, ease of
use and learning curves. They perform two experiments in the first one they measure
the ease with which novices can learn the parallel programming systems and produce
correct, but not necessarily efficient, programs and in the second experiment they mea-
sure the productivity of the system in the hands of an expert. When experimented with
novices the primary aim was to to measure how quickly they can learn the system and
produce correct programs. When experimented with experts, the primary aim was to
know the time it takes to produce a correct program that achieves a specified level
of performance on a given machine. They collected statistics like number of lines of
code, number of edits, compiles, program runs, development time, login hours etc. to
measure the usability of the parallel programming systems.
Another work done to analyse the usability of Enterprise parallel programming
system was done by Wilson, Schaeffer and Szafron [22]. Before this work there were
not many comparable work done on parallel programming systems other than by Rao,
Segall and Vrsalovic [15] where they developed a level of abstraction called the imple-
mentation machine. The implementation machine layer provides the user with a set of
commonly used parallel programming paradigms. They analyse how the implemen-
tation machine template helps the user in implementing the chosen implementation
machine efficiently on the underlying physical machine.
Another notable work is by VanderWiel, Nathanson and Lilja [18] where they
evaluate the relative ease of use of parallel programming languages. They borrow
techniques from the software engineering field to quantify the complexity of the three
prominent progamming models: shared memory, message passing and High perfor-
mance Fortan. They use McCabes Cyclomatic program complexity [10] and non-
commented source code lines [6] to quantify the relative complexity of the several
parallel programming languages. They conclude by saying that traditional software
complexity metrics are effective indicators of the relative complexity of parallel pro-
gramming languages or libraries.
1.2 Goals
The primary goal of the project is to evaluate Intel Threading Building Blocks Pipelines
and to understand if the library is best for pipeline application development in terms
of its performance, scalability, expressibility and usability of the library. We need
Chapter 1. Introduction 5
to objectively compare parallel programming languages for the pipeline applications.
This can be done by understanding the needs of pipeline application developer during
their pipeline application development and use those as the criteria to evaluate Intel
Threading Building Blocks Pipelines.
The Evaluation of Intel Threading Building Blocks pipeline class is done in com-
parison to the conventional POSIX thread implementations. Many features provided
by the library is put into test to understand how helpful it is to a pipeline application
developer. The evaluation includes understanding of the usability of Intel Threading
Building Blocks which is one of the factors that attract developers to use the library in
the complex world of parallel programming. The usability tests have to be done con-
sidering a novice parallel programmer so that at this time when software developers
are making the transition to parallel programming, this information will be helpful for
them to take the right choice during their pipeline application development. Further
we look into the expressibility of the library by understanding how suitable the library
is for different pipeline applications which will help future developers understand how
the library adapts to the different variety of pipeline patterns that are commonly used
for pipeline application development. Finally we evaluate the library in terms of scal-
ability and performance of the pipeline applications developed using the library. This
will help pipeline application developers to understand the performance drawbacks or
benefits of using Intel TBB for their pipeline application development.
1.3 Motivation
There have been many usability experiments conducted for serial programming lan-
guages but very few for the parallel languages or libraries. Such experiments are neces-
sary to help narrow the gap between what parallel programmers want and what current
parallel programming systems provide[17]. Many parallel programming languages are
developed with the aim to simplify the task of writing parallel programs that run on
multi-core machines and there is very little data that tells us about the complexity of
the different parallel programming languages.
There are many parallel programming languages/libraries developed and each of
these languages/libraries may have its own pros and cons for the development of dif-
ferent applications. It is necessary to understand the pros and cons, and highlight it
to the developers so that they can take the right decision to select the apt language/li-
brary for their application development. Since design patterns are the solutions to
Chapter 1. Introduction 6
the commonly seen problems, categorising pros and cons of different programming
languages/libraries on the different design patterns is a good way to provide the infor-
mation to the developers.
Parallel programs are hard to develop, test and debug. Hence, the usability of the
parallel programming languages play a very important role in the success of the pro-
gramming language. Usability is definitely a factor that programmers will be looking
for when developers are going for the development of applications under fast approach-
ing deadlines or when its a novice programmer who is trying to enter into the parallel
programming world and does not have much idea of the multi-threading concepts.
Parallel programming languages that are developed to make the job of parallel pro-
gramming easier may at times have drawbacks in terms of expressibility, performance
or scalability of the programs developed. Understanding the languages in terms of
these factors is really important for the proper evaluation of the parallel language/li-
brary.
1.4 Thesis Outline
The thesis has been divided into 7 chapters including this chapter which is the intro-
duction. The remaining chapters are laid out as follows:
* Chapter 2 gives an overview about the basic concepts and terminologies that
help to understand the report better.
* Chapter 3 discusses the various features of threading building blocks that help
during pipeline application development and the methodology used for the eval-
uation. It also discusses about how we evaluate threading building blocks in
terms of usability, performance and expressibility of threading building blocks
in comparison to the pthread library.
* Chapter 4 focuses on the design and implementation of the various applications
that are developed in threading building blocks and pthread library.
* Chapter 5 discusses about the evaluations and the results obtained.
* Chapter 6 is a Guide to the future programmers that will help them choose be-
tween threading building blocks and pthread library for their pipeline application
development depending on their needs.
Chapter 1. Introduction 7
* Chapter 7 draws an overall conclusion and suggests future work.
Chapter 2
Background
In this chapter we discuss the basic concepts that help to better understand the dis-
cussions and explanations given in the report. We discuss about parallel programming
languages, different forms of parallelism, the pipeline design pattern and about Intel
threading building blocks and POSIX threads.
2.1 Parallel Programming Languages
Parallel programming languages and libraries have been developed for programming
parallel computers. These can generally be divided into classes based on the assump-
tions they make about the underlying memory architecture - shared memory, dis-
tributed memory, or shared distributed memory. Shared memory programming lan-
guages communicate by manipulating shared memory variables. Distributed memory
uses message passing. OpenMP and POSIX Threads are two widely used shared mem-
ory APIs, whereas Message Passing Interface (MPI) is the most widely used message-
passing system API [21]. Intel Threading building blocks is an example of shared
memory programming.
2.2 Task and Data Parallelism
Data parallelism is used when you have large amount of data and you want the same
operation to be performed on all the data. This data by data processing can be done in
parallel if they have no other dependencies with each other. A simple example of this
can be seen in Figure 2.1. Here the tasks can be done concurrently because there are
no dependencies between them. Another point to be noted is that the data parallelism
8
Chapter 2. Background 9
is limited to the number of data items you have to process.
Figure 2.1: Data Parallelism
Task parallelism is used when you have multiple operations to be performed on a
data. In task parallelism there will be multiple tasks working on the same data con-
currently. A simple example of this can be seen in Figure 2.2. Here the multiple tasks
perform different independent operations on the same set of data concurrently.
Figure 2.2: Task Parallelism
2.3 Parallel Design Patterns
Parallel software usually does not make full utilisation of the underlying parallel hard-
ware. It is difficult for the programmers to program patterns that utilise the maximum
potential of the hardware and most of the parallel programming environments do not
focus on the design issues. So programmers need a guide to help them during their
application development that would actually enable them to get the maximum parallel
Chapter 2. Background 10
performance. Parallel design patterns are expert solutions to the common occurring
problems that achieve this maximum parallel performance. These patterns provide
quick and reliable parallel applications.
2.3.1 Pipeline Pattern
The Pipeline pattern is a common design pattern that is used when the need is to per-
form computations over many sets of data. This computation that is to be performed on
a set of data, can be viewed as many stages of processing to be performed in a partic-
ular order. The whole computation can be seen as data flowing through a sequence of
stages. A good analogy of this parallel design pattern is a factory assembly line. Like
in an assembly line where each worker is assigned a component of work and all the
workers work simultaneously on their assigned task. A simple example of a pipeline
can be seen in the figure 2.3
Figure 2.3: Linear Pipeline Pattern
In the example in Figure 2.3 stage 1, stage 2, stage 3 and stage 4 together form
the total computation that has to be performed on each set of data. The input data are
fed to stage 1 of the pipeline where the data is processed one after the other and then
passed on to the next stage in the pipeline. When the arrangement is a single straight
chain(Figure 2.3), its called a Linear Pipeline Pattern.
Figure 2.4: Non-Linear Pipeline Pattern
Figure 2.4 is an example for a Non-Linear pipeline pattern. Here you can see stages
with multiple operations happening concurrently. Non Linear pipelines allow feedback
Chapter 2. Background 11
and feed forward connections and also allow more than one output stages, that need
not be the last stage in the pipeline.
Figure 2.5: Flow of tokens through the stages in a pipeline along the timeline.
In Figure 2.5 you can see how the sets of data moves through the stages as the time
passes. Assuming that work load is balanced equally among all the stages and that each
stage takes one time step to finish its processing on a set of data, after the first five time
steps a data set is obtained fully processed at every time step. So a data set that serially
take four time steps to process, when processed using a pipeline, comes out processed
at every time step after first four time steps. This gives good throughput to the system.
The initial four time step delay is because all the four stages are not working together
initially and some resources remain idle till all the stages are occupied with useful
work. This is referred to as filling the pipeline or the latency of the pipeline.
The amount of concurrency in the pipeline is dependent on the number of stages in
the pipeline. More the number of stages more is the concurrency. The data tokens need
to be transferred between the stages which introduces an overhead on the work to be
done by a stage. Thus, when the computation to be done on a set of data is divided into
stages then the work done by the stage should be comparable to the communication
overhead between the stages. The pipeline design pattern works the best when the
all of its stages are equally computationally intensive. If the stages vary widely in
the amount of work done by them then the slowest stage will be a bottleneck in the
performance of the pipeline.
2.3.2 Pipeline using both data and task parallelism
Pipeline is a type of task parallelism where many tasks or computations are applied
to a stream of data. This is an example of fine grained parallelism because the design
Chapter 2. Background 12
has frequent interaction between the stages of the pipeline. The data parallel way of
doing this is by letting a single thread do the task done by the entire pipeline but let
different threads work on different data concurrently. This is an example of coarse
grained parallelism since the interaction between the stages are infrequent.
In most of the cases the pipeline stages will not be doing work of the same com-
putationally complexity, some stages may take much longer time to perform its task
and some less. Mixing up data parallelism and task parallelism together gives you a
good solution to this problem. The computationally intensive stages can be run by
multiple threads doing the same work concurrently over different sets of data. As seen
in Figure 2.6 it introduces parallelism within each stage of the pipeline and makes the
throughput of the computationally intensive stages better.
Figure 2.6: Using the Hybrid approach. Multiple workers working on multiple data in
stage 2.
2.4 Intel Threading Building Blocks
Threading Building Blocks is a library that supports scalable parallel programming on
standard C++ code. This library does not need any additional language or compiler
support and can work on any processor or operating system that has a C++ compiler
[2]. Intel Threading Building Blocks implement most of the common iteration pat-
terns using templates and thus the user does not have to be a threading expert knowing
the details about synchronisation, cache optimisations or load balancing. The most
important feature of the library is that you just need to specify task to be performed
and nothing about the threads. The library itself does the job of mapping task onto
the threads. Threading Building Blocks supports nested parallelism that allowed larger
parallel components to incorporate smaller parallel components in it. Threading Build-
Chapter 2. Background 13
ing Blocks also allows scalable data parallel programming.
Threading Building Blocks provide a task scheduler which is the main engine that
drives all the templates. The scheduler maps the tasks that you have created onto the
physical threads. The scheduler follows a work stealing scheduling policy. When the
scheduler maps task to physical threads they are made non-preemptive. The physical
thread works on the task to which it is mapped until it is finished and it may perform
other task only when it is waiting on any child tasks or when there are not child task it
would perform the tasks created by other physical threads.
2.4.1 Threading Building Blocks Pipeline
In Threading Building Blocks the pipeline pattern is implemented using the pipeline
and filter classes. A series of filters represent the pipeline structure in threading build-
ing blocks. These filters can be configured to execute concurrently on distinct data
packets or to only process a single packet at a time.
Pipelines in Threading Building Blocks are organized around the notion that the
pipeline data represents a greater weight of memory movement costs than the code
needed to process it. Rather than “do the work, toss it to the next guy,” it’s more like
the workers are changing places while the work stays in place[7].
2.5 POSIX Threads
POSIX thread(pthread) is the extension of the already existing process model to in-
clude the concept of concurrently running threads. The idea was to take some process
resources and make multiple instances of them so that they can run concurrently within
a single process, the multiple instance of the resources that were made are the bare
minimum needed for the instance to execute concurrently[11]. Pthread provided the
programmers access to low level details in programming. Programmers sees this as a
powerful options were they manipulate low level details to the needs of the application
they are developing. But then the programmer has to handle many design issues while
developing application. The Native POSIX Thread Library is the software that allows
the Linux kernel to execute POSIX thread programs efficiently. The Thread schedul-
ing implementations differs on how threads are scheduled to run. The pthread API
provides routines to explicitly set thread scheduling policies and priorities which may
override the default mechanisms.
Chapter 3
Issues and Methodology
In this chapter we discuss about the different features of the Intel threading building
block library that we intend to evaluate and the methodology that we would use to
evaluate them. We also discuss the methodology used to evaluate the library in terms
of usability, performance and expressibility. The initial phase of the project is to under-
stand the pipeline application development in TBB and to know what are the features
that TBB has to offer so that we can test those features for our evaluation and analyse
to see if these are really useful during pipeline application development.
3.1 Execution modes of the filters/stages
Many key features in the Threading Building Blocks library is worth noting and tested.
As we already discussed in the earlier chapter, Threading Building Blocks imple-
mented pipelines using the pipeline and the filter class. This filter class could be
made to run parallel, serially in order or serially out of order. When you
set the filter to run serially in order, the stage will run serially and would process
each input in the same order as they came into the input filter. Setting the filter to
serial out of order make the stages run in parallel but then the order of the input
data may not be maintained. The filters can also be set to parallel by which each stage
can work concurrently in a data parallel way, by this the same operation will be per-
formed on different data in parallel. These three filter modes are implemented and seen
if it provides any favourable results. Pthread applications having parallel stages are to
be designed so that a comparative analysis can be done. A performance analysis will
be done by calculating the speedup of the application.
14
Chapter 3. Issues and Methodology 15
3.2 Setting the number of threads to run the application
Another feature of the Threading Building Blocks is to run the pipeline by manually
setting the number of threads to run the pipeline with. We need to test if this facility
of manually deciding the number of threads to work on the implementation is a good
option provided by the library. If the user decides not to set the number of threads, then
the library sets the value for the number of threads which is usually the same as the
number of physical threads/processors in the system. This TBB philosophy[2], of hav-
ing one thread for each available concurrent execution unit/processor, is put under test
and checked if it is efficient for the different pipeline applications developed. Each of
the application is run with different number of threads which made us understand how
it helped in increasing the performance of the application. We also test if setting the
number of threads manually is more beneficial than letting the library set it automati-
cally. The results obtained from the automatic initialisation of the number of threads
is compared with the results obtained with the manual initialisation of the number of
threads.
Pthread application which runs with varying number of threads working on it is
checked if it can give better performance results that the threading building block coun-
terpart. Performance is measured in terms of speedup of the applications.
3.3 Setting an upper limit on the number of tokens in
flight
Threading Building Blocks gives you the feature to set an upper limit on the number
of tokens in flight on the pipeline. The number of tokens in flight is the number of data
items that is running through the pipeline or in other words the number of data sets that
are being processed in the pipeline at a particular instance of time. This controls the
amount of parallelism in the pipeline. In serial in order filter stages this will not
have an effect as the tokens coming in are executed serially in order. But in the case
of parallel filter stages there can be multiple tokens that can be processed by a stage
in parallel, so if the number of tokens in pipeline is not kept under check then there
can be a case where there can be excessive resource utilisation by the stage. There
might also be a case where the following serial stages may not be able to keep up with
the fast parallel stages before it. The pipeline’s input filter will stop pushing in tokens
once the number of tokens in flight reaches this limit and will continue only when
Chapter 3. Issues and Methodology 16
the output filter has finished processing the elements. Each of the application is to be
run with different number of token limits which tells us how it helped in increasing
the performance of the application. Performance here is measured in terms of the
speedup of the application. Similar feature is implemented in the pthread applications
and analysed.
3.4 Nested parallelism
Threading Building Blocks supports nested parallelism by which you can nest small
parallel executing components inside large parallel executing components. It is dif-
ferent from running the stages with data parallelism. With nested parallelism it is
possible to run in parallel the different processing for a single token within a stage.
So in our pipeline implementation we would incorporate nested parallelism in these
pipelines but incorporating parallel constructs like parallel for within the stages of
the pipeline and see if its useful to the overall implementation. It will help understand
how different it is from the option of running a stages in parallel by setting the filter
to run in parallel. A series of performance tests are done to understand concurrently
running filters and nested parallelism in the stages. The results tell us which is more
efficient for the pipeline application development.
3.5 Usability, Performance and Expressibility
The next step was to decide apt pipeline applications by which the various features
provided by the Threading Building Blocks library can be evaluated. Application of
various types are taken with varying input size and with varying complexity of the
computation it has to perform. Applications with different pipeline patterns are taken
into consideration to understand the expressibility of the library in comparison to the
pthread version. Implementation of the Linear and Non-Linear pipelines need to be
compared with the pthread versions in terms of the performance of the applications and
how easy it is for the user to make enhancements or changes into the program without
actually changing much in the design of the program. Scalability is another factor that
needs to be considered, it should to be understood to see if the same program gave
proportional performance when ran with more or less number of processors. Usability
of the languages is analysed all through the process of the software development life
cycle by putting ourselves in the shoes of parallel software programmer who can be
Chapter 3. Issues and Methodology 17
new to parallel programming or a can be a threading expert.
Pthread applications are developed so that they perform same as the threading
building blocks applications following the same structure of the pipeline and com-
putation complexity of the algorithm. This is done so that there is fair comparison of
the application developed in both the libraries.
Chapter 4
Design and Implementation
In this chapter we discuss in detail about the designing and implementation issues
during the development of various applications developed in both the parallel pro-
gramming libraries. Designing an application is a very important phase in the software
development life cycle and an easy designing phase will really help a programmer
build applications faster. The designing phase in this project will help understand the
pros and cons of the abstraction that the threading building blocks library provides. So
carefully analysing the designing efforts put in by a programmer due to the flexibil-
ity and the constraints the programming library provides, we can evaluate threading
building blocks library. Implementing these designs is a challenging task. During the
implementation phase the expressibility of the two parallel libraries will be understood
and how the various features provided by the library is helpful in implementing the
intended design of application will be understood. During this phase we primarily un-
derstand the usability and expressibility of the Threading Building Blocks library in
comparison to the pthread library. As a neophyte to TBB we found it very easy to un-
derstand the pipeline and filter class in the library and was quickly able to implement
applications in it.
4.1 Selection of Pipeline Applications
For evaluating TBB we need apt applications that would bring out the pros and cons
of the library. For this we used StreamIt1 benchmarks suite. The StreamIt[19] bench-
marks is a collection of streaming applications. Here the applications are developed
using the StreamIt language so it can not be directly used for our purpose. These ap-
1http://groups.csail.mit.edu/cag/streamit/
18
Chapter 4. Design and Implementation 19
Application Computational Complexity No. of stages Pattern
Bitonic Sorting Network Low 4 Linear
Filter Bank Average 6 Linear
Fast Fourier Transform High 5 Non-Linear
Table 4.1: Applications selected for the evaluation.
plications need to be coded in Threading Building Blocks and pthread for our purpose.
The selection of the applications are done such that they vary in their computa-
tional complexity, pattern of the pipeline, number of stages in the pipeline and input
size. Because in Intel threading building blocks we cannot determine the number of
stages in the pipeline during runtime thus we could not include applications like the
Sieve of Eratosthenes [20] where the number of stages was determined during the run-
time whereas this was possible in pthread[4]. This was one of the drawback found with
Intel threading building blocks during this phase of the project. The final set of appli-
cations selected were Fast Fourier Transform kernel, Filter bank for multi-rate signal
processing and Bitonic sorting network as seen in Table 4.1.
4.1.1 Fast Fourier Transform
The coarse-grained version of Fast Fourier Transform kernel was selected. The im-
plementation was a non-linear pipeline pattern with 5 stages. Fast Fourier Transform
is done on a set of n points which is one of the inputs given to the program. An-
other input given to the program is the permuted roots-of-unity look-up table which
is an array of first n/2 nth roots of unity stored in a permuted bit reversal order. The
Fourier Fourier implementation done here is a Decimation In Time Fast Fourier
Transform with input array in correct order and output array in bit-reversed order.
Details about Decimation In Time Fast Fourier Transform can be seen at [3].
The only requirement in the implementation is that n should be a power of 2 for it to
work properly.
4.1.2 Filter bank for multi-rate signal processing
An application that creates a filter bank to perform multi-rate signal processing was
selected. The coefficients for the sets of filters are created in the top-level initialisation
function, and passed down through the initialisation functions to filter objects. On each
Chapter 4. Design and Implementation 20
branch, a delay, filter, and down-sample is performed, followed by an up-sample, delay
and filter [1].
4.1.3 Bitonic sorting network
An application that performs bitonic sort was selected from the StreamIt benchmark
suite. The program does high performance sorting network ( by definition of sorting
network, comparison sequence not data-dependent ). Sorts in O(n∗log(n)2) comparisons[1].
4.2 Fast Fourier Transform Kernel
4.2.1 Application Design
The Fast Fourier Transform kernel implementation was a 5 stages pipeline with struc-
ture as shown in Figure4.1.
Figure 4.1: Structure of the Fast Fourier Transform Kernel Pipeline.
The intended design of the pipeline is as in Figure 4.1. Stage 1 is the input signal
generator which generates the set of n points and stores it in two arrays. Stage 2
generates the set of Bit-reversal permuted roots-of-unity look-up table which is an
array of first n/2 nth roots of unity stored in a permuted bit reversal order. Both of
these stages generates the arrays on the run and not reading from a file so as to avoid
the overhead due to the I/O which may over shadow the performance of the pipeline.
Stage 3 is the Fast Fourier transform stage where the Decimation In Time Fast
Fourier transform with input array in correct order is computed. Stage 4 is where
the output array in bit-reversed order is created and passed on to the last stage in the
pipeline where the output is shown to the user.
Chapter 4. Design and Implementation 21
4.2.2 Implementation
4.2.2.1 Threading Building Blocks
In the Threading Building Blocks implementation the pipeline was implemented as in
Figure 4.2.
Figure 4.2: Structure of the Fast Fourier Transform Kernel Pipeline implemented in TBB.
The implementation had a class that represented the data structure that is passed
along the different stages of the pipeline. The implementation of the class is shown in
Listing 4.1.
Listing 4.1: Data structure representing Tokens in the Fast Fourier Transform Kernel
Application�1 c l a s s d a t a o b j
2 {3 p u b l i c :
4 d ou b l e ∗A re ;
5 d ou b l e ∗A im ;
6 d ou b l e ∗W re ;
7 d ou b l e ∗W im ;
8 s t a t i c i n t n ;
9 s t a t i c d a t a o b j ∗ a l l o c a t e ( i n t n )
10 {11 n= n ;
12 d a t a o b j ∗ t =( d a t a o b j ∗ ) t b b a l l o c a t o r <char >() . a l l o c a t e ( s i z e o f ( d a t a o b j
) ) ;
13 t−>A re = ( do ub l e ∗ ) t b b a l l o c a t o r <double >() . a l l o c a t e ( s i z e o f ( do ub l e )
∗ n ) ;
14 t−>A im = ( d ou b l e ∗ ) t b b a l l o c a t o r <double >() . a l l o c a t e ( s i z e o f ( do ub l e )
∗ n ) ;
15 t−>W re = ( d oub l e ∗ ) t b b a l l o c a t o r <double >() . a l l o c a t e ( s i z e o f ( do ub l e )
∗ n / 2 ) ;
Chapter 4. Design and Implementation 22
16 t−>W im = ( d ou b l e ∗ ) t b b a l l o c a t o r <double >() . a l l o c a t e ( s i z e o f ( do ub l e )
∗ n / 2 ) ;
17 r e t u r n t ;
18 }19 vo id f r e e ( )
20 {21 t b b a l l o c a t o r <double >() . d e a l l o c a t e ( t h i s−>A re , s i z e o f ( do ub l e ) ∗n ) ;
22 t b b a l l o c a t o r <double >() . d e a l l o c a t e ( t h i s−>A im , s i z e o f ( dou b l e ) ∗n ) ;
23 t b b a l l o c a t o r <double >() . d e a l l o c a t e ( t h i s−>W re , s i z e o f ( do ub l e ) ∗n / 2 ) ;
24 t b b a l l o c a t o r <double >() . d e a l l o c a t e ( t h i s−>W im , s i z e o f ( dou b l e ) ∗n / 2 ) ;
25 t b b a l l o c a t o r <char >() . d e a l l o c a t e ( ( c h a r ∗ ) t h i s , s i z e o f ( d a t a o b j ) ) ;
26 }27 } ;� �
The static function allocate function creates the instance of the class allocating
the memory required to perform Fast Fourier Transform on the n input points and
returns the pointer to the object created. The function free does the job to free the
allocated memory when the computations is done at the end of the pipeline and the
token is destroyed.
Stage 1 generates the input points at run time same as in the algorithm described
in the benchmark suite and stores the values in the data structure dataobj as array
A re and A im and passed on to the stage 2. Stage 2 creates the roots-of-unity look-up
table which is stored in the W re and W im array. From the data structure with array
A and W is passed to the Fast Fourier transform stage where values are computed and
stored in array A itself and passed on to the next stage. Stage 4 finds the bit-reversed
order of array A and passes it to the next stage to output values. The pipeline made
was a linear pipeline which implements the same logic of the original algorithm. The
computation of each stage was written in the overloaded operator()(void*) function of
that classes representing different stages of the pipeline. Each of these classes inherit
the filter class.
The pointer that the overloaded operator()(void*) function returns is the pointer to
the token that has to be passed on to the next stage in the pipeline. So this imposed
a restriction of having the need to represent all the components of a single token as a
single data structure so that it can be passed along the stages in the pipeline in threading
building blocks.
Chapter 4. Design and Implementation 23
4.2.2.2 Pthread
The pthread implementation has the same pipeline structure as the original pipeline
taken from benchmark suite. The implementation has two input stages in the pipeline
joining at stage 3 and then having stage 4 and 5 following linearly. The overall pipeline
structure is defined as shown in Listing 4.2.
Listing 4.2: Data structure representing a pipe in the Fast Fourier Transform Kernel
Pthread Application�1 s t r u c t p i p e t y p e {2 p t h r e a d m u t e x t mutex ; /∗ Mutex t o p r o t e c t p i p e d a t a ∗ /
3 s t a g e t ∗ head1 ; /∗ F i r s t head ∗ /
4 s t a g e t ∗ head2 ; /∗ Second head ∗ /
5 s t a g e t ∗ t a i l ; /∗ F i n a l s t a g e ∗ /
6 i n t s t a g e s ; /∗ Number o f s t a g e s ∗ /
7 i n t a c t i v e ; /∗ A c t i v e d a t a e l e m e n t s ∗ /
8 } ;� �Here the mutex variable is used to obtain the lock over the pipeline information
variables(stages and active) and protect it during concurrent access. The variables
head1 and head2 are pointers to the two heads of the pipeline respectively. The vari-
able tail is the pointer to the last stage in the pipeline, stages being the count of
the number of stages and active being the count of number of tokens active in the
pipeline.
In the present pipeline structure since we have two kinds of stages, the stages are
represented by two kinds of structures. The structure that represent the stages that
receive input from a single stage and pass the token to a single stage is as shown in
Listing 4.3.
Listing 4.3: Data structure representing a stage type-1 in the Fast Fourier Transform
Kernel Application�1 s t a g e t y p e {2 p t h r e a d m u t e x t mutex ; /∗ P r o t e c t d a t a ∗ /
3 p t h r e a d c o n d t a v a i l ; /∗ Data a v a i l a b l e ∗ /
4 p t h r e a d c o n d t r e a d y ; /∗ Ready f o r d a t a ∗ /
5 i n t d a t a r e a d y ; / ∗ Data p r e s e n t ∗ /
6 d ou b l e ∗ A re ; /∗ Data t o p r o c e s s ∗ /
7 d ou b l e ∗ A im ; /∗ Data t o p r o c e s s ∗ /
8 d ou b l e ∗ W re ; /∗ Data t o p r o c e s s ∗ /
9 d ou b l e ∗ W im ; /∗ Data t o p r o c e s s ∗ /
Chapter 4. Design and Implementation 24
10 p t h r e a d t t h r e a d ; /∗ Thread f o r s t a g e ∗ /
11 s t r u c t s t a g e t y p e ∗ n e x t ; /∗ Next s t a g e ∗ /
12 } ;� �The mutex variable is used to protect the data in the pipeline stage. The vari-
ables avail and ready are conditional variables, the avail variables indicates to the
pipelines stage that there is data ready for it to consume/process. The ready condi-
tional variable is an indicator to the earlier stage that the stage has finished the pro-
cessing the data and ready to receive new data to process. data ready is an integer
variable indicating to the data sending process that the data is still now consumed in the
receiving process. The structure also includes the data items that are to be processed
by the stage, here these are pointers to the memory location that contains the data.
There is also a pthread t variable which is the thread to process the stage. The last
variable next points to the next stage in the pipeline structure. In the implementation
this structure is used to represent stages 1, 2, 4 and 5 in Figure 4.1. The structure that
represent the stage that receives tokens from two stages and send tokens to only single
stage is as in Listing 4.4.
Listing 4.4: Data structure representing a stage type-2 in the Fast Fourier Transform
Kernel Application�1 s t r u c t s t a g e t y p e {2 p t h r e a d m u t e x t mutex ; /∗ P r o t e c t a r r a y A ∗ /
3 p t h r e a d c o n d t a v a i l ; /∗ Array A a v a i l a b l e ∗ /
4 p t h r e a d c o n d t r e a d y ; /∗ Ready f o r Array A ∗ /
5 p t h r e a d m u t e x t mutex1 ; /∗ P r o t e c t a r r a y W ∗ /
6 p t h r e a d c o n d t a v a i l 1 ; /∗ Array W a v a i l a b l e ∗ /
7 p t h r e a d c o n d t r e ad y1 ; /∗ Ready f o r Array W ∗ /
8 i n t d a t a r e a d y ; /∗ Array A p r e s e n t ∗ /
9 i n t d a t a r e a d y 1 ; / ∗ Array W p r e s e n t ∗ /
10 d ou b l e ∗ A re ; /∗ Data t o p r o c e s s ∗ /
11 d ou b l e ∗ A im ; /∗ Data t o p r o c e s s ∗ /
12 d ou b l e ∗ W re ; /∗ Data t o p r o c e s s ∗ /
13 d ou b l e ∗ W im ; /∗ Data t o p r o c e s s ∗ /
14 p t h r e a d t t h r e a d ; /∗ Thread f o r s t a g e ∗ /
15 s t r u c t s t a g e t y p e ∗ n e x t ; /∗ Next s t a g e ∗ /
16 } ;� �The structure has two sets of mutex, avail, data ready and ready variable for
the two stages that try to send data to this stages. By doing so, the granularity of
locking is lowered and allows the two stages to work independently of each other.
Chapter 4. Design and Implementation 25
This implementation works because both the stages write into different locations in the
data structure. This structure is used to implement stage 3 in the pipeline in Figure 4.1.
Stage 1 generates the input points at run time same as in the algorithm described
in the benchmark suite and sends pointer to arrays A re and A im to stage 3. Stage
2 creates the roots-of-unity look-up table in parallel to stage 1, stores it in the W re
and W im array and passes the pointers to stage 3. Stage 3 on receiving these values
performs the Fast Fourier Transform of these values and send it to stage 4 where the
bit reverse order of the array is created and passed on to the last stage for output.
The passing of tokens in the pthread application was done using the function as in
Listing 4.5. Here initially the thread tries to attain lock to write into the buffer of the
next stage, then it waits on a condition variable ready that tells the thread when the next
stage thread is ready to accept new tokens. After copying the values of the token to the
buffer of the next stage, the thread signals the avail condition variable telling the next
stage thread that the new token is ready to be processed. Variations of this function
was used to send tokens with different contents.
Listing 4.5: Function to pass a token to the specified pipe stage.�1 i n t p i p e s e n d ( s t a g e t ∗ s t a g e , dou b l e ∗A re , do ub l e ∗A im , d ou b l e ∗
W re , do ub l e ∗W im )
2 {3 i n t s t a t u s ;
4
5 s t a t u s = p t h r e a d m u t e x l o c k (& s t a g e−>mutex ) ;
6 i f ( s t a t u s != 0 )
7 r e t u r n s t a t u s ;
8 /∗9 ∗ I f t h e p i p e l i n e s t a g e i s p r o c e s s i n g da t a , w a i t f o r i t
10 ∗ t o be consumed .
11 ∗ /
12 w h i l e ( s t a g e−>d a t a r e a d y ) {13 s t a t u s = p t h r e a d c o n d w a i t (& s t a g e−>ready , &s t a g e−>mutex ) ;
14 i f ( s t a t u s != 0 ) {15 p t h r e a d m u t e x u n l o c k (& s t a g e−>mutex ) ;
16 r e t u r n s t a t u s ;
17 }18 }19
20 /∗21 ∗ Copying t h e d a t a t o t h e b u f f e r o f t h e n e x t s t a g e .
Chapter 4. Design and Implementation 26
22 ∗ /
23 s t a g e−>A re = A re ;
24 s t a g e−>A im = A im ;
25 s t a g e−>W re = W re ;
26 s t a g e−>W im = W im ;
27 s t a g e−>d a t a r e a d y = 1 ;
28 s t a t u s = p t h r e a d c o n d s i g n a l (& s t a g e−>a v a i l ) ;
29 i f ( s t a t u s != 0 ) {30 p t h r e a d m u t e x u n l o c k (& s t a g e−>mutex ) ;
31 r e t u r n s t a t u s ;
32 }33 s t a t u s = p t h r e a d m u t e x u n l o c k (& s t a g e−>mutex ) ;
34 r e t u r n s t a t u s ;
35 }� �4.3 Filter bank for multi-rate signal processing
4.3.1 Application Design
The application design is a 6 stage linear pipeline. The stage 1 is the input generation
stage that creates an array of signal values. The input signal is then convoluted with the
first filter’s coefficient matrix in stage 2. The signal is then down-sampled in stage 3
and then up-sampled in stage 4. The signal is then passed on to the next stage where the
signal is convoluted with the second filter’s co-efficient matrix. the values are added
up the into an output array until an algorithmically determined number of token arrive
and then the values are output.
4.3.2 Implementation
4.3.2.1 Threading Building Blocks
The implementation of the pipeline in Threading building blocks has the same structure
as the original intended pipeline.The token passed between the stages are arrays which
are dynamically allocated using the tbb allocator at each stage in the pipeline.The input
signal is generated in stage 1 is put in an array and passed to stage 2 in the pipeline.
Stage 2 does the convolution of the signal with the filter coefficient matrix. The convo-
lution matrix is created during the initialisation phase of the program. The convoluted
values are stored in an array and passed on to stage 3 where it is down-sampled and
Chapter 4. Design and Implementation 27
then up-sampled in stage 4, each time allocating new arrays to hold the new processed
values and then passed on to the next stage. Stage 5 does the convolution of the signal
with the second filter co-efficient matrix that is created during the initialisation phase
of the program. Finally in stage 6 the signal values are added up into an array until a
predetermined number of tokens arrive, after which the values are output. The send-
ing of tokens was done in the same way as it was done in the case of Fast Fourier
Transform Kernel.
4.3.2.2 Pthread
The pipeline implemented in pthread is same as the intended pipeline structure having
6 stages and performing the same functions as discussed for the threading building
blocks version. Since the pattern is a linear pipeline and the stages have the same
structure, that is each have a single source for tokens and single recipient for the tokens
the structure of the stages are the same and is represented as a structure shown in
Listing 4.6.
Listing 4.6: Data structure representing a stage in the Filter bank for multi-rate signal
processing application�1 s t r u c t s t a g e t y p e {2 p t h r e a d m u t e x t mutex ; /∗ P r o t e c t d a t a ∗ /
3 p t h r e a d c o n d t a v a i l ; /∗ Data a v a i l a b l e ∗ /
4 p t h r e a d c o n d t r e a d y ; /∗ Ready f o r d a t a ∗ /
5 i n t d a t a r e a d y ; /∗ Data p r e s e n t ∗ /
6 f l o a t ∗ d a t a ; /∗ Data t o p r o c e s s ∗ /
7 p t h r e a d t t h r e a d ; /∗ Thread f o r s t a g e ∗ /
8 s t r u c t s t a g e t y p e ∗ n e x t ; /∗ Next s t a g e ∗ /
9 } ;� �Here you have the mutex variable to protect the data in the stage. The avail and
ready condition variables to indicate the availability of the data for processing and the
readiness of the stage to accept new data for processing. The structure also has the
pointer to the data item and variables for the thread that processes the stage and the
pointer to the next stage in the pipeline structure.
The overall pipeline is defined with the structure as shown in Listing 4.7 having
a mutex variable which is used to obtain a lock over the pipeline information vari-
ables(stages and active). The head and tail pointers point to the first and last stage
in the pipeline. The variables, stages and active maintain the count of the number of
Chapter 4. Design and Implementation 28
stages and the number of tokens in the pipeline. The sending of tokens to the next
stage was done using the function as in 4.5 except for the difference in the data that is
copied to buffer of the next stage.
Listing 4.7: Data structure representing the pipeline in the Filter bank for multi-rate
signal processing application�1 s t r u c t p i p e t y p e {2 p t h r e a d m u t e x t mutex ; /∗ Mutex t o p r o t e c t p i p e ∗ /
3 s t a g e t y p e ∗ head ; /∗ F i r s t s t a g e ∗ /
4 s t a g e t y p e ∗ t a i l ; /∗ L a s t s t a g e ∗ /
5 i n t s t a g e s ; /∗ Number o f s t a g e s ∗ /
6 i n t a c t i v e ; /∗ A c t i v e d a t a e l e m e n t s ∗ /
7 } ;� �The Filter Bank application was redesigned to implement the stages with data par-
allelism. This included the addition of a shared memory data structure were all the
threads working in a stage can access the tokens to be processed. The shared mem-
ory data structure is as shown in Listing 4.8. One instance of this data structure is
shared between all the threads working in a particular stage. The functionality of the
components are same the one discussed for Listing 4.6.
Listing 4.8: The Shared Memory data structure for the threads working in the same
stage�1 t y p e d e f s t r u c t s h a r e d d a t a {2 p t h r e a d m u t e x t mutex ; /∗ P r o t e c t d a t a ∗ /
3 p t h r e a d c o n d t a v a i l ; /∗ Data a v a i l a b l e ∗ /
4 p t h r e a d c o n d t r e a d y ; /∗ Ready f o r d a t a ∗ /
5 i n t d a t a r e a d y ; /∗ Data p r e s e n t ∗ /
6 f l o a t ∗ d a t a ; /∗ Data t o p r o c e s s ∗ /
7 } shared mem ;� �4.4 Bitonic sorting network
4.4.1 Application Design
The bitonic sorting network application taken was a 3 stage pipeline was a 4 stage
pipeline. Stage 1 for the input generation, Stage 2 for the creation of the bitonic se-
quence from the input values and then stage 3 for the sorting of the bitonic sequence
Chapter 4. Design and Implementation 29
and the final stage to output the sorted values. The application has been altered from
the original to benchmark suite design to incorporate varied size inputs. The applica-
tion is designed to sort many fixed size arrays of numbers. Initially the application was
designed to read values from a file which was sorted and then written into an output
file which was later changed due to reasons that we would discuss in the later part of
the report. So stage 1 and stage 4 was initially designed to read values from an input
file and to write values into an output file respectively.
4.4.2 Implementation
4.4.2.1 Threading Building Blocks
The implementation is same as the intended pipeline design having 4 pipeline stages
and having a linear pipeline structure. The stage 1 input filter class generates a set of
randomly generated numbers and passes on to the next stages. The class also maintains
a count of the number of input tokens generated and stops when a required limit is
reached. The stage 2 class implements the logic for the bitonic sequence generation
from the array received from the input generation class. Here the computation involves
only comparing and swapping of values. The stage 3 implements the merge phase of
the bitonic sequence and does almost the same amount of computation as stage 2 in
terms if number of comparison and swaps. The final stage is where the sorted values are
output. The sending of data in all the above cases involves only the passing on pointers
that point to the array of numbers. The passing tokens to the next stages involved
passing of the pointers to the array of data values which was done returning the token
pointer in the overloaded operator()(void*) function in the classes representing the
stages. In the initial implementation of the application the input was read from a file
rather than generating the input during the execution. The initial implementation was
changed due to reasons we discuss in the evaluation section. The data structure to
represent tokens was initially represented as a circular buffer of fixed size. Each array
in the buffer was filled with input values and passed on to the next stage. This can be
seen in Listing 4.9. Here buff is the circular buffer of arrays. It has to be ensured in
the implementation that the number of tokens in flight is not more than SIZE which is
easily possible in threading building blocks.
Listing 4.9: The Circular buffer to hold tokens�1 c l a s s I n p u t F i l t e r : p u b l i c t b b : : f i l t e r {2 B u f f e r b u f f [ SIZE ] ;
Chapter 4. Design and Implementation 30
3 s i z e t n e x t B u f f e r ;
4
5 p u b l i c :
6 vo id ∗ o p e r a t o r ( ) ( vo id ∗ ) ;
7 I n p u t F i l t e r ( ) : f i l t e r ( s e r i a l i n o r d e r ) , n e x t B u f f e r ( 0 ) {8 }9 ˜ I n p u t F i l t e r ( ) {
10 }11 vo id Token ize ( c o n s t s t r i n g& s t r , v e c t o r<s t r i n g >& tokens , c o n s t
s t r i n g& d e l i m i t e r s = ” ” ) ;
12 } ;
13
14 vo id I n p u t F i l t e r : : Token ize ( c o n s t s t r i n g& s t r , v e c t o r<s t r i n g >& tokens
, c o n s t s t r i n g& d e l i m i t e r s = ” ” )
15 {16 s t r i n g : : s i z e t y p e l a s t P o s = s t r . f i n d f i r s t n o t o f ( d e l i m i t e r s , 0 ) ;
17 s t r i n g : : s i z e t y p e pos = s t r . f i n d f i r s t o f ( d e l i m i t e r s , l a s t P o s
) ;
18 w h i l e ( s t r i n g : : npos != pos | | s t r i n g : : npos != l a s t P o s ) {19 t o k e n s . p u s h b a c k ( s t r . s u b s t r ( l a s t P o s , pos − l a s t P o s ) ) ;
20 l a s t P o s = s t r . f i n d f i r s t n o t o f ( d e l i m i t e r s , pos ) ;
21 pos = s t r . f i n d f i r s t o f ( d e l i m i t e r s , l a s t P o s ) ;
22 }23 }24
25 vo id ∗ I n p u t F i l t e r : : o p e r a t o r ( ) ( vo id ∗ ) {26 s t r i n g l i n e ;
27 s t a t i c f s t r e a m i n p u t f i l e ( Inpu tF i l eName ) ;
28 v e c t o r<s t r i n g > t o k s ;
29 i f ( g e t l i n e ( i n p u t f i l e , l i n e ) ) {30 B u f f e r &b u f f e r = b u f f [ n e x t B u f f e r ] ;
31 n e x t B u f f e r = ( n e x t B u f f e r +1)%SIZE ;
32 Token ize ( l i n e , t o k s ) ;
33 f o r ( i n t y =0; y<t o k s . s i z e ( ) ; y ++)
34 {35 b u f f e r . a r r a y [ y ]= a t o i ( t o k s [ y ] . c s t r ( ) ) ;
36 }37 r e t u r n &b u f f e r ;
38 }39 e l s e {40 r e t u r n NULL;
41 }
Chapter 4. Design and Implementation 31
42 }� �4.4.2.2 Pthread
The pthread implementation has the same intended linear pipeline design with 4 stages.
The structure of the pipeline stage is similar to Listing 4.6. It contains two condition
variables used to synchronise the sending and receiving of data and the mutex variable
to protect the data in the pipeline. It contains the pointer data that points to the array
and the another pointer next pointing to the next stage in the pipeline which is used
during sending of data.
The main structure of the pipeline is similar to Listing 4.7.The structure contains
the pointer to the head and tail stage of the pipeline structure and also has a mutex
variable to protect the pipe information data. The variable stages and active con-
tains the information about the pipeline like the number of stages in the pipelines and
number of tokens in flight in the pipeline respectively. The sending of tokens to the
next stage was done using the function as in Listing 4.5 except for the difference in the
data that is copied to buffer of the next stage.
Chapter 5
Evaluation and Results
In this chapter we discuss the evaluation done on the threading building blocks library.
We evaluate the usability and expressibility of the two parallel programming libraries
and also the performance of the applications developed using those libraries. The
different features in threading building blocks library is evaluated and a comparative
analysis is done by implementing these features in the pthread applications developed.
We start with the discussion about the evaluation and the results obtained for the
three pipeline applications implemented. Here we discuss the usability and express-
ibility issues faced during each application development and also do a performance
analysis of the applications. We then discuss about the results and evaluation of the
different features provided by threading building blocks.
We ran various applications on multi-core machines and measured the speedup of
the different applications developed. The experiments were run on 16-core and 2-
core machines. The 2-core machine used had Intel(R) Xeon(TM) 3.20GHz processors
with 2 GB RAM and the 16-core machine had Intel(R) Xeon(R) 2.13GHz processors
with 63 GB RAM. Most of the experiments are carried out on the 16-core machine.
Executions done on the 2-core machines have been explicitly mentioned.
5.1 Bitonic Sorting Network
The evaluation of threading building blocks started with the bitonic sorting network
application. This was challenging in many ways because it was the first pipeline appli-
cation we were developing using the library. As a newcomer to the threading building
blocks library we had the initial usual difficulties until we were actually used to the
constructs and the features that the library provided.
32
Chapter 5. Evaluation and Results 33
5.1.1 Usability
5.1.1.1 Threading Building Blocks
During the initial programming phase the challenging part was to create the right data
structure that we would use to represent tokens and pass them efficiently across stages.
We had tried many data structures before we actually decided on one. One of the
data structures tried had a large array which was divided into n buffers of data that
represented a token in the pipeline which had to be sorted. The size of the large array
was fixed and could incorporate only a fixed number of tokens. The input filter would
fill these buffers one after the other and then pass it on to the next stage for processing.
After all the n buffers were filled, it would start again by filling in the first buffer. This
implementation worked because we were able to limit the number of tokens in flight
using the Threading Building Blocks library and also for the fact that we could program
the stages to run sequentially and process the tokens in a fixed order. By limiting the
number of tokens in flight to n it was ensured that when the input filter comes to the
next round of filling up of the large array starting from the first buffer, the first buffer
in the previous round had already finished processing at the last stage of the pipeline.
Making the stages run serially and to process the tokens in the same order as created
by the input filter also ensured that the buffers that had not finished processing, will
not be overwritten with new values. This implementation was later changed since there
was an intention to experiment by changing the number of tokens in flight in pipeline
and hence this would not be the best design suited for the purpose. But then we could
not fail to notice that since the threading building blocks library provided features like
limiting the number of tokens in flight and to make the stages run sequentially and to
perform in order processing of the tokens easily allowed us to implement a design like
this.
As for the implementation of the computational part of performing the bitonic sort
was concerned we just had to implement the C++ serial code for each stage and place
them in the overloaded operator()(void*) function in the respective classes that
represented the stages. As soon as we got the right data structure for the tokens that
are passed along the pipeline and implemented the computational task done by each of
the stages we had been able to implement a correctly working pipeline without much
hassle. As a parallel programmer we did not have to bother anything about the low
level threading concepts like synchronising, load balancing or cache efficiency.
Chapter 5. Evaluation and Results 34
5.1.1.2 POSIX Thread
Bitonic Sorting Network being our first pipeline application in pthread we had to get
a lot of concepts thorough before we actually started writing the program. We had to
understand in detail about the use of thread spawning, mutexes and condition variables
for synchronising the threads, inter thread communications etc. The most important
challenge during the application development in pthread was to get right design for
the pipeline. It was a difficult task because of the amount of the implementation detail
we needed to handle. We had to fix the structure of the pipeline so that they correctly
send and receive data between the threads handling each stage which included the
use of mutexes to protect the critical section and the use of condition variable for the
synchronisation of sending and receiving data. Getting the right design was the most
difficult task but then there were many difficulties we faced during the later develop-
ment stages. To implement a feature like limiting the count of tokens in flight we had
to include a counter in the design that kept the count and had to include mutexes for
the access of the counter variable.
The work that had to be done by each stage was written in separate function and
had to be assigned to each thread during the thread initialisation. During the first exe-
cution of the program we had discovered a few errors and debugging a multi-threaded
application was not an easy task. It was really hard to determine if the errors were due
to the improper synchronisation of threads or due to the wrong computational logic
implemented. There were many cases in which the program could go wrong and iden-
tifying them was really hard.
5.1.2 Expressibility
5.1.2.1 Threading Building Blocks
As for the expressibility of the threading building blocks library we had no issue im-
plementing the desired design of the pipeline. We could easily set the input and output
stages to work sequentially when the data was read and written into files. We could
easily write data into the output file in the same order they were read from the input file
by just passing the serial in order keyword to the inherited filter class constructor
during the constructor call of the classes that represented the input and output stages.
We were able to run the middle stages of the pipeline in parallel(Data parallelism)
just by passing the keyword parallel to the filter class constructor. When we imple-
Chapter 5. Evaluation and Results 35
mented the design where we had to restrict the number of tokens in flight, we were
easily able to do so because the library gave us the feature to set the maximum limit by
passing it to the run() function in the pipeline class.
5.1.2.2 POSIX Thread
In the pthread implementation we were able to implement the intended design for the
application. We could limit the number of tokens in flight by including a counter in
the design that maintained the count of the number of tokens in flight which was not
as easy as that in the case of threading building blocks.
5.1.3 Performance
The Bitonic sorting network application was initially developed with a design to read
the values to be sorted from files and then write the sorted values into another file. The
implementation was working fine except for a problem that when the performance of
the application was calculated it was noted that the run-time of the application varied
greatly for each execution. Initially the assumption was that the varied execution times
may be because the amount of computation in each stage was very less and must be
over shadowed by the overhead of synchronisation of the threads done internally by the
threading building blocks library. On detailed reading about Intel threading building it
was found that threading building block is not ideal for I/O bound application as the
threading building block task scheduler is unfair and non-preemptive [13]. Thus the
design of the application had to be changed for further tests.
By replacing I/O stages with the stage that generated input during the execution, the
application was ready to be evaluated. The speedup of the application was calculated in
its best configuration and it was found that the speedup of the application was less the
1. This could be because of the computational complexity of the application being low
and this was overshadowed by the synchronisation mechanism implemented internally
in threading building block library. The threading building block application was easily
scaled to a machine with different number of cores without the need to the change
anything in the code and the results were obtained as in Figure5.1.
The initial design of the pthread application with stages that reads and writes data
into files, the application gave steady results for execution times unlike the threading
building block version. Though the results were stable, the execution time of the appli-
cation was very high. Later, the pthread version of the application was designed with
Chapter 5. Evaluation and Results 36
Figure 5.1: Performance of the Bitonic Sorting application(TBB) on machines with dif-
ferent number of cores.
input generation stages that removed the overhead due to I/O operations in the pipeline.
On the performance analysis of the application it was noted that the pthread application
took very large amount of time to execute compared to the threading building blocks
version which can be seen in Figure5.2. The pthread application was also evaluated
on machine with different number of cores without any change in the code this would
show how a newcomer programmer can achieve good performance dependent on the
machine without much effort in threading building blocks. Results shown in Figure5.2
were obtained on evaluation.
Figure 5.2: Performance of the Bitonic Sorting application(pthread) on machines with
different number of cores.
The bad performance was assumed to be caused due to the synchronisation mecha-
nism implemented in the program and the stages being less computationally intensive.
Chapter 5. Evaluation and Results 37
To confirm this, a test was done by removing all the thread synchronisation mecha-
nisms in the application and calculating the run-time of the application. Though the
application was giving incorrect output values it could be understood from the run-time
if most of the time was taken for the synchronisation of threads in the application. The
results obtained are shown in Figure 5.3.
Figure 5.3: Performance of the Bitonic Sorting application(pthread) with and without
Locks.
It can be seen that there is a drastic reduction in the execution time of the applica-
tion with and without locks. So it was understood that the application being very less
computationally intensive, most of the threads were idle most of the time waiting on
locks to be released.
5.2 Filter bank for multi-rate signal processing
The Filter Bank application was the second application that was developed. With the
experience we attained with Bitonic sorting network application we could immediately
start the work with the second application because we had familiarised ourself with
both the parallel programming libraries and had a basic idea of how we would go
about designing and implementing a pipeline application in both pthread and threading
building blocks. The Filter bank application is more computationally intensive than the
bitonic sort application having to work on large signal arrays and large filter co-efficient
matrices. The pipeline had a longer chain with 6 stages in a linear pipeline structure.
Chapter 5. Evaluation and Results 38
5.2.1 Usability
5.2.1.1 Threading Building Blocks
The development of bitonic sorting network made us familiar with threading building
blocks due to which the Filter Bank application was developed much faster than what
we took for the bitonic sorting network. We just had to paste in the computation for
each stage into the operator() function of the appropriate classes. To create the right
data structure for the token movement in the pipeline was the only challenge in the
implementation.
As a parallel programmer we wanted to make our pipeline application run faster
and we were easily able to identify the bottleneck stages by using the serial in order,
serial out of order and parallel options in the filter classes and finding their speedup.
By using these options we were very easily able to tweak the application for the best
performance.
5.2.1.2 POSIX Thread
Similar to the case of Threading building blocks the bitonic sort application imple-
mented in pthread gave us a quick start because we had already figured out a generic
structure for the pipeline. With a few application dependent changes in the design we
were immediately ready to start with the implementation. Getting the right design for
the application was the toughest part in bitonic sorting network, which we were able
to get done with moderate ease for the filter bank application. The reuse of the design
made development easy for us but then it was not the same in the case of threading
building blocks because many of the design issues were abstracted by the library and
the only notable challenge in threading building blocks was to get the right data struc-
ture for the tokens.
With the bottleneck stages easily identified in the threading building block applica-
tion we were easily able to tweak our application for performance but then in the case
of the pthread application we had to find the single token execution time in each stage
to understand which were the bottleneck stages in the pipeline. This was comparatively
a tougher task than what we had to do for the threading building block application.
Since we had found out the bottleneck stages in the pipeline the next step was to run
the stage with data parallelism. Implementing the stages to run in parallel needed many
changes in the already implemented design of the pthread application. This redesign
though built up on the already existing design had many challenges. Because of the
Chapter 5. Evaluation and Results 39
cases where data had to be sent to many recipients and received from many senders.
Many issues like synchronisation of threads and efficiency had to be considered for
the right design. A lot of amount of time had to be spent on the redesign, testing and
debugging the application which was even more harder than in the case of sequential
stages whereas in the threading building block there was no need for redesigning the
application as we just had to pass an argument ‘parallel’ to the filter class constructor
to make the stages run in parallel. The idea of collapsing stages if needed was also very
easy in the case of threading building blocks. We just had to paste the computation of
the collapsed stages into one single class and that to without much change in the design
of the application.
5.2.2 Expressibility
5.2.2.1 Threading Building Blocks
In terms of expressibility TBB library provided us with all the needed features for the
implementation of the intended design of the application. It provided us with features
using which we could find the bottlenecks in the application and also run in paral-
lel those stages with great ease, thereby expressing both task and data parallelism.
Changes like collapsing of stages was also possible.
5.2.2.2 POSIX Thread
Pthread library provided the required flexibility to express the intended design for the
pipeline applications. The bottleneck stages were identified and we were also able
to run these stages in data parallel to make the implementation efficient. It was also
possible to collapse the stages for better load balance between different stages.This
was possible in both threading building blocks and pthread without much hassle.
5.2.3 Performance
Filter Bank application being a computationally intensive application and having no
I/O operations in it, showed no problems during the performance evaluation of the
application. The long chained pipeline application worked perfectly with threading
building blocks giving good speedup. The threading building blocks application was
easily scalable and gave good speedup results even when tested over different machines
with different number of cores as shown in Figure5.4.
Chapter 5. Evaluation and Results 40
Figure 5.4: Performance of the Filter Bank application(TBB) on machines with different
number of cores.
The pthread application was also able to give good speedup. The speedup obtained
can be seen in the Figure 5.5. Pthread application does not scale on its own like in the
case of threading building blocks. So to understand how pthread would work without
any change in the code, the application was run on machines with different number of
cores which gave the result as shown in Figure 5.5.
Figure 5.5: Performance of the Filter Bank application(pthread) on machines with dif-
ferent number of cores.
It can be seen that the speedup obtained in the threading building block versions is
much better than the pthread version which can be attributed to scheduler that threading
blocks uses and also the thread abstractions that the library provides.
Chapter 5. Evaluation and Results 41
5.3 Fast Fourier Transform Kernel
Fast Fourier Transform kernel was the third application that was developed to evaluate
threading building blocks pipeline. This application was particularly taken because of
the non-linear pipeline pattern that was required in the implementation. Its a 5 stage
pipeline performing a reasonably good amount of computation at each stage.
5.3.1 Usability
5.3.1.1 Threading Building Blocks
After implementing two applications in threading building blocks, the Fast Fourier
Transform kernel took only a few hours for us to implement. This is because of the
abstractions threading building blocks provided us. Since we already had the required
algorithm at the benchmark suite we just had to put in the code at the appropriate place.
We had the application up and working with just a few execution tries and found the
pipeline application development extremely fast and trouble free after implementing
this application.
Just like the earlier application implementations, the only phase that took some
time was to decide the correct and efficient data structure for the tokens. Even though
the pipeline application was a non-linear pipeline, designing the application was not
any different because the non-linear pipeline is implemented as a linear pattern in
threading building blocks. We just had to decide on the correct order of the stages
in the linear pipeline as the non-linear pattern was converted to a linear pattern and
put in the computation in the filter classes to implement the pipeline. From the pro-
gramming point of you it was not any different than implementing a linear pipeline.
So there were no extra usability issues implementing a non-linear pipeline in threading
building blocks in comparison to implementing a linear pipeline.
5.3.1.2 POSIX Thread
The experience working with the development of the previous two pthread application
helped developing the pthread version of Fast Fourier Transform kernel. But the Fast
Fourier Transform kernel having a non-linear pipeline pattern had demanded extra
attention into the design of the application. Because of the non-linear structure of the
pipeline the stages were not all the same. So the stages were represented using different
structure incorporating extra measures for thread synchronisation and access to shared
Chapter 5. Evaluation and Results 42
resources. Designing the right structure was not a simple task as in threading building
blocks. A lot of time had to be spent testing and implementing the correct design which
consumed a lot of time. Appropriate checks had to be done at the combining stages to
ensure the correct order of data was combined together. Lot of issues like these had
to be handled which made the application development a tougher task as compared to
threading building block where these issues did not come up. In pthread every phase
was tougher than that in the threading building blocks because pthread programming
required attention to a lot of low level details to implement an application.
Implementing a non-linear pipeline had its difficulties in pthread but then if the
application was implemented as a linear pattern just like it was done for the thread-
ing building blocks, then there we could easily overcome the troubles we went into
implementing the non-linear pattern.
5.3.2 Expressibility
5.3.2.1 Threading Building Blocks
The intended design for the application was a non-linear pipeline but then it was not
possible to implement it in threading building blocks because the library does not
support non-linear pipeline patterns. The work around for this is that the non-linear
pipeline has to be converted to a linear pipeline and then implemented with the library.
The expressibility of threading building blocks is flawed if the need is to implement a
non-linear pipeline.
5.3.2.2 POSIX Thread
Pthread library gives you the flexibility of implementing non-linear pipelines. The
Fast Fourier Transform application was developed with the intended design using the
pthread library. One of the good things in the pthread library because it lets the pro-
grammer work on such low details is the flexibility that it gives the programmer to
implement application the way he needs it and providing fewer library related restric-
tions.
5.3.3 Performance
The Fast Fourier Transform kernel application’s intended design was a non-linear
pipeline implementation and it was important to understand the performance of the
Chapter 5. Evaluation and Results 43
threading building blocks application in comparison to the pthread application because
threading building blocks does not support non-linear pipelines whereas pthread gives
the flexibility to implement them. Even though threading building blocks works around
the problem and implements the non-linear pipeline in a linear pattern it is necessary
to understand if the threading building blocks still gave the performance benefits it
gave like in the other application implemented. The linear pipeline version of the Fast
Fourier transform application was evaluated for performance which gave the results as
shown in Figure 5.6.
Figure 5.6: Performance of the Fast Fourier Transform Kernel application(TBB) on ma-
chines with different number of cores.
It can be seen that despite the inability of threading building block library to ex-
press non-linear pipelines, it gave really good speedup for the application which also
scaled well for the machines with different number of cores. The non-linear pipeline
implemented using the pthread version of the application gave good speedup when run
on machines with different cores but not as much as in threading building blocks. The
results are as shown in Figure 5.7.
It is observed that despite the flexibility that pthread offers to implement non-linear
pipeline the performance results of the of the pthread application is not better than
the threading building blocks performance results. In the 5 stage non-linear pipeline
the first two stages concurrently generate the data required to process a single token
which results in a single output set. In the threading building blocks version this was
implement linearly with stage 2 after stage 1 thereby not concurrently generating the
data needed to process a token. But then the pipeline starts to output the results in the
last stage at ever time interval which is equal to the execution time of the slowest stage
Chapter 5. Evaluation and Results 44
Figure 5.7: Performance of the Fast Fourier Transform Kernel application(pthread) on
machines with different number of cores.
in the pipeline. And this time interval is the same for both the linear and non-linear
implementation. The only difference that arises is in the latency of the pipeline, that is
the initial start up time for the pipeline before it starts to output data.
The Figure 5.8 explains how the latency varies for the linear and non-linear imple-
mentation of pipeline assuming that all stages take equal amount of time to execute.
This small difference in the latency of the pipelines does not make a huge difference
in the performance of the pipeline most of time because the pattern is used to process
large amount of data which takes lots of time to process. This time is very large com-
pared to the latency advantage the non-linear pipeline provides. But the priorities can
change depending on the application needs and there are many cases where latency of
the pipeline is a crucial factor.
5.4 Feature 1: Execution modes of the filters/stages
The selection of the mode of operation of the filters is one of the most powerful feature
in the pipeline implementation. The ease at which a programmer can set the way the
stages should work definitely facilitates in faster programming.
5.4.1 serial out of order and serial in order Filters
serial in order stages was used when there was the need for certain operation to be
done only by a single thread and also when the order in which the tokens are processed
is to be maintained. serial out of order stages was used in cases when the certain
Chapter 5. Evaluation and Results 45
Figure 5.8: Latency difference for linear and non-linear implementation assuming equal
execution times of the pipelines stages.
operations had to be performed serially and the order of processing of the tokens was
not important. But then there were not many cases observed where the output tokens
came out of order in any of our implementation using the serial out of order stages
even though there were parallel stages in between the serial out of order stages.
This might be because the processing time in the parallel stages for each and every
token was the same.
Running the stages in the serial out of order rarely gave performance benefits
compared to the serial in order stages even though serial in order stages intro-
duces some delays in between processing of tokens to ensure the tokens are processed
in the right order. This might again be because of the reason that the the processing
time in the middle parallel stages for each and every token was the same and thereby
most of the tokens reached the serial in order stage in order and there was rarely
any delay introduced by the filter to order the token processing.
The Filter Bank pthread application was redesigned to do a comparison test with the
serial in order and serial out of order filters developed in the threading build-
ing blocks library. Implementation of serial in order stages required the stages to
be run only by a single thread and to provide an additional information attached to the
tokens in flight that gave information on the order in which the tokens are to be pro-
cessed. This implementation needed a lot more work compared to that in the threading
Chapter 5. Evaluation and Results 46
building blocks library. Being ignorant of the order of tokens made the stages run
serially out of order execution of tokens.
5.4.2 Parallel Filters
The overall performance of a pipeline application is determined by the slowest se-
quential stage in the pipeline. If all the stages in the pipeline are parallel then its a
different scenario where performance depends on other factors. Making the slow bot-
tleneck stages run in parallel(Data Parallel) increases the throughput of the pipeline.
Thereby introducing both task and data parallelism in the pipeline design helps in
proper load balancing and achieve good performance. This performance benefits was
easily obtained wherever stages could be run with data parallelism and where the order
of tokens processed was not important. This can be seen from Figure 5.9. Significant
performance benefits were easily obtained when the application was developed in the
threading building blocks.
Figure 5.9: Performance of Filter bank application(TBB) for different modes of operation
of the filters.
The Filter Bank application having stages with data parallelism was implemented
in pthread to understand how the performance of the pthread application would in-
crease by running the bottleneck stages in data parallel and to know if it gives better
results than the threading building blocks library. After identification of the bottle-
neck stages, the stages were run in parallel with different number of threads. Results
obtained are as in Figure 5.10.
The results obtained here are very significant because you can see the pthread ver-
sion of the application out performing the threading building blocks version of the
Chapter 5. Evaluation and Results 47
Figure 5.10: Performance of Filter bank application with stages running with data par-
allelism.
application. The pthread version of the application gives good performance with fewer
threads. It should also be noted that only two bottleneck stages were designed to
run data parallel in the pthread application and still obtaining better results that the
threading building blocks version. This performance was not obtained by the thread-
ing building block library with the same thread count and even when all the stages
were run in parallel.
5.5 Feature 2: Setting the number of threads to run the
application
The Threading building blocks scheduler gives the programmer the options to initialise
it with the number of threads he/she wants to run the application with. In case the
programmer does not have a count of threads in mind then the programmer can let the
threading building blocks library decide the number of threads that is needed to run
the application. This feature was tested across all the three applications implemented.
The initial test was to vary the number of threads that was initialised in the scheduler
and see how the application performed. The experiment was carried out for different
values for the maximum number of tokens in flight. The results obtained are as in
Figure 5.11.
The Bitonic sorting network degraded in performance as the number of threads
was increased irrespective of the limit on the number of token in flight. This to certain
extent confirms to the assumption we made about the stages of the pipelines performing
Chapter 5. Evaluation and Results 48
Figure 5.11: Performance of Bitonic Sorting Network varying the number of threads in
execution.
less computation intensive tasks in the case of Bitonic sorting network.
Figure 5.12: Performance of Fast Fourier Transform Kernel varying the number of
threads in execution.
In the Fast Fourier Transform it can be noted from the Figure 5.12 that the increase
in the number of threads gave better speedup for the application. Providing the appli-
cation with more thread to work with and more tokens to work on, the performance
kept on increasing. The performance stabilised after a particular count of threads was
reached.
In the Filter Bank application the speedup of the application increased proportion-
ally to the number of threads which can be seen in Figure 5.13. The speedup values
stabilised after reaching a particular count of threads.
From all the three examples it was seen that it was easy to identify the value of
Chapter 5. Evaluation and Results 49
Figure 5.13: Performance of Filter Bank varying the number of threads in execution.
the number of threads for the best performance of the application. This feature gave
the programmer a very easy and powerful way to tweak his/her application to get good
performance.
Threading building block supports automatic initialisation of the scheduler with
the number of threads and this made the application scalable to different machines
with different number of cores. This is very powerful because there was no need for
the programmer to make any changes in the code depending on the machines he/she is
trying to run the application on. On experimenting it was understood that the count of
the number threads initialised in the scheduler was equal to the number of processors in
the machine. This was in support to the claim by the developers of threading building
block that good performance is achieved when there is one thread per processor of the
machine. This can be seen from the graphs, where the tests were run on a 16-core
shared memory machine. Except for the bitonic sorting network, the applications gave
good performance when the number of the threads was 16. The experiment was run
with different sizes of input and on machines having different number of cores. The
results obtained were as shown in Figure 5.14 and Figure 5.15.
It can be noted that the automatic initialisation done by the threading building
blocks library works in most of the cases. But then there are cases where there were
better results. It can be understood from this evaluation that if the application devel-
opment in threading building blocks is for a particular machines with a fixed number
of cores then it is best to trial run the application for different values of thread count
to choose the most ideal value. In the case of designing application for heterogeneous
machines with different number of cores, it is ideal to leave the initialisation of the
Chapter 5. Evaluation and Results 50
Figure 5.14: Performance of the FFT Kernel for different input sizes and number of
cores of the machine.
Figure 5.15: Performance of the Filter Bank application different input sizes and number
of cores of the machine.
number of threads to the threading building blocks library that does it based on the
machine it is running on.
5.6 Feature 3: Setting an upper limit on the number of
tokens in flight
Setting number of tokens in flight define the amount of parallelism in the pipeline.
When you limit the maximum number of tokens in flight to N, then there wont be
more than N operations happening in the pipeline at any instant of time. Thus setting
Chapter 5. Evaluation and Results 51
the right values for the number of tokens in flight is very crucial to the performance
of the pipeline. A low value would actually reduce the amount of parallelism and
will not utilise the full computing power the hardware provides. There will be many
threads spawned to perform the operations in the pipeline but because there are less
tokens, most of the threads will be idle most of the time. Having a large value for
the number of tokens in flight can also be a problem because it may cause excessive
resource utilisation (example Memory).
This feature was useful in many ways during the application development. The
best example was in the bitonic sorting network as discussed earlier where one of the
designs for a data structure, it was needed to keep a check on the number of tokens in
flight to ensure that the application worked correctly. More over in any pipeline appli-
cation there is a need to have this check over the number of tokens in flight for proper
functioning of the pipeline. So this threading building blocks feature is advantages
because the programmer does not have to go into extra trouble of considering issues
like these unlike in pthread programming.
To further evaluate this feature, it was necessary to understand the performance
benefits that the feature easily provided to the programmer. The three application
were run for different values of tokens for a given thread count. The speedup of the
applications were calculated and plotted on graphs as seen in Figure 5.16. The Bitonic
Figure 5.16: Performance of Bitonic Sorting Network varying the limit on the number of
tokens in flight.
sorting network was computationally less intensive thus increasing the limit on number
of tokens in flight did not create much difference in the speedup values.
In the Fast Fourier Transform kernel application it can be seen from the Figure 5.17
Chapter 5. Evaluation and Results 52
Figure 5.17: Performance of Fast Fourier Transform varying the limit on the number of
tokens in flight.
that the performance of the application increased with the increase in the limit on the
number of tokens in flight. For large number of threads after reaching the value 10 the
speedup values stabilised, by which we understand it to be the optimum value for the
number of tokens in flight.
Figure 5.18: Performance of Filter Bank varying the limit on the number of tokens in
flight.
In the Filter Bank application something similar to the Fast Fourier Transform ker-
nel can be observed in Figure 5.18. The performance of the application increased
proportionally to the number of tokens and stabilised after reaching a particular limit.
This feature was implemented in the corresponding pthread application to see if it
gave similar performance results like in the case of threading building blocks. The re-
Chapter 5. Evaluation and Results 53
Figure 5.19: Performance of pthread applications varying the limit on the number of
tokens in flight.
sults were obtained as in Figure 5.19. The graph levels out after a step rise in speedup.
This might be because the stages here are run serially, so the number of threads ex-
ecuting the application is constant. The number of threads processing tokens should
increase proportionally to the number of tokens in flight to ensure that there are enough
workers to process the tokens.
5.7 Feature 4: Nested parallelism
The threading building blocks library supported nested parallelism and so it was de-
cided to evaluate this feature to understand how this could help pipeline application de-
velopment. So as a programmer trying to achieve the best performance for his applica-
tion, the for loops performing computation on large array and filter co-efficient matri-
ces in the Filter Bank application was made to run in parallel using the parallel for
construct provided by the library. This gave performance benefits to a small amount
than the ones earlier in the pthread application obtained with all the stages of the filter
running in parallel mode with speedup of 9.3.
Chapter 6
Guide to the Programmers
6.1 Overview
There are many parallel programming environments available now for developing par-
allel applications. It is a very crucial decision for any programmer to choose the best
programming language/library that is best suited for their parallel application develop-
ment. Choosing from so many variety of languages/libraries is not an easy task.
This guide provides a helping hand to future programmer developing shared mem-
ory pipeline applications, to make a choice between the conventional POSIX thread
programming and Intel Threading Building Blocks. The guide will help programmer
realise their priorities and to make the right choice between the two parallel program-
ming libraries. This guide will help the programmers make a choice depending on
their experience in parallel programming, design requirements of their application, de-
velopment time for their application, performance requirements and scalability.
6.2 Experience of the programmer
Novice programmer: Threading Building Block is definitely the best choice. It ab-
stracts all the lows level threading details and guides an inexperienced program-
mer in developing efficient, scalable and reliable applications.
Expert programmer: Threading Building Blocks definitely guides to better applica-
tions but is not a perfect solution to all application needs. It does not provide the
flexibility to alter low level details like in the case of pthread.
54
Chapter 6. Guide to the Programmers 55
6.3 Design of the application
Number of stages in the pipeline
TBB: If the application requires to decide the number of stages in the pipeline dynam-
ically during runtime then the application cannot be implemented in threading
building blocks as the library does not support it.
POSIX Thread: Runtime determination of the number of stages in the pipeline is
possible in pthread programming.
I/O bound operations
TBB: If the application to be developed has I/O bound operations then Threading
Building Blocks is not the right choice because of its unfair and non pre-emptive
task scheduler.
POSIX Thread: The library is good for I/O bound operations because of its deter-
ministic nature.
Real-Time applications
TBB: Real-time applications cannot be implement in Threading Building Block li-
brary because of the unfair and non pre-emptive task scheduler.
POSIX Thread: The library is good for Real-time operations because of its deter-
ministic scheduling policy.
Non Linear Pipeline pattern
TBB: If the application design strictly requires the programmer to implement the
pipeline as a non-linear pipeline then threading building block is not a good
choice as it does not support non-linear pipelines. But if application design
allows to change the pattern of the pipeline to a linear pattern with a little increase
in the latency of the pipeline then threading building blocks can be used for the
pipeline application development.
POSIX Thread: Programmer is given the flexibility to implement the pipeline of
any pattern as per the design requirements.
Chapter 6. Guide to the Programmers 56
Stages with data parallelism
TBB: It is very easy to make the stages run with data parallelism by setting the mode
of operation of the filters accordingly.
POSIX Thread: Data parallelism in the stages is not easily achieved. Programmer
will have to handle all the thread synchronisation issues and ensure exclusive
access on shared resources.
Number of Tokens in Flight
TBB: If application design requires to maintain a check on the number of tokens in
flight then threading building blocks will be of good help as it provides a feature
that allows to keep an upper limit on the number of tokens in flight.
POSIX Thread: The programmer has to implement the feature if needed.
Serial and ordered processing of tokens
TBB: If the design requires, it is very easy to make the all stages run serially and
ensure ordered processing of tokens by setting the mode of operation of the filters
accordingly.
POSIX Thread: Serial and order processing of tokens in all the stages is not easily
achieved. Its put to the programmer to implement the feature if needed.
Nested Parallelism
TBB: The operations done on a token within a stage can be broken down and made to
run in parallel with the help of parallel constructs like parallel for, parallel while
etc. that the library provides.
POSIX Thread: Nested Parallelism is difficult to incorporate.
6.4 Performance
TBB: Good performance is easily available using threading building block library
because it efficiently abstracts all the threading details which is needed to achieve
Chapter 6. Guide to the Programmers 57
good performance. Data parallelism and nested parallelism which will improve
performance is easily implemented in the pipeline using threading building blocks.
POSIX Thread: Good performance is not easily achieved. Requires the program-
mer to know all the optimisation strategies needed to improve the performance
of the application.
6.5 Scalability
TBB: Due to the automatic scheduler initialisation functionality, the number of
threads to run the application will be automatically decided by the TBB library
depending on the number of processors in the machine on which the application
is run. This makes the application scalable without the need to change anything
in the code.
POSIX Thread: Automatic scaling depending on the machine is not present in the
library. The programmer will have to write code to implement the feature which
is a difficult task.
6.6 Load Balancing
TBB: It is easily possible to collapse fast running stages and make slow bottleneck
stages to run with data parallelism to ensure a load balanced pipeline. It is also
achieved due to the working stealing scheduling policy of the scheduler.
POSIX Thread: It is not easily achieved.
6.7 Application Development Time
TBB: If the need is faster development of efficient pipeline applications then thread-
ing building block will help you achieve this because of the abstractions it pro-
vides to do so.
POSIX Thread: Development time is higher compared to threading building blocks.
Plenty of time is spend testing and debugging the application.
Chapter 7
Future Work and Conclusions
7.1 Overview
In this project we tried to evaluate Intel Threading Building Blocks Pipelines. The
commercial and academic community look forward to evaluations of parallel program-
ming languages and libraries because there are so many options to choose and deciding
on the best options is really difficult. There has not been much work done to evaluate
parallel programming languages and libraries but now since the world is moving to-
wards parallel programming, evaluations of these languages/libraries can help parallel
programmer to a large extent in their decision making.
The evaluation of the threading building block library was done in comparison
to POSIX thread. Various features that the library provides was put under test and
evaluated in terms of usability, expressibility and performance. Implementation of
various pipeline application also helped in properly evaluating the library which gave
much deeper analysis of the library and how it can be useful during pipeline application
development. The features threading building blocks provided was implemented in the
pthread library to understand how the abstraction provided by the threading building
blocks library made the job of the programmer easy.
7.2 Conclusions
In general, the project was a success. We were able to evaluate threading building
blocks pipelines to a large extent. We were also able to do a comparative study with
the POSIX thread library which brought out the pros and cons in each of the library. We
went through the entire life cycle of software development for different pipeline appli-
58
Chapter 7. Future Work and Conclusions 59
cations in both the programming libraries. In the designing phase we found out about
the limitation of the threading building block library to express non-linear pipelines but
on the other hand it provided with many features like setting the limit on the number
of tokens in flight, setting the number of threads to run the application and setting the
mode of operation of stages that reduced the amount of the designing details that the
programmer had to handle. The designing phase was much more difficult for pthread
application because the amount of details the programmer had to have knowledge about
and also to be able get the right design incorporating all the low levels details.
During the implementation phase of the applications it was found that the thread-
ing building blocks was much more usable for the programmer because of the abstrac-
tions the library provided. As the library provided most of the abstraction there were
very less scope for errors done by the programmer which made debugging and test-
ing the application much more easy. Pthread on the other hand due the details that
the programmer had to handle was prone to a lot many errors and took much longer
development time than the threading building blocks applications.
During the performance analysis of the application developed in the two program-
ming libraries we were able to learn in great detail about both the libraries. From the
performance results its was found that the threading building blocks is not ideal for
I/O bound tasks and real time applications. The speedup obtained for the applications
developed in threading building block was much higher than the pthread versions but
then on further optimising the pthread applications like introducing data parallel stages
which was not an easy task we were able to obtain speedup better than in the case of
threading building blocks. With the nested parallelism feature that threading building
blocks supported with other parallel constructs like parallel for, parallel while etc. the
performance of the threading building blocks overtook the performance of the pthread
application. Even though threading building blocks achieved good performance the
ease at which the performance was achieved was much easier than in the case of the
pthread applications.
It is very necessary to understand that pthread though difficult to program gives
the programmer the flexibility to manipulate the lowest detail. Having such deeper
access, a programmer can easily optimise programs in application specific ways and
may be get optimisations that are far better than what the threading building blocks
abstractions provide. Threading Building Blocks whereas gives only fewer option to
optimise the programs which may not give the best possible results.
In the end we cannot actually conclude by saying that one library is better than
Chapter 7. Future Work and Conclusions 60
the other for pipeline application development because each library had its pros and
cons depending on the experience of the programmer, the design of the application,
the required development time of the application etc. So its upto the programmer to
decide from the evaluation that we have done here and from his/her needs whether
threading building blocks is best suitable for their pipeline application development.
7.3 Future Work
We have evaluated only the pipelines in threading building blocks library and there are
many other features and parallel constructs that the library provides that can be put
under test and analysed. The combination of all these parallel constructs may really
be helpful to develop efficient parallel applications. In this project we have evaluated
only the high level details of the threading building blocks library, as further work
there could be much deeper evaluation of the library taking into consideration how the
library handles low level details and also the working of the scheduler. The inability
of the threading building blocks to handle I/O bound task is a serious issue. Since
threading building blocks can work in combination with the pthread library a detail
study can be done to analyse if the combination of these two libraries by letting pthread
to handle I/O operations gives better results. Intel TBBs pipeline can now perform
DirectX, OpenGL, and I/O parallelisation by the use of the thread bound filter feature.
In many cases we can see that there are certain types of operations that require that
they are used from the same thread every time and by using a filter bound to a thread,
you can guarantee that the final stage of the pipeline will always use the same thread
[14]. This is definitely an area were there can be further detailed low level evaluations.
Bibliography
[1] Streamit benchmark suite. http://groups.csail.mit.edu/cag/streamit/shtml/benchmarks.shtml, November 2006.
[2] Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Par-allelism. O’Reilly Media, 2007.
[3] UC Berkeley. Decimation in time fast fourier transform. www.cs.berkeley.edu/˜demmel/cs267/lecture24/lecture24.html, November 2006.
[4] Erik Barry Erhardt and Daniel Roland Llamocca Obregon. Sieve of eratosthenes– pthreads implementation. http://statacumen.com/pub/proj/homework/parallel/primes_pthread.c.
[5] G. A. Geist and V. S. Sunderam. Network-based concurrent computing on thepvm system. Concurrency: Pract. Exper., 4(4):293–311, 1992.
[6] Robert B. Grady. Successfully applying software metrics. Computer, 27(9):18–25, 1994.
[7] Robert Reed (Intel). Overlapping io and processing in apipeline. http://software.intel.com/en-us/blogs/2007/08/23/overlapping-io-and-processing-in-a-pipeline/.
[8] Simon Peyton Jones. Beautiful concurrency. Technical report, MicrosoftResearch-Cambridge, 2007.
[9] T. A. Marsland, T. Breitkreutz, and S. Sutphen. A network multi-processor forexperiments in parallelism. Concurrency: Pract. Exper., 3(3):203–219, 1991.
[10] Thomas J. McCabe. A complexity measure. In ICSE ’76: Proceedings of the 2ndinternational conference on Software engineering, page 407, Los Alamitos, CA,USA, 1976. IEEE Computer Society Press.
[11] Dave McCracken. Posix threads and the linux kernel. Technical report, IBMLinux Technology Centre, 2002.
[12] Angeles Navarro, Rafael Asenjo, Siham Tabik, and Calin Cascaval. Analyticalmodeling of pipeline parallelism. In PACT ’09: Proceedings of the 2009 18thInternational Conference on Parallel Architectures and Compilation Techniques,pages 281–290, Washington, DC, USA, 2009. IEEE Computer Society.
61
Bibliography 62
[13] Intel Software Network. Intel threading building blocks, openmp,or native threads? http://software.intel.com/en-us/articles/intel-threading-building-blocks-openmp-or-native-threads/, July2009.
[14] Intel Software Network. Version 2.2, intel threading building blocks,worth a look. http://software.intel.com/en-us/blogs/2009/08/04/version-22-intel-threading-building-blocks-worth-a-look/,August 2009.
[15] Manohar Rao, Zary Segall, and Dalibor Vrsalovic. Implementation machineparadigm for parallel programming. In Supercomputing ’90: Proceedings of the1990 ACM/IEEE conference on Supercomputing, pages 594–603, Los Alamitos,CA, USA, 1990. IEEE Computer Society Press.
[16] Jonathan Schaeffer, Duane Szafron, Greg Lobe, and Ian Parsons. The enterprisemodel for developing distributed applications, 1993.
[17] Duane Szafron and Jonathan Schaeffer. An experiment to measure the usabilityof parallel programming systems. Technical report, Department of ComputerScience, University of Alberta, Edmonton, Alberta, Canada T6G 2H1, 1994.
[18] Steven P. VanderWiel, Daphna Nathanson, and David J. Lilja. Complexity andperformance in parallel programming languages. In HIPS ’97: Proceedings ofthe 1997 Workshop on High-Level Programming Models and Supportive Environ-ments (HIPS ’97), page 3, Washington, DC, USA, 1997. IEEE Computer Society.
[19] M. Karczmarek W. Thies and S. Amarasinghe. Streamit: A language for stream-ing applications. Technical report, In Proc. Intl Conf. on Compiler Construction(CC),pages 179196, Grenoble, France., 2002.
[20] Wikipedia. Sieve of eratosthenes. http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes.
[21] Wikipedia. Parallel programming languages. http://en.wikipedia.org/wiki/Parallel_computing#Parallel_programming_languages, November2006.
[22] Gregory V. Wilson, Jonathan Schaeffer, and Duane Szafron. Enterprise in con-text: assessing the usability of parallel programming environments. In CASCON’93: Proceedings of the 1993 conference of the Centre for Advanced Studies onCollaborative research, pages 999–1010. IBM Press, 1993.