evaluating threading buildin

Upload: anonymous-rrgvqj

Post on 04-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Evaluating Threading Buildin

    1/73

    Evaluating Threading Building Blocks

    Pipelines

    Sunu Antony Joseph

    Master of Science

    Computer Science

    School of Informatics

    University of Edinburgh

    2010

  • 7/29/2019 Evaluating Threading Buildin

    2/73

    Abstract

    Parallel programming is the need in the muti-core era. With many parallel program-

    ming languages and libraries developed with the aim of providing higher levels of ab-

    straction, which allows programmers to focus on algorithms and data structures rather

    than the complexity of the machines they are working on, it becomes difficult for pro-

    grammers to choose the right programming environment best suited for their applica-

    tion development. Unlike serial programming languages there are very few evaluations

    done for parallel languages or libraries that can help programmers to make the right

    choice. In this report we evaluate Intel Threading Building Blocks library which is a

    library in C++ language that supports scalable parallel programming. The evaluation

    is done specifically for the pipeline applications that are implemented using filter and

    pipeline class provided by the library. Various features of the library which help during

    pipeline application development are evaluated. Different applications are developed

    using the library and are evaluated in terms of their usability and expressibility. All

    these evaluations are done in comparison to POSIX thread implementation of different

    applications. Performance evaluation of these applications are also done to understand

    the benefits threading building blocks have in comparison to the POSIX thread imple-

    mentations. In the end we provide a guide to future programmers that will help them

    decide the best suited programming library for their pipeline application development

    depending on their needs.

    i

  • 7/29/2019 Evaluating Threading Buildin

    3/73

    Acknowledgements

    First, I would like to thank my supervisor, Murray Cole, for his guidance and help

    throughout this project and mostly for the invaluable support in difficult times of the

    project period. I would also like to thank my family and friends for standing always

    by me in every choice I make.

    ii

  • 7/29/2019 Evaluating Threading Buildin

    4/73

    Declaration

    I declare that this thesis was composed by myself, that the work contained herein is

    my own except where explicitly stated otherwise in the text, and that this work has not

    been submitted for any other degree or professional qualification except as specified.

    (Sunu Antony Joseph)

    iii

  • 7/29/2019 Evaluating Threading Buildin

    5/73

    To my parents and grandparents...

    iv

  • 7/29/2019 Evaluating Threading Buildin

    6/73

    Table of Contents

    1 Introduction 1

    1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2 Background 8

    2.1 Parallel Programming Languages . . . . . . . . . . . . . . . . . . . . 8

    2.2 Task and Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.3 Parallel Design Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.3.1 Pipeline Pattern . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.3.2 Pipeline using both data and task parallelism . . . . . . . . . 11

    2.4 Intel Threading Building Blocks . . . . . . . . . . . . . . . . . . . . 12

    2.4.1 Threading Building Blocks Pipeline . . . . . . . . . . . . . . 13

    2.5 POSIX Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3 Issues and Methodology 14

    3.1 Execution modes of the filters/stages . . . . . . . . . . . . . . . . . . 14

    3.2 Setting the number of threads to run the application . . . . . . . . . . 15

    3.3 Setting an upper limit on the number of tokens in flight . . . . . . . . 15

    3.4 Nested parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3.5 Usability, Performance and Expressibility . . . . . . . . . . . . . . . 16

    4 Design and Implementation 18

    4.1 Selection of Pipeline Applications . . . . . . . . . . . . . . . . . . . 18

    4.1.1 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . 19

    4.1.2 Filter bank for multi-rate signal processing . . . . . . . . . . 19

    4.1.3 Bitonic sorting network. . . . . . . . . . . . . . . . . . . . . 20

    v

  • 7/29/2019 Evaluating Threading Buildin

    7/73

    4.2 Fast Fourier Transform Kernel . . . . . . . . . . . . . . . . . . . . . 20

    4.2.1 Application Design . . . . . . . . . . . . . . . . . . . . . . . 20

    4.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 21

    4.3 Filter bank for multi-rate signal processing . . . . . . . . . . . . . . . 26

    4.3.1 Application Design . . . . . . . . . . . . . . . . . . . . . . . 26

    4.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 26

    4.4 Bitonic sorting network . . . . . . . . . . . . . . . . . . . . . . . . . 28

    4.4.1 Application Design . . . . . . . . . . . . . . . . . . . . . . . 28

    4.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 29

    5 Evaluation and Results 32

    5.1 Bitonic Sorting Network . . . . . . . . . . . . . . . . . . . . . . . . 32

    5.1.1 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    5.1.2 Expressibility . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    5.1.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    5.2 Filter bank for multi-rate signal processing . . . . . . . . . . . . . . . 37

    5.2.1 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    5.2.2 Expressibility . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    5.2.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    5.3 Fast Fourier Transform Kernel . . . . . . . . . . . . . . . . . . . . . 41

    5.3.1 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3.2 Expressibility . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    5.3.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    5.4 Feature 1: Execution modes of the filters/stages . . . . . . . . . . . . 44

    5.4.1 serial out of order and serial in order Filters . . . . . . . . . 44

    5.4.2 Parallel Filters . . . . . . . . . . . . . . . . . . . . . . . . . 46

    5.5 Feature 2: Setting the number of threads to run the application . . . . 47

    5.6 Feature 3: Setting an upper limit on the number of tokens in flight . . 50

    5.7 Feature 4: Nested parallelism . . . . . . . . . . . . . . . . . . . . . . 53

    6 Guide to the Programmers 54

    6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    6.2 Experience of the programmer . . . . . . . . . . . . . . . . . . . . . 54

    6.3 Design of the application . . . . . . . . . . . . . . . . . . . . . . . . 55

    6.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    6.5 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    vi

  • 7/29/2019 Evaluating Threading Buildin

    8/73

    6.6 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    6.7 Application Development Time . . . . . . . . . . . . . . . . . . . . . 57

    7 Future Work and Conclusions 58

    7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    Bibliography 61

    vii

  • 7/29/2019 Evaluating Threading Buildin

    9/73

    List of Figures

    1.1 Parallel Programming Environments[2] . . . . . . . . . . . . . . . . 2

    2.1 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.2 Task Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.3 Linear Pipeline Pattern . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Non-Linear Pipeline Pattern . . . . . . . . . . . . . . . . . . . . . . 10

    2.5 Flow of tokens through the stages in a pipeline along the timeline. . . 11

    2.6 Using the Hybrid approach. Multiple workers working on multiple

    data in stage 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    4.1 Structure of the Fast Fourier Transform Kernel Pipeline. . . . . . . . 20

    4.2 Structure of the Fast Fourier Transform Kernel Pipeline implemented

    in TBB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    5.1 Performance of the Bitonic Sorting application(TBB) on machines with

    different number of cores. . . . . . . . . . . . . . . . . . . . . . . . . 36

    5.2 Performance of the Bitonic Sorting application(pthread) on machines

    with different number of cores. . . . . . . . . . . . . . . . . . . . . . 36

    5.3 Performance of the Bitonic Sorting application(pthread) with and with-

    out Locks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    5.4 Performance of the Filter Bank application(TBB) on machines with

    different number of cores. . . . . . . . . . . . . . . . . . . . . . . . . 40

    5.5 Performance of the Filter Bank application(pthread) on machines with

    different number of cores. . . . . . . . . . . . . . . . . . . . . . . . . 40

    5.6 Performance of the Fast Fourier Transform Kernel application(TBB)

    on machines with different number of cores. . . . . . . . . . . . . . . 43

    5.7 Performance of the Fast Fourier Transform Kernel application(pthread)

    on machines with different number of cores. . . . . . . . . . . . . . . 44

    viii

  • 7/29/2019 Evaluating Threading Buildin

    10/73

    5.8 Latency difference for linear and non-linear implementation assuming

    equal execution times of the pipelines stages. . . . . . . . . . . . . . 45

    5.9 Performance of Filter bank application(TBB) for different modes of

    operation of the filters. . . . . . . . . . . . . . . . . . . . . . . . . . 46

    5.10 Performance of Filter bank application with stages running with data

    parallelism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    5.11 Performance of Bitonic Sorting Network varying the number of threads

    in execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    5.12 Performance of Fast Fourier Transform Kernel varying the number of

    threads in execution. . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    5.13 Performance of Filter Bank varying the number of threads in execution. 49

    5.14 Performance of the FFT Kernel for different input sizes and number of

    cores of the machine. . . . . . . . . . . . . . . . . . . . . . . . . . . 505.15 Performance of the Filter Bank application different input sizes and

    number of cores of the machine. . . . . . . . . . . . . . . . . . . . . 50

    5.16 Performance of Bitonic Sorting Network varying the limit on the num-

    ber of tokens in flight. . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    5.17 Performance of Fast Fourier Transform varying the limit on the number

    of tokens in flight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    5.18 Performance of Filter Bank varying the limit on the number of tokens

    in flight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    5.19 Performance of pthread applications varying the limit on the number

    of tokens in flight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    ix

  • 7/29/2019 Evaluating Threading Buildin

    11/73

    List of Tables

    4.1 Applications selected for the evaluation. . . . . . . . . . . . . . . . . 19

    x

  • 7/29/2019 Evaluating Threading Buildin

    12/73

    Chapter 1

    Introduction

    Programmers have got used to serial programming and expect the programs they de-

    velop to run faster with the next generation processor that have been coming out in themarket in the past years. But those days are over and the next generation chips will

    have more processors with each individual processor not being faster than the previous

    years model [8]. The clock frequencies of chips are no longer increasing making par-

    allelism the only way to improve the speed of the computer. Parallel computer systems

    are useless without having parallel software to utilise their full potential. The idea

    of converting serial programs to run in parallel can be limiting since its performance

    may be much lower than the best parallel algorithms. In sequential programming lan-

    guages like conventional C/C++, it is assumed that the set of instructions are executed

    sequentially by a single processor whereas a parallel programming language assumes

    that simultaneous streams of instructions are fed to multiple processors. Multithread-

    ing aims at exploiting the full potential of multi-core processors. The transition to

    the multithreaded applications is inevitable and leveraging the present multithreading

    libraries can help the developers in threading their application in many ways.

    Parallel programming languages or libraries should help software developers to

    write parallel programs that work reliably giving good performance. Parallel programs

    execute non-deterministically, asynchronously and with the absence of total order of

    events, so they are hard to test and debug. The parallel programs developed should

    also scale with the additions of more processors into the hardware system. The ul-

    timate challenge of parallel programming languages is to overcome these issues and

    help developers to develop software with the same ease as in the case with serial pro-

    gramming languages.

    Several parallel programming languages, libraries and environments have been de-

    1

  • 7/29/2019 Evaluating Threading Buildin

    13/73

    Chapter 1. Introduction 2

    veloped to ease the task of writing programs for multiprocessors. Proponents of each

    approach often point out various language features that are designed to provide the

    programmer with a simple programming interface. But then why are programmers

    hesitant to go parallel? For large codes, the cost of parallel software development can

    easily surpass that of the hardware on which the code is intended to run. As a result,

    users will often choose a particular multiprocessor platform based not only on abso-

    lute performance but also the ease with which the multiprocessor may be programmed

    [18]. In the past two decades there were many parallel programming environments

    developed. So the next question in mind is why have these parallel programming lan-

    guages not been so productive? Why are there only a small fraction of the programmers

    that write parallel code? This could be because of the hassle of designing, developing,

    debugging, testing and maintaining large parallel codes.

    Figure 1.1: Parallel Programming Environments[2]

    One of the most promising techniques to make parallel programming available for

    the general users is the use of parallel programming patterns. Parallel programming

    patterns gives software developers a language to describe the architecture of parallel

    software. The use of design patterns promotes faster development of structurally cor-

    rect parallel programs. All design patterns are structured description of high quality

    solutions to recurring problems.

    An interesting parallel programming framework that provides templates for the

  • 7/29/2019 Evaluating Threading Buildin

    14/73

    Chapter 1. Introduction 3

    common patterns in parallel object-oriented design is Threaded Building Blocks (TBB).

    Intel Thread Building Blocks1 is a library in C++ language which was built on the no-

    tion to separate logical task patterns from physical threads and to delegate task schedul-

    ing to the multi-core systems [2]. Threading Building Blocks uses templates for com-

    mon parallel patterns. A common parallel programming pattern is the pipeline pattern.

    The functional pipeline parallelism is a pattern that is very well suited and used for

    many emerging applications, such as streaming and Recognition, Mining and Synthe-

    sis (RMS) workloads [12].

    In this report we evaluate the pipeline template provided by the Intel threading

    building blocks library. We evaluate the pipeline class in terms of its usability, ex-

    pressibility and performance. The evaluation is done is comparison to the conven-

    tional POSIX2 thread library. This comparative evaluation is based on many experi-

    ments that we conducted on both the parallel programming libraries. We implementedvarious features that the threading building block library provides in pthread to under-

    stand how much easier the threading building blocks library make the programmers

    job during pipeline application development.

    The main intention of the project is to provide a guide to future programmers about

    Intel Threading Building Blocks pipelines and also to provide a comparative analysis

    of the TBB pipelines with the corresponding pipeline implementations using POSIX

    thread.

    1.1 Related Work

    The growth in commercial and academic interest in the parallel systems has seen an

    increase in the number of parallel programming languages and libraries. Development

    of usable and efficient parallel programming systems have received much attention

    from across the computing and research community. Szafron and Schaeffer in their

    paper [17] evaluate parallel programming systems focusing on 3 primary issues of

    performance, applicability and usability of many different parallel programming sys-

    tems. A controlled experiment was conducted in which half of the graduate students

    in a parallel/distributed computing class solved a problem using the Enterprise parallel

    programming system [16] while the rest used a parallel programming system con-

    sisting of a PVM[5]-like library of message passing calls (NMP [9]). They perform

    1www.threadingbuildingblocks.org/2http://standards.ieee.org/regauth/posix/

  • 7/29/2019 Evaluating Threading Buildin

    15/73

    Chapter 1. Introduction 4

    an objective evaluation during system development that gave them valuable feedback

    on the programming model, completeness of the programming environment, ease of

    use and learning curves. They perform two experiments in the first one they measure

    the ease with which novices can learn the parallel programming systems and produce

    correct, but not necessarily efficient, programs and in the second experiment they mea-

    sure the productivity of the system in the hands of an expert. When experimented with

    novices the primary aim was to to measure how quickly they can learn the system and

    produce correct programs. When experimented with experts, the primary aim was to

    know the time it takes to produce a correct program that achieves a specified level

    of performance on a given machine. They collected statistics like number of lines of

    code, number of edits, compiles, program runs, development time, login hours etc. to

    measure the usability of the parallel programming systems.

    Another work done to analyse the usability of Enterprise parallel programmingsystem was done by Wilson, Schaeffer and Szafron [22]. Before this work there were

    not many comparable work done on parallel programming systems other than by Rao,

    Segall and Vrsalovic [15] where they developed a level of abstraction called the imple-

    mentation machine. The implementation machine layer provides the user with a set of

    commonly used parallel programming paradigms. They analyse how the implemen-

    tation machine template helps the user in implementing the chosen implementation

    machine efficiently on the underlying physical machine.

    Another notable work is by VanderWiel, Nathanson and Lilja [18] where they

    evaluate the relative ease of use of parallel programming languages. They borrow

    techniques from the software engineering field to quantify the complexity of the three

    prominent progamming models: shared memory, message passing and High perfor-

    mance Fortan. They use McCabes Cyclomatic program complexity [10] and non-

    commented source code lines [6] to quantify the relative complexity of the several

    parallel programming languages. They conclude by saying that traditional software

    complexity metrics are effective indicators of the relative complexity of parallel pro-

    gramming languages or libraries.

    1.2 Goals

    The primary goal of the project is to evaluate Intel Threading Building Blocks Pipelines

    and to understand if the library is best for pipeline application development in terms

    of its performance, scalability, expressibility and usability of the library. We need

  • 7/29/2019 Evaluating Threading Buildin

    16/73

    Chapter 1. Introduction 5

    to objectively compare parallel programming languages for the pipeline applications.

    This can be done by understanding the needs of pipeline application developer during

    their pipeline application development and use those as the criteria to evaluate Intel

    Threading Building Blocks Pipelines.

    The Evaluation of Intel Threading Building Blocks pipeline class is done in com-

    parison to the conventional POSIX thread implementations. Many features provided

    by the library is put into test to understand how helpful it is to a pipeline application

    developer. The evaluation includes understanding of the usability of Intel Threading

    Building Blocks which is one of the factors that attract developers to use the library in

    the complex world of parallel programming. The usability tests have to be done con-

    sidering a novice parallel programmer so that at this time when software developers

    are making the transition to parallel programming, this information will be helpful for

    them to take the right choice during their pipeline application development. Furtherwe look into the expressibility of the library by understanding how suitable the library

    is for different pipeline applications which will help future developers understand how

    the library adapts to the different variety of pipeline patterns that are commonly used

    for pipeline application development. Finally we evaluate the library in terms of scal-

    ability and performance of the pipeline applications developed using the library. This

    will help pipeline application developers to understand the performance drawbacks or

    benefits of using Intel TBB for their pipeline application development.

    1.3 Motivation

    There have been many usability experiments conducted for serial programming lan-

    guages but very few for the parallel languages or libraries. Such experiments are neces-

    sary to help narrow the gap between what parallel programmers want and what current

    parallel programming systems provide[17]. Many parallel programming languages are

    developed with the aim to simplify the task of writing parallel programs that run on

    multi-core machines and there is very little data that tells us about the complexity of

    the different parallel programming languages.

    There are many parallel programming languages/libraries developed and each of

    these languages/libraries may have its own pros and cons for the development of dif-

    ferent applications. It is necessary to understand the pros and cons, and highlight it

    to the developers so that they can take the right decision to select the apt language/li-

    brary for their application development. Since design patterns are the solutions to

  • 7/29/2019 Evaluating Threading Buildin

    17/73

    Chapter 1. Introduction 6

    the commonly seen problems, categorising pros and cons of different programming

    languages/libraries on the different design patterns is a good way to provide the infor-

    mation to the developers.

    Parallel programs are hard to develop, test and debug. Hence, the usability of the

    parallel programming languages play a very important role in the success of the pro-

    gramming language. Usability is definitely a factor that programmers will be looking

    for when developers are going for the development of applications under fast approach-

    ing deadlines or when its a novice programmer who is trying to enter into the parallel

    programming world and does not have much idea of the multi-threading concepts.

    Parallel programming languages that are developed to make the job of parallel pro-

    gramming easier may at times have drawbacks in terms of expressibility, performance

    or scalability of the programs developed. Understanding the languages in terms of

    these factors is really important for the proper evaluation of the parallel language/li-brary.

    1.4 Thesis Outline

    The thesis has been divided into 7 chapters including this chapter which is the intro-

    duction. The remaining chapters are laid out as follows:

    * Chapter 2 gives an overview about the basic concepts and terminologies thathelp to understand the report better.

    * Chapter 3 discusses the various features of threading building blocks that help

    during pipeline application development and the methodology used for the eval-

    uation. It also discusses about how we evaluate threading building blocks in

    terms of usability, performance and expressibility of threading building blocks

    in comparison to the pthread library.

    * Chapter 4 focuses on the design and implementation of the various applications

    that are developed in threading building blocks and pthread library.

    * Chapter 5 discusses about the evaluations and the results obtained.

    * Chapter 6 is a Guide to the future programmers that will help them choose be-

    tween threading building blocks and pthread library for their pipeline application

    development depending on their needs.

  • 7/29/2019 Evaluating Threading Buildin

    18/73

    Chapter 1. Introduction 7

    * Chapter 7 draws an overall conclusion and suggests future work.

  • 7/29/2019 Evaluating Threading Buildin

    19/73

    Chapter 2

    Background

    In this chapter we discuss the basic concepts that help to better understand the dis-

    cussions and explanations given in the report. We discuss about parallel programminglanguages, different forms of parallelism, the pipeline design pattern and about Intel

    threading building blocks and POSIX threads.

    2.1 Parallel Programming Languages

    Parallel programming languages and libraries have been developed for programming

    parallel computers. These can generally be divided into classes based on the assump-

    tions they make about the underlying memory architecture - shared memory, dis-

    tributed memory, or shared distributed memory. Shared memory programming lan-

    guages communicate by manipulating shared memory variables. Distributed memory

    uses message passing. OpenMP and POSIX Threads are two widely used shared mem-

    ory APIs, whereas Message Passing Interface (MPI) is the most widely used message-

    passing system API [21]. Intel Threading building blocks is an example of shared

    memory programming.

    2.2 Task and Data Parallelism

    Data parallelism is used when you have large amount of data and you want the same

    operation to be performed on all the data. This data by data processing can be done in

    parallel if they have no other dependencies with each other. A simple example of this

    can be seen in Figure 2.1. Here the tasks can be done concurrently because there are

    no dependencies between them. Another point to be noted is that the data parallelism

    8

  • 7/29/2019 Evaluating Threading Buildin

    20/73

    Chapter 2. Background 9

    is limited to the number of data items you have to process.

    Figure 2.1: Data Parallelism

    Task parallelism is used when you have multiple operations to be performed on adata. In task parallelism there will be multiple tasks working on the same data con-

    currently. A simple example of this can be seen in Figure 2.2. Here the multiple tasks

    perform different independent operations on the same set of data concurrently.

    Figure 2.2: Task Parallelism

    2.3 Parallel Design Patterns

    Parallel software usually does not make full utilisation of the underlying parallel hard-

    ware. It is difficult for the programmers to program patterns that utilise the maximum

    potential of the hardware and most of the parallel programming environments do not

    focus on the design issues. So programmers need a guide to help them during their

    application development that would actually enable them to get the maximum parallel

  • 7/29/2019 Evaluating Threading Buildin

    21/73

    Chapter 2. Background 10

    performance. Parallel design patterns are expert solutions to the common occurring

    problems that achieve this maximum parallel performance. These patterns provide

    quick and reliable parallel applications.

    2.3.1 Pipeline Pattern

    The Pipeline pattern is a common design pattern that is used when the need is to per-

    form computations over many sets of data. This computation that is to be performed on

    a set of data, can be viewed as many stages of processing to be performed in a partic-

    ular order. The whole computation can be seen as data flowing through a sequence of

    stages. A good analogy of this parallel design pattern is a factory assembly line. Like

    in an assembly line where each worker is assigned a component of work and all the

    workers work simultaneously on their assigned task. A simple example of a pipeline

    can be seen in the figure 2.3

    Figure 2.3: Linear Pipeline Pattern

    In the example in Figure 2.3 stage 1, stage 2, stage 3 and stage 4 together form

    the total computation that has to be performed on each set of data. The input data arefed to stage 1 of the pipeline where the data is processed one after the other and then

    passed on to the next stage in the pipeline. When the arrangement is a single straight

    chain(Figure 2.3), its called a Linear Pipeline Pattern.

    Figure 2.4: Non-Linear Pipeline Pattern

    Figure 2.4 is an example for a Non-Linear pipeline pattern. Here you can see stages

    with multiple operations happening concurrently. Non Linear pipelines allow feedback

  • 7/29/2019 Evaluating Threading Buildin

    22/73

  • 7/29/2019 Evaluating Threading Buildin

    23/73

    Chapter 2. Background 12

    has frequent interaction between the stages of the pipeline. The data parallel way of

    doing this is by letting a single thread do the task done by the entire pipeline but let

    different threads work on different data concurrently. This is an example of coarse

    grained parallelism since the interaction between the stages are infrequent.

    In most of the cases the pipeline stages will not be doing work of the same com-

    putationally complexity, some stages may take much longer time to perform its task

    and some less. Mixing up data parallelism and task parallelism together gives you a

    good solution to this problem. The computationally intensive stages can be run by

    multiple threads doing the same work concurrently over different sets of data. As seen

    in Figure 2.6 it introduces parallelism within each stage of the pipeline and makes the

    throughput of the computationally intensive stages better.

    Figure 2.6: Using the Hybrid approach. Multiple workers working on multiple data in

    stage 2.

    2.4 Intel Threading Building Blocks

    Threading Building Blocks is a library that supports scalable parallel programming on

    standard C++ code. This library does not need any additional language or compiler

    support and can work on any processor or operating system that has a C++ compiler

    [2]. Intel Threading Building Blocks implement most of the common iteration pat-

    terns using templates and thus the user does not have to be a threading expert knowing

    the details about synchronisation, cache optimisations or load balancing. The most

    important feature of the library is that you just need to specify task to be performed

    and nothing about the threads. The library itself does the job of mapping task onto

    the threads. Threading Building Blocks supports nested parallelism that allowed larger

    parallel components to incorporate smaller parallel components in it. Threading Build-

  • 7/29/2019 Evaluating Threading Buildin

    24/73

    Chapter 2. Background 13

    ing Blocks also allows scalable data parallel programming.

    Threading Building Blocks provide a task scheduler which is the main engine that

    drives all the templates. The scheduler maps the tasks that you have created onto the

    physical threads. The scheduler follows a work stealing scheduling policy. When the

    scheduler maps task to physical threads they are made non-preemptive. The physical

    thread works on the task to which it is mapped until it is finished and it may perform

    other task only when it is waiting on any child tasks or when there are not child task it

    would perform the tasks created by other physical threads.

    2.4.1 Threading Building Blocks Pipeline

    In Threading Building Blocks the pipeline pattern is implemented using the pipeline

    and filter classes. A series of filters represent the pipeline structure in threading build-

    ing blocks. These filters can be configured to execute concurrently on distinct data

    packets or to only process a single packet at a time.

    Pipelines in Threading Building Blocks are organized around the notion that the

    pipeline data represents a greater weight of memory movement costs than the code

    needed to process it. Rather than do the work, toss it to the next guy, its more like

    the workers are changing places while the work stays in place[7].

    2.5 POSIX Threads

    POSIX thread(pthread) is the extension of the already existing process model to in-

    clude the concept of concurrently running threads. The idea was to take some process

    resources and make multiple instances of them so that they can run concurrently within

    a single process, the multiple instance of the resources that were made are the bare

    minimum needed for the instance to execute concurrently[11]. Pthread provided the

    programmers access to low level details in programming. Programmers sees this as a

    powerful options were they manipulate low level details to the needs of the application

    they are developing. But then the programmer has to handle many design issues while

    developing application. The Native POSIX Thread Library is the software that allows

    the Linux kernel to execute POSIX thread programs efficiently. The Thread schedul-

    ing implementations differs on how threads are scheduled to run. The pthread API

    provides routines to explicitly set thread scheduling policies and priorities which may

    override the default mechanisms.

  • 7/29/2019 Evaluating Threading Buildin

    25/73

    Chapter 3

    Issues and Methodology

    In this chapter we discuss about the different features of the Intel threading building

    block library that we intend to evaluate and the methodology that we would use toevaluate them. We also discuss the methodology used to evaluate the library in terms

    of usability, performance and expressibility. The initial phase of the project is to under-

    stand the pipeline application development in TBB and to know what are the features

    that TBB has to offer so that we can test those features for our evaluation and analyse

    to see if these are really useful during pipeline application development.

    3.1 Execution modes of the filters/stages

    Many key features in the Threading Building Blocks library is worth noting and tested.

    As we already discussed in the earlier chapter, Threading Building Blocks imple-

    mented pipelines using the pipeline and the filter class. This filter class could be

    made to run parallel, serially in order or serially out of order. When you

    set the filter to run serially in order, the stage will run serially and would process

    each input in the same order as they came into the input filter. Setting the filter to

    serial out of order make the stages run in parallel but then the order of the input

    data may not be maintained. The filters can also be set to parallel by which each stage

    can work concurrently in a data parallel way, by this the same operation will be per-

    formed on different data in parallel. These three filter modes are implemented and seen

    if it provides any favourable results. Pthread applications having parallel stages are to

    be designed so that a comparative analysis can be done. A performance analysis will

    be done by calculating the speedup of the application.

    14

  • 7/29/2019 Evaluating Threading Buildin

    26/73

    Chapter 3. Issues and Methodology 15

    3.2 Setting the number of threads to run the application

    Another feature of the Threading Building Blocks is to run the pipeline by manually

    setting the number of threads to run the pipeline with. We need to test if this facility

    of manually deciding the number of threads to work on the implementation is a good

    option provided by the library. If the user decides not to set the number of threads, then

    the library sets the value for the number of threads which is usually the same as the

    number of physical threads/processors in the system. This TBB philosophy[2], of hav-

    ing one thread for each available concurrent execution unit/processor, is put under test

    and checked if it is efficient for the different pipeline applications developed. Each of

    the application is run with different number of threads which made us understand how

    it helped in increasing the performance of the application. We also test if setting the

    number of threads manually is more beneficial than letting the library set it automati-

    cally. The results obtained from the automatic initialisation of the number of threads

    is compared with the results obtained with the manual initialisation of the number of

    threads.

    Pthread application which runs with varying number of threads working on it is

    checked if it can give better performance results that the threading building block coun-

    terpart. Performance is measured in terms of speedup of the applications.

    3.3 Setting an upper limit on the number of tokens in

    flight

    Threading Building Blocks gives you the feature to set an upper limit on the number

    of tokens in flight on the pipeline. The number of tokens in flight is the number of data

    items that is running through the pipeline or in other words the number of data sets that

    are being processed in the pipeline at a particular instance of time. This controls the

    amount of parallelism in the pipeline. In serial in order filter stages this will nothave an effect as the tokens coming in are executed serially in order. But in the case

    of parallel filter stages there can be multiple tokens that can be processed by a stage

    in parallel, so if the number of tokens in pipeline is not kept under check then there

    can be a case where there can be excessive resource utilisation by the stage. There

    might also be a case where the following serial stages may not be able to keep up with

    the fast parallel stages before it. The pipelines input filter will stop pushing in tokens

    once the number of tokens in flight reaches this limit and will continue only when

  • 7/29/2019 Evaluating Threading Buildin

    27/73

    Chapter 3. Issues and Methodology 16

    the output filter has finished processing the elements. Each of the application is to be

    run with different number of token limits which tells us how it helped in increasing

    the performance of the application. Performance here is measured in terms of the

    speedup of the application. Similar feature is implemented in the pthread applications

    and analysed.

    3.4 Nested parallelism

    Threading Building Blocks supports nested parallelism by which you can nest small

    parallel executing components inside large parallel executing components. It is dif-

    ferent from running the stages with data parallelism. With nested parallelism it is

    possible to run in parallel the different processing for a single token within a stage.

    So in our pipeline implementation we would incorporate nested parallelism in these

    pipelines but incorporating parallel constructs like parallel for within the stages of

    the pipeline and see if its useful to the overall implementation. It will help understand

    how different it is from the option of running a stages in parallel by setting the filter

    to run in parallel. A series of performance tests are done to understand concurrently

    running filters and nested parallelism in the stages. The results tell us which is more

    efficient for the pipeline application development.

    3.5 Usability, Performance and Expressibility

    The next step was to decide apt pipeline applications by which the various features

    provided by the Threading Building Blocks library can be evaluated. Application of

    various types are taken with varying input size and with varying complexity of the

    computation it has to perform. Applications with different pipeline patterns are taken

    into consideration to understand the expressibility of the library in comparison to the

    pthread version. Implementation of the Linear and Non-Linear pipelines need to becompared with the pthread versions in terms of the performance of the applications and

    how easy it is for the user to make enhancements or changes into the program without

    actually changing much in the design of the program. Scalability is another factor that

    needs to be considered, it should to be understood to see if the same program gave

    proportional performance when ran with more or less number of processors. Usability

    of the languages is analysed all through the process of the software development life

    cycle by putting ourselves in the shoes of parallel software programmer who can be

  • 7/29/2019 Evaluating Threading Buildin

    28/73

  • 7/29/2019 Evaluating Threading Buildin

    29/73

    Chapter 4

    Design and Implementation

    In this chapter we discuss in detail about the designing and implementation issues

    during the development of various applications developed in both the parallel pro-gramming libraries. Designing an application is a very important phase in the software

    development life cycle and an easy designing phase will really help a programmer

    build applications faster. The designing phase in this project will help understand the

    pros and cons of the abstraction that the threading building blocks library provides. So

    carefully analysing the designing efforts put in by a programmer due to the flexibil-

    ity and the constraints the programming library provides, we can evaluate threading

    building blocks library. Implementing these designs is a challenging task. During the

    implementation phase the expressibility of the two parallel libraries will be understood

    and how the various features provided by the library is helpful in implementing the

    intended design of application will be understood. During this phase we primarily un-

    derstand the usability and expressibility of the Threading Building Blocks library in

    comparison to the pthread library. As a neophyte to TBB we found it very easy to un-

    derstand the pipeline and filter class in the library and was quickly able to implement

    applications in it.

    4.1 Selection of Pipeline Applications

    For evaluating TBB we need apt applications that would bring out the pros and cons

    of the library. For this we used StreamIt1 benchmarks suite. The StreamIt[19] bench-

    marks is a collection of streaming applications. Here the applications are developed

    using the StreamIt language so it can not be directly used for our purpose. These ap-

    1http://groups.csail.mit.edu/cag/streamit/

    18

  • 7/29/2019 Evaluating Threading Buildin

    30/73

    Chapter 4. Design and Implementation 19

    Application Computational Complexity No. of stages Pattern

    Bitonic Sorting Network Low 4 Linear

    Filter Bank Average 6 Linear

    Fast Fourier Transform High 5 Non-Linear

    Table 4.1: Applications selected for the evaluation.

    plications need to be coded in Threading Building Blocks and pthread for our purpose.

    The selection of the applications are done such that they vary in their computa-

    tional complexity, pattern of the pipeline, number of stages in the pipeline and input

    size. Because in Intel threading building blocks we cannot determine the number of

    stages in the pipeline during runtime thus we could not include applications like the

    Sieve of Eratosthenes [20] where the number of stages was determined during the run-

    time whereas this was possible in pthread[4]. This was one of the drawback found with

    Intel threading building blocks during this phase of the project. The final set of appli-

    cations selected were Fast Fourier Transform kernel, Filter bank for multi-rate signal

    processing and Bitonic sorting network as seen in Table 4.1.

    4.1.1 Fast Fourier Transform

    The coarse-grained version of Fast Fourier Transform kernel was selected. The im-

    plementation was a non-linear pipeline pattern with 5 stages. Fast Fourier Transform

    is done on a set of n points which is one of the inputs given to the program. An-

    other input given to the program is the permuted roots-of-unity look-up table which

    is an array of first n/2 nth roots of unity stored in a permuted bit reversal order. The

    Fourier Fourier implementation done here is a Decimation In Time Fast Fourier

    Transform with input array in correct order and output array in bit-reversed order.

    Details about Decimation In Time Fast Fourier Transform can be seen at [3].

    The only requirement in the implementation is that n should be a power of 2 for it to

    work properly.

    4.1.2 Filter bank for multi-rate signal processing

    An application that creates a filter bank to perform multi-rate signal processing was

    selected. The coefficients for the sets of filters are created in the top-level initialisation

    function, and passed down through the initialisation functions to filter objects. On each

  • 7/29/2019 Evaluating Threading Buildin

    31/73

    Chapter 4. Design and Implementation 20

    branch, a delay, filter, and down-sample is performed, followed by an up-sample, delay

    and filter [1].

    4.1.3 Bitonic sorting network

    An application that performs bitonic sort was selected from the StreamIt benchmark

    suite. The program does high performance sorting network ( by definition of sorting

    network, comparison sequence not data-dependent ). Sorts in O(nlog(n)2) comparisons[1].

    4.2 Fast Fourier Transform Kernel

    4.2.1 Application Design

    The Fast Fourier Transform kernel implementation was a 5 stages pipeline with struc-

    ture as shown in Figure4.1.

    Figure 4.1: Structure of the Fast Fourier Transform Kernel Pipeline.

    The intended design of the pipeline is as in Figure 4.1. Stage 1 is the input signal

    generator which generates the set of n points and stores it in two arrays. Stage 2

    generates the set of Bit-reversal permuted roots-of-unity look-up table which is an

    array of first n/2 nth roots of unity stored in a permuted bit reversal order. Both of

    these stages generates the arrays on the run and not reading from a file so as to avoid

    the overhead due to the I/O which may over shadow the performance of the pipeline.

    Stage 3 is the Fast Fourier transform stage where the Decimation In Time Fast

    Fourier transform with input array in correct order is computed. Stage 4 is where

    the output array in bit-reversed order is created and passed on to the last stage in the

    pipeline where the output is shown to the user.

  • 7/29/2019 Evaluating Threading Buildin

    32/73

  • 7/29/2019 Evaluating Threading Buildin

    33/73

    Chapter 4. Design and Implementation 22

    16 t>W im = ( d o u b l e ) t b b a l l o c a t o r ( ) . a l l o c a t e ( s i z e o f ( d o u b l e )

    n / 2 ) ;

    17 r e t u r n t ;

    18 }

    19 v o i d f r e e ( )

    20 {21 t b b a l l o c a t o r ( ) . d e a l l o c a t e ( t h i s >A r e , s i z e o f ( d o u b l e ) n ) ;

    22 t b b a l l o c a t o r ( ) . d e a l l o c a t e ( t h i s >A i m , s i z e o f ( d o u b l e ) n ) ;

    23 t b b a l l o c a t o r ( ) . d e a l l o c a t e ( t h i s >W r e , s i z e o f ( d o u b l e ) n / 2 ) ;

    24 t b b a l l o c a t o r ( ) . d e a l l o c a t e ( t h i s >W im , s i z e o f ( d o u b l e ) n / 2 ) ;

    25 t b b a l l o c a t o r ( ) . d e a l l o c a t e ( ( c h a r ) t h i s , s i z e o f ( d a t a o b j ) ) ;

    26 }

    27 } ;

    The static function allocate function creates the instance of the class allocating

    the memory required to perform Fast Fourier Transform on the n input points and

    returns the pointer to the object created. The function free does the job to free the

    allocated memory when the computations is done at the end of the pipeline and the

    token is destroyed.

    Stage 1 generates the input points at run time same as in the algorithm described

    in the benchmark suite and stores the values in the data structure dataobj as array

    A re and A im and passed on to the stage 2. Stage 2 creates the roots-of-unity look-up

    table which is stored in the W re and W im array. From the data structure with array

    A and W is passed to the Fast Fourier transform stage where values are computed and

    stored in array A itself and passed on to the next stage. Stage 4 finds the bit-reversed

    order of array A and passes it to the next stage to output values. The pipeline made

    was a linear pipeline which implements the same logic of the original algorithm. The

    computation of each stage was written in the overloaded operator()(void*) function of

    that classes representing different stages of the pipeline. Each of these classes inherit

    the filter class.

    The pointer that the overloaded operator()(void*) function returns is the pointer tothe token that has to be passed on to the next stage in the pipeline. So this imposed

    a restriction of having the need to represent all the components of a single token as a

    single data structure so that it can be passed along the stages in the pipeline in threading

    building blocks.

  • 7/29/2019 Evaluating Threading Buildin

    34/73

    Chapter 4. Design and Implementation 23

    4.2.2.2 Pthread

    The pthread implementation has the same pipeline structure as the original pipeline

    taken from benchmark suite. The implementation has two input stages in the pipeline

    joining at stage 3 and then having stage 4 and 5 following linearly. The overall pipeline

    structure is defined as shown in Listing 4.2.

    Listing 4.2: Data structure representing a pipe in the Fast Fourier Transform Kernel

    Pthread Application

    1 s t r u c t p i p e t y p e {

    2 p t h r e a d m u t e x t mutex ; / Mutex t o p r o t e c t p ip e d a ta /

    3 s t a g e t h e a d 1 ; / F i r s t h ea d /

    4 s t a g e t h e a d 2 ; / S e co n d h e a d /

    5 s t a g e t t a i l ; / F i n a l s t a g e /

    6 i n t s t a g e s ; / Number o f s t a g e s /7 i n t a c t i v e ; / A c ti v e d a t a e l em e nt s /

    8 } ;

    Here the mutex variable is used to obtain the lock over the pipeline information

    variables(stages and active) and protect it during concurrent access. The variables

    head1 and head2 are pointers to the two heads of the pipeline respectively. The vari-

    able tail is the pointer to the last stage in the pipeline, stages being the count of

    the number of stages and active being the count of number of tokens active in the

    pipeline.

    In the present pipeline structure since we have two kinds of stages, the stages are

    represented by two kinds of structures. The structure that represent the stages that

    receive input from a single stage and pass the token to a single stage is as shown in

    Listing 4.3.

    Listing 4.3: Data structure representing a stage type-1 in the Fast Fourier Transform

    Kernel Application

    1 s t a g e t y p e {

    2 p t h r e a d m u t e x t mutex ; / P r o t e c t d a t a /

    3 p t h r e a d c o n d t a v a i l ; / D a ta a v a i l a b l e /

    4 p t h r e a d c o n d t r e a d y ; / Ready f o r d a t a /

    5 i n t d a t a r e a d y ; / D at a p r e s e n t /

    6 d o u b l e A r e ; / D a ta t o p r o ce s s /

    7 d o u b l e A im ; / D a ta t o p r o ce s s /

    8 d o u b l e W re ; / D a ta t o p r o ce s s /

    9 d o u b l e W im ; / D a ta t o p r o ce s s /

  • 7/29/2019 Evaluating Threading Buildin

    35/73

  • 7/29/2019 Evaluating Threading Buildin

    36/73

    Chapter 4. Design and Implementation 25

    This implementation works because both the stages write into different locations in the

    data structure. This structure is used to implement stage 3 in the pipeline in Figure 4.1.

    Stage 1 generates the input points at run time same as in the algorithm described

    in the benchmark suite and sends pointer to arrays A re and A im to stage 3. Stage

    2 creates the roots-of-unity look-up table in parallel to stage 1, stores it in the W re

    and W im array and passes the pointers to stage 3. Stage 3 on receiving these values

    performs the Fast Fourier Transform of these values and send it to stage 4 where the

    bit reverse order of the array is created and passed on to the last stage for output.

    The passing of tokens in the pthread application was done using the function as in

    Listing 4.5. Here initially the thread tries to attain lock to write into the buffer of the

    next stage, then it waits on a condition variable ready that tells the thread when the next

    stage thread is ready to accept new tokens. After copying the values of the token to the

    buffer of the next stage, the thread signals the avail condition variable telling the nextstage thread that the new token is ready to be processed. Variations of this function

    was used to send tokens with different contents.

    Listing 4.5: Function to pass a token to the specified pipe stage.

    1 i n t p ip e s en d ( s t a g e t s t a g e , d o u bl e A r e , d o u b l e A im , d o u b l e

    W r e , d o u b l e W im)

    2 {

    3 i n t s t a t u s ;

    4

    5 s t a t u s = p t h r ea d m u t ex l o ck (& st a g e>mutex ) ;

    6 i f ( s t a t u s ! = 0 )

    7 r e t u r n s t a t u s ;

    8 /

    9 I f t h e p i p e l i n e s t a g e i s p r o c e s s i n g d a t a , w ai t f o r i t

    10 t o b e c on su me d .

    11 /

    12 w h i l e ( s t a g e>d a t a r e a d y ) {

    13 s t a t u s = p t h r e a d c o n d w a i t (& s t a g e>r e a d y , &s t a g e >mutex ) ;

    14 i f ( s t a t u s != 0 ) {

    15 p t h r e a d m u t e x u n l o c k (& s t a g e>mutex ) ;

    16 r e t u r n s t a t u s ;

    17 }

    18 }

    19

    20 /

    21 Co p y in g t h e d a t a t o t h e b u f f e r o f t h e n ex t s t ag e .

  • 7/29/2019 Evaluating Threading Buildin

    37/73

    Chapter 4. Design and Implementation 26

    22 /

    23 s t a g e>A r e = A r e ;

    24 s t a g e>A im = A im ;

    25 s t a g e>W r e = W r e ;

    26 s t a g e>W im = W im ;

    27 s t a g e>

    d a ta r e a d y = 1 ;28 s t a t u s = p t h r e a d c o n d s ig n a l (& st a g e>a v a i l ) ;

    29 i f ( s t a t u s ! = 0 ) {

    30 p t h r e a d m u t e x u n l o c k (& s t a g e>mutex ) ;

    31 r e t u r n s t a t u s ;

    32 }

    33 s t a t u s = p th re a d m ut ex u n lo ck (& st a g e>mutex ) ;

    34 r e t u r n s t a t u s ;

    35 }

    4.3 Filter bank for multi-rate signal processing

    4.3.1 Application Design

    The application design is a 6 stage linear pipeline. The stage 1 is the input generation

    stage that creates an array of signal values. The input signal is then convoluted with the

    first filters coefficient matrix in stage 2. The signal is then down-sampled in stage 3

    and then up-sampled in stage 4. The signal is then passed on to the next stage where the

    signal is convoluted with the second filters co-efficient matrix. the values are added

    up the into an output array until an algorithmically determined number of token arrive

    and then the values are output.

    4.3.2 Implementation

    4.3.2.1 Threading Building Blocks

    The implementation of the pipeline in Threading building blocks has the same structure

    as the original intended pipeline.The token passed between the stages are arrays which

    are dynamically allocated using the tbb allocator at each stage in the pipeline.The input

    signal is generated in stage 1 is put in an array and passed to stage 2 in the pipeline.

    Stage 2 does the convolution of the signal with the filter coefficient matrix. The convo-

    lution matrix is created during the initialisation phase of the program. The convoluted

    values are stored in an array and passed on to stage 3 where it is down-sampled and

  • 7/29/2019 Evaluating Threading Buildin

    38/73

    Chapter 4. Design and Implementation 27

    then up-sampled in stage 4, each time allocating new arrays to hold the new processed

    values and then passed on to the next stage. Stage 5 does the convolution of the signal

    with the second filter co-efficient matrix that is created during the initialisation phase

    of the program. Finally in stage 6 the signal values are added up into an array until a

    predetermined number of tokens arrive, after which the values are output. The send-

    ing of tokens was done in the same way as it was done in the case of Fast Fourier

    Transform Kernel.

    4.3.2.2 Pthread

    The pipeline implemented in pthread is same as the intended pipeline structure having

    6 stages and performing the same functions as discussed for the threading building

    blocks version. Since the pattern is a linear pipeline and the stages have the same

    structure, that is each have a single source for tokens and single recipient for the tokens

    the structure of the stages are the same and is represented as a structure shown in

    Listing 4.6.

    Listing 4.6: Data structure representing a stage in the Filter bank for multi-rate signal

    processing application

    1 s t r u c t s t a g e t y p e {

    2 p t h r e a d m u t e x t mu tex ; / P r o t e c t d a t a /

    3 p t h r e a d c o n d t a v a i l ; / D a t a a v a i l a b l e /

    4 p t h r e a d c o n d t r e a d y ; / Ready f o r d a t a /

    5 i n t d a t a r e a d y ; / D at a p r e s e n t /

    6 f l o a t d a t a ; / D at a t o p r o c e ss /

    7 p t h r e a d t t h r e a d ; / T hr ea d f o r s t a g e /

    8 s t r u c t s t a g e t y p e n e x t ; / N ex t s t a g e /

    9 } ;

    Here you have the mutex variable to protect the data in the stage. The avail and

    ready condition variables to indicate the availability of the data for processing and the

    readiness of the stage to accept new data for processing. The structure also has the

    pointer to the data item and variables for the thread that processes the stage and the

    pointer to the next stage in the pipeline structure.

    The overall pipeline is defined with the structure as shown in Listing 4.7 having

    a mutex variable which is used to obtain a lock over the pipeline information vari-

    ables(stages and active). The head and tail pointers point to the first and last stage

    in the pipeline. The variables, stages and active maintain the count of the number of

  • 7/29/2019 Evaluating Threading Buildin

    39/73

    Chapter 4. Design and Implementation 28

    stages and the number of tokens in the pipeline. The sending of tokens to the next

    stage was done using the function as in 4.5 except for the difference in the data that is

    copied to buffer of the next stage.

    Listing 4.7: Data structure representing the pipeline in the Filter bank for multi-rate

    signal processing application

    1 s t r u c t p i p e t y p e {

    2 p t h r e a d m u t e x t mu tex ; / Mutex t o p r o t e c t p ip e /

    3 s t a g e t y p e h e a d ; / F i r s t s t a g e /

    4 s t a g e t y p e t a i l ; / L a s t s t a g e /

    5 i n t s t a g e s ; / Number o f s t a g e s /

    6 i n t a c t i v e ; / A c ti v e d a t a e l em e nt s /

    7 } ;

    The Filter Bank application was redesigned to implement the stages with data par-allelism. This included the addition of a shared memory data structure were all the

    threads working in a stage can access the tokens to be processed. The shared mem-

    ory data structure is as shown in Listing 4.8. One instance of this data structure is

    shared between all the threads working in a particular stage. The functionality of the

    components are same the one discussed for Listing 4.6.

    Listing 4.8: The Shared Memory data structure for the threads working in the same

    stage

    1 t y p e d e f s t r u c t s h a r e d d a t a {

    2 p t h r e a d m u t e x t mutex ; / P r o t e c t d a t a /

    3 p t h r e a d c o n d t a v a i l ; / D a t a a v a i l a b l e /

    4 p t h r e a d c o n d t r e a d y ; / Ready f o r d a t a /

    5 i n t d a t a r e a d y ; / D at a p r e s e n t /

    6 f l o a t d a t a ; / D at a t o p r o c e ss /

    7 } s h a re d me m ;

    4.4 Bitonic sorting network

    4.4.1 Application Design

    The bitonic sorting network application taken was a 3 stage pipeline was a 4 stage

    pipeline. Stage 1 for the input generation, Stage 2 for the creation of the bitonic se-

    quence from the input values and then stage 3 for the sorting of the bitonic sequence

  • 7/29/2019 Evaluating Threading Buildin

    40/73

  • 7/29/2019 Evaluating Threading Buildin

    41/73

  • 7/29/2019 Evaluating Threading Buildin

    42/73

  • 7/29/2019 Evaluating Threading Buildin

    43/73

  • 7/29/2019 Evaluating Threading Buildin

    44/73

    Chapter 5. Evaluation and Results 33

    5.1.1 Usability

    5.1.1.1 Threading Building Blocks

    During the initial programming phase the challenging part was to create the right data

    structure that we would use to represent tokens and pass them efficiently across stages.

    We had tried many data structures before we actually decided on one. One of the

    data structures tried had a large array which was divided into n buffers of data that

    represented a token in the pipeline which had to be sorted. The size of the large array

    was fixed and could incorporate only a fixed number of tokens. The input filter would

    fill these buffers one after the other and then pass it on to the next stage for processing.

    After all the n buffers were filled, it would start again by filling in the first buffer. This

    implementation worked because we were able to limit the number of tokens in flight

    using the Threading Building Blocks library and also for the fact that we could programthe stages to run sequentially and process the tokens in a fixed order. By limiting the

    number of tokens in flight to n it was ensured that when the input filter comes to the

    next round of filling up of the large array starting from the first buffer, the first buffer

    in the previous round had already finished processing at the last stage of the pipeline.

    Making the stages run serially and to process the tokens in the same order as created

    by the input filter also ensured that the buffers that had not finished processing, will

    not be overwritten with new values. This implementation was later changed since there

    was an intention to experiment by changing the number of tokens in flight in pipeline

    and hence this would not be the best design suited for the purpose. But then we could

    not fail to notice that since the threading building blocks library provided features like

    limiting the number of tokens in flight and to make the stages run sequentially and to

    perform in order processing of the tokens easily allowed us to implement a design like

    this.

    As for the implementation of the computational part of performing the bitonic sort

    was concerned we just had to implement the C++ serial code for each stage and place

    them in the overloaded operator()(void*) function in the respective classes that

    represented the stages. As soon as we got the right data structure for the tokens that

    are passed along the pipeline and implemented the computational task done by each of

    the stages we had been able to implement a correctly working pipeline without much

    hassle. As a parallel programmer we did not have to bother anything about the low

    level threading concepts like synchronising, load balancing or cache efficiency.

  • 7/29/2019 Evaluating Threading Buildin

    45/73

  • 7/29/2019 Evaluating Threading Buildin

    46/73

  • 7/29/2019 Evaluating Threading Buildin

    47/73

  • 7/29/2019 Evaluating Threading Buildin

    48/73

    Chapter 5. Evaluation and Results 37

    To confirm this, a test was done by removing all the thread synchronisation mecha-

    nisms in the application and calculating the run-time of the application. Though the

    application was giving incorrect output values it could be understood from the run-time

    if most of the time was taken for the synchronisation of threads in the application. The

    results obtained are shown in Figure 5.3.

    Figure 5.3: Performance of the Bitonic Sorting application(pthread) with and without

    Locks.

    It can be seen that there is a drastic reduction in the execution time of the applica-

    tion with and without locks. So it was understood that the application being very less

    computationally intensive, most of the threads were idle most of the time waiting on

    locks to be released.

    5.2 Filter bank for multi-rate signal processing

    The Filter Bank application was the second application that was developed. With the

    experience we attained with Bitonic sorting network application we could immediately

    start the work with the second application because we had familiarised ourself with

    both the parallel programming libraries and had a basic idea of how we would go

    about designing and implementing a pipeline application in both pthread and threading

    building blocks. The Filter bank application is more computationally intensive than the

    bitonic sort application having to work on large signal arrays and large filter co-efficient

    matrices. The pipeline had a longer chain with 6 stages in a linear pipeline structure.

  • 7/29/2019 Evaluating Threading Buildin

    49/73

    Chapter 5. Evaluation and Results 38

    5.2.1 Usability

    5.2.1.1 Threading Building Blocks

    The development of bitonic sorting network made us familiar with threading building

    blocks due to which the Filter Bank application was developed much faster than what

    we took for the bitonic sorting network. We just had to paste in the computation for

    each stage into the operator() function of the appropriate classes. To create the right

    data structure for the token movement in the pipeline was the only challenge in the

    implementation.

    As a parallel programmer we wanted to make our pipeline application run faster

    and we were easily able to identify the bottleneck stages by using the serial in order,

    serial out of order and parallel options in the filter classes and finding their speedup.

    By using these options we were very easily able to tweak the application for the bestperformance.

    5.2.1.2 POSIX Thread

    Similar to the case of Threading building blocks the bitonic sort application imple-

    mented in pthread gave us a quick start because we had already figured out a generic

    structure for the pipeline. With a few application dependent changes in the design we

    were immediately ready to start with the implementation. Getting the right design for

    the application was the toughest part in bitonic sorting network, which we were able

    to get done with moderate ease for the filter bank application. The reuse of the design

    made development easy for us but then it was not the same in the case of threading

    building blocks because many of the design issues were abstracted by the library and

    the only notable challenge in threading building blocks was to get the right data struc-

    ture for the tokens.

    With the bottleneck stages easily identified in the threading building block applica-

    tion we were easily able to tweak our application for performance but then in the case

    of the pthread application we had to find the single token execution time in each stage

    to understand which were the bottleneck stages in the pipeline. This was comparatively

    a tougher task than what we had to do for the threading building block application.

    Since we had found out the bottleneck stages in the pipeline the next step was to run

    the stage with data parallelism. Implementing the stages to run in parallel needed many

    changes in the already implemented design of the pthread application. This redesign

    though built up on the already existing design had many challenges. Because of the

  • 7/29/2019 Evaluating Threading Buildin

    50/73

    Chapter 5. Evaluation and Results 39

    cases where data had to be sent to many recipients and received from many senders.

    Many issues like synchronisation of threads and efficiency had to be considered for

    the right design. A lot of amount of time had to be spent on the redesign, testing and

    debugging the application which was even more harder than in the case of sequential

    stages whereas in the threading building block there was no need for redesigning the

    application as we just had to pass an argument parallel to the filter class constructor

    to make the stages run in parallel. The idea of collapsing stages if needed was also very

    easy in the case of threading building blocks. We just had to paste the computation of

    the collapsed stages into one single class and that to without much change in the design

    of the application.

    5.2.2 Expressibility

    5.2.2.1 Threading Building Blocks

    In terms of expressibility TBB library provided us with all the needed features for the

    implementation of the intended design of the application. It provided us with features

    using which we could find the bottlenecks in the application and also run in paral-

    lel those stages with great ease, thereby expressing both task and data parallelism.

    Changes like collapsing of stages was also possible.

    5.2.2.2 POSIX Thread

    Pthread library provided the required flexibility to express the intended design for the

    pipeline applications. The bottleneck stages were identified and we were also able

    to run these stages in data parallel to make the implementation efficient. It was also

    possible to collapse the stages for better load balance between different stages.This

    was possible in both threading building blocks and pthread without much hassle.

    5.2.3 Performance

    Filter Bank application being a computationally intensive application and having no

    I/O operations in it, showed no problems during the performance evaluation of the

    application. The long chained pipeline application worked perfectly with threading

    building blocks giving good speedup. The threading building blocks application was

    easily scalable and gave good speedup results even when tested over different machines

    with different number of cores as shown in Figure5.4.

  • 7/29/2019 Evaluating Threading Buildin

    51/73

    Chapter 5. Evaluation and Results 40

    Figure 5.4: Performance of the Filter Bank application(TBB) on machines with different

    number of cores.

    The pthread application was also able to give good speedup. The speedup obtained

    can be seen in the Figure 5.5. Pthread application does not scale on its own like in the

    case of threading building blocks. So to understand how pthread would work without

    any change in the code, the application was run on machines with different number of

    cores which gave the result as shown in Figure 5.5.

    Figure 5.5: Performance of the Filter Bank application(pthread) on machines with dif-

    ferent number of cores.

    It can be seen that the speedup obtained in the threading building block versions is

    much better than the pthread version which can be attributed to scheduler that threading

    blocks uses and also the thread abstractions that the library provides.

  • 7/29/2019 Evaluating Threading Buildin

    52/73

    Chapter 5. Evaluation and Results 41

    5.3 Fast Fourier Transform Kernel

    Fast Fourier Transform kernel was the third application that was developed to evaluate

    threading building blocks pipeline. This application was particularly taken because of

    the non-linear pipeline pattern that was required in the implementation. Its a 5 stage

    pipeline performing a reasonably good amount of computation at each stage.

    5.3.1 Usability

    5.3.1.1 Threading Building Blocks

    After implementing two applications in threading building blocks, the Fast Fourier

    Transform kernel took only a few hours for us to implement. This is because of the

    abstractions threading building blocks provided us. Since we already had the required

    algorithm at the benchmark suite we just had to put in the code at the appropriate place.

    We had the application up and working with just a few execution tries and found the

    pipeline application development extremely fast and trouble free after implementing

    this application.

    Just like the earlier application implementations, the only phase that took some

    time was to decide the correct and efficient data structure for the tokens. Even though

    the pipeline application was a non-linear pipeline, designing the application was not

    any different because the non-linear pipeline is implemented as a linear pattern inthreading building blocks. We just had to decide on the correct order of the stages

    in the linear pipeline as the non-linear pattern was converted to a linear pattern and

    put in the computation in the filter classes to implement the pipeline. From the pro-

    gramming point of you it was not any different than implementing a linear pipeline.

    So there were no extra usability issues implementing a non-linear pipeline in threading

    building blocks in comparison to implementing a linear pipeline.

    5.3.1.2 POSIX Thread

    The experience working with the development of the previous two pthread application

    helped developing the pthread version of Fast Fourier Transform kernel. But the Fast

    Fourier Transform kernel having a non-linear pipeline pattern had demanded extra

    attention into the design of the application. Because of the non-linear structure of the

    pipeline the stages were not all the same. So the stages were represented using different

    structure incorporating extra measures for thread synchronisation and access to shared

  • 7/29/2019 Evaluating Threading Buildin

    53/73

    Chapter 5. Evaluation and Results 42

    resources. Designing the right structure was not a simple task as in threading building

    blocks. A lot of time had to be spent testing and implementing the correct design which

    consumed a lot of time. Appropriate checks had to be done at the combining stages to

    ensure the correct order of data was combined together. Lot of issues like these had

    to be handled which made the application development a tougher task as compared to

    threading building block where these issues did not come up. In pthread every phase

    was tougher than that in the threading building blocks because pthread programming

    required attention to a lot of low level details to implement an application.

    Implementing a non-linear pipeline had its difficulties in pthread but then if the

    application was implemented as a linear pattern just like it was done for the thread-

    ing building blocks, then there we could easily overcome the troubles we went into

    implementing the non-linear pattern.

    5.3.2 Expressibility

    5.3.2.1 Threading Building Blocks

    The intended design for the application was a non-linear pipeline but then it was not

    possible to implement it in threading building blocks because the library does not

    support non-linear pipeline patterns. The work around for this is that the non-linear

    pipeline has to be converted to a linear pipeline and then implemented with the library.

    The expressibility of threading building blocks is flawed if the need is to implement a

    non-linear pipeline.

    5.3.2.2 POSIX Thread

    Pthread library gives you the flexibility of implementing non-linear pipelines. The

    Fast Fourier Transform application was developed with the intended design using the

    pthread library. One of the good things in the pthread library because it lets the pro-

    grammer work on such low details is the flexibility that it gives the programmer toimplement application the way he needs it and providing fewer library related restric-

    tions.

    5.3.3 Performance

    The Fast Fourier Transform kernel applications intended design was a non-linear

    pipeline implementation and it was important to understand the performance of the

  • 7/29/2019 Evaluating Threading Buildin

    54/73

  • 7/29/2019 Evaluating Threading Buildin

    55/73

    Chapter 5. Evaluation and Results 44

    Figure 5.7: Performance of the Fast Fourier Transform Kernel application(pthread) on

    machines with different number of cores.

    in the pipeline. And this time interval is the same for both the linear and non-linear

    implementation. The only difference that arises is in the latency of the pipeline, that is

    the initial start up time for the pipeline before it starts to output data.

    The Figure 5.8 explains how the latency varies for the linear and non-linear imple-

    mentation of pipeline assuming that all stages take equal amount of time to execute.

    This small difference in the latency of the pipelines does not make a huge difference

    in the performance of the pipeline most of time because the pattern is used to process

    large amount of data which takes lots of time to process. This time is very large com-pared to the latency advantage the non-linear pipeline provides. But the priorities can

    change depending on the application needs and there are many cases where latency of

    the pipeline is a crucial factor.

    5.4 Feature 1: Execution modes of the filters/stages

    The selection of the mode of operation of the filters is one of the most powerful feature

    in the pipeline implementation. The ease at which a programmer can set the way the

    stages should work definitely facilitates in faster programming.

    5.4.1 serial out of order and serial in order Filters

    serial in order stages was used when there was the need for certain operation to be

    done only by a single thread and also when the order in which the tokens are processed

    is to be maintained. serial out of order stages was used in cases when the certain

  • 7/29/2019 Evaluating Threading Buildin

    56/73

  • 7/29/2019 Evaluating Threading Buildin

    57/73

  • 7/29/2019 Evaluating Threading Buildin

    58/73

    Chapter 5. Evaluation and Results 47

    Figure 5.10: Performance of Filter bank application with stages running with data par-

    allelism.

    application. The pthread version of the application gives good performance with fewer

    threads. It should also be noted that only two bottleneck stages were designed to

    run data parallel in the pthread application and still obtaining better results that the

    threading building blocks version. This performance was not obtained by the thread-

    ing building block library with the same thread count and even when all the stages

    were run in parallel.

    5.5 Feature 2: Setting the number of threads to run the

    application

    The Threading building blocks scheduler gives the programmer the options to initialise

    it with the number of threads he/she wants to run the application with. In case the

    programmer does not have a count of threads in mind then the programmer can let the

    threading building blocks library decide the number of threads that is needed to run

    the application. This feature was tested across all the three applications implemented.The initial test was to vary the number of threads that was initialised in the scheduler

    and see how the application performed. The experiment was carried out for different

    values for the maximum number of tokens in flight. The results obtained are as in

    Figure 5.11.

    The Bitonic sorting network degraded in performance as the number of threads

    was increased irrespective of the limit on the number of token in flight. This to certain

    extent confirms to the assumption we made about the stages of the pipelines performing

  • 7/29/2019 Evaluating Threading Buildin

    59/73

    Chapter 5. Evaluation and Results 48

    Figure 5.11: Performance of Bitonic Sorting Network varying the number of threads in

    execution.

    less computation intensive tasks in the case of Bitonic sorting network.

    Figure 5.12: Performance of Fast Fourier Transform Kernel varying the number of

    threads in execution.

    In the Fast Fourier Transform it can be noted from the Figure 5.12 that the increase

    in the number of threads gave better speedup for the application. Providing the appli-cation with more thread to work with and more tokens to work on, the performance

    kept on increasing. The performance stabilised after a particular count of threads was

    reached.

    In the Filter Bank application the speedup of the application increased proportion-

    ally to the number of threads which can be seen in Figure 5.13. The speedup values

    stabilised after reaching a particular count of threads.

    From all the three examples it was seen that it was easy to identify the value of

  • 7/29/2019 Evaluating Threading Buildin

    60/73

    Chapter 5. Evaluation and Results 49

    Figure 5.13: Performance of Filter Bank varying the number of threads in execution.

    the number of threads for the best performance of the application. This feature gave

    the programmer a very easy and powerful way to tweak his/her application to get good

    performance.

    Threading building block supports automatic initialisation of the scheduler with

    the number of threads and this made the application scalable to different machines

    with different number of cores. This is very powerful because there was no need for

    the programmer to make any changes in the code depending on the machines he/she is

    trying to run the application on. On experimenting it was understood that the count of

    the number threads initialised in the scheduler was equal to the number of processors inthe machine. This was in support to the claim by the developers of threading building