hit1301/hit2080 programming 1 on... · 2018-04-04 · limitations on parallel processing...

Limitations on Parallel ProcessingCOS30003–Advanced .NET

Swinburne University of Technology

Alex Cummaudo 1744070

Semester 1, 2014

Faculty of Information and Communication Technologies

HIT1301/HIT2080 Programming 1Laboratory 10: PointersThis laboratory introduces pointers and dynamic memory management. This will involve:

■ Writing source code for the programs including:■ Using pointers to read and write values■ Dynamically allocating memory

Resources

The following resources can help with this topic:■ Programming Arcana■ Introduction to Programming video podcasts on iTunesU

Submission

You need to submit print outs of the following material at the start of the 11th lecture.1. The new details added to your glossary2. Updated Hit List program using a dynamic array3. Any extension work you have completed.

Attach a lab coversheet to your submission with your details, your tutor's details, and a list of things you would like feedback on. Once you have discussed this with your tutor, and cor-rected any issues they raise with you, your work will signed off as complete.

Abstract

Parallel processing takes advantage of concurrent processing by delegating differ-ent portions of work to different threads, with the ideal aim that the task will reachcompletion faster. However, where does the balance lie for lowering the workload perthread and increasing the number of processing threads? Do limitations arise wherean excessive amount of threads are used for parallel processing? What range exists fordelegating an ideal amount of work to an ideal amount of threads for parallel process-ing? By developing a simple image processing program that makes use of concurrencyutilities, parallel computation limitations were discovered, wherein advantages to re-duced time where more parallel processing threads are used is worsened by having toomany threads processing smaller and smaller pieces of data. Discussion into a heuristicfor parallel tasks suggest ways in which this issue can be battled.

1

CONTENTSCOS30003–Advanced .NET

Alex Cummaudo, 1744070Semester 1, 2014

Contents

1 Introduction 31.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Aims and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.1 Problem Decomposition Methods . . . . . . . . . . . . . . . . . . . . 61.3.2 Metric Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Method 92.1 Design Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Manipulation Execution . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.2 Reassembly Execution . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Required Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.1 Specifications of Test Machines . . . . . . . . . . . . . . . . . . . . . 122.2.2 Data Set Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Metrics Gathering Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Results 153.1 Aggregated Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Speedup Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Efficiency Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Discussion 214.1 Statistics Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.1 Speedup Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1.2 Thread Count with Superfluous Speedup . . . . . . . . . . . . . . . . 224.1.3 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Conclusion 26

A Source Code 29

B Raw Results 37

2


1 INTRODUCTIONCOS30003–Advanced .NET

1 Introduction

1.1 Background

Programs solve problems or complete tasks in a given amount of time. With traditional,non-parallel computation, a large task with a large subset of data is computed on one threadon the one CPU. The task will always be subjected to CPU context switches, constantlybeing interrupted from completing its goal—decreasing liveness—and then resuming whenthe system scheduler sees fit.

This isn’t ideal for large portions of data that need to be timely processed, and is whyparallel computing is key to solving the liveness issues posed by using a single thread. AsWillmore (2012) suggests, by separating the workload of a task into multiple threads, taskscan now be enormous, and won’t need to fit on a single CPU. Where previously, the taskwould take a nonsensical amount of time to complete, parallel computation of the task speedsup this process. Hence, parallel processing reduces time needed to complete the task, and ismore flexible in how much data needs to be processed also.

How is this achieved? Simply put, “parallel computing is the simultaneous use of mul-tiple compute resources to solve a computational problem” (Barney et al., 2010)—splittingthe larger problem down into smaller subproblems and processing those subproblems concur-rently with different processors. The overall problem is thereby solved faster by applying this‘divide and conquer’ approach to a larger task (the smaller tasks should be small enough toprocess faster). Whilst there are other approaches to modelling a task for parallel processing,the divide and conquer approach has several advantages, which Silva and Buyya summarise:

“Because the subproblems are independent no communication is necessary be-tween processes working on different subproblems... The application is organizedin a sort of virtual tree some of the processes create subtasks and have to combinethe results of those to produce an aggregate result”

- (Silva and Buyya, 1999, p.22).

Such splitting is visualised in Figure 1.1, where it is clear that these advantages are notachievable in a serial environment. Though, here lies performance differences between hard-ware concurrency and task switching. Whilst concurrent operations are still described withina single-core system that utilise task switching (i.e., providing an ‘illusion of concurrency’),hardware concurrency genuinely processes more than one task in parallel.

3



(a) Serial computation of a problem (b) Parallel computation of a problem

Figure 1.1: The different computation methods of a problem, from Barney et al. (2010). Notehow parallel computation divides the problem into subproblems being processed concurrently.

Consider Figure 1.2. The problem solved in Figure 1.1b can be done on different threadseither executing in ‘true’ parallel on different cores (i.e., at the same time) or executing ona single core, where task-switching is used to switch between the two threads.

3What is concurrency?

behavior of applications may be subtly different when executing in a single-processortask-switching environment compared to when executing in an environment withtrue concurrency. In particular, incorrect assumptions about the memory model(covered in chapter 5) may not show up in such an environment. This is discussedin more depth in chapter 10.

Computers containing multiple processors have been used for servers and high-performance computing tasks for a number of years, and now computers based onprocessors with more than one core on a single chip (multicore processors) are becom-ing increasingly common as desktop machines too. Whether they have multiple proces-sors or multiple cores within a processor (or both), these computers are capable ofgenuinely running more than one task in parallel. We call this hardware concurrency.

Figure 1.1 shows an idealized scenario of a computer with precisely two tasks to do,each divided into 10 equal-size chunks. On a dual-core machine (which has two pro-cessing cores), each task can execute on its own core. On a single-core machine doingtask switching, the chunks from each task are interleaved. But they are also spaced outa bit (in the diagram this is shown by the gray bars separating the chunks beingthicker than the separator bars shown for the dual-core machine); in order to do theinterleaving, the system has to perform a context switch every time it changes from onetask to another, and this takes time. In order to perform a context switch, the OS hasto save the CPU state and instruction pointer for the currently running task, work outwhich task to switch to, and reload the CPU state for the task being switched to. TheCPU will then potentially have to load the memory for the instructions and data forthe new task into cache, which can prevent the CPU from executing any instructions,causing further delay.

Though the availability of concurrency in the hardware is most obvious with multi-processor or multicore systems, some processors can execute multiple threads on asingle core. The important factor to consider is really the number of hardware threads:the measure of how many independent tasks the hardware can genuinely run concur-rently. Even with a system that has genuine hardware concurrency, it’s easy to havemore tasks than the hardware can run in parallel, so task switching is still used in thesecases. For example, on a typical desktop computer there may be hundreds of tasks

Figure 1.1 Two approaches to concurrency: parallel execution on a dual-core machine versus task switching on a single-core machine

Figure 1.2: Two approaches to concurrency:parallel execution on a dual-core machine versus

task switching on a single-core machine.(Williams, 2009, p.3).

Nonetheless, motivation for parallel pro-cessing is apparent in either a concurrent-hardware environment or a concurrent en-vironment achieved via task switching (orboth). As Buyya (2002) has discussed, thedemand for concurrent processing has neverbeen higher, where serial architectures arereaching their physical limits, and cannotfulfil the needy requirements of the process-intensive systems such as computer visualis-ers, simulators, scientific prediction, distributed databases and so on.

Ultimately, both environments were taken into consideration within this research report,as both task switching and multiple core processing are both commonplace even in hardwarethat supports parallel processing. As such, standard practises in concurrent programmingderive from techniques where the same application, where programmed correctly, is “notaffected by whether the concurrency is achieved through task switching or by genuine hard-ware concurrency” (Williams, 2009, p.4). As such, the same application can be tested onboth a multi-core and single-core machine, another factor which has also been considered.

4



1.2 Aims and Goals

In this report, the primary benefit of parallel computation researched was to reduce the timeneeded for a parallel application to reach its goal (i.e., finish its task) when compared toa non-parallel application. This allowed a discovery into the different liveness issues posedwhen altering parallel computation methods.

By first defining what parallel computation entails, and how it can be utilised in aproblem-solving environment in Section 1, this prompted questionable disadvantages andlimitations when using parallel processing, which this report sought to ascertain:

1. Where does the balance lie for lowering the workload per thread and increasing thenumber of processing threads?

(a) Does increasing the number of threads and reducing the workload shared betweenthe threads mean that the parallel task is quicker to complete?

(b) What is an ideal range for work amount delegation (partitioning) to an idealamount of threads for parallel processing?

2. Do limitations arise where an excessive amount of threads are used for parallel pro-cessing?

(a) At what number of threads do these limitations arise, if any?

(b) Is this consistent between both a single and multi-core hardware environment?

These aims allowed for the development of a simple image processing application, wherebymultiple threads are used to process a number ‘chunks’ of the image (i.e., divide and conquer)in parallel. After processing is complete, the processed image is then reassembled.

The time taken for both processing and reassembly could then be recorded when thenumber of threads varied, and an analysis of the results between various machines withdiffering hardware execution environments could be made. From this, variance in perfor-mance of the concurrent operations with the differing number of threads could be correlatedand the ideal ranges could be determined, both for an single and multi-core environment.Limitations between the amount of data to be processed and the amount of threads usedto process each ‘chunk’ of the image could be found, which were then assessed against theformulae addressed in Section 1.3.2.

5



1.3 Research Methodology

1.3.1 Problem Decomposition Methods

As discussed in Section 1, dividing the overall work of a problem into smaller subproblems isimperative to parallel program design. By splitting the larger problem into smaller, discretesubproblems (or chunks) each thread can separately work on a suitable portion of the overallproblem. Two major methods exist to systematically delegate portions of a problem inparallel. These are Domain Decomposition and Functional Decomposition, and are describedby Barney et al. (2010) and Buyya (2002) as thus:

Domain Decomposition With domain decomposition, focus on decomposition is doneon the data to be computed using the same task (that is, distributing the same work overmultiple subsets of the same set of data). Once relative portions of the data have beensegregated, they can then processed in parallel.

Segregating data can be done in one of two ways:

1. blocks, where a single thread or processor will work on one portion of the data set onceand once only. There is inherently a one-to-one relationship between the processor andthe chunk it will process.

2. cycles, where a single thread or processor will work on one portion of the data set but,once completed, will repeat over other portions of the data set. There is inherently aone-to-many relationship between the processor and the chunks it processes. This maybe implemented by placing each chunk in a pool of chunks, by which threads executeupon on a first-come first-serve basis, as Squyres et al. (1996) implemented in theirsimilar research.

For an illustrious perspective, refer to Figure 1.3. Cyclic Domain Decomposition willreuse threads to continue processing more chunks of the image, whilst Block Domain De-composition will use of threads once and then dispose of them. Therefore, Block DomainDecomposition will delegate one thread per chunk, unlike Cyclic.

On a 2D perspective as shown in the figure, both axes can one of the two methods orneither (as indicated with an asterisk for the respective axis), though both cyclic and blockmethods cannot be used together (i.e. a BLOCK, CYCLIC relationship between axes—or itsinverse—on the 2D image cannot be used ).

6



Figure 1.3: Cyclic versus Block Domain Decomposition on a 2D image perspective (Barneyet al., 2010)

Functional Decomposition In the Functional Decomposition approach, the computa-tional tasks themselves are the focus of parallel processing, not the data manipulated. Theproblem is therefore decomposed relative to the various tasks, and the set of data itself re-mains intact; instead each thread runs a portion of overall work, whose individual tasks arerun over separate threads.

This approach is quite reasonable where a problem can systematically be split into sepa-rate tasks: Barney et al. (2010) reveals applications of its use to modelling or signal process-ing, where a single source of data can have multiple computations applied to it (e.g., voicerecording for various sound filters).

1.3.2 Metric Selection

In the literature of parallel processing, a number of formulae have been proposed to quantifyparallel processing metrics. These metrics have been sourced from Willmore (2012).

Speedup Based off of Amdahl’s Law (Amdahl, 1967), Speedup (Sp) defines how muchfaster code can run when it is parallelised when compared to its non-parallel equivalent.

7



Speedup is modelled with the following relation:

Sp = T1Tp

(1)

where:

p is the number of processors,

Tp is the execution time of the problem with p processors, and

T1 is the execution time of one (single-threaded) processor.

Ideal Speedup Ideal speedup is achieved where Sp = p, that is, where multiplying thenumber of processors by p multiplies the speed by p. This entails a linear relationship where,for every introduced processor, ideally, speedup will be upgraded by a factor of one.

Efficiency Efficiency (Ep) estimates the relative effective utilisation of each processor tosolving the problem, compared to the effort lost in synchronisation and communicationbetween processors. It is modelled by the following relation:

Ep = Sp

p= T1

Tp · p(2)

By using these relations, a quantitative metric to measure the effectiveness of increasing thenumber of processing threads was applied. Both measures were used to deduce a suitablebalance for decreasing partition sizes per thread and increasing thread counts.

8


2 METHODCOS30003–Advanced .NET

2 Method

2.1 Design Procedure

For the purposes of this report, a simple image manipulator program was developed thatutilised both domain and functional decomposition as mechanisms to delegate work to dif-ferent threads. This program was created in the .NET Framework, and was modelled offresearch devised by Squyres et al. (1996), with their PIPT library.

After the main thread loaded in the image to be manipulated, a specified number ofchunks could be used on that image to delegate processing and reassembly to a thread.Block domain decomposition was used for the manipulator, using a nominal number ofvertical blocks where the height of a block is an integral multiple of the image height, thatis:

p : (1 ≤ y ≤ h) ∧ (h mod y = 0) (3)where:

h is the height of the image in pixels, and

y is the y’th pixel from the top of the image.

This set includes all suitable chunk height sizes that was to determine a nominal rangeof threads for processing, and will henceforth be referred to as the nominal thread range.For example, the image used in Figure 2.1 has a height of 8745px. The algorithm aboveindicates nominal available chunk sizes for this image such that:

p ∈ {1, 3, 5, 11, 15, 33, 53, 55, 159, 165, 265, 583, 795, 1749, 2915}.Thus, a number of tests under p is available for this image using the manipulator using

this many number of processors. Figure 2.1 is a visual indication of the segmented chunkswhere p = 5. This meant that 5 processors were used to process through each chunk withinthat image, visually displayed by the various changes in colours between chunks.

To allow for functional decomposition, two sets of p threads were ran—one for imageprocessing and the other for image reassembly (that is, once processing of the image is done,the image was reassembled into a final bitmap concurrently). Both processes were timedas manipulation time and reassembly time. The aggregation of both times defined thetotal time, which was used as the measurement for Tp.

9



Figure 2.1: An example of domain decomposition when running the Image manipulator. Onthis image, the p = 5.

2.1.1 Manipulation Execution

To process the image manipulation itself, each chunk performed a series of operations onthe red, blue and green pixels, as well as modifying the overall brightness of the imageto the image’s average brightness. This is executed under each thread’s main method,NormaliseRange. Refer to the Appendix for full source code of this method.

The first step in this process was to determine the brightness range within each chunk.This was achieved by scanning through each pixel (after chunks are loaded into respectivearrays) and determining the each using .NET’s GetBrightness method available in theSystem.Drawing namespace. Listing 1 shows the code developed for this portion of themanipulation process.

Listing 1: Brightness range determination was the first step in the manipulation execution.� �private static BrightnessRange DetermineBrightnessRange(Color[,] pixels){

BrightnessRange retVal;

double bTotal = 0;long pxlCount = 0;

// Default max and min based on 0,0'th pixelretVal.max = pixels [0, 0].GetBrightness ();retVal.min = pixels [0, 0].GetBrightness ();

// Work out for each pixelforeach (Color c in pixels){

float b = c.GetBrightness ();

10



bTotal += b;pxlCount++;// New max?if (b > retVal.max)

retVal.max = b;// New min?else if (b < retVal.min)

retVal.min = b;}

// Work out averageretVal.avg = (double) bTotal / pxlCount;

return retVal;}� �

Figure 2.2: Execution path forthe Normalise Range method.

Once brightness ranges for each chunk were deter-mined, all threads were synchronised at a Barrier. Thecaptain thread of this barrier, that is, the last thread tofinish determining its brightness range for its designatedchunk, uses a similar process to determine the brightnessrange over all chunks. Other, non-captain threads are syn-chronised at the same barrier whilst this captain executes.

Once the captain thread finishes, it arrives at the samebarrier where the other threads are waiting. Once here,the PerformEffect method is invoked in parallel on eachthread, which swaps red, green and blue pixels and appliesthe average brightness calculated by the captain thread toeach pixel. Red, blue and green pixels were swapped in aparticular order as determined by the current chunk num-ber, thereby giving a visual indication on the output imageas to where chunks began and ended (see Figure 2.1). Thisis done by each thread for each respective chunk range.

This process is illustrated in Figure 2.2, where eachthread executes the same execution path. Black bars in-dicate arrivals (synchronisation) at barriers. Timing formanipulation time began before each thread started thismethod, and ended after all the threads joined back ontothe main thread.

11



2.1.2 Reassembly Execution

Reassembly execution time began as soon as manipulation execution time stopped. Usingthe same number of threads, though newly created threads, the reassembly threads simplyreassemble the manipulated bitmap by setting the pixel on the final bitmap. Once all pixelshave been set, the threads are joined back to the main thread. Timing for reassembly stopsat this point.

2.2 Required Materials

2.2.1 Specifications of Test Machines

To observe changes in processing time on varying hardware architectures (chiefly, betweensingle and multi-core processors), a selection of different hardware was used to run the ImageManipulator. The machines used to test each process on is listed below in Table 2.1.

Table 2.1: Specifications of Test Machines

Machine ID 1 2 3Processor 3.4 GHz Intel Core i7 2.53 GHz Intel Core 2 Duo 1GHz PowerPC G4No. Cores 4 2 1Memory 8GB 1600 MHz DDR3 4GB 1067 MHz DDR3 640MB RAM SDRAMOS OS X 10.9.3 OS X 10.9.2 Mac OS X 10.5.1

2.2.2 Data Set Selection

To observe the changes in processing time with different data sets, three images of varyingdimensions were used to test the manipulator. They are listed in Table 2.2.

2.3 Metrics Gathering Procedure

The following outlines the steps needed for obtaining the statistics required, as listed inAppendix B.

1. Before running, terminate all other processes on the test machine using killall -u<user>. This will help ensure optimal performance when running the Image Normaliser

12



Table 2.2: Data Set Selection Summary

Thumbnail File Size Image Size Nominal Thread Range

1

8,989,774 bytes 5616px ×3744px

p ∈ { 1, 2, 3, 4, 6, 8, 9, 12, 13,16, 18, 24, 26, 32, 36, 39,48, 52, 72, 78, 96, 104,117, 144, 156, 208, 234,288, 312, 416, 468, 624,936, 1248, 1872 }

2

856,365 bytes 3000px ×2286px

p ∈ { 1, 2, 3, 6, 9, 18, 127,254, 381, 762, 1143 }

3

145,038 bytes 640px × 480px p ∈ { 1, 2, 3, 4, 5, 6, 8, 10, 12,15, 16, 20, 24, 30, 32, 40,48, 60, 80, 96, 120, 160,240 }

13



program by reducing other multitasked processes from interrupting Image Manipulatorthreads while the manipulator is processing.

2. Use a SSH or a Telnet connection from a remote terminal to run the Image Manipulator.

3. Determine the nominal values for p on the image.

4. Run the Image Manipulator program for each value in p on the image.

5. Record output of manipulation Tp, assembly Tp and overall total Tp

6. Calculate Sp and Ep according to formulae given in Section 1.3.2 for each value of p

14


3 RESULTSCOS30003–Advanced .NET

3 Results

The following data has been aggregated from the results obtained in Appendix B.

3.1 Aggregated Results

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000

1

2

3

4

n

Sn

Machine 3 Machine 2 Machine 1

Figure 3.1: Large Sized Image Speedup for each test machine

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,0000

0.2

0.4

0.6

0.8

1

n

En


Figure 3.2: Large Sized Image Efficiency for each test machine

15



0 100 200 300 400 500 600 700 800 900 1,000 1,100 1,200

1

2

3

n

Sn


Figure 3.3: Medium Sized Image Speedup for each test machine

0 100 200 300 400 500 600 700 800 900 1,000 1,100 1,2000

0.2

0.4

0.6

0.8

1

n

En


Figure 3.4: Medium Sized Image Efficiency for each test machine

16



0 20 40 60 80 100 120 140 160 180 200 220 240 2600

0.5

1

1.5

n

Sn


Figure 3.5: Small Sized Image Speedup for each test machine

0 20 40 60 80 100 120 140 160 180 200 220 240 2600

0.2

0.4

0.6

0.8

1

n

En


Figure 3.6: Small Sized Image Efficiency for each test machine

17



Table 3.1: Boundary Speedup Results for Machine 1 (Quad-Core)

Data Set max(Sp) min(Sp)

Large Image S16 = 4.0694 S1872 = 0.7103

Medium Image S9 = 3.3802 S1143 = 0.3895

Small Image S3 = 1.7518 S240 = 0.0934

Table 3.2: Boundary Speedup Results for Machine 2 (Dual-Core)


Large Image S26 = 2.0227 S1872 = 0.5072

Medium Image S6 = 1.899 S1143 = 0.3325

Small Image S4 = 1.2765 S240 = 0.1125

Table 3.3: Boundary Speedup Results for Machine 3 (Single-Core)


Large Image S52 = 1.129 S1872 = 0.8133

Medium Image S18 = 1.587 S1 = 1

Small Image S6 = 1.2905 S240 = 0.7138

18



Table 3.4: Boundary Efficiency Results for Machine 1 (Quad-Core)

Data Set max(Ep) min(Ep)

Large Image E1 = 1 E1872 = 0.0003

Medium Image E1 = 1 E1143 = 0.0003

Small Image E1 = 1 E240 = 0.0003

Table 3.5: Boundary Efficiency Results for Machine 2 (Dual-Core)


Large Image E1 = 1 E1872 = 0.0002

Medium Image E1 = 1 E1143 = 0.0002

Small Image E1 = 1 E240 = 0.0004

Table 3.6: Boundary Efficiency Results for Machine 3 (Single-Core)


Large Image E1 = 1 E1872 = 0.0004

Medium Image E1 = 1 E1143 = 0.0009

Small Image E1 = 1 E240 = 0.0029

19



3.2 Speedup Results

Speedup for most data sets tended to show a similar pattern on all three test machines; aninitial spike in speedup for the first 1-2% of introduced processing threads is clearly displayedthroughout Figures 3.1 and 3.3. Figure 3.5 seems to show some outlying recordings at p ≈ 5for both the quad- and dual-core test machines, though these outliers level out for p > 6.These outliers for the small image data set also seem not to be persistent for the single-coretest machine.

There is a consistent commonality after maximum speedup is reached; whilst an increasein speedup is shown, it begins to slowly decline after a the first 1-2% of threads have beenintroduced. Maximum performance is clearly indicated at the ranges outlined in Tables 3.1,3.2, and 3.3 for each of the quad-, dual- and single-core test machines, respectively.

For the following data sets, with respect to the quad-, dual- and single-core test machines:

1. Large Data Set Maximum speedup occurs for 16, 26 and 52 threads.

2. Medium Data Set Maximum speedup occurs for 9, 6 and 18 threads.

3. Small Data Set Maximum speedup occurs for 3, 4 and 6 threads.

For all tests, minimum speedup is found to be less than or equal to one, and usuallyalways occurring at the highest possible value within the nominal thread range (i.e, alwaysat the most possible threads that can be used on the image). There is one exception to thistrend, that being the the minimum speedup for the medium image data set on the single-coremachine, where minimum speedup occurs when using a single thread.

Speedup seems to double with every additional core amongst the three test machines onlyfor some data sets: while this trend is noticeable within the large data set amongst threetest machines (doubling from 1.129 to 2.022 to 4.069), this pattern does not seem consistentwith the other data sets for the other machines. This is noticeable for the maximum speeduptrend within the small and medium data sets, where speedup only increases by a factor ofabout 75% amongst each test machine for the medium data set. For the small data set, thistrend is non-linear, decreasing from a maximum speedup of 1.2905 to 1.2765 from the single-to dual-core test machines, respectively. However, small image processing on the quad-coremachine does increase to 1.7518 from the 1.2765 maximum on the dual-core test machine.

20


4 DISCUSSIONCOS30003–Advanced .NET

3.3 Efficiency Results

All efficiency results follow the same trend; maximum efficiency is reached only for a singlethread. Efficiency then worsens as more threads are introduced and continues to decline tothe maximum nominal thread count, levelling out to near-zero efficiency. This approach tozero is clearly visible within Figures 3.2, 3.4 and 3.6, where efficiency rapidly declines asmore threads are introduced to process each image for each data set regardless of the testmachine (i.e., number of cores).

4 Discussion

4.1 Statistics Evaluation

4.1.1 Speedup Range

From Tables 3.1 to 3.3, each data set had a maximum speedup far less than the highestnumber of threads in the possible nominal thread range. Whilst increasing the number ofthreads did give rise to a faster processing time, this trend seemed to apply for approxi-mately the first 1-2% of introduced threads; speedup will then decline as more threads areintroduced. Therefore, while it may be beneficial to use more threads to process data, it isnot necessarily the case that introducing more threads and reducing the parallel workload oneach of those threads will lead to an increased speedup; limitations arise where introducingtoo many threads make speedup adverse (see Section 4.1.2).

The minimum speedup was typically found to be at a value less than one, indicating thatprocessing speed eventually slowed down to a point where it was faster to process on a singlethread, usually at the highest possible threads in the nominal thread range. This trend wastypically consistent amongst all three test machines, and therefore increasing the numberof threads and thinly spreading the workload is not always ideal to achieve an increase inspeedup, and introducing too many threads will in fact cause a decline in speed regardless ofthe number of cores. Indeed, none of the results conform to ideal speedup, and while thereis some improvement, a linear pattern for introducing more threads to gain more speedupcan not been deduced from the results; for instance, the maximum speedup of the large dataset on a quad-core machine reached an overall maximum speedup of 4.0694 at 16 threads,indicating that when 16 threads are used, the program runs only about 4 times as fast as

21



if it were run only on a single thread. Where ideal speedup is conformed to, introducing16 threads should (theoretically) increase running speed by a factor of 16, whereas resultsfor the quad-core machine only show a speedup of a quarter of that. Likewise, and perhapsworse, large image processing on a single-core machine achieved a maximum speedup of only1.129 when using 53 threads. This meant that the program required an extra 52 threadsto process the image parallel and gained an extra 1% speedup, approximately. Whether ornot the introduced threads were actually worth that extra one percent increase in processingtime is questionable for the single-core machine, though, comparatively speaking, the 400%increase in the quad-core machine was the best result in all results obtained.

Hence, for each data set, the following heuristics have been deduced to help developersdecide the nominal number of threads to use for parallel processing a set of data:

1. 8 MB 52 threads for a single core, divide by two for each core thereafter

2. 800 KB 18 threads for a single core, divide by three for each core thereafter, butlimitations arise at 4 or more cores.

3. 145 KB 6 threads for a single core, divide by one and a half for each core thereafter,but limitations arise at 4 or more cores.

As shown above, limitations begin to arise for smaller data sets, and this is consistentwith the graphs shown in Section 3. These proposed heuristics are only indications as to whatmay reach optimal speedup, and may not be applicable to all scenarios. Nonetheless, thedata obtained in this report follows the pattern posed by these heuristics given, indicatingthat optimal speedup will occur for larger data sets where four cores can be used withoutany limitations at all. However, on smaller data sets, speedup will decline rapidly on aquad-core machine after just three threads. This will clearly have adverse side affects onprogram liveness, as more threads introduced will make the program slower (thereby takinglonger to reach completion) and not its intended use to make the program faster.

4.1.2 Thread Count with Superfluous Speedup

On analysis of the data obtained in Section 3 and Appendix B, the data aggregated inTable 4.1 was developed. This table indicates a clear number of threads where speedupbecame superfluous and, after a rise to max(Sp), the proceeding decline eventually reached

22



Table 4.1: Summary of Limitations to Thread Count for Single-, Dual- and Quad-Core TestMachines where introducing new threads is superfluous (i.e., where Sp < 1 ∧ p 6= 1).

Data Set Quad-Core Dual-Core Single-Core

Large Image 1400 650 650Medium Image 375 230 N/ASmall Image 17 15 110

a value of 1 again (i.e., Sp < 1 given p 6= 1). It is at this number of threads where, if morethreads are used, the time taken for processing to finish will actually worsen—that is, usingmore threads beyond the superfluity thread count will actually hinder performance ratherthan improve it by making processing time longer. As discussed in Section 1.3.2, speedupis relative to its base value of 1; any value below this indicates adverse speedup wherebyone thread performs better (i.e., processes faster) than using the number of threads that hascaused this adverse speedup. Therefore, the results indicate that increasing the number ofthreads makes processing time suffer when increased to a certain extent.

The small data set suffered the most even after a small number of threads were intro-duced to process the image. Especially on quad- and dual-core machines, running the imagemanipulator program ran faster when running with fewer threads than it does for morethreads; after a thread count of approximately 16 threads did introducing more threads havea superfluous effect. However, the single-core machine made beneficial use of more threadsup until 110 threads, a far greater tolerance than the quad- and dual-core machines.

A medium image set shows a significant improvement over small data sets; limitationsonly arise at 375 and 230 threads (a far greater improvement than the 17 and 15) therebyindicating that work is processed faster when introducing multiple threads for processing forslightly larger sets of data. In fact, this trend continues for the large data set (as discussedbelow). For single-core test machine, there are no limitations for the medium image; whileFigure 3.5 shows a gradual decline reaching a speedup of 1, the decline never does reach 1,signaling that processing on the single machine for the medium data has no limitations (forthe nominal thread range tested) at all by using more threads.

The large data set is only limited at large number of threads; usually there is a generalimprovement up until an unreasonable amount of threads are introduced. For a quad-core

23



machine this is at 1400 whilst the single and dual core machines share a limiting threadcount of 650 threads, both of which are suitably unreasonable thread numbers to use for animage of this size.

Hence, it can be deduced that there is inconsistency between the number of cores themachine has as well as the amount of data to process. For small sets of data being processedon a processor with more than one core, introducing an excessive number of threads willquickly have an adverse effect quite soon, while a single-core processor will utilise theseextra threads more (compare 17 and 15 with 110 in Table 4.1).

This discrepancy begins to even out with the medium data set—a single-core machinewill not be limited with a medium sized data set at all, which is in-line with the single-coreprocessor’s small data set where mostly all introduced threads are fully utilised. Quad- anddual-core machines will utilise more of their threads too before an adverse speedup is met.A similar outcome also occurs for the large data set.

Thus, it seems as though more cores will utilise more threads when there is a larger dataset to work with, and only at a large number of threads will adverse speedup be reached.This contradicts a smaller data set, where a processor with more cores will quickly reachadverse speedup only when a few threads are being used to process the data. The resultstherefore indicate that, while there is such an inconsistency between a single and multi-core environment, limitations will arise only when using an excessive number of threads forlarger data sets in a multi-core environment, and, inversely, limitations arise with multi-coreenvironments for smaller data sets with only a few number of threads. Within single-coreenvironments, however, the inconsistency shows that a higher usage of threads in a single-core environment with small and large data sets will lead to adverse speedup instead, whilein medium data sets there are no limitations at all.

4.1.3 Efficiency

As indicated in Section 4.1.1, there were substantial fallbacks with the efficiency of eachthread. Maximum efficiency was always found at a single thread count, while minimumefficiency was always found at the maximum possible thread count in the nominal threadrange. This therefore indicates that the image manipulator poorly conformed to linear, idealspeedup. Indeed, it is clearly indicated within the Figures in Section 3 that non-linear andnegatively exponential efficiency occurs; each thread introduced did not fully make efficientuse and, thus, best efficiency was only met at where single thread was used.

24



4.2 Limitations

While speedup according to Amdahl’s Law is ideal, it is almost never the case in reality.The literature tends to readily confirm this, and as Willmore (2012) states, “the situationis even worse than [when] predicted by Amdahl’s Law”. The results of this report seemedto confirm this, and is most likely caused by a number of adverse side affects, particularlytask scheduling issues, whereby threads are being blocked by the OS scheduler, reaching tosub-optimal liveness. Whilst the method did factor this into consideration so that all non-essential processors were terminated so as to not interfere with the results, it is still possiblethat a necessary overload in context switching may have caused these adverse results. Futureextensions of the study may benefit by running the software on a custom scheduler wherebyit can be ensured that the manipulator has priority over all other concurrent operationsoccurring on the test machine as the test is running.

A second side affect of the results is the choice of operating system. While OS X 10.9 wasused on the first two machines, an older version of OS X (10.5) was used on the third. OSX 10.9 comes with enhanced features specifically targeted to improving task scheduling forimproved energy efficiency (Timer Coalescing), that “groups low-level operations together”and “can dramatically increase the amount of time that the processor spends idling.” (Apple,Inc., 2013, pp.8-9). Enhanced improvements to OS X 10.9 may have skewed the results ofthe test machine running 10.5, especially in regards to efficiency. These limitations could bereduced in future extensions by running the manipulator on a range of different operatingsystems, both UNIX and non-UNIX based.

Whilst the C# language and .NET framework implement a robust multi-threading envi-ronment, for the purposes of this report, it would be ideal if a range of suitable languagesand frameworks were used to eliminate any biases in multi-threaded frameworks. The reportwould therefore benefit by extending the manipulator program and writing it in other multi-threaded languages (factoring in appropriate safety requirements and concurrency utilitiessuch as the barrier utility used) to see if a change in language may have any affect on the re-sults found and heuristics proposed by this report. Efficient languages such as C would be anideal test environment to extend the findings of this report; a non-object oriented languagemay also improve performance by reducing object overhead. This will help deduce if anychanges to languages also has any affect to the speedup and efficiency of C# implementationof the manipulator.

25

5 CONCLUSIONCOS30003–Advanced .NET


5 Conclusion

Parallel processing entails a partitioning either problem data or tasks onto different threads.This allows each partition on each thread to be processed simultaneously, in order to reduceprocessing time and improve program liveness. However, the degree at which the partitionsand workload for each partition are sized may vary, and the extent at which limitations arisefor these differently sized partitions was investigated.

By using images of different sizes, a nominal number of threads could be deduced bythe height of each image from the formula provided in Section 2. This gave a systematicapproach as to the different numbers and sizes of partitions, as well as the workload to bedistributed amongst each partition.

After development of an image manipulator program that processed each image, record-ings of the time taken for processing under varyingly sized partitions were used to deducespeedup and efficiency for the partition sizes used. Testing in accordance to multiple hard-ware environments to factor in different number of cores, it was found that larger datasets achieve maximum speedup when using 52 threads on a single core and divide by twothereafter for each core. While the results were consistent for larger data sets, limitationsbegan to arise for smaller data sets, where increasing thread numbers only improved speedupon single-core machines, but not for multi-cored machines. The tests also indicated that,in reality, optimal efficiency is hard to reach, and all results indicated that efficiency onlyworsened as more threads were used.

Lastly, the limitations posed by using excessive numbers of threads were also investigated—small data sets seem to process slower when a high number of threads are used while theinverse occurs for larger data sets. Inconsistencies also arose amongst quad-, dual- andsingle-core processors. These inconsistencies may be of further investigation for future stud-ies, which could also benefit by using a wider selection of data and implementation languages.While the heuristics proposed by this report may prove useful, they may not necessarily beapplicable to all situations.

Ultimately, there is no ‘right’ answer that can be applied to all scenarios, since hardwareenvironment has a significant impact and an appropriate number of threads should will varyupon the tasks and data used. Fundamentally, simply increasing the number of threads fora parallel task is not sustainable as limitations certainly arise.

26


REFERENCESCOS30003–Advanced .NET

References

G. M. Amdahl. Validity of the single processor approach to achieving large scale computingcapabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference,pages 483–485. ACM, 1967.

Apple, Inc. OS X Mavericks Core Technologies Overview. Online, October2013. URL https://www.apple.com/media/us/osx/2013/docs/OSX_Mavericks_Core_Technology_Overview.pdf. Cited 23 May 2014.

B. Barney et al. Introduction to parallel computing. Lawrence Livermore National Labora-tory, 6(13):10, 2010.

R. Buyya. Introduction to Parallel Computing. Presentation, The University of Melbourne:Grid Computing and Distributed Systems (GRIDS) Lab, August 2002.

A. Heardman. Avenue washlands nature reserve. Online, 2007. URL http://commons.wikimedia.org/wiki/File:477785_85d9ce48-by-Alan-Heardman.jpg. Cited 23 May2014.

Nemo. La pageria rosea. Online, 2010. URL http://commons.wikimedia.org/wiki/File:002_Lapageria_rosea_04_ies.jpg. Cited 23 May 2014.

L. M. Silva and R. Buyya. Parallel programming models and paradigms. High PerformanceCluster Computing: Architectures and Systems, 2:4–27, 1999.

J. M. Squyres, A. Lumsdaine, B. C. McCandless, and R. L. Stevenson. Parallel and dis-tributed algorithms for high speed image processing. tr-96-12. pdf, Dept. of ComputerScience and Engineering, University of Notre Dame, Computer Science and Engineeringtechnical report TR, 12:1996, 1996.

Unknown. Grizzly sitting at tide line. Online, 2012. URL http://www.raincoast.org/?attachment_id=13264. Cited 23 May 2014.

U.S. National Archives and Records Administration. Always fasten safety belt! Online,1941. URL http://commons.wikimedia.org/wiki/File:%22Always_fasten_safety_belt%22_-_NARA_-_513785.tif. Cited 23 May 2014.

27

https://www.apple.com/media/us/osx/2013/docs/OSX_Mavericks_Core_Technology_Overview.pdf

https://www.apple.com/media/us/osx/2013/docs/OSX_Mavericks_Core_Technology_Overview.pdf

http://commons.wikimedia.org/wiki/File:477785_85d9ce48-by-Alan-Heardman.jpg

http://commons.wikimedia.org/wiki/File:477785_85d9ce48-by-Alan-Heardman.jpg

http://commons.wikimedia.org/wiki/File:002_Lapageria_rosea_04_ies.jpg

http://commons.wikimedia.org/wiki/File:002_Lapageria_rosea_04_ies.jpg

http://www.raincoast.org/?attachment_id=13264

http://www.raincoast.org/?attachment_id=13264

http://commons.wikimedia.org/wiki/File:%22Always_fasten_safety_belt%22_-_NARA_-_513785.tif

http://commons.wikimedia.org/wiki/File:%22Always_fasten_safety_belt%22_-_NARA_-_513785.tif

REFERENCESCOS30003–Advanced .NET


A. Williams. C++ Concurrency In Action: Practical Multithreading. Manning, 2009.

F. Willmore. Introduction to Parallel Computing. Presentation, The University of Texas atAustin: Texas Advanced Computing Center, February 2012.

28


A SOURCE CODECOS30003–Advanced .NET

A Source Code

Listing 2: Source code developed for the Image Manipulator tool.� �using System;using System.IO;using System.Drawing;using System.Threading;using System.Collections.Generic;using ConcurrencyUtils;using AlexIO;namespace ImageNormaliser{

class MainClass{

/// <summary>/// The barrier for processing/// </summary>private static ConcurrencyUtils.Barrier _reached;

/// <summary>/// The ranges for each chunk/// </summary>private static BrightnessRange[] _rangesForChunk;

/// <summary>/// Whether csv print/// </summary>public static bool ToCSV;

/// <summary>/// Exception for one thread processing/// </summary>private static bool _isSingleThreaded;

/// <summary>/// The final normalised range/// </summary>private static BrightnessRange _normalisedRange;

/// The struct for containing max and min ranges for brightnessprivate struct BrightnessRange{

public double min;public double max;public double avg;

}

/// <summary>/// The entry point of the program, where the program control starts and ends./// </summary>/// <param name="args">The command-line arguments.</param>public static void Main (string[] args)

29



{if (args.Length == 0){

Console.WriteLine ("Missing filename");return;

}

try{

string fname = args [0];int chunks = args.Length >= 2 ? Convert.ToInt32(args[1]) : 0;ToCSV = args.Length == 3 && args [2] == "-f";Run (fname, chunks);

}catch (Exception e){

Console.WriteLine (e.Message);}

}

/// <summary>/// Run execution on specified file./// </summary>/// <param name="fname">Fname.</param>/// <param name="chunks">Number of chunks to use</param>private static void Run(string fname, int chunks){

bool nomOnly = chunks == -1;// Load in bitmap given it exists!if (!File.Exists (fname))

throw new FileLoadException ("Image could not be loaded from file", fname);Bitmap img = new Bitmap (fname);

// Image was loaded successfully!UserIO.Log (fname + "loaded");

Helper.FlushLog();UserIO.Log ("Nominal chunk sizes (fits in with image height):");

string sizeNotif = "";

// Determine correct chunks to use...for (int i = 1; i < img.Height; i++)

// Divisible by i?if (img.Height % i == 0)

sizeNotif += i + ", ";

UserIO.Log(sizeNotif);

if (!nomOnly){

if (chunks == 0)chunks = Convert.ToInt32 (UserIO.Prompt ("Enter in chunks to use"));

elseConsole.WriteLine(RunNormalisation (img, chunks, fname));

30



}}

/// <summary>/// Runs the normalisation./// </summary>/// <param name="img">Bitmap to run normalisation on.</param>/// <param name="chunks">Chunks to run normalisation with.</param>/// <param name="outFile">Output file name</param>private static string RunNormalisation(Bitmap img, int chunks, string outFile){

if ( chunks < 1 ){

return "Need at least two chunks to work with";}

// Reinitialise barrier and brightness ranges based on chunk count_reached = new ConcurrencyUtils.Barrier (chunks);_rangesForChunk = new BrightnessRange[chunks];_isSingleThreaded = chunks == 1;

// Load in pixels (wrap in try/catch block for overchunk size exception)Color[][,] pixels;try{

pixels = LoadPixels (img, chunks);}catch (ArgumentOutOfRangeException e){

// Cannot process with more chunks than the image height!return e.ParamName;

}

// Loading done!UserIO.Log (string.Format ("Using {0} threads for image normalisation...", chunks));

// Start timing threads to finish...DateTime startProcessing = DateTime.Now;

// Create bunch of new threads that process their thread number'th chunkThread[] processors = Helper.BunchOfNewThreads(chunks,

()=> NormalizeRange(

ref pixels[Helper.CurrentThreadInteger],Helper.CurrentThreadInteger

));

// Start each processorforeach (Thread t in processors)

t.Start ();

// Join each processor before reassemblingforeach (Thread t in processors)

t.Join ();

31



// Processer threads stopped now... reassembler threads start...DateTime stopProcessing = DateTime.Now, startReassembling = DateTime.Now;

UserIO.Log ("Reassembling output image using " + chunks + " threads...");

// New image based off original image's width and heightBitmap finalImg = new Bitmap (img.Width, img.Height);

Thread[] reassemblers = Helper.BunchOfNewThreads(chunks,()=> ReassembleImage(

pixels,Helper.CurrentThreadInteger,ref finalImg

));

// Start each reassemblerforeach (Thread t in reassemblers)

t.Start ();

// Join each reassemblers to finish upforeach (Thread t in reassemblers)

t.Join ();

DateTime stopReassembling = DateTime.Now;

// Reassembly complete, save the image:finalImg.Save (String.Format("{0}_{1}threads", outFile, chunks)+".png");

TimeSpan processTime = stopProcessing - startProcessing;TimeSpan reassembleTime = stopReassembling - startReassembling;TimeSpan totalTime = processTime + reassembleTime;

string[] statReport = new string[4];statReport [0] = String.Format ("Finished With {0} Threads", chunks);statReport [1] = String.Format ("Processing Time: {0}s", processTime.TotalMilliseconds);statReport [2] = String.Format ("Reassembly Time: {0}s", reassembleTime.TotalMilliseconds);statReport [3] = String.Format ("Total Time: {0}s", totalTime.TotalMinutes);

const int LEN = 40;string retVal = "";if (ToCSV){

retVal += String.Format ("{0},{1},{2},{3}", chunks, processTime.TotalMilliseconds,reassembleTime.TotalMilliseconds, totalTime.TotalMilliseconds);

}else{

retVal += String.Format("+{0}+{1}", UserIO.StringBuff("", LEN, '-'),System.Environment.NewLine);

retVal += String.Format("|{0}{1}|{2}", statReport[0],UserIO.StringBuff(statReport[0].ToUpper(), LEN), System.Environment.NewLine);

32



retVal += String.Format("+{0}+{1}", UserIO.StringBuff("", LEN, '-'),System.Environment.NewLine);

for (int i = 0; i < statReport.Length; i++)retVal += String.Format("|{0}{1}|{2}", statReport[i], UserIO.StringBuff(statReport[i],

LEN), System.Environment.NewLine);retVal += String.Format("+{0}+{1}", UserIO.StringBuff("", LEN, '-'),

System.Environment.NewLine);}return retVal;

}

/// <summary>/// Reassembles the original image with its normalised pixels./// </summary>/// <returns>The normalised image.</returns>/// <param name="orgImg">Orginal image.</param>/// <param name="pixels">Pixels after normalisation.</param>/// <param name="chunkNo">The chunk number</param>/// <param name="newImg">The new image reassembled output</param>private static void ReassembleImage(Color[][,] pixels, int chunkNo, ref Bitmap newImg){

int numberOfChunks = pixels.Length;int chunkHSz = newImg.Height / numberOfChunks;

// Each thread will set pixels for their chunkfor (int x = 0; x < newImg.Width; x++){

for (int y = 0; y < chunkHSz; y++){

// The chunk pixel is the current x and y for this chunk...Color chunkPx = pixels [chunkNo] [x, y];

// Actual y is the y factored in for the relative chunk// for the final imageint actY = y + (chunkHSz * chunkNo);newImg.SetPixel (x, actY, chunkPx);

}}

}

/// <summary>/// Loads in the pixels from the orginal image/// </summary>/// <returns>The pixels.</returns>/// <param name="img">Unnormalised Image.</param>/// <param name="numberOfChunks">Number of chunks used to process the image.</param>private static Color[][,] LoadPixels(Bitmap img, int numberOfChunks){

if (img.Height < numberOfChunks)throw new ArgumentOutOfRangeException ("Number of chunks exceed the number of vertical

pixels for this image!");

// Work out the number of chunksint chunkHSz = img.Height / numberOfChunks;

33



// Return value is the number of chunks and the image width:// [ ---> ] chunk1// [ ---> ] chunk2 etc.Color[][,] retVal = new Color[numberOfChunks][,];

// For each chunk of the imagefor (int chunk = 0; chunk < numberOfChunks; chunk++){

// Intialise this chunk rangeretVal [chunk] = new Color[img.Width, chunkHSz];// Load in the width into the chunkfor (int x = 0; x < img.Width; x++){

// Load in the y going downfor (int y = 0; y < chunkHSz; y++){

// Actual y is the y factored in for the relative chunkint actY = y + (chunkHSz * chunk);Color chunkPx = img.GetPixel (x, actY);// Load into this x/chunk'sY the pixel at x, yretVal [chunk] [x, y] = chunkPx;

}}

}return retVal;

}

/// <summary>/// Normalizes the range of pixels provided/// </summary>/// <param name="pixels">Pixels.</param>/// <param name="chunkNo">Chunk no.</param>private static void NormalizeRange(ref Color[,] pixels, int chunkNo){

// Work out range for this chunk_rangesForChunk[chunkNo] = DetermineBrightnessRange (pixels);

Helper.LogThread ("****");// Arrive at barrierif ( _reached.Arrive() ){

Helper.LogThread ("CAPT");

_normalisedRange.max = _rangesForChunk [0].max;_normalisedRange.min = _rangesForChunk [0].min;

double finalAvg = 0;

// Last out? Work out range for all rangesforeach (BrightnessRange range in _rangesForChunk){

// New best max?if (range.max > _normalisedRange.max)

_normalisedRange.max = range.max;// New best min?

34



if (range.min < _normalisedRange.min)_normalisedRange.min = range.min;

finalAvg += range.avg;}

finalAvg = finalAvg / _rangesForChunk.Length;

_normalisedRange.avg = finalAvg;

// Wait to arrive once more... (ignore for single threaded test)if (!_isSingleThreaded){

Helper.LogThread ("****");_reached.Arrive ();Helper.LogThread ("GO");

}}else{

Helper.LogThread("****");// Not last out? Wait again at barrier..._reached.Arrive ();Helper.LogThread ("GO");

}

// Once both have arrived, then we can perform normalisation on each pixel// in this chunkfor (int x = 0; x < pixels.GetLength (0); x++)

for (int y = 0; y < pixels.GetLength (1); y++)pixels [x, y] = PerformEffect (pixels [x, y],chunkNo, _normalisedRange);

}

/// <summary>/// Performs a swap of RBG pixels to have an effect on the image/// </summary>/// <returns>The new pixel color.</returns>/// <param name="pixel">Pixel.</param>/// <param name="chunkNo">Chunk no.</param>/// <param name="ran">The range of brigheness calculated</param>private static Color PerformEffect(Color pixel, int chunkNo, BrightnessRange ran){

int newA, newR, newG, newB;

newA = (int)(ran.avg * 255);// Different rgb based on divisble chunk noif (chunkNo % 3 == 0){

newR = pixel.R;newG = pixel.B;newB = pixel.G;

}else if (chunkNo % 2 == 0){

newR = pixel.B;

35



newG = pixel.G;newB = pixel.R;

}else{

newR = pixel.G;newG = pixel.R;newB = pixel.B;

}

// Return a new color based on the new rgbreturn Color.FromArgb (newA, newR, newG, newB);

}

/// <summary>/// Determines the brightness range for the given pixels./// </summary>/// <returns>The brightness range for these pixels.</returns>/// <param name="pixels">Pixels to determine range from.</param>private static BrightnessRange DetermineBrightnessRange(Color[,] pixels){

BrightnessRange retVal;

double bTotal = 0;long pxlCount = 0;

// Default max and min based on 0,0'th pixelretVal.max = pixels [0, 0].GetBrightness ();retVal.min = pixels [0, 0].GetBrightness ();

// Work out for each pixelforeach (Color c in pixels){

float b = c.GetBrightness ();bTotal += b;pxlCount++;// New max?if (b > retVal.max)

retVal.max = b;// New min?else if (b < retVal.min)

retVal.min = b;}

// Work out averageretVal.avg = (double) bTotal / pxlCount;

return retVal;}

}}� �

36


B RAW RESULTSCOS30003–Advanced .NET

B Raw Results

Below lists the tables of raw results gathered from each machine after tests were ran for eachimage.

Table B.1: Large Image Processing for Machine 1

threads manipulation time reassembly time total time speedup efficiency

1 2,704.52 1,054.43 3,758.95 1 12 1,442.59 680.31 2,122.9 1.77 0.893 1,098.92 375.12 1,474.04 2.55 0.854 1,088.95 392.57 1,481.52 2.54 0.636 783.57 307.08 1,090.65 3.45 0.578 703.22 271.44 974.66 3.86 0.489 711.33 251.3 962.63 3.9 0.4312 684.7 247.02 931.72 4.03 0.3413 679 246.93 925.92 4.06 0.3116 679.52 244.18 923.7 4.07 0.2518 683.88 253.21 937.09 4.01 0.2224 688.01 243.43 931.44 4.04 0.1726 684.26 242.76 927.02 4.05 0.1632 697.17 239.18 936.34 4.01 0.1336 695.17 251.7 946.87 3.97 0.1139 711.04 243.72 954.76 3.94 0.148 789.33 253.67 1,043 3.6 7.51 · 10−2

52 720.77 248.44 969.2 3.88 7.46 · 10−2

72 734.99 251.31 986.3 3.81 5.29 · 10−2

78 773.68 248.78 1,022.46 3.68 4.71 · 10−2

96 885.12 263.98 1,149.1 3.27 3.41 · 10−2

104 910.22 258.86 1,169.08 3.22 3.09 · 10−2

117 907.98 252.86 1,160.84 3.24 2.77 · 10−2

144 978.31 254.85 1,233.15 3.05 2.12 · 10−2

156 1,086.41 262.77 1,349.18 2.79 1.79 · 10−2

208 1,188.23 268.17 1,456.4 2.58 1.24 · 10−2

37



234 1,257.92 266.33 1,524.25 2.47 1.05 · 10−2

288 1,354.15 268.62 1,622.77 2.32 8.04 · 10−3

312 1,469.53 266.92 1,736.45 2.16 6.94 · 10−3

416 1,630.41 268.31 1,898.72 1.98 4.76 · 10−3

468 1,757.68 273.76 2,031.43 1.85 3.95 · 10−3

624 2,138.36 281.19 2,419.55 1.55 2.49 · 10−3

936 2,620.86 293.38 2,914.24 1.29 1.38 · 10−3

1,248 3,194.77 318.75 3,513.53 1.07 8.57 · 10−4

1,872 4,922.5 369.06 5,291.55 0.71 3.79 · 10−4

Table B.2: Medium Image Processing for Machine 1


1 896.97 337.24 1,234.21 1 12 484.16 176.1 660.26 1.87 0.933 361.33 149.18 510.5 2.42 0.816 278.97 99.26 378.23 3.26 0.549 279.54 85.59 365.13 3.38 0.3818 281.73 84.65 366.38 3.37 0.19127 515.17 97.33 612.5 2.02 1.59 · 10−2

254 902 99.12 1,001.12 1.23 4.85 · 10−3

381 1,172.11 102.39 1,274.5 0.97 2.54 · 10−3

762 2,065.76 113.81 2,179.57 0.57 7.43 · 10−4

1,143 3,030.33 137.96 3,168.29 0.39 3.41 · 10−4

Table B.3: Small Image Processing for Machine 1


1 57.14 15.58 72.72 1 12 39.52 8.77 48.3 1.51 0.753 34.77 6.74 41.51 1.75 0.584 36.73 7.38 44.11 1.65 0.41

38



5 138.68 6.8 145.48 0.5 1 · 10−1

6 137.26 6.21 143.47 0.51 8.45 · 10−2

8 44.48 7.97 52.45 1.39 0.1710 45.56 8.29 53.85 1.35 0.1412 54.68 7.25 61.93 1.17 9.78 · 10−2

15 58.34 8.65 66.99 1.09 7.24 · 10−2

16 61.61 9.19 70.79 1.03 6.42 · 10−2

20 73.82 8.47 82.29 0.88 4.42 · 10−2

24 75.57 7.6 83.17 0.87 3.64 · 10−2

30 82.93 9.02 91.95 0.79 2.64 · 10−2

32 93.12 12.12 105.24 0.69 2.16 · 10−2

40 112.25 10.32 122.57 0.59 1.48 · 10−2

48 124.04 10.08 134.12 0.54 1.13 · 10−2

60 165.42 11.14 176.55 0.41 6.86 · 10−3

80 236.58 11.05 247.63 0.29 3.67 · 10−3

96 240.4 11.94 252.33 0.29 3 · 10−3

120 346.14 13.1 359.24 0.2 1.69 · 10−3

160 508.58 14.47 523.06 0.14 8.69 · 10−4

240 753.58 24.75 778.34 9.34 · 10−2 3.89 · 10−4



1 3,290.4 1,288.94 4,579.34 1 12 1,823.44 722.63 2,546.07 1.8 0.93 1,749.03 746.89 2,495.92 1.83 0.614 1,590.46 701.15 2,291.6 2 0.56 1,603.74 736.96 2,340.7 1.96 0.338 1,615.81 709.84 2,325.65 1.97 0.259 1,607.12 668.84 2,275.96 2.01 0.2212 1,610.27 710.24 2,320.5 1.97 0.1613 1,619.7 717.03 2,336.72 1.96 0.1516 1,592.73 710.16 2,302.89 1.99 0.12

39



18 1,619.06 692.85 2,311.91 1.98 0.1124 1,662.57 723.73 2,386.31 1.92 8 · 10−2

26 1,607.22 656.66 2,263.89 2.02 7.78 · 10−2

32 1,720.65 694.05 2,414.71 1.9 5.93 · 10−2

36 1,698.79 713.6 2,412.39 1.9 5.27 · 10−2

39 1,656.6 674.02 2,330.62 1.96 5.04 · 10−2

48 1,838.53 713.17 2,551.7 1.79 3.74 · 10−2

52 1,719.72 692.94 2,412.66 1.9 3.65 · 10−2

72 1,823 725.51 2,548.51 1.8 2.5 · 10−2

78 1,638.51 670.1 2,308.61 1.98 2.54 · 10−2

96 1,638.03 711.82 2,349.85 1.95 2.03 · 10−2

104 1,894.47 735.8 2,630.27 1.74 1.67 · 10−2

117 1,946.12 736.11 2,682.23 1.71 1.46 · 10−2

144 2,109.56 756.36 2,865.92 1.6 1.11 · 10−2

156 1,859.75 1,158.07 3,017.82 1.52 9.73 · 10−3

208 2,321.64 696.61 3,018.24 1.52 7.29 · 10−3

234 2,351.36 713.67 3,065.03 1.49 6.38 · 10−3

288 2,513.11 733.85 3,246.95 1.41 4.9 · 10−3

312 2,631.24 776.69 3,407.93 1.34 4.31 · 10−3

416 2,528.87 741.61 3,270.48 1.4 3.37 · 10−3

468 3,071.59 769 3,840.59 1.19 2.55 · 10−3

624 3,691.5 772.07 4,463.57 1.03 1.64 · 10−3

936 4,757.14 842.87 5,600.01 0.82 8.74 · 10−4

1,248 5,812.44 894.84 6,707.28 0.68 5.47 · 10−4

1,872 7,945.03 1,081.89 9,026.92 0.51 2.71 · 10−4



1 1,076.1 398.99 1,475.08 1 12 599.65 218.38 818.03 1.8 0.93 798.07 217.47 1,015.55 1.45 0.486 562.7 214.06 776.77 1.9 0.32

40



9 555.59 223.8 779.39 1.89 0.2118 635.08 226.44 861.52 1.71 9.51 · 10−2

127 920.82 230.76 1,151.58 1.28 1.01 · 10−2

254 1,349.47 247.2 1,596.67 0.92 3.64 · 10−3

381 1,908.15 256.09 2,164.24 0.68 1.79 · 10−3

762 2,782.88 327.22 3,110.1 0.47 6.22 · 10−4

1,143 4,086.25 348.83 4,435.08 0.33 2.91 · 10−4



1 76.17 18.68 94.84 1 12 157.26 11.23 168.49 0.56 0.283 161.75 13.98 175.73 0.54 0.184 61.58 12.71 74.3 1.28 0.325 65.42 13.91 79.33 1.2 0.246 68.21 12.19 80.4 1.18 0.28 75.69 12.77 88.46 1.07 0.1310 81.1 12.99 94.09 1.01 0.112 80.84 12.51 93.35 1.02 8.47 · 10−2

15 99.96 14 113.96 0.83 5.55 · 10−2

16 96.85 13.92 110.77 0.86 5.35 · 10−2

20 130.21 14.16 144.37 0.66 3.28 · 10−2

24 114.37 13.99 128.36 0.74 3.08 · 10−2

30 152.32 21.78 174.1 0.54 1.82 · 10−2

32 173.22 23.1 196.32 0.48 1.51 · 10−2

40 143.42 17.5 160.92 0.59 1.47 · 10−2

48 189.73 15.02 204.75 0.46 9.65 · 10−3

60 194.72 15.22 209.94 0.45 7.53 · 10−3

80 284.04 19.96 304 0.31 3.9 · 10−3

96 297.15 19.65 316.8 0.3 3.12 · 10−3

120 385.78 27.14 412.92 0.23 1.91 · 10−3

160 541.64 31.88 573.52 0.17 1.03 · 10−3

41



240 765.32 77.51 842.84 0.11 4.69 · 10−4



1 42,607.84 18,715.88 61,323.71 1 12 42,311.51 18,232.29 60,543.81 1.01 0.513 42,054.46 16,085.16 58,139.62 1.05 0.354 42,023.58 15,701.67 57,725.25 1.06 0.276 42,275.53 14,964.97 57,240.5 1.07 0.188 42,160.51 13,993.19 56,153.69 1.09 0.149 42,134.95 13,868.32 56,003.27 1.1 0.1212 42,138.17 13,570.2 55,708.37 1.1 9.17 · 10−2

13 42,303.26 13,402.7 55,705.96 1.1 8.47 · 10−2

16 41,866.57 13,248.97 55,115.54 1.11 6.95 · 10−2

18 42,252.43 12,877.21 55,129.65 1.11 6.18 · 10−2

24 42,459.54 12,715.3 55,174.84 1.11 4.63 · 10−2

26 42,163.37 12,482.02 54,645.39 1.12 4.32 · 10−2

32 42,178.35 12,386.26 54,564.62 1.12 3.51 · 10−2

36 41,993.98 12,547.72 54,541.69 1.12 3.12 · 10−2

39 41,973.75 12,369.94 54,343.69 1.13 2.89 · 10−2

48 42,148.92 12,179.68 54,328.6 1.13 2.35 · 10−2

52 42,188.2 12,124.8 54,313 1.13 2.17 · 10−2

72 42,444.15 12,592.3 55,036.45 1.11 1.55 · 10−2

78 42,727.21 12,589.01 55,316.22 1.11 1.42 · 10−2

96 42,867.55 12,311.99 55,179.53 1.11 1.16 · 10−2

104 42,626.58 12,122.1 54,748.68 1.12 1.08 · 10−2

117 43,005.86 12,652.94 55,658.79 1.1 9.42 · 10−3

144 44,333.4 13,829.79 58,163.19 1.05 7.32 · 10−3

156 43,114.4 12,816.44 55,930.83 1.1 7.03 · 10−3

208 43,237.68 13,030.68 56,268.36 1.09 5.24 · 10−3

234 43,380.22 13,284.5 56,664.71 1.08 4.62 · 10−3

288 43,042.09 13,201.38 56,243.47 1.09 3.79 · 10−3

42



312 43,595.87 13,336.19 56,932.06 1.08 3.45 · 10−3

416 44,197 13,636.69 57,833.69 1.06 2.55 · 10−3

468 46,790.79 16,383.03 63,173.82 0.97 2.07 · 10−3

624 45,236.4 14,456.95 59,693.35 1.03 1.65 · 10−3

936 47,237.28 15,688.78 62,926.06 0.97 1.04 · 10−3

1,248 51,414.75 16,916.95 68,331.7 0.9 7.19 · 10−4

1,872 56,314.5 19,081.55 75,396.06 0.81 4.34 · 10−4



1 23,335.91 4,952.12 28,288.03 1 12 13,772.54 4,456.25 18,228.79 1.55 0.783 14,312.06 4,254.86 18,566.91 1.52 0.516 13,809.74 4,177.48 17,987.22 1.57 0.269 14,107.85 4,100.17 18,208.02 1.55 0.1718 13,867.8 3,956.69 17,824.49 1.59 8.82 · 10−2

127 14,287.05 4,092.56 18,379.61 1.54 1.21 · 10−2

254 14,534.1 4,349.48 18,883.58 1.5 5.9 · 10−3

381 15,181.09 4,569.2 19,750.29 1.43 3.76 · 10−3

762 16,278.98 5,576.7 21,855.68 1.29 1.7 · 10−3

1,143 19,168.56 5,974.48 25,143.04 1.13 9.84 · 10−4



1 1,218.77 236.89 1,455.66 1 12 943.71 193.35 1,137.06 1.28 0.643 976.96 204.32 1,181.28 1.23 0.414 997.4 192.35 1,189.76 1.22 0.315 986.38 195.29 1,181.67 1.23 0.256 913.49 214.44 1,127.92 1.29 0.22

43



8 982.53 195.48 1,178.02 1.24 0.1510 950.87 200.74 1,151.61 1.26 0.1312 1,191.52 200.62 1,392.14 1.05 8.71 · 10−2

15 981.51 203.93 1,185.43 1.23 8.19 · 10−2

16 1,007.33 215.7 1,223.03 1.19 7.44 · 10−2

20 1,006.46 211.84 1,218.3 1.19 5.97 · 10−2

24 940.86 222.96 1,163.82 1.25 5.21 · 10−2

30 1,009.95 230.31 1,240.25 1.17 3.91 · 10−2

32 1,015.72 221.2 1,236.92 1.18 3.68 · 10−2

40 1,059.72 230.74 1,290.46 1.13 2.82 · 10−2

48 1,041.83 246.87 1,288.71 1.13 2.35 · 10−2

60 1,055.22 249.13 1,304.35 1.12 1.86 · 10−2

80 1,131.15 302.63 1,433.78 1.02 1.27 · 10−2

96 994.39 278.5 1,272.89 1.14 1.19 · 10−2

120 1,292.29 515.39 1,807.67 0.81 6.71 · 10−3

160 1,520.99 393.99 1,914.98 0.76 4.75 · 10−3

240 1,622.21 416.95 2,039.16 0.71 2.97 · 10−3

44

hit1301/hit2080 programming 1 on... · 2018-04-04 · limitations on parallel processing...

Documents