real world multicore embedded systems || the promise and challenges of concurrency

CHAPTER 2

The Promise and Challenges ofConcurrency

Bryon MoyerTechnology Writer and Editor, EE Journal

Chapter OutlineConcurrency fundamentals 12

Two kinds of concurrency 14Data parallelism 15

Functional parallelism 16

Dependencies 18Producers and consumers of data 19

Loops and dependencies 23

Shared resources 30

Summary 31

The opportunities and challenges that arise from multicore technology �or any kind of multiple processor arrangement � are rooted in the

concept of concurrency. You can loosely conceive of this as “more than

one thing happening at a time”. But when things happen simultaneously,

it’s very easy for chaos to ensue. If you create an “assembly line” to

make burgers quickly in a fast food joint, with one guy putting the patty

on the bun and the next guy adding a dab of mustard, things will get

messy if the mustard guy doesn’t wait for a burger to be in place before

applying the mustard. Coordination is key, and yet, as obvious as this

may sound, it can be extremely challenging in a complex piece of

software.

The purpose of this chapter is to address concurrency and its associated

challenges at a high level. Specific solutions to the problems will be

covered in later chapters.

11Real World Multicore Embedded Systems.

DOI: http://dx.doi.org/10.1016/B978-0-12-416018-7.00002-X

© 2013 Elsevier Inc. All rights reserved.

http://dx.doi.org/10.1016/B978-0-12-416018-7.00002-X

Concurrency fundamentals

It is first important to separate the notion of inherent concurrency and

implemented parallelization. A given algorithm or process may be full of

opportunities for things to run independently from each other. An actual

implementation will typically select from these opportunities a specific

parallel implementation and go forward with that.

For example, in our burger-making example, you could make burgers

more quickly if you had multiple assembly lines going at the same time.

In theory, given an infinite supply of materials, you could make

infinitely many burgers concurrently. However, in reality, you only have

a limited number of employees and countertops on which to do the work.

So you may actually implement, say, two lines even though the process

inherently could allow more. In a similar fashion, the number of

processors and other resources drives the decision on how much

parallelism to implement.

It’s critical to note, however, that a chosen implementation relies on the

inherent opportunities afforded by the algorithm itself. No amount of

parallelization will help an algorithm that has little inherent concurrency,

as we’ll explore later in this chapter.

So what you end up with is a series of program sections that can be run

independently punctuated by places where they need to “check in” with

each other to exchange data � an event referred to as “synchronization.”

For example, one fast food employee can lay a patty on a bun

completely independently from someone else squirting mustard on a

different burger. During the laying and squirting processes, the two can

be completely independent. However, after they’re done, each has to

pass his or her burger to the next guy, and neither can restart with a new

burger until a new one is in place. So if the mustard guy is a lot faster

than the patty-laying guy, he’ll have to wait idly until the new burger

shows up. That is a synchronization point (as shown in Figure 2.1).

A key characteristic here is the fact that the two independent processes

may operate at completely different speeds, and that speed may not be

predictable. Different employees on different shifts, for example, may go

12 Chapter 2

at different speeds. This is a fundamental issue for parallel execution of

programs. While there are steps that can be taken to make the relative

speeds more predictable, in the abstract, they need to be considered

unpredictable. This concept of a program spawning a set of independent

processes with occasional check-in points is shown in Figure 2.2.

Depending on the specific implementation, the independent portions of

the program might be threads or processes (Figure 2.3). At this stage,

we’re really not interested in those specifics, so to avoid getting caught

up in that detail, they are often generically referred to as “tasks”. In this

chapter, we will focus on tasks; how those tasks are realized, including

the definitions of SMP and AMP shown in the figure, will be discussed

in later chapters.

Figure 2.1Where the two independent processes interact is a synchronization point.

The Promise and Challenges of Concurrency 13

Two kinds of concurrency

There are fundamentally two different ways to do more than one thing at

a time: bulk up so that you have multiple processors doing the same

thing, or use division of labor, where different processors do different

things at the same time.

Figure 2.3Tasks can be different threads within a process or different processes.

Figure 2.2A series of tasks run mutually asynchronously with occasional synchronization points.

14 Chapter 2

Data parallelism

The first of those is the easiest to explain. Let’s say you’ve got a four-bit

vector that you want to operate on. Let’s make it really simple for the sake

of example and say that you need to increment the value of every entry in

the vector. In a standard program, you would do this with a loop:

This problem is exceedingly easy to parallelize. In fact, it belongs to a

general category of problems whimsically called “embarrassingly

parallel” (Figure 2.4) Each vector entry is completely independent and

can be incremented completely independently. Given four processors,

you could easily have each processor work on one of the entries and do

the entire vector in 1/4 the time it takes to do it on a single processor.

In fact, in this case, it would probably be even less than 1/4 because you

no longer have the need for an iterator � the i in the pseudocode above;

you no longer have to increment i each time and compare it to 4 to see

if you’re done (Figure 2.5).

This is referred to as data parallelism; multiple instances of data can be

operated on at the same time. The inherent concurrency allows a four-

fold speed-up, although a given implementation might choose less if

fewer than four processors are available.

Figure 2.4Embarrassingly parallel computation.


Two key attributes of this problem make it so easy to parallelize:

� the operation being performed on one entry doesn’t depend on any

other entry

� the number of entries is known and fixed.

That second one is important. If you’re trying to figure out how to

exploit concurrency in a way that’s static � in other words, you know

exactly how the problem will be parallelized at compile time � then the

number of loop iterations must be known at compile time. A “while”

loop or a “for” loop where the endpoint is calculated instead of constant

cannot be so neatly parallelized because, for any given run, you don’t

know how many parallel instances there might be.

Functional parallelism

The other way of splitting things up involves giving different processors

different things to do. Let’s take a simple example where we have a

number of text files and we want to cycle through them to count the

number of characters in each one. We could do this with the following

pseudo-program:

i = 4?

i = 1 + 1

i = 1

N

Y

Single core Multicore

Incrementvalue

Incrementvalue

Figure 2.5Looping in a single core takes more cycles than multicore.

16 Chapter 2

We can take three processors and give each of them a different task. The

first processor opens files; the second counts characters; and the third

closes files (Figure 2.6).

There is a fundamental difference between this and the prior example of

data parallelism. In the vector-increment example, we took a problem

that had been solved by a loop and completely eliminated the loop. In

this new example, because of the serial nature of the three tasks, if you

only had one loop iteration, then there would be no savings at all. It only

works if you have a workload involving repeated iterations of this loop.

As illustrated in Figure 2.7, when the first file is opened, the second and

third processors sit idle. After one file is open, then the second processor

can count the characters, while the third processor is still idle. Only

when the third file is opened do all processors finally kick in as the third

processor closes the first file. This leads to the descriptive term

“pipeline” for this kind of arrangement, and, when executing, it doesn’t

really hit its stride until the pipeline fills up. This is also referred to as

“loop distribution” because the duties of one loop are distributed into

multiple loops, one on each processor.

This figure also illustrates the fact that using this algorithm on only one

file provides no benefit whatsoever.

Real-world programs and algorithms typically have both inherent data

and functional concurrency. In some situations, you can use both. For

Figure 2.6Different cores performing different operations.


example, if you had six processors, you could double the three-processor

pipeline to work through the files twice as fast. In other situations, you

may have to decide whether to exploit one or the other in your

implementation.

One of the challenges of a pipeline lies in what’s called balancing the

pipeline. Execution can only go as fast as the slowest stage. In

Figure 2.7, opening files is shown as taking longer than counting the

characters. In that situation, counting faster will not improve

performance; it will simply increase the idle time between files.

The ideal situation is to balance the tasks so that every pipeline stage

takes the same amount of time; in practice, this is so difficult as to be

more or less impossible. It becomes even harder when different iterations

take more or less time. For instance, it will presumably take longer to

count the characters in a bigger file, so really the times for counting

characters above should vary from file to file. Now it’s completely

impossible to balance the pipeline perfectly.

Dependencies

One of the keys to the simple examples we’ve shown is the

independence of operations. Things get more complicated when one

Figure 2.7The pipeline isn’t full until all cores are busy.

18 Chapter 2

calculation depends on the results of another. And there are a number of

ways in which these dependencies crop up. We’ll describe some basic

cases here, but a complete theory of dependencies can be quite intricate.

It bears noting here that this discussion is intended to motivate some of

the key challenges in parallelizing software for multicore. In general, one

should not be expected to manually analyze all of the dependencies in a

program in order to parallelize it; tools become important for this. For

this reason, the discussion won’t be exhaustive, and will show concept

examples rather than focusing on practical ways of dealing with

dependencies, which will be covered in the chapter on parallelizing

software.

Producers and consumers of data

Dependencies are easier to understand if you think of a program of

consisting of producers and consumers of data (Figure 2.8). Some part of

the program does a calculation that some other part will use: the first

part is the producer and the second is the consumer. This happens at

very fine-grained instruction levels and at higher levels, especially if you

are taking an object-oriented approach � objects are also producers and

consumers of data.

At its most basic, a dependency means that a consumer of data must wait

to consume its data until the producer has produced the data (Figure 2.9).

The concept is straightforward, but the implications vary depending on

the language and approach taken. At the instruction level, many

compilers have been designed to exploit low-level concurrency, doing

things like instruction reordering to make execution more efficient while

making sure that no dependencies are violated.

It gets more complicated with languages like C that allow pointers. The

concept is the same, but compilers have no way of understanding how

various pointers relate, and so can’t do any optimization. There are two

reasons why this is so: pointer aliasing and pointer arithmetic.

Pointer aliasing is an extremely common occurrence in a C program. If

you have a function that takes a pointer to, say, an image as a parameter,

that function may name the pointer imagePtr. If a program needs to call


that function on behalf of two different images � say, leftImage and

rightImage, then when the function is called with leftImage as the

parameter, then leftImage and imagePtr will refer to the same data.

When called for rightImage, then rightImage and imagePtr will point to

the same data (Figure 2.10).

Figure 2.8Producers and consumers at the fine- and coarse-grained level. Entities are often

both producers and consumers.

Figure 2.9A consumer cannot proceed until it gets its data from the producer.

20 Chapter 2

This is referred to as aliasing because a given piece of data may be

accessed by variables of different names at different times. There’s no

way to know statically what the dependencies are, not only because the

names look completely different, but also because they may change as

the program progresses. Thorough dynamic analysis is required to

understand the relationships between pointers.

Pointer arithmetic can also be an obvious problem because, even if you

know where a pointer starts out, manipulating the actual address being

pointed to can result in the pointer pointing pretty much anywhere

(including address 0, which any C programmer has done at least once in

his or her life). Where it ends up pointing may or may not correlate to a

memory location associated with some other pointer (Figure 2.11).

For example, when scanning through an array with one pointer to

make changes, it may be very hard to understand that some

subsequent operation, where a different pointer scans through the

same array (possibly using different pointer arithmetic), will read that

data (Figure 2.12). If the second scan consumes data that the first

scan was supposed to put into place, then parallelizing those as

independent will cause the program to function incorrectly. In many

cases, this dependency cannot be identified by static inspection; the

only way to tell is to notice at run time that the pointers address the

same space.

These dependencies are based on a consumer needing to wait until the

producer has created the data: writing before reading. The opposite

situation also exists: if a producer is about to rewrite a memory location,

you want to be sure that all consumers of the old data are done before

you overwrite the old data with new data (Figure 2.13). This is called an

Figure 2.10Different pointers may point to the same locations at different times.


Figure 2.11Pointer arithmetic can cause a pointer to refer to some location in memory that may

or may not be pointed to by some other pointer.

Figure 2.12Two pointers operating on the same array create a dependency that isn’t evident by

static inspection.

Figure 2.13The second pointer must wait before overwriting data until the first pointer has

completed its read, creating an anti-dependency.

22 Chapter 2

“anti-dependency”. Everything we’ve discussed about dependencies also

holds for anti-dependencies except that this is about waiting to write

until all the reads are done: reading before writing.

This has been an overview of dependencies; they will be developed in

more detail in the Partitioning chapter.

Loops and dependencies

Dependencies become more complex when loops are involved � and

in programs being targeted for parallelization � loops are almost

always involved. We saw above how an embarrassingly parallel loop

can be parallelized so that each processor gets one iteration of the

loop. Let’s look at an example that’s slightly different from that

example.

Note that in this and all examples like this, I’m ignoring what

happens for the first iteration, since that detail isn’t critical for the

discussion.

This creates a subtle change because each loop iteration produces a

result that will be consumed in the next loop iteration. So the second

loop iteration can’t start until the first iteration has produced its data.

This means that the loop iterations can no longer run exactly in

parallel: each of these parallel iterations is offset from its predecessor

(Figure 2.14). While the total computation time is still less than

required to execute the loop on a single processor, it’s not as fast as

if there were no dependencies between the loop iterations. Such

dependencies are referred to as “loop-carry” (or “loop-carried”)

dependencies.

It gets even more complicated when you have nested loops iterating across

multiple iterators. Let’s say you’re traversing a two-dimensional matrix

using i to scan along a row and using j to scan down the rows (Figure 2.15).


And let’s assume further that a given cell depends on the new value of

the cell directly above it (Figure 2.16):

First of all, there are lots of ways to parallelize this code, depending on

how many cores we have. If we were to go as far as possible, we would

Figure 2.1543 4 array with i iterating along a row (inner loop) and j iterating down the rows

(outer loop).

Figure 2.14Even though iterations are parallelized, each must wait until its needed data is

produced by the prior iteration, causing offsets that increase overall computationtime above what would be required for independent iterations.

24 Chapter 2

need 16 cores since there are 16 cells. Or, with four cores, we could

assign one row to each core.

If we did the latter, then we couldn’t start the second row until the first

cell of the first row was calculated (Figure 2.17).

If we completely parallelized it, then we could start all of the first-row

entries at the same time, but the second-row entries would have to wait

until their respective first-row entries were done (Figure 2.18).

Note that using so many cores really doesn’t speed anything up: using

only four cores would do just as well since only four cores would be

executing at any given time (Figure 2.19). This implementation assigns

Figure 2.16Each cell gets a new value that depends on the new value in the cell in the prior row.

Figure 2.17If each row gets its own core, then each row must wait until the first cell in the prior

row is done before starting.


one column to each core, instead of one row, as is done in Figure 2.17.

As a result, the loop can be processed faster because no core has to wait

for any other core. There is no way to parallelize this set of nested loops

any further because of the dependencies.

Figure 2.18An implementation that assigns each cell to its own core.

Figure 2.19Four cores can implement this loop in the same time as 16.

26 Chapter 2

Such nested loops give rise to the concept of “loop distance”. Each

iterator gets a loop distance. So in the above example, in particular as

shown in Figure 2.16, where the arrows show the dependency, the loop

distance for i is 0 since there is no dependency; the loop distance for j

is 1, since the data consumed in one cell depends on the cell directly

above it, which is the prior j iteration. As a “vector”, the loop distance

for i and j is [0,1].

If we changed the code slightly to make the dependency on j2 2 instead

of j2 1:

then the loop distance for j is 2, as shown in Figure 2.20.

This means that the second row doesn’t have to wait for the first row,

since it no longer depends on the first row. The third row, however, does

have to wait for the first row (Figure 2.21). Thus we can parallelize

further with more cores, if we wish, completing the task in half the time

required for the prior example.

While it may seem obscure, the loop distance is an important measure

for synchronizing data. It’s not a matter of one core producing data and

Figure 2.20Example showing j loop distance of 2.


the other immediately consuming it; the consuming core may have to

wait a number of iterations before consuming the data, depending on

how things are parallelized. While it’s waiting, the producer continues

with its iterations, writing more data. Such data can be, for example,

written into some kind of first-in/first-out (FIFO) memory, and the loop

distance determines how long that FIFO has to be. This will be discussed

more fully in the Partitioning chapter.

Figure 2.21With loop distance of 2, two rows can be started in parallel.

Figure 2.22A four-core implementation with loop distance [0,2].

28 Chapter 2

Let’s take the prior example and implement it with only four cores

instead of eight, as shown in Figure 2.22.

Let’s look at Core 1. When it’s done with cell [1,1], it must move on to

cell [1,2]. But cell [1,3] needs the result from [1,1]. Strictly speaking,

this is an anti-dependency: the [1,1] result must be kept around until

[1,3] reads it. Depending on how we implement things, cell [1,2] might

destroy the result.

Now, as shown above, we can really just implement this as an array in

each core, keeping all the results separate. But in some multicore

systems, the operating system will determine which cores get which

threads, and if each cell is spawned as a thread, then things could be

assigned differently. For example, the first two cores might exchange the

last two cells (Figure 2.23).

Now Core 1 has to hand its results to Core 2 (and vice versa, not

illustrated to avoid clutter). The solution is for Core 1 to put the result

of [1,1] somewhere for safekeeping until [1,3] is ready for it. Then [1,2]

can proceed, and Core 2 can pick up the result it needs when it’s ready.

But the [1,2] result will also be ready before Core 2 is ready for the [1,1]

Figure 2.23The first two of four cores, with a different assignment of cells.

Figure 2.24FIFO used to communicate results between cores. The minimum FIFO size is related

to the loop distance.


result. So the [1,2] result can’t just be put in the same place as the [1,1]

result or it will overwrite it.

One solution, at the risk of getting into implementation details, is to use

some kind of FIFO structure between Core 1 and Core 2 (Figure 2.24).

Because the loop distance for j is 2, the FIFO needs to be at least 2 deep

to avoid stalling things. Additionally, by using a FIFO instead of trying

to hard-code an array implementation, the solution is robust against any

arbitrary thread assignments that the operating system may make.

FIFOs are sometimes thought to be expensive, depending on how they

are implemented. The intent here isn’t to focus on the details of the

FIFO, but rather to illustrate its relationship to the loop distance. Specific

synchronization mechanisms will be discussed in future chapters. More

concrete examples of dependencies and synchronization are presented in

the Partitioning chapter.

Manual determination of loop distance can, frankly, be quite confusing.

In fact, the body of a loop may have numerous variables, each with

different loop distances. Branches further complicate things. The

existence of tools to handle this will be covered in a subsequent chapter.

Because of these tools, we will not delve further into the intricacies, but

rather leave the discussion here as a motivation of the concept of loop

distance as it shows up in tools.

Shared resources

The second major challenge that concurrent tasks present is the fact that

different tasks may need to access the same resources at the same time.

For the most part, the challenges are exactly the same as those presented

by a multi-threaded program on a single-core system. The use of critical

sections and locks and their ilk proceeds exactly as before.

However, the implementations of solutions that work for single-core

systems may not work for multicore systems. For example, one simple

brute-force way to block any other thread from interrupting a critical

section of code is to suspend interrupts while within that critical section.

While that might work for a single core, it doesn’t work if there is

30 Chapter 2

another core that could be accessing (or corrupting) a shared memory

location.

The other new concept that multicore adds is the fact that each core has

its own cache, and global data replicated in the cache may at times be

out of sync with the latest version. And this gets complex because cache

coherency strategies can themselves be complex, and different platforms

will have different schemes.

So, while a programmer can ignore the cache on a single-core system,

that’s no longer possible for multicore, and, as we’ll see in subsequent

chapters, the handling of synchronization may depend on the caching

strategy of the platform.

Summary

All of the challenges of multicore computing arise from concurrency, the

fact that different things may happen at the same time. If we’re used to

events occurring in a prescribed order, then it can require a bit of mental

gymnastics to get used to the idea that two operations in two different

parallel threads may happen in any order with respect to each other.

Concurrency is a good thing � it lets us do things in parallel so that we

achieve a goal more quickly. But it can also make things go haywire, so

most of this book is dedicated to managing the challenges of

concurrency in order to realize the promise of concurrency.


real world multicore embedded systems || the promise and challenges of concurrency

Documents