CHAPTER 2
The Promise and Challenges ofConcurrency
Bryon MoyerTechnology Writer and Editor, EE Journal
Chapter OutlineConcurrency fundamentals 12
Two kinds of concurrency 14Data parallelism 15
Functional parallelism 16
Dependencies 18Producers and consumers of data 19
Loops and dependencies 23
Shared resources 30
Summary 31
The opportunities and challenges that arise from multicore technology �or any kind of multiple processor arrangement � are rooted in the
concept of concurrency. You can loosely conceive of this as “more than
one thing happening at a time”. But when things happen simultaneously,
it’s very easy for chaos to ensue. If you create an “assembly line” to
make burgers quickly in a fast food joint, with one guy putting the patty
on the bun and the next guy adding a dab of mustard, things will get
messy if the mustard guy doesn’t wait for a burger to be in place before
applying the mustard. Coordination is key, and yet, as obvious as this
may sound, it can be extremely challenging in a complex piece of
software.
The purpose of this chapter is to address concurrency and its associated
challenges at a high level. Specific solutions to the problems will be
covered in later chapters.
11Real World Multicore Embedded Systems.
DOI: http://dx.doi.org/10.1016/B978-0-12-416018-7.00002-X
© 2013 Elsevier Inc. All rights reserved.
Concurrency fundamentals
It is first important to separate the notion of inherent concurrency and
implemented parallelization. A given algorithm or process may be full of
opportunities for things to run independently from each other. An actual
implementation will typically select from these opportunities a specific
parallel implementation and go forward with that.
For example, in our burger-making example, you could make burgers
more quickly if you had multiple assembly lines going at the same time.
In theory, given an infinite supply of materials, you could make
infinitely many burgers concurrently. However, in reality, you only have
a limited number of employees and countertops on which to do the work.
So you may actually implement, say, two lines even though the process
inherently could allow more. In a similar fashion, the number of
processors and other resources drives the decision on how much
parallelism to implement.
It’s critical to note, however, that a chosen implementation relies on the
inherent opportunities afforded by the algorithm itself. No amount of
parallelization will help an algorithm that has little inherent concurrency,
as we’ll explore later in this chapter.
So what you end up with is a series of program sections that can be run
independently punctuated by places where they need to “check in” with
each other to exchange data � an event referred to as “synchronization.”
For example, one fast food employee can lay a patty on a bun
completely independently from someone else squirting mustard on a
different burger. During the laying and squirting processes, the two can
be completely independent. However, after they’re done, each has to
pass his or her burger to the next guy, and neither can restart with a new
burger until a new one is in place. So if the mustard guy is a lot faster
than the patty-laying guy, he’ll have to wait idly until the new burger
shows up. That is a synchronization point (as shown in Figure 2.1).
A key characteristic here is the fact that the two independent processes
may operate at completely different speeds, and that speed may not be
predictable. Different employees on different shifts, for example, may go
12 Chapter 2
at different speeds. This is a fundamental issue for parallel execution of
programs. While there are steps that can be taken to make the relative
speeds more predictable, in the abstract, they need to be considered
unpredictable. This concept of a program spawning a set of independent
processes with occasional check-in points is shown in Figure 2.2.
Depending on the specific implementation, the independent portions of
the program might be threads or processes (Figure 2.3). At this stage,
we’re really not interested in those specifics, so to avoid getting caught
up in that detail, they are often generically referred to as “tasks”. In this
chapter, we will focus on tasks; how those tasks are realized, including
the definitions of SMP and AMP shown in the figure, will be discussed
in later chapters.
Figure 2.1Where the two independent processes interact is a synchronization point.
The Promise and Challenges of Concurrency 13
Two kinds of concurrency
There are fundamentally two different ways to do more than one thing at
a time: bulk up so that you have multiple processors doing the same
thing, or use division of labor, where different processors do different
things at the same time.
Figure 2.3Tasks can be different threads within a process or different processes.
Figure 2.2A series of tasks run mutually asynchronously with occasional synchronization points.
14 Chapter 2
Data parallelism
The first of those is the easiest to explain. Let’s say you’ve got a four-bit
vector that you want to operate on. Let’s make it really simple for the sake
of example and say that you need to increment the value of every entry in
the vector. In a standard program, you would do this with a loop:
This problem is exceedingly easy to parallelize. In fact, it belongs to a
general category of problems whimsically called “embarrassingly
parallel” (Figure 2.4) Each vector entry is completely independent and
can be incremented completely independently. Given four processors,
you could easily have each processor work on one of the entries and do
the entire vector in 1/4 the time it takes to do it on a single processor.
In fact, in this case, it would probably be even less than 1/4 because you
no longer have the need for an iterator � the i in the pseudocode above;
you no longer have to increment i each time and compare it to 4 to see
if you’re done (Figure 2.5).
This is referred to as data parallelism; multiple instances of data can be
operated on at the same time. The inherent concurrency allows a four-
fold speed-up, although a given implementation might choose less if
fewer than four processors are available.
Figure 2.4Embarrassingly parallel computation.
The Promise and Challenges of Concurrency 15
Two key attributes of this problem make it so easy to parallelize:
� the operation being performed on one entry doesn’t depend on any
other entry
� the number of entries is known and fixed.
That second one is important. If you’re trying to figure out how to
exploit concurrency in a way that’s static � in other words, you know
exactly how the problem will be parallelized at compile time � then the
number of loop iterations must be known at compile time. A “while”
loop or a “for” loop where the endpoint is calculated instead of constant
cannot be so neatly parallelized because, for any given run, you don’t
know how many parallel instances there might be.
Functional parallelism
The other way of splitting things up involves giving different processors
different things to do. Let’s take a simple example where we have a
number of text files and we want to cycle through them to count the
number of characters in each one. We could do this with the following
pseudo-program:
i = 4?
i = 1 + 1
i = 1
N
Y
Single core Multicore
Incrementvalue
Incrementvalue
Figure 2.5Looping in a single core takes more cycles than multicore.
16 Chapter 2
We can take three processors and give each of them a different task. The
first processor opens files; the second counts characters; and the third
closes files (Figure 2.6).
There is a fundamental difference between this and the prior example of
data parallelism. In the vector-increment example, we took a problem
that had been solved by a loop and completely eliminated the loop. In
this new example, because of the serial nature of the three tasks, if you
only had one loop iteration, then there would be no savings at all. It only
works if you have a workload involving repeated iterations of this loop.
As illustrated in Figure 2.7, when the first file is opened, the second and
third processors sit idle. After one file is open, then the second processor
can count the characters, while the third processor is still idle. Only
when the third file is opened do all processors finally kick in as the third
processor closes the first file. This leads to the descriptive term
“pipeline” for this kind of arrangement, and, when executing, it doesn’t
really hit its stride until the pipeline fills up. This is also referred to as
“loop distribution” because the duties of one loop are distributed into
multiple loops, one on each processor.
This figure also illustrates the fact that using this algorithm on only one
file provides no benefit whatsoever.
Real-world programs and algorithms typically have both inherent data
and functional concurrency. In some situations, you can use both. For
Figure 2.6Different cores performing different operations.
The Promise and Challenges of Concurrency 17
example, if you had six processors, you could double the three-processor
pipeline to work through the files twice as fast. In other situations, you
may have to decide whether to exploit one or the other in your
implementation.
One of the challenges of a pipeline lies in what’s called balancing the
pipeline. Execution can only go as fast as the slowest stage. In
Figure 2.7, opening files is shown as taking longer than counting the
characters. In that situation, counting faster will not improve
performance; it will simply increase the idle time between files.
The ideal situation is to balance the tasks so that every pipeline stage
takes the same amount of time; in practice, this is so difficult as to be
more or less impossible. It becomes even harder when different iterations
take more or less time. For instance, it will presumably take longer to
count the characters in a bigger file, so really the times for counting
characters above should vary from file to file. Now it’s completely
impossible to balance the pipeline perfectly.
Dependencies
One of the keys to the simple examples we’ve shown is the
independence of operations. Things get more complicated when one
Figure 2.7The pipeline isn’t full until all cores are busy.
18 Chapter 2
calculation depends on the results of another. And there are a number of
ways in which these dependencies crop up. We’ll describe some basic
cases here, but a complete theory of dependencies can be quite intricate.
It bears noting here that this discussion is intended to motivate some of
the key challenges in parallelizing software for multicore. In general, one
should not be expected to manually analyze all of the dependencies in a
program in order to parallelize it; tools become important for this. For
this reason, the discussion won’t be exhaustive, and will show concept
examples rather than focusing on practical ways of dealing with
dependencies, which will be covered in the chapter on parallelizing
software.
Producers and consumers of data
Dependencies are easier to understand if you think of a program of
consisting of producers and consumers of data (Figure 2.8). Some part of
the program does a calculation that some other part will use: the first
part is the producer and the second is the consumer. This happens at
very fine-grained instruction levels and at higher levels, especially if you
are taking an object-oriented approach � objects are also producers and
consumers of data.
At its most basic, a dependency means that a consumer of data must wait
to consume its data until the producer has produced the data (Figure 2.9).
The concept is straightforward, but the implications vary depending on
the language and approach taken. At the instruction level, many
compilers have been designed to exploit low-level concurrency, doing
things like instruction reordering to make execution more efficient while
making sure that no dependencies are violated.
It gets more complicated with languages like C that allow pointers. The
concept is the same, but compilers have no way of understanding how
various pointers relate, and so can’t do any optimization. There are two
reasons why this is so: pointer aliasing and pointer arithmetic.
Pointer aliasing is an extremely common occurrence in a C program. If
you have a function that takes a pointer to, say, an image as a parameter,
that function may name the pointer imagePtr. If a program needs to call
The Promise and Challenges of Concurrency 19
that function on behalf of two different images � say, leftImage and
rightImage, then when the function is called with leftImage as the
parameter, then leftImage and imagePtr will refer to the same data.
When called for rightImage, then rightImage and imagePtr will point to
the same data (Figure 2.10).
Figure 2.8Producers and consumers at the fine- and coarse-grained level. Entities are often
both producers and consumers.
Figure 2.9A consumer cannot proceed until it gets its data from the producer.
20 Chapter 2
This is referred to as aliasing because a given piece of data may be
accessed by variables of different names at different times. There’s no
way to know statically what the dependencies are, not only because the
names look completely different, but also because they may change as
the program progresses. Thorough dynamic analysis is required to
understand the relationships between pointers.
Pointer arithmetic can also be an obvious problem because, even if you
know where a pointer starts out, manipulating the actual address being
pointed to can result in the pointer pointing pretty much anywhere
(including address 0, which any C programmer has done at least once in
his or her life). Where it ends up pointing may or may not correlate to a
memory location associated with some other pointer (Figure 2.11).
For example, when scanning through an array with one pointer to
make changes, it may be very hard to understand that some
subsequent operation, where a different pointer scans through the
same array (possibly using different pointer arithmetic), will read that
data (Figure 2.12). If the second scan consumes data that the first
scan was supposed to put into place, then parallelizing those as
independent will cause the program to function incorrectly. In many
cases, this dependency cannot be identified by static inspection; the
only way to tell is to notice at run time that the pointers address the
same space.
These dependencies are based on a consumer needing to wait until the
producer has created the data: writing before reading. The opposite
situation also exists: if a producer is about to rewrite a memory location,
you want to be sure that all consumers of the old data are done before
you overwrite the old data with new data (Figure 2.13). This is called an
Figure 2.10Different pointers may point to the same locations at different times.
The Promise and Challenges of Concurrency 21
Figure 2.11Pointer arithmetic can cause a pointer to refer to some location in memory that may
or may not be pointed to by some other pointer.
Figure 2.12Two pointers operating on the same array create a dependency that isn’t evident by
static inspection.
Figure 2.13The second pointer must wait before overwriting data until the first pointer has
completed its read, creating an anti-dependency.
22 Chapter 2
“anti-dependency”. Everything we’ve discussed about dependencies also
holds for anti-dependencies except that this is about waiting to write
until all the reads are done: reading before writing.
This has been an overview of dependencies; they will be developed in
more detail in the Partitioning chapter.
Loops and dependencies
Dependencies become more complex when loops are involved � and
in programs being targeted for parallelization � loops are almost
always involved. We saw above how an embarrassingly parallel loop
can be parallelized so that each processor gets one iteration of the
loop. Let’s look at an example that’s slightly different from that
example.
Note that in this and all examples like this, I’m ignoring what
happens for the first iteration, since that detail isn’t critical for the
discussion.
This creates a subtle change because each loop iteration produces a
result that will be consumed in the next loop iteration. So the second
loop iteration can’t start until the first iteration has produced its data.
This means that the loop iterations can no longer run exactly in
parallel: each of these parallel iterations is offset from its predecessor
(Figure 2.14). While the total computation time is still less than
required to execute the loop on a single processor, it’s not as fast as
if there were no dependencies between the loop iterations. Such
dependencies are referred to as “loop-carry” (or “loop-carried”)
dependencies.
It gets even more complicated when you have nested loops iterating across
multiple iterators. Let’s say you’re traversing a two-dimensional matrix
using i to scan along a row and using j to scan down the rows (Figure 2.15).
The Promise and Challenges of Concurrency 23
And let’s assume further that a given cell depends on the new value of
the cell directly above it (Figure 2.16):
First of all, there are lots of ways to parallelize this code, depending on
how many cores we have. If we were to go as far as possible, we would
Figure 2.1543 4 array with i iterating along a row (inner loop) and j iterating down the rows
(outer loop).
Figure 2.14Even though iterations are parallelized, each must wait until its needed data is
produced by the prior iteration, causing offsets that increase overall computationtime above what would be required for independent iterations.
24 Chapter 2
need 16 cores since there are 16 cells. Or, with four cores, we could
assign one row to each core.
If we did the latter, then we couldn’t start the second row until the first
cell of the first row was calculated (Figure 2.17).
If we completely parallelized it, then we could start all of the first-row
entries at the same time, but the second-row entries would have to wait
until their respective first-row entries were done (Figure 2.18).
Note that using so many cores really doesn’t speed anything up: using
only four cores would do just as well since only four cores would be
executing at any given time (Figure 2.19). This implementation assigns
Figure 2.16Each cell gets a new value that depends on the new value in the cell in the prior row.
Figure 2.17If each row gets its own core, then each row must wait until the first cell in the prior
row is done before starting.
The Promise and Challenges of Concurrency 25
one column to each core, instead of one row, as is done in Figure 2.17.
As a result, the loop can be processed faster because no core has to wait
for any other core. There is no way to parallelize this set of nested loops
any further because of the dependencies.
Figure 2.18An implementation that assigns each cell to its own core.
Figure 2.19Four cores can implement this loop in the same time as 16.
26 Chapter 2
Such nested loops give rise to the concept of “loop distance”. Each
iterator gets a loop distance. So in the above example, in particular as
shown in Figure 2.16, where the arrows show the dependency, the loop
distance for i is 0 since there is no dependency; the loop distance for j
is 1, since the data consumed in one cell depends on the cell directly
above it, which is the prior j iteration. As a “vector”, the loop distance
for i and j is [0,1].
If we changed the code slightly to make the dependency on j2 2 instead
of j2 1:
then the loop distance for j is 2, as shown in Figure 2.20.
This means that the second row doesn’t have to wait for the first row,
since it no longer depends on the first row. The third row, however, does
have to wait for the first row (Figure 2.21). Thus we can parallelize
further with more cores, if we wish, completing the task in half the time
required for the prior example.
While it may seem obscure, the loop distance is an important measure
for synchronizing data. It’s not a matter of one core producing data and
Figure 2.20Example showing j loop distance of 2.
The Promise and Challenges of Concurrency 27
the other immediately consuming it; the consuming core may have to
wait a number of iterations before consuming the data, depending on
how things are parallelized. While it’s waiting, the producer continues
with its iterations, writing more data. Such data can be, for example,
written into some kind of first-in/first-out (FIFO) memory, and the loop
distance determines how long that FIFO has to be. This will be discussed
more fully in the Partitioning chapter.
Figure 2.21With loop distance of 2, two rows can be started in parallel.
Figure 2.22A four-core implementation with loop distance [0,2].
28 Chapter 2
Let’s take the prior example and implement it with only four cores
instead of eight, as shown in Figure 2.22.
Let’s look at Core 1. When it’s done with cell [1,1], it must move on to
cell [1,2]. But cell [1,3] needs the result from [1,1]. Strictly speaking,
this is an anti-dependency: the [1,1] result must be kept around until
[1,3] reads it. Depending on how we implement things, cell [1,2] might
destroy the result.
Now, as shown above, we can really just implement this as an array in
each core, keeping all the results separate. But in some multicore
systems, the operating system will determine which cores get which
threads, and if each cell is spawned as a thread, then things could be
assigned differently. For example, the first two cores might exchange the
last two cells (Figure 2.23).
Now Core 1 has to hand its results to Core 2 (and vice versa, not
illustrated to avoid clutter). The solution is for Core 1 to put the result
of [1,1] somewhere for safekeeping until [1,3] is ready for it. Then [1,2]
can proceed, and Core 2 can pick up the result it needs when it’s ready.
But the [1,2] result will also be ready before Core 2 is ready for the [1,1]
Figure 2.23The first two of four cores, with a different assignment of cells.
Figure 2.24FIFO used to communicate results between cores. The minimum FIFO size is related
to the loop distance.
The Promise and Challenges of Concurrency 29
result. So the [1,2] result can’t just be put in the same place as the [1,1]
result or it will overwrite it.
One solution, at the risk of getting into implementation details, is to use
some kind of FIFO structure between Core 1 and Core 2 (Figure 2.24).
Because the loop distance for j is 2, the FIFO needs to be at least 2 deep
to avoid stalling things. Additionally, by using a FIFO instead of trying
to hard-code an array implementation, the solution is robust against any
arbitrary thread assignments that the operating system may make.
FIFOs are sometimes thought to be expensive, depending on how they
are implemented. The intent here isn’t to focus on the details of the
FIFO, but rather to illustrate its relationship to the loop distance. Specific
synchronization mechanisms will be discussed in future chapters. More
concrete examples of dependencies and synchronization are presented in
the Partitioning chapter.
Manual determination of loop distance can, frankly, be quite confusing.
In fact, the body of a loop may have numerous variables, each with
different loop distances. Branches further complicate things. The
existence of tools to handle this will be covered in a subsequent chapter.
Because of these tools, we will not delve further into the intricacies, but
rather leave the discussion here as a motivation of the concept of loop
distance as it shows up in tools.
Shared resources
The second major challenge that concurrent tasks present is the fact that
different tasks may need to access the same resources at the same time.
For the most part, the challenges are exactly the same as those presented
by a multi-threaded program on a single-core system. The use of critical
sections and locks and their ilk proceeds exactly as before.
However, the implementations of solutions that work for single-core
systems may not work for multicore systems. For example, one simple
brute-force way to block any other thread from interrupting a critical
section of code is to suspend interrupts while within that critical section.
While that might work for a single core, it doesn’t work if there is
30 Chapter 2
another core that could be accessing (or corrupting) a shared memory
location.
The other new concept that multicore adds is the fact that each core has
its own cache, and global data replicated in the cache may at times be
out of sync with the latest version. And this gets complex because cache
coherency strategies can themselves be complex, and different platforms
will have different schemes.
So, while a programmer can ignore the cache on a single-core system,
that’s no longer possible for multicore, and, as we’ll see in subsequent
chapters, the handling of synchronization may depend on the caching
strategy of the platform.
Summary
All of the challenges of multicore computing arise from concurrency, the
fact that different things may happen at the same time. If we’re used to
events occurring in a prescribed order, then it can require a bit of mental
gymnastics to get used to the idea that two operations in two different
parallel threads may happen in any order with respect to each other.
Concurrency is a good thing � it lets us do things in parallel so that we
achieve a goal more quickly. But it can also make things go haywire, so
most of this book is dedicated to managing the challenges of
concurrency in order to realize the promise of concurrency.
The Promise and Challenges of Concurrency 31