an introduction to parallel programming ing. andrea marongiu ([email protected])

An Introduction ToPARALLEL PROGRAMMINGIng. Andrea Marongiu([email protected])

The Multicore Revolution is Here! More instruction-level parallelism hard to find

Very complex designs needed for small gain Thread-level parallelism appears live and well

Clock frequency scaling is slowing drastically Too much power and heat when pushing envelope

Cannot communicate across chip fast enough Better to design small local units with short paths

Effective use of billions of transistors Easier to reuse a basic unit many times

Potential for very easy scaling Just keep adding processors/cores for higher (peak)

performance

Vocabulary in the Multi Era

AMP, Assymetric MP: Each processor has local memory, tasks statically allocated to one processor

SMP, Shared-Memory MP: Processors share memory, tasks dynamically scheduled to any processor

Vocabulary in the Multi Era

Heterogeneous: Specialization among processors. Often different instruction sets. Usually AMP design.

Homogeneous: all processors have the same instruction set, can run any task, usually SMP design.

Future Embedded Systems

The First Software Crisis

60’s and 70’s: PROBLEM: Assembly Language

Programming Need to get abstraction and portability

without losing performance SOLUTION: High-level Languages

(Fortran and C) Provided “common machine language” for

uniprocessors

The Second Software Crisis

80’s and 90’s: PROBLEM: Inability to build and

maintain complex and robust applications requiring multi-million lines of code developed by hundred programmers Need to composability, malleability and

maintainability SOLUTION: Object-Oriented

Programming (C++ and Java) Better tools and software engineering

methodology (design patterns, specification, testing)

The Third Software Crisis

Today: PROBLEM: Solid boundary between

hardware and software High-level languages abstract away the

hardware Sequential performance is left behind by

Moore’s Law SOLUTION: What’s under the hood?

Language features for architectural awareness

The Software becomes the Problem, AGAIN

Parallelism required to gain performance Parallel hardware is “easy” to design Parallel software is (very) hard to write

Fundamentally hard to grasp true concurrency Especially in complex software

environments Existing software assumes single-

processor Might break in new and interesting ways Multitasking no guarantee to run on

multiprocessor

Parallel Programming Principles Coverage (Amdahl’s Law) Communication/Synchronization Granularity Load Balance Locality

Coverage

More, less powerful (and power-hungry) cores to achieve same performance?

Coverage

Amdahl's Law: The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.

Speedup = old running time / new running time = 100 seconds / 60 seconds = 1.67

Amdahl’s Law

p = fraction of work that can be parallelized n = the number of processors

Implications of Amdahl’s Law

Speedup tends to 1/(1-p) as number of processors tends to infinity

Parallel programming is worthwhile when programs have a lot of work that is parallel in nature

Overhead

Overhead of Parallelism

Given enough parallel work, this is the biggest barrier to getting desired speedup

Parallelism overheads include: cost of starting a thread or process cost of communicating shared data cost of synchronizing extra (redundant) computation

Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work

Communication/Synchronization Only few programs are “embarassingly”

parallel Programs have sequential parts and

parallel parts Need to orchestrate parallel execution

among processors Synchronize threads to make sure

dependencies in the program are preserved Communicate results among threads to

ensure a consistent view of data being processed

Communication/Synchronization Shared Memory

Communication is implicit. One copy of data shared among many threads

Atomicity, locking and synchronization essential for correctness

Synchronization is typically in the form of a global barrier

Distributed memory Communication is explicit

through messages Cores access local

memory Data distribution and

communication orchestration is essential for performance

Synchronization is implicit in messages

Overhead

Granularity

Granularity is a qualitative measure of the ratio of computation to communication

Computation stages are typically separated from periods of communication by synchronization events

Granularity

Fine-grain Parallelism Low computation to

communication ratio Small amounts of

computational work between communication stages

Less opportunity for performance enhancement

High communication overhead

Coarse-grain Parallelism High computation to

communication ratio Large amounts of

computational work between communication events

More opportunity for performance increase

Harder to load balance efficiently

The Load Balancing Problem

Processors that finish early have to wait for the processor with the largest amount of work to complete Leads to idle time, lowers utilization

Particularly urgent with barrier synchronizationCore 1

Core 2

Core 3

Core 4

BALANCED

workloads

Core 1

Core 2

Core 3

Core 4

UNBALANCED

workloads

Slowest core dictates overall execution time

Static Load Balancing

Programmer make decisions and assigns a fixed amount of work to each processing core a priori

Works well for homogeneous multicores All core are the same Each core has an equal amount of work

Not so well for heterogeneous multicore Some cores may be faster than others Work distribution is uneven

Dynamic Load Balancing

Workload is partitioned in small tasks. Available tasks for processing are pushed in a work-queue

When one core finishes its allocated task, it takes on further work from the queue. The process continues until all tasks are assigned to some core for processing.

Ideal for codes where work is uneven, and in heterogeneous multicore

Core 1

Core 2

Core 3

Core 4

Memory Access Latency

Uniform Memory Access (UMA) – Shared Memory Centrally located shared memory All processors are equidistant (access times)

Non-Uniform Access (NUMA) Shared memory – Processors have the same address

space data is directly accessible by all, but cost depends on the distance Placement of data affects performance

Distributed memory – Processors have private address spaces Data access is local, but cost of messages depends on the distance Communication must be efficiently architected

Locality of Memory Accesses(UMA Shared Memory)

Parallel computation is serialized due to memory contention and lack of bandwidth

Locality of Memory Accesses(UMA Shared Memory)

Distribute data to relieve contention and increase effective bandwidth

Locality of Memory Accesses(NUMA Shared Memory)

int main()

{

/* Task 1 */

for (i = 0; i < n; i++)

A[i][rand()] = foo ();

/* Task 2 */

for (j = 0; j < n; j++)

B[j] = goo ();

}INTERCONNECT

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2

SHAREDMEMORY

Once parallel tasks have been assigned to different processors..

int main()

{

/* Task 1 */

for (i = 0; i < n; i++)


/* Task 2 */

for (j = 0; j < n; j++)

B[j] = goo ();

}


INTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2

A B

..phisical placement of data can have a great impact on performance!

Memory reference cost =Bus latency +Off-chip memory latency (100 cycles)

int main()

{

/* Task 1 */

for (i = 0; i < n; i++)


/* Task 2 */

for (j = 0; j < n; j++)

B[j] = goo ();

}


INTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2Memory reference cost =Bus latency +On-chip memory latency (2-20 cycles)

int main()

{

/* Task 1 */

for (i = 0; i < n; i++)


/* Task 2 */

for (j = 0; j < n; j++)

B[j] = goo ();

}


INTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2Memory reference cost =Local Memory Latency (1 cycle)

Locality in Communication(Message Passing)

an introduction to parallel programming ing. andrea marongiu ([email protected])

Documents

parallel software

testing slide

multiprocessor slide

software crisis

higher peak performance

architectural awareness

performance parallel

performance solution