© 2009 rock solid knowledge 1 architecting for multiple cores andrew clymer...
Post on 31-Mar-2015
214 Views
Preview:
TRANSCRIPT
© 2009 RoCk SOLid KnOwledge 1
RoCkKnOwledge
SOLiDhttp://www.rocksolidknowledge.com
Architecting for multiple cores
Andrew Clymerandy@rocksolidknowledge.com
© 2009 RoCk SOLid KnOwledge 2
Objectives
Why should I care about parallelism?
Types of parallelism
Introduction to Parallel Patterns
© 2009 RoCk SOLid KnOwledge 3
The free lunch is overHerb Sutter, Dr Dobb’s Journal 30th March 2005
© 2009 RoCk SOLid KnOwledge 4
Multi core architectures
Symmetric multi processors ( SMP )Shared Memory
Easy to program withScalability issues with memory bus contention
Distributed MemoryHard to program against, message passing model Scales well when very low latency or low message exchange
GridHeterogeneousMessage PassingNo central control
HybridMany SMP’s networked together
© 2009 RoCk SOLid KnOwledge 5
The continual need for speed
For a given time frame XData volume increases
More detail in gamesIntellisense
Better answersSpeech recognitionRoute planningWord gets better at grammar checking
© 2009 RoCk SOLid KnOwledge 6
Expectations
Amdahl’s LawIf only n% of the compute can be parallelised at best you can get an n% reduction in time taken.Example
10 hours of compute of which 1 hour cannot be parallelised.So minimum time to run is 1 hour + Time taken for parallel part, thus you can only ever achieve a maximum of 10x speed up.
Not even the most perfect parallel algorithms scale linearly
Scheduling overheadCombining results overhead
© 2009 RoCk SOLid KnOwledge 7
Types of parallelism
Coarse grain parallelismLarge tasks, little scheduling or communication overhead
Each service request executed in parallel, parallelism realised by many independent requests.DoX,DoY,DoZ and combine results. X,Y,Z large independent operations
ApplicationsWeb Page renderingServicing multiple web requestsCalculating the payroll
Coarse grain
© 2009 RoCk SOLid KnOwledge 8
Types of parallelism
Fine grain parallelismA single activity broken down into a series of small co-operating parallel tasks.
f(n=0...X) can possibly be performed by many tasks in parallel for given values of nInput => Step 1 => Step 2 => Step 3 => Output
ApplicationsSensor processingRoute planningGraphical layoutFractal generation ( Trade Show Demo )
Fine grain
© 2009 RoCk SOLid KnOwledge 9
Moving to a parallel mind set
Identify areas of code that could benefit from concurrency
Keep in mind Amdahl’s law
Adapt algorithms and data structures to assist concurrency
Lock free data structures
Utilise a parallel frameworkMicrosoft Parallel Extensions .NET 4Open MPIntel Threading blocks
Thinking Parallel
© 2009 RoCk SOLid KnOwledge 10
Balancing safety and concurrency
Reduce shared dataMaximising “free running” to produce scalability
Optimise shared data accessSelect cheapest form of synchronization
Consider lock free data structures
If larger proportion of time spent reading than writing, consider optimise for read.
© 2009 RoCk SOLid KnOwledge 11
Parallel patterns
Fork/Join
Pipe line
Geometric Decomposition
Parallel loops
Monti Carlo
© 2009 RoCk SOLid KnOwledge 12
Fork/Join
Book Hotel, Book Car, Book Flight
Sum X + Sum Y + Sum Z
Task<int> sumX = Task.Factory.StartNew(SumX);Task<int> sumY = Task.Factory.StartNew(SumY);Task<int> sumZ = Task.Factory.StartNew(SumZ);
int total = sumX.Result+ sumY.Result+ sumZ.Result;
Parallel.Invoke( BookHotel , BookCar , BookFlight );
// Won’t get here until all three tasks are complete
© 2009 RoCk SOLid KnOwledge 13
Pipe Line
When an algorithm has Strict ordering, first in first outLarge sequential steps that are hard to parallelise
Each stage of the pipeline represents a sequential step, and runs in parallel
Queues connect tasks to form the pipeline
Example, Signal processing
© 2009 RoCk SOLid KnOwledge 14
Pipeline
AdvantagesEasy to implement.
DisadvantagesIn order to scale you need as many stages as cores.
Hybrid approachParallelise further a given stage to utilise more cores.
© 2009 RoCk SOLid KnOwledge 15
Geometric Decomposition
Task parallelism is one approach, data parallelism is another.
Decompose the data into a smaller sub set, have identical tasks work on smaller data sets.
Combine results.
© 2009 RoCk SOLid KnOwledge 16
Geometric Decomposition
Each task runs on its own sub region, often they need to share edge data
Update/Reading needs to be synchronised with neighbouring tasks
Edge Complications
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6
© 2009 RoCk SOLid KnOwledge 17
Parallel loops
Loops can be successfully parallelised IFThe order of execution is not important.Each iteration is independent of each other.Each Iteration has sufficient work to compensate for scheduling overhead.
If insufficient work per loop iteration consider creating an inner loop.
for
© 2009 RoCk SOLid KnOwledge 18
Cost of Loop Parallelisation
Task Scheduling Cost = 50
Iteration Cost = 1
Total Iterations = 100,000
Sequential Time would be 100,000
Assuming fixed cost per iteration
Time = Number of Tasks * ( Task Scheduling Cost + ( IterationCost * IterationsPerTask )
Number of Cores
Number of Cores
Iterations Per Task
Number of Tasks
Time
2 100 1000 75000
2 50000 2 50050
4 100 1000 37500
4 25000 4 25050
256 100 1000 586
256 391 256 440
© 2009 RoCk SOLid KnOwledge 19
Travelling Salesman problem
Calculate the shortest path, between a series of towns, starting and finishing at the same location
Possible combinations n Factorial
15 City produces 1307674368000 combinations
Took 8 Cores appx. 18 days to calculate exact shortest route.
Estimate 256 Cores would take appx. 0.5 days
© 2009 RoCk SOLid KnOwledge 20
Monti Carlo
Some algorithms even after heavy parallelisation are too slow.
Often some answer in time T is often better than a perfect answer in T+
Real time speech processingRoute planning
Machines can make educated guesses too..
Thinking out of the box
© 2009 RoCk SOLid KnOwledge 21
Monti CarloSingle Core
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97100
0
5
10
15
20
25
30
35
40
45
50
% Error for a 15 city route allowing 2 seconds for calculation using single core
% Error
Runs
% E
rror
© 2009 RoCk SOLid KnOwledge 22
Monti Carlo8 Cores
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97100
0
0.5
1
1.5
2
2.5
3
3.5
4
% Error for a 15 city route allowing 2 seconds for calculation using eight cores
% Error
Runs
% E
rror
© 2009 RoCk SOLid KnOwledge 23
Monti CarloAverage % Error over 1-8 cores
1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
9
Average % Error for a 15 city route allowing 2 seconds for calcula-tion
Average % Error
Number of Cores
% E
rror
© 2009 RoCk SOLid KnOwledge 24
Monti Carlo
Each core runs independently making guesses.
Scales, more cores greater the chance of a perfect guess.As we approach 256 cores...looks very attractive.
Strategies Pure RandomHybrid, Random + Logic
© 2009 RoCk SOLid KnOwledge 25
Summary
Free Lunch is over
Think concurrent
Test on a variety of hardware platforms
Utilise a framework and patterns to produce scalable solutions, now and for the future.
Andrew Clymerandy@rocksolidknowledge.comhttp://www.rocksolidknoweldge.com
/Blogs/Screencasts
top related