blazing fast windows 8 apps using visual c++

Post on 29-Nov-2014

770 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

More info on http://www.techdays.be

TRANSCRIPT

Blazing Fast Windows 8 Apps using Visual C++

Tarek MadkourGroup Program Manager – Visual C++, Microsoft Corp.

Agenda

Windows 8 Apps

Free performance boost

Squeeze the CPU (PPL)

Smoke the GPU (C++ AMP)

Agenda

Windows 8 Apps

Free performance boost

Squeeze the CPU (PPL)

Smoke the GPU (C++ AMP)

Windows 8 Apps

New user experience

Touch-friendly

Trust

Battery-power

Fast and fluid

Windows 8 C++ App Options

XAML-based applications XAML user interface C++ code

DirectX-based applications and games DirectX user interface (D2D or D3D) C++ code

Hybrid XAML and DirectX applications XAML controls mixed with DirectX surfaces C++ code

HTML5 + JavaScript applications HTML5 user interface JS code calling into C++ code

demoFresh Paint

Agenda Checkpoint

Windows 8 apps

Free performance boost

Squeeze the CPU (PPL)

Smoke the GPU (C++ AMP)

Recap of “free” performance

Compilation Unit Optimizations

• /O2 and friends

Whole Program Optimizations

• /GL and /LTCG

Profile Guided Optimization

• /LTCG:PGI and /LTCG:PGO

.cpp

.cpp .obj

.obj

.exe

.cpp

.cpp .obj

.obj

.exe

.cpp

.cpp .obj

.obj

.exe

Run TrainingScenario

s

.exe

More “free” boosts

Automatic vectorization• Always on in VS2012• Uses “vector” instructions

where possible in loops

• Can run this loop in only 250 iterations down from 1,000!

+

r1 r2

r3

add r3, r1, r2

SCALAR(1 operation)

v1 v2

v3

+

vectorlength

vadd v3, v1, v2

VECTOR(N operations)

for (i = 0; i < 1000; i++) { A[i] = B[i] + C[i]; }

More “free” boosts

Automatic parallelization• Uses multiple CPU cores• /Qpar compiler switch

• Can run this loop “vectorized” and on 4 CPU cores in parallel

#pragma loop (hint_parallel(4)) for (i = 0; i < 1000; i++) { A[i] = B[i] + C[i]; }

Agenda Checkpoint

Windows 8 apps

Free performance boost

Squeeze the CPU (PPL)

Smoke the GPU (C++ AMP)

Parallel Patterns Library (PPL)

Part of the C++ Runtime No new libraries to link in Task parallelism Parallel algorithms Concurrency-safe containers Asynchronous agents

Abstracts away the notion of threads Tasks are computations that may be run in parallel

Used to express your potential concurrency Let the runtime map it to the available concurrency Scale from 1 to 256 cores

parallel_for

parallel_for iterates over a range in parallel

#include <ppl.h>

using namespace concurrency;

parallel_for( 0, 1000, [] (int i) { work(i); });

parallel_for

• Order of iteration is indeterminate.

• Cores may come and go.

• Ranges may be stolen by newly idle cores.

parallel_for(0, 1000, [] (int i) { work(i);});

Core 4Core 3

Core 1

work(0…249)

work(500…749)

work(750…999)

Core 2

work(250…499)

parallel_for

parallel_for considerations:• Designed for unbalanced loop bodies• An idle core can steal a portion of another core’s range of work• Supports cancellation• Early exit in search scenarios

For fixed-sized loop bodies that don’t need cancellation, use parallel_for_fixed.

parallel_for_each

parallel_for_each iterates over an STL container in parallel

#include <ppl.h>

using namespace concurrency;

vector<int> v = …;

parallel_for_each(v.begin(), v.end(), [] (int i) { work(i); });

parallel_for_each

Works best with containers that support random-access iterators: std::vector, std::array, std::deque, concurrency::concurrent_vector, …

Works okay, but with higher overhead on containers that support forward (or bi-di) iterators: std::list, std::map, …

parallel_invoke

• Executes function objects in parallel and waits for them to finish#include <ppl.h>#include <string>#include <iostream>using namespace concurrency; using namespace std;

template <typename T>T twice(const T& t) { return t + t; }

int main() { int n = 54; double d = 5.6; string s = "Hello"; parallel_invoke( [&n] { n = twice(n); }, [&d] { d = twice(d); }, [&s] { s = twice(s); } ); cout << n << ' ' << d << ' ' << s << endl; return 0;}

task<>

• Used to write asynchronous code• Task::then lets you create continuations that get executed when the task finishes• You need to manage the lifetime of the variables going into a task

#include <ppltasks.h>#include <iostream>using namespace concurrency; using namespace std;

int main(){ auto t = create_task([]() -> int { return 42; });

t.then([](int result) { cout << result << endl; }).wait();}

Concurrent Containers

• Thread-safe, lock-free containers provided: concurrent_vector<> concurrent_queue<> concurrent_unordered_map<> concurrent_unordered_multimap<> concurrent_unordered_set<> concurrent_unordered_multiset<>

• Functionality resembles equivalent containers provided by the STL

• Behavior is more limited to allow concurrency. For example:• concurrent_vector can push_back but not insert• concurrent_vector can clear but not pop_back or erase

concurrent_vector<T>

#include <ppl.h>#include <concurrent_vector.h>

using namespace concurrency;

concurrent_vector<int> carmVec;

parallel_for(2, 5000000, [&carmVec](int i) { if (is_carmichael(i)) carmVec.push_back(i);});

Agenda Checkpoint

Windows 8 apps

Free performance boost

Squeeze the CPU (PPL)

Smoke the GPU (C++ AMP)

CPU / GPU Comparison

What is C++ AMP?

Performance & ProductivityC++ AMP -> C++ Accelerated Massive ParallelismC++ AMP is• Programming model for expressing data parallel algorithm• Exploiting heterogeneous system using mainstream tools• C++ language extensions and library

C++ AMP delivers performance without compromising productivity

What is C++ AMP?

C++ AMP gives you…Productivity• Simple programming model

Portability• Run on hardware from NVIDIA, AMD, Intel and ARM*• Open Specification

Performance• Power of heterogeneous computing at your hands

Use it to speed up data parallel algorithms

1. #include <iostream>2. 3.

4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. 8. for (int idx = 0; idx < 11; idx++)9. {10. v[idx] += 1;11. } 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( v[i]);14. }

1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;

4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. 8. for (int idx = 0; idx < 11; idx++)9. {10. v[idx] += 1;11. } 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( v[i]);14. }

amp.h: header for C++ AMP library

concurrency: namespace for library

1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;

4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. for (int idx = 0; idx < 11; idx++)9. {10. v[idx] += 1;11. }

12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( v[i]);14. }

array_view: wraps the data to operate on the accelerator. array_view variables

captured and associated data copied to accelerator (on demand)

1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;

4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. for (int idx = 0; idx < 11; idx++)9. {10. av[idx] += 1;11. }

12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( av[i]);14. }

array_view: wraps the data to operate on the accelerator. array_view variables

captured and associated data copied to accelerator (on demand)

1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;

4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)9. {10. av[idx] += 1;11. }); 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>(av[i]);14. }

parallel_for_each: execute the lambda on the accelerator once

per threadextent: the parallel loop

bounds or computation “shape”

index: the thread ID that is running the lambda, used to

index into data

1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;

4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)9. {10. av[idx] += 1;11. }); 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>(av[i]);14. }

restrict(amp): tells the compiler to check that code conforms to C+

+ subset, and tells compiler to target GPU

1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;

4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)9. {10. av[idx] += 1;11. }); 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>(av[i]);14. }

array_view: automatically copied to accelerator if

required

array_view: automatically copied back to host when

and if required

C++ AMPParallel Debugger

Well known Visual Studio debugging features Launch (incl. remote), Attach, Break, Stepping, Breakpoints, DataTips Tool windows

Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch, Quick Watch

New features (for both CPU and GPU) Parallel Stacks window, Parallel Watch window

New GPU-specific Emulator, GPU Threads window, race detection

concurrency::direct3d_printf, _errorf, _abort

demoCartoonizerLinear vs. Parallel vs. AMP

Summary

C++ is a great way to create fast and fluid apps for Windows 8Get the most out of the compiler’s free optimizationsUse PPL for concurrent programmingUse C++ AMP for data parallel algorithms

Thank you!

tarekm@microsoft.com

top related