blazing fast windows 8 apps using visual c++
DESCRIPTION
More info on http://www.techdays.beTRANSCRIPT
Blazing Fast Windows 8 Apps using Visual C++
Tarek MadkourGroup Program Manager – Visual C++, Microsoft Corp.
Agenda
Windows 8 Apps
Free performance boost
Squeeze the CPU (PPL)
Smoke the GPU (C++ AMP)
Agenda
Windows 8 Apps
Free performance boost
Squeeze the CPU (PPL)
Smoke the GPU (C++ AMP)
Windows 8 Apps
New user experience
Touch-friendly
Trust
Battery-power
Fast and fluid
Windows 8 C++ App Options
XAML-based applications XAML user interface C++ code
DirectX-based applications and games DirectX user interface (D2D or D3D) C++ code
Hybrid XAML and DirectX applications XAML controls mixed with DirectX surfaces C++ code
HTML5 + JavaScript applications HTML5 user interface JS code calling into C++ code
demoFresh Paint
Agenda Checkpoint
Windows 8 apps
Free performance boost
Squeeze the CPU (PPL)
Smoke the GPU (C++ AMP)
Recap of “free” performance
Compilation Unit Optimizations
• /O2 and friends
Whole Program Optimizations
• /GL and /LTCG
Profile Guided Optimization
• /LTCG:PGI and /LTCG:PGO
.cpp
.cpp .obj
.obj
.exe
.cpp
.cpp .obj
.obj
.exe
.cpp
.cpp .obj
.obj
.exe
Run TrainingScenario
s
.exe
More “free” boosts
Automatic vectorization• Always on in VS2012• Uses “vector” instructions
where possible in loops
• Can run this loop in only 250 iterations down from 1,000!
+
r1 r2
r3
add r3, r1, r2
SCALAR(1 operation)
v1 v2
v3
+
vectorlength
vadd v3, v1, v2
VECTOR(N operations)
for (i = 0; i < 1000; i++) { A[i] = B[i] + C[i]; }
More “free” boosts
Automatic parallelization• Uses multiple CPU cores• /Qpar compiler switch
• Can run this loop “vectorized” and on 4 CPU cores in parallel
#pragma loop (hint_parallel(4)) for (i = 0; i < 1000; i++) { A[i] = B[i] + C[i]; }
Agenda Checkpoint
Windows 8 apps
Free performance boost
Squeeze the CPU (PPL)
Smoke the GPU (C++ AMP)
Parallel Patterns Library (PPL)
Part of the C++ Runtime No new libraries to link in Task parallelism Parallel algorithms Concurrency-safe containers Asynchronous agents
Abstracts away the notion of threads Tasks are computations that may be run in parallel
Used to express your potential concurrency Let the runtime map it to the available concurrency Scale from 1 to 256 cores
parallel_for
parallel_for iterates over a range in parallel
#include <ppl.h>
using namespace concurrency;
parallel_for( 0, 1000, [] (int i) { work(i); });
parallel_for
• Order of iteration is indeterminate.
• Cores may come and go.
• Ranges may be stolen by newly idle cores.
parallel_for(0, 1000, [] (int i) { work(i);});
Core 4Core 3
Core 1
work(0…249)
work(500…749)
work(750…999)
Core 2
work(250…499)
parallel_for
parallel_for considerations:• Designed for unbalanced loop bodies• An idle core can steal a portion of another core’s range of work• Supports cancellation• Early exit in search scenarios
For fixed-sized loop bodies that don’t need cancellation, use parallel_for_fixed.
parallel_for_each
parallel_for_each iterates over an STL container in parallel
#include <ppl.h>
using namespace concurrency;
vector<int> v = …;
parallel_for_each(v.begin(), v.end(), [] (int i) { work(i); });
parallel_for_each
Works best with containers that support random-access iterators: std::vector, std::array, std::deque, concurrency::concurrent_vector, …
Works okay, but with higher overhead on containers that support forward (or bi-di) iterators: std::list, std::map, …
parallel_invoke
• Executes function objects in parallel and waits for them to finish#include <ppl.h>#include <string>#include <iostream>using namespace concurrency; using namespace std;
template <typename T>T twice(const T& t) { return t + t; }
int main() { int n = 54; double d = 5.6; string s = "Hello"; parallel_invoke( [&n] { n = twice(n); }, [&d] { d = twice(d); }, [&s] { s = twice(s); } ); cout << n << ' ' << d << ' ' << s << endl; return 0;}
task<>
• Used to write asynchronous code• Task::then lets you create continuations that get executed when the task finishes• You need to manage the lifetime of the variables going into a task
#include <ppltasks.h>#include <iostream>using namespace concurrency; using namespace std;
int main(){ auto t = create_task([]() -> int { return 42; });
t.then([](int result) { cout << result << endl; }).wait();}
Concurrent Containers
• Thread-safe, lock-free containers provided: concurrent_vector<> concurrent_queue<> concurrent_unordered_map<> concurrent_unordered_multimap<> concurrent_unordered_set<> concurrent_unordered_multiset<>
• Functionality resembles equivalent containers provided by the STL
• Behavior is more limited to allow concurrency. For example:• concurrent_vector can push_back but not insert• concurrent_vector can clear but not pop_back or erase
concurrent_vector<T>
#include <ppl.h>#include <concurrent_vector.h>
using namespace concurrency;
concurrent_vector<int> carmVec;
parallel_for(2, 5000000, [&carmVec](int i) { if (is_carmichael(i)) carmVec.push_back(i);});
Agenda Checkpoint
Windows 8 apps
Free performance boost
Squeeze the CPU (PPL)
Smoke the GPU (C++ AMP)
CPU / GPU Comparison
What is C++ AMP?
Performance & ProductivityC++ AMP -> C++ Accelerated Massive ParallelismC++ AMP is• Programming model for expressing data parallel algorithm• Exploiting heterogeneous system using mainstream tools• C++ language extensions and library
C++ AMP delivers performance without compromising productivity
What is C++ AMP?
C++ AMP gives you…Productivity• Simple programming model
Portability• Run on hardware from NVIDIA, AMD, Intel and ARM*• Open Specification
Performance• Power of heterogeneous computing at your hands
Use it to speed up data parallel algorithms
1. #include <iostream>2. 3.
4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. 8. for (int idx = 0; idx < 11; idx++)9. {10. v[idx] += 1;11. } 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( v[i]);14. }
1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;
4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. 8. for (int idx = 0; idx < 11; idx++)9. {10. v[idx] += 1;11. } 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( v[i]);14. }
amp.h: header for C++ AMP library
concurrency: namespace for library
1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;
4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. for (int idx = 0; idx < 11; idx++)9. {10. v[idx] += 1;11. }
12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( v[i]);14. }
array_view: wraps the data to operate on the accelerator. array_view variables
captured and associated data copied to accelerator (on demand)
1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;
4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. for (int idx = 0; idx < 11; idx++)9. {10. av[idx] += 1;11. }
12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( av[i]);14. }
array_view: wraps the data to operate on the accelerator. array_view variables
captured and associated data copied to accelerator (on demand)
1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;
4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)9. {10. av[idx] += 1;11. }); 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>(av[i]);14. }
parallel_for_each: execute the lambda on the accelerator once
per threadextent: the parallel loop
bounds or computation “shape”
index: the thread ID that is running the lambda, used to
index into data
1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;
4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)9. {10. av[idx] += 1;11. }); 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>(av[i]);14. }
restrict(amp): tells the compiler to check that code conforms to C+
+ subset, and tells compiler to target GPU
1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;
4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)9. {10. av[idx] += 1;11. }); 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>(av[i]);14. }
array_view: automatically copied to accelerator if
required
array_view: automatically copied back to host when
and if required
C++ AMPParallel Debugger
Well known Visual Studio debugging features Launch (incl. remote), Attach, Break, Stepping, Breakpoints, DataTips Tool windows
Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch, Quick Watch
New features (for both CPU and GPU) Parallel Stacks window, Parallel Watch window
New GPU-specific Emulator, GPU Threads window, race detection
concurrency::direct3d_printf, _errorf, _abort
demoCartoonizerLinear vs. Parallel vs. AMP
Summary
C++ is a great way to create fast and fluid apps for Windows 8Get the most out of the compiler’s free optimizationsUse PPL for concurrent programmingUse C++ AMP for data parallel algorithms