what’s new in visual c++

What’s new in Visual C++ 11

Jim Hogg

Program ManagerVisual C++Microsoft

Agenda

• Why C++?• Performance : CPUs and GPUs• Baseline : Single-CPU / Multi-CPU Demo• Vector CPU Demo• GPU : C++ AMP Demo

• ISO C++ 11• ALM (Application Lifetime

Management)

Why C++? : Power & Performance

“The going word at Facebook is that ‘reasonably written C++

code just runs fast,’ which underscores the enormous effort spent at optimizing PHP and Java code. Paradoxically, C++ code is

more difficult to write than in other languages, but

efficient code is a lot easier.” – Andrei Alexandrescu

power: driver at all scales – on-die, mobile, desktop, datacenterPerf/W

Perf/T

size: limits on processor resources – desktop,

mobileexperiences: bigger experiences on smaller hardware; pushing envelope means every cycle matters

Perf/C

Agenda



Management)

CPU v.s. GPU today

CPU

• Low memory bandwidth• Higher power consumption• Medium level of parallelism• Deep execution pipelines• Random accesses• Supports general code• Mainstream programming

GPU

• High memory bandwidth• Lower power consumption• High level of parallelism• Shallow execution pipelines• Sequential accesses• Supports data-parallel code• Niche programming

images source: AMD

NBody Simulation, CPU (novec)

Vector Processors (CPU)

SCLRUNIT

VECTUNIT

Vector Processors – How they work

ADD RAX, RBX

1.10

1.20

RAX

RBX

2.30

ADDPS XMM1, XMM2

XMM1

XMM2

1.10 2.10 3.10 4.10

RAX

1.20 2.20 3.20 4.20

2.30 4.30 6.30 8.30XMM1

SCALAR

VECTOR

for (int i = 0; i < 1000; ++i) a[i] += b[i ]

for (int i = 0; i < 1000; i += 4) a[i : i+3] += b[i : i+3]

Vector Processors (CPU)

SCLRUNIT

VECTUNIT

VECTUNIT

VECTUNIT

VECTUNIT

SCLRUNIT

SCLRUNIT

SCLRUNIT

Compiler Enhancements• Auto-vectorizer• Automatically vectorize

loops.• SIMD instructions. • ON by default

• Auto-parallelization– Reorganizes the loop to

run on multiple threads – /Qpar– Optional #pragma loop

for (i = 0; i < 1024; i++) a[i] = b[i] * c[i];

for (i = 0; i < 1024; i += 4) a[i:i+3] = b[i:i+3] *

c[i:i+3];

#pragma loop(hint_parallel(N))

for (i = 0; i < 1024; i++) a[i] = b[i] * c[i];

Multi-Core Machines (w/ Vectorization)

SCLRUNIT

VECTUNIT

SCLRUNIT

VECTUNIT

SCLRUNIT

VECTUNIT

SCLRUNIT

VECTUNIT

NBody Simulation, CPU (Auto Vectorize + Parallelize)

The Big Picture – Vectorization

Source Code Assembly of Bodyint A[20000];int B[20000];int C[20000];

for (i=0; i<20000; i++) { A[i] = B[i] + C[i];}

$LL3@foo: mov ecx, DWORD PTR ?C@@3PAHA[eax*4] mov edx, DWORD PTR ?B@@3PAHA[eax*4] add ecx, edx mov DWORD PTR ?A@@3PAHA[eax*4], ecx

inc eax cmp eax, esi jl SHORT $LL3@foo

Transformation Assembly of Bodyint A[20000];int B[20000];int C[20000];

for (i=0; i<20000; i+=4) { A[i:i+3] = B[i:i+3] + C[i:i+3];}

$LL3@foo: movdqu xmm1, XMMWORD PTR ?C@@3PAHA[eax*4] movdqu xmm0, XMMWORD PTR ?B@@3PAHA[eax*4] paddd xmm1, xmm0 movdqu XMMWORD PTR ?A@@3PAHA[eax*4], xmm1

add eax, 4 cmp eax, ecx jl SHORT $LL3@foo

Dev11 /O2 400% Speedup!!!

Not Your Grandfather’s Vectorizer for (k = 1; k <= M; k++) {

mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY;

dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY;

if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } }

for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY;

for (k = 1; k <= M; k++) { if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; }}

for (k = 1; k < M; k++) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; }

Agenda



Management)

N-Body Simulation (GPU)

The Power of Heterogeneous Computing

146X

Interactive visualization of

volumetric white matter connectivity

36X

Ionic placement for molecular

dynamics simulation on

GPU

19X

Transcoding HD video stream to

H.264

17X

Simulation in Matlab

using .mex file CUDA function

100X

Astrophysics N-body simulation

149X

Financial simulation of LIBOR model

with swaptions

47X

GLAME@lab: An M-script API for linear Algebra operations on

GPU

20X

Ultrasound medical

imaging for cancer

diagnostics

24X

Highly optimized

object oriented molecular dynamics

30X

Cmatch exact string matching to find similar proteins and

gene sequences

source

C++ AMP• Part of Visual C++ • Visual Studio integration• STL-like library for multidimensional data • Builds on Direct3D

performance

portability

productivity

Hello World: Array Addition

void AddArrays(int n, int * pA, int * pB, int * pC){

for (int i=0; i<n; i++)

{ pC[i] = pA[i] + pB[i]; }

}

#include <amp.h>using namespace concurrency;

void AddArrays(int* a, int* b, int* c, int N){ array_view<int,1> va(N, a); array_view<int,1> vb(N, b); array_view<int,1> vc(N, c); parallel_for_each( va.grid, [=](index<1> i) restrict(direct3d) { va[i] = vb[i] + vc[i]; } );}

void AddArrays(int* a, int* b, int* c, int N){

for (int i = 0; i < n; ++i)

{ a[i] = b[i] + c[i]; }

}

Basic Elements of C++ AMP coding

void AddArrays(int* a, int* b, int* c, int N){ array_view<int,1> va(N, a); array_view<int,1> vb(N, b); array_view<int,1> vc(N, c); parallel_for_each(

va.grid, [=](index<1> i) restrict(direct3d) { va[i] = vb[i] + vc[i];

} );}

array_view variables captured and associated data copied to accelerator (on demand)

restrict(direct3d): tells the compiler to check that this code can execute on Direct3D hardware (aka accelerator)

parallel_for_each: execute the lambda on the accelerator once per thread

grid: the number and shape of threads to execute the lambda

index: the thread ID that is running the lambda, used to index into data

array_view: wraps the data to operate on the accelerator

Achieving maximum performance gains• Schedule threads in tiles

• Avoid thread index remapping• Gain ability to use tile static

memory

0 1 2 3 4 5

0

1

2

3

4

5

6

7

0 1 2 3 4 5

0

1

2

3

4

5

6

7

g.tile<2,2>()g.tile<4,3>()

array_view<int,2> data(8, 6, p_my_data);parallel_for_each( data.grid.tile<2,2>(), [=] (tiled_index<2,2> t_idx)… { … });

C++ AMP at a Glance• restrict(direct3d, cpu)• parallel_for_each• class array<T,N>• class array_view<T,N>• class index<N>• class extent<N>,

grid<N>• class accelerator• class accelerator_view

• tile_static storage class• class tiled_grid< , , >• class tiled_index< , , >• class tile_barrier

Visual Studio/C++ AMP• Organize• Edit• Design• Build• Browse• Debug• Profile

C++ AMP Parallel Debugger• Well known Visual Studio debugging features • Launch, Attach, Break, Stepping, Breakpoints, DataTips • Toolwindows • Processes, Debug Output, Modules, Disassembly, Call Stack,

Memory, Registers, Locals, Watch, Quick Watch

• New features (for both CPU and GPU)• Parallel Stacks window, Parallel Watch window, Barrier

• New GPU-specific• Emulator, GPU Threads window, race detection

Summary

• Democratization of parallel hardware programmability• Performance for the mainstream• High-level abstractions in C++ (not C)• State-of-the-art Visual Studio IDE• Hardware abstraction platform

• C++ AMP now published as open specification• http://download.microsoft.com/download/4/0/E/40EA02D8-23A7-4BD2-

AD3A-0BFFFB640F28/CppAMPLanguageAndProgrammingModel.pdf

http://download.microsoft.com/download/4/0/E/40EA02D8-23A7-4BD2-AD3A-0BFFFB640F28/CppAMPLanguageAndProgrammingModel.pdf

http://download.microsoft.com/download/4/0/E/40EA02D8-23A7-4BD2-AD3A-0BFFFB640F28/CppAMPLanguageAndProgrammingModel.pdf

Agenda



Management)

Modern C++: Clean, Safe and Fast

circle* p = new circle( 42 );

vector<shape*> v = load_shapes();

for( vector<circle*>::iterator i = v.begin(); i != v.end(); ++i ) { if(*i && **i == *p )

cout << **i << “ is a match\n”;}

for( vector<circle*>::iterator i = v.begin(); i != v.end(); ++i ) { delete *i;}

delete p;

auto p = make_shared<circle>( 42 );

vector<shared_ptr<shape>> vw = load_shapes();

for_each( begin(vw), end(vw), [&]( shared_ptr<circle>& s ) { if( s && *s == *p ) cout << *s << “ is a match\n”;} );

Then NowT*

shared_ptr<T>

new make_shared

no need for “delete”

automatic lifetime management

exception-safe

for/while/do std:: algorithms

[&] lambda functions

auto type deduction

not exception-safe

missing try/catch, __try/__finally

C++ 11 Language Features in Visual StudioC++11 Core Language Features VC10 VC11

rvalue references v2.0 v2.1*auto v1.0 v1.0decltype v1.0 v1.1**static_assert Yes Yestrailing return types Yes Yeslambdas v1.0 v1.1nullptr Yes Yesstrongly typed enums Partial Yesforward declared enums No Yesstandard-layout and trivial types No Yesatomics No Yesstrong compare and exchange No Yesbidirectional fences No Yesdata-dependency ordering No Yes

rvalue refsstruct Car { string make; // eg “Volvo” int when; // last-serviced – eg 201103 => March 2011};

workOnClone(Car c); // work on a clone of my car – not returned

inspect(const Car& c); // inspect, but don’t alter, my car

fix(Car& c); // fix and return my car

replace(Car&& c); // take my car and cannibalize it – I won’t be using it again// note that && is not a ref-to-ref (unlike **)// enables “move semantics” and “perfect forwarding”

auto

for (std::map<string, vector<double>>::const_iterator iter = m.cbegin(); iter != m.cend(); ++iter) for (auto iter = m.cbegin(); iter != m.cend(); ++iter) const auto * p = new MyClass; // “add back” qualifiers to auto’s inferred typeconst auto & r = s; // “add back” qualifiers to auto’s inferred type

auto a1 = new auto(42); // infers int*auto * a2 = new auto(42); // beware: also infers int*

Notes: static type inference!like C# “var”may break old code: old auto specifies allocation within current stack

frame

int n = 42;double pi = 3.14159;auto x = n * e; // will infer type of x is double

decltype

decltype(new C) c = new C; // c is a C*// Note: first “new C” is not executed

std::vector<int>::const_iterator iter1; // a long type name

decltype(iter1) iter2; // iter2 has same type as iter1

static_assert

static_assert (FeetPerMile > 5200 && FeetPerMile < 6100, “FeetPerMile is wrong”);

#if VERSION < 8 #error “Need version 8 or higher”#endif

pre-processor-time

compile-time

bool done(float g1, float g2, float tol) { assert (tol < 1.0e-3);

run-time

template<class T> struct S { static_assert(sizeof(T) < sizeof(int), “T is too big”); static_assert(std::is_unsigned<T>::value, “S needs an unsigned type”);

Trailing-Return-Type

template<class A, class B> ??? adder(A &a, B &b) { return a + b; } // no!

template<class A, class B> decltype(a + b) adder(A &a, B &b) { return a + b; }// no!

template<class A, class B> auto adder(A &a, B &b) -> decltype(a + b) { return a + b; } // yes!

lambdas – functions with no name[ ] ( ) -> int { return 42; } ; // no arguments[ ] (int n) -> int { return n * n; } ; // one argument[ ] (int a, int b) -> int { return a + b; } ; // two arguments

for_each(v.begin(), v.end(), [ ] (int n) { cout << n << “ “; }); // one-liner

float f1 = integrate ( golden, 0.0, 1.0 );float f2 = integrate ( [ ] (float x ) { return x * x + x – 1; }, 0.0, 1.0 );

[ ] { cout << “hi” } // can omit ( ) if no parameters// can omit -> return-type if inferable

[ capture-clause] ( parameter-list ) -> return-type { body } // grammar

Strongly-Typed Enums

enum Heights {SHORT, TALL}; // okenum Widths {BYTE, SHORT, INT, LONG}; // clash

Use enum class

enum class Heights {SHORT, TALL};enum class Widths {BYTE, SHORT, INT, LONG}; // eg: Widths::SHORT

Illegal – members must be globally unique

enum Colors {RED, GREEN, BLUE};if (GREEN == 1) cout << “GREEN == 1”; // yes!enum Parts {ENGINE, BRAKE, CLUTCH};if (GREEN == BRAKE) cout << “GREEN == BRAKE”; // yes!

enum members are just integers

Forward-Declared Enum Classes

enum class Colors; // forward declaration

void fun(Colors c); // use

. . .

enum class Colors : unsigned char {RED = 3, GREEN, BLUE = 7};

nullptr// the NULL hack:

int* p1 = 0; // value of 0 is ‘special’int* p2 = 42; // illegal

void f (int* p) { cout << p; };f(0); // works

void f (int n) { cout << n; }void f (int* p) { cout << p; };f(0); // which one?

f(nullptr); // calls f(int*)

void f (int n) { cout << n; };f(0); // works

decltype(nullptr) == nullptr_t

Memory Model – Scary Terminology• Dekker’s algorithm

• Double check locking• Weak memory consistency• Atomics• Memory fences/barriers• Volatile• Sequential consistency• Acquire/Release semantics• Axiomatic definition & litmus tests

Dekker’s Algorithm

flag[0] := true while flag[1] = true { if turn ≠ 0 { flag[0] := false while turn ≠ 0 { } flag[0] := true } } // critical section turn := 1 flag[0] := false

flag[1] := true while flag[0] = true { if turn ≠ 1 { flag[1] := false while turn ≠ 1 { } flag[1] := true } } // critical section turn := 0 flag[1] := false

Memory

Proc Proc

Lock

Store buffer

Store buffer

http://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.tphols.pdf

Each proc has FIFO store bufferReads read from local SB

Read bypassing

MFENCE flushes SB

LOCK’d instruction acqiures Lock (eg: XCHG)

Write to SB may reach memory at any time Lock is not held



C++ Libraries (VS)• STL• C++ 11 conformant• Support for new headers in VS vNext• <atomic>, <filesystem>, <thread> (others)

• PPL• Parallel Algorithms• Task-based programming model• Agents and Messaging - express dataflow pipelines• Concurrency-safe containers

Agenda



Management)

ALM (Application Life Management)

• 2010 features Updated• Architecture Tools

• Dependency Diagrams• Architecture Explorer

• Unit Testing

• Native Unit Test Framework

• Manage and Run tests in VS and Test Manager

• Lightweight Requirements• Agile Planning Tools• Stakeholder Feedback• Context Switching• Code Review• Exploratory Testing

• Additional new C++ features

• New ALM features in vNext

Code Understanding

demo

PARTICIPATE IN C++ DEVELOPMENT USER RESEARCH

MICROSOFT DEVELOPER DIVISION DESIGNRESEARCH

SIGN UP ONLINE AThttp://bit.ly/cppdeveloper

MICROSOFTC++2

01

2

Chaque semaine, les DevCampsALM, Azure, Windows Phone, HTML5, OpenDatahttp://msdn.microsoft.com/fr-fr/devcamp

Téléchargement, ressources et toolkits : RdV sur MSDNhttp://msdn.microsoft.com/fr-fr/

Les offres à connaître90 jours d’essai gratuit de Windows Azure www.windowsazure.fr

Jusqu’à 35% de réduction sur Visual Studio Pro, avec l’abonnement MSDN www.visualstudio.fr

Pour aller plus loin

10 février 2012

Live Meeting

Open Data - Développer des applications riches avec le protocole Open Data

16 février 2012

Live Meeting

Azure series - Développer des applications sociales sur la plateforme Windows Azure

17 février 2012

Live Meeting

Comprendre le canvas avec Galactic et la librairie three.js

21 février 2012

Live Meeting

La production automatisée de code avec CodeFluent Entities

2 mars 2012

Live Meeting

Comprendre et mettre en oeuvre le toolkit Azure pour Windows Phone 7, iOS et Android

6 mars 2012

Live Meeting

Nuget et ALM

9 mars 2012

Live Meeting

Kinect - Bien gérer la vie de son capteur

13 mars 2012

Live Meeting

Sharepoint series - Automatisation des tests

14 mars 2012

Live Meeting

TFS Health Check - vérifier la bonne santé de votre plateforme de développement

15 mars 2012

Live Meeting

Azure series - Développer pour les téléphones, les tablettes et le cloud avec Visual Studio 2010

16 mars 2012

Live Meeting

Applications METRO design - Désossage en règle d'un template METRO javascript

20 mars 2012

Live Meeting

Retour d'expérience LightSwitch, Optimisation de l'accès aux données, Intégration Silverlight

23 mars 2012

Live Meeting

OAuth - la clé de l'utilisation des réseaux sociaux dans votre application

Prochaines sessions des Dev Camps

what’s new in visual c++

Technology

iterator i

int b int

vector processors cpu

arguments int n int

cpu novec

argument int

written c code

c resources code