outline - uppsala universityuser.it.uu.se/~carln/hpc2015_carln1.pdf · • great influence on the...

Post on 19-Apr-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

C/C++

Carl Nettelblad 2015-11-24

Outline

• Languages

• Cases:

– Printing lists

– Sorting lists

• The discussion will include:

– Templates vs. inheritance

Why is C a good language?

Why is C a good language?

• Fast

• Nothing is hidden

• “Lingua franca”

– Runs everywhere

• For any type of program

• Any kind of distributed/parallel computing

– Can interact with anything

• Compiled and static typing

Why is C a bad language?

• Tedious

– Easily getting stuck in “how”, not “what”

• Long iteration times

– Rebuild after a simple bug

• Unsafe

– Bugs can be devastating

– For scientific codes:

• Complex bugs can be hidden a long time

Why is Python a good language?

Why is Python a good language?

• Flexible

– Different abstractions

• Concise

• Good libraries for scientific and non-scientific purposes

• Easy to use for interactive and quick prototyping

Why is Python a bad language?

Why is Python a bad language?

• Slow

– Default version is interpreted

– Not scaling well with threading

• Flexibility can promote bad habits

– Hard to guarantee that all parts are consistently

used when changes are made

• (Indentation carrying semantic meaning)

What do we want?

• Flexible abstractions

• Good and predictable libraries

• High performance

• Easy interactivity

• Type safety

• This language could be C++!

– Or Python with a mix of C++

Python from Matlab

• Matlab R2014b (8.4) and later support immediate

Python integration

• py.module.function

• I.e. access the highly accurate summation function

fsum using py.math.fsum

• Can work straight away, flat vectors (not matrices)

automatically translated back and forth

• Any Python module built in this course might then be

accessed in Matlab

Python from the web

• IPython platform for interactive Python

• IPython Notebook, web-based interface to IPython

– Combine code, text, and figures

• Kind of like Mathematica

– Easily edit different code snippets

– Press Shift+Enter to (re)compute

C++ from the web

• Jupyter project

– IPython is separated into interactivity engine and

actual Python

cling and clang

• At CERN, the ROOT framework has existed for a long

time

– Special classes and an interpreter of a language

similar to C++

– With several oddities

– Interpreted language truly slow

• Effort to rebuild this into using “real” C++

– cling real-time compiler based on clang

– clang is the C++ compiler currently used by Apple

clang

• g++/gcc has been the de-facto standard for (open-source)

C/C++ compilers for a long time

• gcc has an archaic codebase

– Historically not easy to easily tie into some services

– E.g. get a parse tree

– Or add new code on the fly to an ongoing compilation

process

• Other compilers are closed source

– And also tend to lack flexible APIs

• clang is modularized (the front-end to the separate LLVM

backend) and open-sourced

Other users of clang

• In addition to cling and the Apple compilers, clang is

found in e.g.

– The Nvidia CUDA device compiler is clang-based,

no matter what host compiler you have

– The IDE Ceemple, which tries to bundle a lot of C++

libraries with a separate compiling mode with very

short latency is based on clang

• Keep the compiler loaded with all headers

between reruns

Working on an array in C

void printIntArray(int* data, int size)

{

for (int i = 0; i < size; i++)

{

printf("%d\n", data[i]);

}

}

Why is this bad?

• Adapted to one specific type of data (int)

• Size is an explicit parameter

– If size is specified incorrectly, we will read invalid

data

• The function can easily change the data

• Data pointer can be invalid

What would this look like in

Python?

def printArray(array):

for i in array:

print i

What would this look like in C++?

void printIntVector(IntVector* vector)

{

for (int i = 0; i < vector->size(); i++)

{

printf("%d\n", vector->get(i));

}

}

What would this look like in C++?

void printIntVector(vector<int>::iterator begin, vector<int>::iterator end)

{

for (vector<int>::iterator i = begin; i != end; i++)

{

cout << *i << "\n";

}

}

The inheritance abstraction

• An iterator would be a common interface or base class

• This is the case in e.g. Java

• Subclasses inherit from this base class

– Performing iteration in a specific data structure

– Virtual methods for getting next element, current

element etc.

• (Runtime) polymorphism

How is the method call made?

• Each object has a table of method implementations

• The slot numbers are fixed at compilation

– Any call to an Iterator method will be “call the

method pointed to in the right slot in the vtable”

– This is an indirect jump

IntVectorIterator

next()

get()

Iterator

next()

get()

Indirect jumps

• A modern fast CPU is pipelined and out of order

– Multiple instructions “in flight” at once

– If instructions depend on each other, an out of order core starts

executing a later one

– Pipeline depth 20

• Out of order window of 224 in recent Intel CPU

– Hides waiting on memory

– Latency is the difference between real and theoretical

performance

ADD MOV CMP JNZ MOV

MOV CMP JNZ MOV …

CMP JNZ MOV … …

Branch prediction

• Out of order works fine if the instruction stream is

known

• If you have a loop or an if statement, the CPU has to

guess

– Can actually get pretty good

• A virtual method call is another branch

– In the very worst case, that instruction is not even

cached

Virtual methods in the compiler

• When you call a function directly in C, the compiler can

see everything that happens

– It can inline the function

– Move instructions around

– Do all the optimizations that make a modern

compiler fast, across the function call

• The virtual method call breaks this

– Sometimes the compiler can identify that the same

implementation is always used

The duck-typing abstraction

• Python uses the concept of duck typing

– “If it walks like a duck, swims like a duck, quacks like

a duck, it is a duck”

– “If an object has all the methods of an iterator, it is

an iterator”

• Convenient, flexible

– You can use inheritance, but you don’t rely on it to

define the contract

• Functions are looked up by name in a data structure

when they are called

– C++ vtables suddenly seem superfast

C++ templates

• Create functions and classes that can work on arbitrary

classes

• Simple motivation

– Type-safe container classes

• vector<int>

• map<int, double>

• These are done at compile-time

• Compiler error messages can be hard to track

– Templates within templates within templates

– Compare this to sudden error at runtime

Printing a list

template<typename T>

void printList(T begin, T end)

{

for (T i = begin; i != end; i++)

{

printf("%d\n", static_cast<int>(*i));

}

}

What happened here

• We are doing duck-typing in C++

• We don’t know what T is

– But begin and end are of the same type

– We can get a value with the dereference (*) operator

– That value can be casted to an int

– We can iterate to the next value with ++

• All of this is done at compile time

– Performance

– Correctness

Abstraction costs

• For a simple array, this is just as fast as the C version

– That code could only handle pointer-based int arrays

– But it can be binary trees (set), or a network stream

• For performance, you want to keep runtime costs of the

generalizations and abstractions you make at a

minimum

Printing a list

template<typename T>

void printList(T begin, T end)

{

for (auto i = begin; i != end; i++)

{

printf("%d\n", static_cast<int>(*i));

}

}

Printing a list

template<typename T>

void printList(const T& list)

{

for (auto i : list)

{

printf("%d\n", static_cast<int>(i));

}

}

Printing a list

template<typename T>

void printList(const T& list)

{

for (auto i : list)

{

cout << i << "\n";

}

}

Consequences

• auto keyword

– For local variables, you frequently don’t really care about the type, no

“contract”

– Full typename could change if you change data structures later on

– Just let the compiler figure it out

• const &

– C and C++ send all paramters by value by default

– If you would send a full vector to a function, that could imply copying

the vector

– const means “I don’t want to be able to change this object by

accident”

– & means “I want to work on the original object, not a copy”

– These are semantic differences

Consequences

• for (auto i : list)

– Simple “for each” notation

– Under the hood relying on iterators

– But you can do stuff like

for (auto x : map<int,int>{{1,2}, {3,5}}) {

printf("%d %d\n", x.first, x.second);

}

• You simply can’t accidentally go outside the range with this syntax

Give your code a Boost

• The C++ standard library is rather thin

– It’s become larger in the last few standards

– You want to interact with the underlying tech (the

OS), not a library faking the OS

– OS libraries are rarely nice C++…

• Also lack of general algorithms and abstractions

• The Boost library (or library of libraries) changes this

Boost

• Independent project

– Started out in the end of last millennium

– Libraries added after peer review process, focusing on

generality and “nice interface”

– Varying quality

• Far fewer, but far more stable than arbitrary Perl, Python, or

R libraries

• Great influence on the C++ standards process

– The TR1 document between C++03 and C++11 based several

new libraries on their boost counterparts

– C++11 continued this

– Added language features in C++11 based on “things Boost

could not achieve”

What do we have in Boost?

• Accumulators, Algorithm, Align, Any, Array, Asio, Assert, Assign, Atomic, Bimap,

Bind, Call Traits, Chrono, Circular Buffer, Compatibility, Compressed Pair,

Concept Check, Config, Container, Context, Conversion, Convert, Core, Coroutine,

Coroutine2, CRC, Date Time, Dynamic Bitset, Enable If, Endian, Exception,

Filesystem, Flyweight, Foreach, Format, Function, Function Types, Functional,

Fusion, Geometry, GIL, Graph, Heap, ICL, Identity Type, In Place Factory,

Integer, Interprocess, Interval, Intrusive, IO State Savers, Iostreams, Iterator,

Lambda, Lexical Cast, Local Function, Locale, Lockfree, Log, Math, Member

Function, Meta State Machine, Min-Max, MPI, MPL, Multi-Array, Multi-Index,

Multiprecision, Numeric Conversion, Odeint, Operators, Optional, Parameter,

Phoenix, Pointer Container, Polygon, Pool, Predef, Preprocessor, Program

Options, Property Map, Property Tree, Random, Range, Ratio, Rational, Ref,

Regex, Result Of, Scope Exit, Serialization, Signals, Signals2, Smart Ptr, Sort,

Spirit, Statechart, Static Assert, String Algo, Swap, System, Test, Thread,

ThrowException, Timer, Tokenizer, TR1, Tribool, TTI, Tuple, Type Index, Type

Traits, Typeof, uBLAS, Units, Unordered, Utility, Uuiod, Value Initialized, Variant,

Wave, Xpressive

Python and C++

• When you integrate languages with each other, you

need to define:

– Who are you?

– Who are your users?

– Which language is extending the bridge into the

other?

– What features of the two languages need to be

maintained in the bridge?

– Do you have performance concerns?

Cython

• There are many ways to create bindings between

Python and other languages

• Cython generates C++ code from Python code

– Can call into C++ with some work

– The Python parser needs to understand C++

declarations

– The generated C++ code also needs to compile

correctly

• Do not confuse Cython with CPython (normal Python

implementation)

Performance of Cython

• Code can be annotated with exact types

– Allows more optimizations

– Tight loops can be quick

• Still plagued of some of the indirection problems of

Python

– Just as fast as C code interacting closely with

Python

– Not as fast as code in C/C++ with full control over

data structures

– Transition between C and Cython code is very quick

Cython C++ wrapping

class Rectangle {

public:

int x0, y0, x1, y1;

Rectangle(int x0, int y0, int x1, int y1);

~Rectangle();

int getLength();

int getHeight();

int getArea();

void move(int dx, int dy);

};

Wrapping to Cython

cdef extern from "Rectangle.h":

cdef cppclass Rectangle:

Rectangle(int, int, int, int) except +

int x0, y0, x1, y1

int getLength()

int getHeight()

int getArea()

void move(int, int)

Wrapping to Python

cdef class PyRectangle:

cdef Rectangle *thisptr # hold a C++ instance which we're wrapping

def __cinit__(self, int x0, int y0, int x1, int y1):

self.thisptr = new Rectangle(x0, y0, x1, y1)

def __dealloc__(self):

del self.thisptr

def getLength(self):

return self.thisptr.getLength()

def getHeight(self):

return self.thisptr.getHeight()

def getArea(self):

return self.thisptr.getArea()

def move(self, dx, dy):

self.thisptr.move(dx, dy)

Conclusion

• Interface stated three times

• One time in C++, two times in semi-Python

• Makes perfect sense if you are a Python coder

wrapping an existing C++ library

• Performance nice overall

• Wrapping is imperative in style

Boost.Python

• Far older interface (dating back to 2002!)

• Write C++ classes

• Define in C++ how these classes are mapped

Rectangle example again

BOOST_PYTHON_MODULE(shapes)

{

class_<Rectangle>("PyRectangle", init<int,int,int,int>())

.def("getLength", &Rectangle::getLength)

.def("getHeight", &Rectangle::getHeight)

.def("getArea", &Rectangle::getArea)

.def("move", &Rectangle::move)

;

}

Exposing data members

• .def_readonly("x0", &Rectangle::x0)

• More relevant, exposing existing getter with property

syntax of Python

.add_property("area", &Rectangle::getArea)

• Add a third parameter to have a setter as well

More complex stories

• Define customized rules for how to map Python types

to C++ types

• Define declarative ownership rules for objects created

in C++

– That’s what the PyRectangle Cython wrapper did in

code

Sorting data

• Common task

• Quicker to keep data non-sorted and sort it later vs.

maintaining propery sorted data structure

– I.e. keep a vector, then sort it, rather than keeping a

C++ set (which is a sorted self-balancing tree)

What is needed to sort

• Sorting itself is O(n log n)

– Using a proper algorithm, you can always sort a list

in a number of operations that is proportional to

n log n basic operations

– Since proportional, it doesn’t matter what log base

we are using

– If sorting 1,000 elements would use T operations, we

would expect 1,000,000 elements to use 2000T (not

1000T)

or

Which data layout would you use for sorting? Which data

layout would you use for accessing the data later?

Indirection

El1 El2 El3 El4

Ref1 Ref2 Ref3 Ref4

El1 El3 El4 El2

Indirection

• Just keeping references could seem making sorting

faster

– You don’t need to move the full elements

– This is kind of true

– Depends a bit on how much of the data you need to

access to do the sorting comparisons

• Overall size larger in indirect case

– Frequently overhead for each allocated element

• Remember: Current CPUs are very fast

– When they can predict what to do beforehand

– Moving a chunk of data is predictable

Indirection

• If sorting for speedy access later, “sorting” an indirected data

structure will keep actual data stored all over the place

• When is indirection used?

– Python lists, Python dictionaries

– Java ArrayList, HashMap etc

– Java arrays of non-primitive types

– General pointer-based data structures in C and C++

– Cell arrays in Matlab

• When is it not used?

– array module in Python

– numpy matrices, Matlab matrices

– C/C++ arrays, and some C++ STL containers (vector, array)

Case in point

• Python list of integers

– Each value is really just 4 bytes

– An entry in a list will use 8 bytes on a 64-bit machine

– The minimum size of the allocated list element is 24

bytes

• Sorting will require walking over the 8-byte entries,

tracing each to the correct element, and then moving

the entries around

Caches

• All memory is not equally fast

• Different levels of caches

• If data does not fit into cache, things become slow

• CPU does prefetching

– If memory accesses follow simple patterns

– Random indirection does NOT

• Cache-friendly code

– Good locality

• Keep using the same part of memory before moving on

– Small workset

• Keep memory usage low

– Good predictability

• Helps prefetcher

So, how do we sort?

• Do not implement a general sorting algorithm yourself

• In C:

– void qsort(void *base, size_t nitems, size_t size, int (*compar)(const void *, const void*))

• You have bare pointers to data, you need to state the size of each

element, you need a pointer to a function that can do comparisons

• We learned in the printing example that we can do better…

• template <class RandomAccessIterator> void sort (RandomAccessIterator first, RandomAccessIterator last);

• template <class RandomAccessIterator, class Compare> void sort (RandomAccessIterator first, RandomAccessIterator last, Compare comp);

Simple and complex sorting

• Elements with valid < operator can be sorted in increasing order by simply

– sort(vector.begin(), vector.end());

• Nice trick

– pair class, make tuples of your data where the desired order is

represented by the first element

• Harder case, implement a function which takes const (references) to the

elements

– Returning true if the first object comes before the latter

– false if it comes after or if they are equivalent (strict-weak ordering)

– The bugs you can get in any language for invalid comparison code are

nasty

• The pair suggestion might not be too bad

Functors

• A function is a very real thing in original C

– It’s a single specific piece of code (location in

memory) than can be called by the CPU

• In modern C++

– We have templates etc

• Same piece of source can result in multiple sets

of machine code

– Inlining might mean that there is no function call, not

even a block of machine instructions for the function

• A function is just code, no data

Functors

• C++ supports operator overloading

• () is another operator

• So, comparing elements can be as simple as struct intComparator

{

bool operator() (const int left, const int right)

{

return left < right;

}

};

sort(data.begin(), data.end(), intComparator());

Why would you want a functor?

• Auxiliary data

• Settings

• Caching/precalculation to speed up additional function calls

• Keeping statistics

• Any case where you want to inject a piece of code inside an

algorithm or library

Sorting again

struct intComparator

{

int comps;

intComparator() : comps(0) {}

bool operator() (const int left, const int right)

{

return left < right;

}

};

intComparator comparer;

sort(data.begin(), data.end(), comparer);

printf("%d\n", comparer.comps);

Functors

• This was nice

• But it moves the logic for the sorting away from the

place where we sort

– Logical if is a general ordering

– But if it’s a general ordering, it should probably just

be in the < operator for the elements we sort

Lambda expressions

• You would really like to put the instructions right where

they logically belong

• Like…

This?

int comps = 0;

sort(data.begin(), data.end(), [&] (int left, int right) {

comps++;

return left < right;

} );

printf("%d\n", comps);

• And in the C++14 standard, you can even put ”auto” for left and right there

What was that?

• Lambda expressions

– Common in functional programming

– Common in Python

• Create a functor object in place anywhere

– Put all relevant code in one place

• Local variables can be made accessible within the

functor

– “Capturing”

Lambda syntax

• [capture-list] (params) mutable -> return-type { body }

• Specifying mutable is optional

• Return type and arrow does not need to be specified, if

it can be deduced correctly

– Single return statement with evident type

Capture lists

• If you just specify [], the lambda won’t have access to

local variables

• You can also specify a list of variable names [a,b,c]

– These are then captured by value, a copy is made

– That makes them safe to access even when the

function that created the lambda has returned

• You can also specify variables with & - [&a,&b,&c]

• Shorthand, capture all variables by value [=], all

variables by reference [&]

• If you have a method in a class, you can also expose

instance variables using [this]

Performance comparison

• Sorted the same numbers, the same way

– Generate 10,000,000 random numbers

– Sort them

– With functor

• With and without counting comparisons

• With and without inlining

– With lambda

• With and without counting comparisons

• Amounting to 282306119 comparisons

• On GCC 5.2, Tintin

• Repeated timings a bit, not fully accurate

Performance comparison

Type Inline Counting Time (s)

Functor X 1.114

Lambda X 1.119

Functor 1.522

Functor X X 1.102

Lambda X X 1.205

Functor X 1.523

Performance comparison

• Functor turns out to be faster than lambda for counting case

– Probably variable capture by reference leads to two levels of

indirection (get to the lambda data, get the reference to the

comps variable, update it)

– Lambda still less optimized than normal objects

• Some uncertainty in the numbers

• Relatively huge overhead in not inlining

– Even a non-indirect function call is expensive if the operation is

simple

• Benefits of inlining can sometimes be even stronger for slightly

longer methods

– More data interactions between caller and callee that the

optimizer can work on

Summary

• Lambda, auto, range for, and templates are some

examples of things that make modern C++ a much

more pleasant language to use

• The way these technologies are implemented allow

them to give the same or better performance than

equivalent C code

– Much faster than Python code

– While actual code can be similar in style

• Powerful libraries in Boost

• Interactive modern C++ is there, but not as mature as

IPython

top related