mind the cache · joaquín m lópez muñoz madrid, november 2015 mind the cache . our mental model...

42
Telefónica Digital – Video Products Definition & Strategy using std::cpp 2015 Joaquín M López Muñoz <[email protected]> Madrid, November 2015 Mind the cache

Upload: others

Post on 24-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Telefónica Digital – Video Products Definition & Strategy

using std::cpp 2015

Joaquín M López Muñoz <[email protected]> Madrid, November 2015

Mind the cache

Page 2: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Our mental model of the CPU is 70 years old

Von Neumann in front of EDVAC

Page 3: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Current reality looks quite different

Intel Core i7-3770K (Ivy Bridge)

Page 4: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Processor-memory gap

Year

Pe

rfo

rman

ce

More cores

More cache

Page 5: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

L3 8 MB

L2 256 KB

Cache hierarchy

L1 32 KB

L2 256 KB

L1 32 KB

L2 256 KB

L1 32 KB

L2 256 KB

L1 32 KB

Page 6: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

L2 256 KB

Cache line

L1 32 KB

L1 32 KB

64 B

Page 7: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Access times

DRAM: 60 ns / 237 cycles

L1: 1 ns / 4 cycles

L2: 3.1 ns / 12 cycles

L3: 7.7 ns / 30 cycles

Page 8: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Measure measure measure

Page 9: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Our little measuring tool

template<typename F> double measure(F f) { using namespace std::chrono; static const int num_trials=10; static const milliseconds min_time_per_trial(200); std::array<double,num_trials> trials; volatile decltype(f()) res; // to avoid optimizing f() away for(int i=0;i<num_trials;++i){ int runs=0; high_resolution_clock::time_point t2; measure_start=high_resolution_clock::now(); do{ res=f(); ++runs; t2=high_resolution_clock::now(); }while(t2-measure_start<min_time_per_trial); trials[i]=duration_cast<duration<double>>(t2-measure_start).count()/runs; } (void)(res); // var not used warn std::sort(trials.begin(),trials.end()); return std::accumulate(trials.begin()+2,trials.end()-2,0.0)/(trials.size()-4)*1E6; }

Page 10: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Profiling environment

Intel Core i5-2520M @ 2.5 GHz (Sandy Bridge)

L1: 2×32 KB 8-way 3 cycles

L2: 2×256 KB 8-way 9 cycles

L3: shared 3 MB 12-way 22 cycles

4.0 GB DRAM

Windows 7 Enterprise

Visual Studio Express 2015

x86 mode (32 bit), default release settings

Page 11: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Container traversal

Page 12: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Linear traversal

// vector std::vector<int> v=...; return std::accumulate(v.begin(),v.end(),0); // list std::list<int> l=...; return std::accumulate(l.begin(),l.end(),0); // shuffled list std::list<int> l=...; // nodes shuffled in memory return std::accumulate(l.begin(),l.end(),0);

Page 13: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Linear traversal: results

20.2×

6.6×

Page 14: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Linear traversal: results

111×

179×

225×

Page 15: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Linear traversal: analysis

Two cache aspects to watch out for

Locality

Prefetching

std::vector excels at both

std::list: sequentially allocated nodes provide some sort of non-guaranteed locality

Shuffled nodes is the worst scenario

current cache line prefetched cache line

Page 16: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Matrix sum

boost::multi_array<int,2> a(boost::extents[m][m]); // row_col long int res=0; for(std::size_t i=0;i<m;++i){ for(std::size_t j=0;j<m;++j){ res+=a[i][j]; } } return res; // col_row long int res=0; for(std::size_t j=0;j<m;++j){ for(std::size_t i=0;i<m;++i){ res+=a[i][j]; } } return res;

Page 17: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Matrix sum: results

1.7× 1.9×

Page 18: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Matrix sum: analysis

Max # of segments in L1: 32 KB / 64 B = 512

L2 saturation @ 40972 = 16,785,409

Partial cache associativity makes things actually worse

m m

m

m

5132 = 262,656 51

3

513 513

51

3

Page 19: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Layout tricks

Page 20: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

AOS vs SOA

// aos struct particle { int x,y,z; int dx,dy,dz; }; using particle_aos=std::vector<particle>; auto ps=create_particle_aos(n); long int res=0; for(std::size_t i=0;i<n;++i)res+=ps[i].x+ps[i].y+ps[i].z; return res; // soa struct particle_soa { std::vector<int> x,y,z; std::vector<int> dx,dy,dz; }; auto ps=create_particle_soa(n); long int res=0; for(std::size_t i=0;i<n;++i)res+=ps.x[i]+ps.y[i]+ps.z[i]; return res;

Page 21: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

AOS vs SOA: results

2.2×

Page 22: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

AOS vs SOA: analysis

AOS cache load throughput: 8 useful values per fetch (average)

SOA throughput: 16 useful values per fetch

16/8 = 2 times faster

Yet results are even better than this!

x x x x x x x x x x x x x x x x x x x

64 B

y y y y y y y y y y y y y y y y y y y

z z z z z z z z z z z z z z z z z z z

64 B

z dx dy dz x y z dx dy dz x y z dx dy dz x y z

x y z dx dy dz x y z dx dy dz x y z dx

x y z dx dy dz x y z dx dy dz x y dy dz

Page 23: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Compact AOS vs SOA

// aos struct particle { int x,y,z; }; using particle_aos=std::vector<particle>; // ...same as before // soa struct particle_soa { std::vector<int> x,y,z; }; // ...same as before

Page 24: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Compact AOS vs SOA: results

2.9× 2×

1.1×

Page 25: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Compact AOS vs SOA: analysis

Both AOS and SOA fetch only useful values

Yet SOA is (diminishingly as n grows) faster

The CPU is prefetching x, y, z in parallel

L1 L2 better than L1 L2 L3 better than L1 ··· DRAM

Page 26: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Random access in AOS vs SOA

// aos auto ps=create_particle_aos(n); long int res=0; for(std::size_t i=0;i<n;++i){ auto idx=rnd(gen); res+=ps[idx].x+ps[idx].y+ps[idx].z; } return res; // soa auto ps=create_particle_soa(n); long int res=0; for(std::size_t i=0;i<n;++i){ auto idx=rnd(gen); res+=ps.x[idx]+ps.y[idx]+ps.z[idx]; } return res;

Page 27: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Random access in AOS vs SOA: results

1.3×

Page 28: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Random access in AOS vs SOA: analysis

You’re now in a position to make an educated guess

Moral: don’t trust your intuition

Page 29: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Parallel count

std::vector<int> v=...; auto f=[&](int* px,int* py,int* pz,int* pw){ auto th=[](int* p,int* first,int* last){ *p=0; while(first!=last){ int x=*first++; *p+=x%2; } }; std::thread t1(th,px,v.data(),v.data()+n/4); std::thread t2(th,py,v.data()+n/4,v.data()+n/2); std::thread t3(th,pz,v.data()+n/2,v.data()+n*3/4); std::thread t4(th,pw,v.data()+n*3/4,v.data()+n); t1.join();t2.join();t3.join();t4.join(); return *px+*py+*pz+*pw; }; int res[49]; // near result addresses return f(&res[0],&res[1],&res[2],&res[3]); // far result addresses return f(&res[0],&res[16],&res[32],&res[48]);

Page 30: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Parallel count: results

2.2×

Page 31: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

False sharing

L2 L1 L3

L2 L1

T1

T2

Writing to res[0] invalidates cached res[1]

Coherence management requires full write to DRAM

Rule of thumb: use shared write memory only to communicate thread results

Page 32: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Branch prediction

Page 33: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Filtered sum

std::vector<int> v=...; // unsorted long int res=0; for(int x:v)if(x>128)res+=x; return res; // sorted std::sort(v.begin(),v.end()); long int res=0; for(int x:v)if(x>128)res+=x; return res;

Page 34: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Filtered sum : results

6.1× 5.9×

Page 35: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Instruction pipeline

Increases execution throughput by n (latency remains the same)

Sandy Bridge: 14-17 stages

Ivy Bridge: 14-19 stages

Branch prediction failure invalidates partially executed queue

stage 1 stage 2 stage 3 stage n ···

stage 1 stage 2 stage 3 stage n ···

stage 1 stage 2 stage 3 stage n ···

stage 1 stage 2 stage 3 stage n ···

stage 1 stage 2 stage 3 stage n ···

stage 1 stage 2 stage 3 stage n ···

inst 1

inst 2

inst 3

inst n

···

Page 36: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

A particularly obnoxious case: bool-based processing

struct particle{ bool active; int x,y,z; ... }; std::vector<particle> particles; void render_particles(){ for(const particle& p:particles)if(p.active)render(p); }

Remove active flag by smartly laying out objects struct particle{ int x,y,z; ... }; std::vector<particle> particles; std::size_t num_active; void render_particles(){ for(std::size_t i=0;i<num_active;++i)render(particles[i]); }

Better cache density, no branching, fewer elements processed

Page 37: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Polymorphic containers

struct base{virtual int f()const=0; virtual ~base(){}}; struct derived1:base{virtual int f()const{return 1;};}; struct derived2:base{virtual int f()const{return 2;};}; struct derived3:base{virtual int f()const{return 3;};}; using pointer=std::shared_ptr<base>; std::vector<pointer> v=...; // unsorted long int res=0; for(const auto& p:v)res+=p->f(); return res; // sorted std::sort(v.begin(),v.end(),[](const pointer& p,const pointer& q){ return std::type_index(typeid(*p))<std::type_index(typeid(*q)); }); long int res=0; for(const auto& p:v)res+=p->f(); return res; // poly_collection: more on this later poly_collection<base> v=...; v.for_each([&](const base& x){res+=x.f();});

Page 38: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Polymorphic containers: results

4.3×

1.2×

10.1×

1.6× 18.6×

Page 39: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Polymorphic containers: analysis

Sorting improves branch prediction even without data locality

What is this poly_collection thing?

Branch prediction-friendly + optimum data locality

More info on http://tinyurl.com/poly-collection

Cutting-edge compilers: look for devirtualization

d1 d3 d1 d2 d3 d3 d1 d2 d1

d1 d3 d1 d2 d3 d3 d1 d2 d1

vector<shared_ptr<base>>

poly_collection<base>

Page 40: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Concluding remarks

Page 41: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Concluding remarks

You can’t just ignore reality if you’re serious about performance

Think about the planet: don’t waste energy

Measure measure measure

(Almost) nothing beats traversing a vector in its natural order

One indirection ≈ one cache miss

If multithreading, beware of false sharing

SOA can get you huge speedups (and can not)

Can you remove flags?

Look for data-oriented design if you want to know more

Be regular: sort your objects

Page 42: Mind the cache · Joaquín M López Muñoz  Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC

Mind the cache

Thank you

github.com/joaquintides/usingstdcpp2015

using std::cpp 2015

Joaquín M López Muñoz <[email protected]> Madrid, November 2015