talk about performance

172
Talk About Performance @YaroslavBunyak Senior Software Engineer, SoftServe Inc.

Upload: yaroslav-bunyak

Post on 22-Nov-2014

198 views

Category:

Technology


0 download

DESCRIPTION

The talk I did during IT Weekend Rivne event 2 years ago.

TRANSCRIPT

Page 1: Talk About Performance

Talk About Performance@YaroslavBunyak

Senior Software Engineer, SoftServe Inc.

Page 2: Talk About Performance

What is Performance?

Page 3: Talk About Performance

What is a Program?

xformdata data

Page 4: Talk About Performance

What is a Program?

xformdata data

Page 5: Talk About Performance

⬆ THIS

!!

What is a Program?

xformdata data

Page 6: Talk About Performance

What is a Program?

xformdata data

Page 7: Talk About Performance

What is a Program?

xformdata data

Page 8: Talk About Performance

How to Create a Program?

Page 9: Talk About Performance

Simple

Page 10: Talk About Performance

Simple

Write code Your favorite programming language: C, C++, Objective-C, Java etc.

Page 11: Talk About Performance

Simple

Write code Your favorite programming language: C, C++, Objective-C, Java etc.

Compile Compiler will transform your code into machine code

Page 12: Talk About Performance

Simple

Write code Your favorite programming language: C, C++, Objective-C, Java etc.

Compile Compiler will transform your code into machine code

Run on target hardware Hardware is a black box

Page 13: Talk About Performance

Simple

Write code Your favorite programming language: C, C++, Objective-C, Java etc.

Compile Compiler will transform your code into machine code

Run on target hardware Hardware is a black box <- Right?

Page 14: Talk About Performance

Simple

Write code Your favorite programming language: C, C++, Objective-C, Java etc.

Compile Compiler will transform your code into machine code

Run on target hardware Hardware is a black box <- Right?

Wrong!

Page 15: Talk About Performance

Simple

Write code Your favorite programming language: C, C++, Objective-C, Java etc.

Compile Compiler will transform your code into machine code

Run on target hardware Hardware is a black box

Page 16: Talk About Performance

Bad Programs

Page 17: Talk About Performance

Bad Programs

Sloppy

Using the program is like trying to swim in jelly

Page 18: Talk About Performance

Bad Programs

Sloppy

Using the program is like trying to swim in jelly

Use memory inefficiently

Page 19: Talk About Performance

Bad Programs

Sloppy

Using the program is like trying to swim in jelly

Use memory inefficiently

Battery is dead already

Page 20: Talk About Performance

Good Programs

Page 21: Talk About Performance

Run fast

Good Programs

Page 22: Talk About Performance

Run fastUse little memory

Good Programs

Page 23: Talk About Performance

Run fastUse little memorySave battery

Good Programs

Page 24: Talk About Performance

Run fastUse little memorySave battery

Good Programs

I write them!

Page 25: Talk About Performance

Run fastUse little memorySave battery

Good Programs

I write them!

It was a joke :)

Page 26: Talk About Performance

Run fastUse little memorySave battery

Good Programs

Page 27: Talk About Performance

How to Create a Good Program?

Page 28: Talk About Performance

What is a Program?

xformdata data

Page 29: Talk About Performance

What is a Program?

Page 30: Talk About Performance

What is a Program?

Page 31: Talk About Performance

What is a Program?

code

hardware

Page 32: Talk About Performance

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Page 33: Talk About Performance

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Page 34: Talk About Performance

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Q: How fast this code is?

Page 35: Talk About Performance

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Q: How fast this code is?

A: Depends...

Page 36: Talk About Performance

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Page 37: Talk About Performance

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Page 38: Talk About Performance

... on how fast CPU adds two

integers?

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Page 39: Talk About Performance

... on how fast CPU adds two

integers?

Code Sample

int a = ... int b = ... // more code... !int c = a + b; NO

Page 40: Talk About Performance

... on how fast CPU adds two

integers?

Code Sample

int a = ... int b = ... // more code... !int c = a + b; NO

Any modern CPU can add integers

very fast !

~1 cycle

Page 41: Talk About Performance

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Page 42: Talk About Performance

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Page 43: Talk About Performance

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

... on whether `a’ and `b’ are ready for processing

Page 44: Talk About Performance

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

... on whether `a’ and `b’ are ready for processingi.e. loaded into

CPU registers

Page 45: Talk About Performance

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

... on whether `a’ and `b’ are ready for processingi.e. loaded into

CPU registersLoad data

from memory into a register

!~600 cycles

Page 46: Talk About Performance

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Page 47: Talk About Performance

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Page 48: Talk About Performance

Q: What CPU is doing in the meantime?

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Page 49: Talk About Performance

Q: What CPU is doing in the meantime?

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

A: Nothing! It’s waiting for data

Page 50: Talk About Performance

Code Sample

int a = ... int b = ... // more code... !int c = a + b;

Page 51: Talk About Performance

You Ask

Page 52: Talk About Performance

You Ask

Can we do better?

Page 53: Talk About Performance

You Ask

Can we do better?Yes. And your hardware will help you

Page 54: Talk About Performance

CPU

Page 55: Talk About Performance

CPU Operation

Page 56: Talk About Performance

CPU Operation

Load & decode instruction(s)

Page 57: Talk About Performance

CPU Operation

Load & decode instruction(s)Load data

memory -> registers

Page 58: Talk About Performance

CPU Operation

Load & decode instruction(s)Load data

memory -> registers

Execute instruction(s)

Page 59: Talk About Performance

CPU Operation

Load & decode instruction(s)Load data

memory -> registers

Execute instruction(s)Store results

registers -> memory

Page 60: Talk About Performance

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS

Page 61: Talk About Performance

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 1

Page 62: Talk About Performance

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 1

Page 63: Talk About Performance

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 13 instr. 1

Page 64: Talk About Performance

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 13 instr. 14 instr. 1

Page 65: Talk About Performance

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 13 instr. 14 instr. 15 instr. 1

Page 66: Talk About Performance

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 13 instr. 14 instr. 15 instr. 16 instr. 2

Page 67: Talk About Performance

(Not) Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 13 instr. 14 instr. 15 instr. 16 instr. 27 instr. 2

Page 68: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS

Page 69: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 1

Page 70: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 1

Page 71: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 1

Page 72: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 1

Page 73: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 15 instr. 4 instr. 3 instr. 2 instr. 1

Page 74: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 15 instr. 4 instr. 3 instr. 2 instr. 16 instr. 4 instr. 3 instr. 2

Page 75: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 15 instr. 4 instr. 3 instr. 2 instr. 16 instr. 4 instr. 3 instr. 27 instr. 4 instr. 3

Page 76: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 15 instr. 4 instr. 3 instr. 2 instr. 16 instr. 4 instr. 3 instr. 27 instr. 4 instr. 3

Page 77: Talk About Performance

Branch Prediction

if (day == Monday) dose = kDouble; else dose = kStandard; !make_coffee(dose);

Page 78: Talk About Performance

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

Page 79: Talk About Performance

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

<- What instruction to load & decode

next?

Page 80: Talk About Performance

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

<- What instruction to load & decode

next?

<- two or

<- three ?

Page 81: Talk About Performance

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

Page 82: Talk About Performance

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

Page 83: Talk About Performance

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

CPU will try to predict and start

load & decode

Page 84: Talk About Performance

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

CPU will try to predict and start

load & decode

If it was wrong: discard results,

flush pipeline

Page 85: Talk About Performance

Branch Prediction

if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4

Page 86: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS

Page 87: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 1

Page 88: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 1

Page 89: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 1

Page 90: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 1 <- instr. 1

executed, prediction

was correct

Page 91: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 4 instr. 2 instr. 1

Page 92: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 4 instr. 2 instr. 16 instr. 4 instr. 2

Page 93: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 4 instr. 2 instr. 16 instr. 4 instr. 27 instr. 4

Page 94: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS

Page 95: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 1

Page 96: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 1

Page 97: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 1

Page 98: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 1 <- instr. 1

executed, wrong prediction detected

Page 99: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 3 instr. 1

Page 100: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 3 instr. 16 instr. 4 instr. 3

Page 101: Talk About Performance

Pipeline

cyclepipeline stage

IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 3 instr. 16 instr. 4 instr. 37 instr. 4 instr. 3

Page 102: Talk About Performance

Takeaways

Page 103: Talk About Performance

Takeaways

Branches are bad for the pipeline

Page 104: Talk About Performance

Takeaways

Branches are bad for the pipelineAvoid if possible

Page 105: Talk About Performance

Takeaways

Branches are bad for the pipelineAvoid if possibleHelp branch predictor to help you

Page 106: Talk About Performance

Memory

Page 107: Talk About Performance

Workflow

Page 108: Talk About Performance

Workflow

Program data is stored in memory

Page 109: Talk About Performance

Workflow

Program data is stored in memoryCPU requests data for processing

Page 110: Talk About Performance

Workflow

Program data is stored in memoryCPU requests data for processingTypical cycle: load, process, store

Page 111: Talk About Performance

Architecture

Memory Controller

Memory BanksCPU

Page 112: Talk About Performance

Architecture

Memory Controller

Memory BanksCPU

Page 113: Talk About Performance

Architecture

Memory Controller

Memory BanksCPU

Page 114: Talk About Performance

Architecture

Memory Controller

Memory BanksCPU

Page 115: Talk About Performance

Architecture

Memory Controller

Memory BanksCPU

Page 116: Talk About Performance

Parameters

Page 117: Talk About Performance

Parameters

There are two main parameters of memory subsystem:

Page 118: Talk About Performance

Parameters

There are two main parameters of memory subsystem:

latency

Page 119: Talk About Performance

Parameters

There are two main parameters of memory subsystem:

latencybandwidth

Page 120: Talk About Performance

Latency

Page 121: Talk About Performance

Latency

Shows how much time passes between data request and its delivery

Page 122: Talk About Performance

Latency

Shows how much time passes between data request and its deliveryVery important concept (see further)

Page 123: Talk About Performance

Bandwidth

Page 124: Talk About Performance

Bandwidth

Shows how much data can be accessed per second

Page 125: Talk About Performance

Bandwidth

Shows how much data can be accessed per secondAlso important

Page 126: Talk About Performance

History Lesson

VAX-11 (1980) Modern Desktop Improvement

Clock Speed, Mhz 6 3000 +500x

Memory Size, MB 2 2000 +1000x

Memory Bandwidth, MB/s 13 7000 +540x

Memory Latency, ns 225 70 +3x

Memory Latency, cycles 1.4 210 -150x

Data from “Machine Architecture” talk by Herb Sutter

Page 127: Talk About Performance

History Lesson

Page 128: Talk About Performance

History Lesson

For the past 30+ years we saw huge improvements in CPU processing power and data sizes

Page 129: Talk About Performance

History Lesson

For the past 30+ years we saw huge improvements in CPU processing power and data sizes ... but

Page 130: Talk About Performance

History Lesson

For the past 30+ years we saw huge improvements in CPU processing power and data sizesMemory speeds couldn’t keep up with the progress

Page 131: Talk About Performance

Takeaways

Page 132: Talk About Performance

Latency is the king!

Takeaways

Page 133: Talk About Performance

Latency is the king!You can trade CPU time for memory, i.e. calculate more - load/store less

Takeaways

Page 134: Talk About Performance

Memory types

Page 135: Talk About Performance

Memory types

There are two main memory types:

Page 136: Talk About Performance

Memory types

There are two main memory types:Static RAM - fast, but very expensive

Page 137: Talk About Performance

Memory types

There are two main memory types:Static RAM - fast, but very expensiveDynamic RAM - slow, but cheaper

Page 138: Talk About Performance

Memory types

There are two main memory types:Static RAM - fast, but very expensiveDynamic RAM - slow, but cheaper

Which one to use?

Page 139: Talk About Performance

Memory types

There are two main memory types:Static RAM - fast, but very expensiveDynamic RAM - slow, but cheaper

Page 140: Talk About Performance

Solution

Page 141: Talk About Performance

Solution

Build memory hierarchy which utilizes large amounts of cheap DRAM storage and small amounts of fast SRAM cache

Page 142: Talk About Performance

Memory Hierarchy

Memory

L2 Cache

L1i/L1d

Page 143: Talk About Performance

Memory Hierarchy

Memory

L2 Cache

L1i/L1diPhone 4s:

!32KB L1i 32KB L1d 1 MB L2

512 MB DRAM

Page 144: Talk About Performance

Memory Hierarchy

Memory

L2 Cache

L1i/L1diPhone 4s:

!32KB L1i 32KB L1d 1 MB L2

512 MB DRAM

Access: !

registers - 1 cycle L1 - 5 cycles

L2 - 40 cycles DRAM - 610

Page 145: Talk About Performance

Memory Hierarchy

Memory

L2 Cache

L1i/L1d

Page 146: Talk About Performance

Cache Miss

Page 147: Talk About Performance

Cache Miss

If data requested by CPU is not in the cache it has to be loaded from the main (slow) memory

Page 148: Talk About Performance

Cache Line

Page 149: Talk About Performance

Cache Line

Minimum amount of data that can be read from and written to memory

Page 150: Talk About Performance

Cache Line

Minimum amount of data that can be read from and written to memoryUsually 64-128 bytes

Page 151: Talk About Performance

Cache Line

Page 152: Talk About Performance

Cache Line

What does it mean?

Page 153: Talk About Performance

Cache Line

What does it mean?Consider you have an array of 16 floats and you want the first float for calculations

Page 154: Talk About Performance

Cache Line

What does it mean?Consider you have an array of 16 floats and you want the first float for calculationsIf it’s not in cache already, you will pay the “full price” to load entire cache line

Page 155: Talk About Performance

Cache Line

What does it mean?Consider you have an array of 16 floats and you want the first float for calculationsIf it’s not in cache already, you will pay the “full price” to load entire cache lineAccess remaining 15 floats “for free”

Page 156: Talk About Performance

Prefetch

Page 157: Talk About Performance

Prefetch

Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculatively

Page 158: Talk About Performance

Prefetch

Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need it

Page 159: Talk About Performance

Prefetch

Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need itBut your data access patterns must be very simple - linear is a good one

Page 160: Talk About Performance

Prefetch

Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need itBut your data access patterns must be very simple - linear is a good one

BTW, C++ operator-> sometimes

referred to as “cache miss”

operator

Page 161: Talk About Performance

Prefetch

Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need itBut your data access patterns must be very simple - linear is a good one

BTW, C++ operator-> sometimes

referred to as “cache miss”

operator

Can you guess why?

Page 162: Talk About Performance

Prefetch

Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need itBut your data access patterns must be very simple - linear is a good one

Page 163: Talk About Performance

How to Create a Good Program?

Page 164: Talk About Performance

Simple

Page 165: Talk About Performance

Simple

Know your target hardware

Page 166: Talk About Performance

Simple

Know your target hardwareKnow your data

Page 167: Talk About Performance

Simple

Know your target hardwareKnow your dataUse your brain

Page 168: Talk About Performance

One More Thing...

Page 169: Talk About Performance

One More Thing...

Data-Oriented Design

Page 170: Talk About Performance

Thank You!

Page 171: Talk About Performance

Questions?

Page 172: Talk About Performance

References

Ulrich Drepper, “What Every Programmer Should Know About Memory” Крис Касперски, “Техника оптимизации программ. Еффективное использование памяти” @mike_acton