talk about performance
DESCRIPTION
The talk I did during IT Weekend Rivne event 2 years ago.TRANSCRIPT
Talk About Performance@YaroslavBunyak
Senior Software Engineer, SoftServe Inc.
What is Performance?
What is a Program?
xformdata data
What is a Program?
xformdata data
⬆ THIS
!!
What is a Program?
xformdata data
What is a Program?
xformdata data
What is a Program?
xformdata data
How to Create a Program?
Simple
Simple
Write code Your favorite programming language: C, C++, Objective-C, Java etc.
Simple
Write code Your favorite programming language: C, C++, Objective-C, Java etc.
Compile Compiler will transform your code into machine code
Simple
Write code Your favorite programming language: C, C++, Objective-C, Java etc.
Compile Compiler will transform your code into machine code
Run on target hardware Hardware is a black box
Simple
Write code Your favorite programming language: C, C++, Objective-C, Java etc.
Compile Compiler will transform your code into machine code
Run on target hardware Hardware is a black box <- Right?
Simple
Write code Your favorite programming language: C, C++, Objective-C, Java etc.
Compile Compiler will transform your code into machine code
Run on target hardware Hardware is a black box <- Right?
Wrong!
Simple
Write code Your favorite programming language: C, C++, Objective-C, Java etc.
Compile Compiler will transform your code into machine code
Run on target hardware Hardware is a black box
Bad Programs
Bad Programs
Sloppy
Using the program is like trying to swim in jelly
Bad Programs
Sloppy
Using the program is like trying to swim in jelly
Use memory inefficiently
Bad Programs
Sloppy
Using the program is like trying to swim in jelly
Use memory inefficiently
Battery is dead already
Good Programs
Run fast
Good Programs
Run fastUse little memory
Good Programs
Run fastUse little memorySave battery
Good Programs
Run fastUse little memorySave battery
Good Programs
I write them!
Run fastUse little memorySave battery
Good Programs
I write them!
It was a joke :)
Run fastUse little memorySave battery
Good Programs
How to Create a Good Program?
What is a Program?
xformdata data
What is a Program?
What is a Program?
What is a Program?
code
hardware
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
Q: How fast this code is?
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
Q: How fast this code is?
A: Depends...
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
... on how fast CPU adds two
integers?
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
... on how fast CPU adds two
integers?
Code Sample
int a = ... int b = ... // more code... !int c = a + b; NO
... on how fast CPU adds two
integers?
Code Sample
int a = ... int b = ... // more code... !int c = a + b; NO
Any modern CPU can add integers
very fast !
~1 cycle
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
... on whether `a’ and `b’ are ready for processing
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
... on whether `a’ and `b’ are ready for processingi.e. loaded into
CPU registers
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
... on whether `a’ and `b’ are ready for processingi.e. loaded into
CPU registersLoad data
from memory into a register
!~600 cycles
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
Q: What CPU is doing in the meantime?
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
Q: What CPU is doing in the meantime?
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
A: Nothing! It’s waiting for data
Code Sample
int a = ... int b = ... // more code... !int c = a + b;
You Ask
You Ask
Can we do better?
You Ask
Can we do better?Yes. And your hardware will help you
CPU
CPU Operation
CPU Operation
Load & decode instruction(s)
CPU Operation
Load & decode instruction(s)Load data
memory -> registers
CPU Operation
Load & decode instruction(s)Load data
memory -> registers
Execute instruction(s)
CPU Operation
Load & decode instruction(s)Load data
memory -> registers
Execute instruction(s)Store results
registers -> memory
(Not) Pipeline
cyclepipeline stage
IL ID DL EX DS
(Not) Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 1
(Not) Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 1
(Not) Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 13 instr. 1
(Not) Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 13 instr. 14 instr. 1
(Not) Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 13 instr. 14 instr. 15 instr. 1
(Not) Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 13 instr. 14 instr. 15 instr. 16 instr. 2
(Not) Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 13 instr. 14 instr. 15 instr. 16 instr. 27 instr. 2
Pipeline
cyclepipeline stage
IL ID DL EX DS
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 1
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 1
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 1
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 1
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 15 instr. 4 instr. 3 instr. 2 instr. 1
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 15 instr. 4 instr. 3 instr. 2 instr. 16 instr. 4 instr. 3 instr. 2
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 15 instr. 4 instr. 3 instr. 2 instr. 16 instr. 4 instr. 3 instr. 27 instr. 4 instr. 3
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 3 instr. 2 instr. 14 instr. 4 instr. 3 instr. 2 instr. 15 instr. 4 instr. 3 instr. 2 instr. 16 instr. 4 instr. 3 instr. 27 instr. 4 instr. 3
Branch Prediction
if (day == Monday) dose = kDouble; else dose = kStandard; !make_coffee(dose);
Branch Prediction
if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4
Branch Prediction
if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4
<- What instruction to load & decode
next?
Branch Prediction
if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4
<- What instruction to load & decode
next?
<- two or
<- three ?
Branch Prediction
if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4
Branch Prediction
if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4
Branch Prediction
if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4
CPU will try to predict and start
load & decode
Branch Prediction
if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4
CPU will try to predict and start
load & decode
If it was wrong: discard results,
flush pipeline
Branch Prediction
if (day == Monday) // 1 dose = kDouble; // 2 else dose = kStandard; // 3 !make_coffee(dose); // 4
Pipeline
cyclepipeline stage
IL ID DL EX DS
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 1
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 1
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 1
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 1 <- instr. 1
executed, prediction
was correct
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 4 instr. 2 instr. 1
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 4 instr. 2 instr. 16 instr. 4 instr. 2
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 4 instr. 2 instr. 16 instr. 4 instr. 27 instr. 4
Pipeline
cyclepipeline stage
IL ID DL EX DS
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 1
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 1
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 1
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 1 <- instr. 1
executed, wrong prediction detected
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 3 instr. 1
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 3 instr. 16 instr. 4 instr. 3
Pipeline
cyclepipeline stage
IL ID DL EX DS1 instr. 12 instr. 2 instr. 13 instr. 4 instr. 2 instr. 14 instr. 4 instr. 2 instr. 15 instr. 3 instr. 16 instr. 4 instr. 37 instr. 4 instr. 3
Takeaways
Takeaways
Branches are bad for the pipeline
Takeaways
Branches are bad for the pipelineAvoid if possible
Takeaways
Branches are bad for the pipelineAvoid if possibleHelp branch predictor to help you
Memory
Workflow
Workflow
Program data is stored in memory
Workflow
Program data is stored in memoryCPU requests data for processing
Workflow
Program data is stored in memoryCPU requests data for processingTypical cycle: load, process, store
Architecture
Memory Controller
Memory BanksCPU
Architecture
Memory Controller
Memory BanksCPU
Architecture
Memory Controller
Memory BanksCPU
Architecture
Memory Controller
Memory BanksCPU
Architecture
Memory Controller
Memory BanksCPU
Parameters
Parameters
There are two main parameters of memory subsystem:
Parameters
There are two main parameters of memory subsystem:
latency
Parameters
There are two main parameters of memory subsystem:
latencybandwidth
Latency
Latency
Shows how much time passes between data request and its delivery
Latency
Shows how much time passes between data request and its deliveryVery important concept (see further)
Bandwidth
Bandwidth
Shows how much data can be accessed per second
Bandwidth
Shows how much data can be accessed per secondAlso important
History Lesson
VAX-11 (1980) Modern Desktop Improvement
Clock Speed, Mhz 6 3000 +500x
Memory Size, MB 2 2000 +1000x
Memory Bandwidth, MB/s 13 7000 +540x
Memory Latency, ns 225 70 +3x
Memory Latency, cycles 1.4 210 -150x
Data from “Machine Architecture” talk by Herb Sutter
History Lesson
History Lesson
For the past 30+ years we saw huge improvements in CPU processing power and data sizes
History Lesson
For the past 30+ years we saw huge improvements in CPU processing power and data sizes ... but
History Lesson
For the past 30+ years we saw huge improvements in CPU processing power and data sizesMemory speeds couldn’t keep up with the progress
Takeaways
Latency is the king!
Takeaways
Latency is the king!You can trade CPU time for memory, i.e. calculate more - load/store less
Takeaways
Memory types
Memory types
There are two main memory types:
Memory types
There are two main memory types:Static RAM - fast, but very expensive
Memory types
There are two main memory types:Static RAM - fast, but very expensiveDynamic RAM - slow, but cheaper
Memory types
There are two main memory types:Static RAM - fast, but very expensiveDynamic RAM - slow, but cheaper
Which one to use?
Memory types
There are two main memory types:Static RAM - fast, but very expensiveDynamic RAM - slow, but cheaper
Solution
Solution
Build memory hierarchy which utilizes large amounts of cheap DRAM storage and small amounts of fast SRAM cache
Memory Hierarchy
Memory
L2 Cache
L1i/L1d
Memory Hierarchy
Memory
L2 Cache
L1i/L1diPhone 4s:
!32KB L1i 32KB L1d 1 MB L2
512 MB DRAM
Memory Hierarchy
Memory
L2 Cache
L1i/L1diPhone 4s:
!32KB L1i 32KB L1d 1 MB L2
512 MB DRAM
Access: !
registers - 1 cycle L1 - 5 cycles
L2 - 40 cycles DRAM - 610
Memory Hierarchy
Memory
L2 Cache
L1i/L1d
Cache Miss
Cache Miss
If data requested by CPU is not in the cache it has to be loaded from the main (slow) memory
Cache Line
Cache Line
Minimum amount of data that can be read from and written to memory
Cache Line
Minimum amount of data that can be read from and written to memoryUsually 64-128 bytes
Cache Line
Cache Line
What does it mean?
Cache Line
What does it mean?Consider you have an array of 16 floats and you want the first float for calculations
Cache Line
What does it mean?Consider you have an array of 16 floats and you want the first float for calculationsIf it’s not in cache already, you will pay the “full price” to load entire cache line
Cache Line
What does it mean?Consider you have an array of 16 floats and you want the first float for calculationsIf it’s not in cache already, you will pay the “full price” to load entire cache lineAccess remaining 15 floats “for free”
Prefetch
Prefetch
Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculatively
Prefetch
Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need it
Prefetch
Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need itBut your data access patterns must be very simple - linear is a good one
Prefetch
Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need itBut your data access patterns must be very simple - linear is a good one
BTW, C++ operator-> sometimes
referred to as “cache miss”
operator
Prefetch
Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need itBut your data access patterns must be very simple - linear is a good one
BTW, C++ operator-> sometimes
referred to as “cache miss”
operator
Can you guess why?
Prefetch
Modern CPUs and compilers are able to detect memory access patterns and preload data in caches speculativelySo, data will be ready when you need itBut your data access patterns must be very simple - linear is a good one
How to Create a Good Program?
Simple
Simple
Know your target hardware
Simple
Know your target hardwareKnow your data
Simple
Know your target hardwareKnow your dataUse your brain
One More Thing...
One More Thing...
Data-Oriented Design
Thank You!
Questions?
References
Ulrich Drepper, “What Every Programmer Should Know About Memory” Крис Касперски, “Техника оптимизации программ. Еффективное использование памяти” @mike_acton