pragmatic optimization in modern programming - modern computer architecture concepts

29
1 PRAGMATIC OPTIMIZATION IN MODERN PROGRAMMING MODERN COMPUTER ARCHITECTURE CONCEPTS Created by for / 2015-2016 Marina (geek) K olpakova UNN

Upload: marina-kolpakova

Post on 05-Apr-2017

23 views

Category:

Education


1 download

TRANSCRIPT

1

PRAGMATICOPTIMIZATION

IN MODERN PROGRAMMINGMODERN COMPUTER ARCHITECTURE CONCEPTS

Created by for / 2015-2016Marina (geek) K olpakova UNN

2

COURSE TOPICSOrdering optimization approachesDemystifying a compilerMastering compiler optimizationsModern computer architectures concepts

3

OUTLINEThree aspects of the computer architectureLatency vs Throughput architecturesArchitecture families

CISCRISCVLIWVector

Why is it doing to be load/store?Latest trendssummary

4 . 1

1-ST ASPECT OF COMPUTER ARCHITECTUREInstruction Set Architecture or ISA (interface)

is a contract between HW and SW, which speci�es right, possibilities & limitations.

Class of ISA (load-store, register-memory)Memory addressing modes & rules (base-immediate,alignment requirements)Types & sizes of operands (size of byte, short)Operations (general arithmetic, control, logical)Control �ow instructions (branches, jumps, calls, returns)Encoding an ISA (�xed or variable length)All the conceptual aspects of the architecture

4 . 2

2-ND ASPECT OF COMPUTER ARCHITECTUREMicroarchitecture (organization) is a concrete

implementation of the ISA, the high-level aspects of aprocessor design (memory system, memory interconnect,

design of the processor internals).

Pipeline widthInstruction latenciesIssue wight and schedulingSpeculation capabilitiesAll the concrete aspects of the architecture

4 . 3

3-ND ASPECT OF COMPUTER ARCHITECTUREHardware or chip (design) is the speci�cs of a computer,

including the logic design and packaging. This is a concreteimplementation of the microarchitecture.

Tech-processClock ratesOn die placementAll the properties of the chip

5 . 1

ARCHITECTURE: ARMV8-Aμarch IP Hardware

Cortex-A53 ARM Octa Exynos 7(7580) 1.6GHz28nmHKMG

Cortex-A57 ARM Octa Exynos 7(7420) big.LITTLE 2.1/1.514FF ( LPE) (Samsung)

Cortex-A72 ARM Deca MediaTek Helio X20 big.LITTLE

2.5/2.0/1.4 20nmHKMG (TSMC)

Cortex-A35 ARM -

5 . 2

ARCHITECTURE: ARMV8-Aμarch IP Hardware

Denver NVIDIA Dual Tegra K1 2.3GHz28nmHPM

Kryo Qualcomm Tetra S820 big.LITTLE 2.2/1.6GHz14FF ( LPP) (Samsung)

Exynos M1 Samsung Quad Exynos 8890 big.LITTLE

2.6/2.29GHz 14FF ( LPP)(Samsung)

5 . 3

ARCHITECTURE: ARMV8-Aμarch IP Hardware

Cyclone Apple Dual A7 (APL0698) 1.4GHz28nmHKMG (Samsung)

Typhoon Apple Dual A8 (APL1012) 1.3GHz20nmHKMG (TSMC)

Twister Apple Dual A9 (APL0898) 1.85GHz16nmFF+ (TSMC)

Dual A9 (APL1022) 1.85GHz14nmFF( LPP) (Samsung)

6 . 1

LATENCY VS THROUGHPUT ARCHITECTURESLatency oriented architecture

addresses latency hiding issues;features sophisticated pipelining;out-of-order;employs advanced cache hierarchies;widely uses speculation.Compute cores occupy only a small part of a die.

6 . 2

LATENCY VS THROUGHPUT ARCHITECTURESThroughput oriented architecture

performs a bunch of operations in �y;features many simple compute units/cores;employs simple pipelines and large register �le toprovide a low-cost thread scheduling;uses wide basses, tiling, programmable local memory.Compute cores occupy most part of a die.

7

KEY ARCHITECTURE FAMILIESRISC

Reduced Instruction Set Computer

CISCComplex Instruction Set Computer

VLIWVery Long Instruction Word

Vector architecture

8 . 1

CISCComplex Instruction Set Computer

Designed in the 1970s which was a time where transistorswere expensive while compilers were naive. Additionally,instruction packaging was the main concern due toshortage of memory. The latency of the memory was just abit higher then registers.The goal was to de�ne an instruction set that allows highlevel language constructs be translated into as fewassembly language instructions as possible, improvingperformance as well as code density.Examples are VAX, x86, AMD64.Latency-oriented architecture.

8 . 2

CISCHeritage

instructions access memory, a plenty of addressing modes,many instruction families and a very rich variable lengthISA (alignment counts!),consequently, complicated instruction decoding logic.Moreover, a few registers are available for programmers.

Nowadays1. Instructions are broken down into μcode which are

much easy to pipeline and process power ef�ciently.2. Transistors are spent to cache hierarchies, out-of-order

execution, large RB and speculation to eliminate stalls.3. Symmetric multi-processing.

9 . 1

RISCReduced Instruction Set Computer

Designed in the 1980s which was a time there IPL was thegreat concern. The memory-processor gap already beganto come out.The goal was to decrease the number of clocks perinstruction (CPI) while pipeline instructions as much aspossible employing hardware to help with it. Uniform ISA,pipelining and large register �le is a must-have.Examples are MIPS, ARM, PowerPC.Latency-oriented architecture.

9 . 2

RISCHeritage

Relatively few instructions, all are the same length.Only load and store instructions access memory.Large resister �le than typical CISC processors have.No μcode

Nowadays Most architectures that comes from RISC arecalled Load-Store architectures, while may employ μops.They combine concepts of a classic RISC with usage ofmodern hardware enhancements: 1. deep pipelines, multi-cycle instructions,2. out-of-order execution,3. speculation.

10

THE HARDWARE/SOFTWARE GAPCompiler

analyzes control �ow, analyzes dependencyschedules instructionsmaps variables to limited register set

Hardware

analyzes control �ow, analyzes dependencyschedules instructionsremaps ISA register to large internal register set

11

A WORD TOWARDS REGISTERSIn deed, registers are temporary storage locations inside the

processor that hold data and addresses.

Local variables are not the same as registers in ISA, sincecompiler uses IR internally and does register allocationclose to the end of optimization process.Registers provided by ISA is not the same as actualregisters on the processor. Internal reorder buffers whichhold decoded instruction parameters and intermediateresults are closer to classic de�nition of a register �le.

12 . 1

VLIWVery Long Instruction Word

Designed in the 1980s which was a time there IPL was thegreat concern.The goal was to pipeline instructions as much as possibleemploying software to help with it reducing complexity ofthe hardware and mitigate the the Hardware/Softwaregap. Boost processor clock simplifying work per cycle.Example is Intel HP Itanium.Throughput-oriented architecture

12 . 2

VLIWHeritage

Compiler determines which instructions can beperformed in parallel,bundles this information and the instructions,and passes the bundle(word) to the hardware.No data dependencies between instructions in a word.Each operation in a word assigned to speci�c issue slot(dedicated FU).

12 . 3

VLIWNowadays

hardly any generic processor implements VLIWbrunchy nature of production codes (in contrast toHPC or scienti�c codes),need to follow binary compatibility across theμarchitecture families.

Whereas architecture is widely adopted forprogrammable co-processors where shrink in powerconsumption without lose of performance is crucial(DSP, GPU).

13 . 1

VECTOR PROCESSORSFirst introduced in 1976 and dominated for HPC in the1980s because of high instruction throughput.The goal was to perform operations on vectors of dataexposing data level parallelism (DLP) to increaseinstruction throughput. Vector pipelining is also calledchaining.Example is CrayThroughput-oriented architectures.

13 . 2

VECTOR PROCESSORSHeritage

Process the data in vectors, each element in a vector(lane) is independent on any other.Deep pipelines, wide execution units, not necessary thesame width (batch length) as size of vector in elements(vector length).Most ef�cient for simple memory patterns, butgetter/scatter is usually possible too.Wide memory interfaces to saturate execution units.Large vector register �le, cache is not a strictrequirement and absent for classical vector processors.

13 . 3

VECTOR PROCESSORSNowadays

They aren't used in generic processors design, but usedas a co-processors for a speci�c workloads: HPC,multimedia.Precursors of most designs of modern GPUs.Vector pipes with short vector length (8-16 bytes) calledSIMD units are widely integrated in modern generalpurpose processors to accelerate most demandingloops.

14 . 1

WHY IS IT DOING TO BE RISC LOAD/STORE?1. Simple �xed-width instructions & few addressing modes

Cache-ef�cient instruction fetch, branches are aligned.Simple hardware logic → power ef�cient chips.Drive a higher clock rate.

2. Concise ISA with orthogonal functionalityComplex instructions are ignored by compilers due tosemantic gap → simple instructions simplify scheduling.Complex addressing lead either to variable lengthinstructions or big instruction size → inef�cientdecoding and scheduling as well as alignment issues.

3. Large register setExpose possible instruction parallelism to the compiler.

15

LATEST TRENDSArchitecture is seen as Load-store RISC-inheritedInternally instructions are broken down into single-pipeμopsμops are reordered and optionally organized into wordsμops or words are scheduled for execution, caching in thehighest level is usually performed on this preprocessedview.Latest generations of Intel processors, NVIDIA Denverarchitecture and 64-bit ARM Cortex-A processors alreadyemploy this approach.

16 . 1

SUMMARYThere are three key aspects of computer architecture:Instruction Set Architecture, μarchitecture and design.Some architectures aim to hide latency while others aim tomaximize instruction throughput.CISC is created for compact code size and exact instructionencoding and used only on ISA level nowadays.RISC leads to less complicated decoding and pipeline stagesallow boosting clock in affordable power budget.VLIW targets power ef�cient high performance devices forspeci�c tasks or used internally on μarchitecture level.Vector processors transformed into SIMD-extensions andSIMT-like GPU designs.

16 . 2

SUMMARYLoads-Store architectres with its simple �xed-widthinstructions, few addressing modes, concise ISA and optimal size register size is a winner solution.Architecture can expose different properties for it's different levels (ISA, μarchitecture).

17

THE END

/ 2015-2016MARINA KOLPAKOVA