how it's made: c++ compilers (gcc)

82
How it's made C++ compilers Created by Sławomir Zborowski

Upload: szborows

Post on 24-Dec-2014

330 views

Category:

Technology


1 download

DESCRIPTION

Presentation slides about internals of GCC C++ compiler. It covers transformation from source code to output binary, compiler optimizations, register transfer language, etc.

TRANSCRIPT

How it's madeC++ compilersCreated by Sławomir Zborowski

- Mariusz Max Kolonko

Average journalist documents what has happened. Good journalistexplains why that happened.

AgendaGCCPreprocessorCompiler

Front-end, ASTMiddle-end, optimization passesBack-end, RTL

LinkerTests

GCC - compilation controllerWhy GCC?Because we use itMultiple languages: C, C++, Fortran, Java, Mercury, …Multiple architectures: ARM, MN10300, PDP-10, AVR32, …

Before we goWhat happens when developers design a logo?

"Do what you do best and outsource the rest"

GCC - compilation controllercc1 - preprocessor and compiler

Output → AT&T/Intel assembler file (*.s)Use ­E flag to preprocess onlyUse ­S flag to preprocess and compile

as - assembler (from binutils)Output → object file (*.o)Use ­c flag to ignore the linker

collect2 - linkerOutput → shared object/ELF (*.so, *)

The preprocessorEntry-pointAlmost no safetyC++ standard defines interresting requirements

Min. #include nesting levels - 15Min. number of macros in one translation unit - 4095Min. number of character in line - 4096

GCC preprocessor is limited by memory

Preprocessor on steroidsPeople use preprocessor to do variety of thingsUsually, it is just bad habitSome people uses more than one preprocessor :-)

@Gynvael Coldwind1 float fast_sin(int deg) 2 static const float sin_table[] = <?php3 for($i = 0; $i < 359; $i++)4 echo(sin($i) . ", ");5 echo(sin($i));6 ?>;7 return sin_table[deg % 360];8 ;

php my.c | gcc ­x c ­Hmm... good idea, but kind of naïve. Surely we can do better!

Let's replace the preprocessorExample motivation: diab & #pragma once

Time to hack 1 #!/usr/bin/env python 2 import random, re, subprocess, sys; x = sys.argv 3 4 try: 5 i,o = x[x.index('-D_GNU_SOURCE')+1], x[x.index('-o')+1] + '_' 6 if not re.search('\.hp?p?$', i): raise RuntimeError 7 g = '_0_1'.format(random.randrange(2**32), i.replace('.', '_')) 8 with open(i) as h, open(o, 'w') as f: 9 f.write('#ifndef 0\n#define 0\n1\n#endif'.format(10 g, h.read().replace('#pragma once', '')))11 n = [[e,o][e==i] for e in x[1:]]12 except (ValueError, RuntimeError): n = x[1:][:]13 p = subprocess.Popen(['/usr/lib/gcc/x86_64-linux-gnu/4.8/cc1plus'] + n)14 p.communicate(); sys.exit(p.returncode)

Let's use it!g++ ­no­integrated­cpp ­std=c++11 ­B/path/to/script example.cpp

1 #ifndef _3121294961_example_cpp2 #define _3121294961_example_cpp3 4 template <typename T>5 T add(T a, T b) return a + b; 6 7 #endif8 9 int main(void) return add(1, 2);

Okay, get back to the topic

cc1 - From input to output

IN → Front-end → Middle-end → Back-end →OUT

Frontend overviewC/C++ → AST → Generic

It all starts with lexer & parserImmediate representation - ASTAt the end - language-independent

ParsingSimple example:

Basic lexers base on regular expressionsStatements are tokenized

x can be mapped to id, 1, where 1 is an index in symboltablea, b → id, 2, id, 3+, * can be mapped to token table3 can be mapped to constant table

The lexer does not define any orderIt's just tokenization

1 x = a + b * 3;

ASTEventually parser emits ASTAST stands for Abstract Syntax TreeExample expression: a + (b * 3)

AST

AST

AST

AST

AST

AST

AST

AST

AST

AST

AST

AST

AST

AST

AST

Semantic analysisCompiler needs to check syntax tree with language definitionThis analysis saves type information in symbol tableType checking is also performed (e.g. array[1.f] is ill-formed)Implicit conversions are likely to happen

Symbol tableGCC must record variables in so-called symbol tableIt contains information about type, storage, scope, etc.It is built incrementally by analysing phasesScopes are very important

GenericThe code is correct in regards to syntax & language semanticsIt is also stored as ASTAlthough AST is abstract, it is not generic enoughLanguage-specific AST nodes are replacedRight from now, middle-end kicks in

Middle-end overview→ GIMPLE → SSA → Optimize → RTL →

Generic → GIMPLESSA transformationOptimization passesUn-SSA transformationRTL, suitable for back-end

GIMPLEModified GENERIC formOnly 3 operands per expression

Why 3? Three-address instructionsFunction calls are exception

No nested function callsSome control structures are represented with ifs and gotos

GIMPLEToo complex expressions are breaked down to expressiontemporariesExample:a = b + c + dbecomesT1 = b + ca = T1 + d

GIMPLEAnother example:a = b ? c : dbecomesif (b == 1) T1 = celse T1 = da = T1

GIMPLE instruction setGIMPLE_ASM GIMPLE_ASSIGN GIMPLE_BIND

GIMPLE_CALL GIMPLE_CATCH GIMPLE_COND

GIMPLE_DEBUG GIMPLE_EH_FILTER GIMPLE_GOTO

GIMPLE_LABEL GIMPLE_NOP GIMPLE_PHI

GIMPLE_RESX GIMPLE_RETURN GIMPLE_SWITCH

GIMPLE_TRY GIMPLE_OMP_* …

Static Single Assignment (SSA)Every variable is assigned only onceCan be used as a read-only value multiple timesIn if statemens merging takes place

PHI functionGCC performs over 20 optimizations on SSA tree

GIMPLE vs SSA1 a = 3;2 b = 9;3 c = a + b;4 a = b + 1;5 d = a + c;6 return d;

1 a_1 = 3;2 b_2 = 9;3 c_3 = a_1 + b_2;4 a_4 = b_2 + 1;5 d_5 = a_4 + c_3;6 _6 = d_5;7 return _6;

OptimizationsWhy optimize?Why in this phase?Requirements

Optimization must not change program behaviourIt must improve program overall performanceCompilation time must be kept reasonableEngineering effort has to be feasible

Optimizations & middle-endDead code elliminationConstant propagationStrength reductionTail recursion elliminationInliningVectorization

Dead code eliminationThe task is simple: simply remove unreachable codeSimplify if statements with constant conditionsRemove exception handling constructs surrounding non-throwing code…

Constant propagation1 a_1 = 3;2 b_2 = 9;3 c_3 = a_1 + b_2;4 a_4 = b_2 + 1;5 d_5 = a_4 + c_3;6 _6 = d_5;7 return _6;

1 a_1 = 3;2 b_2 = 9;3 c_3 = 12;4 a_4 = b_2 + 1;5 d_5 = a_4 + c_3;6 _6 = d_5;7 return _6;

1 a_1 = 3;2 b_2 = 9;3 c_3 = 12;4 a_4 = 10;5 d_5 = a_4 + c_3;6 _6 = d_5;7 return _6;

1 a_1 = 3;2 b_2 = 9;3 c_3 = 12;4 a_4 = 10;5 d_5 = 22;6 _6 = d_5;7 return _6;

It could just be

SSA helps here a lot

1 return 22;

Strength reductionGoal: reduce the strength of an expressionExample:1 unsigned foo(unsigned a) 2 return a / 4;3

1 shrl $2, %edi

… and less intuitive one:1 unsigned bar(unsigned a) 2 return a * 9 + 17;3

1 leal 17(%rdi,%rdi,8), %eax

Tail recursion elimination1 int factorial(int x) 2 return (x > 1)3 ? x * factorial(x - 1)4 : 1;5

1 int factorial(int x) 2 int result = 1;3 while (x > 1) 4 result *= x--;5 6 return result;7

Why? Recursion running in constant space.

InliningBased on mem-space/time costsNot possible when:­fno­inline switch is usedconflicting __attribute__s

Forbidden when:call to alloca, setjmp, or longjmpnon-local goto instructionrecursionvariadic argument list

VectorizationOne of GCC's concurrency modelCompiler uses sse, sse2, sse3, … to make program fasterEnabled by ­O3 or ­ftree­vectorizeThere are more than 25 cases where vectorization can bedone

e.g. backward access, multidimensional arrays, conditions,nested loops, …

With ­ftree­vectorizer­verbose=N switch,vectorization can be debugged

Vectorization1 int a[256], b[256], c[256];2 void foo () 3 for (int i = 0; i < 256; i++) 4 a[i] = b[i] + c[i];5 6

Scalar: 1 .L3: 2 movl -4(%rbp), %eax 3 cltq 4 movl b(,%rax,4), %edx 5 movl -4(%rbp), %eax 6 cltq 7 movl c(,%rax,4), %eax 8 addl %eax, %edx 9 movl -4(%rbp), %eax10 cltq11 movl %edx, a(,%rax,4)12 addl $1, -4(%rbp)

Vectorized:1 .L3:2 movdqa b(%rax), %xmm03 addq $16, %rax4 paddd c-16(%rax), %xmm05 movdqa %xmm0, a-16(%rax)6 cmpq $1024, %rax7 jne .L3

Outsmarting GCC1 unsigned int foo(unsigned char i) 2 return i | (i<<8) | (i<<16) | (i<<24);3 // 3 * SHL, 3 * OR

Human

GCC

5 unsigned int bar(unsigned char i) 6 unsigned int j=i | (i << 8);7 return j | (j<<16);8 // 2 * SHL, 2 * OR

10 unsigned int baz(unsigned char i) 11 return i * 0x01010101;12 // 1 * IMUL

Outsmarting GCC1 int fsincos_(double arg) 2 return sin(arg) + cos(arg);3

1 leaq 8(%rsp), %rdi2 movq %rsp, %rsi3 call sincos4 movsd 8(%rsp), %xmm05 addsd (%rsp), %xmm06 addq $24, %rsp7 cvttsd2si %xmm0, %eax

Only on architectures with FPUActually, this is FPU + SSE

Outsmarting GCCWhich way is the best to reset accumulator?1 mov $0, %eax2 add $0, %eax3 sub %eax, %eax4 xor %eax, %eax

# b8 00 00 00 00# 83 e0 00# 29 00# 31 00

Answer: sub. Did you know it? GCC did.

Outsmarting GCCCompilers are goot at optimizationLet them optimizeProgrammer should focus on writing readable code

Back-end

Register Transfer Language(RTL)

Inspired by LispIt describes instructions to be output

GIMPLE → RTLGIMPLE:

1 unsigned int baz(unsigned char) (unsigned char i) 2 unsigned int D.2202; 3 int D.2203; 4 int D.2204; 5 6 D.2203 = (int) i; 7 D.2204 = D.2203 * 16843009; 8 D.2202 = (unsigned int) D.2204; 9 return D.2202;10

RTL:(insn# 0 0 2 (parallel [ (set (reg:SI 0 ax [orig:60 D.2207 ] [60]) (mult:SI (reg:SI 0 ax [orig:59 D.2207 ] [59]) (const_int 16843009 [0x1010101]))) (clobber (reg:CC 17 flags)) ]) rtl.cpp:2# *mulsi3_1 (expr_list:REG_DEAD (reg:SI 0 ax [orig:59 D.2207 ] [59]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))))

RTL ObjectsThere are multiple types of RTL objects:

ExpressionsIntegers, wide integersStringsVectors

RTL ClassesThere are few categories of RTL expressionsRTX_UNARY: NOT, SQRT, ABSRTX_OBJ: MEM, REG, VALUERTX_COMPARE: GE, LTRTX_COMM_COMPARE: EQ, NERTX_COMM_ARITH: PLUS, MULT…

Register allocationThe task: ensure that machine resources (registers) are usedoptimally.There are two types of register allocators:

Local Register AllocatorGlobal Register Allocator

Since GCC 4.8 messy reload.c was replaced with LRA

Register allocationThe problem: interference-graph-coloringColors == registers

Assign registers (colors) to temporariesFinding k-coloring graph is NP-complete, so GCC usesheurestic method

In case of failure some of variables are stored in memoryTwo variables can share registers only when only one of themlive at any point of the program

Register allocation - exampleInstructions Live variables

ab = a + 2

b, ac = b * b

a, cb = c + 1

a, breturn a * b

We can mess with compiler1 register int variable asm("rbx");

However… this is not a good idea (unless you have a very goodreason)Variable can be optimizedRegister still can be used by other variables

Instruction schedulingGoal: minimize length of the critical pathGoal: maximize parallelism opportunitiesHow does it work?1. Build the data dependence graph2. Calculate priorities for each instruction3. Iteratively schedule ready instructionsUsed before and after register allocation

Instruction schedulingWorks well in case of unrelated expressions

1 a = x + 1;2 b = y + 2;3 c = z + 3;

IF RF EX ME WBSoftware pipelining

IF RF EX ME WBIF RF EX ME WB

Instruction selectionGCC picks instruction from the set available for given targetEach instruction has its costAddressing mode is also selected

RTL → ASMRegisters - allocatedExpressions - orderedInstructions - selected

RTL OptimizationsOptimizations performed on RTL form

RematerializationRe-compute value of particular variable multiple timesSmaller register pressure, more CPU workShould happen only when time of the computation is lesserthan loadExpression must not have side effectsExperimental results show 1-6% execution performance _

Common SubexpressionElimination

Finds subexpressions that occurs in multiple placesDecides whether additional temporary would make programfasterExample:

Becomes:

CSE works also with functions

1 k = i + j + 10;2 r = i + j + 30;

1 movl 8(%rsp), %esi2 addl 12(%rsp), %esi3 xorl %eax, %eax4 leal 30(%rsi), %edx5 addl $10, %esi

Loop-invariant code motionMove variables that do not depend on the loop outside itsbodyBenefits: less calculations & constants in registersExample:

Becomes:

Can introduce high register pressure → rematerialization

1 for (int i = 0; i < n; i++) 2 x = y + z;3 a[i] = 6 * i + x * x;4

1 x = y + z;2 t1 = x * x;3 for (int i = 0; i < n; i++) 4 a[i] = 6 * i + t1;5

More RTL optimizationsJump bypassingControl flow graph cleanupLoop optimizationsInstruction combination…

Linker (collect2)collect2 really uses ldPerforms consolidation of multiple object filesgold - better linker, but only for ELF

Link time optimizationsGCC optimizations are constrained to single translation unitWhen LTO is enabled object files include GIMPLE treesLocal optimizations are applied globally:

Dead code elliminationConstant propagation…

GCC test suitesGcc is tested by over 19k of testsTest suites employ DejaGnu, Tcl, and expect toolsEach test is a C file with special commentsTest results arePASS: the test passed as expectedXPASS: the test unexpectedly passedFAIL: the test unexpectedly failedXFAIL: the test failed as expectedERROR: the testsuite detected an errorWARNING: the testsuite detected a possible problemUNSUPPORTED: the test is not supported on this platform

string-1.C1 // Test location of diagnostics for interpreting strings. Bug 17964.2 // Origin: Joseph Myers <[email protected]>3 // dg-do compile 4 5 const char *s = "\q"; // dg-error "unknown escape sequence" 6 7 const char *t = "\ "; // dg-error "unknown escape sequence" 8 9 const char *u = "";

ambig2.C 1 // PR c++/57948 2 3 struct Base ; 4 struct Derived : Base 5 6 struct Derived2 : Base 7 8 struct ConvertibleToBothDerivedRef 9 10 operator Derived&();11 operator Derived2&();12 void bind_lvalue_to_conv_lvalue_ambig(ConvertibleToBothDerivedRef both)13 14 Base &br1 = both; // dg-error "ambiguous" 15 16 ;17 ;18 ;

dependend-name3.C 1 // dg-do compile 2 3 // Dependent arrays of invalid size generate appropriate error messages 4 5 template<int I> struct A 6 7 static const int zero = 0; 8 static const int minus_one = -1; 9 ;10 11 template<int N> struct B12 13 int x[A<N>::zero]; // dg-error "zero" 14 int y[A<N>::minus_one]; // dg-error "negative" 15 ;16 17 B<0> b;

DG commandsdg­dopreprocess, compile, assemble, link, run

dg­optionsdg­errordg­warningdg­bogus…

Auxilliary toolsTools every developer should be aware of…nm - helps examinating symbols in object filesobjdump - displays information from object filesc++filt - demangles C++ symbolsaddr2line - converts offsets to lines and filenames…, see binutils

Bonus slideWhich came first, the chicken or the egg?

First compilers were written in… assemblyIt was challenging because of poor hardware resourcesIt is believed that first compiler was created by Grace Hopper,for A-0First complete compiler - FORTRAN, IBM, 1957First multi-architecture compiler - COBOL, 1960

areThereAnyQuestions()? pleaseAsk(): thankYouForYourAttention();