parallel gcc: status and expectations

Post on 19-Dec-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Parallel GCC: Status and Expectations

Giuliano Belinassi

giulianob

September 13, 2019

GNU Cauldron 2019FLUSP

Computer Science Department

Institute of Mathematics and Statistics

University of São Paulo (USP)

1/40

Introduction

• FLUSP – Floss at USPI Student extension group

I Aims to Contribute to FLOSS

I Projects we currently contribute: Linux Kernel, GCC, Git, Debian

I Promote Contribution Events, such as the KernelDevDay

2/40

Overview

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

3/40

Overview

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

4/40

Introduction

•How many of you have seen a compiler running in parallel?

5/40

Introduction

• Parallelization of GCC InternalsI Exploring parallelism in a single file

•Main objectivesI Reduction of compilation time

I Highlight GCC global states

6/40

Motivation

•Compilers are really complex programsI GCC dates from 1987

I Still sequential

•Automatic Generation of big filesI gimple-match.c: 100358 lines of C++ (GCC 10.0.0)

• Exponencial growth in number of cores of a CPU

7/40

Motivation

Growth of Computational Power (Rupp, 2018)

100

101

102

103

104

105

106

107

1970 1980 1990 2000 2010 2020

Number ofLogical Cores

Frequency (MHz)

Single-ThreadPerformance(SpecINT x 10

3)

Transistors(thousands)

Typical Power(Watts)

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten

New plot and data collected for 2010-2017 by K. Rupp

Year

42 Years of Microprocessor Trend Data

8/40

Motivation

•Where can we use these parallel processors in a compiler?

•How much the improvement is?

• Is there projects that can be benefited from this?

9/40

Introduction

• Experiment 1I GCC compilation on a machine with 4× AMD Opteron 6376

» 4× 16 = 64 cores

I No bootstrap (--disable-bootstrap)

I $ make -j64

I Collected Compilation Time of each file

I Collected Consumed Energy of all CPUs

10/40

0 25 50 75 100 125 150 175Time (s)

0

10

20

30

40

50

60

Mak

efile

Job

[A]

[B]

[C]

[D]

Flamegraph of GCC CompilationOther Files[A] insn-emit.o[B] insn-recog.o[C] gimple-match.o[D] generic-match.o

11/40

0 25 50 75 100 125 150 175Time (s)

0

100

200

300

400

Powe

r Con

sum

ptio

n (W

)

Eletric Energy Consumed by All CPUs (amd fam15h_power)

Sequential. Consumed Power = 1.841616 ± 0.006624 Wh

12/40

Introduction

•How about LTO?

• Experiment 2I GCC compilation on a machine with 4× AMD Opteron 6376

» 4× 16 = 64 cores

I No bootstrap (--disable-bootstrap)

I Enabled LTO (CFLAGS=-flto=64 CXXFLAGS=-flto=64)

I $ make -j64

I Collected Compilation Time of each file

I Collected Consumed Energy of all CPUs

13/40

0 50 100 150 200 250 300Time (s)

0

10

20

30

40

50

60

Mak

efile

Job

[A] [B] [C]

Flamegraph of GCC CompilationOther Files[A] lto1[B] cc1[C] cc1plus

14/40

0 50 100 150 200 250 300Time (s)

0

100

200

300

400

Powe

r Con

sum

ptio

n (W

)

Eletric Energy Consumed by All CPUs (amd fam15h_power)

Sequential. Consumed Power = 2.695716 ± 0.009216 Wh

15/40

Introduction

• There is a parallelism bo�leneck in GCC projectI And the compilation uses more electrical power as a consequence

•Could we improve it by parallelizing GCC?

16/40

Overview

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

17/40

Where to Start?

GCC structure is divided in three parts.

• Front EndI Parsing

•Middle EndI Hardware-Independent Optimization

•Back EndI Hardware-Dependent Optimization

I Code Generation

18/40

Where to Start?

• Selected Part: Middle EndI Applies Hardware-Independent Optimizations in the code

I Why? Because it looked easier to start rather than RTL

•Optimization can be broken into two disjoint setsI Intra Procedural

» Can be applied into a function without observing the interactions with other functions

I Inter Procedural

» Interaction with other functions must be considered» GCC calls this Inter Process Analysis (IPA)

19/40

GCC Profile

•We used gimple-match.c to profile the compilerI Intel Core i5-8250U (4 cores)

I 5400RPM 1TB Hard Drive

Parser

5.8%

IPA

10.7%GIMPLE IPA

0.6%

GIMPLE 15.4%

RTL

48.1%

Others

19.4%

GCC ProfileParserIPAGIMPLE IPAGIMPLERTLOthers

20/40

GCC Profile

•We used gimple-match.c to profile the compilerI 48s is spent compiling this file

I 63% is spent in Intra Procedural Optimization

I This can be parallelized

I Maximum speedup: 2.7× according to Amdahl’s Law

21/40

Overview

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

22/40

Parallel Architecture

• Parallelize Intra Procedural OptimizationsI By per-function

I Similarly to Wortman and Junkin (1992)

I With arbitrary number of threads

23/40

Parallel Architecture

•Architecture of Parallel GCCI The Inter Process Analysis inserts functions into a Producer Consumer �eue

I Each thread is responsible to remove their work from the �eue

I When �eue is empty, threads are blocked until work is inserted into it

I The threads die when the EMPTY token is inserted

I Measured overhead: 0.1s for 2000 functions

RTL Passes

Barrier

Code Generation

IPA

GIMPLE Passes

GIMPLE Passes

GIMPLE Passes

GIMPLE Passes

GIMPLE Passes

GIMPLE Passes

Queue

f1 f2

f5f3 f4

f7f6

24/40

Parallel Architecture

•AdvantagesI Best candidate to Linear speedup

I Worst case: A single big function

•DisadvantageI Map per-pass global states in the compiler

25/40

Implementation

• Split cgraph_node::expand into three methods:1 expand_ipa_gimple

2 expand_gimple←− Parallelization e�ort in GSoC 2019

3 expand_rtl

• Serialized the Garbage Collector

• Serialized memory related structures

26/40

Implementation

Memory Pools:

• Problems:I Linked List of several object instances

I Can be allocated/released upon request

I Most race conditions were there

• Solution:I Create an instance for every thread

I Merge the memory pools when all threads join

I Implementing merge feature: Just append the two Linked List

27/40

Implementation

Garbage Collection:

• Problem:I GCC implements a garbage collector

I Variables can be marked to be watched by it

I Collection can also be done upon request

• Partial Solution:I Serialize the entire Garbage Collector

I Disable collection between passes

•Research is needed to:I Map the race conditions in the Garbage Collector

I Make it thread safe

28/40

Implementation

rtl_data Structure:

• Problem:I This class represents the current function being compiled in RTL

I GCC only has a single instance of this class

I GIMPLE uses it to decide optimization related to instruction costs

• SolutionI Have one copy of this structure for each thread

I Make GIMPLE not rely on this?

29/40

Implementation

tree-ssa.address.c: mem_addr_template_list

• Problem:I Race condition in this structure

• Partial Solution:I Serialize with a mutex

• SolutionI Replicate for each thread?

30/40

Implementation

Integer to Tree Node hash

• Problem:I Race condition in this structure

• Partial Solution:I Serialize with a mutex

• SolutionI Replicate for each thread?

31/40

Overview

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

32/40

Results

•Results by parallelizing GIMPLE•Mean of 30 samples

1 2 4 8Number of Threads

0

1

2

3

4

5

6

7

Tim

e (s

) x1.72

x2.51 x2.52

Time in Parallel GIMPLE Intra Procedural Analysis

1 2 4 8Number of Threads

0

10

20

30

40

50

60

70

Tim

e (s

)

x1.06 x1.09 x1.09

GCC Time and SpeedupsParallel GIMPLERTLInter Process AnalysisParserOthers

33/40

Results

• The same approach can be used to parallelize RTL•Using the GIMPLE results

1 2 4 8Number of Threads

0

10

20

30

40

50

60

70Ti

me

(s)

x1.35x1.61 x1.61

GCC Time and Speedups (Estimate)Parallel GIMPLEParallel RTL (estimate)Inter Process AnalysisParserOthers

34/40

Motivation

•Where can we use these parallel processors in a compiler?I Intra Process Analysis is a easy answer

•How much the improvement is?I Without much optimization, 1.6× using 4 threads

I Be�er results will require more e�ort

• Is there projects that can be benefited from this?I GCC

I ...What else?

35/40

Overview

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

36/40

TODOs

•What is le� to do:I Fix all race conditions in GIMPLE

I Fix all C11 __thread occurrences

» Initialize Inter-Pass objects at ::execute time» Per-thread a�ributes initialized at spawn time

I Make this build pass the testsuite

I Parallelize RTL

I Parallelize IPA (useful for LTO?)

I Communicate with Make for automatic threading

37/40

Overview

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

38/40

References i

I Karl Rupp (2018). 42 Years of Microprocessor Trend Data. url:

h�ps://github.com/karlrupp/microprocessor-trend-data.

I D.B. Wortman and M.D. Junkin (Jan. 1992). “A Concurrent Compiler for Modula-2+”. In: 27,

pp. 68–81. doi: 10.1145/143103.120025.

39/40

Repositories

1 Introduction2 Where to Start?3 Parallel Architecture4 Results

5 TODOs6 References

h�ps://github.com/giulianobelinassi/gcc-timer-analysis

h�ps://gitlab.com/flusp/gcc/tree/giulianob_parallel

h�ps://gcc.gnu.org/wiki/ParallelGcc

40/40

Thank you!

40/40

top related