parallel gcc: status and expectations
TRANSCRIPT
![Page 1: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/1.jpg)
Parallel GCC: Status and Expectations
Giuliano Belinassi
giulianob
September 13, 2019
GNU Cauldron 2019FLUSP
Computer Science Department
Institute of Mathematics and Statistics
University of São Paulo (USP)
1/40
![Page 2: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/2.jpg)
Introduction
• FLUSP – Floss at USPI Student extension group
I Aims to Contribute to FLOSS
I Projects we currently contribute: Linux Kernel, GCC, Git, Debian
I Promote Contribution Events, such as the KernelDevDay
2/40
![Page 3: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/3.jpg)
Overview
1 Introduction2 Where to Start?3 Parallel Architecture4 Results
5 TODOs6 References
3/40
![Page 4: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/4.jpg)
Overview
1 Introduction2 Where to Start?3 Parallel Architecture4 Results
5 TODOs6 References
4/40
![Page 5: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/5.jpg)
Introduction
•How many of you have seen a compiler running in parallel?
5/40
![Page 6: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/6.jpg)
Introduction
• Parallelization of GCC InternalsI Exploring parallelism in a single file
•Main objectivesI Reduction of compilation time
I Highlight GCC global states
6/40
![Page 7: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/7.jpg)
Motivation
•Compilers are really complex programsI GCC dates from 1987
I Still sequential
•Automatic Generation of big filesI gimple-match.c: 100358 lines of C++ (GCC 10.0.0)
• Exponencial growth in number of cores of a CPU
7/40
![Page 8: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/8.jpg)
Motivation
Growth of Computational Power (Rupp, 2018)
100
101
102
103
104
105
106
107
1970 1980 1990 2000 2010 2020
Number ofLogical Cores
Frequency (MHz)
Single-ThreadPerformance(SpecINT x 10
3)
Transistors(thousands)
Typical Power(Watts)
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten
New plot and data collected for 2010-2017 by K. Rupp
Year
42 Years of Microprocessor Trend Data
8/40
![Page 9: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/9.jpg)
Motivation
•Where can we use these parallel processors in a compiler?
•How much the improvement is?
• Is there projects that can be benefited from this?
9/40
![Page 10: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/10.jpg)
Introduction
• Experiment 1I GCC compilation on a machine with 4× AMD Opteron 6376
» 4× 16 = 64 cores
I No bootstrap (--disable-bootstrap)
I $ make -j64
I Collected Compilation Time of each file
I Collected Consumed Energy of all CPUs
10/40
![Page 11: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/11.jpg)
0 25 50 75 100 125 150 175Time (s)
0
10
20
30
40
50
60
Mak
efile
Job
[A]
[B]
[C]
[D]
Flamegraph of GCC CompilationOther Files[A] insn-emit.o[B] insn-recog.o[C] gimple-match.o[D] generic-match.o
11/40
![Page 12: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/12.jpg)
0 25 50 75 100 125 150 175Time (s)
0
100
200
300
400
Powe
r Con
sum
ptio
n (W
)
Eletric Energy Consumed by All CPUs (amd fam15h_power)
Sequential. Consumed Power = 1.841616 ± 0.006624 Wh
12/40
![Page 13: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/13.jpg)
Introduction
•How about LTO?
• Experiment 2I GCC compilation on a machine with 4× AMD Opteron 6376
» 4× 16 = 64 cores
I No bootstrap (--disable-bootstrap)
I Enabled LTO (CFLAGS=-flto=64 CXXFLAGS=-flto=64)
I $ make -j64
I Collected Compilation Time of each file
I Collected Consumed Energy of all CPUs
13/40
![Page 14: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/14.jpg)
0 50 100 150 200 250 300Time (s)
0
10
20
30
40
50
60
Mak
efile
Job
[A] [B] [C]
Flamegraph of GCC CompilationOther Files[A] lto1[B] cc1[C] cc1plus
14/40
![Page 15: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/15.jpg)
0 50 100 150 200 250 300Time (s)
0
100
200
300
400
Powe
r Con
sum
ptio
n (W
)
Eletric Energy Consumed by All CPUs (amd fam15h_power)
Sequential. Consumed Power = 2.695716 ± 0.009216 Wh
15/40
![Page 16: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/16.jpg)
Introduction
• There is a parallelism bo�leneck in GCC projectI And the compilation uses more electrical power as a consequence
•Could we improve it by parallelizing GCC?
16/40
![Page 17: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/17.jpg)
Overview
1 Introduction2 Where to Start?3 Parallel Architecture4 Results
5 TODOs6 References
17/40
![Page 18: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/18.jpg)
Where to Start?
GCC structure is divided in three parts.
• Front EndI Parsing
•Middle EndI Hardware-Independent Optimization
•Back EndI Hardware-Dependent Optimization
I Code Generation
18/40
![Page 19: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/19.jpg)
Where to Start?
• Selected Part: Middle EndI Applies Hardware-Independent Optimizations in the code
I Why? Because it looked easier to start rather than RTL
•Optimization can be broken into two disjoint setsI Intra Procedural
» Can be applied into a function without observing the interactions with other functions
I Inter Procedural
» Interaction with other functions must be considered» GCC calls this Inter Process Analysis (IPA)
19/40
![Page 20: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/20.jpg)
GCC Profile
•We used gimple-match.c to profile the compilerI Intel Core i5-8250U (4 cores)
I 5400RPM 1TB Hard Drive
Parser
5.8%
IPA
10.7%GIMPLE IPA
0.6%
GIMPLE 15.4%
RTL
48.1%
Others
19.4%
GCC ProfileParserIPAGIMPLE IPAGIMPLERTLOthers
20/40
![Page 21: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/21.jpg)
GCC Profile
•We used gimple-match.c to profile the compilerI 48s is spent compiling this file
I 63% is spent in Intra Procedural Optimization
I This can be parallelized
I Maximum speedup: 2.7× according to Amdahl’s Law
21/40
![Page 22: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/22.jpg)
Overview
1 Introduction2 Where to Start?3 Parallel Architecture4 Results
5 TODOs6 References
22/40
![Page 23: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/23.jpg)
Parallel Architecture
• Parallelize Intra Procedural OptimizationsI By per-function
I Similarly to Wortman and Junkin (1992)
I With arbitrary number of threads
23/40
![Page 24: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/24.jpg)
Parallel Architecture
•Architecture of Parallel GCCI The Inter Process Analysis inserts functions into a Producer Consumer �eue
I Each thread is responsible to remove their work from the �eue
I When �eue is empty, threads are blocked until work is inserted into it
I The threads die when the EMPTY token is inserted
I Measured overhead: 0.1s for 2000 functions
RTL Passes
Barrier
Code Generation
IPA
GIMPLE Passes
GIMPLE Passes
GIMPLE Passes
GIMPLE Passes
GIMPLE Passes
GIMPLE Passes
Queue
f1 f2
f5f3 f4
f7f6
24/40
![Page 25: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/25.jpg)
Parallel Architecture
•AdvantagesI Best candidate to Linear speedup
I Worst case: A single big function
•DisadvantageI Map per-pass global states in the compiler
25/40
![Page 26: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/26.jpg)
Implementation
• Split cgraph_node::expand into three methods:1 expand_ipa_gimple
2 expand_gimple←− Parallelization e�ort in GSoC 2019
3 expand_rtl
• Serialized the Garbage Collector
• Serialized memory related structures
26/40
![Page 27: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/27.jpg)
Implementation
Memory Pools:
• Problems:I Linked List of several object instances
I Can be allocated/released upon request
I Most race conditions were there
• Solution:I Create an instance for every thread
I Merge the memory pools when all threads join
I Implementing merge feature: Just append the two Linked List
27/40
![Page 28: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/28.jpg)
Implementation
Garbage Collection:
• Problem:I GCC implements a garbage collector
I Variables can be marked to be watched by it
I Collection can also be done upon request
• Partial Solution:I Serialize the entire Garbage Collector
I Disable collection between passes
•Research is needed to:I Map the race conditions in the Garbage Collector
I Make it thread safe
28/40
![Page 29: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/29.jpg)
Implementation
rtl_data Structure:
• Problem:I This class represents the current function being compiled in RTL
I GCC only has a single instance of this class
I GIMPLE uses it to decide optimization related to instruction costs
• SolutionI Have one copy of this structure for each thread
I Make GIMPLE not rely on this?
29/40
![Page 30: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/30.jpg)
Implementation
tree-ssa.address.c: mem_addr_template_list
• Problem:I Race condition in this structure
• Partial Solution:I Serialize with a mutex
• SolutionI Replicate for each thread?
30/40
![Page 31: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/31.jpg)
Implementation
Integer to Tree Node hash
• Problem:I Race condition in this structure
• Partial Solution:I Serialize with a mutex
• SolutionI Replicate for each thread?
31/40
![Page 32: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/32.jpg)
Overview
1 Introduction2 Where to Start?3 Parallel Architecture4 Results
5 TODOs6 References
32/40
![Page 33: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/33.jpg)
Results
•Results by parallelizing GIMPLE•Mean of 30 samples
1 2 4 8Number of Threads
0
1
2
3
4
5
6
7
Tim
e (s
) x1.72
x2.51 x2.52
Time in Parallel GIMPLE Intra Procedural Analysis
1 2 4 8Number of Threads
0
10
20
30
40
50
60
70
Tim
e (s
)
x1.06 x1.09 x1.09
GCC Time and SpeedupsParallel GIMPLERTLInter Process AnalysisParserOthers
33/40
![Page 34: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/34.jpg)
Results
• The same approach can be used to parallelize RTL•Using the GIMPLE results
1 2 4 8Number of Threads
0
10
20
30
40
50
60
70Ti
me
(s)
x1.35x1.61 x1.61
GCC Time and Speedups (Estimate)Parallel GIMPLEParallel RTL (estimate)Inter Process AnalysisParserOthers
34/40
![Page 35: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/35.jpg)
Motivation
•Where can we use these parallel processors in a compiler?I Intra Process Analysis is a easy answer
•How much the improvement is?I Without much optimization, 1.6× using 4 threads
I Be�er results will require more e�ort
• Is there projects that can be benefited from this?I GCC
I ...What else?
35/40
![Page 36: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/36.jpg)
Overview
1 Introduction2 Where to Start?3 Parallel Architecture4 Results
5 TODOs6 References
36/40
![Page 37: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/37.jpg)
TODOs
•What is le� to do:I Fix all race conditions in GIMPLE
I Fix all C11 __thread occurrences
» Initialize Inter-Pass objects at ::execute time» Per-thread a�ributes initialized at spawn time
I Make this build pass the testsuite
I Parallelize RTL
I Parallelize IPA (useful for LTO?)
I Communicate with Make for automatic threading
37/40
![Page 38: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/38.jpg)
Overview
1 Introduction2 Where to Start?3 Parallel Architecture4 Results
5 TODOs6 References
38/40
![Page 39: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/39.jpg)
References i
I Karl Rupp (2018). 42 Years of Microprocessor Trend Data. url:
h�ps://github.com/karlrupp/microprocessor-trend-data.
I D.B. Wortman and M.D. Junkin (Jan. 1992). “A Concurrent Compiler for Modula-2+”. In: 27,
pp. 68–81. doi: 10.1145/143103.120025.
39/40
![Page 40: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/40.jpg)
Repositories
1 Introduction2 Where to Start?3 Parallel Architecture4 Results
5 TODOs6 References
h�ps://github.com/giulianobelinassi/gcc-timer-analysis
h�ps://gitlab.com/flusp/gcc/tree/giulianob_parallel
h�ps://gcc.gnu.org/wiki/ParallelGcc
40/40
![Page 41: Parallel GCC: Status and Expectations](https://reader030.vdocuments.us/reader030/viewer/2022012809/61bf3c8ecd3d61133f399144/html5/thumbnails/41.jpg)
Thank you!
40/40