gcc link time optimization and the linux kernel andi kleen intel otc
TRANSCRIPT
gcc link time optimization and the Linux kernel
Andi KleenIntel OTC
Acknowledgments
● Lots of people helped/contributed
● Ralf Baechle, Richard Biener, Tim Bird, Honza Hubicka, H.J. Lu, Joe Mario, Markus Trippelsdorf, Changlong Xie, others
Why LTO?
● Optimize over the whole binary– Not just function (3.0) or file (<4.5)
● Avoid inline dependency hell in header files
● Without changing Makefiles significantly
gcc 4.7+ LTO WHOPR crash course
● Compiler parses files, writes GIMPLE to object files (LGEN)– Function optimization summaries are computed
● Linker calls lto1 with all files for sequential whole program analysis (WPA)– Merges types, Generates global callgraph, IPA optimization
summaries, writes partitions
● Generate code per partition in parallel (LTRANS)– Inline inside partition, run IPA optimizations for real, run per
function optimizations, generate object code
LTO IPA optimizations (4.8)
● Inlining between files● Function cloning for
specific arguments● Remove unused code
(fwhole-program)● Increase alignment● Discover pure/const● Keep globals/statics
alive over calls
● De-virtualization● Change ABI (SSE)● Constant propagation● Scalar replacement of
aggregates● Constructor /
destructor merging
green: does not benefit kernel today
Build time
LTO is slow and not parallel enough
Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto0.00
20.00
40.00
60.00
Faults small configF
au
lts (
M)
Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto0
200
400
600
800
Build time small
Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto0
2000
4000
User time
small config
us
er
time
(s
)
Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto0.00
5.00
10.00
15.00
Parallelism
Us
er
time
/ re
al t
ime
Parallelism small build 4.7
1 41 8111
2131 51
6171 91
101111
121131
141151
161171
181191
201211
221231
241251
261271
281291
301311
321331
341351
361371
381391
401411
421431
441451
461471
481491
501511
521531
541551
0
10
20
30
40
50
60
Runnable processes
LTO kernel build small config
r
time (s)
run
na
ble
(p
)
Modules without job server
WPA + real linkerType merging
4 times
Object file generationParsing / LGEN
LTRANScode generation
gcc 4.8Gcc 4.7
Gcc 4.8 no ltoGcc 4.7 no lto
0
5
10
15
Small config parallelism
User time / Runtime
time
(s
)
WHOPR still has poor parallelism
Multilink
● vmlinux links 2-4x: runs LTO that often● Generates integrated symbol table (kallsyms)● So far not fixed
– KALLSYMS can be disabled
● One of those unexpected quirks of real build systems
Gcc 4.8 no ltoGcc 4.8 w/o kallsyms
Gcc 4.7 with kallsyms
0.00
2000.00
4000.00
6000.00
Memory usage 4.7 small build
125
4973
97917 33
41 5765 81
89 105113
121129
137145
153161
169177
185193
201209
217225
233241
249257
265273
281289
297305
313321
329337
345353
361369
377385
393401
409417
425433
441449
457465
473481
489497
505513
521529
537545
5530
1
2
3
4
5
6
7
8
Active memory
Kernel LTO build small config
time (s)
act
ive
me
m (
GB
)
Even small build swaps in 4GB systemMemory peaks all in WPA
Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto0.00
20.00
40.00
60.00
Faults small config
Fa
ults
(M
)
Memory consumption
● Temporary data can be a problem, together with WPA– Early many swap storms with /tmp = tmpfs
– Partitioning algorithm was improved
– Use TMPDIR=objdir
– With modules need to avoid too large -j* for parallel WPA
– Jobserver has to be disabled, makes it worse
Large build 4.8 (allyes)
~ 42mins~1h 11min with kallsyms~15GB peak
Kallsyms disabled(disables various features)
3 93981
159237
315393
471549
627705
783861 1017
10951173
12511329
14071485
15631641
17191797
18751953
20312109
21872265
23432421
2499
0
10
20
30
40
50
60
Linux allyes nosym runnable 4.8 -j16
time (s)
run
na
ble
(p
)
3109587
171255
339423
507591
675759
843927
101111791263
13471431
15151599
16831767
18511935
20192103
21872271
23552439
2523
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
Linux allyes nosym gcc 4.8 memory
time (s)
act
ive
me
m (
G)
WPA cycle profile 4.8Samples: 257K of event 'cycles', Event count (approx.): 183142453583
● 9.71% lto1wpa lto1 [.] htab_find_slot_with_hash ◆● htab_find_slot_with_hash ▒● + 61.79% gimple_type_hash(void const*) ▒
● + 26.54% gimple_register_type_1(tree_node*, bool) ▒● + 4.79% visit(tree_node*, sccs*, unsigned int, vec<tree_node*, va_heap, v▒● + 1.53% iterative_hash_canonical_type(tree_node*, unsigned int) ▒● + 0.98% iterative_hash_gimple_type(tree_node*, unsigned int, vec<tree_nod▒● + 0.92% symtab_get_node(tree_node const*) ▒
● + 0.77% gimple_type_eq(void const*, void const*) ▒● + 0.53% build_int_cst_wide(tree_node*, unsigned long, long) ▒● + 0.53% streamer_string_index(output_block*, char const*, unsigned int, b▒● + 9.00% lto1wpa libc2.17.so [.] _int_malloc ▒● + 7.34% lto1wpa lto1 [.] iterative_hash_hashval_t(unsigned i▒
● + 7.33% lto1wpa lto1 [.] gimple_type_eq(void const*, void co▒● + 6.66% lto1wpa libc2.17.so [.] __memset_sse2 ▒● + 5.35% lto1wpa libc2.17.so [.] malloc_consolidate ▒● + 4.11% lto1wpa libc2.17.so [.] _int_free ▒● + 4.10% lto1wpa lto1 [.] tree_map_base_eq(void const*, void ▒
● + 2.85% lto1wpa lto1 [.] pointer_map_insert(pointer_map_t*, ▒● + 2.76% lto1wpa lto1 [.] inflate_fast ▒● + 2.17% lto1wpa lto1 [.] lto_output_tree(output_block*, tree▒
Most time spent in hashing types for type merging
Type merging
struct foo **
pointer_type
pointer_type
Struct foo(record_type) Members
Complicated iterative hash consisting of 4 hash tables(including two “hash cache hashes”)
Each LGEN has its own types, need to be unified
Hashing improvements
● Use tcmalloc instead of glibc malloc– build time -9%; peak mem +35%
● Start with large type hash tables / caches● Use murmur for hashing / fix iterative / fix
pointer hash to handle all bits – (collisions 90%->70%->64%)
● Splay trees for type merging (bad idea)
tcmalloc murmur fullhash0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%Build time Improvement hash changes
4.8, build time small config
time
(s
) d
elta
Possible future hash improvements
● Precompute hash in LGEN– And stream types out in hash order
– Requires even larger hash tables to avoid collisions (or a good tree)
● Only merge per partition?– And do that in parallel
● Parallelize WPA further?
Incremental linking
● Kernel build system uses ld -r / ar extensively● Linker merges all sections with the same name
– Added random postfixes to LTO sections
● Still problem with pure assembler files– .S assembler code dropped during LTO– Originally attempted to fix with slim LTO
● Then fixed in Linux binutils– Unfortunately changes not merged into mainline binutils
– Use Linux binutils for kernel
Kernel modifications (excerpt)
● Section attributes– const __initconst char * vs
__initconst char * const
● Toplevel inline assembler– Need globals and visible for any
symbols
● Lots of externally_visible● Various changes for dot
NUMBER symbols
● Disable LTO for some files● Workarounds for 4.7
partitioning bugs (visible)● Gcc-ld● A few fixes for optimizations
(rare)
Kernel changes: 108 files changed, 740 insertions(+), 313 deletions(-) (v3.8; without most section attributes)
Debugging
● Kernel debugging: more difficult with more inlining– Especially with initialization functions
– May need inline aware backtracer (“dwarf kallsyms”)
● Compiler– No simple test cases. Lots of regressions; Complicated delta.
– Complex multi process
– Hard to find right process to attach/profile
– -dH (core dumps) are most useful
LTO has challenges for debugging
Global Optimizations (small config)
Inlining21k new inlines between files 13k unique
Scalar replacement No change
Partial Inline +16%
Constant Propagation +40%
Cloning for constant propagation(CONFIG_LTO_CLONE)
266->1052 const clones
+2% text (~200k)
Static instruction changes
calls functions fsize avg push pop mov0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
Generated code 4.8 small LTO
Changes in static instructions (vmlinux) More = Better
instruction
de
lta
Generated code
● Runtime– Mixed results
● Likely need some more optimizations – Open: Atom testing (more sensitive to bad code)
Runtime
LKP dbench tbench AIM7 OLTP
Sandy Bridge 0% 0% +3% 0%
Nehalem 0% 0% +4% +6%
Text size
● Text size– Slight increase for large configs, but lower for
small configs● small config +3%● allno shrinks -50k
– The smaller the kernel the less infrastructure is used. But EXPORT_SYMBOL defeats it again.
ARM embedded improvements (Tim Bird)
vmlinux.non-lto => vmlinux.lto
● baseline other change percent
● -------- ----- ------ -------
● text: 5242000 4869340 -372660 -7%
● data: 490184 476200 -13984 -2%
● bss: 121824 122112 288 0%
● total: 5854008 5467652 -386356 -6%
●
● Kernel: non-lto lto
● --------------------------- ------- --------
● time for full build: 1m5.87s 3m22s
● time for incremental build: 0m5.24s 2m12s
● kernel image size: 5854008 5467652
● compressed uImage size: 2849920 2694224
● system meminfo MemTotal: 17804 kB 18188 kB
● system meminfo MemFree: 11020 kB 11260 kB
● boot time: 2.466369 2.414428
gcc LTO improvements wish list
● Per file options (partial patch)● Fix passing through of options● Standard gcc-ld● Don't error for section mismatches● Better procedure for LTO error reporting● Better messages for type mismatches● No .XXXX postfixes● -fno-toplevel-reorder by symbol?● Automatic procedure for externally_visible discovery?● Better syntax for “symbol visible to toplevel assembler”
Status
● Tested on x86, ARM, MIPS● Uptodate with Linux 3.8● Some options disabled (function-trace, gcov,
modversions)
References
● LTO kernel tree
http://github.com/andikleen/linux-misc– Branch lto-3.x (x increasing)
● Linux binutilshttp://www.kernel.org/pub/linux/devel/binutils/
● Tcmalloc http://code.google.com/p/gperftools/
● [email protected] http://halobates.de
Backup
Future kernel improvements
● Avoid multi-link● Use constructors to allow disabling -fno-
toplevel-reorder● Support new option to disable
instrument_function
Quick HOWTO for LTO kernel
● Get git://github.com/andikleen/linux-misc lto-3.x tree● Need gcc 4.7+ set up for LTO / Linux binutils● Enable CONFIG_LTO
– Make sure FUNCTION_PROFILER et.al./MODVERSIONS are disabled
● make CC=compiler LD=linux-ld AR=gcc-ld -jX– Check LTO does not get disabled
● See Documentation/lto-build for more details