the hiphop virtual machine
TRANSCRIPT
THE HIPHOP VIRTUAL MACHINE Keith Adams Facebook 2/6/13
Why PHP?
Facebook’s PHP Engines • 2004-2009: Zend
• Switch-on-bytecode C interpreter
• 2009-2012: HipHop • Ahead-of-time compiler • Intermediate: a C++ program • Output: big fat ELF binary • Devs use interpreter
• 2013-: HipHop VM • Just-in-time compiler • Dev/prod unified
HipHop Production Throughput
simpl
e
simpl
ecall
simpl
euca
ll
simpl
eudc
all
man
del
man
del2
acker
man
n(11
)
ary(
50000
000)
ary2(
5000
0000
)
ary3(
2000
0)
fibo(
40)
hash
1(100
0000
0)
hash
2(100
00)
heap
sort(
20000
00)
matrix
(2000
0)
neste
dloop
(36)
sieve
(3000
0)
strca
t(300
00)
binar
y!tree
s
fannk
uch!r
edux
fast
a
k!nuc
leot
ide1
man
delbr
ot
n!bod
y
regex
!dna
rever
se!c
ompl
emen
t
spec
tral!n
orm
Geo
Mea
n
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Zend HHVMi HipHop
Exec
uti
on
tim
e re
lati
ve
to Z
end
Figure 4. Performance comparison of Zend, HHVMi, and HipHop using standard PHP and computer language benchmarks
and updated with the minimal changes necessary to keepit running the site. These updates included support for newlanguage extensions, but no performance enhancements.
In August 2010, we last compared the performance ofHipHop and Zend running Facebook production traffic. Atthat time, the version of Zend used was 5.1.3. For this com-parison, two clusters of servers were used, one with HipHopand another with Zend, and load balancers ensured that thetwo sets received equal traffic. The average CPU time serv-ing requests in each cluster was computed across a large pe-riod of time. On average, HipHop completed web requests2.5⇥ faster than Zend. Since HipHop was much faster, theservers in its cluster were underutilized. We also comparedthe throughput of the clusters when the servers were fullyloaded. To do so, we configured the load balancers to pro-vide enough traffic to keep the CPUs in both clusters 100%utilized. The results of this experiment, measured in numberof requests-per-second (RPS), demonstrated that the HipHopcluster achieve throughput 2.5⇥ higher than Zend. On theleft side of Figure 5, this throughput data is plotted relativeto HipHop.
Figure 5 also plots the throughput improvements made toHipHop after August 2010. This comparison was performedusing the same methodology described above, except thatthe Zend cluster was changed to use the baseline HipHopbranch instead. The load balancers were again set to keepthe servers in each cluster fully utilized, so that we couldmeasure the maximum throughput achieved by each server.Overall, HipHop’s throughput over this period improved by
Aug
’201
0
Dec
’2010
Mar
’201
1
Jun’20
11
Sep’20
11
Dec
’2011
0
0.5
1
1.5
2
2.5
Baseline HipHop HipHop Zend
Rel
ativ
e T
hro
ug
hp
ut
Figure 5. HipHop throughput over time, running Face-book production traffic. The baseline is the throughput ofHipHop’s version from August 2010, which is when Zendlast ran Facebook.
another 2.3⇥. Although we cannot directly compare withZend running the current Facebook web site, these data sug-gest that HipHop’s throughput is now 5.8 (2.5 ⇥ 2.3) timeshigher than what Zend 5.1.3 would achieve. These improve-ments resulted from additional tuning, static analyses, andoptimizations described in Section 3 that were added in thisperiod.
Given the nature of the Facebook application, which pro-cesses huge data sets, one could expect Facebook’s webservers to be I/O bound. However, given the great amount
From “The HipHop Compiler for PHP,” Zhao et al., OOPSLA 2012
HipHop Learnings • Compiler taught us that type inference is the most
important optimization available.
• Many types cannot be inferred at compile-time!
Dynamic Types
goldbach_conjecture() ? 3.14159 : “string” !
mysql_fetch_row($result)[0] !
123.2 / $divisor !
Claim: Types are homogenous in practice • $a / $b could produce boolean but almost never does
• Database row types rarely change
• Goldbach conjecture can only be true or false
• Etc.
HHVM Vision • Observe types at runtime before generating code • Discover types that cannot be inferred
Aside: HipHop Bytecode • Originally developed to support a fast interpreter • Executable serialization of PHP
=
lval:$c +
rval:$a rval:$b
PushL 1 !PushL 0 !Add!SetL 2 !
Example program <? function mymax($a, $b) { return $a > $b ? $a : $b; }
...
Type polymorphism <? function mymax($a, $b) { return $a > $b ? $a : $b; }
mymax(1, 2) => 2 mymax(1.3, 3.2) => 3.2 => 3.2 mymax(true, false) => true mymax(array(1,2), array(4)) => array(1,2) mymax(array(1,2), array(2,1)) => array(2,1) mymax("Felix", "Bob”) => "Felix"
In HHBC
function mymax($a, $b) { ! return $a > $b ? $a : $b; !} !
PushL 1 ! PushL 0 ! Gt! JmpZ 1f ! PushL 0 ! Jmp 7 2f !1: PushL 1 !2: RetC !
HHVM’s Strategy • Basic-block at a time • Type specialization • Goal: incremental type discovery
Tracelets • Our compilation unit: the tracelet
• Sequence of control-flow-free bytecode instructions, annotated with type information
• We construct tracelets the moment before running them, translate them to machine code, and chain them together in the code cache
Tracelet construction • mymax(10, 333);
PushL 1 ! PushL 0 ! Gt ! JmpZ 1f ! PushL 0 ! Jmp 7 2f !1: PushL 1 !2: RetC !
Tracelet construction: trimming • mymax(10, 333);
PushL 1 ! PushL 0 ! Gt ! JmpZ 1f – Break at branch! PushL 0 ! Jmp 7 2f !1: PushL 1 !2: RetC !
Tracelet construction • mymax(10, 333);
PushL 1 ! PushL 0 ! Gt ! JmpZ X !
Tracelet construction • mymax(10, 333);
PushL 1 – Relies on locals 0 and 1! PushL 0 ! Gt ! JmpZ X !
Tracelet construction: guards • mymax(10, 333);
Local0 :: Int! Local1 :: Int! PushL 1 ! PushL 0 ! Gt ! JmpZ X !
Tracelet construction: data flow • mymax(10, 333);
Local0 :: Int! Local1 :: Int! PushL 1 ! PushL 0 ! Gt (Int, Int) -> Bool! JmpZ X (Bool) -> () !
Tracelet construction: machine code • mymax(10, 333);
Local0 :: Int! Local1 :: Int! PushL 1 ! PushL 0 ! Gt ! JmpZ X !
cmpl $0x3,-0x4(%rbp) !jne <retranslate> !cmpl $0x3,-0x14(%rbp) !jne <retranslate> !
mov -0x20(%rbp),%rax !mov -0x10(%rbp),%r13 !mov %r13,%rcx !cmp %rax,%rcx !jle <translateSuccessor0> !jmpq <translateSuccessor1!
Tracelet construction • mymax(10, 333);
PushL 1 ! PushL 0 ! Gt ! JmpZ 1f – Taken! ! PushL 0 ! Jmp 7 2f – Skipped!
1: PushL 1 !2: RetC !
Logical view of code cache
$a :: Int, $b :: Int
$a > $b ?
$a :: Int, $b :: Int
return $b
Retranslate B
Retranslate C
Program Flow Guard Flow
Retranslate A A
C
A: PushL 1 ! PushL 0 ! Gt ! JmpZ 1f !B: PushL 0 ! Jmp 7 2f !C: 1: PushL 1 !2: RetC
Call mymax(“a”, “z”) $a :: Int, $b :: Int
$a > $b ?
$a :: Int, $b :: Int
return $b
Retranslate B
Retranslate C
Program Flow Guard Flow
Retranslate A A
C
$a :: String, $b :: String $a > $b ?
$a :: String, $b :: String return $b
Call mymax(“z”, “a”) $a :: Int, $b :: Int
$a > $b ?
$a :: Int, $b :: Int
return $b
Retranslate B
Retranslate C
Program Flow Guard Flow
Retranslate A A
C
$a :: String, $b :: String $a > $b ?
$a :: String, $b :: String return $b
$a :: String, $b :: String return $a
Risk: Code explosion • N inputs, each takes on t types
• will yield tN separate translations!
• Solution: truncate tracelet chain at 12 items • Fall back to interpreter. • Applies to 0.0066% of chains
$a :: Null, $b :: Null, $c :: Null
...
Interp A
$a :: Int, $b :: Null, $c :: Null
...
...
$a :: Int, $b :: Bool, $c :: String
...
Function types • PHP methods and functions return untyped data • Late-bound • Early approach: using return values breaks tracelets
function hotFunc() { … foreach ($obj in $objs) { $s = $a[$obj-‐>getName()]; // what type? }
Return type prediction class C { … function getName() { return $this-‐>name; } }
class D { … function getName() { return “D”; } }
function hotFunc() { … foreach ($obj in $objs) { $s = $a[$obj-‐>getName()]; // what type? }
Type profiling • Special case of value profiling[1] • Typical value-profiling records values encountered per-call
site • Problems
• Too much information • Requires long calibrations
• [1] Calder, et al. "Value profiling." Microarchitecture, 1997. Proceedings., Thirtieth Annual IEEE/ACM International Symposium on. IEEE, 1997.
Fun trick: profile names • Run first N requests in interpreter • Remember method name -> return type info • System learns facts like “methods named getName() tend
to return strings” • Works for callsites, classes, methods, etc. we haven’t seen in
warmup • Exploits the ways that humans naturally write code
• Our codebase has millions of call sites, but only 13K unique method names
• Accuracy ~99%
Example name predictions name type reset Null getTimer Int get_loggedin_user Int is_fb_employee Bool array_pull Array IPAddress Object ends_with Bool vqsprintf String html2txt String
Perf
HHVM vs. Zend
Perf: Wordpress
0
200
400
600
800
1000
1200
1400
Zend HHVM Interp
HPHPc HHVM Jit HHVM Repo Jit
Wordpress RPS
RPS
Facebook CPU idle (1/27/2013)
0
10
20
30
40
50
60
70
80
90
1 18
35
52
69
86
103
120
137
154
171
188
205
222
239
256
273
290
307
324
341
358
375
392
409
426
443
460
477
494
511
528
545
562
579
596
613
630
647
664
h3
v3
Road ahead • Low-hanging fruit • Memory management • Broadened compilation scope • Improved compilation quality • Science fiction: compilation as big-data
Conclusions • HHVM can run your PHP
• very fast • in both prod and dev
• Lots of cool work left
Team HHVM
Thanks! • https://github.com/facebook/hiphop-php/ • Our group blog: http://www.hiphop-php.com/wp/
HHBC is a platform, not an IR • Bytecode can be serialized • Runs without source • Natively supports PHP datatypes • Other front-ends, languages could produce HHBC
PHP AST
Op)mize
Parse HHBC
Bytecode Gen
Ahead: Compilation at scale • Facebook: many thousands of machines running the
same app
• Almost any profiling/compilation overhead can be amortized
• Challenge: publish local compilation results
Ahead: HackIR • SSA-based IR
• Supports classic compiler optimization suite, multiple machine targets
• Recently checked in
Ahead: Larger compilation scope • Stitch together a method’s tracelets
• HHIR gets larger scope for optimization
• Tracelet body counts • Provide both control-flow and type profiling
HipHop Pipeline
$c = $a + $b; // php!
=
lval:$c +
rval:$a rval:$b
_c.assign(_a.add(_b)); // C++ !
parse
optimize
emit
Problems with AoT • PHP development environment suffers
• Slow • New class of bug • Perf work different from “normal” work • Operational headaches deploying ELF binary
• Missed perf opportunities • Runtime has more info than compile-time • Type inference • Method dispatch
Important related work • Other dynamic language VMs
• Smalltalk, Self • ECMAscript: v8, JSC, Tamarin, *Monkey, … • Other: LuaJIT, PyPy
• Dynamic binary translators • Shade • Embra • VMware’s BT monitor • DynamoRIO
HipHop GeoMean of 5.6x faster than Zend
From “The HipHop Compiler for PHP,” Zhao et al., to appear in OOPSLA 2012
simpl
e
simpl
ecall
simpl
euca
ll
simpl
eudcall
mandel
mandel2
acker
mann(1
1)
ary(5
0000000)
ary2(
50000000)
ary3(
20000)
fibo(
40)
hash1(1
0000000)
hash2(1
0000)
heapsor
t(2000000)
matr
ix(2
0000)
nestedl
oop(36)
sieve(
30000)
strcat(
30000)
binar
y!trees
fannkuch!re
dux fa
sta
k!nucle
otid
e1
mandelb
rot
n!body
regex
!dna
rever
se!c
ompl
emen
t
spectr
al!nor
m
GeoMean
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Zend HHVMi HipHop
Exe
cutio
n tim
e re
lativ
e to
Zen
d
Figure 4. Performance comparison of Zend, HHVMi, and HipHop using standard PHP and computer language benchmarks
and updated with the minimal changes necessary to keepit running the site. These updates included support for newlanguage extensions, but no performance enhancements.
In August 2010, we last compared the performance ofHipHop and Zend running Facebook production traffic. Atthat time, the version of Zend used was 5.1.3. For this com-parison, two clusters of servers were used, one with HipHopand another with Zend, and load balancers ensured that thetwo sets received equal traffic. The average CPU time serv-ing requests in each cluster was computed across a large pe-riod of time. On average, HipHop completed web requests2.5⇥ faster than Zend. Since HipHop was much faster, theservers in its cluster were underutilized. We also comparedthe throughput of the clusters when the servers were fullyloaded. To do so, we configured the load balancers to pro-vide enough traffic to keep the CPUs in both clusters 100%utilized. The results of this experiment, measured in numberof requests-per-second (RPS), demonstrated that the HipHopcluster achieve throughput 2.5⇥ higher than Zend. On theleft side of Figure 5, this throughput data is plotted relativeto HipHop.
Figure 5 also plots the throughput improvements made toHipHop after August 2010. This comparison was performedusing the same methodology described above, except thatthe Zend cluster was changed to use the baseline HipHopbranch instead. The load balancers were again set to keepthe servers in each cluster fully utilized, so that we couldmeasure the maximum throughput achieved by each server.Overall, HipHop’s throughput over this period improved by
Aug’2010
Dec’20
10
Mar’201
1
Jun’2011
Sep’2011
Dec’20
110
0.5
1
1.5
2
2.5
Baseline HipHop HipHop Zend
Rel
ativ
e T
hrou
ghpu
t
Figure 5. HipHop throughput over time, running Face-book production traffic. The baseline is the throughput ofHipHop’s version from August 2010, which is when Zendlast ran Facebook.
another 2.3⇥. Although we cannot directly compare withZend running the current Facebook web site, these data sug-gest that HipHop’s throughput is now 5.8 (2.5 ⇥ 2.3) timeshigher than what Zend 5.1.3 would achieve. These improve-ments resulted from additional tuning, static analyses, andoptimizations described in Section 3 that were added in thisperiod.
Given the nature of the Facebook application, which pro-cesses huge data sets, one could expect Facebook’s webservers to be I/O bound. However, given the great amount
PHP Perf Challenges • What method/function is being called here? • What types do these data have? • Automatic memory management • High language “surface area”
Prod: Tracelet Chain length
Risk: Warmup • Possible weak point of JIT vs. AoT: warmup latency • We start with an empty code cache • Goal: reach steady state quickly
Warmup: Production requests/second
Code size over time
Jit output / time
The HipHop VM (HHVM) • New execution engine for PHP • ~5x faster than standard interpreter • No compromises
• Interactive use same as interpreter • Real warts-and-all PHP language
Why PHP? • Large base of software • Mediawiki, Drupal, Wordpress • Large opportunities for performance improvement
Background: PHP • Server-side • Single-threaded • Stateless
• Web requests run in isolation • Request starts with empty heap, runs to completion • No thread-to-thread memory sharing
• Untyped • Highly dynamic • Facebook
• > 18 MLOC • Significant open source PHP • Site pushed twice daily
Dynamic binding: functions <?
switch(date('l')) { case ’Wednesday': function foo($a, $b) { return "Wednesday!!!" . (32 * $a - $b); } break; default: function foo($a, $b) { return date('l') . ($a + $b) . "\n"; } } echo foo(1, 2);