brace yourselves, leap second is coming
TRANSCRIPT
Nati CohenFewbytes
BRACE YOURSELVES
LEAP SECONDIS COMING
Intro: Assumptions
Installing servers
1. Unbox2. Mount3. Connect to power4. Connect to network5. Power up6. Network boot7. …8. Profit
(Not) Installing servers
1. Unbox2. Mount3. Connect to power4. Connect to network5. Power up6. …7. …...8. ……...
Solving problems 101
1. Blame it on the network
2. DHCP issue?
3. PXE issue?
4. Problematic server?
We checked everything
at least anything that seemed plausible
But after 5 days...
Same MAC address
MAC addressA media access control address (MAC address) is a unique identifier assigned to network interfaces for communications on the physical network segment.
Once in a lifetime thing?
● HP servers & switches, Dell PCs, D-Link Access Points, Lenovo tablets, Android custom ROMs
Also:● 2008- Hyper-V● 2010- Xen 5.3● 2012- libvirt 0.10.2● 2013- VMWare ESXi 5.x● 2014- lxc-clone
We all make assumptions
● Development● Debugging● Marketing● Support● …
Being informed, helps avoiding them!
Agenda
1. Intro: Assumptions
2. Missiles and Rounding Errors
3. Breaking the Internet in one second
4. Aviation Safety
Patriot missile defense system
1. Search
2. Validate
3. Track
Next position =Velocity (real) * Time (int -> real)
GAO/IMTEC-92-26 - Software Problem Led to System Failure at Dhahran, Saudi Arabia
to_seconds(ttos)
ttos = 28800 // 8*60*60*10ttos * 0.1 = ?
1. 2880.02. 28799.97253. 28799.9999998613974. ???
It depends on the representation...
Rational Numbers
Recall, integers:1337 = 00000101 00111001
Option 1: Rational Numbers● (numerator, denominator)● PROs: exact representation● CONs: Pi / sqrt(2), space, speed
Fixed Point
Option 2: Fixed-point● variable * (base ^ scaling factor)● base and scaling factor are fixed
○ binary vs decimal○ +/- exponent
● PROs: space, easy to compute● CONs: limited range of valueseg.variable = 30, base = 2, scaling factor = -3
00011.1102 = 21 + 20 + 2-1 + 2-2 = 3.75
Rounding errors
0.125 = 2-3
0.1 = 2-4 + 2-5 + 2-8 + 2-9 + 2-12 + 2-13 + ...
with 8 bit variable0.1 = 0.09375
with 24 bit variable0.1 = 0.09999990463256836
Floating Point
Option 3: Floating-point (IEEE 754)● (mantissa * 2^exponent)● eg. float- 1 sign, 23 mantissa, 8 exponent● PROs: wide range, fast w/ FPU● CONs: accuracyCaveats- NaNs, signed zero/infinity, denorm
rounding, ... IEEE Standards Association. "Standard for Floating-Point Arithmetic." IEEE 754-2008 (2008).
Goldberg, David. "What every computer scientist should know about floating-point arithmetic." ACM Computing Surveys (CSUR) 23.1 (1991): 5-48.
Picking the right tool
Rational Numbers - fractions.Fraction
Decimal Fixed-point - decimalBinary Fixed-point - spfpm module
Floating-point issues and limitation
Recall: Patriot system
1. Search
2. Validate
3. Track
Next position =Velocity (real) * Time (int -> real)
GAO/IMTEC-92-26 - Software Problem Led to System Failure at Dhahran, Saudi Arabia
Patriot system cont’d
● After 8 hours 0.0275 seconds error● After 100 hours 0.3433 seconds error
Scud velocity is ~ 1,676 meters per-second
687 meters error
Patriot system cont’d
On February 25, 1991, a Patriot missile defense system operating at Dhahran, Saudi Arabia, during Operation Desert Storm failed to track and intercept an incoming Scud.
This Scud subsequently hit an Army barracks, killing 28 Americans.
GAO/IMTEC-92-26 - Software Problem Led to System Failure at Dhahran, Saudi Arabia
Let’s talk about time
Time is simple, right?
1 Year = 365 days
Leap year?
Time is (not) simple
1 Year = 365 or 366 days1 Month = 28/29/30/31 days
Mostly true, except:in Britain 1752, September had 19 daysin Russia 1918, February had 15 daysin Greece 1923, February had 15 days
Time is (not) simple
1 Year = 365 or 366 days*1 Month = 28/29/30/31 days*1 Day = 24 hours
Don’t forget DST!it can also be 23/25or 23.5/24.5
Lord Howe Island, Australia
Time is (not) simple
1 Year = 365 or 366 days*1 Month = 28/29/30/31 days*1 Day = 24 hours**1 Minute = 60 Seconds
NO- Leap Second might cause a minute to have 61 seconds, up to twice a year...
June 30, 2015 at 23:59
What could possibly go wrong?
● 2005/8- most NTPs failed to get it right● 2012- Bugs in Linux
○ Reddit, LinkedIn, Yelp, Meetup, Foursquare○ 400 Qantas flights delayed○ Leaping seconds and looping servers○ Linux's leap-second deadlocks
● s += 3600○ "one hour from now" ?○ "same time, next hour" ?
● 1 second in Flight Control = 300 meter
Possible solutions
● Simulate leap second○ eg. using adjtimex(8)
● Slewing time (AWS, Google)○ ntptime -s 0 # reset kernel state○ ntpdate -B <some-ntp-server>
● Servers from the future (Google, Facebook)● Have good monitoring
○ NTP offset○ leap second status○ @statscraft
● Falsehoods programmers believe about time● More falsehoods programmers believe about
time
● The One-Second War
Further reading
Let’s talk about planes
787 Boeing Dreamliner
● 30/4/2015- FAA requires operators to do electrical power deactivation at intervals which will not exceed 120 days
FAA- Airworthiness Directives; The Boeing Company Airplanes
But, why?
“Boeing ... identified during laboratory testing the software counter internal to the generator control units (GCUs) will overflow after 248 days of continuous power, ... resulting in a loss of all AC electrical power regardless of flight phase”
FAA- Airworthiness Directives; The Boeing Company Airplanes
Wait, 248 days?
● 231 / (100 * 60 * 60 * 24) = 248.551
● ie. 231 deciseonds are roughly 248 days
● signed integer?
Recall 2-complement
00000000 000000001 100000010 200000011 3…01111111 12710000000 -12810000001 -127…
Arithmetic operations are easy, but might overflow:
00000001 1+
01111111 127=
10000000 -128
F**k overflows, I’m using Python
>>> import sys
>>> print sys.maxint, type(sys.maxint)
9223372036854775807 <type 'int'>
>>> print sys.maxint + 1, type(sys.maxint + 1)
9223372036854775808 <type 'long'>
Arbitrary Precision
struct _longobject {PyObject_VAR_HEADdigit ob_digit[1];
};
● PROs: unlimited*● CONs: slow, harder to implement
○ think about multiplication● What about builtins? C modules?
○ eg. formatting, unicode, itertools, sqlite
PEP 237 -- Unifying Long Integers and IntegersPython 2.7.10 source code, “Include/longintrepr.h”
Know your language (ruby)
[1] pry(main)> a = 1337
=> 1337
[2] pry(main)> a.class
=> Fixnum
[3] pry(main)> a = 2**100
=> 1267650600228229401496703205376
[4] pry(main)> a.class
=> Bignum
Know your language (Scala)
scala> 2147483647 + 1
res1: Int = -2147483648
scala> Math.addExact(2147483647, 1)
java.lang.ArithmeticException: integer overflow
scala> Math.pow(2, 1024)
res4: Double = Infinity
Know your language (JavaScript)
> Number.MAX_VALUE
1.7976931348623157e+308
> Number.MAX_VALUE*2
Infinity
> Number.MAX_VALUE + 1 - Number.MAX_VALUE
0
> Number.MAX_VALUE - Number.MAX_VALUE + 1
1
From the news*
● January 2014:○ >67m players per month○ >27m players per day○ >7.5m concurrently during peak hours○ 946m dollar yearly revenue
● 12/6/2015- The EU West Spectator mode
crashed○ game counter just surpassed 2147483647 (2B)○ ie. 2**31 games...
League of Legends tops MMO revenue list, Hearthstone No. 10LEAGUE PLAYERS REACH NEW HEIGHTS IN 2014EUW spectator mode fell over recently – here’s why
Take home message
● MAC addresses aren’t always unique● Real numbers can be deadly● Time is not simple● Beware of the overflow
We all make assumptions, let’s make less
Thank You!Nati Cohen
Fewbytes
@nocoot