atmosphere 2014: helping the internet to scale since 1998 - paweł kuśmierski
DESCRIPTION
Akamai runs a network of 150.000 servers distributed among 2.000 locations in 92 countries. It’s constantly outputting Terabits per second, accounting for between 15 and 30% of the Internet’s WWW traffic. Talk will cover the principles of operation of Akamai’s Inteligent Platform, aspects of monitoring and managing consistent configuration on such scale. Speaker will share interesting technical details and general ideas behind the scalability and performance of the Akamai network. Paweł Kuśmierski - Pawel Kusmierski is a Senior Engineer and Lead of Akamai’s System Operations in Krakow, Poland. He’s responsible for operational oversight of Internet Mapping and Distributed Storage systems. In the past he interned at Google’s Mountain View office as a Software Engineer. He lives with his wife and three year old son in Krakow. Occasionally he finds time to fly sailplanes and build electronic devices.TRANSCRIPT
Helping the Internet to scale since 1998
Paweł Kuśmierski, Senior Engineer, Lead
System Operations, Akamai Krakow
©2013 AKAMAI | FASTER FORWARDTM
What’s Akamai?
Founded at MIT in 1998 by prof. Tom Leighton and Danny Lewin
Akamai has the world’s most distributed Internet platform (over 150.000 servers, deployed in 2000 locations in 92
countries)
The Akamai Intelligent Platform is leading cloud platform delivering beteween 15% and 30% of the worldwide web traffic.
Accelerating Daily Traffic of:
10+ Tbps
20+ million hits per second
2+ trillion deliveries per day
30+ petabytes/day
10+ million concurrent streams
©2013 AKAMAI | FASTER FORWARDTM
Who do we serve?
The top 30 media & entertainment companies
All 20 top global eCommerce sites
7 of the top 10 world banks
9 of the top 10 largest newspapers
9 out of 10 top social media sites
6 of the top 7 computer manufacturers
All of the top anti-virus companies
©2013 AKAMAI | FASTER FORWARDTM
What’s the idea?
• Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web
• ACMS: Akamai Configuration Management System
• Query (various publications, Scaling a Monitoring Infrastructure for the Akamai Network)
http://www.akamai.com/html/perspectives/techpubs.html
©2013 AKAMAI | FASTER FORWARDTM
Why and how is Akamai helping the Internet to scale?
The Internet wasn’t designed for the ways in which we use it today.
• No single network dominates the Internet traffic with the largest
controlling less than 5% of the access traffic.
Trobule:
• Outages (cable cuts, de-peering)
• Congestion (packet loss)
• Lack of scalability
• Slow adaptability (IPv6 first proposed in 1998)
• Lack of security
©2013 AKAMAI | FASTER FORWARDTM
10’000 feet view of Akamai
©2013 AKAMAI | FASTER FORWARDTM
Akamai Cloud Optimization
The User Always Connects to a Nearby Akamai Server
Challenges with Cloud Adoption
Cloud servers reside in big data centers,
farther away from the end user…
...resulting in decreased performance and
security
End User
Cloud Datacenter
Akamai Edge Servers
©2013 AKAMAI | FASTER FORWARDTM
End User
Problem 1
Route to datacenter
may perform poorly
Cloud Datacenter
X
X
Cloud Optimization: Route Selection
©2013 AKAMAI | FASTER FORWARDTM
End User
Solution
Akamai SureRoute
to optimize route
Problem 1
Route to datacenter
may perform poorly
Akamai Edge Servers
X
Cloud Datacenter
Cloud Optimization: Route Selection
©2013 AKAMAI | FASTER FORWARDTM
Pack
et L
oss
50%
40%
30%
20%
10%
0%
Jan 25
Jan 27
Jan 29
Jan 31
Feb 02
Feb 04
Feb 06
Feb 08
Feb 10
Feb 12
Feb 14
Feb 16
Feb 18
Generic InternetAkamai
Akamai SureRoute Makes a Big Difference
Packet loss into India after MidEast cable cut
©2013 AKAMAI | FASTER FORWARDTM
End User
Solution
Akamai Communication
Protocol
Problem 2
Many round trips for
initial large download
Cloud Datacenter
Akamai Edge Servers
Cloud Optimization: Communication Protocol
©2013 AKAMAI | FASTER FORWARDTM
Attacks on Akamai Customers
• Typical Attack Size: 3-10 Gbps
• Large Attack Size: 100-200 Gbps
• Attacks are originating from all
geographies and are moving between geographies during the attack
2009 2010 20110
100
200
300
400
500
600
Nu
mb
er
of A
ttac
ks
©2013 AKAMAI | FASTER FORWARDTM
Denial of Service (DoS); 32%
SQL Injection (SQLi); 21%Cross-Site Scripting (XSS); 9%
Brute Force; 4%
Cross-Site Request Forgery (CSRF); 4%
Process Automation; 4%
Known Vulnerability; 4%
Misconfiguration; 3%
Stolen Credentials; 1%
Banking Trojan; 1%
Predictable Resource Location; 1%Content Spoofing; 1%
Abuse of Functionality; 1%
DNS Hijacking; 1%
Malware; 1%
Insufficient Authentication; 1%
OS Commanding; 1%
Unknown; 10%
Attack Methods
Source: TrustWave - 2010 - Web Hacking Incident Database
The Threat is Varied & Easier to Launch
74% of companies experienced one or more DDoS
attacks in the past year.
31% of these attacks resulted in service disruption.
New attack tools such as Low Orbit Ion Cannon
Users download the tool, insert the target URL or IP
and press GO!
©2013 AKAMAI | FASTER FORWARDTM
(Cloud) Datacenters
End User
1
10
100
10000
Origin Traffic
1000
Akamai Traffic
10
100
10000
1000
Web Application With a Perimeter Defense
COVERED
1
©2013 AKAMAI | FASTER FORWARDTM
Customer – PROTECTED
U.S. Government Customer 1
U.S. Government Customer 2
U.S. Government Customer 3
U.S. Government Customer 4
U.S. Government Customer 5
U.S. Government Customer 6
Peak Traffic
Times Above Normal Traffic
July 4th – 7th 2009 DDoS Attack
400,000 Korean Bots Attack Key U.S. Government Web Sites
598x
369x
39x
19x
9x
6x
124 Gbps
32 Gbps
9 Gbps
9 Gbps
2 Gbps
1.9 Gbps
©2013 AKAMAI | FASTER FORWARDTM
08:00 16:00 0:00 08:000:00 16:00
25
50
75
100
125
Atta
ck S
ize
— G
bps
July 5, 2009
16:00 Customer notified
20:00 Attack grows rapidly
23:00 Mitigation measures engaged
Spike 1
Spike 2Spike 3
Unique IPs
21:00 Akamai identifies sources
23:50 Peak pageviews
July 4th – 7th 2009 DDoS Attack
400,000 Korean Bots Attack Key U.S. Government Web Sites
©2013 AKAMAI | FASTER FORWARDTM
Under the hood
©2013 AKAMAI | FASTER FORWARDTM
• Syntax check
• File liveness checks
• Check number of objects changing
• Deploy to a subset
• Check for machine liveness (do we have a representative sample?)
• Check for relative change in machine liveness
• Check for service health
• Check relative changes in response codes %
• Check for self-suspension
Configuration change deployments
©2013 AKAMAI | FASTER FORWARDTM
Ok, But how?
• Various web infrastructure services
• Over 150,000 machines
• Over 1 million distributed components
• Over 1000 autonomous systems
• 24/7/365 operation
• Failures, usage changes
• Massive, real-time monitoring
©2013 AKAMAI | FASTER FORWARDTM
Query
• Distributed data collection
• Aggregation at several hundred points
• SQL-style interface
©2013 AKAMAI | FASTER FORWARDTM
A Sample Query
SELECT
c.continent_name,
SUM(l.hits) hits
FROM
load_info l,
region_data r,
continent_data c
WHERE
l.georegion=r.id AND
r.continent=c.continent
GROUP BY
c.continent_name
ORDER BY
hits DESC;
c.continent_name hits
---------------- ---------
North America 4,620,551
Europe 3,392,102
South America 655,175
Asia 552,258
Africa 106,781
Oceania 39,905
Antarctica 135
©2013 AKAMAI | FASTER FORWARDTM
Query at the Edge
• Each machine collects its own data
• Many processes may publish
• Snapshots every two minutes
©2013 AKAMAI | FASTER FORWARDTM
Cluster proxies
• Collect data for the whole cluster
• Include themselves
©2013 AKAMAI | FASTER FORWARDTM
Top-Level Aggregators
• Collect data for the whole network
• Snapshots every two minutes
• Static tables for data that doesn’t change much
©2013 AKAMAI | FASTER FORWARDTM
SQL parsers
• Get tables from 1 TLA
• Only get the ones we need
• Answer queries based on them
©2013 AKAMAI | FASTER FORWARDTM
Aggregator Sets
• Span different parts of the network
• Designated for different purposes
• Several replicated TLAs & SQLs
• Combined TLA/SQLs
• Shared hostnames
• Help meet reliability guarantees
• Help tolerate faults & keep localized
©2013 AKAMAI | FASTER FORWARDTM
Scale
• Several hundred TLAs, SQLs, TLA/SQLs
• Thousands of queries per minute
• Tens of GB in the system
• Up to 16 GB per TLA (and growing fast)
• Internet usage
• Network growth
• Customer growth
• Data/customer
• More queries
• Age of data typically a few minutes
©2013 AKAMAI | FASTER FORWARDTM
Result:
2-100X
compression
Result:
2-100X
compression
Result:
2-100X
compression
Download the Akamai Internet
Visualization app in the Apple store