atmosphere 2014: helping the internet to scale since 1998 - paweł kuśmierski

29
Helping the Internet to scale since 1998 Paweł Kuśmierski, Senior Engineer, Lead System Operations, Akamai Krakow

Upload: proidea

Post on 09-May-2015

169 views

Category:

Presentations & Public Speaking


1 download

DESCRIPTION

Akamai runs a network of 150.000 servers distributed among 2.000 locations in 92 countries. It’s constantly outputting Terabits per second, accounting for between 15 and 30% of the Internet’s WWW traffic. Talk will cover the principles of operation of Akamai’s Inteligent Platform, aspects of monitoring and managing consistent configuration on such scale. Speaker will share interesting technical details and general ideas behind the scalability and performance of the Akamai network. Paweł Kuśmierski - Pawel Kusmierski is a Senior Engineer and Lead of Akamai’s System Operations in Krakow, Poland. He’s responsible for operational oversight of Internet Mapping and Distributed Storage systems. In the past he interned at Google’s Mountain View office as a Software Engineer. He lives with his wife and three year old son in Krakow. Occasionally he finds time to fly sailplanes and build electronic devices.

TRANSCRIPT

Page 1: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

Helping the Internet to scale since 1998

Paweł Kuśmierski, Senior Engineer, Lead

System Operations, Akamai Krakow

Page 2: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

What’s Akamai?

Founded at MIT in 1998 by prof. Tom Leighton and Danny Lewin

Akamai has the world’s most distributed Internet platform (over 150.000 servers, deployed in 2000 locations in 92

countries)

The Akamai Intelligent Platform is leading cloud platform delivering beteween 15% and 30% of the worldwide web traffic.

Accelerating Daily Traffic of:

10+ Tbps

20+ million hits per second

2+ trillion deliveries per day

30+ petabytes/day

10+ million concurrent streams

Page 3: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Who do we serve?

The top 30 media & entertainment companies

All 20 top global eCommerce sites

7 of the top 10 world banks

9 of the top 10 largest newspapers

9 out of 10 top social media sites

6 of the top 7 computer manufacturers

All of the top anti-virus companies

Page 4: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

What’s the idea?

• Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

• ACMS: Akamai Configuration Management System

• Query (various publications, Scaling a Monitoring Infrastructure for the Akamai Network)

http://www.akamai.com/html/perspectives/techpubs.html

Page 5: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Why and how is Akamai helping the Internet to scale?

The Internet wasn’t designed for the ways in which we use it today.

• No single network dominates the Internet traffic with the largest

controlling less than 5% of the access traffic.

Trobule:

• Outages (cable cuts, de-peering)

• Congestion (packet loss)

• Lack of scalability

• Slow adaptability (IPv6 first proposed in 1998)

• Lack of security

Page 6: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

10’000 feet view of Akamai

Page 7: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Akamai Cloud Optimization

The User Always Connects to a Nearby Akamai Server

Challenges with Cloud Adoption

Cloud servers reside in big data centers,

farther away from the end user…

...resulting in decreased performance and

security

End User

Cloud Datacenter

Akamai Edge Servers

Page 8: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

End User

Problem 1

Route to datacenter

may perform poorly

Cloud Datacenter

X

X

Cloud Optimization: Route Selection

Page 9: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

End User

Solution

Akamai SureRoute

to optimize route

Problem 1

Route to datacenter

may perform poorly

Akamai Edge Servers

X

Cloud Datacenter

Cloud Optimization: Route Selection

Page 10: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Pack

et L

oss

50%

40%

30%

20%

10%

0%

Jan 25

Jan 27

Jan 29

Jan 31

Feb 02

Feb 04

Feb 06

Feb 08

Feb 10

Feb 12

Feb 14

Feb 16

Feb 18

Generic InternetAkamai

Akamai SureRoute Makes a Big Difference

Packet loss into India after MidEast cable cut

Page 11: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

End User

Solution

Akamai Communication

Protocol

Problem 2

Many round trips for

initial large download

Cloud Datacenter

Akamai Edge Servers

Cloud Optimization: Communication Protocol

Page 12: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Attacks on Akamai Customers

• Typical Attack Size: 3-10 Gbps

• Large Attack Size: 100-200 Gbps

• Attacks are originating from all

geographies and are moving between geographies during the attack

2009 2010 20110

100

200

300

400

500

600

Nu

mb

er

of A

ttac

ks

Page 13: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Denial of Service (DoS); 32%

SQL Injection (SQLi); 21%Cross-Site Scripting (XSS); 9%

Brute Force; 4%

Cross-Site Request Forgery (CSRF); 4%

Process Automation; 4%

Known Vulnerability; 4%

Misconfiguration; 3%

Stolen Credentials; 1%

Banking Trojan; 1%

Predictable Resource Location; 1%Content Spoofing; 1%

Abuse of Functionality; 1%

DNS Hijacking; 1%

Malware; 1%

Insufficient Authentication; 1%

OS Commanding; 1%

Unknown; 10%

Attack Methods

Source: TrustWave - 2010 - Web Hacking Incident Database

The Threat is Varied & Easier to Launch

74% of companies experienced one or more DDoS

attacks in the past year.

31% of these attacks resulted in service disruption.

New attack tools such as Low Orbit Ion Cannon

Users download the tool, insert the target URL or IP

and press GO!

Page 14: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

(Cloud) Datacenters

End User

1

10

100

10000

Origin Traffic

1000

Akamai Traffic

10

100

10000

1000

Web Application With a Perimeter Defense

COVERED

1

Page 15: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Customer – PROTECTED

U.S. Government Customer 1

U.S. Government Customer 2

U.S. Government Customer 3

U.S. Government Customer 4

U.S. Government Customer 5

U.S. Government Customer 6

Peak Traffic

Times Above Normal Traffic

July 4th – 7th 2009 DDoS Attack

400,000 Korean Bots Attack Key U.S. Government Web Sites

598x

369x

39x

19x

9x

6x

124 Gbps

32 Gbps

9 Gbps

9 Gbps

2 Gbps

1.9 Gbps

Page 16: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

08:00 16:00 0:00 08:000:00 16:00

25

50

75

100

125

Atta

ck S

ize

— G

bps

July 5, 2009

16:00 Customer notified

20:00 Attack grows rapidly

23:00 Mitigation measures engaged

Spike 1

Spike 2Spike 3

Unique IPs

21:00 Akamai identifies sources

23:50 Peak pageviews

July 4th – 7th 2009 DDoS Attack

400,000 Korean Bots Attack Key U.S. Government Web Sites

Page 17: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Under the hood

Page 18: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

• Syntax check

• File liveness checks

• Check number of objects changing

• Deploy to a subset

• Check for machine liveness (do we have a representative sample?)

• Check for relative change in machine liveness

• Check for service health

• Check relative changes in response codes %

• Check for self-suspension

Configuration change deployments

Page 19: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Ok, But how?

• Various web infrastructure services

• Over 150,000 machines

• Over 1 million distributed components

• Over 1000 autonomous systems

• 24/7/365 operation

• Failures, usage changes

• Massive, real-time monitoring

Page 20: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Query

• Distributed data collection

• Aggregation at several hundred points

• SQL-style interface

Page 21: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

A Sample Query

SELECT

c.continent_name,

SUM(l.hits) hits

FROM

load_info l,

region_data r,

continent_data c

WHERE

l.georegion=r.id AND

r.continent=c.continent

GROUP BY

c.continent_name

ORDER BY

hits DESC;

c.continent_name hits

---------------- ---------

North America 4,620,551

Europe 3,392,102

South America 655,175

Asia 552,258

Africa 106,781

Oceania 39,905

Antarctica 135

Page 22: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Query at the Edge

• Each machine collects its own data

• Many processes may publish

• Snapshots every two minutes

Page 23: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Cluster proxies

• Collect data for the whole cluster

• Include themselves

Page 24: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Top-Level Aggregators

• Collect data for the whole network

• Snapshots every two minutes

• Static tables for data that doesn’t change much

Page 25: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

SQL parsers

• Get tables from 1 TLA

• Only get the ones we need

• Answer queries based on them

Page 26: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Aggregator Sets

• Span different parts of the network

• Designated for different purposes

• Several replicated TLAs & SQLs

• Combined TLA/SQLs

• Shared hostnames

• Help meet reliability guarantees

• Help tolerate faults & keep localized

Page 27: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Scale

• Several hundred TLAs, SQLs, TLA/SQLs

• Thousands of queries per minute

• Tens of GB in the system

• Up to 16 GB per TLA (and growing fast)

• Internet usage

• Network growth

• Customer growth

• Data/customer

• More queries

• Age of data typically a few minutes

Page 28: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Result:

2-100X

compression

Result:

2-100X

compression

Result:

2-100X

compression

Download the Akamai Internet

Visualization app in the Apple store

Page 29: Atmosphere 2014: Helping the Internet to scale since 1998 - Paweł Kuśmierski

©2013 AKAMAI | FASTER FORWARDTM

Thanks!

Paweł Kuśmierski, [email protected]