"scalability for big data" | dr. steve hanks, principal data scientist

14
Scalability Sucks (working title) June 6, 2013 Steve Hanks Principal Data Scientist WhitePages.com

Upload: whitepagespro

Post on 18-Dec-2014

69 views

Category:

Data & Analytics


1 download

DESCRIPTION

The world of Scalability is glamorous and magical: we get to use cool technologies with sexy names and magical things happen, we have a tough problem and we throw a Scalability Technology at it, and results flow quickly and easily...or so it seems from the outside. The experience of the Whitepages Data Group has been quite different. We have significant, challenging problems of scale, both in the size of our data artifacts, and in the number of updates we have to process to keep it accurate and up to date. But our day-to-day lives are much more mundane than the glamorous world were Hadoop meets NoSql meets Scala, and results flow smoothly, and bigger problems are solved by bolting on a node or two. While we are heavy users of Hadoop, we are typically running EC2 instances in the hundreds and are exploring three of four alternatives to Postgres. We have found no "silver bullet" technology, cutting-edge or otherwise. Rather, for us, scaling problems tend to be insidious and move around a lot, and we feel like firefighters (or worse, like mole-whackers) at least as often as we feel like Data Scientists. I will provide some case studies, examples and rants, most of which don't involve Hadoop or other magic bullets.

TRANSCRIPT

Page 1: "Scalability for Big Data"   |    Dr. Steve Hanks, Principal Data Scientist

Scalability Sucks(working title)

June 6, 2013

Steve HanksPrincipal Data ScientistWhitePages.com

Page 2: "Scalability for Big Data"   |    Dr. Steve Hanks, Principal Data Scientist

2

Scalability Perception Versus (Our) Reality

• Perception

– Scalability is about technology, and adopting the right technology

gives you scalability

o You want to believe it (the technology is fun)

o Sales people want you to believe it

• Reality

– Problems are complex and solutions are inter-related

– Scalability problems are rarely isolated to one facet of the solution

o A solution to one symptom tends to push the problem somewhere else

(one thing leads to another)

– Scaling problems are rarely known at inception

o Tipping (over) points

WhitePages Confidential

Page 3: "Scalability for Big Data"   |    Dr. Steve Hanks, Principal Data Scientist

WhitePages Confidential 3

Brooks: “No Silver Bullet – Essence and Accidents of Software Engineering” (1986)

• Separating “essential difficulties” from “accidental difficulties”

– Technologies address the latter, but at best free us to work on the problem

features that are inherently difficult

• The mistake we make, thinking that technologies that address accidental

difficulties in any sense solve the harder problems

– Then, Compilers, IDEs

– Now, Distributed Databases, NoSQL, MapReduce

• The message is the same: a technology can (help) solve your problem if

– It’s a simple problem and the technology is exactly the right tool, or

– Applying the technology can effectively solve one piece of a complex problem

Page 4: "Scalability for Big Data"   |    Dr. Steve Hanks, Principal Data Scientist

Case Study:Scale, and More Scaling

4

Page 5: "Scalability for Big Data"   |    Dr. Steve Hanks, Principal Data Scientist

WhitePages Confidential 5

The WhitePages Data Ecosystem

Search/API

Data Build(s)

PurchasedData

InternallySourced Data

Core Data

Data Size (approximate)• 300M Persons• 150M Addresses• 135M Businesses• 120M Telephones• 1.4B Address Links• 400M Telephone Links

Volume (monthly)• 50M Unique Users (website)• 35M Mobile Downloads (total)• 1B API calls• 600M Mobile-Related events• 1.2B Data Inputs (purchased + internal)

Page 6: "Scalability for Big Data"   |    Dr. Steve Hanks, Principal Data Scientist

WhitePages Confidential 6

Scalability Challenges (Ours. Actually. Recently.)

Search/API

Data Build(s)

PurchasedData

InternallySourced Data

Core Data

Page 7: "Scalability for Big Data"   |    Dr. Steve Hanks, Principal Data Scientist

WhitePages Confidential 7

The Q4/Q1 Scalability Storm

• Address History

• Close the Loop With our Internal Applications

– Phone Meta-Data

o Track more phones, especially mobile phones (e.g. carrier, call or search velocity)

» It’s very interesting to know whether a phone number is active, and when a number gets ported, for example

– Person/Phone links

o We get information from internal clients about phone numbers, calls, and names

o Sometimes we can infer a new link between a Person and a Phone

» Subject to some very strict privacy constraints

• Challenges Perceived at the Time

– Address History

o Size of the Core Database grows

o Time to do the build increases (?? By 25 to 50%)

– Internal Data

o No big scaling problem anticipated, but need a new component to process the event queue and stage the data for

the build (Data Normalization)

Page 8: "Scalability for Big Data"   |    Dr. Steve Hanks, Principal Data Scientist

WhitePages Confidential 8

Data Normalization – Quick Way to Catch Internal Events

Normalization

CNAM

Phone Attributes

Current App

(Searches)

(Phone Calls)

PPMatcher

CNAM

Phone Attributes

Current

PPMatcher

Data

Build

Phone Attributes

etc.

(Delivered from internal apps using ActiveMQ) End of Q4

• Stable in production. Processing ~ 100K records per hour

• Clearing the backlog in about 13 minutes

• Nice general solution for tasks that need certain libraries or read/write to DBs (Hadoop doesn’t work well for these)

• Well integrated into our build process

Page 9: "Scalability for Big Data"   |    Dr. Steve Hanks, Principal Data Scientist

WhitePages Confidential 9

Scaling Tipping Points Q4

• Historical address information is partially on board

– Increases size of CoreDB by 25%, reliability suffers due to stress

• IT finally looks carefully at our Q4 equipment asks, are

appalled

• Normalization is getting strained

– Greater adoption of Current app results in x5 increase in call

records

– New data providers and other applications increase load on

Normalization by 10x

– DB size (phones) increases beyond 20M, local Postgres is getting

cranky

– Bottom line, finishing 1 hour worth of inputs in 12 hours. Not

encouraging.

• General mood is good. Multiple scaling issues, but it’s

OK because we have a silver bullet - AWS

Page 10: "Scalability for Big Data"   |    Dr. Steve Hanks, Principal Data Scientist

WhitePages Confidential 10

Q1/Q2 – Biting the Silver Bullet (or vice versa)

• Transition our whole build to EMR

– Looks easy, it’s all just Pig!

– Will save a ton of money

– No need for cluster administrator

– Scaling problems are solved, forever!

• Move our core data base (contact graph) to RDS

– Looks easy, it’s just an RDB!

– Will save a ton of money

– No need for a DBA

– Scaling problems are solved, forever!

• Move normalization to the cloud by using hundreds of EC2 instances

– Looks easy, they’re just Linux boxes

– Will save a ton of money

– Deployment problems are solved

– Scaling problems are solved, forever!

Page 11: "Scalability for Big Data"   |    Dr. Steve Hanks, Principal Data Scientist

WhitePages Confidential 11

Q2/Q3 – One Thing Leads to Another

• Expected and unexpected overall adoption problems with AWS

– Credentials, regions, etc.

– Carefully estimate time/cost to get data in and out

• EMR transition fairly smooth

• Normalization transition less smooth (to 200+ workers)

– Need new debugging paradigms (e.g. image changes, viewing local state)

– Shared resources (shared filesystems, databases)– Dealing with race conditions and bad actors

– Downstream implications of 200+ workers

– Upstream implications of 200+ workers

• Unexpected new requirements

– For accounting purposes need to log all calls to external data providers

• Database transition less smooth

– Too many conflicting use cases for a clear-choice technology

– Uneasy peace with Redshift

Page 12: "Scalability for Big Data"   |    Dr. Steve Hanks, Principal Data Scientist

WhitePages Confidential 12

Q4 – Stabilizing

• As We Speak

– Transition to EMR is complete, and the full build works smoothly,

and takes about 50% of the time

o Some combination of the technology and our re-engineering

– Transition to Redshift is complete, though the story isn’t over

o We’re about to make our peace with the fact that we need multiple

data representations, and all that implies

– Normalization is still in flux, still under stress

o Not anticipating the scaling implications on our database of phones

continues to bedevil us

» Exploring NoSql solutions

– Reliably stabilized and at full scale by end of year

Page 13: "Scalability for Big Data"   |    Dr. Steve Hanks, Principal Data Scientist

WhitePages Confidential 13

Wrap-Up

• We did a lot of scaling. It was messy, and took longer than we thought

– Because we were dumb?

– Because we didn’t plan well enough?

– Because it’s inherently messy, and Brooks was necessarily right about incremental development?

• We accomplished really awesome things, and I still don’t have a single cool or insightful

thing to say about specific technologies!

• Things we did well

– Maintained focus on problems, not technologies

– Designed well enough that there were options for solving problems when they arose

• Things we could have done better

– Don’t combine a lot of scaling with a lot of re-architecting (?)

o Don’t change too many things at once

– Anticipate second-order problems (120M phone numbers on a single Postgres installation was never going to

work)

– Don’t believe that throwing 200 x at any problem will work, for any x

– Pay more attention from the beginning to standard problems in distributed processing (hung remote

processes, long locks, race conditions)

– Know more about Postgres

Page 14: "Scalability for Big Data"   |    Dr. Steve Hanks, Principal Data Scientist

WhitePages Confidential 14

Conclusion

• Scaling really does suck

• It really isn’t primarily about the technology

• You need

– Understanding of and focus on the problem, not on solution

technologies

– At the same time, knowledge of the possible technology tools

– To be diligent about standard best practices in design and

engineering