migrating big data - o'reilly mediaassets.en.oreilly.com/1/event/61/moving day_...

34
Moving Day: Migrating your Big Data from A to B Justin Dow [email protected] Corey Shields [email protected] Laura Thomson [email protected]

Upload: dinhminh

Post on 28-Jul-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Moving Day:Migrating your Big Data from A to B

Justin [email protected]

Corey [email protected]

Laura [email protected]

Page 2: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Overview

• What is Socorro?

• Rationale for migration

• Build out

• Automation

• Smoke testing

• Troubleshooting

• Moving Day

• Aftermath

Page 3: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

What is Socorro?

Page 4: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t
Page 5: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t
Page 6: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t
Page 7: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t
Page 8: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

“Socorro has a lot of moving parts”

...

“I prefer to think of them as dancing parts”

Page 9: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

A different type of scaling:

• Typical webapp: scale to millions of users without degradation of response time

• Socorro: less than a hundred users, terabytes of data.

Page 10: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Basic law of scale still applies:

The bigger you get, the more spectacularly you fail

Page 11: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Some numbers

• At peak we receive 3000 crashes per minute

• 3 million per day

• Median crash size 150k

• ~40TB stored in HBase and growing every day

Page 12: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

What can we do?

• Does betaN have more (null signature) crashes than other betas?

• Analyze differences between Flash versions x and y crashes

• Detect duplicate crashes

• Detect explosive crashes

• Find “frankeninstalls”

• Email victims of a malware-related crash

Page 13: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Rationale and planning

Page 14: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Why?

• Initial reason for move was stability: lots of downtime, partly due to approaching capacity limits

• Needed more infrastructure, which meant new data center

• Decided if we were going to do it over...we’d do it right

Page 15: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Fragility

• Ongoing stability problems with HBase, and when it went down, everything went with it

• Releases were nightmares, requiring manual upgrades of multiple boxes, editing of config files, and manual QA

• Troubleshooting done via remote (awful)

Page 16: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Moving data

• Setting up new architecture and new machines is one thing

• Moving such a large amount of data...with no downtime...is harder

• Downtime requirement turned out to be on collection, so we changed code to a pluggable storage architecture. Spool to disk when HBase unavailable

• PostgreSQL easy to move data

• HBase orginally intended to use distcp, but ran into trouble and ended up using a dirty copy tool authored in house

Page 17: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Planning tools

• Bugzilla for tasks

• Pre-flight checklist and in-flight checklist to track tasks

• Read Atul Gawande’s The Checklist Manifesto

• Rollback plan

• Failure scenarios

• Rehearsals

Page 18: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Build out

Page 19: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

What was wrong?

• legacy hardware

• improperly managed code

• each server was different

• no configuration management

• shared resources with other webapps

• vital daemons were started with “nohup ./startDaemon &”

• insufficient monitoring

• one sysadmin. rest of team and developers had no insight into production

• no automated testing

Page 20: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Automation

Page 21: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Configuration Management

• new rule: if it wasn’t checked in and managed by puppet, it wasn’t going on the new servers

• no local configuration/installation of anything

• daemons got init scripts and proper nagios plugins

• application configuration is done centrally in one place

• staging application matches production

Page 22: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Packages for production

• 3rd party libraries and packages are pulled in upstream

• IT doesn’t need to know/care how a developer develops. What goes into production is a tested, polished package.

• packages for production are built and tested by Jenkins the same way every time.

• local patches aren’t allowed. A patch to production means a patch to the source upstream, a patch to stage and a proper rollout to production

• Every package is fully tested in a staging environment

Page 23: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Smoke testing

Page 24: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Load Testing

• Used a small portion (40 nodes) of a 512-node Seamicro cluster

• Simulating real traffic by submitting crashes from a test farm

• Testing the entire system as a whole, under real production load

Page 25: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Troubleshooting

Page 26: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

The Bleeding Edge... bleeds!

• Moving HBase data over, issues with distcp

• New data center, new load balancers, new challenges

• New hardware = new challenges with hbase

• Discovering a misconfigured kickstart script, from which all servers were built.

• Using puppet to correct the error

• Testing various configurations

Page 27: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Moving Day

Page 28: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

• Flew the team in

• Migration day checklist: http://tinyurl.com/migrationday

• Went remarkably smoothly due largely to good co-operation between teams

Page 29: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t
Page 30: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Aftermath

Page 31: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

• Before moving day, every possible failure scenario was discussed and planned for

• Most failure scenarios actually happened. The new cluster failed gracefully with 0 data loss.

• New data that came in during the migration had to be dealt with.

• Problems with new hbase cluster and various other tweaks.

• Postmortem to learn what we did right and wrong

Page 32: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

It’s all open

Get the code (moving to github in 3-4 weeks):

http://code.google.com/p/socorro

Read/file/fix bugs:

https://bugzilla.mozilla.org/

Call in for the weekly meetings:

https://wiki.mozilla.org/Breakpad/Status_Meetings

Join us in IRC:

irc.mozilla.org #breakpad

Page 33: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

Questions?

Page 34: migrating big data - O'Reilly Mediaassets.en.oreilly.com/1/event/61/Moving Day_ Migrating...•no automated testing Automation Configuration Management •new rule: if it wasn’t

We’re Hiring!!

• Help us work to improve and protect the open web

• Offices worldwide, work from home possible for many jobs

• careers.mozilla.com

• System Administrators (Web Ops, VCS, LDAP, etc)

• Site Reliability Engineers, QA Engineers

• Developers, developers, developers, developers!!