backtype-efficiency-and-big-data-systems

106
Become Efficient or Die The Story of BackType Nathan Marz @nathanmarz

Upload: matt-hudson

Post on 07-Mar-2016

213 views

Category:

Documents


0 download

DESCRIPTION

http://posscon.org/assets/Uploads/backtype-efficiency-and-big-data-systems.pdf

TRANSCRIPT

Become Efficient or DieThe Story of BackType

Nathan Marz@nathanmarz

BackType

BackType helps businesses understand social media and make use of it

BackType

Data Services (APIs)

Social Media Analytics Dashboard

APIs• Conversational graph for url

• Comment search

• #Tweets / URL

• Influence scores

• Top sites

• Trending links stream

• etc.

URL Profiles

Site comparisons

Influencer Profiles

Twitter Account Analytics

Topic Analysis

Topic Analysis

BackType stats

• >30 TB of data

• 100 to 200 machine cluster

• Process 100M messages per day

• Serve 300 req/sec

BackType stats

• 3 full time employees

• 2 interns

• 1.4M in funding

How?

Avoid waste

Invest in efficiency

Development philosophy

• Waterfall

• Agile

• Scrum

• Kanban

Development philosophy

Suffering-oriented programming

Suffering-oriented Programming

Don’t add process until you feel the pain of not having it

Example

• Growing from 2 people to 3 people

Example

• Founders were essentially “one brain”

• Cowboy coding led to communication mishaps

• Added biweekly meeting to sync up

Example

• Growing from 3 people to 5 people

Example• Moving through tasks a lot faster with 5

people

• Needed more frequent prioritization of tasks

• Changed biweekly meeting to weekly meeting

• Added “chat room standups” to facilitate mid-week adjustments

Suffering-oriented Programming

Don’t build new technology until you feel the pain of not having it

Suffering-oriented Programming

First, make it possible.Then, make it beautiful.

Then, make it fast.

Example

• Batch processing

Make it possible

• Hack things out using MapReduce/Cascading

• Learn the ins and outs of batch processing

Make it beautiful

• Wrote (and open-sourced) Cascalog

• The “perfect interface” to our data

Make it fast

• Use it in production

• Profile and identify bottlenecks

• Optimize

Overengineering

Attempting to create beautiful software without a thorough understanding of problem domain

Premature optimization

Optimizing before creating “beautiful” design, creating unnecessary complexity

Knowledge debt

0

5

10

15

20

Your productivity Your potential

Knowledgedebt

Knowledge debt

Use small, independent projects to experiment with new technology

• Needed to write a small server to collect records into a Distributed Filesystem

• Wrote it using Clojure programming language

• Huge win: now we use Clojure for most of our systems

Example

• Needed to implement social search

• Wrote it using Neo4j

• Ran into lot of problems with Neo4j and rewrote it later using Sphinx

Example

• Needed an automated deploy for a distributed stream processing system

• Wrote it using Pallet

• Massive win: anticipate dramatic reduction in complexity in administering infrastructure

Example

Knowledge debt

(Crappy job ad)

Knowledge debt

Instead of hiring people who share your skill set, hire people with completely different skill sets

(food for thought)

Technical debt

Technical debt builds up in a codebase

Technical debt

• W needs to be refactored

• X deploy should be faster

• Y needs more unit tests

• Z needs more documentation

Technical debt

Never high enough priority to work on, but these issues built up and slow you down

BackSweep

• Issues are recorded on a wiki page

• We spend one day a month removing items from that wiki page

BackSweep

• Keeps our codebase lean

• Gives us a way to defer technical debt issues when don’t have time to deal with them

• “Garbage collection for the codebase”

What is a startup?

A startup is a human institution designed to deliver a new product or service under conditions of extreme uncertainty.

- Eric Ries

How do you decide what to work on?

Don’t want to waste three months building a feature no one cares about

This could be fatal!

Product development

Form hypothesis Test hypothesis

Discard

KeepValid?

Invalid?

Learn

Example

Example

Pro product didn’t actually exist yet

Example

• We tested different feature combinations and measured click through rate

• Clicking on “sign up” went to a survey page

Example

Hypothesis #1

Customers want analytics on topics being discussed on Twitter

Testing hypothesis #1

• Fake feature -> clicking on topic goes to survey page

Testing hypothesis #1

• Do people click on those links?

• If not, need to reconsider hypothesis

Hypothesis #2

Customers want to know how often topics are mentioned over time

Testing hypothesis #2

• Build topic mentions over time graph for “big topics” our private beta customers are interested in (e.g. “nike”, “microsoft”, “apple”, “kodak”)

• Talk to customers

Hypothesis #3

• Customers want to see who’s talking about a topic on a variety of dimensions: recency, influence, num followers, or num retweets

Testing hypothesis #3

• Create search index on last 24 hours of data that can sort on all dimensions

Lean Startup

Questions?

Twitter: @nathanmarz

Email: [email protected]

Web: http://nathanmarz.com

The Secrets of Building Realtime Big Data

Systems

Nathan Marz@nathanmarz

Who am I?

Who am I?

Who am I?

Who am I?

(Upcoming book)

BackType

• >30 TB of data

• Process 100M messages / day

• Serve 300 requests / sec

• 100 to 200 machine cluster

• 3 full-time employees, 2 interns

Built on open-source

Thrift

Cascading

Scribe

ZeroMQ

Zookeeper

Pallet

What is a data system?

Raw

dat

aView 1

View 2

View 3

What is a data system?

Twee

ts# Tweets /

URL

Influence scores

Trending topics

Everything else: schemas, databases, indexing, etc are implementation

Essential properties of a data system

1. Robust

and human error

to machine failure

2. Low latency reads and updates

3. Scalable

4. General

5. Extensible

6. Allows ad-hoc analysis

7. Minimal maintenance

8. Debuggable

Batch Layer

Speed Layer

Layered Architecture

Let’s pretend temporarily that update latency doesn’t matter

Let’s pretend it’s OK for a view to lag by a few hours

Batch layer

• Arbitrary computation

• Horizontally scalable

• High latency

Batch layer

Not the end-all-be-all of batch computation, but the most general

Hadoop

Input files

Input files

Input files

Distributed Filesystem

MapReduce

Output files

Output files

Output files

Distributed Filesystem

Hadoop

• Express your computation in terms of MapReduce

• Get parallelism and scalability “for free”

Batch layer

• Store master copy of dataset

• Master dataset is append-only

Batch layer

view = fn(master dataset)

Batch layer

Mas

ter

data

set Batch

View 1

Batch View 2

Batch View 3

MapReduce

MapReduce

MapReduce

Batch layer

• In practice, too expensive to fully recompute each view to get updates

• A production batch workflow adds minimum amount of incrementalization necessary for performance

Incremental batch layer

All data

New data

Batch View 1

Batch View 2

Batch View 3

Batch workflow

Append

View maintenance

Query

Batch layerRobust and fault-tolerant to both machine and human error. Low latency reads.

Low latency updates.

Scalable to increases in data or traffic.

Extensible to support new features or related services.

Generalizes to diverse types of data and requests.

Allows ad hoc queries.

Minimal maintenance.Debuggable: can trace how any value in the system came to be.

Speed layer

Compensate for high latencyof updates to batch layer

Speed layer

Key point: Only needs to compensate for data not yet absorbed in batch layer

Hours of data instead of years of data

Application-level Queries

Batch Layer

Speed Layer

Query

Query

Merge

Speed layer

Once data is absorbed into batch layer, can discard speed layer results

Speed layer

• Message passing

• Incremental algorithms

• Read/Write databases

• Riak

• Cassandra

• HBase

• etc.

Speed layer

Significantly more complexthan the batch layer

Speed layer

But the batch layer eventuallyoverrides the speed layer

Speed layer

So that complexity is transient

Flexibility in layered architecture

• Do slow and accurate algorithm in batch layer

• Do fast but approximate algorithm in speed layer

• “Eventual accuracy”

Data model

Every record is a single, discretefact at a moment in time

Data model

• Alice lives in San Francisco as of time 12345

• Bob and Gary are friends as of time 13723

• Alice lives in New York as of time 19827

Data model

• Remember: master dataset is append-only

• A person can have multiple location records

• “Current location” is a view on this data: pick location with most recent timestamp

Data model

• Extremely useful having the full history for each entity

• Doing analytics

• Recovering from mistakes (like writing bad data)

Data model

AliceBob

Tweet: 123

Tweet: 456

Reactor ReactorReaction

Content: RT @bob Data is fun!

Content: Data is fun!

Reshare: true

Property

PropertyProperty

Gender: female

Property

Questions?

Twitter: @nathanmarz

Email: [email protected]

Web: http://nathanmarz.com