backtype-efficiency-and-big-data-systems
DESCRIPTION
http://posscon.org/assets/Uploads/backtype-efficiency-and-big-data-systems.pdfTRANSCRIPT
APIs• Conversational graph for url
• Comment search
• #Tweets / URL
• Influence scores
• Top sites
• Trending links stream
• etc.
BackType stats
• >30 TB of data
• 100 to 200 machine cluster
• Process 100M messages per day
• Serve 300 req/sec
Example
• Founders were essentially “one brain”
• Cowboy coding led to communication mishaps
• Added biweekly meeting to sync up
Example• Moving through tasks a lot faster with 5
people
• Needed more frequent prioritization of tasks
• Changed biweekly meeting to weekly meeting
• Added “chat room standups” to facilitate mid-week adjustments
Make it possible
• Hack things out using MapReduce/Cascading
• Learn the ins and outs of batch processing
Overengineering
Attempting to create beautiful software without a thorough understanding of problem domain
Premature optimization
Optimizing before creating “beautiful” design, creating unnecessary complexity
• Needed to write a small server to collect records into a Distributed Filesystem
• Wrote it using Clojure programming language
• Huge win: now we use Clojure for most of our systems
Example
• Needed to implement social search
• Wrote it using Neo4j
• Ran into lot of problems with Neo4j and rewrote it later using Sphinx
Example
• Needed an automated deploy for a distributed stream processing system
• Wrote it using Pallet
• Massive win: anticipate dramatic reduction in complexity in administering infrastructure
Example
Knowledge debt
Instead of hiring people who share your skill set, hire people with completely different skill sets
(food for thought)
Technical debt
• W needs to be refactored
• X deploy should be faster
• Y needs more unit tests
• Z needs more documentation
BackSweep
• Issues are recorded on a wiki page
• We spend one day a month removing items from that wiki page
BackSweep
• Keeps our codebase lean
• Gives us a way to defer technical debt issues when don’t have time to deal with them
• “Garbage collection for the codebase”
What is a startup?
A startup is a human institution designed to deliver a new product or service under conditions of extreme uncertainty.
- Eric Ries
Example
• We tested different feature combinations and measured click through rate
• Clicking on “sign up” went to a survey page
Testing hypothesis #2
• Build topic mentions over time graph for “big topics” our private beta customers are interested in (e.g. “nike”, “microsoft”, “apple”, “kodak”)
• Talk to customers
Hypothesis #3
• Customers want to see who’s talking about a topic on a variety of dimensions: recency, influence, num followers, or num retweets
Testing hypothesis #3
• Create search index on last 24 hours of data that can sort on all dimensions
BackType
• >30 TB of data
• Process 100M messages / day
• Serve 300 requests / sec
• 100 to 200 machine cluster
• 3 full-time employees, 2 interns
Hadoop
Input files
Input files
Input files
Distributed Filesystem
MapReduce
Output files
Output files
Output files
Distributed Filesystem
Hadoop
• Express your computation in terms of MapReduce
• Get parallelism and scalability “for free”
Batch layer
• In practice, too expensive to fully recompute each view to get updates
• A production batch workflow adds minimum amount of incrementalization necessary for performance
Incremental batch layer
All data
New data
Batch View 1
Batch View 2
Batch View 3
Batch workflow
Append
View maintenance
Query
Batch layerRobust and fault-tolerant to both machine and human error. Low latency reads.
Low latency updates.
Scalable to increases in data or traffic.
Extensible to support new features or related services.
Generalizes to diverse types of data and requests.
Allows ad hoc queries.
Minimal maintenance.Debuggable: can trace how any value in the system came to be.
Speed layer
Key point: Only needs to compensate for data not yet absorbed in batch layer
Hours of data instead of years of data
Speed layer
• Message passing
• Incremental algorithms
• Read/Write databases
• Riak
• Cassandra
• HBase
• etc.
Flexibility in layered architecture
• Do slow and accurate algorithm in batch layer
• Do fast but approximate algorithm in speed layer
• “Eventual accuracy”
Data model
• Alice lives in San Francisco as of time 12345
• Bob and Gary are friends as of time 13723
• Alice lives in New York as of time 19827
Data model
• Remember: master dataset is append-only
• A person can have multiple location records
• “Current location” is a view on this data: pick location with most recent timestamp
Data model
• Extremely useful having the full history for each entity
• Doing analytics
• Recovering from mistakes (like writing bad data)
Data model
AliceBob
Tweet: 123
Tweet: 456
Reactor ReactorReaction
Content: RT @bob Data is fun!
Content: Data is fun!
Reshare: true
Property
PropertyProperty
Gender: female
Property