yahoo! hadoop user group - may meetup - hbase and pig: the hadoop ecosystem at twitter, dmitriy...

28
TM Hadoop, Pig, HBase at Twitter Dmitriy Ryaboy Twitter Analytics @squarecog

Upload: hadoop-user-group

Post on 11-May-2015

26.711 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

TM

Hadoop, Pig, HBase at TwitterHadoop, Pig, HBase at TwitterDmitriy Ryaboy Twitter Analytics@squarecog

Dmitriy Ryaboy Twitter Analytics@squarecog

Page 2: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Who is this guy, anyway‣ LBNL: Genome alignment & analysis

‣ Ask.com: Click log data warehousing

‣ CMU: MS in “Very Large Information Systems”

‣ Cloudera: graduate student intern

‣ Twitter: Hadoop, Pig, Big Data, ...

‣ Pig committer.

Page 3: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

In This Talk‣ Focus on Hadoop parts of data pipeline

‣ Data movement

‣ HBase

‣ Pig

‣ A few tips

Page 4: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Not In This Talk‣ Cassandra

‣ FlockDB

‣ Gizzard

‣ Memcached

‣ Rest of Twitter’s NoSQL Bingo card

Page 5: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Daily workload

‣1000s of Front End machines

‣3 Billion API requests

‣7 TB of ingested data

‣20,000 Hadoop jobs

‣55 Million tweets

‣ Tweets only 0.5% of data

Page 6: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Twitter data pipeline (simplified)

‣ Front Ends update DB cluster. Scheduled DB exports to HDFS

‣ Front Ends, Middleware, Backend services write logs

‣ Scribe pipes logs straight into HDFS

‣ Various other data source exports into HDFS

‣ Daemons populate work queues as new data shows up

‣ Daemons (and cron) pull work off queues, schedule MR and Pig jobs

‣ Pig wrapper pushes results into MySQL for reports and dashboards

Page 7: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Logs‣ Apache HTTP, W3C, JSON and Protocol Buffers

‣ Each category goes into its own directory on HDFS

‣ Everything is LZO compressed.

‣ You need to index LZO files to make them splittable.

‣ We use a patched version of Hadoop LZO libraries

‣ See http://github.com/kevinweil/hadoop-lzo

Page 8: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Tables‣ Users, tweets, geotags, trends, registered devices, etc

‣ Automatic generation of protocol buffer definitions from SQL tables

‣ Automatic generation of Hadoop Writables, Input / Output formats, Pig loaders from protocol buffers

‣ See Elephant-Bird: http://github.com/kevinweil/elephant-bird

Page 9: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

ETL‣ "Crane", config driven, protocol buffer powered.

‣ Sources/Sinks: HDFS, HBase, MySQL tables, web services

‣ Protobuf-based transformations: chain sets of <input proto, output proto, transformation class>

Page 10: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

HBase

Page 11: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Mutability‣ Logs are immutable; HDFS is great.

‣ Tables have mutable data.

‣ Ignore updates? bad data

‣ Pull updates, resolve at read time? Pain, time.

‣ Pull updates, resolve in batches? Pain, time.

‣ Let someone else do the resolving? Helloooo, HBase!

‣ Bonus: various NoSQL bonuses, "not just scans". Lookups, indexes.

‣ Warning: we just started with HBase. This is all preliminary. Haven't tried indexes yet.

‣ That being said, several services rely on HBase already.

Page 12: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Aren't you guys Cassandraposter boys?

‣YES but

‣ Rough analogy: Cassandra is OLTP and HBase is OLAP

‣ Cassandra used when we need low-latency, single-key reads and writes

‣ HBase scans much more powerful

‣ HBase co-locates data on the Hadoop cluster.

Page 13: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

HBase schema for MySQL exports, v1.‣ Want to query by created_at range, by updated_at range,

and / or by user_id.

‣ Key: [created_at, id]

‣ CF: "columns"

‣ Configs specify which columns to pull out and store explicitly.

‣ Useful for indexing, cheap (HBase-side) filtering

‣ CF: "protobuf"

‣ A single column, contains serialized protocol buffer.

Page 14: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

HBase schema v1, cont.‣ Pro: easy to query by created_at range

‣ Con: hard to pull out specific users (requires a full scan)

‣ Con: hot spot at the last region for writes

‣ Idea: put created_at into 'columns' CF, make user_id key

‣ BUT ids mostly sequential; still a hot spot at the end of the table

‣ Transitioning to non-sequential ids; but their high bits are creation timestamp! Same problem.

Page 15: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

HBase schema, v2.‣ Key: inverted Id. Bottom bits are random. Ahh, finally,

distribution.

‣ Date range queries: new CF, 'time'

‣ keep all versions of this CF

‣ When specific time range needed, use index on the time column

‣ Keeping time in separate CF allows us to keep track of every time the record got updated, without storing all versions of the record

Page 16: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Pig

Page 17: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Why Pig?‣ Much faster to write than vanilla MR

‣ Step-by-step iterative expression of data flows intuitive to programmers

‣ SQL support coming for those who prefer SQL (PIG-824)

‣ Trivial to write UDFs

‣ Easy to write Loaders (Even better with 0.7!)

‣ For example, we can write Protobuf and HBase loaders...

‣ Both in Elephant-Bird

Page 18: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

HBase Loader enhancements‣ Data expected to be binary, not String representations

‣ Push down key range filters

‣ Specify row caching (memory / speed tradeoff)

‣ Optionally load the key

‣ Optionally limit rows per region

‣ Report progress

‣ Haven't observed significant overhead vs. HBase scanning

Page 19: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

HBase Loader TODOs‣ Expose better control of filters

‣ Expose timestamp controls

‣ Expose Index hints (IHBase)

‣ Automated filter and projection push-down (once on 0.7)

‣ HBase Storage

Page 20: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Elephant Bird‣ Auto-generate Hadoop Input/Output formats,

Writables, Pig loaders for Protocol Buffers

‣ Starting to work on same for Thrift

‣ HBase Loader

‣ assorted UDFs

‣ http://www.github.com/kevinweil/elephant-bird

Page 21: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Assorted Tips

Page 22: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Bad records kill jobs‣ Big data is messy.

‣ Catch exceptions, increment counter, return null

‣ Deal with potential nulls

‣ Far preferable to a single bad record bringing down the whole job

Page 23: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Runaway UDFs kill jobs‣ Regex over a few billion tweets, most return in milliseconds.

‣ 8 cause the regex to take more than 5 minutes, task gets reaped.

‣ You clever twitterers, you.

‣ MonitoredUDF wrapper kicks off a monitoring thread, kills a UDF and returns a default value if it doesn't return something in time.

‣ Plan to contribute to Pig, add to ElephantBird. May build into Pig internals.

Page 24: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Use Counters‣ Use counters. Count everything.

‣ UDF invocations, parsed records, unparsable records, timed-out UDFs...

‣ Hook into cleanup phases and store counters to disk, next to data, for future analysis

‣ Don't have it for Pig yet, but 0.8 adds metadata to job confs to make this possible.

Page 25: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

‣ At first: converted Protocol Buffers into Pig tuples at read time.

‣ Moved to a Tuple wrapper that deserializes fields upon request.

‣ Huge performance boost for wide tables with only a few used columns

Lazy deserializaton FTW

lazy deserializationlazy deserialization

Page 27: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Questions?

Follow me attwitter.com/squarecog

TM

Page 28: Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

Photo Credits‣ Bingo:

http://www.flickr.com/photos/hownowdesign/2393662713/

‣ Sandhill Crane: http://www.flickr.com/photos/dianeham/123491289/

‣ Oakland Cranes: http://www.flickr.com/photos/clankennedy/2654213672/