facebook hadoop data & applications

18
Hadoop and Hive at Facebook Data and Applications Your Company Logo Here Wednesday, June 10, 2009 Santa Clara Marriott Dhruba Borthakur, Ding Zhou

Upload: dzhou

Post on 10-May-2015

3.328 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Facebook Hadoop Data & Applications

Hadoop and Hive at Facebook Data and Applications

Your Company Logo Here

Wednesday, June 10, 2009  Santa Clara Marriott 

Dhruba Borthakur, Ding Zhou

Page 2: Facebook Hadoop Data & Applications

Who generates this data?

Lots of data is generated on Facebook »  200 million active users »  20 million users update their statuses at least

once each day »  More than 850 million photos uploaded to the site

each month »  More than 8 million videos uploaded each month »  More than 1 billion pieces of content (web links,

news stories, blog posts, notes, photos, etc.) shared each week

http://www.slideshare.net/guest5b1607/text-analytics-summit-2009-roddy-lindsay-social-media-happiness-petabytes-and-lols

Page 3: Facebook Hadoop Data & Applications

»  Hadoop/Hive Warehouse ›  4800 cores, 2 PetaBytes

total size

»  Other Hadoop Clusters •  HDFS-Scribe cluster: 320

cores, 160 TB total size •  Hadoop Archival Cluster :

80 cores, 200TB total size •  Test cluster : 800 cores,

150 TB total size

Where do we store parts of this data?

Page 4: Facebook Hadoop Data & Applications

Data Collection using Scribe

Web Servers  Scribe MidTier 

Network Storage and Servers 

Hadoop Hive Warehouse Oracle RAC  MySQL 

Page 5: Facebook Hadoop Data & Applications

Data Collection using Scribe and HDFS

Web Servers 

Scribe MidTier 

RealBme Hadoop Cluster 

Hadoop Hive Warehouse Oracle RAC 

MySQL  Hadoop Scribe Integration

Page 6: Facebook Hadoop Data & Applications

Data Archive: Move old data to cheap storage

Cheap NAS 

Hadoop Archival Cluster 20TB per node 

Hadoop Archive Node 

NFS 

Hive Query 

distcp 

Hadoop Warehouse 

HADOOP‐5048 

Page 7: Facebook Hadoop Data & Applications

Hive User Interfaces

Hive shell access

Hive Web UI

Page 8: Facebook Hadoop Data & Applications

Data Analysis at Facebook

»  Business Intelligence ›  Growth and monetization strategies ›  Product insights & decisions ›  Philosophy: build meta tools and provide easy access to data

»  Artificial Intelligence ›  Recommendation & ranking products ›  Advertising optimization ›  Text analytics ›  Philosophy: model inference; data preparation; model building;

Page 9: Facebook Hadoop Data & Applications

»  Top-level site metrics

BI: Build centralized reporting tools

Bird-view of user growth by countries

Comparing certain metrics between user groups

Page 10: Facebook Hadoop Data & Applications

»  Example: “Find the number of status updates mentioning ‘swine flu’ per day last month”

»  SELECT a.date, count(1) »  FROM status_updates a »  WHERE a.status LIKE “%swine flu%” »  AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’ »  GROUP BY a.date

BI: Make AdHoc reporting easy

Page 11: Facebook Hadoop Data & Applications

Build site metric dashboard in a day

»  Data collection: ›  Define metrics and log format (Hive schema) ›  Add logging to the site (Scribe logging) ›  Create a Hive table partitioned by date ›  Set up metric ETL cron job (Hive -> mysql/oracle)

»  Data visualization (using mysql) »  Data access (adhoc query using Hive)

Page 12: Facebook Hadoop Data & Applications

Build Machine Learning Products on Hadoop/Hive

•  Recommendation & ranking •  Advertising optimization •  Text analytics

Page 13: Facebook Hadoop Data & Applications

What applications the user may like

»  Recommend apps based on social and demographic popularity

»  User-app log is huge »  Joining user-app log with

user demographics is difficult

»  Hive for data aggregation

Page 14: Facebook Hadoop Data & Applications

Who the user wants to connect

»  Take existing edges and user feedbacks as labels

»  Build regression models based on user profile and local graph features

»  Too many friends of friends »  Model trained by sampling

»  Hive for model inference »  Hive for feature selection

Page 15: Facebook Hadoop Data & Applications

»  Market research & ad tool

»  Extract popular words from user content

»  Slice by age, gender, region »  Sentiment analysis »  Keyword association

»  Hadoop used for text analytics

What users are talking about (Lexicon)

laid-off

Words associated with vodka

Page 16: Facebook Hadoop Data & Applications

What ads the user might click on

»  Predict user-ad click-through

»  Ads click data is sparse so sampling can miss info

»  Many ML algorithms are iterative thus not easy for hadoop

»  Hadoop for model training

Page 17: Facebook Hadoop Data & Applications

Build ensemble ML models on Hadoop

»  Each mapper trains a number of models

»  Each model output as a intermediate feature

»  Model selection at reducer »  A regression model is built

on selected features

ds1 ds2 ds3 ds4

ensembles

Train models locally Cross-Test models locally

Models assembled by ensemble methods Model inference in a second Hadoop job

Page 18: Facebook Hadoop Data & Applications

In summary

»  So Zuckerberg’s urgent questions are answered; »  So celebrities know where their fans are from; »  So we know one can like vodka and lemonade at the same time; »  It’s fun playing with the data;

Dhruba Borthakur, Ding Zhou

»  Hadoop and Hive at Facebook »  Support product strategy and decision; »  Recommendation & ranking products; »  Advertising optimization; »  Text analytics tools;

dhruba@, dzhou@