text analytics summit 2009 - roddy lindsay - "social media, happiness, petabytes and lols"

Social Media, Happiness, Petabytes and LOLs

Roddy Lindsay, Data Scientist, Facebook

June 1, 2009

Lots of data is generated on Facebook

▪ 200 million active users

▪ More than 20 million users update their statuses at least once each day

▪ More than 850 million photos uploaded to the site each month

▪ More than 8 million videos uploaded each month

▪ More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week

▪ More than 2.5 million events created each month

▪ More than 25 million active user groups exist on the site

Lots of data is generated on Facebook

▪ Undoubtedly a very rich data set (and large...we’re talking petabytes)

▪ Many different groups clamoring for data:

▪ Internal analysts▪ FB Engineers▪ Advertisers▪ Page owners▪ Platform/Connect developers▪ Marketers▪ Academics

Challenges

▪ How can Facebook satisfy all the different consumers of data?▪ What are the challenges?▪ 1. Infrastructure

▪ 2. Infrastructure▪ 3. Infrastructure

Facebook’s Data Infrastructure

▪ Attempt 1: Oracle Data Warehouse (2005)

▪ Business analysts already familiar with tools, SQL▪ Fast JOINs for data slicing ideal for dashboards (home-rolled in PHP)▪ i.e. growth by country and demographic

▪ When growth took off (2007), ETL processes to load and roll-up data started taking a very long time

▪ A single machine (or several machines) were not going to cut it much longer for data volumes at that scale...

▪ Attempt 2: Hadoop (2007)

▪ Open-source framework for running Map-Reduce on a cluster of commodity machines, as well as a distributed file system for long-term storage▪ Map-Reduce (invented at Google) provides a way to process large data sets

that scales linearly with the number of machines in the cluster....if your data doubles in size, just buy twice as many computers

▪ Hadoop initially developed by Doug Cutting, now an Apache project led by the Grid Computing team at Yahoo!

▪ Much faster ETL when transform and load is distributed across a cluster

▪ Engineers able to write jobs in Java and Python▪ Not a viable solution for analysts who can write SQL but not code

▪ Attempt 3: Hive (2008)

▪ SQL-like query language, table partitioning schema, and metadata store built on top of Hadoop

▪ Developed at Facebook, now an Apache subproject▪ Also includes:▪ Web interface for constructing queries on the fly without using a shell

▪ Live support for query problems from the data team▪ Easy integration with charts and dashboards▪ One-click scheduling▪ CSV/Excel export

▪ Example: “Find the number of status updates mentioning ‘swine flu’ per day last month”

▪ SELECT a.date, count(1)▪ FROM status_updates a▪ WHERE a.status LIKE “%swine flu%”▪ AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’▪ GROUP BY a.date

▪ Easily extendable to new operators▪ Hypothetical example: “Find the sentiment of the ‘Terminator’ movie”

▪ FROM (▪ FROM status_updates b▪ SELECT SENTIMENT(b.status, ‘terminator’) AS sentiment ▪ WHERE b.status LIKE “%terminator%”▪ AND b.date >= ‘2009-05-01’ AND b.date <= ‘2009-05-31’) a▪ SELECT a.sentiment, count(1)▪ GROUP BY a.sentiment

▪ Successfully decentralized the querying and consumption of data across the company

▪ Instead of 10 dedicated data analysts, we trained a few hundred▪ Everyone is able to answer 95% of his or her data questions with

minimal training▪ Dedicated data scientists, instead of working on an endless queue of

ad-hoc requests, can spend their time performing complex analyses and building scalable systems on top of Hadoop/Hive▪ Machine Learning systems

▪ Rich reporting for clients + Page owners▪ Text analytics

Facebook text analytics

▪ Lexicon (Spring 2008)

▪ Started as an intern project to test Hadoop▪ First external deployment of a Hadoop-powered system at Facebook

(and one of the first anywhere)▪ Simple idea: count the number of occurrences of words and bigrams

on Facebook Walls per day, plot them on a line graph

“american idol”

▪ “New” Lexicon (Fall 2008), beta preview

▪ Leveraged Hive’s structured metadata and the raw computational power of a 600-node Hadoop cluster▪ Slices by age, gender, region

▪ Sentiment analysis▪ Common user interests▪ Associations graph of similar keywords, with age and gender axes

Dashboard: “economy”

Demographics: “economy”

Map: “laid off”

Sentiment: “iron man” (blue) vs. “indiana jones” (yellow)

Associations: “marriage”

Associations: “vodka”

▪ Hadoop and Hive makes this all possible

▪ Consider “Associations” (similar words and phrases)

▪ Need to compare the co-occurrence of each term with every single other word and bigram, compared to baseline probability of occurrence (TF-IDF)......and keep demographic metadata around for fun

▪ Typical job generates several TB of data along the way▪ Absolutely need a cluster of machines

▪ Distributed computation opens up the possibilities for text analytics algorithms!

▪ And.....the software is free!

Text Analytics

▪ Text analytics is clearly useful in the “macro”:

▪ Big data sets▪ Big compute clusters▪ Big consumers (corporations)

▪ What about in the micro?

▪ Small data sets▪ B, not PB

▪ Small consumers▪ Individual people analyzing their own data

HappyFactor

▪ Facebook Application (personal project, not associated with Facebook)

▪ Idea: ask people privately how happy they are and what they are doing

▪ Uses random text messages to ensure a good sample and to collect data easily

▪ Provide users with trends on their happiness (by day, week, month, etc.)

▪ When are you happiest?

▪ Sift through the unstructured text to find patterns in behavior that correlate with happiness and unhappiness

▪ Which activities make you happiest?▪ Which people in your life make you happiest?

HappyFactor

▪ Just like corporations can learn about (and improve) themselves through text analytics....

▪ Why not humans?

On a scale from 1 to 10, how happy are you right now? Reply with your score and an optional description of what you are doing.

In sum...

▪ Analyzing large data sets is a challenging problem that requires significant investment (both human and financial) in infrastructure

▪ We’re now just learning what we can do with Facebook data since we developed the infrastructure to support it

▪ Distributed computation and structured metadata allow for a powerful new class of text analytics algorithms

▪ Text analytics has applications well beyond enterprise data-mining...

▪ ...could it potentially make the world a happier place?

text analytics summit 2009 - roddy lindsay - "social media, happiness, petabytes and lols"

Technology

backblaze blog » petabytes on a budget v2.0:revealing

interview with derek roddy

boggus presentation smart airports may 2015 - roddy

pota.goatley.com weekend with roddy mcdowall.pdfa weekend...

informe pasantias roddy

roddy fairley - snh - ecosystems approach

roddy nuclear and the chamber of power

sms marketing more than just lols & omgs

smile piano q min' - thetuningnote.com lh.pdfsmile piano q...

satellite communications - ashwani goyal -...

manhattan harbour – greater cincinnati area – … ·...

satellite communications by dennis roddy

petabytes for peanuts! making sense of “ambient data”

the journey by roddy brooks

roddy storyboard for fmp

roddy stuart educational ict consultant june 20051 ict for...

s pondy lols ys thesis

the lols of nations: understanding global memes

lhc scale physics in 2008: grids, networks and petabytes

roddy report