hadoop summit 2011 - using a hadoop data pipeline to build a graph of users and content

Thursdays9:00 ET/PT

Using a Hadoop Data Pipeline to Build a Graph of Users and Content Hadoop Summit - June 29, 2011

Bill Graham

[email protected]

About me

• Principal Software Engineer

• Technology, Business & News BU (TBN)

• TBN Platform Infrastructure Team

• Background in SW Systems Engineering and Integration Architecture

• Contributor: Pig, Hive, HBase

• Committer: Chukwa

About CBSi – who are we?

GAMES & MOVIES TECH, BIZ & NEWS SPORTSENTERTAINMENT

MUSIC

About CBSi - scale

• Top 10 global web property

• 235M worldwide monthly uniques1

• Hadoop Ecosystem– CDH3, Pig, Hive, HBase, Chukwa, Oozie, Sqoop, Cascading

• Cluster size:– Currently workers: 35 DW + 6 TBN (150TB)– Next quarter: 100 nodes (500TB)

• DW peak processing: 400M events/day globally

1 - Source: comScore, March 2011

Abstract

At CBSi we’re developing a scalable, flexible platform to provide the ability to aggregate large volumes of data, to mine it for meaningful relationships and to produce a graph of connected users and content. This will enable us to better understand the connections between our users, our assets, and our authors.

The Problem

• User always voting on what they find interesting– Got-it, want-it, like, share, follow, comment, rate, review, helpful

vote, etc.

• Users have multiple identities– Anonymous– Registered (logged in)– Social– Multiple devices

• Connections between entities are in silo-ized sub-graphs

• Wealth of valuable user connectedness going unrealized

The Goal

• Create a back-end platform that enables us to assemble a holistic graph of our users and their connections to:– Content– Authors– Each other– Themselves

• Better understand how our users connect to our content

• Improved content recommendations

• Improved user segmentation and content/ad targeting

Requirements

• Integrate with existing DW/BI Hadoop Infrastructure

• Aggregate data from across CBSi and beyond

• Connect disjointed user identities

• Flexible data model

• Assemble graph of relationships

• Enable rapid experimentation, data mining and hypothesis testing

• Power new site features and advertising optimizations

The Approach

• Mirror data into HBase

• Use MapReduce to process data

• Export RDF data into a triple store

CMS Publishing

Social/UGC Systems

DW Systems

CMS Systems

Data Flow

HDFS

MapReduce• Pig• ImportTsv

HBasetransform

& load

TripleStore

Site Activity Streama.k.a. Firehose (JMS)

Content Tagging Systems

bulk load

atomic writes

RDF

Site SPARQL

NOSQL Data Models

ColumnFamily

Key-value stores

Document databases

Graph databases

Da

ta s

ize

Data complexityCredit: Emil Eifrem, Neotechnology

Conceptual Graph

anonId

Asset

Asset

regId

Asset

is also

follow

like

follow

Activity firehose (real-time)

Product

Author

Brand

is also

is also

is also

Story

Author

authored by

CMS (batch + incr.)

tag

tagged withtagged with

Tags (batch)

SessionId

PageEvent

PageEvent

had session

contains

contains

DW (daily)

HBase Schema

user_info table

Row Key Column Families

ALIAS: EVENT:

user id col. name value col. name value

ANON-<id1> URS-<id1> <ts>

ANON-<id1> LIKE-<ts> <json>

ANON-<id1> SHARE-<ts> <json>

URS-<id1> ANON-<id1> <ts>

HBase Loading

• Incremental– Consuming from a JMS queue == real-time

• Batch– Pig’s HBaseStorage == quick to develop & iterate– HBase’s ImportTsv == more efficient

Generating RDF with Pig

• RDF1 is an XML standard to represent subject-predicate-object relationships

• Philosophy: Store large amounts of data in Hadoop, be selective of what goes into the triple store

• For example:– “first class” graph citizens we plan to query on– Implicit to explicit (i.e., derived) connections

− Content recommendations− User segments− Related users− Content tags

• Easily join data to create new triples with Pig

• Run SPARQL2 queries, examine, refine, reload1 - http://www.w3.org/RDF, 2 - http://www.w3.org/TR/rdf-sparql-query

Example Pig RDF Script

Create RDF triples of users to social events:

RAW = LOAD 'hbase://user_info' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event:*', '-loadKey true’)AS (id:bytearray, event_map:map[]);

-- Convert our maps to bags so we can flatten them out A = FOREACH RAW GENERATE id, FLATTEN(mapToBag(event_map)) AS (social_k,

social_v);

-- Convert the JSON events into maps B = FOREACH A GENERATE id, social_k, jsonToMap(social_v) AS social_map:map[]; -- Pull values from map C = FOREACH B GENERATE id, social_map#'levt.asid' AS asid,

social_map#'levt.xastid' AS astid, social_map#'levt.event' AS event, social_map#'levt.eventt' AS eventt, social_map#'levt.ssite' AS ssite, social_map#'levt.ts' AS eventtimestamp ;

EVENT_TRIPLE = FOREACH C GENERATE GenerateRDFTriple(

'USER-EVENT', id, astid, asid, event, eventt, ssite, eventtimestamp ) ;

STORE EVENT_TRIPLE INTO 'trident/rdf/out/user_event' USING PigStorage ();

Example SPARQL query

Recommend content based on Facebook “liked” items:

SELECT ?asset1 ?tagname ?asset2 ?title2 ?pubdt2 WHERE { # anon-user who Like'd a content asset (news item, blog post) on Facebook <urn:com.cbs.dwh:ANON-Cg8JIU14kobSAAAAWyQ> <urn:com.cbs.trident:event:LIKE> ?x . ?x <urn:com.cbs.trident:eventt> "SOCIAL_SITE” . ?x <urn:com.cbs.trident:ssite> "www.facebook.com" . ?x <urn:com.cbs.trident:tasset> ?asset1 . ?asset1 a <urn:com.cbs.rb.contentdb:content_asset> . # a tag associated with the content asset ?asset1 <urn:com.cbs.cnb.bttrax:tag> ?tag1 . ?tag1 <urn:com.cbs.cnb.bttrax:tagname> ?tagname .

# other content assets with the same tag and their title ?asset2 <urn:com.cbs.cnb.bttrax:tag> ?tag2 . FILTER (?asset2 != ?asset1) ?tag2 <urn:com.cbs.cnb.bttrax:tagname> ?tagname . ?asset2 <http://www.w3.org/2005/Atom#title> ?title2 . ?asset2 <http://www.w3.org/2005/Atom#published> ?pubdt2 . FILTER (?pubdt2 >= "2011-01-01T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>) } ORDER BY DESC (?pubdt2) LIMIT 10

Conclusions I - Power and Flexibility

• Architecture is flexible with respect to:– Data modeling– Integration patterns– Data processing, querying techniques

• Multiple approaches for graph traversal– SPARQL– Traverse HBase– MapReduce

Conclusions II – Match Tool with the Job

• Hadoop - scale and computing horsepower

• HBase – atomic r/w access, speed, flexibility

• RDF Triple Store – complex graph querying

• Pig – rapid MR prototyping and ad-hoc analysis

• Future:– HCatalog – Schema & table management– Oozie or Azkaban – Workflow engine– Mahout – Machine learning– Hama – Graph processing

Conclusions III – OSS, woot!

If it doesn’t do what you want, submit a patch.

Questions?

hadoop summit 2011 - using a hadoop data pipeline to build a graph of users and content

Technology

x social

social events

data mining

approachmirror data

graph of users

hadoop data pipeline

dataexport rdf data

large volumes of data