agile analytics applications on hadoop

96
Agile Analytics Applications Russell Jurney 1 Wednesday, May 8, 13

Upload: rjurney

Post on 26-Jan-2015

115 views

Category:

Documents


2 download

DESCRIPTION

Presentation to match the book Agile Big Data

TRANSCRIPT

Page 1: Agile analytics applications on hadoop

Agile Analytics Applications

Russell Jurney

1

Wednesday, May 8, 13

Page 2: Agile analytics applications on hadoop

About me…Bearding.

• I’m going to beat this guy

• Seriously

• Bearding is my #1 natural talent

• Salty Sea Beard

• Fortified with Pacific Ocean Minerals

2

Wednesday, May 8, 13

Page 3: Agile analytics applications on hadoop

Agile Data: The Book (August, 2013)

3

Read @ Safari Rough Cuts

A philosophy,not the only way

But still, its good! Really!

Wednesday, May 8, 13

Page 4: Agile analytics applications on hadoop

We go fast... but don’t worry!

• Download the slides - click the links - read examples!

• If its not on the blog, its in the book!

• Order now: http://shop.oreilly.com/product/0636920025054.do

• Read the book on Safari Rough Cuts

4

Wednesday, May 8, 13

Page 5: Agile analytics applications on hadoop

Agile Application Development: Check

• LAMP stack mature• Post-Rails frameworks to choose from• Enable rapid feedback and agility

5

+ NoSQL

Wednesday, May 8, 13

Page 6: Agile analytics applications on hadoop

Data Warehousing

6

Wednesday, May 8, 13

Page 7: Agile analytics applications on hadoop

Scientific Computing / HPC

• ‘Smart kid’ only: MPI, Globus, etc. until Hadoop

7

Tubes and Mercury (old school) Cores and Spindles (new school)

UNIVAC and Deep Blue both fill a warehouse. We’re back...

Wednesday, May 8, 13

Page 8: Agile analytics applications on hadoop

Data Science?

8

33%

33%

33%

ApplicationDevelopment Data Warehousing

Scientific Computing / HPC

Wednesday, May 8, 13

Page 9: Agile analytics applications on hadoop

Data Center as Computer

• Warehouse Scale Computers and applications

9

“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.” Click here for a paper on operating a ‘data center as computer.’

Wednesday, May 8, 13

Page 10: Agile analytics applications on hadoop

Hadoop to the Rescue!

• Easy to use! (Pig, Hive, Cascading)

• CHEAP: 1% the cost of SAN/NAS

• A department can afford its own Hadoop cluster!

• Dump all your data in one place: Hadoop DFS

• Silos come CRASHING DOWN!

• JOIN like crazy!

• ETL like whoah!

• An army of mappers and reducers at your command

• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!10

Wednesday, May 8, 13

Page 11: Agile analytics applications on hadoop

NOW WHAT?

11

Wednesday, May 8, 13

Page 12: Agile analytics applications on hadoop

Analytics Apps: It takes a Team

12

• Broad skill-set

• Nobody has them all

• Inherently collaborative

Wednesday, May 8, 13

Page 13: Agile analytics applications on hadoop

Data Science Team

• 3-4 team members with broad, diverse skill-sets that overlap

• Transactional overhead dominates at 5+ people

• Expert researchers: lend 25-50% of their time to teams

• Creative workers. Run like a studio, not an assembly line

• Total freedom... with goals and deliverables.

• Work environment matters most

13

Wednesday, May 8, 13

Page 14: Agile analytics applications on hadoop

How to get insight into product?

• Back-end has gotten THICKER

• Generating $$$ insight can take 10-100x app dev

• Timeline disjoint: analytics vs agile app-dev/design

• How do you ship insights efficiently?

• How do you collaborate on research vs developer timeline?14

Wednesday, May 8, 13

Page 15: Agile analytics applications on hadoop

The Wrong Way - Part One

15

“We made a great design. Your job is to predict the future for it.”

Wednesday, May 8, 13

Page 16: Agile analytics applications on hadoop

The Wrong Way - Part Two

16

“What is taking you so long to reliably predict the future?”

Wednesday, May 8, 13

Page 17: Agile analytics applications on hadoop

The Wrong Way - Part Three

17

“The users don’t understand what 86% true means.”

Wednesday, May 8, 13

Page 18: Agile analytics applications on hadoop

The Wrong Way - Part Four

18

GHJIAEHGIEhjagigehganb!!!!!RJ(@J?!!

Wednesday, May 8, 13

Page 19: Agile analytics applications on hadoop

The Wrong Way - Inevitable Conclusion

19

Plane Mountain

Wednesday, May 8, 13

Page 20: Agile analytics applications on hadoop

Reminds me of... the waterfall model

20

:(

Wednesday, May 8, 13

Page 21: Agile analytics applications on hadoop

Chief Problem

21

You can’t design insight in analytics applications.

You discover it.

You discover by exploring.

Wednesday, May 8, 13

Page 22: Agile analytics applications on hadoop

-> Strategy

22

So make an app for exploring your data.

Which becomes a palette for what you ship.

Iterate and publish intermediate results.

Wednesday, May 8, 13

Page 23: Agile analytics applications on hadoop

Data Design

• Not the 1st query that = insight, it’s the 15th, or the 150th

• Capturing “Ah ha!” moments

• Slow to do those in batch…

• Faster, better context in an interactive web application.

• Pre-designed charts wind up terrible. So bad.

• Easy to invest man-years in the wrong statistical models

• Semantics of presenting predictions are complex, delicate

• Opportunity lies at intersection of data & design

23

Wednesday, May 8, 13

Page 24: Agile analytics applications on hadoop

How do we get back to Agile?

24

Wednesday, May 8, 13

Page 25: Agile analytics applications on hadoop

Statement of Principles

25

(then tricks, with code)

Wednesday, May 8, 13

Page 26: Agile analytics applications on hadoop

Setup an environment where...

• Insights repeatedly produced

• Iterative work shared with entire team

• Interactive from day Zero

• Data model is consistent end-to-end

• Minimal impedance between layers

• Scope and depth of insights grow

• Insights form the palette for what you ship

• Until the application pays for itself and more

26

Wednesday, May 8, 13

Page 27: Agile analytics applications on hadoop

Snowballing Audience

27

Wednesday, May 8, 13

Page 28: Agile analytics applications on hadoop

Value document > relation

28

Most data is dirty. Most data is semi-structured or unstructured. Rejoice!

Wednesday, May 8, 13

Page 29: Agile analytics applications on hadoop

Value document > relation

29

Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.

Wednesday, May 8, 13

Page 30: Agile analytics applications on hadoop

Relational Data = Legacy Format

• Why JOIN? Storage is fundamentally cheap!

• Duplicate that JOIN data in one big record type!

• ETL once to document format on import, NOT every job

• Not zero JOINs, but far fewer JOINs

• Semi-structured documents preserve data’s actual structure

• Column compressed document formats beat JOINs!

30

Wednesday, May 8, 13

Page 31: Agile analytics applications on hadoop

Value imperative > declarative

• We don’t know what we want to SELECT.

• Data is dirty - check each step, clean iteratively.

• 85% of data scientist’s time spent munging. See: ETL.

• Imperative is optimized for our process.

• Process = iterative, snowballing insight

• Efficiency matters, self optimize

31

Wednesday, May 8, 13

Page 32: Agile analytics applications on hadoop

Value dataflow > SELECT

32

Wednesday, May 8, 13

Page 33: Agile analytics applications on hadoop

Ex. dataflow: ETL + email sent count

33(I can’t read this either. Get a big version here.)Wednesday, May 8, 13

Page 34: Agile analytics applications on hadoop

Value Pig > Hive (for app-dev)

• Pigs eat ANYTHING• Pig is optimized for refining data, as opposed to consuming it• Pig is imperative, iterative• Pig is dataflows, and SQLish (but not SQL)• Code modularization/re-use: Pig Macros• ILLUSTRATE speeds dev time (even UDFs)• Easy UDFs in Java, JRuby, Jython, Javascript• Pig Streaming = use any tool, period.• Easily prepare our data as it will appear in our app.• If you prefer Hive, use Hive.

34

But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive…See: HCatalog for Pig/Hive integration.

Wednesday, May 8, 13

Page 35: Agile analytics applications on hadoop

Localhost vs Petabyte scale: same tools

• Simplicity essential to scalability: highest level tools we can• Prepare a good sample - tricky with joins, easy with documents

• Local mode: pig -l /tmp -x local -v -w• Frequent use of ILLUSTRATE• 1st: Iterate, debug & publish locally• 2nd: Run on cluster, publish to team/customer• Consider skipping Object-Relational-Mapping (ORM)• We do not trust ‘databases,’ only HDFS @ n=3.

• Everything we serve in our app is re-creatable via Hadoop.

35

Wednesday, May 8, 13

Page 36: Agile analytics applications on hadoop

Data-Value Pyramid

36

Climb it. Do not skip steps. See here.

Wednesday, May 8, 13

Page 37: Agile analytics applications on hadoop

0/1) Display atomic records on the web

37

Wednesday, May 8, 13

Page 38: Agile analytics applications on hadoop

0.0) Document-serialize events

• Protobuf

• Thrift

• JSON

• Avro - I use Avro because the schema is onboard.

38

Wednesday, May 8, 13

Page 39: Agile analytics applications on hadoop

0.1) Documents via Relation ETL

39

enron_messages = load '/enron/enron_messages.tsv' as (

message_id:chararray,

sql_date:chararray,

from_address:chararray,

from_name:chararray,

subject:chararray,

body:chararray

);

 

enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);

 

split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';

 

headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;

with_headers = join headers by group, enron_messages by message_id parallel 10;

emails = foreach with_headers generate enron_messages::message_id as message_id,

CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,

TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray),

enron_messages::subject as subject,

enron_messages::body as body,

headers::tos.(address, name) as tos,

headers::ccs.(address, name) as ccs,

headers::bccs.(address, name) as bccs;

store emails into '/enron/emails.avro' using AvroStorage(

Example here.

Wednesday, May 8, 13

Page 40: Agile analytics applications on hadoop

0.2) Serialize events from streams

40

class  GmailSlurper(object):    ...    def  init_imap(self,  username,  password):        self.username  =  username        self.password  =  password        try:            imap.shutdown()        except:            pass        self.imap  =  imaplib.IMAP4_SSL('imap.gmail.com',  993)        self.imap.login(username,  password)        self.imap.is_readonly  =  True    ...    def  write(self,  record):        self.avro_writer.append(record)    ...    def  slurp(self):        if(self.imap  and  self.imap_folder):            for  email_id  in  self.id_list:                (status,  email_hash,  charset)  =  self.fetch_email(email_id)                if(status  ==  'OK'  and  charset  and  'thread_id'  in  email_hash  and  'froms'  in  email_hash):                    print  email_id,  charset,  email_hash['thread_id']                    self.write(email_hash)

Scrape your own gmail in Python and Ruby.Wednesday, May 8, 13

Page 41: Agile analytics applications on hadoop

0.3) ETL Logs

41

log_data  =  LOAD  'access_log'        USING  org.apache.pig.piggybank.storage.apachelog.CommongLogLoader        AS  (remoteAddr,                remoteLogname,                user,                time,                method,                uri,                proto,                bytes);

Wednesday, May 8, 13

Page 42: Agile analytics applications on hadoop

1) Plumb atomic events -> browser

42

(Example stack that enables high productivity)

Wednesday, May 8, 13

Page 43: Agile analytics applications on hadoop

1.1) cat our Avro serialized events

43

me$ cat_avro ~/Data/enron.avro{ u'bccs': [], u'body': u'scamming people, blah blah', u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z', u'from': {u'address': u'[email protected]', u'name': None}, u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u'subject': u'Re: Enron trade for frop futures', u'tos': [ {u'address': u'[email protected]', u'name': None} ]}

Get cat_avro in python, rubyWednesday, May 8, 13

Page 44: Agile analytics applications on hadoop

1.2) Load our events in Pig

44

me$ pig -l /tmp -x local -v -wgrunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();grunt> describe enron_emails

emails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)}}

 

Wednesday, May 8, 13

Page 45: Agile analytics applications on hadoop

1.3) ILLUSTRATE our events in Pig

45

grunt> illustrate enron_emails 

---------------------------------------------------------------------------| emails | | message_id:chararray | | datetime:chararray | | from:tuple(address:chararray,name:chararray) | | subject:chararray | | body:chararray | | tos:bag{to:tuple(address:chararray,name:chararray)} || ccs:bag{cc:tuple(address:chararray,name:chararray)} || bccs:bag{bcc:tuple(address:chararray,name:chararray)} |---------------------------------------------------------------------------| | | <1731.10095812390082.JavaMail.evans@thyme> | | 2001-01-09T06:38:00.000Z | | ([email protected], J.R. Bob Dobbs) | | Re: Enron trade for frop futures | | scamming people, blah blah | | {([email protected],)} | | {} | | {} |

Upgrade to Pig 0.10+

Wednesday, May 8, 13

Page 46: Agile analytics applications on hadoop

1.4) Publish our events to a ‘database’

46

pig -l /tmp -x local -v -w -param avros=enron.avro \ -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig

/* MongoDB libraries and configuration */register /me/mongo-hadoop/mongo-2.7.3.jarregister /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jarregister /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar

/* Set speculative execution off to avoid chance of duplicate records in Mongo */set mapred.map.tasks.speculative.execution falseset mapred.reduce.tasks.speculative.execution falsedefine MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */

/* By default, lets have 5 reducers */set default_parallel 5

avros = load '$avros' using AvroStorage();store avros into '$mongourl' using MongoStorage();

Full instructions here.

Which does this:

From Avro to MongoDB in one command:

Wednesday, May 8, 13

Page 47: Agile analytics applications on hadoop

1.5) Check events in our ‘database’

47

$ mongo enron

MongoDB shell version: 2.0.2connecting to: enron

> show collectionsemailssystem.indexes

> db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}){" "_id" : ObjectId("502b4ae703643a6a49c8d180")," "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>"," "date" : "2001-01-09T06:38:00.000Z"," "from" : { "address" : "[email protected]", "name" : "J.R. Bob Dobbs" }," "subject" : Re: Enron trade for frop futures," "body" : "Scamming more people..."," "tos" : [ { "address" : "connie@enron", "name" : null } ]," "ccs" : [ ]," "bccs" : [ ]}

Wednesday, May 8, 13

Page 48: Agile analytics applications on hadoop

1.6) Publish events on the web

48

require 'rubygems'require 'sinatra'require 'mongo'require 'json'

connection = Mongo::Connection.newdatabase = connection['agile_data']collection = database['emails']

get '/email/:message_id' do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data)end

Wednesday, May 8, 13

Page 49: Agile analytics applications on hadoop

1.6) Publish events on the web

49

Wednesday, May 8, 13

Page 50: Agile analytics applications on hadoop

One-Liner to Transition Stack

50

Wednesday, May 8, 13

Page 51: Agile analytics applications on hadoop

Whats the point?

• A designer can work against real data.

• An application developer can work against real data.

• A product manager can think in terms of real data.

• Entire team is grounded in reality!

• You’ll see how ugly your data really is.

• You’ll see how much work you have yet to do.

• Ship early and often!

• Feels agile, don’t it? Keep it up!

51

Wednesday, May 8, 13

Page 52: Agile analytics applications on hadoop

1.7) Wrap events with Bootstrap

52

<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">

</head>

<body>

<div class="container" style="margin-top: 100px;">

<table class="table table-striped table-bordered table-condensed">

<thead>

{% for key in data['keys'] %}

<th>{{ key }}</th>

{% endfor %}

</thead>

<tbody>

<tr>

{% for value in data['values'] %}

<td>{{ value }}</td>

{% endfor %}

</tr>

</tbody>

</table>

</div>

</body> Complete example here with code here.

Wednesday, May 8, 13

Page 53: Agile analytics applications on hadoop

1.7) Wrap events with Bootstrap

53

Wednesday, May 8, 13

Page 54: Agile analytics applications on hadoop

Refine. Add links between documents.

54Not the Mona Lisa, but coming along... See: hereWednesday, May 8, 13

Page 55: Agile analytics applications on hadoop

1.8) List links to sorted events

55

mongo enron

> db.emails.ensureIndex({message_id: 1})

> db.emails.find().sort({date:0}).limit(10).pretty()

{

{

" "_id" : ObjectId("4f7a5da2414e4dd0645d1176"),

" "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",

" "from" : [

...

pig -l /tmp -x local -v -w

emails_per_user = foreach (group emails by from.address) {

sorted = order emails by date;

last_1000 = limit sorted 1000;

generate group as from_address, emails as emails;

};

store emails_per_user into '$mongourl' using MongoStorage();

Use Pig, serve/cache a bag/array of email documents:

Use your ‘database’, if it can sort.

Wednesday, May 8, 13

Page 56: Agile analytics applications on hadoop

1.8) List links to sorted documents

56

Wednesday, May 8, 13

Page 57: Agile analytics applications on hadoop

1.9) Make it searchable...

57

If you have list, search is easy with ElasticSearch and Wonderdog...

/* Load ElasticSearch integration */

register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';

register '/me/elasticsearch-0.18.6/lib/*';

define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();

emails = load '/me/tmp/emails' using AvroStorage();

store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/

elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');

curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'

Test it with curl:

ElasticSearch has no security features. Take note. Isolate.

Wednesday, May 8, 13

Page 58: Agile analytics applications on hadoop

2) Create Simple Charts

58

Wednesday, May 8, 13

Page 59: Agile analytics applications on hadoop

2) Create Simple Tables and Charts

59

Wednesday, May 8, 13

Page 60: Agile analytics applications on hadoop

2) Create Simple Charts

• Start with an HTML table on general principle.

• Then use nvd3.js - reusable charts for d3.js

• Aggregate by properties & displaying is first step in entity

resolution

• Start extracting entities. Ex: people, places, topics, time series

• Group documents by entities, rank and count.

• Publish top N, time series, etc.

• Fill a page with charts.

• Add a chart to your event page.

60

Wednesday, May 8, 13

Page 61: Agile analytics applications on hadoop

2.1) Top N (of anything) in Pig

61

pig -l /tmp -x local -v -w

top_things = foreach (group things by key) { sorted = order things by arbitrary_rank desc;top_10_things = limit sorted 10;generate group as key, top_10_things as top_10_things;};

store top_n into '$mongourl' using MongoStorage();

Remember, this is the same structure the browser gets as json.

This would make a good Pig Macro.

Wednesday, May 8, 13

Page 62: Agile analytics applications on hadoop

2.2) Time Series (of anything) in Pig

62

pig -l /tmp -x local -v -w

/* Group by our key and date rounded to the month, get a total */things_by_month = foreach (group things by (key, ISOToMonth(datetime))

generate flatten(group) as (key, month), COUNT_STAR(things) as total;

/* Sort our totals per key by month to get a time series */things_timeseries = foreach (group things_by_month by key) {

timeseries = order things by month;generate group as key, timeseries as timeseries;};

store things_timeseries into '$mongourl' using MongoStorage();

Yet another good Pig Macro.

Wednesday, May 8, 13

Page 63: Agile analytics applications on hadoop

Data processing in our stack

63

A new feature in our application might begin at any layer... great!

Any team member can add new features, no problemo!

I’m creative!I know Pig!

I’m creative too!I <3 Javascript!

omghi2u!where r my legs?

send halp

Wednesday, May 8, 13

Page 64: Agile analytics applications on hadoop

Data processing in our stack

64

... but we shift the data-processing towards batch, as we are able.

Ex: Overall total emails calculated in each layer

See real example here.

Wednesday, May 8, 13

Page 65: Agile analytics applications on hadoop

3) Exploring with Reports

65

Wednesday, May 8, 13

Page 66: Agile analytics applications on hadoop

3) Exploring with Reports

66

Wednesday, May 8, 13

Page 67: Agile analytics applications on hadoop

3.0) From charts to reports...

• Extract entities from properties we aggregated by in charts (Step 2)

• Each entity gets its own type of web page

• Each unique entity gets its own web page

• Link to entities as they appear in atomic event documents (Step 1)

• Link most related entities together, same and between types.

• More visualizations!

• Parametize results via forms.

67

Wednesday, May 8, 13

Page 68: Agile analytics applications on hadoop

3.1) Looks like this...

68

Wednesday, May 8, 13

Page 69: Agile analytics applications on hadoop

3.2) Cultivate common keyspaces

69

Wednesday, May 8, 13

Page 70: Agile analytics applications on hadoop

3.3) Get people clicking. Learn.

• Explore this web of generated pages, charts and links!

• Everyone on the team gets to know your data.

• Keep trying out different charts, metrics, entities, links.

• See whats interesting.

• Figure out what data needs cleaning and clean it.

• Start thinking about predictions & recommendations.

70

‘People’ could be just your team, if data is sensitive.

Wednesday, May 8, 13

Page 71: Agile analytics applications on hadoop

4) Predictions and Recommendations

71

Wednesday, May 8, 13

Page 72: Agile analytics applications on hadoop

4.0) Preparation

• We’ve already extracted entities, their properties and relationships

• Our charts show where our signal is rich

• We’ve cleaned our data to make it presentable

• The entire team has an intuitive understanding of the data

• They got that understanding by exploring the data

• We are all on the same page!

72

Wednesday, May 8, 13

Page 73: Agile analytics applications on hadoop

4.2) Think in different perspectives

• Networks

• Time Series / Distributions

• Natural Language Processing

• Conditional Probabilities / Bayesian Inference

• Check out Chapter 2 of the book73

Wednesday, May 8, 13

Page 74: Agile analytics applications on hadoop

4.3) Networks

74

Wednesday, May 8, 13

Page 75: Agile analytics applications on hadoop

4.3.1) Weighted Email Networks in Pig

75

DEFINE header_pairs(email, col1, col2) RETURNS pairs { filtered = FILTER $email BY ($col1 IS NOT NULL) AND ($col2 IS NOT NULL); flat = FOREACH filtered GENERATE FLATTEN($col1) AS $col1, FLATTEN($col2) AS $col2; $pairs = FOREACH flat GENERATE LOWER($col1) AS ego1, LOWER($col2) AS ego2;}

/* Get email address pairs for each type of connection, and union them together */emails = LOAD '/me/Data/enron.avro' USING AvroStorage();from_to = header_pairs(emails, from, to);from_cc = header_pairs(emails, from, cc);from_bcc = header_pairs(emails, from, bcc);pairs = UNION from_to, from_cc, from_bcc;

/* Get a count of emails over these edges. */pair_groups = GROUP pairs BY (ego1, ego2);sent_counts = FOREACH pair_groups GENERATE FLATTEN(group) AS (ego1, ego2), COUNT_STAR(pairs) AS total;

Wednesday, May 8, 13

Page 76: Agile analytics applications on hadoop

4.3.2) Networks Viz with Gephi

76

Wednesday, May 8, 13

Page 77: Agile analytics applications on hadoop

4.3.3) Gephi = Easy

77

Wednesday, May 8, 13

Page 78: Agile analytics applications on hadoop

4.3.4) Social Network Analysis

78

Wednesday, May 8, 13

Page 79: Agile analytics applications on hadoop

4.4) Time Series & Distributions

79

pig -l /tmp -x local -v -w

/* Count things per day */

things_per_day = foreach (group things by (key, ISOToDay(datetime))

generate flatten(group) as (key, day),

COUNT_STAR(things) as total;

/* Sort our totals per key by day to get a sorted time series */

things_timeseries = foreach (group things_by_day by key) {

timeseries = order things by day;

generate group as key, timeseries as timeseries;

};

store things_timeseries into '$mongourl' using MongoStorage();

Wednesday, May 8, 13

Page 81: Agile analytics applications on hadoop

4.4.2) Regress to find Trends

81

JRuby Linear Regression UDF Pig to use the UDF

Trend Line in your Application

Wednesday, May 8, 13

Page 82: Agile analytics applications on hadoop

4.5.1) Natural Language Processing

82

Example with code here and macro here.

import 'tfidf.macro';my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body');

/* Get the top 10 Tf*Idf scores per message */per_message_cassandra = foreach (group tfidf_all by message_id) { sorted = order tfidf_all by value desc; top_10_topics = limit sorted 10; generate group, top_10_topics.(score, value);}

Wednesday, May 8, 13

Page 83: Agile analytics applications on hadoop

4.5.2) NLP: Extract Topics!

83

Wednesday, May 8, 13

Page 85: Agile analytics applications on hadoop

4.6) Probability & Bayesian Inference

85

Wednesday, May 8, 13

Page 86: Agile analytics applications on hadoop

4.6.1) Gmail Suggested Recipients

86

Wednesday, May 8, 13

Page 87: Agile analytics applications on hadoop

4.6.1) Reproducing it with Pig...

87

Wednesday, May 8, 13

Page 88: Agile analytics applications on hadoop

4.6.2) Step 1: COUNT(From -> To)

88

Wednesday, May 8, 13

Page 89: Agile analytics applications on hadoop

4.6.2) Step 2: COUNT(From, To, Cc)/Total

89

P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone

Wednesday, May 8, 13

Page 90: Agile analytics applications on hadoop

4.6.3) Wait - Stop Here! It works!

90

They match…

Wednesday, May 8, 13

Page 91: Agile analytics applications on hadoop

4.4) Add predictions to reports

91

Wednesday, May 8, 13

Page 92: Agile analytics applications on hadoop

5) Enable new actions

92

Wednesday, May 8, 13

Page 93: Agile analytics applications on hadoop

Why doesn’t Kate reply to my emails?

• What time is best to catch her?

• Are they too long?

• Are they meant to be replied to (contain original content)?

• Are they nice? (sentiment analysis)

• Do I reply to her emails (reciprocity)?

• Do I cc the wrong people (my mom)?

93

Wednesday, May 8, 13

Page 94: Agile analytics applications on hadoop

Example: Packetpig and PacketLoop

94

snort_alerts  =  LOAD  '$pcap'    USING  com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig');

countries  =  FOREACH  snort_alerts    GENERATE        com.packetloop.packetpig.udf.geoip.Country(src)  as  country,        priority;

countries  =  GROUP  countries  BY  country;

countries  =  FOREACH  countries    GENERATE        group,        AVG(countries.priority)  as  average_severity;

STORE  countries  into  'output/choropleth_countries'  using  PigStorage(',');

Code here.

Wednesday, May 8, 13

Page 96: Agile analytics applications on hadoop

Thank You!

Questions & Answers

Follow: @rjurneyRead the Blog: datasyndrome.com

96

Wednesday, May 8, 13