(bdt306) how hearst publishing manages clickstream analytics with aws

57
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Roy Ben-Alta, Business Development Manager, AWS Rick McFarland, VP of Data Services, Hearst October 2015 BDT306 The Life of a Click How Hearst Publishing Manages Clickstream Analytics with AWS

Upload: amazon-web-services

Post on 15-Apr-2017

3.241 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Roy Ben-Alta, Business Development Manager, AWS

Rick McFarland, VP of Data Services, Hearst

October 2015

BDT306

The Life of a Click

How Hearst Publishing Manages

Clickstream Analytics with AWS

Page 2: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

What to Expect from the Session

• Common patterns for clickstream analytics

• Tips on using Amazon Kinesis and

Amazon EMR for clickstream processing

• Hearst’s big data journey in building the Hearst analytics

stack for clickstream

• Lesson learned

• Q&A

Page 3: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Clickstream Analytics = Business Value

Verticals/Use

Cases

Accelerated Ingest-

Transform-Load to final

destination

Continual Metrics/

KPI ExtractionActionable Insights

Ad Tech/

Marketing Analytics

Advertising data aggregation Advertising metrics like coverage,

yield, conversion, scoring

webpages

User activity engagement

analytics, optimized bid/ buy

engines

Consumer Online/

Gaming

Online customer engagement data

aggregation

Consumer/ app engagement

metrics like page views, CTR

Customer clickstream analytics,

recommendation engines

Financial Services

Digital assets

Improve customer experience on

bank website

Financial market data metrics Fraud monitoring, and value-at-

risk assessment, auditing of

market order data

IoT / Sensor Data Fitness device , vehicle sensor,

telemetry data ingestion

Wearable sensor operational

metrics, and dashboards

Devices / sensor operational

intelligence

Page 4: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

DataXu Records

68.198.92 - - [22/Dec/2013:23:08:37 -0400] "GET

/ HTTP/1.1" 200 6394 www.yahoo.com

"-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1...)" "-"

192.168.198.92 - - [22/Dec/2013:23:08:38 -0400] "GET

/images/logo.gif HTTP/1.1" 200 807 www.yahoo.com

"http://www.some.com/" "Mozilla/4.0 (compatible; MSIE 6...)" "-"

192.168.72.177 - - [22/Dec/2013:23:32:14 -0400] "GET

APACHE ACCESS LOG

{"cId":"10049","cdid":"5961","campID":"8","loc":"b","ip_address":"174.56.106.10","icctm_ht_athr":"","icctm_ht_aid":"","icctm_ht_attl":"Family Circus","icctm_ht_dtpub":"2011-04-05","icctm_ht_stnm":"SEATTLE POST-INTELLIGENCER","icctm_ht_cnocl":"http://www.seattlepi.com/comics-and-games/fun/Family_Circus","ts":"1422839422426","url":"http://www.seattlepi.com/comics-and-games/fun/Family_Circus","hash":"d98ace5874334232f6db3e1c0f8be3ab","load":"5.096","ref":"http://www.seattlepi.com/comics-and-games","bu":"HNP","brand":"SEATTLE POST-INTELLIGENCER","ref_type":"SAMESITE","ref_subtype":"SAMESITE","ua":"desktop:chrome"}

JSON

Clickstream Record

Number of fields is not fixed

Tags names change

Multiple pages/sites

Format can be defined as we

store the data

AVRO, CSV, TSV, JSON

Page 5: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Clickstream Analytics Is the New “Hello World”

Hello World Word count Clickstream

Page 6: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Clickstream Analytics – Common Patterns

Flume HDFS Hive Batch high latency on retrieve

SQLFlume HDFSHive &

PigBatch low latency on

retrieve

Flume

SqoopHDFS

Impala

SparkSql

Presto

Other

More options: Batch with lower

latency when retrieve

Page 7: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

users

Amazon

Kinesis

Kinesis-

enabled appAmazon S3 Amazon

EMR

Web

Servers

Amazon S3

Amazon Redshift

Page 8: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

It’s All About the Pace, About the Pace…

Big data

Hourly server logs: were your systems misbehaving 1hr ago

Weekly / monthly bill:

what you spent this billing cycle

Daily customer-preferences report from your web

site’s click stream:

what deal or ad to try next time

Daily fraud reports:

was there fraud yesterday

Real-time big data

•Amazon CloudWatch metrics:

what went wrong now

•Real-time spending alerts/caps:

prevent overspending now

•Real-time analysis:

what to offer the current customer now

•Real-time detection:

block fraudulent use now

Page 9: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Clickstream Storage and Processing with

Amazon Kinesis

Amazon KinesisApp N

Live dashboard

AW

S e

nd

po

int

App 1

Aggregate and ingest data to S3

App 2Aggregate and ingest data to

Amazon Redshift

Data lake

Amazon RedshiftApp 3

ETL/ELTMachine learning

Availability

Zone

Shard 1

Shard 2

Shard N

Availability

ZoneAvailability

Zone

EMR

DynamoDB

Page 10: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Amazon

EMR

Managed, elastic Hadoop (1.x & 2.x) cluster

Integrates with Amazon S3, Amazon DynamoDB, and

Amazon Redshift

Install Storm, Spark, Hive, Pig, Impala, and end user

tools automatically

Support for Spot instances

Integrated HBase NOSQL database

Amazon EMR with Apache Spark

Apache Spark

Spark

SQL

Spark

StreamingMllib

GraphX

Page 11: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Spot Integration with Amazon EMR

aws emr create-cluster --name "Spot cluster" --ami-version 3.3

InstanceGroupType=MASTER,

InstanceType=m3.xlarge,InstanceCount=1,

InstanceGroupType=CORE,

BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2

InstanceGroupType=TASK,

BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

Page 12: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Spot Integration with Amazon EMR

10 node cluster running for 14 hours

Cost = 1.0 * 10 * 14 = $140

Page 13: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Resize Nodes with Spot Instances

Add 10 more nodes on Spot

Page 14: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Resize Nodes with Spot Instances

20 node cluster running for 7 hours

Cost = 1.0 * 10 * 7 = $70

= 0.5 * 10 * 7 = $35

Total $105

Page 15: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Resize Nodes with Spot Instances

50 % less run-time ( 14 7)

25% less cost (140 105)

Page 16: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Amazon EMR and Amazon Kinesis for Batch and Interactive Processing

• Streaming log analysis

• Interactive ETL

Amazon Kinesis Amazon EMR

Amazon Redshift

Amazon S3

Data scientist

Amazon EMR for data scientists

using Spot instances

BI

Page 17: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Amazon Beanstalk - App to push data into Amazon Kinesis

Page 18: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

• Amazon software license linking – Add ASL dependency

to SBT/MAVEN project (artifactId = spark-streaming-

kinesis-asl_2.10)

• Shards - Include head-room to catching up with data in

stream

• Tracking Amazon Kinesis application state (DynamoDB)• Kinesis-Application:DynamoDB-table (1:1)

• Created automatically

• Make sure application name doesn’t conflict with existing DynamoDB tables.

• Adjust DynamoDB provision throughput if necessary (default 10 reads per

sec & 10 writes per second)

Amazon Kinesis Applications – Tips

Page 19: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Spark on Amazon EMR - Tips

• Amazon EMR applications after version 3.8.0 (no need

to run bootstrap actions)

• Use Spot instances for time & cost saving especially

when using Spark

• Run in Yarn cluster mode (--master yarn-cluster) for

production jobs – Spark driver runs in application master

(high availability)

• Data serialization – use Kryo if possible to boost

performance

(spark.serializer=org.apache.spark.serializer.KryoSerializer)

Page 20: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

The Life of a Click at Hearst

• Hearst’s journey with their big data analytics platform on

AWS

• Demo

• Clickstream analysis patterns

• Lessons learned

Page 21: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS
Page 22: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Have you heard of Hearst?

Page 23: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

BUSINESS MEDIA operates more than 20 business-to-businesses with significant holdings in the

automotive, electronic, medical and finance industries

MAGAZINES publishes 20 U.S. titles and close to 300 international editions

BROADCASTINGcomprises 31 television and two radio stations

NEWSPAPERS owns 15 daily and 34 weekly newspapers

Hearst includes over 200 businesses in over

100 countries around the world

Page 24: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Data Services at Hearst – Our Mission

• Ensure that Hearst leverages its combined data

assets

• Unify Hearst’s data streams

• Development of Big Data Analytics Platform using AWS

services

• Promote enterprise-wide product development

• Example: product initiative led by all of Hearst’s editors

– Buzzing@Hearst

Page 25: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

1

Page 26: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Business Value of Buzzing

• Instant feedback on articles from our audiences

• Incremental re-syndication of popular articles across

properties (e.g. trending newspaper articles can be

adopted by magazines)

• Inform the editors to write articles that are more

relevant to our audiences and what channels are our

audiences leveraging to read our articles

• Ultimately, drive incremental value

• 25% more page views, 15% more visitors which

lead to incremental revenue

Page 27: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

• Throughput goal: transport data from all 250+ Hearst

properties worldwide

• Latency goal: click-to-tool in under 5 minutes

• Agile: easily add new data fields into clickstream

• Unique metrics requirements defined by Data Science team

(e.g., standard deviations, regressions, etc.)

• Data reporting windows ranging from 1 hour to 1 week

• Front-end developed “from scratch” so data exposed through

API must support development team’s unique requirements

Most importantly, operation of existing sites cannot be

affected!

Engineering Requirements of Buzzing…

Page 28: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

What we had to work with…

a ”static” clickstream collection process on many Hearst sites

Users to

Hearst

Properties

Clickstream

corporate

data centerNetezza

Data

Warehouse

Once per day

…now how do we get there?

Used for ad hoc

SQL-based

reporting and

analytics

~30 GB per day

containing basic web

log data (e.g., referrer,

url, user agent, cookie,

etc.)

Page 29: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

…and we own Hearst’s tag management system

Users to

Hearst

Properties

Clickstream

This not only gave us access

to the clickstream but also

the JavaScript code that

lives on our websites

JavaScript on

web pages

Page 30: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Phase 1 – Ingest Clickstream Data Using AWS

Amazon

Kinesis

Node.JS App-

Proxy

Kinesis S3 App –

KCL LibrariesUsers to

Hearst

Properties

Clickstream

“Raw JSON”

Raw data

Use tag manager to easily deploy JavaScript to all sites

Kinesis Client

Libraries and

Kinesis

Connectors

persist data to

Amazon S3

for durability

ElasticBeanstalk with

Node.JS exposes an

HTTP endpoint which

asynchronously takes

the data and feeds to

Amazon Kinesis

Implement

JavaScript on

sites that call

an exposed

endpoint and

pass in query

parameters

Page 31: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Node.JS – Push clickstream to Amazon Kinesisfunction pushToKinesis(data) {

var params = {

Data: data, /* required */

PartitionKey: guid(),

StreamName: streamName /* required */

};

kinesis.putRecord(params, function(err, data) {

if (err) {

console.log(err, err.stack); // an error occurred

}

});

}

app.get('/hearstkin.gif', function(req, res){

async.series([function(callback){

var queryData = url.parse(req.url, true).query;

queryData.proxyts = new Date().getTime().toString();

pushToKinesis(JSON.stringify(queryData));

callback(null);

}]);

res.writeHead(200,{'Content-Type': 'text/plain', 'Access-

Control-Allow-Origin': '*'});

res.end(imageGIF, 'binary');

});

http.createServer(app).listen(app.get('port'), function(){

console.log('Express server listening on port ' +

app.get('port'));

});

Asynchronous calls –

ensures no user experience

interruption

Server timestamp – to

create a unified timestamp.

Amazon Kinesis now offers

this out-of-the box!

JSON format – this

helps us downstream

Kinesis Partition Key – guid() is a

good partition key to ensure even

distribution across the shards

Page 32: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Ingest Monitoring - AWS

Amazon Kinesis Monitoring

AWS Elastic Beanstalk Monitoring

Auto Scaling triggered by

network in > 20MB. Then scale

up to 40 instances.

Page 33: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Phase 1- Summary

• Use JSON formatting for payloads so more fields can be easily

added without impacting downstream processing

• HTTP call requires minimal code introduced to the actual site

implementations

• Flexible to meet rollout and growing demand

• Elastic Beanstalk can be scaled

• Amazon Kinesis stream can be re-sharded

• Amazon S3 provides high durability storage for raw data

• Once a reliable, scalable onboarding platform is in place,

we can now focus on ETL!

Page 34: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Phase 2a- Data Processing First Version (EMR)

ETL on

Amazon EMR

“Raw

JSON”

Raw DataClean Aggregate

Data

• Amazon EMR was

chosen initially for

processing due to ease

of Amazon EMR creation

… and Pig because we

knew how to code in

PigLatin

• 50+ UDFs were written

using Python…also

because we knew Python

Page 35: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Unfortunately, Pig was not performing well – 15 min latency

Processing Clickstream Data with Pig

set output.compression.enabled true;

set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;

REGISTER '/home/hadoop/PROD/parser.py' USING jython as pyudf;

REGISTER '/home/hadoop/PROD/refclean.py' USING jython AS refs;

AA0 = load 's3://BUCKET_NAME/rawjson/datadata.tsv.gz' using TextLoader as

(line:chararray);

A0 = FILTER AA0 BY ( pyudf.get_obj(line,'url') MATCHES '.*(a\\d+|g\\d+).*';

A1 = FOREACH A0 GENERATE

( pyudf.urlclean(pyudf.get_obj(line,'url')) as url:chararray,

pyudf.get_obj(line,'hash') as hash:chararray,

pyudf.get_obj(line,'icxid') as icxid:chararray,

pyudf.pubclean(pyudf.get_obj(line,'icctm_ht_dtpub')) as pubdt:chararray,

pyudf.get_obj(line,'icctm_ht_cnocl') as cnocl:chararray,

pyudf.get_obj(line,'icctm_ht_athr') as author:chararray,

pyudf.get_obj(line,'icctm_ht_attl') as title:chararray,

pyudf.get_obj(line,'icctm_ht_aid') as cms_id:chararray,

pyudf.num1(pyudf.get_obj(line,'mxdpth')) as mxdpth:double,

pyudf.num2(pyudf.get_obj(line,'load')) as topsecs:double,

refs.classy(pyudf.get_obj(line,'url'),1) as bu:chararray,

pyudf.get_obj(line,'ip_address') as ip:chararray,

pyudf.get_obj(line,'img') as img:chararray

;

Gzip your

output

Regex in

Pig!

Python imports

limited to what is

allowed by

Jython

Page 36: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Phase 2b- Data Processing (SparkStreaming)

Clean Aggregate Data

Node.JS App-

ProxyUsers to

Hearst

Properties

Clickstream

• Welcome Apache Spark– one framework for

batch and realtime

• Benefits – using same code for batch and

real time ETL

• Use Spot instances – cost savings

• Drawbacks – Scala!

Amazon KinesisETL on EMR

Page 37: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Using SQL with Scala

SparkSQLSince we

knew SQL,

we wrote

Scala with

embedded

SQL Query

endpointUrl = kinesis.us-west-2.amazonaws.com

streamName= hearststream

outputLoc.json.streaming =

s3://hearstkinesisdata/processedsparkjson

window.length = 300

sliding.interval = 300

outputLimit = 5000000

query1Table=hearst1

query1= SELECT \

simplestartq(proxyts, 5) as startq,\

urlclean(url) as url,\

hash,\

icxid,\

pubclean(icctm_ht_dtpub) as pubdt,\

classy(url,1) as bu,\

ip_address as ip,\

artcheck(classy(url,1),url) as artcheck,\

ref_type(ref,url) as ref_type,\

img, \

wc, \

contentSource \

FROM hearst1

val jsonRDD = sqlContext.jsonRDD(rdd1)

jsonRDD.registerTempTable(query1Table.trim)

val query1Result = sqlContext.sql(query1)//.limit(outputLimit.toInt)

query1Result.registerTempTable(query2Table.trim)

val query2Result = sqlContext.sql(query2)

query2Result.registerTempTable(query3Table.trim)

val query3Result = sqlContext.sql(query3).limit(outputLimit.toInt)

val outPartitionFolder = UDFUtils.output60WithRolling(slidingInterval.toInt)

query3Result.toJSON.saveAsTextFile("%s/%s".format(outputLocJSON,

outPartitionFolder), classOf[org.apache.hadoop.io.compress.GzipCodec])

logger.info("New JSON file written to "+outputLoc+"/"+outPartitionFolder)

Page 38: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Python UDF versus ScalaPythondef artcheck(bu,url):

try:

if url and bu:

cleanurl = url[0:url.find("?")].strip('/')

tailurl = url[findnth(url, '/', 3)+1:url.find("?")].strip('/')

revurl=cleanurl[::-1]

root=revurl[0:revurl.find('/')][::-1]

if (bu=='HMI' or bu=='HMG') and re.compile('a\d+|g\d+').search(tailurl)!=None : return 'T'

elif bu=='HTV' and root.isdigit()==True and re.compile('/search/').search(cleanurl)==None: return 'T'

elif bu=='HNP' and re.compile('blog|fuelfix').search(url)!=None and re.compile(r'\S*[0-9]{4,4}\/[0-9]{2,2}\/[0-9]{2,2}\S*').search(tailurl)!=None : return 'T'

elif bu=='HNP' and re.compile('businessinsider').search(url)!=None and re.compile(r'\S*[0-9]{4,4}\-[0-9]{2,2}').search(root)!=None : return 'T'

elif bu=='HNP' and re.compile('blog|fuelfix|businessinsider').search(url)==None and re.compile('.php').search(url)!=None : return 'T'

else : return 'F'

else : return 'F'

except:

return 'F'

def artcheck(bu:String, url: String )={

try{

val cleanurl = UDFUtils.utilurlclean(url.trim).stripSuffix("/")

val pathClean = UDFUtils.pathURI(cleanurl)

val lastContext = pathClean.split("/").last

var resp = "F"

if(("HMI"==bu||"HMG"==bu)&&Pattern.compile("/a\\d+|/g\\d+").matcher(pathClean).find()) resp="T"

else if("HTV"==bu && StringUtils.isNumeric(lastContext) && !cleanurl.contains("/search/")) resp="T"

else if("HNP"==bu && Pattern.compile("blog|fuelfix").matcher(url).find() && Pattern.compile("\\d{4}/\\d{2}/\\d{2}").matcher(pathClean).find()) resp="T"

else if("HNP"==bu && Pattern.compile("businessinsider").matcher(url).find() && Pattern.compile("\\d{4}-\\d{2}").matcher(lastContext).find()) resp="T"

else if("HNP"==bu && !Pattern.compile("blog|fuelfix|businessinsider").matcher(url).find()&& Pattern.compile(".php").matcher(url).find()) resp="T"

resp}

}

catch{

case e:Exception => "F"

}

Scala

Don’t be intimidated by Scala…if

you know Python, the syntax can

be similar

re.compile('a\d+|g\d+').

Pattern.compile("a\\d+|g\\d+").

Try: Except:

Try{} Catch{}

Page 39: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Phase 3a- Data Science!

Data Science on EC2

Clean Aggregate Data API-ready Data

Amazon KinesisETL on EMR

• We decided to perform our

Data Science using SAS

on Amazon EC2 initially

because of the ability to

perform both data

manipulation and easily

run complex data science

techniques (e.g.,

regressions)

• Great for exploration and

initial development

• Performing data science

using this method took

3-5 minutes to complete

Page 40: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

SAS Code Example

data _null_;

call system("aws s3 cp s3://BUCKET_NAME/file.gz

/home/ec2-user/LOGFILES/file.gz");

run;

FILENAME IN pipe "gzip -dc /home/ec2-user/LOGFILES/file.gz" lrecl=32767;

data temp1;

FORMAT startq DATETIME19.;

infile IN delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=1;

input

startq :YMDDTTM.

url :$1000.

pageviews :best32.

visits :best32.

author :$100.

cms_id :$100.

img :$1000.

title :$1000.;

run;

Use pipe to

read in S3

data and

keep it

compressed

proc sql;

CREATE TABLE metrics AS

SELECT

url FORMAT=$1000.,

SUM(pageviews) as pageviews,

SUM(visits) as visits,

SUM(fvisits) as fvisits,

SUM(evisits) as evisits,

MIN(ttct) as rec,

COUNT(distinct startq) as frq,

AVG(visits) as avg_visits_pp,

SUM(visits1) as visits_soc,

SUM(visits2) as visits_dir,

SUM(visits3) as visits_int,

SUM(visits4) as visits_sea,

SUM(visits5) as visits_web,

SUM(visits6) as visits_nws,

SUM(visits7) as visits_pd,

SUM(visits8) as visits_soc_fb,

SUM(visits9) as visits_soc_tw,

SUM(visits10) as visits_soc_pi,

SUM(visits11) as visits_soc_re,

SUM(visits12) as visits_soc_yt,

SUM(visits13) as visits_soc_su,

SUM(visits14) as visits_soc_gp,

SUM(visits15) as visits_soc_li,

SUM(visits16) as visits_soc_tb,

SUM(visits17) as visits_soc_ot,

CASE WHEN (SUM(v1) - SUM(v3) ) > 20 THEN ( SUM(v1) - SUM(v3) ) / 2 ELSE 0 END as trending

FROM temp1

GROUP BY 1;

Use PROC SQL when

possible for easier

translation to Amazon

Redshift for production

later on.

Page 41: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Phase 3b- Split Data Science into Development and Production

Amazon

Kinesis

Clean Aggregate

Data

API-ready Data

Data Science

“Production”

Amazon Redshift

ETL on EMR

• Once Data Science

models were established,

we split the modeling and

production

• Production was moved to

Amazon Redshift which

provided much faster

ability to read Amazon S3

data and process the data

• Data Science processing

time went down to 100

seconds!

Use S3 to store

data science

models and

apply them

using Amazon

Redshift Data Science

“Development”

on EC2

Statistical Models

run once per day

Models

Agg Data

Page 42: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

select

clean_url as url,

trim(substring(max(proxyts||domain) from 20 for 1000)) as domain,

trim(substring(max(proxyts||clean_cnocl) from 20 for 1000)) as cnocl,

trim(substring(max(proxyts||img) from 20 for 1000)) as img,

trim(substring(max(proxyts||title) from 20 for 1000)) as title,

trim(substring(max(proxyts||section) from 20 for 1000)) as section,

approximate count(distinct ic_fpc) as visits,

count(1) as hits

from kinesis_hits

where bu='HMG' and (article_id is not null or author is not null or title is

not null)

group by 1;

Amazon Redshift Code Example

Cool trick to find the most

recent value of a

character field in one

pass through the data

Page 43: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Phase 4a- Elasticsearch Integration

Amazon EMR

PUSH

Buzzing API

S3 Storage

Data Science

Amazon Redshift

ETL on EMR

Since we had the Amazon EMR cluster running already, we used a handy Pig jar

that made it easy to push data to Elasticsearch.

S3 Storage

Models

Agg Data API Ready Data

Page 44: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

REGISTER /home/hadoop/pig/lib/piggybank.jar;

REGISTER /home/hadoop/PROD/elasticsearch-hadoop-2.0.2.jar;

DEFINE EsStorageDEV org.elasticsearch.hadoop.pig.EsStorage

('es.nodes = es-dev.hearst.io',

'es.port = 9200',

'es.http.timeout = 5m',

'es.index.auto.create = true');

SECTIONS = load 's3://hearstkinesisdata/ss.tsv' USING PigStorage('\t') as

(sectionid:chararray,cnt:long,visits:long,sectionname:chararray);

STORE SECTIONS INTO 'content-sections-sync/content-sections-sync' USING

EsStoragePROD;

Pig Code – Push to ES Example

Use handy Pig jar to push data to Elasticsearch

The “Amazon EMR overhead” required to read small files added 2 min to latency

Page 45: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Phase 4b- Elasticsearch Integration Sped Up

Buzzing API

S3 Storage

API

Ready

DataData Science

Amazon Redshift

ETL on EMR

Since the Amazon

Redshift code was

run in a Python

wrapper, solution

was to push data

directly into

Elasticsearch

Models

Agg Data

Page 46: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

# Converting file into bulk-insert compatible format

$bin/convert_json.php big.json create rowbyrow.json

# Get mapping file

${aws} s3 cp S3://hearst/es_mapping es_mapping

# Creating new ES index

$(curl -XPUT http://es.hearst.io/content-web180-sync --data-binary es_mapping -s)

# Performing bulk API call

$(curl -XPOST http://es.hearst.io/content-web180-sync/_bulk --data-binary rowbyrow.json

-s) "http://es.hearst.io/content-web180-sync"

Script to Push to Elasticsearch Directly

Converting one big input JSON

file to a row-by-row JSON is a

key step for making the data

batch compatible

Use a mapping file to manage

the formatting in your index…

very important for dates and

numeric values that look like

strings

Page 47: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Final Data Pipeline

Buzzing API

API

Ready

Data

Amazon

Kinesis

S3 Storage

Node.JS

App- ProxyUsers to

Hearst

Properties

Clickstream

Data Science

Application

Amazon Redshift

ETL on EMR

100 seconds

1G/day

30 seconds

5GB/day

5 seconds

1G/day

Milliseconds

100GB/day

LATENCY

THROUGHPUT Models

Agg Data

Page 48: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Data Science

Amazon Redshift

ETL

A more “visual” representation of our pipeline!

Clickstream dataAmazon

KinesisResults

API

Page 49: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Version Transport

S

T

O

R

A

G

E ETL

S

T

O

R

A

G

E Analysis

S

T

O

R

A

G

E Exposure Latency

V1

Amazon

Kinesis S3 EMR-Pig S3 EC2-SAS S3

EMR to

ElasticSearch 1 hour

Today

Amazon

Kinesis Spark-Scala S3

Amazon

Redshift ElasticSearch <5 min

Tomorrow

Amazon

Kinesis PySpark + SparkR ElasticSearch <2 min

Lessons learned

“No Duh’s” Removing “stoppage” points, speed up processing, and

combine processes improve latency.

Page 50: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Data Science Tool Box

Buzzing API

API

Ready

Data

Amazon

Kinesis

S3 Storage

Node.JS

App- ProxyUsers to

Hearst

Properties

Clickstream

Data Science

Application

Amazon Redshift

ETL on EMR

Models

Agg Data

• IPython Notebook

• On Spark and Amazon Redshift

• Code sharing (and insights)

• User-friendly development environment for data scientists

• Auto-convert .pynb .py Data

Science

Toolbox

DataModels

Amazon Redshift

Page 51: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Data Science at Hearst – Notebook

Page 52: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Next Steps

• Amazon EMR 4.1.0 with Spark 1.5 released and we can do more with pyspark, look at Apache Zeppelin on Amazon EMR

• Amazon Kinesis just release a new feature to retain data up to 7 days - We could do more ETL “in the stream”

• Amazon Kinesis Firehose and Lambda – Zero touch (no Amazon EC2 maintenance)

• More complex data science that requires…• Amazon Redshift UDFs

• Python shell that calls Amazon Redshift but also allows for complex statistical methods (e.g., using R or machine learning)

Page 53: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Conclusion

• Clickstreams are the new “data currency” of business

• AWS provides great technology to process data

• High speed

• Lower costs – Using Spot…

• Very agile

• Do more with less: this can all be done with a team

of 2 FTEs!

• 1 developer (well versed in AWS) + 1 data scientist

Page 54: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Ingest Store Process Analyze

Click Insight

Time

Call To Action

Amazon S3

Amazon Kinesis

Amazon DynamoDB

Amazon RDS (Aurora)

AWS Lambda

KCL Apps

Amazon

EMRAmazon

Redshift

Page 55: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Use Amazon Kinesis, EMR and Amazon Redshift for

Clickstream

Open source connectors:

• http://docs.aws.amazon.com/kinesis/latest/dev/developing-

consumers-with-kcl.html

AWS Big Data blog

- http://blogs.aws.amazon.com/bigdata/

AWS re:Invent Big Data booth

AWS Big Data Marketplace and Partner ecosystem

Hearst Booth – Hall C1156: Learn more about the

interesting things we are doing with data!

Call To Action

Page 56: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Remember to complete

your evaluations!

Page 57: (BDT306) How Hearst Publishing Manages Clickstream Analytics with AWS

Thank you!