clickstream & social media analysis using apache spark

22
Clickstream & Social Media Analysis Use cases and examples using Apache Spark Michael Cutler @ TUMRA – November 2014

DESCRIPTION

Use cases and examples using Apache Spark, presented at the Hadoop User Group (UK) November 2014 Hadoop Meetup http://www.meetup.com/hadoop-users-group-uk/events/217791892/

TRANSCRIPT

Page 1: Clickstream & Social Media Analysis using Apache Spark

Clickstream & Social Media Analysis Use cases and examples using Apache Spark

Michael Cutler @ TUMRA – November 2014

Page 2: Clickstream & Social Media Analysis using Apache Spark

Hello

•  Early adopter of Hadoop

•  Spoke at Hadoop World on

machine learning

•  Twitter: @cotdp

About Me We use Data Science and Big Data

technology to help ecommerce

companies understand their

customers and increase sales.

TUMRA •  Slide are on Slideshare

•  Code example on Github

•  Twitter: @tumra

This Talk

Page 3: Clickstream & Social Media Analysis using Apache Spark

Examples 3

Introducing Apache Spark 2

Background 1

Page 4: Clickstream & Social Media Analysis using Apache Spark

Background 1

Page 5: Clickstream & Social Media Analysis using Apache Spark

Clickstream & Social Media Analysis Generalised Approach

Mobile/Tablet App

Data Collection

Data Processing

Reporting & Analysis

Web Site

You People

Social Network

Events Files Tables

Page 6: Clickstream & Social Media Analysis using Apache Spark

How has this approach evolved? Rapidly reducing the ‘time to insight’

•  Proprietary & Expensive

•  Slow Constrained

Time to Insight

48+ hours

pre-Historic Hadoop •  Open-source & Inexpensive

•  Flexible but complex to use

Time to Insight

hours

2008 - Hadoop •  Batch, Streaming & Interactive

•  Fast & Easy to use

Time to Insight

minutes

2014 - Spark

Page 7: Clickstream & Social Media Analysis using Apache Spark

Weaving a story from a string of activities Understanding the shoppers journey

Day #0

PPC long-tail

keyword

Day #7 Day #10 Day #13 Day #17

PPC brand keyword &

signed up email

Opened Email

Newsletter on iPad PPC brand

keyword

Add To Cart

Order

Placed

Page 8: Clickstream & Social Media Analysis using Apache Spark

Shopper Journey Understanding the shoppers journey

Time

Consumer

Shopper

Research Consideration Purchase

Need

Page 9: Clickstream & Social Media Analysis using Apache Spark

It’s all about People & Products Not just boring log files!

Turn low-level events like “Page Views” into something meaningful

e.g. <Person1234> <viewed-a> <Product:Camera>

Bought a …

Activity & Interactions

Measuring the degree of interest a Person has about a Product

e.g. are 10 views for a certain Product a good or bad thing?

Gauging Interest

Either inferred from other Peoples activities, or Product similarity

Affinities

Both people and products have properties,

e.g. <Person1234> <is:gender> <Female>

Properties

Page 10: Clickstream & Social Media Analysis using Apache Spark

People & Product Interactions

e.g. “Michael” “bought a” “Americano” “Starbucks, Shoreditch”

Source: Snowplow Analytics

Page 11: Clickstream & Social Media Analysis using Apache Spark

That sounds like a Graph … Use graphs to understand user intent

Interest Graph Visualisation

•  Collect user activity data in real-time, not just

clicks but mouse-overs, images, video, social.

•  Algorithms identify products, categories and

brands a particular person is interested in.

•  Cluster users into ‘neighborhoods’ to infer what to

show to existing and future visitors.

This visualization illustrates just 1% of 6 weeks visitor

activity data. Blue data points are People, Orange

data points are Products.

Page 12: Clickstream & Social Media Analysis using Apache Spark

Introducing Apache Spark 2

Page 13: Clickstream & Social Media Analysis using Apache Spark

Three reasons Apache Spark is awesome! Apart from “no more Java Map/Reduce code!!!”

•  In-memory Caching

•  DAG execution optimisation

•  Easy to use in Scala, Java, Python

Fast •  Machine Learning baked in

•  Graph algorithms

•  Interactive Shell

Smart •  Query from Spark SQL

•  Streaming

•  Batch (file based)

Flexible

Page 14: Clickstream & Social Media Analysis using Apache Spark

Apache Spark Architecture Overview

Apache ZooKeeper Hadoop Filesystem (HDFS)

Yarn / Mesos (optional)

Page 15: Clickstream & Social Media Analysis using Apache Spark

Apache Spark Coexists with your existing Hadoop Infrastructure

Apache ZooKeeper

Hadoop Filesystem (HDFS)

Map / Reduce

Apache Hive etc.

Yarn / Mesos

Page 16: Clickstream & Social Media Analysis using Apache Spark

Apache Spark can … Simple example of Spark SQL used from Scala

Source: Databricks

Go from a SQL query… … to a trained machine learning model in three lines of code.

Page 17: Clickstream & Social Media Analysis using Apache Spark

Examples 3

Page 18: Clickstream & Social Media Analysis using Apache Spark

Example Architecture Coexists with your existing Hadoop Infrastructure

Apache ZooKeeper

Hadoop Filesystem (HDFS) NoSQL Store (Cassandra)

Reporting Dashboard

Apache Kafka

Analytics Jobs

Page 19: Clickstream & Social Media Analysis using Apache Spark

Social Media Analysis Converting a low-level event into a meaningful high-level interaction

•  A user-interaction from the Facebook firehose, received as a real-time stream of JSON

•  Streamed into Apache Kafka, also stored in SequenceFiles

•  Modeled into Scala Case Class:

Page 20: Clickstream & Social Media Analysis using Apache Spark

Example - Spark (Scala) Using the Spark (Scala) interface to analyze the data

•  Parse JSON

•  Extract interesting attributes •  ‘Reduce by Key’ to sum the result

•  Print results

Page 21: Clickstream & Social Media Analysis using Apache Spark

Example - Spark SQL Using the Spark SQL interface to analyze the data

•  Parse JSON

•  Extract interesting attributes, transform into Case Classes

•  ‘Register as table’

•  Execute SQL, print results

Page 22: Clickstream & Social Media Analysis using Apache Spark

Want to play with awesome tech and data?

We’re hiring! [email protected]

Data Engineer

Scala, functional programming, Hadoop, NoSQL

Sales & Marketing

Experience with SaaS and ecommerce sales