clickstream data with spark

54
MAKING BIG DATA COME ALIVE Clustering click-stream data using Spark Marissa Saunders Slides available at: http ://www.slideshare.net/MarissaSaunders/clickstream-data-with-

Upload: marissa-saunders

Post on 21-Apr-2017

670 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Clickstream data with spark

MAKING BIG DATA COME ALIVE

Clustering click-stream data using Spark Marissa Saunders

Slides available at: http://www.slideshare.net/MarissaSaunders/clickstream-data-with-spark

Page 2: Clickstream data with spark

22

• Why?– Why clustering?– Why Spark?– Why click-stream?

• What?– What is the raw data?

• How?– Parsing user agent data on

Spark– Distributed K-modes on Spark

• So what?– Details of applying the method

to this use case– Resulting clusters– Time access patterns– Preferred websites

• Questions

Agenda

Page 3: Clickstream data with spark

3

ObjectivesUnderstand:• k-means and k-modes clustering• why Spark is a good choice• different data structures in Spark

– RDD, dataframe and dataset• clickstream data and how user-agent parsing works

Demonstrate:• mapping a function over a RDD• defining a custom UDF and mapping it over a

dataframe• mapping a python function over a partition• how identifying different user types can drive insight

into user behavior

Page 4: Clickstream data with spark

4

Why Clustering

Page 5: Clickstream data with spark

5

We have a plot like this …• 2 groups of data• Clustering can find them• This can lead to insight …– There are two different groups

of unladen swallows– The heavy species flies more

slowly– When asking for airspeed, we

should specify if we mean African or European swallows

Why clustering?

… with apologies to Monty Python

Bird Type

Flight velocities vs. bird mass

Page 6: Clickstream data with spark

6

For 2 clusters:1. Pick 2 points at random

as centroids

How does it work?

Page 7: Clickstream data with spark

7

For 2 clusters:1. Pick 2 points at random

as centroids2. Cluster data based on

closest point

How does it work?

Page 8: Clickstream data with spark

8

For 2 clusters:1. Pick 2 points at random as centroids2. Cluster data based on closest point3. Calculate the mean of each cluster as centroids

How does it work?

Page 9: Clickstream data with spark

9

For 2 clusters:1. Pick 2 points at random as centroids2. Cluster data based on closest point3. Calculate the mean of each cluster as centroids4. Repeat 2 and 3 to convergence

How does it work?

Page 10: Clickstream data with spark

10

For 2 clusters:1. Pick 2 points at random as centroids2. Cluster data based on closest point3. Calculate the mean of each cluster as centroids4. Repeat 2 and 3 to convergence

How does it work?

Page 11: Clickstream data with spark

11

For 2 clusters:1. Pick 2 points at random as centroids2. Cluster data based on closest point3. Calculate the mean of each cluster as centroids4. Repeat 2 and 3 to convergence

How does it work?

Converged

This is called K-means clustering… and there is a Spark function for this

Page 12: Clickstream data with spark

12

What about categorical data?

• Use modes instead of means– Most frequently occurring value

• Use binary distance metric for each dimension– 0 = the same– 1 = not the same

• Use the same iterative cluster assignment algorithm

This is called K-modes clustering

Color Mass Speed Type

Green/Grey Heavy Slow AfricanGreen/Grey Heavy Fast AfricanGreen/Black Heavy Slow AfricanGreen/Grey Light Slow AfricanBlue/White Heavy Fast EuropeanBlue/White Light Fast EuropeanBlue/Grey Light Slow EuropeanBlue/White Light Fast European

… and we’ve open-sourced a Spark function for this

Page 13: Clickstream data with spark

13

Why Spark?

Page 14: Clickstream data with spark

14

What is Spark?Apache Spark™ is a fast and general engine for large-scale data

processing. - spark.apache.org

• Distributed computing

• Relies on HDFS (or other DFS)

• In-memory• Optimized

execution• High level

functionality

Page 15: Clickstream data with spark

15

Block1Block2Block3Block4

Block5Block6Block7Block8

Why Spark?

• Take the computation to the data

• Spark works faster on partitioned data than map-reduce– In-memory operation avoids I/O costs– DAG optimization reduces computational costs

• Fast to develop– Data transformation and machine learning libraries are part of Spark

http://spark.apache.org/docs/latest/cluster-overview.html

It is FAST

Page 16: Clickstream data with spark

16

Basic data structures in Spark

• Resiliently distributed dataset (RDD)

• Dataframe = RDD with a schema– SQL-style syntax– Refer to column by name– Optimized queries

• Dataset = best of both worlds?!?

Block3Block4Block7Block8

Block1Block2Block3Block4

Block1Block2Block3Block4Block5Block6Block7Block8

Full data set

Block1Block2Block5Block6

Block7Block8Block5Block6

What makes it resilient?Multiple copiesStores lineage

Page 17: Clickstream data with spark

17

A little terminology …

Block3Block4Block7Block8

Block1Block2Block3Block4

Block1Block2Block5Block6

Block7Block8Block5Block6

Full data set

nodepartition

record

Page 18: Clickstream data with spark

18

Why Clickstream?

Page 19: Clickstream data with spark

19

What is clickstream data?• Information trail left behind by each user• Semi-structured website log files• Includes:– User agent information- Device- OS- Browser

– Geo information- Timezone- Lat/Longitude- City- Country

– Time of access– Referring website– Website accessed

Photo credit: Tim Franklin Photography via Foter.com

Page 20: Clickstream data with spark

20

What is this good for?

• Web analytics can answer questions like:– How long do users take from first visit to purchase?– When do users visit the website?– What marketing channels are effective in attracting users?– Where are users located?– What are the paths that users take through the website?– How long do users stay on a specific page?– Which pages draw the most users?– etc…

Page 21: Clickstream data with spark

21

The sample use caseClickstream data from 1usagov– Created whenever anyone shortens a .gov or .mil site with bitly– Feed at http://developer.usa.gov/1usagov– Archive for 2011-2013:

http://bitly.measuredvoice.com/bitly_archive/?C=M;O=D

Why this is a great dataset:– Large volume– Realistic format- Streaming - Not cleaned

– Interesting questions- What subtypes of users are there?- How do the activity patterns of these subtypes differ?

– Publically available archive

Page 22: Clickstream data with spark

22

What is the raw data?

Page 23: Clickstream data with spark

23

What is the raw data?

{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}

• json format

Page 24: Clickstream data with spark

24

What is the raw data?

{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}

• json format• Fields include:

• Website clicked: long url

Page 25: Clickstream data with spark

25

What is the raw data?

{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}

• json format• Fields include:

• Website clicked/long url• Referring url

Page 26: Clickstream data with spark

26

What is the raw data?

{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}

• json format• Fields include:

• Website clicked/long url• Referring url• User agent – what machine is this?

Page 27: Clickstream data with spark

27

What is the raw data?

{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}

• json format• Fields include:

• Website clicked/long url• Referring url• User agent – what machine is this?• Time accessed• etc…

Page 28: Clickstream data with spark

28

Parsing click stream data on Spark

Page 29: Clickstream data with spark

29

High level picture

• Need to extract:– Time in date, hours– Information about the user:- Device type- OS- Timezone

– Main domain of the url– Referring url

• Do this for one record in python• Map this function over all records

using Spark

{"h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL” ,”tz":”America/New_York "}

Day: FridayLocal_hour: 16Device_type:pcBrowser: IEOS: Windows 7Is_bot: false

Page 30: Clickstream data with spark

30

Actual transformation

• Define parsing function

• Map parsing function over RDD

Leverage user-agents library

for every record s

Apply user_agent library

RDD containing parsed json data

Page 31: Clickstream data with spark

31

Actual transformation

• Define parsing function

• Map parsing function over RDD

Leverage user-agents library

Keep every entry as item in list

Page 32: Clickstream data with spark

32

Actual transformation

• Define parsing function

• Map parsing function over RDD

Leverage user-agents library

Apply custom function to user agent string

Page 33: Clickstream data with spark

33

Distributed K-modes

Page 34: Clickstream data with spark

34

How does clustering have to change to be distributed?

K-means example:Clustering is a collective operation.How can we distribute it?

Page 35: Clickstream data with spark

35

How does clustering have to change to be distributed?

Do k-means on each partition

Cluster the collected centroids

K-means example:

Page 36: Clickstream data with spark

36

Mapping over data in Spark• Map over a record:

def f(record): return transform(record)rdd2 = rdd1.map(f)

Page 37: Clickstream data with spark

37

Mapping over data in Spark

Block3Block4Block7Block8

Block1Block2Block3Block4

Block1Block2Block5Block6

Block7Block8Block5Block6

Full data set

map

Block1

What is the equivalent here?

Spark has two possibilities:1. mapPartition:

• get each record in turn and do something; return after all records are done

• mapPartitionWithIndex:• Keep track of which partition returned

which result

Page 38: Clickstream data with spark

38

Mapping over data in Spark• Map over a record:

def f(record): return transform(record)rdd2 = rdd1.map(f)

• Map over a partition:def f(iterator): yield cluster(iterator)rdd2 = rdd1.mapPartitions(f)

• Map over a partition with a partition keydef f(splitIndex, iterator): yield (partitionIndex, cluster(iterator))rdd2 = rdd1.mapPartitionsWithIndex(f)

For K-modes, we have open-sourced an implementation of distributed clustering: https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes

Iterator = cycle once through each record

Page 39: Clickstream data with spark

39

Applying to 1USAGOV data

Page 40: Clickstream data with spark

40

Getting 1usagov clickstream data• Scrape data from archive site:– http://1usagov.measuredvoice.com/– json format

• Concatenate into files by month• Store in HDFS• Load into Spark

Page 41: Clickstream data with spark

41

Loading json data

Page 42: Clickstream data with spark

42

Parse to extract user agent information

• Python package user_agents– Input string -> output information

• Add some custom parsing to extract features– os family, os_version, device

• Use spark to map this over each clickstream entry

Page 43: Clickstream data with spark

43

Prepare for K-modes clustering

To reduce dimensionality:• Decide which variables to

use for clustering• Keep only the top few

categories for each variable

Prasad Patil, as referenced on http://www.newsnshit.com/curse-of-dimensionality-interactive-demo/

The CURSE of dimensionality ….

Page 44: Clickstream data with spark

44

Prepare for K-modes clustering• Decide which variables to use for clustering– Country– Timezone– Device Type– OS– Browser

• Keep only the top few categories for each variable

Custom UDF for Spark dataframes

Apply a series of UDFs

Page 45: Clickstream data with spark

45

• Uses open-source packagehttps://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes

Perform distributed k-modes clustering

# of modes Max. iterations

Full log

Partition

Partition

Partition

Centroids

Centroids

Centroids

Centroids

Local clustering

Distributed clustering

Create RDD

Page 46: Clickstream data with spark

46

Clustering results: 10 clusters

Page 47: Clickstream data with spark

47

What do the clusters look like?# Size Country Timezone Device

TypeOS Browser

1 617820 US: 93% US/NY: 53% Pc: 97% Win 7: 75% Firefox: 57%

2 226035 NotUS: 68% Other: 57% Mobile: 75% iOS:84% MobileSafari: 78%

3 152053 NoGeoInfo:86%

NoGeoInfo: 86% Pc: 99% Windows:

81%Chrome/IE:

72%

4 161947 US:96% US/NY: 60% PC: 99% Windows not 7: 99%

IE:81%

5 105090 NoGeoInfo:76%

NoGeoInfo:76% Mobile: 70% Other: 70% Other: 99%

6 235719 NotUS:99% Other:89% PC: 99% Win7: 68% Chrome: 51%

7 121464 US:100% US/LA: 59% PC:95% MacOSX: 72%

Chrome: 54%

8 121115 US:48% NoGeoInfo: 40% Mobile:93% Android:

100%Android:

99%

9 101052 NotUS:98% Other: 90% PC: 100% Win other than 7: 84% Firefox: 57%

10 173424 US:100% US/NY: 48% Mobile: 68% iOS:100% MobileSafari: 74%

Page 48: Clickstream data with spark

48

Access patterns

Page 49: Clickstream data with spark

49

Access patterns

Page 50: Clickstream data with spark

50

Top sites visited: January 2012Description

Top 3 domains

US, pc, Win7

www.nysdot.gov 212K

www.nasa.gov 59K

www.fda.gov 18K

US, pc, Win_not7, IE

www.nasa.gov 15K

www.shrewsbury-ma.gov 9K

www.fda.gov 5K

US, pc, Mac OS X

www.nysdot.gov 29K

www.nasa.gov 16K

www.whitehouse.gov6KnotUS, pc,

Win7www.nasa.gov 87K

earthobservatory.nasa.gov 15K

www.nysdot.gov 14K

notUS, pc,Win_not7

www.nasa.gov 30K

www.navy.mil 8K

globalhealth.gov7K

noGeo, pc, Win, Chrome

www.nasa.gov 34K

www.nysdot.gov 17K

earthobservatory.nasa.gov 6K

US, mobile,iOS

www.nasa.gov 33K

earthobservatory.nasa.gov 11K

forecast.weather.gov 9K

notUS, mobile, iOS

www.nasa.gov 82K

earthobservatory.nasa.gov 24K

www.navy.mil 13K

Mobile,Android

www.nasa.gov 29K

earthobservatory.nasa.gov 9K

www.navy.mil 6K

noGeo, mobile, OtherOS

www.nasa.gov 24K

www.nysdot.gov 8K

www.army.mil 5K

Page 51: Clickstream data with spark

51

Where do users come from: January 2012Description

Top 3 domains

US, pc, Win7

direct 342K

t.co 135K

www.facebook.com 67K

US, pc, Win_not7, IE

direct 69K

t.co33K

www.facebook.com19K

US, pc, Mac OS X

t.co 49K

direct41K

www.facebook.com15K

notUS, pc, Win7

t.co 125K

www.facebook.com45K

direct38K

notUS, pc,Win_not7

t.co 41K

direct29K

www.facebook.com14K

noGeo, pc, Win, Chrome

t.co56K

direct 47K

www.facebook.com24K

US, mobile,iOS

twitter.com 83K

direct59K

m.facebook.com17K

notUS, mobile, iOS

twitter.com 119K

direct 69K

t.co 21K

Mobile,Android

t.co 62K

direct34K

m.facebook.com17K

noGeo, mobile, OtherOS

direct 63K

t.co20K

m.facebook.com13K

Page 52: Clickstream data with spark

52

What happened in space that had the twitter-sphere abuzz in January 2012?

Solar Flares!

Especially non-US users

to:Nasa.govEarthobservatory.com

from:Twitter

http://earthobservatory.nasa.gov/NaturalHazards/view.php?id=76998

Page 53: Clickstream data with spark

53

Summary

• Data processing operations, like parsing user-agent string, can be distributed using spark• Clustering of large data sets can be distributed using Spark• Clustering finds groups of related users/records• These user types show distinct behaviors • Segmenting users can drive insight and facilitate appropriate

messaging– When are they visiting?– Where are they looking?– Where are they coming from?

User information

Usergroups

Targeted message

Web log data

Page 54: Clickstream data with spark

54

Questions?

Slides available at: http://www.slideshare.net/MarissaSaunders/clickstream-data-with-spark

Distributed K-modes clustering for pyspark:https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes