application architectures with hadoop and sessionization in mr

1

Headline Goes Here Speaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

ApplicaAon Architectures with Apache Hadoop Mark Grover | @mark_grover East Bay JUG hadooparchitecturebook.com June 25th, 2014

©2014 Cloudera, Inc. All Rights Reserved.

About Me •  CommiTer on Apache Bigtop, commiTer and PPMC member on Apache Sentry (incubaAng).

•  Contributor to Hadoop, Hive, Spark, Sqoop, Flume. •  SoYware developer at Cloudera • @mark_grover

2 ©2014 Cloudera, Inc. All Rights Reserved.

Co-‐authoring O’Reilly book

• @hadooparchbook •  hadooparchitecturebook.com

©2014 Cloudera, Inc. All Rights Reserved. 3

What is Apache Hadoop?

4

Has the Flexibility to Store and Mine Any Type of Data

§  Ask quesAons across structured and

unstructured data that were previously impossible to ask or solve

§  Not bound by a single schema

Excels at Processing Complex Data

§  Scale-‐out architecture divides workloads

across mulAple nodes

§  Flexible file system eliminates ETL boTlenecks

Scales Economically

§  Can be deployed on commodity

hardware

§  Open source plaaorm guards against vendor lock

Hadoop Distributed File System (HDFS)

Self-‐Healing, High

Bandwidth Clustered Storage

MapReduce

Distributed CompuAng Framework

Apache Hadoop is an open source plaaorm for data storage and processing that is…

ü  Scalable ü  Fault tolerant ü  Distributed

CORE HADOOP SYSTEM COMPONENTS


5

Click Stream Analysis

Case Study


AnalyAcs


Web Logs


244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

Clickstream AnalyAcs


244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

Click Stream Analysis (Before Hadoop)


Web logs

ODS

Data Warehouse

Query Extract Transform

Load Business

Intelligence Transform

The Problems (Before Hadoop)


Web logs

ODS

Data Warehouse


Load Business

Intelligence Transform

1

1 1

Slow Data TransformaAons = Missed ETL SLAs.

2

2

Slow Queries = Frustrated Business Users.

Click Stream Analysis (AYer Hadoop)


Web logs

ODS

Data Warehouse


Load Business

Intelligence Transform X X

Web logs

ODS

Business Intelligence

Query



Transform

Web logs

ODS

Business Intelligence

Query



Transform

Flume or Sqoop?

Flume or Sqoop?

Hive/Impala/MR?

Hive/Impala/Spark?

Challenges of Hadoop ImplementaAon


Other challenges -‐ Architectural ConsideraAons

•  Storage managers? •  HDFS? HBase?

•  Data storage and modeling: •  File formats? Compression? Schema design?

•  Data movement •  How do we actually get the data into Hadoop? How do we get it out?

•  Metadata •  How do we manage data about the data?

•  Data access and processing •  How will the data be accessed once in Hadoop? How can we transform it? How do we query it?

•  OrchestraAon •  How do we manage the workflow for all of this?


17

High level Design


Clickstream AnalyAcs


244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”


Hadoop Cluster

BI/VisualizaAon tool (e.g.

microstrategy)

BI Analysts

Spark For machine learning and graph processing

R/Python StaAsAcal Analysis

Custom Apps

3. Accessing

2. Processing

4. OrchestraAon via Oozie 1. IngesAon

OperaAonal Data Store

CRM System Via Sqoop

Web servers

Website users

20

Since that’s of most interest to this audience

2. Processing


Processing

• De-‐duplicaAon •  Filtering •  SessionizaAon


DeduplicaAon – remove duplicate records


244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

Filtering – filter out invalid records


244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…

SessionizaAon


Website visit

Visitor 1 Session 1

Visitor 1 Session 2

Visitor 2 Session 1

> 30 minutes

Why sessionize?

Helps answers quesAons like: • What is my website’s bounce rate?

•  i.e. how many % of visitors don’t go past the landing page? • Which markeAng channels (e.g. organic search, display ad, etc.) are leading to most sessions?

• Which ones of those lead to most conversions (e.g. people buying things, signing up, etc.)

• Do aTribuAon analysis – which channels are responsible for most conversions?


SessionizaAon


244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 165 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 166

How to Sessionize?

1.  Given a list of clicks, determine which clicks came from the same user

2.  Given a parAcular user's clicks, determine if a given click is a part of a new session or a conAnuaAon of the previous session


#1 – Which clicks are from same user?

• We can use: •  IP address (244.157.45.12) •  Cookies (A9A3BECE0563982D) •  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")



• We can use: •  IP address (244.157.45.12) •  Cookies (A9A3BECE0563982D) •  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")


#2 – Which clicks part of the same session?



> 30 mins apart = different sessions

32

Intro to MapReduce


MapReduce

• Map – Apply a funcAon to each input record •  Shuffle & Sort – ParAAon the map output and sort each parAAon

•  Reduce – Apply aggregaAon funcAon to all values in each parAAon


34

SessionizaAon in MapReduce


35

github.com/hadooparchitecturebook/hadoop-‐arch-‐book/tree/master/ch06/MRSessionize





Map

Reduce

Reduce

Log line IP1, log lines

IP1, log line

s

Log line, session ID

Map

Map

Log line

Log line IP2, log lines

IP2, log lines Log line, session ID

Mapper for SessionizaAon


public static class SessionizeMapper extends Mapper<Object, Text, IpTimestampKey, Text> { private Matcher logRecordMatcher; public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { logRecordMatcher = logRecordPattern.matcher(value.toString()); // We only emit something out if the record matches with our regex. Otherwise, we assume the record is busted and simply ignore it if (logRecordMatcher.matches()) { String ip = logRecordMatcher.group(1); DateTime timestamp = DateTime.parse(logRecordMatcher.group(2), TIMESTAMP_FORMATTER); Long unixTimestamp = timestamp.getMillis(); IpTimestampKey outputKey = new IpTimestampKey(ip, unixTimestamp); context.write(outputKey, value); } } }

Reducer for SessionizaAon


public static class SessionizeReducer extends Reducer<IpTimestampKey, Text, IpTimestampKey, Text> { private Text result = new Text(); private static int sessionId = 0; private Long lastTimeStamp = null; public void reduce(IpTimestampKey key, Iterable<Text> values, Context context ) throws IOException, InterruptedException { for (Text value : values) { String logRecord = value.toString(); // If this is the first record for this user or it's been more than the timeout // since the last click from this user, let's increment the session ID. if (lastTimeStamp == null || (key.getUnixTimestamp() -‐ lastTimeStamp > SESSION_TIMEOUT_IN_MS)) { sessionId++; } lastTimeStamp = key.getUnixTimestamp(); result.set(logRecord + " " + sessionId); // Since we only care about printing out the entire record in the result, with // session ID appended at the end, we just emit out "null" for the key context.write(null, result); } } }

Secondary sorAng – by Amestamp

• Need records to reducer to be grouped by IP address and sorted by Amestamp – a concept called secondary sor/ng

•  Instead of using just IP address as map output key and reduce input key

• We use a composite key (IP, Amestamp) as map output key and reduce input key


Secondary sorAng – vocabulary

•  Composite key – IP address, Amestamp • Natural key – IP address •  Secondary sort key -‐ Amestamp


Secondary sorAng

•  Custom Grouping Comparator – on Natural Key (IP) •  Custom Sort Comparator – on Composite Key (IP, address) •  Custom ParAAoner – on Natural Key (IP) job.setGroupingComparatorClass(NaturalKeyComparator.class); job.setSortComparatorClass(CompositeKeyComparator.class); job.setPartitionerClass(NaturalKeyPartitioner.class);


42

Data Storage and Modeling


Data Storage – Storage Manager consideraAons

•  Popular storage managers for Hadoop •  Hadoop Distributed File System (HDFS) •  HBase


Data Storage – HDFS vs HBase

HDFS

•  Stores data directly as files •  Fast scans •  Poor random reads/writes

HBase

•  Stores data as Hfiles on HDFS •  Slow scans •  Fast random reads/writes


Data Storage – Storage Manager consideraAons

• We choose HDFS •  AnalyAcal needs in this case served beTer by fast scans.


46

Data Storage Format


Data Storage – Format ConsideraAons

•  Store as plain text? •  Sure, well supported by Hadoop. •  Text can easily be processed by MapReduce, loaded into Hive for analysis, and so on.

•  But… • Will begin to consume lots of space in HDFS. • May not be opAmal for processing by tools in the Hadoop ecosystem.


Data Storage – Format ConsideraAons

•  But, we can compress the text files… •  Gzip – supported by Hadoop, but not spliTable. •  Bzip2 – hey, spliTable! Great compression! But decompression is slooowww.

•  LZO – spliTable (with some work), good compress/de-‐compress performance. Good choice for storing text files on Hadoop.

•  Snappy – provides a good tradeoff between size and speed.


Data Storage – More About Snappy

•  Designed at Google to provide high compression speeds with reasonable compression.

•  Not the highest compression, but provides very good performance for processing on Hadoop.

•  Snappy is not spliTable though, which brings us to…


SequenceFile

• Stores records as binary key/value pairs.

• SequenceFile “blocks” can be compressed.

• This enables spliTability with non-‐spliTable compression.


Avro

• Kinda SequenceFile on Steroids.

• Self-‐documenAng – stores schema in header.

• Provides very efficient storage.

• Supports spliTable compression.


Our Format Choices…

• Avro with Snappy •  Snappy provides opAmized compression. •  Avro provides compact storage, self-‐documenAng files, and supports schema evoluAon.

•  Avro also provides beTer failure handling than other choices. •  SequenceFiles would also be a good choice, and are directly supported by ingesAon tools in the ecosystem.

•  But only supports Java.


53

HDFS Schema Design


Recommended HDFS Schema Design

• How to lay out data on HDFS?


Recommended HDFS Schema Design

/user/<username> -‐ User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – shared data for the enAre organizaAon /app – Everything but data: UDF jars, HQL files, Oozie workflows


56

Advanced HDFS Schema Design


What is ParAAoning?

57

dataset col=val1/file.txt col=val2/file.txt . . . col=valn/file.txt

dataset file1.txt file2.txt . . . filen.txt

Un-‐parAAoned HDFS directory structure

ParAAoned HDFS directory structure


What is ParAAoning?

58

clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt . . . dt=2014-‐03-‐31/clicks.txt

clicks clicks-‐2014-‐01-‐01.txt clicks-‐2014-‐01-‐02.txt . . . clicks-‐2014-‐03-‐31.txt

Un-‐parAAoned HDFS directory structure

ParAAoned HDFS directory structure


ParAAoning

•  Split the dataset into smaller consumable chunks •  Rudimentary form of “indexing” •  <data set name>/<parAAon_column_name=parAAon_column_value>/{files}


ParAAoning consideraAons

• What column to bucket by? •  HDFS is append only. •  Don’t have too many parAAons (<10,000) •  Don’t have too many small files in the parAAons (more than block size generally)

• We decided to parAAon by /mestamp


What is buckeAng?

61

clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt

Un-‐bucketed HDFS directory structure

clicks dt=2014-‐01-‐01/file0.txt dt=2014-‐01-‐01/file1.txt dt=2014-‐01-‐01/file2.txt dt=2014-‐01-‐01/file3.txt dt=2014-‐01-‐02/file0.txt dt=2014-‐01-‐02/file1.txt dt=2014-‐01-‐02/file2.txt dt=2014-‐01-‐02/file3.txt

Bucketed HDFS directory structure


BuckeAng

•  Hash-‐bucketed files within each parAAon based on a parAcular column

•  Useful when sampling •  In some joins, pre-‐reqs:

•  Datasets bucketed on the same key as the join key •  Number of buckets are the same or one is a mulAple of the other


BuckeAng consideraAons?

• Which column to bucket on? •  How many buckets? • We decided to bucket based on cookie


De-‐normalizing consideraAons

•  In general, big data joins are expensive • When to de-‐normalize?

•  Decided to join the smaller dimension tables •  Big fact tables are sAll joined


65

Data IngesAon


File Transfers

• “hadoop fs –put <file>” • Reliable, but not resilient to failure.

• Other opAons are mountable HDFS, for example NFSv3.


Streaming IngesAon

•  Flume •  Reliable, distributed, and available system for efficient collecAon, aggregaAon and movement of streaming data, e.g. logs.

•  Ka�a •  Reliable and distributed publish-‐subscribe messaging system.


Flume vs. Ka�a

• Purpose built for Hadoop data ingest.

• Pre-‐built sinks for HDFS, HBase, etc.

• Supports transformaAon of data in-‐flight.

• General pub-‐sub messaging framework.

• Hadoop not supported, requires 3rd-‐party component (Camus).

• Just a message transport (a very fast one).


Flume vs. Ka�a

•  BoTom line: •  Flume very well integrated with Hadoop ecosystem, well suited to ingesAon of sources such as log files.

•  Ka�a is a highly reliable and scalable enterprise messaging system, and great for scaling out to mulAple consumers.


A Quick IntroducAon to Flume

70

Flume Agent

Source Channel Sink DesAnaAon External Source

Web Server TwiTer JMS System logs …

Consumes events and forwards to channels

Stores events unAl consumed by sinks – file, memory, JDBC

Removes event from channel and puts into external desAnaAon

JVM process hosAng components



•  Reliable – events are stored in channel unAl delivered to next stage. •  Recoverable – events can be persisted to disk and recovered in the event of failure.

71

Flume Agent

Source Channel Sink DesAnaAon



• DeclaraAve • No coding required. •  ConfiguraAon specifies how components are wired together.


A Brief Discussion of Flume PaTerns – Fan-‐in

• Flume agent runs on each of our servers.

• These agents send data to mulAple agents to provide reliability.

• Flume provides support for load balancing.


A Brief Discussion of Flume PaTerns – Spli�ng

• Common need is to split data on ingest.

• For example: •  Sending data to mulAple clusters for DR.

•  To mulAple desAnaAons. • Flume also supports parAAoning, which is key to our implementaAon.


Sqoop Overview

• Apache project designed to ease import and export of data between Hadoop and external data stores such as relaAonal databases.

• Great for doing bulk imports and exports of data between HDFS, Hive and HBase and an external data store. Not suited for ingesAng event based data.


IngesAon Decisions

• Historical Data •  Smaller files: file transfer •  Larger files: Flume with spooling directory source.

•  Incoming Data •  Flume with the spooling directory source.


77

Data Processing and Access


Data flow

78

Raw data

ParAAoned clickstream

data

Other data (Financial, CRM, etc.)

Aggregated dataset #2

Aggregated dataset #1


Data processing tools

79

• Hive •  Impala •  Pig, etc.


Hive

80

• Open source data warehouse system for Hadoop •  Converts SQL-‐like queries to MapReduce jobs

• Work is being done to move this away from MR •  Stores metadata in Hive metastore •  Can create tables over HDFS or HBase data • Access available via JDBC/ODBC


Impala

81

•  Real-‐Ame open source SQL query engine for Hadoop • Doesn’t build on MapReduce • WriTen in C++, uses LLVM for run-‐Ame code generaAon •  Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC


Pig

82

• Higher level abstracAon over MapReduce (like Hive) • Write transformaAons in scripAng language – Pig LaAn •  Can access Hive metastore via HCatalog for metadata


Data Processing consideraAons

83

• We chose Hive for ETL and Impala for interac/ve BI.


84

Metadata Management


What is Metadata?

85

• Metadata is data about the data •  Format in which data is stored •  Compression codec •  LocaAon of the data •  Is the data parAAoned/bucketed/sorted?


Metadata in Hive

86

Hive Metastore


Metadata

87

• Hive metastore has become the de-‐facto metadata repository • HCatalog makes Hive metastore accessible to other applicaAons (Pig, MapReduce, custom apps, etc.)


Hive + HCatalog


89

OrchestraAon


OrchestraAon

• Once the data is in Hadoop, we need a way to manage workflows in our architecture.

•  Scheduling and tracking MapReduce jobs, Hive jobs, etc. •  Several opAons here:

•  Cron •  Oozie, Azkaban •  3rd-‐party tools, Talend, Pentaho, InformaAca, enterprise schedulers.


Oozie

• Supports defining and execuAng a sequence of jobs.

• Can trigger jobs based on external dependencies or schedules.


92

Final Architecture


Final Architecture – High Level Overview

93

Data Sources IngesAon

Data Storage/Processing

Data ReporAng/Analysis



94





Final Architecture – IngesAon

95

Web App Avro Agent Web App Avro Agent




Flume Agent

Flume Agent

Flume Agent

Flume Agent

Fan-‐in PaTern

MulA Agents for Failover and rolling restarts

HDFS



96





Final Architecture – Storage and Processing

97

/etl/weblogs/20140331/ /etl/weblogs/20140401/ …

Data Processing /data/markeAng/clickstream/bouncerate/ /data/markeAng/clickstream/aTribuAon/ …



98





Final Architecture – Data Access

99

Hive/Impala

BI/AnalyAcs Tools

DWH Sqoop

Local Disk

R, etc.

DB import tool

JDBC/ODBC


Contact info • Mark Grover

• @mark_grover •  www.linkedin.com/in/grovermark

•  Slides at slideshare.net/markgrover


application architectures with hadoop and sessionization in mr

Engineering