application architectures with hadoop - big data techcon sf 2014

95
Application Architectures with Hadoop Big Data TechCon San Francisco 2014 slideshare.com/hadooparchbook Mark Grover | @mark_grover Jonathan Seidman | @jseidman

Upload: hadooparchbook

Post on 14-Jun-2015

586 views

Category:

Technology


2 download

DESCRIPTION

Presentation by Jonathan Seidman and Mark Grover on Application Architectures with Hadoop at Big Data TechCon in SF in October 2014.

TRANSCRIPT

Page 1: Application Architectures with Hadoop - Big Data TechCon SF 2014

Application Architectures with Hadoop Big Data TechCon San Francisco 2014 slideshare.com/hadooparchbook Mark Grover | @mark_grover Jonathan Seidman | @jseidman

Page 2: Application Architectures with Hadoop - Big Data TechCon SF 2014

2

About the book •  @hadooparchbook •  hadooparchitecturebook.com •  github.com/hadooparchitecturebook •  slideshare.com/hadooparchbook

©2014 Cloudera, Inc. All Rights Reserved.

Page 3: Application Architectures with Hadoop - Big Data TechCon SF 2014

3

About Us •  Mark

–  Software Engineer –  Committer on Apache Bigtop, committer and PPMC member on Apache

Sentry (incubating). –  Contributor to Hadoop, Hive, Spark, Sqoop, Flume. –  @mark_grover

•  Jonathan –  Senior Solutions Architect, Partner Enablement. –  Co-founder of Chicago Hadoop User Group and Chicago Big Data. –  [email protected] –  @jseidman

©2014 Cloudera, Inc. All Rights Reserved.

Page 4: Application Architectures with Hadoop - Big Data TechCon SF 2014

4

Case Study Clickstream Analysis

Page 5: Application Architectures with Hadoop - Big Data TechCon SF 2014

5

Analytics

©2014 Cloudera, Inc. All Rights Reserved.

Page 6: Application Architectures with Hadoop - Big Data TechCon SF 2014

6

Analytics

©2014 Cloudera, Inc. All Rights Reserved.

Page 7: Application Architectures with Hadoop - Big Data TechCon SF 2014

7

Web Logs – Combined Log Format

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

Page 8: Application Architectures with Hadoop - Big Data TechCon SF 2014

8

Clickstream Analytics

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

Page 9: Application Architectures with Hadoop - Big Data TechCon SF 2014

9

Challenges of Hadoop Implementation

©2014 Cloudera, Inc. All Rights Reserved.

Page 10: Application Architectures with Hadoop - Big Data TechCon SF 2014

10

Challenges of Hadoop Implementation

©2014 Cloudera, Inc. All Rights Reserved.

Page 11: Application Architectures with Hadoop - Big Data TechCon SF 2014

11

Hadoop Architectural Considerations •  Storage managers?

–  HDFS? HBase? •  Data storage and modeling:

–  File formats? Compression? Schema design? •  Data movement

–  How do we actually get the data into Hadoop? How do we get it out? •  Metadata

–  How do we manage data about the data? •  Data access and processing

–  How will the data be accessed once in Hadoop? How can we transform it? How do we query it?

•  Orchestration –  How do we manage the workflow for all of this?

©2014 Cloudera, Inc. All Rights Reserved.

Page 12: Application Architectures with Hadoop - Big Data TechCon SF 2014

12

Architectural Considerations Data Storage and Modeling

Page 13: Application Architectures with Hadoop - Big Data TechCon SF 2014

13

Data Modeling Considerations •  We need to consider the following in our architecture:

–  Storage layer – HDFS? HBase? Etc. –  File system schemas – how will we lay out the data? –  File formats – what storage formats to use for our data, both raw and

processed data? –  Data compression formats?

©2014 Cloudera, Inc. All Rights Reserved.

Page 14: Application Architectures with Hadoop - Big Data TechCon SF 2014

14

Architectural Considerations Data Modeling – Storage Layer

Page 15: Application Architectures with Hadoop - Big Data TechCon SF 2014

15

Data Storage Layer Choices •  Two likely choices for raw data:

©2014 Cloudera, Inc. All Rights Reserved.

Page 16: Application Architectures with Hadoop - Big Data TechCon SF 2014

16

Data Storage Layer Choices

•  Stores data directly as files •  Fast scans •  Poor random reads/writes

•  Stores data as Hfiles on HDFS

•  Slow scans •  Fast random reads/writes

©2014 Cloudera, Inc. All Rights Reserved.

Page 17: Application Architectures with Hadoop - Big Data TechCon SF 2014

17

Data Storage – Storage Manager Considerations

•  Incoming raw data: –  Processing requirements call for batch transformations across multiple

records – for example sessionization.

•  Processed data: –  Access to processed data will be via things like analytical queries – again

requiring access to multiple records.

•  We choose HDFS –  Processing needs in this case served better by fast scans.

©2014 Cloudera, Inc. All Rights Reserved.

Page 18: Application Architectures with Hadoop - Big Data TechCon SF 2014

18

Architectural Considerations Data Modeling – Raw Data Storage

Page 19: Application Architectures with Hadoop - Big Data TechCon SF 2014

19

Storage Formats – Raw Data and Processed Data

©2014 Cloudera, Inc. All Rights Reserved.

Processed Data

Raw Data

Page 20: Application Architectures with Hadoop - Big Data TechCon SF 2014

20

Data Storage – Format Considerations

Click to enter confidentiality information

Logs (plain text)

Logs (plain text)

Logs (plain text)

Logs (plain text)

Logs (plain text) Logs

(plain text)

Logs (plain text)

Logs (plain text)

Page 21: Application Architectures with Hadoop - Big Data TechCon SF 2014

21

Data Storage – Compression

Click to enter confidentiality information

snappy

Well, maybe. But not splittable.

X Splittable. Getting better…

Hmmm…. Splittable, but no...

Page 22: Application Architectures with Hadoop - Big Data TechCon SF 2014

22

Data Storage – More About Snappy – Designed at Google to provide high compression speeds with reasonable

compression. – Not the highest compression, but provides very good performance for

processing on Hadoop. – Snappy is not splittable though, which brings us to…

©2014 Cloudera, Inc. All Rights Reserved.

Page 23: Application Architectures with Hadoop - Big Data TechCon SF 2014

23

Hadoop File Types •  Formats designed specifically to store and process data on Hadoop:

–  File based – Sequence File –  Serialization formats – Thrift, Protocol Buffers, Avro –  Columnar formats – RCFile, ORC, Parquet

Click to enter confidentiality information

Page 24: Application Architectures with Hadoop - Big Data TechCon SF 2014

24

SequenceFile

• Stores records as binary key/value pairs.

• SequenceFile “blocks” can be compressed.

•  This enables splittability with non-splittable compression.

©2014 Cloudera, Inc. All Rights Reserved.

Page 25: Application Architectures with Hadoop - Big Data TechCon SF 2014

25

Avro

• Kinda SequenceFile on Steroids.

• Self-documenting – stores schema in header.

• Provides very efficient storage.

• Supports splittable compression.

©2014 Cloudera, Inc. All Rights Reserved.

Page 26: Application Architectures with Hadoop - Big Data TechCon SF 2014

26

Our Format Choices… •  Avro with Snappy

–  Snappy provides optimized compression. –  Avro provides compact storage, self-documenting files, and supports schema

evolution. –  Avro also provides better failure handling than other choices.

•  SequenceFiles would also be a good choice, and are directly supported by ingestion tools in the ecosystem. –  But only supports Java.

©2014 Cloudera, Inc. All Rights Reserved.

Page 27: Application Architectures with Hadoop - Big Data TechCon SF 2014

27

Architectural Considerations Data Modeling – Processed Data Storage

Page 28: Application Architectures with Hadoop - Big Data TechCon SF 2014

28

Storage Formats – Raw Data and Processed Data

©2014 Cloudera, Inc. All Rights Reserved.

Processed Data

Raw Data

Page 29: Application Architectures with Hadoop - Big Data TechCon SF 2014

29

Access to Processed Data

©2014 Cloudera, Inc. All Rights Reserved.

Column A Column B Column C Column D Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value Value

Analytical Queries

Page 30: Application Architectures with Hadoop - Big Data TechCon SF 2014

30

Columnar Formats •  Eliminates I/O for columns that are not part of a query. •  Works well for queries that access a subset of columns. •  Often provide better compression. •  These add up to dramatically improved performance for many

queries.

©2014 Cloudera, Inc. All Rights Reserved.

1 2014-10-13

abc

2 2014-10-14

def

3 2014-10-15

ghi

1 2 3

2014-10-13

2014-10-14

2014-10-15

abc def ghi

Page 31: Application Architectures with Hadoop - Big Data TechCon SF 2014

31

Columnar Choices – RCFile •  Provides efficient processing for MapReduce applications, but

generally only used with Hive. •  Only supports Java. •  No Avro support. •  Limited compression support. •  Sub-optimal performance compared to newer columnar formats.

©2014 Cloudera, Inc. All Rights Reserved.

Page 32: Application Architectures with Hadoop - Big Data TechCon SF 2014

32

Columnar Choices – ORC •  A better RCFile. •  Supports Hive, but currently limited support for other interfaces. •  Only supports Java.

©2014 Cloudera, Inc. All Rights Reserved.

Page 33: Application Architectures with Hadoop - Big Data TechCon SF 2014

33

Columnar Choices – Parquet •  Supports multiple programming interfaces – MapReduce, Hive,

Impala, Pig. •  Multiple language support. •  Broad vendor support. •  These features make Parquet a good choice for our processed data.

©2014 Cloudera, Inc. All Rights Reserved.

Page 34: Application Architectures with Hadoop - Big Data TechCon SF 2014

34

Architectural Considerations Data Modeling – HDFS Schema Design

Page 35: Application Architectures with Hadoop - Big Data TechCon SF 2014

35

Recommended HDFS Schema Design •  How to lay out data on HDFS?

©2014 Cloudera, Inc. All Rights Reserved.

Page 36: Application Architectures with Hadoop - Big Data TechCon SF 2014

36

Recommended HDFS Schema Design /user/<username> - User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – shared data for the entire organization /app – Everything but data: UDF jars, HQL files, Oozie workflows

©2014 Cloudera, Inc. All Rights Reserved.

Page 37: Application Architectures with Hadoop - Big Data TechCon SF 2014

37

Architectural Considerations Data Modeling – Advanced HDFS Schema Design

Page 38: Application Architectures with Hadoop - Big Data TechCon SF 2014

38

What is Partitioning?

dataset col=val1/file.txt col=val2/file.txt . . . col=valn/file.txt

dataset file1.txt file2.txt . . . filen.txt

Un-partitioned HDFS directory structure

Partitioned HDFS directory structure

©2014 Cloudera, Inc. All Rights Reserved.

Page 39: Application Architectures with Hadoop - Big Data TechCon SF 2014

39

What is Partitioning?

clicks dt=2014-01-01/clicks.txt dt=2014-01-02/clicks.txt . . . dt=2014-03-31/clicks.txt

clicks clicks-2014-01-01.txt clicks-2014-01-02.txt . . . clicks-2014-03-31.txt

Un-partitioned HDFS directory structure

Partitioned HDFS directory structure

©2014 Cloudera, Inc. All Rights Reserved.

Page 40: Application Architectures with Hadoop - Big Data TechCon SF 2014

40

Partitioning •  Split the dataset into smaller consumable chunks •  Rudimentary form of “indexing” •  <data set name>/<partition_column_name=partition_column_value>/

{files}

©2014 Cloudera, Inc. All Rights Reserved.

Page 41: Application Architectures with Hadoop - Big Data TechCon SF 2014

41

Partitioning considerations •  What column to partition by?

–  HDFS is append only. –  Don’t have too many partitions (<10,000) –  Don’t have too many small files in the partitions (more than block size

generally)

•  We decided to partition by timestamp

©2014 Cloudera, Inc. All Rights Reserved.

Page 42: Application Architectures with Hadoop - Big Data TechCon SF 2014

42

Partitioning For Our Case Study •  Raw dataset:

–  /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10!

•  Processed dataset: –  /data/bikeshop/clickstream/year=2014/month=10/day=10!

©2014 Cloudera, Inc. All Rights Reserved.

Page 43: Application Architectures with Hadoop - Big Data TechCon SF 2014

43

A Word About Partitioning Alternatives

year=2014/month=10/day=10 dt=2014-10-10

©2014 Cloudera, Inc. All Rights Reserved.

Page 44: Application Architectures with Hadoop - Big Data TechCon SF 2014

44

Architectural Considerations Data Ingestion

Page 45: Application Architectures with Hadoop - Big Data TechCon SF 2014

45

File Transfers

•  “hadoop fs –put <file>” • Reliable, but not

resilient to failure. • Other options are

mountable HDFS, for example NFSv3.

©2014 Cloudera, Inc. All Rights Reserved.

Page 46: Application Architectures with Hadoop - Big Data TechCon SF 2014

46

Streaming Ingestion •  Flume

–  Reliable, distributed, and available system for efficient collection, aggregation and movement of streaming data, e.g. logs.

•  Kafka –  Reliable and distributed publish-subscribe messaging system.

©2014 Cloudera, Inc. All Rights Reserved.

Page 47: Application Architectures with Hadoop - Big Data TechCon SF 2014

47

Flume vs. Kafka

• Purpose built for Hadoop data ingest.

• Pre-built sinks for HDFS, HBase, etc.

• Supports transformation of data in-flight.

• General pub-sub messaging framework.

• Hadoop not supported, requires 3rd-party component (Camus).

•  Just a message transport (a very fast one).

©2014 Cloudera, Inc. All Rights Reserved.

Page 48: Application Architectures with Hadoop - Big Data TechCon SF 2014

48

Flume vs. Kafka •  Bottom line:

–  Flume very well integrated with Hadoop ecosystem, well suited to ingestion of sources such as log files.

–  Kafka is a highly reliable and scalable enterprise messaging system, and great for scaling out to multiple consumers.

©2014 Cloudera, Inc. All Rights Reserved.

Page 49: Application Architectures with Hadoop - Big Data TechCon SF 2014

49

Sources Interceptors Selectors Channels Sinks

Flume Agent

Short Intro to Flume Twitter, logs, JMS,

webserver Mask, re-format,

validate… DR, critical Memory, file HDFS, HBase,

Solr

Page 50: Application Architectures with Hadoop - Big Data TechCon SF 2014

50

A Quick Introduction to Flume •  Reliable – events are stored in channel until delivered to next

stage. •  Recoverable – events can be persisted to disk and recovered in

the event of failure.

Flume Agent

Source Channel Sink Destination

©2014 Cloudera, Inc. All Rights Reserved.

Page 51: Application Architectures with Hadoop - Big Data TechCon SF 2014

51

A Quick Introduction to Flume

• Declarative – No coding required. – Configuration specifies

how components are wired together.

©2014 Cloudera, Inc. All Rights Reserved.

Page 52: Application Architectures with Hadoop - Big Data TechCon SF 2014

52

A Brief Discussion of Flume Patterns – Fan-in

•  Flume agent runs on each of our servers.

•  These agents send data to multiple agents to provide reliability.

•  Flume provides support for load balancing.

©2014 Cloudera, Inc. All Rights Reserved.

Page 53: Application Architectures with Hadoop - Big Data TechCon SF 2014

53

Sqoop Overview •  Apache project designed to ease import and export of data between

Hadoop and external data stores such as relational databases. •  Great for doing bulk imports and exports of data between HDFS,

Hive and HBase and an external data store. Not suited for ingesting event based data.

©2014 Cloudera, Inc. All Rights Reserved.

Page 54: Application Architectures with Hadoop - Big Data TechCon SF 2014

54

Ingestion Decisions •  Historical Data

–  Smaller files: file transfer –  Larger files: Flume with spooling directory source.

•  Incoming Data –  Flume with the spooling directory source.

•  Relational Data Sources – ODS, CRM, etc. –  Sqoop

©2014 Cloudera, Inc. All Rights Reserved.

Page 55: Application Architectures with Hadoop - Big Data TechCon SF 2014

55

Architectural Considerations Data Processing – Engines

Page 56: Application Architectures with Hadoop - Big Data TechCon SF 2014

56

Data flow

Raw data

Partitioned clickstream data

Other data (Financial, CRM, etc.)

Aggregated dataset #2

Aggregated dataset #1

©2014 Cloudera, Inc. All Rights Reserved.

Page 57: Application Architectures with Hadoop - Big Data TechCon SF 2014

57

Processing Engines •  MapReduce •  Abstractions – Pig, Hive, Cascading, Crunch •  Spark •  Impala

Confidentiality Information Goes Here

Page 58: Application Architectures with Hadoop - Big Data TechCon SF 2014

58

MapReduce •  Oldie but goody •  Restrictive Framework / Innovated Work Around •  Extreme Batch

Confidentiality Information Goes Here

Page 59: Application Architectures with Hadoop - Big Data TechCon SF 2014

59

MapReduce Basic High Level

Confidentiality Information Goes Here

Mapper

HDFS (Replicated)

Native File System

Block of Data

Temp Spill Data

Partitioned Sorted Data

Reducer

Reducer Local Copy

Output File

Page 60: Application Architectures with Hadoop - Big Data TechCon SF 2014

60

Abstractions •  SQL

–  Hive

•  Script/Code –  Pig: Pig Latin –  Crunch: Java/Scala –  Cascading: Java/Scala

Confidentiality Information Goes Here

Page 61: Application Architectures with Hadoop - Big Data TechCon SF 2014

61

Spark •  The New Kid that isn’t that New Anymore •  Easily 10x less code •  Extremely Easy and Powerful API •  Very good for machine learning •  Scala, Java, and Python •  RDDs •  DAG Engine

Confidentiality Information Goes Here

Page 62: Application Architectures with Hadoop - Big Data TechCon SF 2014

62

Spark - DAG

Confidentiality Information Goes Here

Page 63: Application Architectures with Hadoop - Big Data TechCon SF 2014

63

Impala • Real-time open source MPP style engine for Hadoop • Doesn’t build on MapReduce • Written in C++, uses LLVM for run-time code generation • Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC

©2014 Cloudera, Inc. All Rights Reserved.

Page 64: Application Architectures with Hadoop - Big Data TechCon SF 2014

64

What processing needs to happen?

Confidentiality Information Goes Here

•  Sessionization •  Filtering •  Deduplication •  BI / Discovery

Page 65: Application Architectures with Hadoop - Big Data TechCon SF 2014

65

Sessionization

Confidentiality Information Goes Here

Website visit

Visitor 1 Session 1

Visitor 1 Session 2

Visitor 2 Session 1

> 30 minutes

Page 66: Application Architectures with Hadoop - Big Data TechCon SF 2014

66

Why sessionize?

Confidentiality Information Goes Here

Helps answers questions like: •  What is my website’s bounce rate?

–  i.e. how many % of visitors don’t go past the landing page?

•  Which marketing channels (e.g. organic search, display ad, etc.) are leading to most sessions? –  Which ones of those lead to most conversions (e.g. people buying things,

signing up, etc.)

•  Do attribution analysis – which channels are responsible for most conversions?

Page 67: Application Architectures with Hadoop - Big Data TechCon SF 2014

67

How to Sessionize?

Confidentiality Information Goes Here

1.  Given a list of clicks, determine which clicks came from the same user

2.  Given a particular user's clicks, determine if a given click is a part of a new session or a continuation of the previous session

Page 68: Application Architectures with Hadoop - Big Data TechCon SF 2014

68

#1 – Which clicks are from same user? •  We can use:

–  IP address (244.157.45.12) –  Cookies (A9A3BECE0563982D) –  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")

©2014 Cloudera, Inc. All Rights Reserved.

Page 69: Application Architectures with Hadoop - Big Data TechCon SF 2014

69

#1 – Which clicks are from same user? •  We can use:

–  IP address (244.157.45.12) –  Cookies (A9A3BECE0563982D) –  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")

©2014 Cloudera, Inc. All Rights Reserved.

Page 70: Application Architectures with Hadoop - Big Data TechCon SF 2014

70

#1 – Which clicks are from same user?

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

Page 71: Application Architectures with Hadoop - Big Data TechCon SF 2014

71

#2 – Which clicks part of the same session?

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

> 30 mins apart = different sessions

Page 72: Application Architectures with Hadoop - Big Data TechCon SF 2014

72 ©2014 Cloudera, Inc. All rights reserved.

Sessionization engine recommendation •  We have sessionization code in MR, Spark on github. The

complexity of the code varies, depends on the expertise in the organization.

•  We choose MR, since it’s fairly simple and maintainable code.

Page 73: Application Architectures with Hadoop - Big Data TechCon SF 2014

73

Filtering – filter out incomplete records

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…

Page 74: Application Architectures with Hadoop - Big Data TechCon SF 2014

74

Filtering – filter out records from bots/spiders

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

Google spider IP address

Page 75: Application Architectures with Hadoop - Big Data TechCon SF 2014

75 ©2014 Cloudera, Inc. All rights reserved.

Filtering recommendation •  Bot/Spider filtering can be done easily in any of the engines •  Incomplete records are harder to filter in schema systems like

Hive, Impala, Pig, etc. •  Pretty close choice between MR, Hive and Spark •  Can be done in Flume interceptors as well •  We can simply embed this in our sessionization job

Page 76: Application Architectures with Hadoop - Big Data TechCon SF 2014

76

Deduplication – remove duplicate records

©2014 Cloudera, Inc. All Rights Reserved.

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

Page 77: Application Architectures with Hadoop - Big Data TechCon SF 2014

77 ©2014 Cloudera, Inc. All rights reserved.

Deduplication recommendation •  Can be done in all engines. •  We already have a Hive table with all the columns, a simple

DISTINCT query will perform deduplication •  We use Pig

Page 78: Application Architectures with Hadoop - Big Data TechCon SF 2014

78 ©2014 Cloudera, Inc. All rights reserved.

BI/Discovery engine recommendation •  Main requirements for this are:

–  Low latency –  SQL interface (e.g. JDBC/ODBC) –  Users don’t know how to code

•  We chose Impala –  It’s a SQL engine –  Much faster than other engines –  Provides standard JDBC/ODBC interfaces

Page 79: Application Architectures with Hadoop - Big Data TechCon SF 2014

79

Architectural Considerations Orchestration

Page 80: Application Architectures with Hadoop - Big Data TechCon SF 2014

80

•  Data arrives through Flume •  Triggers a processing event:

–  Sessionize –  Enrich – Location, marketing channel… –  Store as Parquet

•  Each day we process events from the previous day

Orchestrating Clickstream

Page 81: Application Architectures with Hadoop - Big Data TechCon SF 2014

81 ©2014 Cloudera, Inc. All rights reserved.

•  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting

Choosing Right

Page 82: Application Architectures with Hadoop - Big Data TechCon SF 2014

82

Oozie or Azkaban?

©2014 Cloudera, Inc. All rights reserved.

Page 83: Application Architectures with Hadoop - Big Data TechCon SF 2014

83 ©2014 Cloudera, Inc. All rights reserved.

•  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting

Choosing…

Easier in Oozie

Page 84: Application Architectures with Hadoop - Big Data TechCon SF 2014

84 ©2014 Cloudera, Inc. All rights reserved.

•  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting

Choosing the right Orchestration Tool

Better in Azkaban

Page 85: Application Architectures with Hadoop - Big Data TechCon SF 2014

85 ©2014 Cloudera, Inc. All rights reserved.

The best orchestration tool is the one you are an expert on

Important Decision Consideration!

Page 86: Application Architectures with Hadoop - Big Data TechCon SF 2014

86

Putting It All Together Final Architecture

Page 87: Application Architectures with Hadoop - Big Data TechCon SF 2014

87

Final Architecture – High Level Overview

Data Sources Ingestion Data Storage/

Processing

Data Reporting/Analysis

©2014 Cloudera, Inc. All Rights Reserved.

Page 88: Application Architectures with Hadoop - Big Data TechCon SF 2014

88

Final Architecture – High Level Overview

Data Sources Ingestion Data Storage/

Processing

Data Reporting/Analysis

©2014 Cloudera, Inc. All Rights Reserved.

Page 89: Application Architectures with Hadoop - Big Data TechCon SF 2014

89

Final Architecture – Ingestion

Web App Avro Agent Web App Avro Agent

Web App Avro Agent Web App Avro Agent

Web App Avro Agent Web App Avro Agent

Web App Avro Agent Web App Avro Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Fan-in Pattern

Multi Agents for Failover and rolling restarts

HDFS

©2014 Cloudera, Inc. All Rights Reserved.

Page 90: Application Architectures with Hadoop - Big Data TechCon SF 2014

90

Final Architecture – High Level Overview

Data Sources Ingestion Data Storage/

Processing

Data Reporting/Analysis

©2014 Cloudera, Inc. All Rights Reserved.

Page 91: Application Architectures with Hadoop - Big Data TechCon SF 2014

91

Final Architecture – Storage and Processing

/etl/weblogs/20140331/ /etl/weblogs/20140401/ …

Data Processing

/data/marketing/clickstream/bouncerate/ /data/marketing/clickstream/attribution/ …

©2014 Cloudera, Inc. All Rights Reserved.

Page 92: Application Architectures with Hadoop - Big Data TechCon SF 2014

92

Final Architecture – High Level Overview

Data Sources Ingestion Data Storage/

Processing

Data Reporting/Analysis

©2014 Cloudera, Inc. All Rights Reserved.

Page 93: Application Architectures with Hadoop - Big Data TechCon SF 2014

93

Final Architecture – Data Access

Hive/Impala

BI/Analytics

Tools

DWH Sqoop

Local Disk

R, etc.

DB import tool

JDBC/ODBC

©2014 Cloudera, Inc. All Rights Reserved.

Page 94: Application Architectures with Hadoop - Big Data TechCon SF 2014

Thank you

Page 95: Application Architectures with Hadoop - Big Data TechCon SF 2014

95

Contact info •  Mark Grover

–  @mark_grover –  www.linkedin.com/in/grovermark

•  Jonathan Seidman –  [email protected] –  @jseidman –  www.linkedin.com/in/jaseidman –  www.slideshare.net/jseidman

•  Slides at slideshare.net/hadooparchbook

©2014 Cloudera, Inc. All Rights Reserved.