high volume streaming analytics with cdap: 08/19 big data application meetup, talk #2

17
High Volume Streaming Analytics with CDAP Jialong Wu August 19, 2015

Upload: cask-data-inc

Post on 16-Apr-2017

652 views

Category:

Software


0 download

TRANSCRIPT

Page 1: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

High Volume Streaming Analytics with CDAP

Jialong Wu August 19, 2015

Page 2: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

COLLECT• Web Data• Mobile Data• Set-Top Data• CRM Data

ORGANIZE• Data into default and custom hierarchies• Easily managed in folder structure

ACTIVATE• Deliver targeted ad campaigns and marketing promotions• Dynamically serve content based on audience segment• Generate advanced audience analytics

LOTAME DMP

2

Page 3: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

3

Page 4: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

Client Dashboard

4

Page 5: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

Number of unique site visitors (profiles) that exhibit certain behavior(s) during some time periodUseful for estimating audience reach, tracking behavior trend, etc.Report stats in Daily, Month-to-date, 30 day intervalsRoll up by client-network and behavior-category hierarchies

sum(Uday1,…,Uday30) != U30day

Counting Uniques

5

Page 6: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

Counting Uniques

Old Approach New Approach

• MapReduce-based batch processing

• Re-scans historical profile data Expensive

• Half day wait for the previous day’s stats

• Fixed time buckets for counting

• Real-time event stream processing with CDAP

• Uses HyperLogLog for estimating uniques count

• No re-scanning of historical profile data• Flexible, on demand aggregation and

deduplication• Allows frequent stats updates

Page 7: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

Estimates distinct values (DV) in a set by the length of longest run of trailing zeros for all the hash values in the setImproves the estimate by averaging results from multiple estimators

Splits input into 2p bins, p is number of register index bits

Takes harmonic mean of estimates to filter extreme outliersSet unions are lossless

Allows distributed computing of DV count

Estimating set intersections: Accuracy is low for sets that have high cardinality ratio or low overlap

Space saving: 1% error rate for 1B count using 1.5KB in memory

HyperLogLog

7

Page 8: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

HLL Accuracy

• Uses register index bits of 14 (~16K bins)• Unbiased error (mean is 0% error)• Standard deviation 0.8%• Accurate 99.9% of the time with < 2.5% error

• Exact count for low cardinality• Stores hash values in Trove sets

8

Page 9: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

Good API abstraction layer for building data processing applicationsFaster Time-To-MarketLower entry barrier for developing big data applicationBetter reusability for data and processing patterns

Support for both stream and batch processing paradigmShares data across programs in different paradigms

“Exactly once” transactional processingGood support for application development cycle

No distributed components required for developmentCDAP Modes: In-Memory, Standalone, Distributed

Why CDAP ?

9

Page 10: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

Architecture

10

Page 11: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

CDAP Flow

11

• Extends CDAP Kafka Flowlet Library• Each Flowlet instance pulls from one topic-partition• Configurable starting point for reading data

Page 12: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

CDAP Flow

12

• Encapsulates attribute counting logics in pluggable processors• Attribute tuple = Attribute values + Profile ID hash• Each processor emits to its own Flowlet queue• Emits processed attribute tuples with the hash of attribute value as

partition key

Page 13: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

CDAP Flow

13

• Updates in-memory HLL objects for the incoming attribute tuple • Flushes in-memory HLL objects to Datasets every minute• Uses Hash-based Partition Strategy for better scalability

Page 14: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

Encodes attribute tuple in a byte arrayRow key components:

Prefix byteHour ID (ddHH)Client IDAttribute TypeAttributeEvent timestamp (yyyyMMddHHmm)

Prefix byte: hash(client_id +attribute type + attribute) mod (# of buckets)Optimized for writes and specific scan patterns

Dataset Row Key

14

Page 15: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

DatasetsPre-split regions to distribute write loadProper key format to avoid hotspotKeep key length shortUse the lowest conflict detection level possible

Flow QueuesUse hash-based partitioning strategy for queuesMinimize payload size between FlowletsPick appropriate batch size for processing (between 100 and 500)Balance transaction duration and size of work in FlowletsMonitor Flowlet pending events metric closely

Best Practices

15

Page 16: High Volume Streaming Analytics  with CDAP: 08/19 Big Data Application Meetup, Talk #2

Uniques stats are available hours earlierConsumes less computing resourcesAbility to (re-)process uniques stats retrospectively in an efficient manner

Bug fixesNew features

Enables new products that utilizes real-time feedbacks

Gains

16