high volume streaming analytics with cdap: 08/19 big data application meetup, talk #2
TRANSCRIPT
High Volume Streaming Analytics with CDAP
Jialong Wu August 19, 2015
COLLECT• Web Data• Mobile Data• Set-Top Data• CRM Data
ORGANIZE• Data into default and custom hierarchies• Easily managed in folder structure
ACTIVATE• Deliver targeted ad campaigns and marketing promotions• Dynamically serve content based on audience segment• Generate advanced audience analytics
LOTAME DMP
2
3
Client Dashboard
4
Number of unique site visitors (profiles) that exhibit certain behavior(s) during some time periodUseful for estimating audience reach, tracking behavior trend, etc.Report stats in Daily, Month-to-date, 30 day intervalsRoll up by client-network and behavior-category hierarchies
sum(Uday1,…,Uday30) != U30day
Counting Uniques
5
Counting Uniques
Old Approach New Approach
• MapReduce-based batch processing
• Re-scans historical profile data Expensive
• Half day wait for the previous day’s stats
• Fixed time buckets for counting
• Real-time event stream processing with CDAP
• Uses HyperLogLog for estimating uniques count
• No re-scanning of historical profile data• Flexible, on demand aggregation and
deduplication• Allows frequent stats updates
Estimates distinct values (DV) in a set by the length of longest run of trailing zeros for all the hash values in the setImproves the estimate by averaging results from multiple estimators
Splits input into 2p bins, p is number of register index bits
Takes harmonic mean of estimates to filter extreme outliersSet unions are lossless
Allows distributed computing of DV count
Estimating set intersections: Accuracy is low for sets that have high cardinality ratio or low overlap
Space saving: 1% error rate for 1B count using 1.5KB in memory
HyperLogLog
7
HLL Accuracy
• Uses register index bits of 14 (~16K bins)• Unbiased error (mean is 0% error)• Standard deviation 0.8%• Accurate 99.9% of the time with < 2.5% error
• Exact count for low cardinality• Stores hash values in Trove sets
8
Good API abstraction layer for building data processing applicationsFaster Time-To-MarketLower entry barrier for developing big data applicationBetter reusability for data and processing patterns
Support for both stream and batch processing paradigmShares data across programs in different paradigms
“Exactly once” transactional processingGood support for application development cycle
No distributed components required for developmentCDAP Modes: In-Memory, Standalone, Distributed
Why CDAP ?
9
Architecture
10
CDAP Flow
11
• Extends CDAP Kafka Flowlet Library• Each Flowlet instance pulls from one topic-partition• Configurable starting point for reading data
CDAP Flow
12
• Encapsulates attribute counting logics in pluggable processors• Attribute tuple = Attribute values + Profile ID hash• Each processor emits to its own Flowlet queue• Emits processed attribute tuples with the hash of attribute value as
partition key
CDAP Flow
13
• Updates in-memory HLL objects for the incoming attribute tuple • Flushes in-memory HLL objects to Datasets every minute• Uses Hash-based Partition Strategy for better scalability
Encodes attribute tuple in a byte arrayRow key components:
Prefix byteHour ID (ddHH)Client IDAttribute TypeAttributeEvent timestamp (yyyyMMddHHmm)
Prefix byte: hash(client_id +attribute type + attribute) mod (# of buckets)Optimized for writes and specific scan patterns
Dataset Row Key
14
DatasetsPre-split regions to distribute write loadProper key format to avoid hotspotKeep key length shortUse the lowest conflict detection level possible
Flow QueuesUse hash-based partitioning strategy for queuesMinimize payload size between FlowletsPick appropriate batch size for processing (between 100 and 500)Balance transaction duration and size of work in FlowletsMonitor Flowlet pending events metric closely
Best Practices
15
Uniques stats are available hours earlierConsumes less computing resourcesAbility to (re-)process uniques stats retrospectively in an efficient manner
Bug fixesNew features
Enables new products that utilizes real-time feedbacks
Gains
16