deep dive of deduplication using apache apex and rts
TRANSCRIPT
Deep Dive on DeduplicationApache Apex + RTSBhupesh [email protected]
Software Engineer @ DataTorrentCommitter @ Apache Apex
AgendaBrief introduction to Apache Apex
De-duplication and related use cases
Demo!
Conclusion with Q&A
Apache Apex - Stream ProcessingYARN - Native - Uses Hadoop YARN framework for resource
negotiation
Highly Scalable - Scales statically as well as dynamically
Highly Performant - Can reach single digit millisecond end-to-end latency
Fault Tolerant - Automatically recovers from failures - without manual intervention
Stateful - Guarantees that no state will be lost
Easily Operable - Exposes an easy API for developing Operators (part of an application) and Applications
Project HistoryProject development started in 2012 at DataTorrent
Open-sourced in July 2015
Apache Apex started incubation in August 2015
50+ committers from Apple, GE, Capital One, DirecTV, Silver Spring Networks, Barclays, Ampool and DataTorrent
Now a top level Apache project since April 2016
Apex Platform Overview
An Apex Application is a DAG(Directed Acyclic Graph)
A DAG is composed of vertices (Operators) and edges (Streams).A Stream is a sequence of data tuples which connects operators at end-points called PortsAn Operator takes one or more input streams, performs computations & emits one or more output streams
● Each operator is USER’s business logic, or built-in operator from our open source library● Operator may have multiple instances that run in parallel
Apex as a YARN Application
● YARN (Hadoop 2.0) replaces MapReduce with a more generic Resource Management Framework.
● Apex uses YARN for resource management and HDFS for storing any persistent storage
De-duplication (Dedup)Duplicates are very common in most data sets
Very common in almost all data ingestion and cleansing use cases
Basic functionality is to separate out the duplicates in the data set
Dedup - ConsiderationsDeduplication, seems to be simple and does not look like a
complicated operation.
Just need to store the set of keys that we have seen so far - ALL the unique keys
Situation becomes complicated when the source data is huge or when the incoming data is never ending!
Need to store ALL the incoming unique keys, so that new duplicates can be detected.
Any new key needs to be searched in the existing set of keys
Any in-memory technique would fail in this scenario and would need more sophisticated techniques
To Duplicate is nature; to Dedup, is costly - Anonymous
De-duplication
Managed stateA fault tolerant, large scale, persistent bucketing mechanism. Uses HDFS by default.
Uses the concept of Buckets to store / hash the keys on to the storage.
A Bucket is an abstraction for a collection of tuples all of which share a common hash value.
Example: a bucket of data for 5 contiguous minutes.
A Bucket has a span property called Bucket Span.
Num Buckets = Expiry Period / Bucket Span
IO from HDFS is slow and hence asynchronous calls are also supported.
Supports an in-memory cache so that repeated accesses are faster.
Periodically, also purges data that is no longer needed.
Spillable Data StructuresPersistent data structures based on managed state
Example: SpillableMap, SpillableArrayList etc.
Managed State and Spillable Data Structures
Dedup in Streaming ScenariosAs discussed, in-memory techniques would fail to process a never-
ending stream of incoming data.
To address the solutions mentioned before:Memory can fill up too fast
Solution is to store the data on to some scalable, fault tolerant, persistent storage.
HDFS is the default in Apex
Search becomes slow
Use some kind of hashing mechanism for storing the keys on HDFS. A Plain storage would not work
Use Managed State / Spillable Data Structures for Apex!
Search becomes slow, eventually
Use Expiry!
Expiry in Dedup - Streaming ScenariosIn a streaming scenario, the search is bound to become slow
eventually, as we keep storing all the incoming keys.
But this is not needed for most practical scenarios.
In most of the cases, the duplicates for a tuple (record) are usually encountered within a small time span of generation of the original tuple.
This allows us to use the concept of expiry for streaming scenarios to purge the amount of state that we need to persist; thereby reducing our search space.
Currently, expiry based on time is supported, as it is a natural expiry key.
Example schema{Name, Phone, Email, Date, State, Zip, Country}
Tuple 1:
{ Austin U. Saunders, +91-319-340-59385, [email protected], 2015-11-09 13:38:38, Texas, 73301, United States}
Dedup KeyExpiry Key
Details on ExpiryWe maintain the following points for expiry
1.Latest Point
2.Expiry Point
Expiry Point = Latest Point - Expiry Period
Architecture of De-duplication
More on Bucketing
Size proportional to Expiry Period
Based on Time field
Size of a bucket = the bucket span
Dedup in Streaming - Use casesDeduplication for Bounded data - Batch
Parameters required
Key field for de-duplication
Deduplication for Un-bounded data - StreamingParameters required
Key field for de-duplication
Time field for expiry
Expiry duration
Reference time
Bucket span
Demo time!
ConclusionDe-duplication is an important and complex functionality provided out
of the box in DataTorrent RTS and Apex Malhar
Uses Managed State for state management and asynchronous processing to maintain a low latency
Uses expiry based semantics for streaming de-duplication scenarios
Resources• Apache Apex - http://apex.apache.org/• Subscribe to forums
ᵒ Apex - http://apex.apache.org/community.htmlᵒ DataTorrent - https://groups.google.com/forum/#!forum/dt-users
• Download - https://datatorrent.com/download/• Twitter
ᵒ @ApacheApex; Follow - https://twitter.com/apacheapexᵒ @DataTorrent; Follow – https://twitter.com/datatorrent
• Meetups - http://meetup.com/topics/apache-apex• Webinars - https://datatorrent.com/webinars/• Videos - https://youtube.com/user/DataTorrent• Slides - http://slideshare.net/DataTorrent/presentations • Startup Accelerator Program - Full featured enterprise product
https://datatorrent.com/product/startup-accelerator/• Big Data Application Templates Hub – https://datatorrent.com/apphub
We’re [email protected]
Developers/Architects
QA Automation Developers
Information Developers
Build and Release
Community Leaders
Thank you!