deep dive of deduplication using apache apex and rts

Deep Dive on DeduplicationApache Apex + RTSBhupesh [email protected]

Software Engineer @ DataTorrentCommitter @ Apache Apex

mailto:[email protected]

AgendaBrief introduction to Apache Apex

De-duplication and related use cases

Demo!

Conclusion with Q&A

Apache Apex - Stream ProcessingYARN - Native - Uses Hadoop YARN framework for resource

negotiation

Highly Scalable - Scales statically as well as dynamically

Highly Performant - Can reach single digit millisecond end-to-end latency

Fault Tolerant - Automatically recovers from failures - without manual intervention

Stateful - Guarantees that no state will be lost

Easily Operable - Exposes an easy API for developing Operators (part of an application) and Applications

Project HistoryProject development started in 2012 at DataTorrent

Open-sourced in July 2015

Apache Apex started incubation in August 2015

50+ committers from Apple, GE, Capital One, DirecTV, Silver Spring Networks, Barclays, Ampool and DataTorrent

Now a top level Apache project since April 2016

Apex Platform Overview

An Apex Application is a DAG(Directed Acyclic Graph)

A DAG is composed of vertices (Operators) and edges (Streams).A Stream is a sequence of data tuples which connects operators at end-points called PortsAn Operator takes one or more input streams, performs computations & emits one or more output streams

● Each operator is USER’s business logic, or built-in operator from our open source library● Operator may have multiple instances that run in parallel

Apex as a YARN Application

● YARN (Hadoop 2.0) replaces MapReduce with a more generic Resource Management Framework.

● Apex uses YARN for resource management and HDFS for storing any persistent storage

De-duplication (Dedup)Duplicates are very common in most data sets

Very common in almost all data ingestion and cleansing use cases

Basic functionality is to separate out the duplicates in the data set

Dedup - ConsiderationsDeduplication, seems to be simple and does not look like a

complicated operation.

Just need to store the set of keys that we have seen so far - ALL the unique keys

Situation becomes complicated when the source data is huge or when the incoming data is never ending!

Need to store ALL the incoming unique keys, so that new duplicates can be detected.

Any new key needs to be searched in the existing set of keys

Any in-memory technique would fail in this scenario and would need more sophisticated techniques

To Duplicate is nature; to Dedup, is costly - Anonymous

De-duplication

Managed stateA fault tolerant, large scale, persistent bucketing mechanism. Uses HDFS by default.

Uses the concept of Buckets to store / hash the keys on to the storage.

A Bucket is an abstraction for a collection of tuples all of which share a common hash value.

Example: a bucket of data for 5 contiguous minutes.

A Bucket has a span property called Bucket Span.

Num Buckets = Expiry Period / Bucket Span

IO from HDFS is slow and hence asynchronous calls are also supported.

Supports an in-memory cache so that repeated accesses are faster.

Periodically, also purges data that is no longer needed.

Spillable Data StructuresPersistent data structures based on managed state

Example: SpillableMap, SpillableArrayList etc.

Managed State and Spillable Data Structures

Dedup in Streaming ScenariosAs discussed, in-memory techniques would fail to process a never-

ending stream of incoming data.

To address the solutions mentioned before:Memory can fill up too fast

Solution is to store the data on to some scalable, fault tolerant, persistent storage.

HDFS is the default in Apex

Search becomes slow

Use some kind of hashing mechanism for storing the keys on HDFS. A Plain storage would not work

Use Managed State / Spillable Data Structures for Apex!

Search becomes slow, eventually

Use Expiry!

Expiry in Dedup - Streaming ScenariosIn a streaming scenario, the search is bound to become slow

eventually, as we keep storing all the incoming keys.

But this is not needed for most practical scenarios.

In most of the cases, the duplicates for a tuple (record) are usually encountered within a small time span of generation of the original tuple.

This allows us to use the concept of expiry for streaming scenarios to purge the amount of state that we need to persist; thereby reducing our search space.

Currently, expiry based on time is supported, as it is a natural expiry key.

Example schema{Name, Phone, Email, Date, State, Zip, Country}

Tuple 1:

{ Austin U. Saunders, +91-319-340-59385, [email protected], 2015-11-09 13:38:38, Texas, 73301, United States}

Dedup KeyExpiry Key

Details on ExpiryWe maintain the following points for expiry

1.Latest Point

2.Expiry Point

Expiry Point = Latest Point - Expiry Period

Architecture of De-duplication

More on Bucketing

Size proportional to Expiry Period

Based on Time field

Size of a bucket = the bucket span

Dedup in Streaming - Use casesDeduplication for Bounded data - Batch

Parameters required

Key field for de-duplication

Deduplication for Un-bounded data - StreamingParameters required

Key field for de-duplication

Time field for expiry

Expiry duration

Reference time

Bucket span

Demo time!

ConclusionDe-duplication is an important and complex functionality provided out

of the box in DataTorrent RTS and Apex Malhar

Uses Managed State for state management and asynchronous processing to maintain a low latency

Uses expiry based semantics for streaming de-duplication scenarios

Resources• Apache Apex - http://apex.apache.org/• Subscribe to forums

ᵒ Apex - http://apex.apache.org/community.htmlᵒ DataTorrent - https://groups.google.com/forum/#!forum/dt-users

• Download - https://datatorrent.com/download/• Twitter

ᵒ @ApacheApex; Follow - https://twitter.com/apacheapexᵒ @DataTorrent; Follow – https://twitter.com/datatorrent

• Meetups - http://meetup.com/topics/apache-apex• Webinars - https://datatorrent.com/webinars/• Videos - https://youtube.com/user/DataTorrent• Slides - http://slideshare.net/DataTorrent/presentations • Startup Accelerator Program - Full featured enterprise product

https://datatorrent.com/product/startup-accelerator/• Big Data Application Templates Hub – https://datatorrent.com/apphub

http://apex.incubator.apache.org/

http://apex.incubator.apache.org/community.html

https://groups.google.com/forum/

https://www.datatorrent.com/download/

https://twitter.com/apacheapex

https://twitter.com/datatorrent

http://www.meetup.com/topics/apache-apex

https://www.datatorrent.com/webinars/

https://www.youtube.com/user/DataTorrent

http://www.slideshare.net/DataTorrent/presentations

https://www.datatorrent.com/product/startup-accelerator/

https://datatorrent.com/apphub

https://datatorrent.com/apphub

We’re [email protected]

Developers/Architects

QA Automation Developers

Information Developers

Build and Release

Community Leaders

mailto:[email protected]

Thank you!

deep dive of deduplication using apache apex and rts

Technology