big data hadoop streaming etl template for kafka-filter-hdfs

18

Click here to load reader

Upload: datatorrent

Post on 14-Apr-2017

234 views

Category:

Technology


7 download

TRANSCRIPT

Page 1: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

1

Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

Deepak Narkhede, Mohit Jotwani

[email protected]

[email protected]

Page 2: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

2

•DataTorrent - Vision•About Apache Apex•App templates•Kafka to HDFS App Template•Live demo•Roadmap

Agenda

Page 3: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

3

• Big Data is neither Productized nor Operationalized• Total Cost of Ownership (TCO) includes

• Time to Develop + Time to Launch + Cost of ongoing Operations

• Provide a Product to ...• Build Applications Rapidly with Simple Interfaces, Pre-Built Apps, Code

Reuse & Debuggability• Support Dev, Test, Prod cycle to Launch Apps quickly• Manage and Visualize Applications for Operability

DataTorrent Vision - Productize Big Data

Page 4: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

4

Next Gen Big Data Applications

Browser

Web Server

Kafka Input(logs)

Decompress, Parse, Filter

Dimensions Aggregate Kafka

LogsKafka

Variety of sources - IoT, Kafka, files, social media etc.Variety of sinks – Kafka, files, databases etc.* Supports low latency real time visualizations as well

Unbounded and continuous data streamsBatch support, batch processed as stream

In-memory processing with temporal window boundaries

Stateful operations: Aggregation, Rules etc --> Analytics

Page 5: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

5

Big Data Ecosystem: Where DataTorrent fits

Data Sources Oper1 Oper2 Oper3

Hadoop (YARN + HDFS)

Sensor Data

Social Media

Web Servers

App Servers

Click Streams

Real-time analytics &

Visualizations

Real-time DataVisualization

Page 6: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

6

DataTorrent ArchitectureSolutions for

Business Problems

Ingestion & Data Prep ETL Pipelines

Ease of Use Tools Real-Time Data VisualizationManagement & MonitoringGUI Application

Assembly

Application Templates

Apex-Malhar Operator Library

Big Data Infrastructure Hadoop 2.x – YARN + HDFS – On Prem & Cloud

Core

High-level APITransformation ML & Score SQL Analytic

s

FileSync

Dev Framework

Batch Support

Apache Apex Core

Kafka HDFS

HDFS HDFS JDBC HDFS JDBC

Kafka

Page 7: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

7

• Building Apps such as Ingestion & Transform Apps for commonly patterns in customer use cases

App Templates – Recurring patterns

Use Case Pattern Sources Processors Sinks

Data Synchronization, Staging Data for Analytics

HDFS, Kafka, JDBC,

S3

→ HDFS,S3

Enriching Data before Staging

HDFS,JDBC,Kafka

Parser → Deduper → Enricher → Formatter HDFS,Cassandra

Merge & Transform Data Streams

Kafka,JDBC,

FileStream Merge → Transform → Filter → Enricher HDFS

Machine Scoring Kafka H2O or Custom HDFS

Page 8: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

8

• Central repository for big data application templates

• Tested and published by DataTorrent

• Accessible via dtManage on DataTorrent RTS and direct app download from website

• Provides version updates via dtManage

AppHub – App Template Repository

Page 9: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

9

App Templates advantages

Ease of use Time to market and TCO Real-time Visualizations

✓ Quickly import and launch app template applications

✓ Easily add business logic by adding custom operators

✓ Reduces time to production drastically

✓ Reduces cost of operations in production

✓ Real-time visualizations of operational metrics such as throughput, latency etc.

✓ Real-time visualizations of application data such as number of files processed, amount of data transferred etc.

Page 10: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

10

Brief about Kafka● Distributed Messaging System.

● Data Partitioning Capability.

● Fast Read and Writes.

● Basic Terminology○ Topic ○ Producer○ Consumer○ Broker

Page 11: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

11

•Look at: https://www.datatorrent.com/apphub/•Ready to use, customizable applications for big data ingestion use-cases.•Source : https://github.com/DataTorrent/app-templates (apache 2.0)

App Template Demo

Page 12: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

12

Kafka to HDFS Filter app-template

Page 13: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

13

Page 14: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

14

• Visualizations – widgets on app data• Metrics such as size of data moved, lines per file, number of errors etc

• Custom user defined metrics using apex auto-metrics• Schema enablement• Cloud Integrations

• Amazon EMR, Microsoft Azure• Upcoming app templates

• FTP → HDFS• SFTP → HDFS• Kinesis → S3• Kinesis → Redshift • Kafka → JSON parse → filter → transform → HDFS• Kafka → CSV parse → filter → transform → HDFS

Roadmap

Page 15: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

15

Questions

•Send feedback to : https://groups.google.com/forum/#!forum/dt-users•Email to : [email protected]

Page 16: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

16

Resources• Apache Apex - http://apex.apache.org/• Subscribe to forums

ᵒ Apex - http://apex.apache.org/community.htmlᵒ DataTorrent - https://groups.google.com/forum/#!forum/dt-users

• Download - https://datatorrent.com/download/• Twitter

ᵒ @ApacheApex; Follow - https://twitter.com/apacheapexᵒ @DataTorrent; Follow – https://twitter.com/datatorrent

• Meetups - http://meetup.com/topics/apache-apex• Webinars - https://datatorrent.com/webinars/• Videos - https://youtube.com/user/DataTorrent• Slides - http://slideshare.net/DataTorrent/presentations • Startup Accelerator – Free full featured enterprise product

ᵒ https://datatorrent.com/product/startup-accelerator/• Big Data Application Templates Hub – https://datatorrent.com/apphub

Page 17: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

17

We are hiring!

[email protected]• Developers/Architects• QA Automation Developers• Information Developers• Build and Release• Community Leaders

Page 18: Big Data Hadoop Streaming ETL template for Kafka-Filter-HDFS

18