deep dive of hdfs to hdfs sync - disaster recovery app template
TRANSCRIPT
Deep dive of HDFS to HDFS SyncDisaster Recovery App Template
Yogi [email protected]
Agenda
● About Apache Apex
● Ingestion
● App templates
● HDFS to HDFS sync app template
● Live demo
● Scalability, fault tolerance
3
● Platform and runtime engine that enables development of scalable and fault-tolerant distributed applications
● Hadoop native● Unified engine to process streaming or batch big data● High throughput and low latency● Library of commonly needed business logic● Write any custom business logic in your application
What is Apache Apex
4
5
● Migrate a lot more use cases to Hadoop● Productization of big data projects on Hadoop● Enable users to extract value from big data● Significant reduction of time to market for big data
applications migrating to Hadoop
Reference : https://wiki.apache.org/incubator/ApexProposal
Apex rationale
● Operability
● Highly scalable and performant
● Fault tolerant
● Hadoop native
● Easy to integrate
● Easy to develop
Guiding Principles
● Advertising
● IoT
● Finance
● Telecoms and
Networks
● Ingestion●...
Use cases
Architecture overview
IngestionData ingestion
● Obtaining, importing, and processing data for later use or storage in a database
Big Data Ingestion● Discovering the data sources
● Importing the data
● Processing data to produce intermediate data
● Send data out to durable data stores
● Failure Recovery
● Copying large number of files
● Copying big files in parallel
● Bandwidth limit
Challenges in Ingestion @ scale
App templates
● Look at: https://www.datatorrent.com/apphub/● Ready to use, customizable applications for big data
ingestion use-cases.● Source : https://github.com/DataTorrent/app-templates
(apache 2.0)
● Use cases○Disaster recovery○Archival
● Features○Dynamic scaling○Fault tolerance○Easy to customize
HDFS to HDFS sync
Application DAG
● A DAG is composed of vertices (Operators) and edges (Streams).● A Stream is a sequence of data tuples which connects operators at end-points called Ports● An Operator takes one or more input streams, performs computations & emits one or more output streams
○ Each operator is USER’s business logic, or built-in operator from our open source library○ Operator may have multiple instances that run in parallel
13
Questions
Image ref [2]
● Apache Apex - http://apex.apache.org/● Subscribe to forums
○ Apex - http://apex.apache.org/community.html○ DataTorrent - https://groups.google.com/forum/#!forum/dt-users
● Download - https://datatorrent.com/download/● Twitter
○ @ApacheApex; Follow - https://twitter.com/apacheapex○ @DataTorrent; Follow – https://twitter.com/datatorrent
● Meetups - http://meetup.com/topics/apache-apex● Webinars - https://datatorrent.com/webinars/● Videos - https://youtube.com/user/DataTorrent● Slides - http://slideshare.net/DataTorrent/presentations● Startup Accelerator – Free full featured enterprise product
○ https://datatorrent.com/product/startup-accelerator/● Big Data Application Templates Hub – https://datatorrent.com/apphub
Resources
17
We Are Hiring• [email protected]• Developers/Architects• QA Automation Developers• Information Developers• Build and Release• Community Leaders