cask data application platform (cdap)customers.cask.co/rs/882-oyr-915/images/cdap_101.pdfthe cdap...

22

Upload: others

Post on 27-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing
Page 2: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

Cask Data Application Platform (CDAP)CDAP is an integrated application development framework for Hadoop. It integrates and abstracts the underlying Hadoop technologies to provide simpler and easy-to-use APIs to build, deploy and manage complex data analytics applications in the cloud or on-premise.

Data Ingestion Applications -- Batch or Realtime

Data Processing Work�ows

Real-time Applications

Data Services

Predictive Analytics Applications

Business Analytics Applications

Social Applications, and many more

It’s built for You can build

Developers Operations

Data Engineers Data Scientists

Try it

License

Use Cloudera ManagerCSD to install on cluster

CDAP Standalone Dockerthrough Kitematic

Download CDAP Standaloneto build your application

Apache 2.0

2 31 Accelerated ROI-Faster time to Market, Faster time to Value

Maximize developer productivity and minimize TCO

Simple, Easy and Standard APIs for Developers & Operations

5 64 Enables Reusability and Self-Service

Future Proof - Distribution and Deployment Agnostic

Support different workloads - Transactional and Non-transactional

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 3: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

ArchitectureThis section describes the functional and physical architecture of CDAP.

Functional Architecture

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 4: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

API

Application

An Application is a standardized container framework for de�ning all services. It simpli�es the painful integration process in heterogeneous infrastructure technologies running on Hadoop. It’s responsible for managing the lifecycle of Programs and Datasets within in an application. E.g. Wikipedia Analysis, Twitter Sentiment Analysis, Fraud Detection, etc.

Application Template

An Application Template is a user-de�ned, reusable, recon�gurable pattern of an Application. It is parameterized by a con�guration that allows recon�gurability upon deployment. It simpli�es by providing one generic version of an application which can be repurposed, instead of ongoing creation of specialized applications. It exposes the re-con�gurability and modularization of an Application through Plugins. E.g. User de�ned Template, or Cask provided templates like CDAP ETL Batch, CDAP ETL Real-time, CDAP Data Pipeline Batch, CDAP Data Pipeline Real-time, CDAP Spark Streaming Pipelines, Data Quality, etc.

Dataset

A Dataset is a standardized container framework for organizing, storing and accessing data from various storage engines. It simpli�es integration with different storage engines, allowing one to build complex data patterns across multiple storage types on Hadoop. It’s responsible for exposing transactionally consistent data patterns, integration with query engines, schema evolution and data lifecycle management. E.g. Indexed Dataset, Time Partitioned Fileset, Partitioned Fileset, OLAP Cube Dataset, Indexed Object Store, Object Store, Timeseries Dataset, etc are different types of datasets that can be de�ned in CDAP.

Extension

Application Template with a domain speci�c UI integrated into the CDAP UI. E.g. Cask Hydrator and Cask Tracker.

Program

A Program is a container of well-de�ned tasks for processing or servicing Datasets to generate zero or more Datasets. It is responsible for managing the lifecycle, and integration with transactions, metrics, logging and the metadata system. For e.g. Spark Program, MapReduce Program, Work�ow Program, Worker Program, Service Program, Flow Program, etc.

Plugin

A Plugin is a customizable module exposed and used by an Application or an Application Template. It simpli�es adding new features or extending the capability of an Application. Plugin implementations are based on interfaces exposed by the Application. For e.g. CDAP ETL Batch Application template exposes three plugins namely Source, Transform & Sink, CDAP Data Quality Application template exposes Aggregation plugin.

Artifact

An Artifact is versioned packaging format used to aggregate one or more Application, Dataset, Plugin, Resource and associated the metadata. It’s a JAR (Java Archive) containing Java classes and resources required to create and run the Application.

1

Extensions are a new concept within CDAP and are not ready for general use. 1

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 5: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

Tools

Command Line Interface (CLI)

The CDAP CLI allows developers and operations teams script and automate interactions with local or remote CDAP entities from the shell. CLI uses the CDAP REST APIs to provide this functionality. Using CLI one can manage the lifecycle of Applications, Artifacts, Programs and Datasets. More information can be found here

Testing Framework

An end-to-end JUnit scaffolding over CDAP that allows developers to test their Applications, Plugins and Programs during development. It’s built as modular framework allowing developer to also test individual components. The tests can be integrated with Continuous Integration (CI) tools like Bamboo, Jenkins and Teamcity.

JDBC / ODBC Driver

The CDAP JDBC and ODBC drivers enable users to access Datasets (HDFS, HBase or a Composite Dataset) on Hadoop through Business Intelligence (BI) applications with JDBC or ODBC support. The driver achieves this integration by translating Open Database Connectivity (JDBC/ODBC) calls from the application into SQL and passing the SQL queries to the underlying Dataset management and Query engine (Hive is the default).

Monitoring Integrations

These integrations allow external systems to monitor CDAP and Applications running within it. Integrations with Nagios, Sensu, Cacti and Splunk are supported, and they are achieved by plugins that access status, logs and metrics through REST API.

Performance Framework

A CDAP performance framework provides you the ability to load test and capture performance metrics to diagnose bottlenecks within your Application.

User Interface (Console)

The Console provides a user friendly graphical user interface with well-designed user work�ows for deploying and managing the lifecycle of Applications, Programs, Datasets and Artifacts. Operations management capabilities allow deeper and faster insights into diagnosing issues with different entities. It also exposes administrative capabilities for managing CDAP.

Isn’t publicly available yet, but will be provided access on-demand. We are still evolving it. 4

In-memory CDAP - Abstracted to in-memory structures for easy debugging (shorter stack traces)3

Framework only available in Java.2

2 4

3

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 6: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

Router

Service Discovery

Service discovery allows users to register data Service running in containers on a cluster. It achieves this by registering one or more service endpoints announced with Zookeeper and actively maintaining the live state of the running services.

Dataset and Service

Dataset and Service -- Data within Datasets can be exposed to external clients through a Service Program. Developers can implement custom Service exposing data from Dataset or they can also be used to write to Dataset. Service exposes user de�ned REST APIs to Dataset. Service execute as YARN containers on the cluster eliminating an additional step to migrate data into traditional database to be exposed to applications. Future for this is to automatically for Datasets to expose REST APIs. Developers can use annotations to map REST endpoints to methods within Datasets simplifying Data-As-A-Service concept.

CDAP System ServicesIn order to simplify the deployment model of CDAP on a Hadoop cluster, CDAP uses a small portion of the cluster to run mission critical CDAP system services in YARN containers. It also does this to support elastic scaling of the services without having to stop them. So, CDAP partly runs on Edge Nodes and partly within the cluster. The system is smart enough to ensure that the system services are distributed evenly across the nodes on the cluster and don’t interfere with normal operations and jobs running on the cluster. More information about deployment and services can be found here More on services can be found at ...

Additional Information

Service Dispatc

A request for accessing a service method is appropriately routed to the right container running a service on the cluster. In case of multiple instances of a service, a routing strategy is engaged automatically to distribute the load across multiple instances.

REST APIs

REST APIs are HTTP interfaces exposed by CDAP for a multitude of purposes: everything from deploying and managing Applications, Artifacts, Plugins and Datasets , to ingesting data events, to query data from datasets, to checking the status of various system and user services. More information about the different REST APIs exposed can be found here

Edge nodes are the interfaces between the Hadoop cluster and outside network. They are also referred to as gateway nodes.They run client applications and cluster administration tools.

5

5

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 7: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

To support different environments, such as development, QA and staging;

To support multiple customers; and

To support multiple sub-organizations within a organization.

multiple sub-organizations within an organization.

More information on Namespace can be found here

Dataset and Service

CDAP provides isolation of application and data through Namespace. Namespace conceptually can be thought of partitions of a CDAP instance. Application and Dataset in one space are not accessible in other namespace. It’s �rst step towards introducing multi-tenancy in CDAP. This feature can be used for partitioning a single Hadoop cluster into multiple namespaces:

Security

CDAP supports Kerberos enabled clusters and also supports perimeter level security for authentication through LDAP or JASPI or Basic mechanisms. Authorization is currently being worked on in 3.4 release. More information on CDAP security can be found here

Supported Programming Languages

Java is the only programming language currently supported by CDAP. CDAP has plans to support other languages like Python, R and Javascript. Near term plan is to support Python through Py4j integration. here

Supported Hadoop Distributions

CDAP and all it’s Applications are agnostic to the distribution they run on. On nightly basis the platform as well as test Applications are tested on various �avors of Hadoop Distribution. Information about tests can be found here

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 8: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

CDAP ServerThe CDAP Server is a collection of services (Figure 2) essential for successfully running CDAP. It can be installed on one or more edge nodes of a cluster and is responsible for managing only the cluster to which it’s con�gured. The services are installed for a few reasons:

Deployment ArchitectureThis section describes different components of the CDAP runtime system and how they are deployed on a Hadoop cluster. For more information please see here

The CDAP runtime system is made up of two major components:

Fixed IP/Hostname for accessing REST APIs

Impersonating a user in secure mode to run in a cluster

Manage (Start/Stop/Monitor) system services running within the cluster

CDAP Services These are mission critical CDAP system services running in YARN containers on a Hadoop cluster.The lifecycle on these services are managed by the CDAP Server. Below are the services running on the cluster

Dataset Executor: Responsible for managing dataset lifecycle

Metadata: Responsible for managing metadata for applications and datasets

Log and Metrics Aggregator: Responsible for aggregating and indexing logs and metrics across all applications and datasets

Transactions: Responsible for providing consistency guarantees across applications and datasets

Explore: Responsible for exposing the querying (SQL) interface for datasets

Stream: Responsible for data ingestion either in realtime or batch

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 9: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

Use-case(s)Following are a few use-cases that CDAP is well-suited for and are being used within customer environments.

Data LakeHigh Volume

Streaming Analytics

InformationSecurity Reporting

Real-time brandand marketing

campaign monitoring

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 10: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

Use-case(s)

Data Lake

Building an enterprise data lake requires building a reliable, repeatable and fully operational data management system, which included ingestion, transformations & distribution of data . It must support varied data types and formats, and must be able to capture data �ow in in various ways. The system must support the following:

Following are a few use-cases that CDAP is well-suited for and are being used within customer environments.

Transform, normalize, harmonize, partition, �lter and join data

Interface with anonymization and encryption services external to the cluster

Generate metadata for all data feeds, snapshots and datasets ingested, and make it accessible through APIs and webservices

Perform policy enforcement for all ingested and processed data feeds

Tracking and isolating errors during processing

Performing incremental processing of data being ingested

Reprocessing data in case of failures and errors

Apply retention policies on ingested and processed datasets

Setup common location format (CLF) for storing staging, compressed, encrypted and processed dataFiltered views over processed datasets

Monitoring, reporting, and alerting based on thresholds for transport and data quality issues experienced during ingestion. This helps provide the highest quality of data for analytics needs.

Annotate Datasets with business/user metadata

Search Datasets using metadata

Search Datasets based on schema �eld names and types

Manage data provenance (lineage) as data is processed/transformed in the data lake

A team of 10 Java (Non-Hadoop) developers were able to build an end-to-end ingestion system with the capabilities described above using CDAP. Lower barrier to entry.

These developers provided a self-service platform to the rest of the organization(s) to ingest, process and catalog data. Abstractions helped them build at a much faster pace and get it to their customers faster. Time to market.

The ingestion platform standardized and created conventions for how data is ingested, transformed and stored on the cluster, allowing the platform users to on-board at much faster rate. Time to value.

CDAP’s native support for incremental processing, reprocessing, tracking metadata, work�ow, retention, snapshotting, monitoring and reporting expedite the efforts to get a system to their customers. Time to market.

CDAP is installed in 8 clusters with 100s of nodes.

Data Lake users were able to locate Datasets faster and had faster access to metadata, data lineage and data provenance. This allowed them to ef�ciently utilize their clusters and also aided them in data governance, auditability, and improving the data quality of Datasets. CDAP Tracker provided this set of capabilities.

Outcome

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 11: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

High Volume Streaming Analytics

Building a high speed, high volume streaming analytics solutions with exactly-once semantics is complex, resource intensive and hard to maintain and enhance. This use-case required data collection from web logs, mobile activity logs and CRM data, in real-time and batch. The data collected was then organized into customer hierarchies and modeled to deliver targeted ad campaigns and marketing promotion campaigns. It also had to provide advanced analytics for tracking the campaigns in real-time. This application had to support the following:

Support processing of ~38 billion transactions per day in real-time

Categorizing customer activity into buckets and hierarchies

Generating unique counts in real-time, to understand audience reach, tracking behavior trend, and the like.

Generate hourly, daily, monthly and yearly reports on multiple dimensions

Provide unique stat count on an hourly basis rather than weekly

Reprocessing data without side effects due to bug �xes and new features

Exactly-once processing semantics for reliable processing

Processing data both in real-time and batch.

CDAP’s abstraction and its real-time program simpli�ed building this application and in getting it to market faster. Time to market.

The team replaced a MapReduce based batch-system to a realtime system, delivering insights every minute instead of days.

CDAP’s exactly-once and transactional semantics provided high-degree of data consistency during failure scenarios, making it easy to debug and reason the state of data.

CDAP’s Standalone and Testing frameworks allowed the developers to build this application ef�ciently. No distributed components were required to run functional tests.

Outcome

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 12: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

Information Security Reporting

In a large enterprise environment there are traditional sources that house a great deal of data. There is a constant need to load data into Hadoop clusters to perform complex joins, �ltering, transformations and report generation. Moving data to Hadoop is cost-effective as there is the need to run many complex, ad-hoc queries that would otherwise require expensive execution on traditional data storage and querying technologies.

The Customer has been attempting to build a reliable, repeatable data pipeline for generating reports across all network devices which access resources. Data is currently aggregated into �ve different Microsoft SQL Servers. Aggregated data is then periodically (once-a-day) staged into a secured (Kerberos) Hadoop cluster. Upon loading the data into the staged area, transformations (rename �elds, change type of �eld, project �elds) are performed to create new datasets. The data was registered within Hive to run Hive SQL queries for any ad-hoc investigation. Once all the data is in �nal independent datasets, the next job is kicked off -- that joins the data from across all �ve tables to create a new uber table that provides a 360 degree view for all network devices. This table is then used to generate a report that is part of another job. Following are the challenges the customer faced:

Ensuring that the reports aligned to day boundaries

Restarting the failed jobs from the last point where they had failed (had to recon�gure pipelines to restart failed jobs)

Adding new sources required a lot of setup and development time

Inability to test the pipeline before it was deployed — this lead to inef�cient utilization of the cluster as all the testing was performed on the cluster

They had to cobble together a set of loosely federated technologies -- Sqoop, Oozie, MR, Spark, Hive and Bash Scripts

The in-house Java developers with limited knowledge of Hadoop built and ran the complex pipelines at scale within two weeks after four (4) hours of training

The visual interface enabled the team to build, test, debug, deploy, run and view pipelines during operations

The new process reduced system complexity dramatically, which simpli�ed pipeline management

The development experience was improved by reducing inappropriate cluster utilization

Transforms were performed in-�ight with error record handling

Tracking tools made it easy to rerun the process from any point of failure

Outcome

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 13: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

Real-time brand and marketing campaign monitoringEnterprises use Twitter to know when people are talking about their brand and understand sentiment toward their new marketing campaigns. Real-time monitoring capabilities on Twitter allows them to keep a close eye on the results of marketing efforts.

Developing a real-time pipeline that ingests the full Twitter stream, then cleanses, transforms, and performs sentiment and multi-dimensional analysis of the Tweets that

were related to campaign delivers a valuable real-time decision making platform. The aggregated data is exposed through REST APIs to an internal tool for visualization, making consumption of the output easier.

The pipeline is built using Storm, HBase, MySQL and JBoss. Storm is used to ingest and process the stream of Tweets. The Tweets are analyzed using NLP algorithms to determine sentiment. They are aggregated on multiple dimensions like number of re-tweets and attitude (positive, negative or neutral). The aggregations are stored in HBase. Periodically (twice-a-day) the data from HBase is moved into MySQL. JBoss exposed REST APIs for accessing the data in MySQL.

The goal of this use-case was to reduce the overall complexity of the pipeline, moving away from maintaining a separate cluster for processing the real-time Twitter stream, integrate NLP scoring algorithms for sentiment analysis and exposing the aggregated data from HBase with lower latency, thereby reducing the latency between the data being available in HBase to delivery via REST API. The result - an easy to build, deploy and manage real-time pipeline with better operational insights.

Cask Hydrator pipeline for processing full twitter stream was built in 2 weeks.

Cleansing, Transforming, Analyzing and Aggregating tweets at about 6K/sec in-�ight.

Consolidated infrastructure into a single Hadoop cluster.

Java Developers were able to build the pipeline and plugin with less learning curve.

CDAP Service on OLAP Cube reduced the expensive data movement and reduced the latency between the aggregation being generated to the results being exposed through REST APIs allowing them to make better decisions faster.

CDAP and Cask Hydrator seamlessly transparency provided easy operational insights through custom dashboards and aggregated logs for debugging.

Outcome

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 14: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

A day in life with CDAP

This section of the document describes how a day would go for a developer and operations team member building and deploying an application or a solution on Hadoop in production using CDAP. In order to demonstrate this we will take a common use-case with an organization using Hadoop.

The users are provided a very constrained environment for developing and testing the pipeline. In total there are three (3) clusters made available to them:

Use-case

Joltie, a Java developer, and Root, a operations team member, work in an

enterprise. They have been tasked with building and operationalizing a

data pipeline for processing data that is ingested and available on HDFS.

The data is delivered to a standard directory on a daily basis with an

approximate size of 600GB per day. The processing pipeline has to be

operationalized to process daily data within a SLA of 3 hrs. Developer and

Operations only have access to CDAP, not Cask Hydrator.

The data pipeline must include the following :

1

A dataset integrity job that takes a pass over all the

data from a day to check if the data on HDFS is reliable

enough to be processed.

The jobs in step 1 and step 2 can run in parallel.

The job in step 3 is executed only if the job in step 1 clears that the data is reliable and step 2 job is successful.

2

A transformation job that will process a day's worth

of data -- applies transformations, �ltering and �eld level validation. The output of the job is a

new transient dataset.

3

The output from the step 2 is then picked up by

another job that interacts with the encryption system to encrypt certain �elds in

the feed. This job will generate two outputs --

one that is encrypted and the other that is not. The job also writes both the output sets to datasets

partitioned by day. These datasets are explorable

using Hive.

4

The in-encrypted partition output is further processed to build and update a data model in HBase. The data in HBase is also explorable

through Hive.

21 The job in step 4 is dependent on successful completion of step 3.

3

Conditions

Environment

A Dev Cluster - 2% capacity of production cluster

A QA Cluster - 5% capacity of production cluster

A Production Cluster

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 15: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

For Joltie - The Developer

Joltie a mid-level Java developer with experience building web applications is very excited about this opportunity to build a Big Data application. She has taken the initiative to learn about the basics of Hadoop. She is intimidated by the complexity of Hadoop and has been looking at different examples of how to process data on Hadoop. Following are additional requirements that she has been tasked with:

Make the application code modular and testable

Commit changes on regular basis

Write unit test cases for each piece functionality that is added to the application

Set up a job in Jenkins that runs unit tests on every check-in

Setup an end-to-end test for the application on the QA cluster

Instrument the application correctly to provide more insights into business processing

The application should handle some critical edge cases

Joltie has to work with Root (operations team member) to successfully deploy the applicationto QA and production

Joltie has to use CDAP to build this application

Joltie starts by looking at the pipeline requirements and in parallel

spends time reading CDAP documentation and running examples.

She is delighted that she doesn’t need to install Hadoop.

In case of an error in the pipeline, the user should be able to �x the issue and easily reprocess

The pipeline must support incremental processing

The pipeline must handle the case where data arrives late

Tools Needed

Material Available

CDAP Apps & ExamplesCDAP SDKCDAP Documentation

Maven

IDE

Laptop

Nodejs

Java

Design & Research

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 16: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

Development

Joltie starts up by creating a base plate CDAP Application using CDAP Maven Archetype.

All the necessary dependencies are included, so she is ready to get started

She then modi�es the Application to build a Directed Acyclic Graph (DAG) for processing input data, as described in the use-case, using the Work�ow API provided by CDAP.

She also uses the CDAP JUnit scaffolding to build a unit test for testing the Work�ow. She �nds it’s very useful to have the ability to test the work�ows before being deployed on a cluster or anywhere else.

Joltie builds the project using Maven to generate the application artifact.

She then starts a CDAP Standalone version on her laptop

She deploys the artifact into CDAP Standalone and plays around with it. She cannot test it at scale, but is able to experience working with Application as it would have been in the cluster.

She Iterates through the development cycle without having to touch the a cluster

She goes home feeling like he has accomplished a great deal and has developed the �rst version of the Application by herself.

DAY1 First Cut of the Application

It’s a beautiful day. Joltie is ready to include a few of the edge cases that were provided as requirements and take the application to completion.

She reads about Partitioned Fileset Datasets and con�gures her Application to include this in orderto solve the edge cases. The Partitioned Fileset Datasets are transactional and provide consistency required to handle new partitions and also handle errors ef�ciently.

She iterates on it a few times to make sure she includes test scenarios for edge cases and also tries the same within CDAP Standalone. she is able to simulate a few of the scenarios, but she has to try this on a real cluster to make sure it works as expected under error conditions.

Next, she is ready to setup a Jenkins job to periodically trigger tests on check-in. she speci�es the con�gurations and commands to run the tests the same way she would in a shell.

She now wants to setup an end-to-end test of her application -- she is not sure how she can accomplish this, so, she opens an issue within Cask’s support portal. She gets a response back within a few hours with information around how she can set-up the end-to-end test.

She follows the instructions speci�ed by a Cask representative to set-up an end-to-end job on Jenkins.

She has accomplished a lot on the second day!

DAY2 Application Enhancement and CI Setup

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 17: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

DAY3 Operations Handoff

Joltie reaches of�ce early, with a spring in her step, to get his Application into the QA environment.

Before starting to work with Root, Joltie goes in and adds metrics to track the many business metrics that would be useful to debug issues and provide insights into the Application itself. She uses the CDAP-provided Metrics APIs

She meets Root and explains the Application and what it does to him. Root, having worked on Hadoop previously, is not very excited about a new Application being dropped on his lap for operations.

Joltie starts of by deploying the Application into a CDAP Standalone and starts it. He proceeds to explain the different aspects of Application/Work�ow and how the data will be processed to Root. Root has already started feeling comfortable and is asking a lot of questions around how he can monitor this Application.

Joltie then easily builds a dashboard using the CDAP Console and highlights the metrics that Root can use to monitor the application.

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 18: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

For Root - The Operations Guy

Rootthe operations guy is charted with deploying and managing the data pipeline in QA and production environments. In order to be successful with this project, Root needs to accomplish the following:

Install and con�gure CDAP on a cluster using the package manager provided by the distribution he is using

Secure the CDAP instance with authentication, so that only authenticated users are allowed to gain access to the instance

Ensure that the authentication can be integrated into company’s existing LDAP system

Ensure that CDAP can be upgraded using the tool

Integrate the deployment and management of the data pipeline

Specify data pipeline related runtime arguments for the the data pipeline

Manage the data pipeline lifecycle via a simple-to-use, well-de�ned set of APIs

Ensure the data pipeline has guaranteed resources in the cluster, so that ad-hoc data processing does not impact this high priority production data pipeline.

Assess CDAP and data pipeline health so that he can effectively manage and run the pipeline

Gain visibility into the running on data pipeline by monitoring application and system metrics via simple to use APIsAccess the application and system logs in case he needs to trouble shoot any incidentGet a historical view of SLAs for the data pipeline Receive proactive noti�cation when there are failures in the pipelineRestart failed pipelines using APIsUpgrade CDAP and the applications within a reasonable downtime

Material Available

System RequirementsDeployment ArchitectureAdministration Manual

Installation Guide Security Guide Operatons Guide Download Artifacts

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 19: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

Operations

Root starts his day by reading through the documentation provided by Cask to understand the components and deployment architecture

He proceeds to review the security aspects, learning more about the LDAP integration

In the afternoon he downloads CDAP and install CDAP on a Kerberos enabled cluster

He then con�gures perimeter security and integrates it with the existing LDAP system in his company

DAY1 CDAP installation and Con�guration

Root starts his second day by researching the management and monitoring APIs

He lays out a plan for monitoring CDAP and the data pipeline that Joltie is building

In the afternoon, he syncs up with Joltie and submits requirements for certain key application metrics he needs to assess the data pipeline’s health He spends the rest of the afternoon using the REST APIs to integrate CDAP and application health checks and sets up alerts on failures

DAY2 Application management and monitoring integration

Root syncs up with Joltie in the morning, making sure he gets all the information he needs. They share a donut.

Root and Joltie do a dry run of how to run the data pipeline in CDAP standalone

Root then sets up the data pipeline in a QA environment and is ready to push the data pipeline to production whenever he gets the go-ahead from management

DAY3 Operations Handoff from Dev

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 20: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

FAQ

Service discovery allows users to register data Service running in containers on a cluster. It achieves this by registering one or more service endpoints announced with Zookeeper and actively maintaining the live state of the running services.

Is CDAP just a library ?

it’s not a IDE plugin. Building of CDAP Applications are IDE friendly. Developers use Eclipse and Intellij to build. Every Application is a Maven compatible project which is generally supported by most of the IDEs.

Is CDAP an IDE plugin ?

Yes, CDAP is Open Source and it’s licensed under Apache 2.0 license

Is CDAP Open Source ?

Where can I �nd more information about CDAP ?

More information on CDAP could be found at http://docs.cask.co/cdap

Who is currently using CDAP in production ?

CDAP is currently running in production at multiple customer sites. Telco, Financial Institution, Advertising, E-Commerce and Cloud Companies.

What is the biggest installation of CDAP ?

CDAP has been tested and run on cluster as small as 10 nodes to as large as 600 nodes.

Do you support log integration with Logstash ?

We currently don’t have support for integration with Logstash for logs. We will able to provide a logstash input plugin that would ingest data into logstash for applications and system.

Do I need to be a Hadoop expert to learn and use CDAP ?

ou need to have basic knowledge of Hadoop, focusing more on how you can build things with Hadoop would be suf�cient. CDAP provides all the glue required for you to build a solution.

What are the ways one can install CDAP on a cluster ?

CDAP can be installed on Hadoop cluster through Cloudera Manager (CDH), Ambari (HDP) and RPM based install. More information can be found here

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 21: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

DAP doesn’t provide any visualization capabilities. CDAP through JDBC/ODBC driver for Datasets allows integration with existing tools.

Does CDAP provide any visualization capabilities ?

CDAP Standalone should be used only during development. It’s not meant for doing performance or large scale tests. It’s not built to be run in production yet. CDAP Standalone can be used for demos and educating colleagues and partners about how the Application you have built would look like in distributed mode.

When should I use CDAP Standalone ?

Customers have CDAP running in AWS and GCE and we are currently working with Microsoft team to support running on Azure.

What public clouds does CDAP work on ?

You can use open source JIRA system for CDAP and other projects. If you have a CDAP subscription, then Cask support portal is available for customers to report issues. The issues reported through the Cask support portal have agreed SLA for response and resolution.

If I have found a issue with CDAP how can I report an issue ?

There are few companies contributing to core CDAP. But, there are larger group of companies contributing to sub-components on CDAP like Apache Twill (p.k.a Weave), Apache Tephra (a.k.a Tephra), Hydrator Plugins and such.

Are there other companies that contribute to CDAP ?

CDAP is a open source project licensed under Apache 2.0 license. It's not part of Apache foundation yet. In order to contribute to CDAP you need to either provide to Cask a signed ICLA or CCLA. The terms are open and very similar to Apache Foundation.

I am interested in contributing to CDAP, how can I get started ?

You can �le a JIRA ticket for the feature request or if you have CDAP subscription then use the support portal.

I have a feature request for CDAP, how can I work with the team toget it added to product ?

CDAP has a open source support system. It has a google groups where developers / users can ask questions. Following is email id for the google group : [email protected] The response SLA is 1 day. If you have CDAP subscription then response SLAs are based on the contract established with your company.

If I have a question about CDAP or Application I am building, how can I get some help from the experts ?

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential

Page 22: Cask Data Application Platform (CDAP)customers.cask.co/rs/882-OYR-915/images/CDAP_101.pdfThe CDAP runtime system is made up of two major components: Fixed IP/Hostname for accessing

CDAP including Cask Hydrator is made from 4,413 �les, 362,940 LOC and 128,601 CLOC.

How many lines of code make-up CDAP ?

There are more than 40 engineers working on CDAP.

How many engineers work on CDAP ?

Customer 360 and Log Analytics are next on our list.

What are the extension that you are looking to building in future ?

CDAP team is closely observing the evolution of Apache Flink in the community and across the industry. We believe that the technology is architected, built and maintained really well. We are currently waiting for adoption by major Hadoop distribution before we can include in CDAP.

Is CDAP going to be integrating with Apache Flink ?

Currently, this capability doesn’t exist within CDAP, but it’s a very useful feature request. We have opened up a CDAP-5209 to track this issue.

Does CDAP provide any mechanism for adding custom documentation onhow to troubleshoot a Work�ow or any other Programs.

CDAP currently doesn’t support R natively. Customers have integrated R at scale through RServe

How do I integrate CDAP with R ?

CDAP-5226

Does CDAP support rolling upgrades ?

CDAP-5227

Does CDAP has multi-site support ?

Current stable version of CDAP is not integrated with Sentry for authorization. Active work is being done for integrating CDAP with Apache Sentry. More information about design can be found here First phase of Sentry integration would be available in CDAP 3.4 version. First phase includes integration of authorization for runtime components. Next release after 3.4 would include authorization for Datasets.

Does CDAP integrate with Apache Sentry for authorization ?

Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential