cask data application platform (cdap)customers.cask.co/rs/882-oyr-915/images/cdap_101.pdfthe cdap...
TRANSCRIPT
Cask Data Application Platform (CDAP)CDAP is an integrated application development framework for Hadoop. It integrates and abstracts the underlying Hadoop technologies to provide simpler and easy-to-use APIs to build, deploy and manage complex data analytics applications in the cloud or on-premise.
Data Ingestion Applications -- Batch or Realtime
Data Processing Work�ows
Real-time Applications
Data Services
Predictive Analytics Applications
Business Analytics Applications
Social Applications, and many more
It’s built for You can build
Developers Operations
Data Engineers Data Scientists
Try it
License
Use Cloudera ManagerCSD to install on cluster
CDAP Standalone Dockerthrough Kitematic
Download CDAP Standaloneto build your application
Apache 2.0
2 31 Accelerated ROI-Faster time to Market, Faster time to Value
Maximize developer productivity and minimize TCO
Simple, Easy and Standard APIs for Developers & Operations
5 64 Enables Reusability and Self-Service
Future Proof - Distribution and Deployment Agnostic
Support different workloads - Transactional and Non-transactional
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
ArchitectureThis section describes the functional and physical architecture of CDAP.
Functional Architecture
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
API
Application
An Application is a standardized container framework for de�ning all services. It simpli�es the painful integration process in heterogeneous infrastructure technologies running on Hadoop. It’s responsible for managing the lifecycle of Programs and Datasets within in an application. E.g. Wikipedia Analysis, Twitter Sentiment Analysis, Fraud Detection, etc.
Application Template
An Application Template is a user-de�ned, reusable, recon�gurable pattern of an Application. It is parameterized by a con�guration that allows recon�gurability upon deployment. It simpli�es by providing one generic version of an application which can be repurposed, instead of ongoing creation of specialized applications. It exposes the re-con�gurability and modularization of an Application through Plugins. E.g. User de�ned Template, or Cask provided templates like CDAP ETL Batch, CDAP ETL Real-time, CDAP Data Pipeline Batch, CDAP Data Pipeline Real-time, CDAP Spark Streaming Pipelines, Data Quality, etc.
Dataset
A Dataset is a standardized container framework for organizing, storing and accessing data from various storage engines. It simpli�es integration with different storage engines, allowing one to build complex data patterns across multiple storage types on Hadoop. It’s responsible for exposing transactionally consistent data patterns, integration with query engines, schema evolution and data lifecycle management. E.g. Indexed Dataset, Time Partitioned Fileset, Partitioned Fileset, OLAP Cube Dataset, Indexed Object Store, Object Store, Timeseries Dataset, etc are different types of datasets that can be de�ned in CDAP.
Extension
Application Template with a domain speci�c UI integrated into the CDAP UI. E.g. Cask Hydrator and Cask Tracker.
Program
A Program is a container of well-de�ned tasks for processing or servicing Datasets to generate zero or more Datasets. It is responsible for managing the lifecycle, and integration with transactions, metrics, logging and the metadata system. For e.g. Spark Program, MapReduce Program, Work�ow Program, Worker Program, Service Program, Flow Program, etc.
Plugin
A Plugin is a customizable module exposed and used by an Application or an Application Template. It simpli�es adding new features or extending the capability of an Application. Plugin implementations are based on interfaces exposed by the Application. For e.g. CDAP ETL Batch Application template exposes three plugins namely Source, Transform & Sink, CDAP Data Quality Application template exposes Aggregation plugin.
Artifact
An Artifact is versioned packaging format used to aggregate one or more Application, Dataset, Plugin, Resource and associated the metadata. It’s a JAR (Java Archive) containing Java classes and resources required to create and run the Application.
1
Extensions are a new concept within CDAP and are not ready for general use. 1
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
Tools
Command Line Interface (CLI)
The CDAP CLI allows developers and operations teams script and automate interactions with local or remote CDAP entities from the shell. CLI uses the CDAP REST APIs to provide this functionality. Using CLI one can manage the lifecycle of Applications, Artifacts, Programs and Datasets. More information can be found here
Testing Framework
An end-to-end JUnit scaffolding over CDAP that allows developers to test their Applications, Plugins and Programs during development. It’s built as modular framework allowing developer to also test individual components. The tests can be integrated with Continuous Integration (CI) tools like Bamboo, Jenkins and Teamcity.
JDBC / ODBC Driver
The CDAP JDBC and ODBC drivers enable users to access Datasets (HDFS, HBase or a Composite Dataset) on Hadoop through Business Intelligence (BI) applications with JDBC or ODBC support. The driver achieves this integration by translating Open Database Connectivity (JDBC/ODBC) calls from the application into SQL and passing the SQL queries to the underlying Dataset management and Query engine (Hive is the default).
Monitoring Integrations
These integrations allow external systems to monitor CDAP and Applications running within it. Integrations with Nagios, Sensu, Cacti and Splunk are supported, and they are achieved by plugins that access status, logs and metrics through REST API.
Performance Framework
A CDAP performance framework provides you the ability to load test and capture performance metrics to diagnose bottlenecks within your Application.
User Interface (Console)
The Console provides a user friendly graphical user interface with well-designed user work�ows for deploying and managing the lifecycle of Applications, Programs, Datasets and Artifacts. Operations management capabilities allow deeper and faster insights into diagnosing issues with different entities. It also exposes administrative capabilities for managing CDAP.
Isn’t publicly available yet, but will be provided access on-demand. We are still evolving it. 4
In-memory CDAP - Abstracted to in-memory structures for easy debugging (shorter stack traces)3
Framework only available in Java.2
2 4
3
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
Router
Service Discovery
Service discovery allows users to register data Service running in containers on a cluster. It achieves this by registering one or more service endpoints announced with Zookeeper and actively maintaining the live state of the running services.
Dataset and Service
Dataset and Service -- Data within Datasets can be exposed to external clients through a Service Program. Developers can implement custom Service exposing data from Dataset or they can also be used to write to Dataset. Service exposes user de�ned REST APIs to Dataset. Service execute as YARN containers on the cluster eliminating an additional step to migrate data into traditional database to be exposed to applications. Future for this is to automatically for Datasets to expose REST APIs. Developers can use annotations to map REST endpoints to methods within Datasets simplifying Data-As-A-Service concept.
CDAP System ServicesIn order to simplify the deployment model of CDAP on a Hadoop cluster, CDAP uses a small portion of the cluster to run mission critical CDAP system services in YARN containers. It also does this to support elastic scaling of the services without having to stop them. So, CDAP partly runs on Edge Nodes and partly within the cluster. The system is smart enough to ensure that the system services are distributed evenly across the nodes on the cluster and don’t interfere with normal operations and jobs running on the cluster. More information about deployment and services can be found here More on services can be found at ...
Additional Information
Service Dispatc
A request for accessing a service method is appropriately routed to the right container running a service on the cluster. In case of multiple instances of a service, a routing strategy is engaged automatically to distribute the load across multiple instances.
REST APIs
REST APIs are HTTP interfaces exposed by CDAP for a multitude of purposes: everything from deploying and managing Applications, Artifacts, Plugins and Datasets , to ingesting data events, to query data from datasets, to checking the status of various system and user services. More information about the different REST APIs exposed can be found here
Edge nodes are the interfaces between the Hadoop cluster and outside network. They are also referred to as gateway nodes.They run client applications and cluster administration tools.
5
5
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
To support different environments, such as development, QA and staging;
To support multiple customers; and
To support multiple sub-organizations within a organization.
multiple sub-organizations within an organization.
More information on Namespace can be found here
Dataset and Service
CDAP provides isolation of application and data through Namespace. Namespace conceptually can be thought of partitions of a CDAP instance. Application and Dataset in one space are not accessible in other namespace. It’s �rst step towards introducing multi-tenancy in CDAP. This feature can be used for partitioning a single Hadoop cluster into multiple namespaces:
Security
CDAP supports Kerberos enabled clusters and also supports perimeter level security for authentication through LDAP or JASPI or Basic mechanisms. Authorization is currently being worked on in 3.4 release. More information on CDAP security can be found here
Supported Programming Languages
Java is the only programming language currently supported by CDAP. CDAP has plans to support other languages like Python, R and Javascript. Near term plan is to support Python through Py4j integration. here
Supported Hadoop Distributions
CDAP and all it’s Applications are agnostic to the distribution they run on. On nightly basis the platform as well as test Applications are tested on various �avors of Hadoop Distribution. Information about tests can be found here
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
CDAP ServerThe CDAP Server is a collection of services (Figure 2) essential for successfully running CDAP. It can be installed on one or more edge nodes of a cluster and is responsible for managing only the cluster to which it’s con�gured. The services are installed for a few reasons:
Deployment ArchitectureThis section describes different components of the CDAP runtime system and how they are deployed on a Hadoop cluster. For more information please see here
The CDAP runtime system is made up of two major components:
Fixed IP/Hostname for accessing REST APIs
Impersonating a user in secure mode to run in a cluster
Manage (Start/Stop/Monitor) system services running within the cluster
CDAP Services These are mission critical CDAP system services running in YARN containers on a Hadoop cluster.The lifecycle on these services are managed by the CDAP Server. Below are the services running on the cluster
Dataset Executor: Responsible for managing dataset lifecycle
Metadata: Responsible for managing metadata for applications and datasets
Log and Metrics Aggregator: Responsible for aggregating and indexing logs and metrics across all applications and datasets
Transactions: Responsible for providing consistency guarantees across applications and datasets
Explore: Responsible for exposing the querying (SQL) interface for datasets
Stream: Responsible for data ingestion either in realtime or batch
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
Use-case(s)Following are a few use-cases that CDAP is well-suited for and are being used within customer environments.
Data LakeHigh Volume
Streaming Analytics
InformationSecurity Reporting
Real-time brandand marketing
campaign monitoring
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
Use-case(s)
Data Lake
Building an enterprise data lake requires building a reliable, repeatable and fully operational data management system, which included ingestion, transformations & distribution of data . It must support varied data types and formats, and must be able to capture data �ow in in various ways. The system must support the following:
Following are a few use-cases that CDAP is well-suited for and are being used within customer environments.
Transform, normalize, harmonize, partition, �lter and join data
Interface with anonymization and encryption services external to the cluster
Generate metadata for all data feeds, snapshots and datasets ingested, and make it accessible through APIs and webservices
Perform policy enforcement for all ingested and processed data feeds
Tracking and isolating errors during processing
Performing incremental processing of data being ingested
Reprocessing data in case of failures and errors
Apply retention policies on ingested and processed datasets
Setup common location format (CLF) for storing staging, compressed, encrypted and processed dataFiltered views over processed datasets
Monitoring, reporting, and alerting based on thresholds for transport and data quality issues experienced during ingestion. This helps provide the highest quality of data for analytics needs.
Annotate Datasets with business/user metadata
Search Datasets using metadata
Search Datasets based on schema �eld names and types
Manage data provenance (lineage) as data is processed/transformed in the data lake
A team of 10 Java (Non-Hadoop) developers were able to build an end-to-end ingestion system with the capabilities described above using CDAP. Lower barrier to entry.
These developers provided a self-service platform to the rest of the organization(s) to ingest, process and catalog data. Abstractions helped them build at a much faster pace and get it to their customers faster. Time to market.
The ingestion platform standardized and created conventions for how data is ingested, transformed and stored on the cluster, allowing the platform users to on-board at much faster rate. Time to value.
CDAP’s native support for incremental processing, reprocessing, tracking metadata, work�ow, retention, snapshotting, monitoring and reporting expedite the efforts to get a system to their customers. Time to market.
CDAP is installed in 8 clusters with 100s of nodes.
Data Lake users were able to locate Datasets faster and had faster access to metadata, data lineage and data provenance. This allowed them to ef�ciently utilize their clusters and also aided them in data governance, auditability, and improving the data quality of Datasets. CDAP Tracker provided this set of capabilities.
Outcome
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
High Volume Streaming Analytics
Building a high speed, high volume streaming analytics solutions with exactly-once semantics is complex, resource intensive and hard to maintain and enhance. This use-case required data collection from web logs, mobile activity logs and CRM data, in real-time and batch. The data collected was then organized into customer hierarchies and modeled to deliver targeted ad campaigns and marketing promotion campaigns. It also had to provide advanced analytics for tracking the campaigns in real-time. This application had to support the following:
Support processing of ~38 billion transactions per day in real-time
Categorizing customer activity into buckets and hierarchies
Generating unique counts in real-time, to understand audience reach, tracking behavior trend, and the like.
Generate hourly, daily, monthly and yearly reports on multiple dimensions
Provide unique stat count on an hourly basis rather than weekly
Reprocessing data without side effects due to bug �xes and new features
Exactly-once processing semantics for reliable processing
Processing data both in real-time and batch.
CDAP’s abstraction and its real-time program simpli�ed building this application and in getting it to market faster. Time to market.
The team replaced a MapReduce based batch-system to a realtime system, delivering insights every minute instead of days.
CDAP’s exactly-once and transactional semantics provided high-degree of data consistency during failure scenarios, making it easy to debug and reason the state of data.
CDAP’s Standalone and Testing frameworks allowed the developers to build this application ef�ciently. No distributed components were required to run functional tests.
Outcome
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
Information Security Reporting
In a large enterprise environment there are traditional sources that house a great deal of data. There is a constant need to load data into Hadoop clusters to perform complex joins, �ltering, transformations and report generation. Moving data to Hadoop is cost-effective as there is the need to run many complex, ad-hoc queries that would otherwise require expensive execution on traditional data storage and querying technologies.
The Customer has been attempting to build a reliable, repeatable data pipeline for generating reports across all network devices which access resources. Data is currently aggregated into �ve different Microsoft SQL Servers. Aggregated data is then periodically (once-a-day) staged into a secured (Kerberos) Hadoop cluster. Upon loading the data into the staged area, transformations (rename �elds, change type of �eld, project �elds) are performed to create new datasets. The data was registered within Hive to run Hive SQL queries for any ad-hoc investigation. Once all the data is in �nal independent datasets, the next job is kicked off -- that joins the data from across all �ve tables to create a new uber table that provides a 360 degree view for all network devices. This table is then used to generate a report that is part of another job. Following are the challenges the customer faced:
Ensuring that the reports aligned to day boundaries
Restarting the failed jobs from the last point where they had failed (had to recon�gure pipelines to restart failed jobs)
Adding new sources required a lot of setup and development time
Inability to test the pipeline before it was deployed — this lead to inef�cient utilization of the cluster as all the testing was performed on the cluster
They had to cobble together a set of loosely federated technologies -- Sqoop, Oozie, MR, Spark, Hive and Bash Scripts
The in-house Java developers with limited knowledge of Hadoop built and ran the complex pipelines at scale within two weeks after four (4) hours of training
The visual interface enabled the team to build, test, debug, deploy, run and view pipelines during operations
The new process reduced system complexity dramatically, which simpli�ed pipeline management
The development experience was improved by reducing inappropriate cluster utilization
Transforms were performed in-�ight with error record handling
Tracking tools made it easy to rerun the process from any point of failure
Outcome
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
Real-time brand and marketing campaign monitoringEnterprises use Twitter to know when people are talking about their brand and understand sentiment toward their new marketing campaigns. Real-time monitoring capabilities on Twitter allows them to keep a close eye on the results of marketing efforts.
Developing a real-time pipeline that ingests the full Twitter stream, then cleanses, transforms, and performs sentiment and multi-dimensional analysis of the Tweets that
were related to campaign delivers a valuable real-time decision making platform. The aggregated data is exposed through REST APIs to an internal tool for visualization, making consumption of the output easier.
The pipeline is built using Storm, HBase, MySQL and JBoss. Storm is used to ingest and process the stream of Tweets. The Tweets are analyzed using NLP algorithms to determine sentiment. They are aggregated on multiple dimensions like number of re-tweets and attitude (positive, negative or neutral). The aggregations are stored in HBase. Periodically (twice-a-day) the data from HBase is moved into MySQL. JBoss exposed REST APIs for accessing the data in MySQL.
The goal of this use-case was to reduce the overall complexity of the pipeline, moving away from maintaining a separate cluster for processing the real-time Twitter stream, integrate NLP scoring algorithms for sentiment analysis and exposing the aggregated data from HBase with lower latency, thereby reducing the latency between the data being available in HBase to delivery via REST API. The result - an easy to build, deploy and manage real-time pipeline with better operational insights.
Cask Hydrator pipeline for processing full twitter stream was built in 2 weeks.
Cleansing, Transforming, Analyzing and Aggregating tweets at about 6K/sec in-�ight.
Consolidated infrastructure into a single Hadoop cluster.
Java Developers were able to build the pipeline and plugin with less learning curve.
CDAP Service on OLAP Cube reduced the expensive data movement and reduced the latency between the aggregation being generated to the results being exposed through REST APIs allowing them to make better decisions faster.
CDAP and Cask Hydrator seamlessly transparency provided easy operational insights through custom dashboards and aggregated logs for debugging.
Outcome
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
A day in life with CDAP
This section of the document describes how a day would go for a developer and operations team member building and deploying an application or a solution on Hadoop in production using CDAP. In order to demonstrate this we will take a common use-case with an organization using Hadoop.
The users are provided a very constrained environment for developing and testing the pipeline. In total there are three (3) clusters made available to them:
Use-case
Joltie, a Java developer, and Root, a operations team member, work in an
enterprise. They have been tasked with building and operationalizing a
data pipeline for processing data that is ingested and available on HDFS.
The data is delivered to a standard directory on a daily basis with an
approximate size of 600GB per day. The processing pipeline has to be
operationalized to process daily data within a SLA of 3 hrs. Developer and
Operations only have access to CDAP, not Cask Hydrator.
The data pipeline must include the following :
1
A dataset integrity job that takes a pass over all the
data from a day to check if the data on HDFS is reliable
enough to be processed.
The jobs in step 1 and step 2 can run in parallel.
The job in step 3 is executed only if the job in step 1 clears that the data is reliable and step 2 job is successful.
2
A transformation job that will process a day's worth
of data -- applies transformations, �ltering and �eld level validation. The output of the job is a
new transient dataset.
3
The output from the step 2 is then picked up by
another job that interacts with the encryption system to encrypt certain �elds in
the feed. This job will generate two outputs --
one that is encrypted and the other that is not. The job also writes both the output sets to datasets
partitioned by day. These datasets are explorable
using Hive.
4
The in-encrypted partition output is further processed to build and update a data model in HBase. The data in HBase is also explorable
through Hive.
21 The job in step 4 is dependent on successful completion of step 3.
3
Conditions
Environment
A Dev Cluster - 2% capacity of production cluster
A QA Cluster - 5% capacity of production cluster
A Production Cluster
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
For Joltie - The Developer
Joltie a mid-level Java developer with experience building web applications is very excited about this opportunity to build a Big Data application. She has taken the initiative to learn about the basics of Hadoop. She is intimidated by the complexity of Hadoop and has been looking at different examples of how to process data on Hadoop. Following are additional requirements that she has been tasked with:
Make the application code modular and testable
Commit changes on regular basis
Write unit test cases for each piece functionality that is added to the application
Set up a job in Jenkins that runs unit tests on every check-in
Setup an end-to-end test for the application on the QA cluster
Instrument the application correctly to provide more insights into business processing
The application should handle some critical edge cases
Joltie has to work with Root (operations team member) to successfully deploy the applicationto QA and production
Joltie has to use CDAP to build this application
Joltie starts by looking at the pipeline requirements and in parallel
spends time reading CDAP documentation and running examples.
She is delighted that she doesn’t need to install Hadoop.
In case of an error in the pipeline, the user should be able to �x the issue and easily reprocess
The pipeline must support incremental processing
The pipeline must handle the case where data arrives late
Tools Needed
Material Available
CDAP Apps & ExamplesCDAP SDKCDAP Documentation
Maven
IDE
Laptop
Nodejs
Java
Design & Research
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
Development
Joltie starts up by creating a base plate CDAP Application using CDAP Maven Archetype.
All the necessary dependencies are included, so she is ready to get started
She then modi�es the Application to build a Directed Acyclic Graph (DAG) for processing input data, as described in the use-case, using the Work�ow API provided by CDAP.
She also uses the CDAP JUnit scaffolding to build a unit test for testing the Work�ow. She �nds it’s very useful to have the ability to test the work�ows before being deployed on a cluster or anywhere else.
Joltie builds the project using Maven to generate the application artifact.
She then starts a CDAP Standalone version on her laptop
She deploys the artifact into CDAP Standalone and plays around with it. She cannot test it at scale, but is able to experience working with Application as it would have been in the cluster.
She Iterates through the development cycle without having to touch the a cluster
She goes home feeling like he has accomplished a great deal and has developed the �rst version of the Application by herself.
DAY1 First Cut of the Application
It’s a beautiful day. Joltie is ready to include a few of the edge cases that were provided as requirements and take the application to completion.
She reads about Partitioned Fileset Datasets and con�gures her Application to include this in orderto solve the edge cases. The Partitioned Fileset Datasets are transactional and provide consistency required to handle new partitions and also handle errors ef�ciently.
She iterates on it a few times to make sure she includes test scenarios for edge cases and also tries the same within CDAP Standalone. she is able to simulate a few of the scenarios, but she has to try this on a real cluster to make sure it works as expected under error conditions.
Next, she is ready to setup a Jenkins job to periodically trigger tests on check-in. she speci�es the con�gurations and commands to run the tests the same way she would in a shell.
She now wants to setup an end-to-end test of her application -- she is not sure how she can accomplish this, so, she opens an issue within Cask’s support portal. She gets a response back within a few hours with information around how she can set-up the end-to-end test.
She follows the instructions speci�ed by a Cask representative to set-up an end-to-end job on Jenkins.
She has accomplished a lot on the second day!
DAY2 Application Enhancement and CI Setup
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
DAY3 Operations Handoff
Joltie reaches of�ce early, with a spring in her step, to get his Application into the QA environment.
Before starting to work with Root, Joltie goes in and adds metrics to track the many business metrics that would be useful to debug issues and provide insights into the Application itself. She uses the CDAP-provided Metrics APIs
She meets Root and explains the Application and what it does to him. Root, having worked on Hadoop previously, is not very excited about a new Application being dropped on his lap for operations.
Joltie starts of by deploying the Application into a CDAP Standalone and starts it. He proceeds to explain the different aspects of Application/Work�ow and how the data will be processed to Root. Root has already started feeling comfortable and is asking a lot of questions around how he can monitor this Application.
Joltie then easily builds a dashboard using the CDAP Console and highlights the metrics that Root can use to monitor the application.
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
For Root - The Operations Guy
Rootthe operations guy is charted with deploying and managing the data pipeline in QA and production environments. In order to be successful with this project, Root needs to accomplish the following:
Install and con�gure CDAP on a cluster using the package manager provided by the distribution he is using
Secure the CDAP instance with authentication, so that only authenticated users are allowed to gain access to the instance
Ensure that the authentication can be integrated into company’s existing LDAP system
Ensure that CDAP can be upgraded using the tool
Integrate the deployment and management of the data pipeline
Specify data pipeline related runtime arguments for the the data pipeline
Manage the data pipeline lifecycle via a simple-to-use, well-de�ned set of APIs
Ensure the data pipeline has guaranteed resources in the cluster, so that ad-hoc data processing does not impact this high priority production data pipeline.
Assess CDAP and data pipeline health so that he can effectively manage and run the pipeline
Gain visibility into the running on data pipeline by monitoring application and system metrics via simple to use APIsAccess the application and system logs in case he needs to trouble shoot any incidentGet a historical view of SLAs for the data pipeline Receive proactive noti�cation when there are failures in the pipelineRestart failed pipelines using APIsUpgrade CDAP and the applications within a reasonable downtime
Material Available
System RequirementsDeployment ArchitectureAdministration Manual
Installation Guide Security Guide Operatons Guide Download Artifacts
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
Operations
Root starts his day by reading through the documentation provided by Cask to understand the components and deployment architecture
He proceeds to review the security aspects, learning more about the LDAP integration
In the afternoon he downloads CDAP and install CDAP on a Kerberos enabled cluster
He then con�gures perimeter security and integrates it with the existing LDAP system in his company
DAY1 CDAP installation and Con�guration
Root starts his second day by researching the management and monitoring APIs
He lays out a plan for monitoring CDAP and the data pipeline that Joltie is building
In the afternoon, he syncs up with Joltie and submits requirements for certain key application metrics he needs to assess the data pipeline’s health He spends the rest of the afternoon using the REST APIs to integrate CDAP and application health checks and sets up alerts on failures
DAY2 Application management and monitoring integration
Root syncs up with Joltie in the morning, making sure he gets all the information he needs. They share a donut.
Root and Joltie do a dry run of how to run the data pipeline in CDAP standalone
Root then sets up the data pipeline in a QA environment and is ready to push the data pipeline to production whenever he gets the go-ahead from management
DAY3 Operations Handoff from Dev
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
FAQ
Service discovery allows users to register data Service running in containers on a cluster. It achieves this by registering one or more service endpoints announced with Zookeeper and actively maintaining the live state of the running services.
Is CDAP just a library ?
it’s not a IDE plugin. Building of CDAP Applications are IDE friendly. Developers use Eclipse and Intellij to build. Every Application is a Maven compatible project which is generally supported by most of the IDEs.
Is CDAP an IDE plugin ?
Yes, CDAP is Open Source and it’s licensed under Apache 2.0 license
Is CDAP Open Source ?
Where can I �nd more information about CDAP ?
More information on CDAP could be found at http://docs.cask.co/cdap
Who is currently using CDAP in production ?
CDAP is currently running in production at multiple customer sites. Telco, Financial Institution, Advertising, E-Commerce and Cloud Companies.
What is the biggest installation of CDAP ?
CDAP has been tested and run on cluster as small as 10 nodes to as large as 600 nodes.
Do you support log integration with Logstash ?
We currently don’t have support for integration with Logstash for logs. We will able to provide a logstash input plugin that would ingest data into logstash for applications and system.
Do I need to be a Hadoop expert to learn and use CDAP ?
ou need to have basic knowledge of Hadoop, focusing more on how you can build things with Hadoop would be suf�cient. CDAP provides all the glue required for you to build a solution.
What are the ways one can install CDAP on a cluster ?
CDAP can be installed on Hadoop cluster through Cloudera Manager (CDH), Ambari (HDP) and RPM based install. More information can be found here
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
DAP doesn’t provide any visualization capabilities. CDAP through JDBC/ODBC driver for Datasets allows integration with existing tools.
Does CDAP provide any visualization capabilities ?
CDAP Standalone should be used only during development. It’s not meant for doing performance or large scale tests. It’s not built to be run in production yet. CDAP Standalone can be used for demos and educating colleagues and partners about how the Application you have built would look like in distributed mode.
When should I use CDAP Standalone ?
Customers have CDAP running in AWS and GCE and we are currently working with Microsoft team to support running on Azure.
What public clouds does CDAP work on ?
You can use open source JIRA system for CDAP and other projects. If you have a CDAP subscription, then Cask support portal is available for customers to report issues. The issues reported through the Cask support portal have agreed SLA for response and resolution.
If I have found a issue with CDAP how can I report an issue ?
There are few companies contributing to core CDAP. But, there are larger group of companies contributing to sub-components on CDAP like Apache Twill (p.k.a Weave), Apache Tephra (a.k.a Tephra), Hydrator Plugins and such.
Are there other companies that contribute to CDAP ?
CDAP is a open source project licensed under Apache 2.0 license. It's not part of Apache foundation yet. In order to contribute to CDAP you need to either provide to Cask a signed ICLA or CCLA. The terms are open and very similar to Apache Foundation.
I am interested in contributing to CDAP, how can I get started ?
You can �le a JIRA ticket for the feature request or if you have CDAP subscription then use the support portal.
I have a feature request for CDAP, how can I work with the team toget it added to product ?
CDAP has a open source support system. It has a google groups where developers / users can ask questions. Following is email id for the google group : [email protected] The response SLA is 1 day. If you have CDAP subscription then response SLAs are based on the contract established with your company.
If I have a question about CDAP or Application I am building, how can I get some help from the experts ?
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential
CDAP including Cask Hydrator is made from 4,413 �les, 362,940 LOC and 128,601 CLOC.
How many lines of code make-up CDAP ?
There are more than 40 engineers working on CDAP.
How many engineers work on CDAP ?
Customer 360 and Log Analytics are next on our list.
What are the extension that you are looking to building in future ?
CDAP team is closely observing the evolution of Apache Flink in the community and across the industry. We believe that the technology is architected, built and maintained really well. We are currently waiting for adoption by major Hadoop distribution before we can include in CDAP.
Is CDAP going to be integrating with Apache Flink ?
Currently, this capability doesn’t exist within CDAP, but it’s a very useful feature request. We have opened up a CDAP-5209 to track this issue.
Does CDAP provide any mechanism for adding custom documentation onhow to troubleshoot a Work�ow or any other Programs.
CDAP currently doesn’t support R natively. Customers have integrated R at scale through RServe
How do I integrate CDAP with R ?
CDAP-5226
Does CDAP support rolling upgrades ?
CDAP-5227
Does CDAP has multi-site support ?
Current stable version of CDAP is not integrated with Sentry for authorization. Active work is being done for integrating CDAP with Apache Sentry. More information about design can be found here First phase of Sentry integration would be available in CDAP 3.4 version. First phase includes integration of authorization for runtime components. Next release after 3.4 would include authorization for Datasets.
Does CDAP integrate with Apache Sentry for authorization ?
Cask Data Application Platform - CDAP 101 Copyright © 2016 to Cask Data Inc. Proprietary and Con�dential