partner’s guide to integrating with cloudera · partner’s guide to integrating with cloudera...

7
Partner’s Guide to Integrating with Cloudera Overview: Cloudera provides an an enterprise data hub, built on the foundation of Apache Hadoop. An Enterprise Data Hub provides: • one place to store all data, for as long as desired or required, in its original fidelity; • a framework to integrate with existing infrastructure and tools; • flexibility to run a variety of enterprise workloads — including batch processing, interactive SQL, enterprise search, and advanced analytics • robust security, governance, data protection, and management that enterprises require. Given the robustness and breadth of Cloudera’s data management platform, it is easiest to break it down into 4 key integration areas: Ingest, Process, Interact, and Administer. The information below is meant to act as a guide for partner integrations. Integration Points: CDH includes numerous components that can serve as integration points for partner products. It makes sense to break the components down by broad high level functionality with respect to the data lifecycle: Ingest: Movement of data into the Cloudera cluster Process: Large scale data processing Interact: Interactive query and analysis Administer: Management of the cluster and services

Upload: phamthuan

Post on 12-Jun-2018

224 views

Category:

Documents


1 download

TRANSCRIPT

Partner’s Guide to Integrating with Cloudera

Overview: Cloudera provides an an enterprise data hub, built on the foundation of Apache Hadoop. An Enterprise Data Hub provides: • one place to store all data, for as long as desired or required, in its original fidelity; • a framework to integrate with existing infrastructure and tools; • flexibility to run a variety of enterprise workloads — including batch processing, interactive SQL, enterprise search, and advanced analytics • robust security, governance, data protection, and management that enterprises require. Given the robustness and breadth of Cloudera’s data management platform, it is easiest to break it down into 4 key integration areas: Ingest, Process, Interact, and Administer. The information below is meant to act as a guide for partner integrations.

Integration Points: CDH includes numerous components that can serve as integration points for partner products. It makes sense to break the components down by broad high level functionality with respect to the data lifecycle:

Ingest: Movement of data into the Cloudera cluster Process: Large scale data processing Interact: Interactive query and analysis Administer: Management of the cluster and services

Below is a quick list of components by organized by their function, as well as links to where you can find more information, and relevant training courses. Ingest Apache Sqoop: Bidirectional data movement between HDFS and RDBMS systems using MapReduce. Sqoop documentation: http://archive­primary.cloudera.com/cdh5/cdh/5/sqoop/ Blog: Understanding connectors and drivers in the world of Sqoop, Sqooping Data with Hue Relevant training: Hadoop Developer Training, and Hadoop Admin Training each include hands­on exercises using Sqoop. Apache Flume: Event­based streaming of data into HDFS, HBase, and other sources. Flume Documentation: http://archive­primary.cloudera.com/cdh5/cdh/5/flume­ng/ Blog: http://blog.cloudera.com/blog/category/flume/ Relevant Training: Cloudera’s Hadoop Admin Training includes a Flume exercise. HDFS API: Partners often use Hadoop clients and the Hadoop API to move data into Cloudera clusters. HDFS shell: http://archive­primary.cloudera.com/cdh5/cdh/5/hadoop/hadoop­project­dist/hadoop­common/FileSystemShell.html HDFS API Javadoc: http://archive­primary.cloudera.com/cdh5/cdh/5/hadoop/api/org/apache/hadoop/fs/FileSystem.html Relevant training: Hadoop Admin Training, Hadoop Developer Training Hue: Hue includes an HDFS client for transfer of smaller amounts of data. Hue documentation: Cloudera Hue documentation for the Hue file browser: http://archive­primary.cloudera.com/cdh5/cdh/5/hue/user­guide/filebrowser.html File operations made easy with Hue: http://blog.cloudera.com/blog/2013/04/demo­hdfs­file­operations­made­easy­with­hue/ Process Hadoop MapReduce: The core batch processing engine of Hadoop Relevant Training: Hadoop Developer Training Video resources: Introduction to MapReduce MapReduce Tutorial: http://www.cloudera.com/content/cloudera­content/cloudera­docs/HadoopTutorial/CDH4/Hadoop­Tutorial.html

Apache Hive: Batch processing using SQL­like interface Hive documentation, including language reference Relevant Training: Analyzing Data using Hive, Pig and Impala (preview at: https://www.youtube.com/watch?v=KC3dbj1oYZ4) Introduction to Hive Demo: Analyzing data with Hue and Hive An old Cloudera video on Apache Hive Apache Pig: Data flow processing for Hadoop Pig documentation Relevant training: Analyzing Data using Hive, Pig and Impala (preview at: https://www.youtube.com/watch?v=KC3dbj1oYZ4) Apache Oozie: Workflow orchestration for Hadoop Oozie documentation Oozie tutorials: http://blog.cloudera.com/blog/category/oozie/ Relevant training: Hadoop Developer Training has a component on Oozie Apache Spark: Fast, memory­intensive processing Spark documentation Relevant Training: Cloudera Spark Training Running a simple Spark app in CDH5 Fast in­memory computing for Big Data applications AMPLab mini course on Spark Interact Apache HBase: BigTable­based NoSQL for Hadoop Google BigTable paper The Apache HBase “Book” HBase: The Definitive Guide by Lars George HBase In Action by Nick Dimiduk, Amandeep Khurana HBaseCon presentations Relevant Training: Cloudera HBase Training HBase introduction from Todd Lipcon HBase schema design from Lars George HBase and Lewis Carroll from Jeff Bean Cloudera Impala: Fast interactive SQL query Cloudera Impala overview:

http://www.cloudera.com/content/cloudera­content/cloudera­docs/Impala/latest/Installing­and­Using­Impala/ciiu_concepts.html Connecting to Impala via JDBC: JDBC in a secure environment: http://blog.cloudera.com/blog/2014/05/how­to­configure­jdbc­connections­in­secure­apache­hadoop­environments/ Configuring Impala: http://www.cloudera.com/content/cloudera­content/cloudera­docs/Impala/latest/Installing­and­Using­Impala/ciiu_impala_jdbc.html Connecting to Impala via ODBC: Configuring Impala: http://www.cloudera.com/content/cloudera­content/cloudera­docs/Impala/latest/Installing­and­Using­Impala/ciiu_impala_odbc.html Relevant training: Analyzing Data using Hive, Pig and Impala (preview at: https://www.youtube.com/watch?v=KC3dbj1oYZ4), Designing and Building Big Data Applications Impala connectors Cloudera Search: indexing and retrieval of complex data in HDFS Cloudera Search overview video Cloudera developer blog on Search Cloudera Hue Demo with Search Search documentation Relevant Training: Designing and Building Big Data Applications Administer: Cluster and service management and extensibility Cloudera Manager Cloudera Manager introduction Get started with Hadoop in less than 30 minutes Cloudera Manager automation via the Java API Cloudera Manager API guthub repo: http://cloudera.github.io/cm_api/ Cloudera Manager API Introduction

Cloudera Manager parcels Here is our github repository that has some early documentation and tools on how to create parcels: https://github.com/cloudera/cm_ext docs are here: https://github.com/cloudera/cm_ext/wiki Cloudera Manager CSDs. The CSD specific documentation is not ready yet but to help in the meantime, we have open sourced some of our internal CSDs, Spark and Accumulo:

https://github.com/cloudera/cm_csds

Here is the documentation on how to install a CSD: http://cloudera.com/content/cloudera­content/cloudera­docs/CM5/latest/Cloudera­Manager­Managing­Clusters/cm5mc_addon_services.html

You can use the CSD and Parcel mechanism to deploy your agents. Security: Cloudera encourages all of our partners to certify using secure clusters whenever possible. See Configuring Hadoop Security with Cloudera Manager for more information. Eric Sammer’s KRB Bootstrap helps quickly and easily set up Kerberos KDCs for Cloudera clusters. This doesn’t secure a cluster for production but is adequate for certification. Cloudera security webinar series gives a good overview of all Cloudera’s thinking on security considerations Configuring JDBC connections in secure Hadoop environments This details setting up and configuring a secure hadoop cluster, and connecting to it via username/password authentication over JDBC. Kite: The Kite SDK project is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem. Kite removes complexity by encoding best practices. It enables adoption with loosely coupled modules, such as a data module for reading and writing data, and morphlines for ETL. If you’re building an application integrating with Cloudera at multiple points, Kite is a useful place to start. Kite Howto Kite Documentation Documentation: Cloudera documentation includes Cloudera specific details such as cluster installation, configuration, management, and requirements. Cloudera Documentation: Top­level documentation entry point CDH Documentation: Cloudera Distribution documentation, not excluding Cloudera Manager. A lot of the functionality around installation and configuration that is documented here is also automated by Cloudera Manager. However, it makes sense to understand what’s happening under the covers even when using Cloudera Manager. Cloudera Manager Documentation: http://www.cloudera.com/content/support/en/documentation/manager/cloudera­manager­v5­latest.html

Project documentation: Cloudera supports and packages open source software. In some cases the documentation from the open source community is the best resource. There is often documentation available on the respective project’s home page, but Cloudera also hosts version­specific documentation linked from our documentation. For example, Cloudera’s hosted version of Apache Hadoop documentation for CDH5 is at http://archive­primary.cloudera.com/cdh5/cdh/5/hadoop/ We serve documentation for every project included in Cloudera’s distribution under the URL http://archive­primary.cloudera.com/cdh<version>/cdh/<version>/project­name Cloudera Web Resources: As a member of Cloudera Connect, you have access to our partner portal, which includes links to training videos, current release videos, FAQs and other resources. Cloudera also provides video resources for free to the community at large via its website. This is a good place to find introductions on key topics as well as technical discussions. You can also find updated information for developers and admins on our resources page. Cloudera blogs: We post running updates in both our developer blog and our vision blogs. If you’d like more information about use cases, individual projects, or interesting work being done in our community, this is a good place to start. Also, please contact us if you’d like to contribute your own interesting stories to our blog. Cloudera developer blog: in­depth technical details about specific topics. Feel free to browse or search for a specific topic. Cloduera vision blog: broader strokes on use cases, big data, technology direction, and the market. Cloudera Training: Cloudera University offers a great set of courses for learning Hadoop and related technologies in detail. Additionally, there is robust library of e­learning courses available. Books: Most Hadoop topics are covered in depth in books published by Oreilly. We particularly like: Hadoop: The Definitive Guide by Clouderan Tom White Hadoop Operations by Cloudera Alumni Eric Sammer HBase: The Definitive Guide by Clouderan Lars George

Hadoop Application Architectures by a whole team of Clouderans User lists: Mailing lists such as cdh­[email protected] Cloudera forums for our user community: http://community.cloudera.com/