a beginners guide to cloudera hadoop

Cloudera Hadoop as your Data Lake

Introduction to BigData and Hadoop for beginners

David Yahalom, CTONAYA Technologies

[email protected] www.naya-tech.com

2015, All Rights reserved

NAYA Technologies | 1250 Oakmead Pkwy suite 210, Sunnyvale, CA 94085-4037 | +1.408.501.8812 All Rights Reserved. Do not Distribute.

mailto:[email protected]

http://www.naya-tech.com

About NAYA Technologies Global leader in Data Platform consulting and managed services.

Established in 2009, NAYA is a leading provider of Data Platform managed services with emphasis on planning, deploying, and managing business critical database systems for large enterprises and leading startups.

Our company provides everything data platform architecture design through implementation and 24/7/365 support for mission critical systems.

NAYA is one of the fastest growing consultants in the market with teams that provide clients with the peace of mind they need when it comes to their critical data and database systems.

NAYA delivers the most professional, respected and experienced consultants in the industry. The company uses multi-national consulting teams that can manage projects consistently across time zones.


BigData as a “Game Changer”

• What is BigData?

• What makes BigData different ?

As data becomes increasingly more complex and difficult to manage and analyze, organizations are looking for new solutions that span beyond the scope of the traditional RDBMS.

It used to be very simple! A decade ago, everything was running on top of relational databases - Realtime, Analytics, BI, OLTP, OLAP, batch … Back then data sets were much smaller and usually well structured (native to a relational database) so a single type of database paradigm - the relational database was a great match for all data requirements supporting all major use cases.

Things aren’t so simple anymore. In the past few years, the nature of our data has changed – datasets have become larger, more complex and include tremendous increases in rate of data flow both in and out of our databases. This change brought about a new way to think about database platforms.

The traditional role of the relational database as the single and unified platform for all types of data is no more. The market has evolved to embrace a more specialized approach where different database technologies are used to store and process different sets of data.


The Challenge of “BigData”

• How do I know if I have a “BigData problem” ? The changing nature of data can be discussed in terms of Volume, Velocity of Verity. These are the differentiating factors which help separate classic data use cases from next-generation ones. These “three Vs” are the business challenges which force organizations to look beyond the traditional RDBMS as the sole data platform.

Volume

• Collecting and analyzing more data helps make more educated business decisions. We want to store all data that have or might have business value. Don’t throw anything away as you never know when a piece of data will be valuable for your organization.

• Flexibility in the ability to store data is also extremely important. Organizations require solutions that can scale easily. You might have only 1 Terabyte of data today but that may increase to 10 Terabytes in a few years and your data architecture must seamlessly and easily support scalability without “throw away” architectures.

Velocity

• The rate of data collected, data which is flowing into our applications and systems, is increasing dramatically. Thousands, tens of thousands or even hundreds of thousands of critical business events are generated by our applications and systems every second. These business events are meaningful to us and have to be stored, cataloged and analyzed.

• Rapid data ingestion isn’t the only challenge, users are demanding real time access to analytics based on up to date data. No longer can we provide users with reports based on yesterday’s data. No longer can we rely on period nightly ETL jobs. Data needs to be fresh and immediately available to users for analytics as its being generated.

Variety

• Traditional data sets used to be strictly structured. Either natively or after an ETL created structure – ETLs which are slow, non-scalable, difficult to change


and prone to errors and failures. Nowadays, applications require to store different types of data, some structured yet some unstructured. Data generated from social networks, sensors, application logs, user interactions, geo-spatial data, etc… This is data which is much more complex and has to be made accessible for processing and analysis alongside more traditionaldata models.

• In addition, different applications with different data structures and use cases can benefit from different processing frameworks / paradigms. Some datasets require batch processing (such as recommendation engines) while other datasets rely on realtime analytics (such as fraud detection). Flexibility in data access APIs – a “best of breed” approach can benefit users by making complex data easily accessible for everyone in our organization.

Enter the world of NoSQL databases

• What are NoSQL databases and how do they relate to BigData?

• How are NoSQL databases different compared to traditional SQL-based databases?

The solution to the challenges we descried? The next-generation of NoSQL databases. Databases which try to address the “Volume, Velocity, Verity” challenges by thinking outside the box.

Remember, relational databases are optimized for storing structured data, are difficult to scale and rely on SQL for data retrieval. They are optimized for some use case, but not all.

NoSQL databases, on the other hand, are designed to store and process large amount of data (Velocity,Volume) that is scalable (Volume), complex (Variety) and provide immediate access (Velocity) to fresh data.

Relational databases:

• Structured: data is stored in tables. Tables have data types, primary keys and constraints.

• Transactional: data can be inserted and manipulated in “grouped units” = transactions. We can commit and rollback.


• Versatile but limited: traditional relational databases can do OLTP, OLAP, DWH, Batch but are generally not specialized.

• Exemples: Oracle, SQL Server, DB2, MySQL, PostgreSQL.

• Do not easily scale-out: traditional relational database usually rely on a single database instance , scale-out requires manual sharding, complex application-level Data Access Layers (DALs) or expensive and specialized hardware.

• Well-known and easy to work with: everyone knows the RDBMS and SQL.

NoSQL databases:

• Non-structured or semi-structured data model: NoSQL databases usually provide a flexible data, supports un/semi-structured data, schema-less and support rapid data model changes. Some NoSQL databases provide native JSON support others provide a BigTable type data model.

• Extremely Scalable: designed to be scalable from the ground up. Usually deployed in ac luster architecture to achieve easy and rapid scalability.

• Usually specialized: Specific NoSQL database technologies are designed for specific use cases. High-volume operational processing? HBase. Advanced analytics? Hadoop.

• Exemples: Hadoop, HBase, MongoDB, CouchBase, Cassandra, etc…

• Varity of data retrieval and development APIs: each NoSQL database has its own unique query API and query language. Some even support SQLs, some do not.

BigData as a “One Liner” Generating value from large datasets that cannot be analyzed using traditional technologies.


Hadoop as your Data Lake

• How does Hadoop fits the BigData picture?

Apache Hadoop is an open source data platform facilitating a fundamentally new way of storing and processing data. Instead of relying on expensive, proprietary hardware and different systems to store and process different types of data, Hadoop allows for centralized, distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers.

Hadoop can become your organization’s master centralized location for all raw data, structured or unstructured and thus become a central “Data Lake” to which all other databases, data silos and applications can connect to and retrieve data from.

Hadoop doesn’t just store your data, all data in Hadoop can be easily accessed using multiple frameworks and APIs.

Data can be ingested to Hadoop without pre-processing or the need for complex ETL. You can just load the data as-is in near realtime. This minimizes any processing overhead when storing raw data and does not require changing the way your data looks like so that it can fit a particular target schema. Changes to the raw data does not mandate changing the data model during data ingestion. The data model is usually created during queries (reads) and not during data load (writes).

Hadoop provides a “store everything now and decide how to access later” approach.


The “store everything now and decide how to process later“ architecture

- All required raw data will ingested in near realtime to an Hadoop cluster from both unstructured and structured sources.

- Once loaded into Hadoop, all of your data is immediately accessible for all the different use cases in your organization.

With Hadoop, no data is “too big” or “too complex”.


All valuable data, both raw and processed

NO ETL

Access data anytime using multiple data access frameworks

for BATCH and Realtime processing

Once data is stored in Hadoop Query data using a variety of APIs and frameworks

Unstructured data Relationa data

Cloudera Hadoop

• What is Cloudera Hadoop and what is the difference between just “Hadoop”?

• What is the difference between Cloudera Express and Enterprise?

Hadoop is an open-source platform, Cloudera provides a pre-packaged, tested and enhanced open-source distribution of the Hadoop platform. The relation between Cloudera Hadoop and “vanilla Hadoop” could be thought of as similar to the relation between RedHat Linux and “vanilla Linux”.

Cloudera is one of the leading innovators in the Hadoop space and largest contributor to the open source Apache Hadoop ecosystem.

Cloudera packages the Hadoop source code in a special distribution which includes enhanced Cloudera-developed Hadoop capabilities (such as Impala for interactive SQL-based analytics), graphical web cluster management development user interfaces (Cloudera Manager/HUE) as well as important Hadoop bug fixes and 24X7 support.

Cloudera Hadoop includes both an Express and Enterprise editions.

• Cloudera Express is the free to use version of Cloudera Hadoop with support for unlimited cluster size and runs all the Apache Hadoop features without any limitations. Cloudera express includes the Cloudera Manager web UI.

• Cloudera Enterprise includes support directly from Cloudera and some cluster management enchantments such as rolling upgrades, SNMP alerts, etc….


In addition to the core Hadoop components - HDFS & YARN which we will discuss later, Cloudera Hadoop (both Express and Enterprise) also includes multiple supplementary open-source Hadoop Ecosystem components which come bundled as part of the Cloudera Hadoop installation. The Hadoop ecosystem components compliment each other and allow Hadoop to reach it’s full potential.

Components such as HBase (online near realtime key/value access for “operational” database use cases) , Impala (interactive SQL-language based analytics on top of Hadoop data) , Spark (in-memory analytics and stream data processing) and more…


The Hadoop Architecture

• How does an Hadoop cluster looks like?

At a high-level, Hadoop is comprised of a Master/Slave architecture where the master nodes are responsible for providing cluster-wide services (such as resource scheduling and coordination or storing metadata for data which reside in Hadoop) and the slave nodes are responsible for actual data storage and processing. Both master nodes and slave node are highly-available: more than one master node can be brought online for failover proposes and multiple slave nodes will always be online due to the distributed nature of Hadoop.

The core of Hadoop is made of two components which provide scalable & highly available data storage and fast & flexible data retrieval.

• HDFS – Hadoop’s distributed filesystem. The core Hadoop component that is responsible for storing data in a highly availiable way.

• YARN – Hadoop’s job scheduling and data access resource management framework allowing fast, parallel processing of data stored in HDFS.

Both HDFS and YARN is deployed on Hadoop in a Master/Slave architecture: The HDFS master node is responsible for handling file system Metadata while the slave node store actual business data. The YARN master node is responsible for cross-cluster resource scheduling and job execution while the slave nodes are responsible for actually executing user queries and jobs.


These two core components work together to seamlessly to provide:

• High Availability of your data - Hadoop provides an internal distributed storage architecture that allows for protection against multiple kinds of data loss from single block corruption to complete server or rack failures. Automatic re-balancing of the Hadoop cluster is done in the background to ensure constant availability for your data and sustained workloads.

• Scalability - Hadoop clusters can scale virtually without limits. Adding new Hadoop nodes to an existing cluster can be done online without any downtime or interruption of existing workloads. Because each Hadoop “worker node” in the cluster is a server equipped with its own processor cores and hard drives, adding new nodes to your Hadoop cluster adds both storage capacity as well as computation capacity.When scaling Hadoop, You are not just expanding your data storage capability but also increasing your data processing power. This method of scaling can be considered a paradigm shift compared to the traditional database model where scaling the storage does not also increase data retrieval performance - so you end up with the capacity to store more data but without the capacity to quickly query it.

• Data Model Flexibility - Hadoop can handle any and all types of data. The underlying Hadoop HDFS filesystem allows for storing any type of structured or unstructured data. During data load, Hadoop is agnostic to the data model and can store JSON documents, CSVs, Tab Delimited Files, unstructured text files, XML files, Binary files – you name it!No need for expensive ETL or data pre-processing during data load.

With Hadoop you can first load your data and later decide on how to query it and what is the data model. This is also known as “schema on read”.This approach decouples the application data model (schema, data types, access patterns) from data storage and is considered an essential requirement for a scalable and flexible next generation database. “Store everything and decide how to query it later”. In addition to flexible data models, Hadoop also provides flexible data access


with a “pluggable” architecture allowing for multiple query APIs and data processing frameworks on top of the same dataset.

Hadoop HDFS

• How is data stored in Hadoop? Tables? Files?

The first component of the Core Hadoop architecture is a fault tolerant and self-healing distributed file system designed to turn a cluster of industry standard servers into a massively scalable pool of storage.

Developed specifically for large-scale data processing workloads where scalability, flexibility and throughput are critical, HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming, and scales to proven deployments of 100PB and beyond.

Scale-Out Architecture - add servers to increase storage capacity.High Availability - serve mission-critical workflows and applications. Fault Tolerance - automatically and seamlessly recover from failures with affecting data availability. Load Balancing - place data intelligently across cluster nodes for maximum efficiency and utilization Tunable Replication - multiple copies of each piece of data provide failure protection and computational performance.Security - optional LDAP integration.As the name suggested, HDFS is the HaDoop FileSystem. As such, HDFS behaves in a similar way to traditional Linux/Unix filesystems. At it’s lowest level, Hadoop stores data as files which are made of individual blocks.

During data ingestion to Hadoop, the HDFS stripes your loaded files across all nodes in the cluster, with replication for fault tolerance. The file loaded onto Hadoop will be split into multiple individual blocks which will be spread across the entire cluster. Each block will be stored more then once on more than one server. The replication factor (number of block copies) and block size is configurable on a per-file basis.


Working with HDFS, at it’s lowest level, is simple. You can either access the HDFS file browser using the HUE web interface:

Using the Hadoop HUE WebUI to browse HDFS

Or using a special Hadoop command line tool (“hadoop”) that is part of the Hadoop client. Speccing the “fs” argument allows for interacting with the filesystem of a remote Hadoop cluster:


Some more very basic examples:

hadoop fs -mkdir /user/hadoop/dir1 hadoop fs -ls /user/hadoop/dir1 hadoop fs -rm -r /user/hadoop/dir1 hadoop fs -put /path_to_loca_dir/local_file.txt /user/hadoop/hdfs_dir/hdfs_file.txt

Note that the paths shown in the examples above are HDFS paths and not paths from the local file system of where the “hadoop fs” command line is executed.

It’s important to note that while writing / reading files from HDFS is the lowest-level access to HDFS, end-users (developers, data analysts) working with Hadoop rely on several other Hadoop data access frameworks which allow queries and data processing on top of HDFS-stored data without having to directly interact with the Hadoop filesystem. Frameworks such as Cloudera Impala or HIVE allow end-users to write SQL queries on top of data stored in HDFS.

A SQL query used to directly access and visualize data from HDFS using the Hadoop HUE web UI.

Bottom line - HDFS is the Hadoop filesystem, the low-level data storage layer. Users can interact with HDFS using both the HUE WebUI and the Hadoop command line. Using these tools you can treat HDFS as if it is a regular (but distributed) filesystem - create directories, write files, read files, delete files, etc…


All data in Hadoop, at it’s lowest level, is just files on HDFS. “Structure” (semantics, tables, records, fields) is created when accessing data, not when writing it.

Hadoop YARN

• Once data is stored in Hadoop, how can we coordinate access to it?

The second component of the core Hadoop architecture is the data processing, resource management and scheduling framework called YARN. Different workloads (realtime and batch) can co-exist on your Hadoop cluster. YARN facilities scheduling, resource management and application/query level execution failure protection for all types of Hadoop workloads.

If Hadoop HDFS takes care of data storage, YARN takes care of managing data retrieval.

With YARN, data processing workloads are executed at the same location where the data is stored rather than relying on moving data form a dedicated storage tier to a database tier.

Data storage and computation coexist on the same physical nodes in the cluster. Workloads running in Hadoop under YARN can process exceedingly large amounts of data without being affected by traditional bottlenecks like network bandwidth by taking advantage of this data proximity.

Scale-out architecture - adding servers to your Hadoop cluster increases both processing power and storage capacity.Security & authentication – YARN works with HDFS security to make sure that only approved users can operate against the data in Hadoop.Resource manager and job scheduling – YARN employs data locality and manages cluster resources intelligently to determine optimal locations (nodes) across the cluster for data processing while allowing both long-running (batch) and short-running (realtime) applications to co-exist and access the same datasets.Flexibility – YARN allows for various data processing APIs and query frameworks to work on the same data at the same time. Some of the Hadoop data processing frameworks running under YARN are optimized for batch analytics while others


provide near realtime in-memory event processing thus providing a “best of breed” approach for accessing your data based on your use cases.Resiliency & high availability – YARN runs as a distrusted architecture across your Hadoop cluster ensuring that if a submitted job or query fails, it can independently and automatically restart and resume processing. No user intervention is required.

When data stored in Hadoop is accessed, data processing is distributed across all nodes in the cluster. Distributed data sets are pieced-together automatically providing parallel reads and processing to construct the final output.

Bottom line: YARN, by itself, isn’t a “query engine” or a data processing framework in Hadoop. It’s a cluster resource manager and coordinator that allows various different data processing and query engines (discussed later) to access data stored in Hadoop and “play nice” with one another. That is, share cluster resources (CPU cores, Memory, etc…)

Hadoop Query APIs and Data Processing Frameworks • What does end-users (data analysts, developers, etc…) actually use to query and

process data in Hadoop?

Unlike most traditional relational databases which only offer SQL-based access to data, Hadoop provides a variety of APIs each optimized for specific use cases.

While SQL has its benefits in simplicity and very short development cycles, it is limited when taxed with more complex computation or analytical workloads.

Continuing Hadoop’s “one size does not fits all” approach, multiple different “pluggable” APIs and processing languages are available. Each specifically designed to address an individual use case with custom tailored performance and flexibility.


Identify your data processing use case and then select the best optimized framework for the job.

Some of these modern frameworks for retrieving and processing data stored in Hadoop are:

Cloudera Impala (Interactive SQL) – high-performance interactive access to data via SQL. Impala provides second-level latency for SQL-based data retrieval in Hadoop. Impala is a fully integrated, state-of-the-art analytic Hadoop database engine specifically designed to leverage the flexibility and scalability strengths of Hadoop - combining the familiar SQL language and multi-user performance of a traditional analytic database with the performance and scalability of Hadoop. Impala workloads are not converted to Map/Reduce when executed and access Hadoop data directly. Example Impala create table statement + query:

CREATE EXTERNAL TABLE tab2 ( id INT, col_1 BOOLEAN, col_2 DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION ‘/some_hdfs_folder/‘; <<<<< Note that this is a location to a folder on HDFS

SELECT tab2.* <<<<< Impala query runs on top of previously crated table, accessing HDFS data FROM tab2, (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2 FROM tab2, tab1 WHERE tab1.id = tab2.id GROUP BY col_1) subquery1 WHERE subquery1.max_col2 = tab2.col_2;

HIVE (Batch SQL) – BATCH optimized SQL interface HIVE allows for non-developers to access data directly from Hadoop using the SQL language while providing Batch processing optimizations.

HIVE automatically converts SQL code to Map/Reduce programs and pushes them onto the Hadoop cluster. Because HIVE leverages Map/Reduce, it is suited for batch processing and provides the same performance, reliability and scalability which are core strengths of Map/Reduce on Hadoop.


Spark (In-memory “fast” processing) – next generation memory-optimized data processing engine Spark is an extremely fast, memory-optimized general propose processing engine and considered the next-generation data processing framework for Hadoop.Spark is a functional data processing language which supports development in Python, JAVA and Scala. SPARK is designed for both batch processing workloads as well as streaming workloads (using Spark Streaming), interactive queries, and machine learning.

Example Spark word count application in Phyton:# Open a CSV file on Hadoop HDFS text_file = spark.textFile("hdfs://raw_data/my_raw_datafile.csv")# Count the words in the file text_file.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b)

Map/Reduce (BATCH)– Distributed batch processing framework MapReduce is the original core of processing in Hadoop. Map/Reduce is this programming paradigm which allows for data processing to become massive scalability across hundreds or thousands of servers in an Hadoop cluster.

With MapReduce and Hadoop, compute workloads are executed in the location as the data. Data storage and computation coexist on the same physical nodes in the Hadoop cluster. MapReduce processes exceedingly large amounts of data without being affected by traditional bottlenecks like network bandwidth by taking advantage of this data proximity.

Apache Mahout (Machine Learning)– Scalable Machine Learning Framework. Mahout provides the core algorithms for clustering, classification and collaborative filtering which are implemented on top of the scalable, distributed Hadoop platform such as:

- Recommendation mining: takes users behavior and from that tries to find items users might like.

- Clustering: takes data (documents) and groups them into groups of topically related data or documents.


- Classification: learns from existing categorized data what data of a specific category look like and is able to assign unlabeled documents to the correct category

Note that the frameworks detailed above are just some of the most popular Hadoop frameworks which run under Yarn. Many more exist and being developed.


a beginners guide to cloudera hadoop

Data & Analytics