hadoop ecosystem overview cmsc 491 hadoop-based distributed computing spring 2015 adam shook

Hadoop Ecosystem Overview

CMSC 491Hadoop-Based Distributed Computing

Spring 2015Adam Shook

Agenda

• Introduce Hadoop projects to prepare you for your group work– Intimate detail will be provided in future lectures

• Discuss potential use cases for each project

Topics• HDFS• MapReduce• YARN• Sqoop• Flume• NiFi• Pig• Hive• Streaming• HBase• Accumulo• Avro

• Parquet• Mahout• Oozie• Storm• ZooKeeper• Spark• SQL-on-Hadoop• In-Memory Stores• Cassandra• Kafka• Crunch• Azkaban

HDFS

• Hadoop Distributed File System– High-performance file system for storing data

• We’ve talked about this enough

Hadoop MapReduce

• High-performance fault-tolerance data processing system

• We’ve also talked about this enough

YARN

• Abstract framework for distributed application development

• Split functionality of JobTracker into two components– ResourceManager– ApplicationMaster

• TaskTracker becomes NodeManager– Containers instead of map and reduce slots

• Configurable amount of memory per NodeManager

MapReduce 2.x on YARN

• MapReduce API has not changed– Binary-level backwards compatible (no recompile)

• Application Master launches and monitors job via YARN

• MapReduce History Server to store… history

• Enabled Yahoo! to scale beyond 4,000 nodes

Hadoop Ecosystem

• Core Technologies– Hadoop Distributed File System– Hadoop MapReduce

• Many other tools…– Which we will be discussing… now

Apache Sqoop

• Apache project designed for efficient transfer between Apache Hadoop and structured data stores

• Use through CLI and extendable

• Use cases?

Apache Flume

• Distributed, reliable, available service for collecting, aggregating, and moving large amounts of log data

• Configure agents using simple files, extendable

• Use cases?

Apache NiFi

• A service to reliably move and manipulate files between clusters using a web front-end

• Uses a GUI to drop processors and connect them to build workflows

• Use cases?

Apache Pig

• Platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs

• Infrastructure compiles language to a sequence of MapReduce programs

• Use cases?

Apache Hive

• Data warehouse facilitating querying and managing large datasets

• Compiles SQL-like queries into MapReduce programs

• Use cases?

Hadoop Streaming

• Utility to create and run MapReduce jobs with any executable or script as the mapper or reducer

• Just a jar file, not a real project

• Use cases?

Which high-level API is for you?

• What are you comfortable with?• What are you being told to use?

Apache HBase

• Distributed, scalable, big data store• Data stored as sorted key/value pairs, with the

key consisting of a row and column

• Use cases?

Apache Accumulo

• Robust, scalable, high-performance data storage and retrieval key/value store

• Cell-based access controls– i.e. cell-level security

• Use cases?

Apache Avro

• Data serialization system for the Hadoop ecosystem

• Use cases?

Apache Parquet

• Columnar storage format for Hadoop

• Use cases?

Apache Mahout

• Machine learning library to build scalable machine learning algorithms implemented on top of Hadoop MapReduce

• Use cases?

Apache Oozie

• Workflow scheduler system to manage Apache Hadoop jobs

• Use cases?

Apache Storm

• Distributed real-time computation system• Didn’t have a logo until June 2014

• How is this different than MapReduce?• Use cases?

Apache ZooKeeper

• Effort to develop and maintain and open-source server enabling highly reliable distributed coordination

• Use cases?

Apache Spark

• Fast and general engine for large-scale data processing

• Write applications in Java, Scala, or Python

• Use cases?

SQL on Hadoop

• Apache Drill, Cloudera Impala, Facebook’s Presto, Hortonworks’s Hive Stinger, Pivotal HAWQ, etc.

• SQL-like or ANSI SQL compliant MPP execution engines using HDFS as a data store

• Use cases? Non use cases?

Sample Architecture

HDFS

Flume Agent

Flume Agent

Flume Agent

MapReduce Pig HBase Storm

Website

Oozie

Webserver

Sales

Call Center SQL

SQL

OTHER HADOOP PROJECTSWe [maybe] won’t be covering these in detail later on

Redis, Memcached, etc.

• Open-source in-memory key/value stores

• Use cases?

Apache Cassandra

• NoSQL database for managing large amounts of structured, semi-structured, and unstructured data

• Support for clusters spanning multiple datacenters• Unlike HBase and Accumulo, data is not stored on

HDFS

• Use cases? Non use cases?

Apache Crunch

• Java framework for writing, testing, and running MapReduce pipelines with a simple API

• Same code executes as a local job, as a MapReduce job, or as a streaming Spark job

• Use cases? *

*Not the real logo, but truly fantastic

Apache Kafka

• High-throughput distributed publish-subscribe message service

• Use cases?

Azkaban

• Batch workflow job scheduler to run Hadoop jobs

• Use cases?

Review

• A lot of projects available to you for your grou project

• Think of a problem you are interested in, then choose the appropriate projects to solve it

• Keep in mind data ingest, storage, processing, and egress

• Feel free to explore and use other projects than the ones I have listed here– Get permission if you plan on using it as part of your

project quota

References

• All those logos are the property of their owners

• *.apache.org• redis.io

hadoop ecosystem overview cmsc 491 hadoop-based distributed computing spring 2015 adam shook

Documents

apache hadoop

hadoop use cases

hadoop mapreduce use

project slide

nodemanager slide

hadoop ecosystem use

extendable use cases

potential use cases