hadoop ecosystem overview cmsc 491 hadoop-based distributed computing spring 2015 adam shook

Hadoop Ecosystem Overview

CMSC 491Hadoop-Based Distributed Computing

Spring 2015Adam Shook

Agenda

• Introduce Hadoop projects to prepare you for your group work– Intimate detail will be provided in future lectures

• Discuss potential use cases for each project

Topics• HDFS• MapReduce• YARN• Sqoop• Flume• NiFi• Pig• Hive• Streaming• HBase• Accumulo• Avro

• Parquet• Mahout• Oozie• Storm• ZooKeeper• Spark• SQL-on-Hadoop• In-Memory Stores• Cassandra• Kafka• Crunch• Azkaban

• Hadoop Distributed File System– High-performance file system for storing data

• We’ve talked about this enough

Hadoop MapReduce

• High-performance fault-tolerance data processing system

• We’ve also talked about this enough

• Abstract framework for distributed application development

• Split functionality of JobTracker into two components– ResourceManager– ApplicationMaster

• TaskTracker becomes NodeManager– Containers instead of map and reduce slots

• Configurable amount of memory per NodeManager

MapReduce 2.x on YARN

• MapReduce API has not changed– Binary-level backwards compatible (no recompile)

• Application Master launches and monitors job via YARN

• MapReduce History Server to store… history

• Enabled Yahoo! to scale beyond 4,000 nodes

Hadoop Ecosystem

• Core Technologies– Hadoop Distributed File System– Hadoop MapReduce

• Many other tools…– Which we will be discussing… now

Apache Sqoop

• Apache project designed for efficient transfer between Apache Hadoop and structured data stores

• Use through CLI and extendable

• Use cases?

Apache Flume

• Distributed, reliable, available service for collecting, aggregating, and moving large amounts of log data

• Configure agents using simple files, extendable

• Use cases?

Apache NiFi

• A service to reliably move and manipulate files between clusters using a web front-end

• Uses a GUI to drop processors and connect them to build workflows

• Use cases?

Apache Pig

• Platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs

• Infrastructure compiles language to a sequence of MapReduce programs

• Use cases?

Apache Hive

• Data warehouse facilitating querying and managing large datasets

• Compiles SQL-like queries into MapReduce programs

• Use cases?

Hadoop Streaming

• Utility to create and run MapReduce jobs with any executable or script as the mapper or reducer

• Just a jar file, not a real project

• Use cases?

Which high-level API is for you?

• What are you comfortable with?• What are you being told to use?

Apache HBase

• Distributed, scalable, big data store• Data stored as sorted key/value pairs, with the

key consisting of a row and column

• Use cases?

Apache Accumulo

• Robust, scalable, high-performance data storage and retrieval key/value store

• Cell-based access controls– i.e. cell-level security

• Use cases?

Apache Avro

• Data serialization system for the Hadoop ecosystem

• Use cases?

Apache Parquet

• Columnar storage format for Hadoop

• Use cases?

Apache Mahout

• Machine learning library to build scalable machine learning algorithms implemented on top of Hadoop MapReduce

• Use cases?

Apache Oozie

• Workflow scheduler system to manage Apache Hadoop jobs

• Use cases?

Apache Storm

• Distributed real-time computation system• Didn’t have a logo until June 2014

• How is this different than MapReduce?• Use cases?

Apache ZooKeeper

• Effort to develop and maintain and open-source server enabling highly reliable distributed coordination

• Use cases?

Apache Spark

• Fast and general engine for large-scale data processing

• Write applications in Java, Scala, or Python

• Use cases?

SQL on Hadoop

• Apache Drill, Cloudera Impala, Facebook’s Presto, Hortonworks’s Hive Stinger, Pivotal HAWQ, etc.

• SQL-like or ANSI SQL compliant MPP execution engines using HDFS as a data store

• Use cases? Non use cases?

Sample Architecture

Flume Agent

MapReduce Pig HBase Storm

Website

Webserver

Call Center SQL

OTHER HADOOP PROJECTSWe [maybe] won’t be covering these in detail later on

Redis, Memcached, etc.

• Open-source in-memory key/value stores

• Use cases?

Apache Cassandra

• NoSQL database for managing large amounts of structured, semi-structured, and unstructured data

• Support for clusters spanning multiple datacenters• Unlike HBase and Accumulo, data is not stored on

• Use cases? Non use cases?

Apache Crunch

• Java framework for writing, testing, and running MapReduce pipelines with a simple API

• Same code executes as a local job, as a MapReduce job, or as a streaming Spark job

• Use cases? *

*Not the real logo, but truly fantastic

Apache Kafka

• High-throughput distributed publish-subscribe message service

• Use cases?

Azkaban

• Batch workflow job scheduler to run Hadoop jobs

• Use cases?

Review

• A lot of projects available to you for your grou project

• Think of a problem you are interested in, then choose the appropriate projects to solve it

• Keep in mind data ingest, storage, processing, and egress

• Feel free to explore and use other projects than the ones I have listed here– Get permission if you plan on using it as part of your

project quota

References

• All those logos are the property of their owners

• *.apache.org• redis.io

hadoop ecosystem overview cmsc 491 hadoop-based distributed computing spring 2015 adam shook

apache hadoop

hadoop use cases

hadoop mapreduce use

project slide

nodemanager slide

hadoop ecosystem use

extendable use cases

potential use cases

Documents

cmsc presentation

cmsc 601: topics

cmsc$601:$ topics$ · cmsc$601:$ topics$...

geodesy lesson plan : all shook up - microsoft 1 geodesy all...

cmsc delivery

cmsc 202 cmsc 202, advanced section classes and objects in...

scalable machine learning cmsc 491 hadoop-based distributed...

shook 01 vol1

hadoop , hadoop , hadoop !!!

key/value stores cmsc 491 hadoop-based distributed computing...

moving data cmsc 491 hadoop-based distributed computing...

cmsc 15100

1 cmsc 250 discrete structures cmsc 250 lecture 1

real-time stream processing cmsc 491 hadoop-based...

all shook up

sql on hadoop cmsc 491 hadoop-based distributed computing...

workflow management cmsc 491 hadoop-based distributed...

high-level mapreduce apis cmsc 491 hadoop-based distributed...

cmsc 2015 program

eric shook - pop.umn.edu