hadoop ecosystem overview cmsc 491 hadoop-based distributed computing spring 2015 adam shook

34
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Upload: darlene-chase

Post on 23-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview

CMSC 491Hadoop-Based Distributed Computing

Spring 2015Adam Shook

Page 2: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Agenda

• Introduce Hadoop projects to prepare you for your group work– Intimate detail will be provided in future lectures

• Discuss potential use cases for each project

Page 3: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Topics• HDFS• MapReduce• YARN• Sqoop• Flume• NiFi• Pig• Hive• Streaming• HBase• Accumulo• Avro

• Parquet• Mahout• Oozie• Storm• ZooKeeper• Spark• SQL-on-Hadoop• In-Memory Stores• Cassandra• Kafka• Crunch• Azkaban

Page 4: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

HDFS

• Hadoop Distributed File System– High-performance file system for storing data

• We’ve talked about this enough

Page 5: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop MapReduce

• High-performance fault-tolerance data processing system

• We’ve also talked about this enough

Page 6: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

YARN

• Abstract framework for distributed application development

• Split functionality of JobTracker into two components– ResourceManager– ApplicationMaster

• TaskTracker becomes NodeManager– Containers instead of map and reduce slots

• Configurable amount of memory per NodeManager

Page 7: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

MapReduce 2.x on YARN

• MapReduce API has not changed– Binary-level backwards compatible (no recompile)

• Application Master launches and monitors job via YARN

• MapReduce History Server to store… history

• Enabled Yahoo! to scale beyond 4,000 nodes

Page 8: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem

• Core Technologies– Hadoop Distributed File System– Hadoop MapReduce

• Many other tools…– Which we will be discussing… now

Page 9: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Sqoop

• Apache project designed for efficient transfer between Apache Hadoop and structured data stores

• Use through CLI and extendable

• Use cases?

Page 10: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Flume

• Distributed, reliable, available service for collecting, aggregating, and moving large amounts of log data

• Configure agents using simple files, extendable

• Use cases?

Page 11: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache NiFi

• A service to reliably move and manipulate files between clusters using a web front-end

• Uses a GUI to drop processors and connect them to build workflows

• Use cases?

Page 12: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Pig

• Platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs

• Infrastructure compiles language to a sequence of MapReduce programs

• Use cases?

Page 13: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Hive

• Data warehouse facilitating querying and managing large datasets

• Compiles SQL-like queries into MapReduce programs

• Use cases?

Page 14: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Streaming

• Utility to create and run MapReduce jobs with any executable or script as the mapper or reducer

• Just a jar file, not a real project

• Use cases?

Page 15: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Which high-level API is for you?

• What are you comfortable with?• What are you being told to use?

Page 16: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache HBase

• Distributed, scalable, big data store• Data stored as sorted key/value pairs, with the

key consisting of a row and column

• Use cases?

Page 17: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Accumulo

• Robust, scalable, high-performance data storage and retrieval key/value store

• Cell-based access controls– i.e. cell-level security

• Use cases?

Page 18: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Avro

• Data serialization system for the Hadoop ecosystem

• Use cases?

Page 19: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Parquet

• Columnar storage format for Hadoop

• Use cases?

Page 20: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Mahout

• Machine learning library to build scalable machine learning algorithms implemented on top of Hadoop MapReduce

• Use cases?

Page 21: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Oozie

• Workflow scheduler system to manage Apache Hadoop jobs

• Use cases?

Page 22: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Storm

• Distributed real-time computation system• Didn’t have a logo until June 2014

• How is this different than MapReduce?• Use cases?

Page 23: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache ZooKeeper

• Effort to develop and maintain and open-source server enabling highly reliable distributed coordination

• Use cases?

Page 24: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Spark

• Fast and general engine for large-scale data processing

• Write applications in Java, Scala, or Python

• Use cases?

Page 25: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

SQL on Hadoop

• Apache Drill, Cloudera Impala, Facebook’s Presto, Hortonworks’s Hive Stinger, Pivotal HAWQ, etc.

• SQL-like or ANSI SQL compliant MPP execution engines using HDFS as a data store

• Use cases? Non use cases?

Page 26: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Sample Architecture

HDFS

Flume Agent

Flume Agent

Flume Agent

MapReduce Pig HBase Storm

Website

Oozie

Webserver

Sales

Call Center SQL

SQL

Page 27: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

OTHER HADOOP PROJECTSWe [maybe] won’t be covering these in detail later on

Page 28: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Redis, Memcached, etc.

• Open-source in-memory key/value stores

• Use cases?

Page 29: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Cassandra

• NoSQL database for managing large amounts of structured, semi-structured, and unstructured data

• Support for clusters spanning multiple datacenters• Unlike HBase and Accumulo, data is not stored on

HDFS

• Use cases? Non use cases?

Page 30: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Crunch

• Java framework for writing, testing, and running MapReduce pipelines with a simple API

• Same code executes as a local job, as a MapReduce job, or as a streaming Spark job

• Use cases? *

*Not the real logo, but truly fantastic

Page 31: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Kafka

• High-throughput distributed publish-subscribe message service

• Use cases?

Page 32: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Azkaban

• Batch workflow job scheduler to run Hadoop jobs

• Use cases?

Page 33: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Review

• A lot of projects available to you for your grou project

• Think of a problem you are interested in, then choose the appropriate projects to solve it

• Keep in mind data ingest, storage, processing, and egress

• Feel free to explore and use other projects than the ones I have listed here– Get permission if you plan on using it as part of your

project quota

Page 34: Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

References

• All those logos are the property of their owners

• *.apache.org• redis.io