hadoop ecosystem overview cmsc 491 hadoop-based distributed computing spring 2015 adam shook
Post on 23-Dec-2015
217 Views
Preview:
TRANSCRIPT
Hadoop Ecosystem Overview
CMSC 491Hadoop-Based Distributed Computing
Spring 2015Adam Shook
Agenda
• Introduce Hadoop projects to prepare you for your group work– Intimate detail will be provided in future lectures
• Discuss potential use cases for each project
Topics• HDFS• MapReduce• YARN• Sqoop• Flume• NiFi• Pig• Hive• Streaming• HBase• Accumulo• Avro
• Parquet• Mahout• Oozie• Storm• ZooKeeper• Spark• SQL-on-Hadoop• In-Memory Stores• Cassandra• Kafka• Crunch• Azkaban
HDFS
• Hadoop Distributed File System– High-performance file system for storing data
• We’ve talked about this enough
Hadoop MapReduce
• High-performance fault-tolerance data processing system
• We’ve also talked about this enough
YARN
• Abstract framework for distributed application development
• Split functionality of JobTracker into two components– ResourceManager– ApplicationMaster
• TaskTracker becomes NodeManager– Containers instead of map and reduce slots
• Configurable amount of memory per NodeManager
MapReduce 2.x on YARN
• MapReduce API has not changed– Binary-level backwards compatible (no recompile)
• Application Master launches and monitors job via YARN
• MapReduce History Server to store… history
• Enabled Yahoo! to scale beyond 4,000 nodes
Hadoop Ecosystem
• Core Technologies– Hadoop Distributed File System– Hadoop MapReduce
• Many other tools…– Which we will be discussing… now
Apache Sqoop
• Apache project designed for efficient transfer between Apache Hadoop and structured data stores
• Use through CLI and extendable
• Use cases?
Apache Flume
• Distributed, reliable, available service for collecting, aggregating, and moving large amounts of log data
• Configure agents using simple files, extendable
• Use cases?
Apache NiFi
• A service to reliably move and manipulate files between clusters using a web front-end
• Uses a GUI to drop processors and connect them to build workflows
• Use cases?
Apache Pig
• Platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs
• Infrastructure compiles language to a sequence of MapReduce programs
• Use cases?
Apache Hive
• Data warehouse facilitating querying and managing large datasets
• Compiles SQL-like queries into MapReduce programs
• Use cases?
Hadoop Streaming
• Utility to create and run MapReduce jobs with any executable or script as the mapper or reducer
• Just a jar file, not a real project
• Use cases?
Which high-level API is for you?
• What are you comfortable with?• What are you being told to use?
Apache HBase
• Distributed, scalable, big data store• Data stored as sorted key/value pairs, with the
key consisting of a row and column
• Use cases?
Apache Accumulo
• Robust, scalable, high-performance data storage and retrieval key/value store
• Cell-based access controls– i.e. cell-level security
• Use cases?
Apache Avro
• Data serialization system for the Hadoop ecosystem
• Use cases?
Apache Parquet
• Columnar storage format for Hadoop
• Use cases?
Apache Mahout
• Machine learning library to build scalable machine learning algorithms implemented on top of Hadoop MapReduce
• Use cases?
Apache Oozie
• Workflow scheduler system to manage Apache Hadoop jobs
• Use cases?
Apache Storm
• Distributed real-time computation system• Didn’t have a logo until June 2014
• How is this different than MapReduce?• Use cases?
Apache ZooKeeper
• Effort to develop and maintain and open-source server enabling highly reliable distributed coordination
• Use cases?
Apache Spark
• Fast and general engine for large-scale data processing
• Write applications in Java, Scala, or Python
• Use cases?
SQL on Hadoop
• Apache Drill, Cloudera Impala, Facebook’s Presto, Hortonworks’s Hive Stinger, Pivotal HAWQ, etc.
• SQL-like or ANSI SQL compliant MPP execution engines using HDFS as a data store
• Use cases? Non use cases?
Sample Architecture
HDFS
Flume Agent
Flume Agent
Flume Agent
MapReduce Pig HBase Storm
Website
Oozie
Webserver
Sales
Call Center SQL
SQL
OTHER HADOOP PROJECTSWe [maybe] won’t be covering these in detail later on
Redis, Memcached, etc.
• Open-source in-memory key/value stores
• Use cases?
Apache Cassandra
• NoSQL database for managing large amounts of structured, semi-structured, and unstructured data
• Support for clusters spanning multiple datacenters• Unlike HBase and Accumulo, data is not stored on
HDFS
• Use cases? Non use cases?
Apache Crunch
• Java framework for writing, testing, and running MapReduce pipelines with a simple API
• Same code executes as a local job, as a MapReduce job, or as a streaming Spark job
• Use cases? *
*Not the real logo, but truly fantastic
Apache Kafka
• High-throughput distributed publish-subscribe message service
• Use cases?
Azkaban
• Batch workflow job scheduler to run Hadoop jobs
• Use cases?
Review
• A lot of projects available to you for your grou project
• Think of a problem you are interested in, then choose the appropriate projects to solve it
• Keep in mind data ingest, storage, processing, and egress
• Feel free to explore and use other projects than the ones I have listed here– Get permission if you plan on using it as part of your
project quota
References
• All those logos are the property of their owners
• *.apache.org• redis.io
top related