big data hadoop stack -...

89
Big Data Hadoop Stack

Upload: others

Post on 25-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Big Data Hadoop Stack

Page 2: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Lecture #1

Hadoop Beginnings

Page 3: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

What is Hadoop?

Apache Hadoop is an open source software framework for storage and

large scale processing of data-sets on clusters of commodity hardware

Page 4: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Hadoop was created by Doug Cutting and Mike Cafarella in 2005

Named the project after son's toy elephant

Page 5: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Moving Computation to Data

Computation

Data

Page 6: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Scalability at Hadoop’s core!

Page 7: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Reliability! Reliability! Reliability!

Page 8: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Reliability! Reliability! Reliability!

Page 9: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Reliability! Reliability! Reliability!

Google File System

Page 10: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Once a year

365 Computers Once a day

Page 11: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Hourly

Page 12: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

New Approach to Data

Keep all data

Page 13: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

New Kinds of Analysis

Schema-on read style

Page 14: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

New Kinds of Analysis

Schema-on read style New analysis

Page 15: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Small Data & Complex Algorithm

Large Data & Simple Algorithm

Vs.

Page 16: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Lecture #2

Apache Framework Hadoop Modules

Page 17: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Apache Framework Basic Modules Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce

Page 18: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Apache Framework Basic Modules

Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce

Page 19: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Apache Framework Basic Modules

Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce

Page 20: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Apache Framework Basic Modules

Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce

Page 21: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 22: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 23: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 24: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 25: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 26: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Lecture #3

Hadoop Distributed File System (HDFS)

Page 27: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

HDFS Hadoop Distributed File System

Distributed, scalable, and portable file-system written in Java for the Hadoop framework

Page 28: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

HDFS

Page 29: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 30: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

MapReduce Engine Job Tracker

Page 31: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

MapReduce Engine

Task Tracker

Job Tracker

Page 32: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Apache Hadoop NextGen MapReduce (YARN)

Page 33: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 34: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

What is Yarn?

• YARN enhances the power of a Hadoop compute cluster

Scalability

Page 35: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

What is Yarn?

• YARN enhances the power of a Hadoop compute cluster

Scalability

Improved cluster utilization

Page 36: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

What is Yarn?

• YARN enhances the power of a Hadoop compute cluster

Scalability

MapReduce Compatibility Improved cluster utilization

Page 37: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

What is Yarn?

• YARN enhances the power of a Hadoop compute cluster

Scalability

MapReduce Compatibility

Improved cluster utilization

Supports Other Workloads

Page 38: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Lecture #4

The Hadoop “Zoo”

Page 39: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 40: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

How to figure out the Zoo??

Page 41: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Original Google Stack

Page 42: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Original Google Stack

Page 43: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Original Google Stack

Page 44: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Original Google Stack Data Integration

Page 45: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Original Google Stack

Page 46: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Original Google Stack

Page 47: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Original Google Stack

Page 48: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Facebook’s Version of the Stack Data Integration

Coordination

Languages Compilers Data Store

Page 49: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Yahoo’s Version of the Stack Data Integration

Coordination

Languages Compilers Data Store

Page 50: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

LinkedIn’s Version of the Stack Data Integration

Coordination

Languages Compilers Data Store

Page 51: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Cloudera’s Version of the Stack

Coordination

Languages Compilers

Data Integration

Data Store

Page 52: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Hadoop Ecosystem Major Components

Lecture #5

Page 53: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 54: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Apache Sqoop • Tool designed for

efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases

Page 55: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 56: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

HBASE • Column-oriented database

management system • Key-value store • Based on Google Big Table • Can hold extremely large data • Dynamic data model • Not a Relational DBMS

Page 57: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 58: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

PIG

High level programming on top of Hadoop MapReduce

Page 59: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

PIG

The language: Pig Latin

Page 60: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

PIG

Data analysis problems as data flows

Page 61: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

PIG

Originally developed at Yahoo 2006

Page 62: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Pig for ETL

Page 63: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Pig for ETL

Page 64: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Pig for ETL

Page 65: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 66: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Apache Hive

• Data warehouse software facilitates querying and managing large datasets residing in distributed storage

Page 67: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Apache Hive

SQL-like language!

Page 68: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Apache Hive

Facilitates querying and managing large datasets in HDFS

Page 69: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Apache Hive

Mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL

Page 70: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 71: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Oozie

Workflow scheduler system to manage Apache Hadoop jobs

Page 72: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Oozie

Oozie Coordinator jobs!

Page 73: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Oozie Supports MapReduce, Pig, Apache Hive, and Sqoop, etc.

Page 74: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 75: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Zookeeper

Provides operational services for a Hadoop cluster group services

Page 76: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Zookeeper Centralized service for: maintaining configuration information naming services providing distributed synchronization and providing group services

Page 77: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Zookeeper Centralized service for: maintaining configuration information

Page 78: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Zookeeper Centralized service for: maintaining configuration information naming services

Page 79: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Zookeeper Centralized service for: maintaining configuration information naming services providing distributed synchronization and providing group services

Page 80: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Flume Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data

Page 81: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Additional Cloudera Hadoop Components

Impala

Page 82: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 83: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Impala

• Cloudera's open source massively parallel processing (MPP) SQL query engine Apache Hadoop

Page 84: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Additional Cloudera Hadoop Components

Spark The New Paradigm

Page 85: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named
Page 86: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Spark

Apache Spark™ is a fast and general engine for large-scale data processing

Page 87: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Spark Benefits Multi-stage in-memory primitives provides performance up to 100 times faster for certain applications

Page 88: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Spark Benefits Allows user programs to load data into a cluster's memory and query it repeatedly

Well-suited to machine learning!!!

Page 89: Big Data Hadoop Stack - support.pdnsoft.comsupport.pdnsoft.com/docs/docs-pics/Big-Data-Meeting/PDN-Tech-Big… · Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . Named

Up Next

Tour of the Cloudera’s Quick Start VM