hadoop ecosystem · – flume – impala. yarn ... spark vs. m/r spark tries to ... append-only...

Hadoop Ecosystem

Ryan BrownRed Hat, Inc.

Agenda● MapReduce

● Storage

● The Ecosystem (Spark, Hive, etc)

● The Real World

“Big Data”● Well it depends

● Petabyte and beyond

● Capture, curation, storage, search, and analysis

● Unstructured

● Too big for one machine

Brief History● Hadoop is based on Google papers

from the early 2000s

– Google File System (GFS)

– MapReduce

● Distribute data at write-time, and reads are distributed by default.

MapReduce● Built for parallelism

● Split input

● Process small pieces

● Merge results from each

● Final output

Infrastructure● Commodity servers, commodity prices

● … lots and lots of them

● Plan for failure

● … and growth

M/R Concepts● “Shared-nothing” computation

● Minimum of communication

● Master node restarts failed tasks

● Splits tasks between nodes

HDFS Concepts● Files are split into big (128MB+) blocks

● Blocks are replicated (3x by default)

● Spindles, platters, and performance

● Best for big sequential reads

● Write only once

Reading Data● Client: NameNode, who has foo.txt?

● NN: Nodes 7, 11, and 187

● Client: 7, give me foo.txt

● 7: Blocks 1-10 of foo.txt

SQL-likes● Hive

● Pig

● Impala

Pig Latin● Scripting-like language

● Mature (err...kind of old)

● Built for analysts, can be used interactively

● Every line compiles to a MapReduce job

Hive● SQL-ish language called HiveQL

● Metadata is stored separately from datahive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

Impala● Newer than Hive and Pig

● Faster to run queries

● Built on HBase (BigTable clone)

● SQL-92 compliant, making it a great for BI and easy to integrate

Under the Hood● Accepts queries on any DataNode

● Query planner

● Coordinates all other DataNodes with relevant data

● Gathers results to go back over SQL connection

Impala

Hadoop Today● YARN, Mesos, and other schedulers

● Federated NameNodes

● Huge tool ecosystem

– Sqoop

– Flume

– Impala

YARN● MapReduce: Reloaded

YARN● Splits up JobTracker from Hadoop v1

● Monitoring, resource management, and scheduling are now separate daemons

● Application Masters negotiate with Resource Manager for capacity

Spark● Began ~2009

● Targeted for Scala and Python

● Application Masters negotiate with Resource Manager for capacity

● Jobs are set up as a DAG

● “Micro-batching”

Spark vs. M/R● Spark tries to work in memory

● MapReduce is slower for multi-pass work

● Spark jobs work great as a team

● Spark is better for streaming

● MapReduce looks more like “traditional” ETL

Kafka Streaming● Append-only logs

● Shard by stream

● Replay when consumers fail

● Connects real-time systems to batch systems

Kafka Streaming● Does not process data

● Not coupled to the Hadoop ecosystem

● Can act as a key-value store

● Performance in catch-up cases

Flink Combo● Streaming OR batch

● Streams in Scala, Python, or Java

● Tables for SQL-ish access

● Graph processing with Gelly

Flink Combo● One-stop framework

● Caches intermediate data

● Jobs can be chained like Spark

Hadoop-ables● Risk calculations

● Recommendations

● Transaction analysis

● Failure prediction

● Customer churn analysis

● Fraud/threat detection

● Word count – haha

Questions?

hadoop ecosystem · – flume – impala. yarn ... spark vs. m/r spark tries to ... append-only...

Documents