hadoop ecosystem · – flume – impala. yarn ... spark vs. m/r spark tries to ... append-only...

26
Hadoop Ecosystem Ryan Brown Red Hat, Inc.

Upload: others

Post on 09-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Hadoop Ecosystem

Ryan BrownRed Hat, Inc.

Page 2: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Agenda● MapReduce

● Storage

● The Ecosystem (Spark, Hive, etc)

● The Real World

Page 3: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

“Big Data”● Well it depends

● Petabyte and beyond

● Capture, curation, storage, search, and analysis

● Unstructured

● Too big for one machine

Page 4: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Brief History● Hadoop is based on Google papers

from the early 2000s

– Google File System (GFS)

– MapReduce

● Distribute data at write-time, and reads are distributed by default.

Page 5: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

MapReduce● Built for parallelism

● Split input

● Process small pieces

● Merge results from each

● Final output

Page 6: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Infrastructure● Commodity servers, commodity prices

● … lots and lots of them

● Plan for failure

● … and growth

Page 7: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

M/R Concepts● “Shared-nothing” computation

● Minimum of communication

● Master node restarts failed tasks

● Splits tasks between nodes

Page 8: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

HDFS Concepts● Files are split into big (128MB+) blocks

● Blocks are replicated (3x by default)

● Spindles, platters, and performance

● Best for big sequential reads

● Write only once

Page 9: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Reading Data● Client: NameNode, who has foo.txt?

● NN: Nodes 7, 11, and 187

● Client: 7, give me foo.txt

● 7: Blocks 1-10 of foo.txt

Page 10: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

SQL-likes● Hive

● Pig

● Impala

Page 11: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Pig Latin● Scripting-like language

● Mature (err...kind of old)

● Built for analysts, can be used interactively

● Every line compiles to a MapReduce job

Page 12: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Hive● SQL-ish language called HiveQL

● Metadata is stored separately from datahive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

Page 13: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Impala● Newer than Hive and Pig

● Faster to run queries

● Built on HBase (BigTable clone)

● SQL-92 compliant, making it a great for BI and easy to integrate

Page 14: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Under the Hood● Accepts queries on any DataNode

● Query planner

● Coordinates all other DataNodes with relevant data

● Gathers results to go back over SQL connection

Page 15: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Impala

Page 16: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Hadoop Today● YARN, Mesos, and other schedulers

● Federated NameNodes

● Huge tool ecosystem

– Sqoop

– Flume

– Impala

Page 17: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

YARN● MapReduce: Reloaded

Page 18: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

YARN● Splits up JobTracker from Hadoop v1

● Monitoring, resource management, and scheduling are now separate daemons

● Application Masters negotiate with Resource Manager for capacity

Page 19: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Spark● Began ~2009

● Targeted for Scala and Python

● Application Masters negotiate with Resource Manager for capacity

● Jobs are set up as a DAG

● “Micro-batching”

Page 20: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Spark vs. M/R● Spark tries to work in memory

● MapReduce is slower for multi-pass work

● Spark jobs work great as a team

● Spark is better for streaming

● MapReduce looks more like “traditional” ETL

Page 21: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Kafka Streaming● Append-only logs

● Shard by stream

● Replay when consumers fail

● Connects real-time systems to batch systems

Page 22: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Kafka Streaming● Does not process data

● Not coupled to the Hadoop ecosystem

● Can act as a key-value store

● Performance in catch-up cases

Page 23: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Flink Combo● Streaming OR batch

● Streams in Scala, Python, or Java

● Tables for SQL-ish access

● Graph processing with Gelly

Page 24: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Flink Combo● One-stop framework

● Caches intermediate data

● Jobs can be chained like Spark

Page 25: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Hadoop-ables● Risk calculations

● Recommendations

● Transaction analysis

● Failure prediction

● Customer churn analysis

● Fraud/threat detection

● Word count – haha

Page 26: Hadoop Ecosystem · – Flume – Impala. YARN ... Spark vs. M/R Spark tries to ... Append-only logs Shard by stream Replay when consumers fail Connects real-time systems to batch

Questions?