taming operations in the apache hadoop ecosystem · 2019-12-18 · taming operations in the apache...

77
Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, [email protected] Kate Ting, [email protected] USENIX LISA ’14 Nov 14, 2014

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, [email protected]

Kate Ting, [email protected]

USENIX LISA ’14

Nov 14, 2014

Page 2: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

2

$ whoami

•  Jon Hsieh, Cloudera – Software engineer – HBase Tech Lead – Apache HBase committer/

PMC

•  Kate Ting, Cloudera – Technical account manager – Apache Sqoop Cookbook co-

author – Former customer operations

engineer

©2014 Cloudera, Inc. All rights reserved.

Page 3: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

3

Cloudera and Apache Hadoop

•  Apache Hadoop is an open source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware.

•  Cloudera is revolutionizing enterprise data management by offering the first unified Platform for Big Data, an enterprise data hub built on Apache Hadoop. –  Distributes CDH, a distribution Hadoop Distribution. –  Teaches, consults, and supports customers building applications on the Hadoop stack. –  The world-wide Cloudera Customer Operations Engineering team has closed tens of

thousands of support incidents over six years.

©2014 Cloudera, Inc. All rights reserved.

Page 4: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

4

Outline

•  Anatomy of a Hadoop System

•  Managing Hadoop Clusters

•  Troubleshooting Hadoop Applications

©2014 Cloudera, Inc. All rights reserved.

Page 5: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

5

Anatomy of a Hadoop System

Page 6: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

6 ©2014 Cloudera, Inc. All rights reserved.

Page 7: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

7

Distributed systems

•  A Distributed system is a set of many processes trying to acting like one a single process.

•  Ideally you only deal with the system as a whole

•  But in practice you deal with the parts

•  And assemble a set of tools to bring it together

©2014 Cloudera, Inc. All rights reserved.

Page 8: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

8

A Hadoop Machine

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

©2014 Cloudera, Inc. All rights reserved.

Page 9: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

9

A Hadoop Rack

Rack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Top of Rack Switch

©2014 Cloudera, Inc. All rights reserved.

Page 10: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

10

A Hadoop Cluster Cluster

Backbone Switch

Rack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Top of Rack Switch Rack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Top of Rack Switch Rack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Top of Rack Switch

©2014 Cloudera, Inc. All rights reserved.

Page 11: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

11

Cluster

Rack

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

Top of Rack Switch Rack

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

Top of Rack Switch Rack

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

Top of Rack Switch

Backbone Switch WAN Cluster

Backbone Switch

Rack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Top of Rack Switch Rack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Top of Rack Switch Rack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Top of Rack Switch

©2014 Cloudera, Inc. All rights reserved.

Page 12: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

12

Hadoop Stack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

©2014 Cloudera, Inc. All rights reserved.

Page 13: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

13

Hadoop Stack - Storage

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Coordination: zookeeper;

Security: Sentry

Random Access Storage: HBase

File Storage: HDFS

©2014 Cloudera, Inc. All rights reserved.

Page 14: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

14

Hadoop Stack - Processing

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

In Mem processing: Spark

Interactive SQL: Impala Batch processing

MapReduce

Resource Management: YARN

Coordination: zookeeper;

Security: Sentry

Search: Solr

Random Access Storage: HBase

File Storage: HDFS

©2014 Cloudera, Inc. All rights reserved.

Page 15: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

15

Hadoop Stack - Integration

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

In Mem processing: Spark

Interactive SQL: Impala Batch processing

MapReduce

Resource Management: YARN

Coordination: zookeeper;

Security: Sentry

Search: Solr

Even

t ing

est:

Flum

e,

Kafk

a

DB

Impo

rt E

xpor

t: Sq

oop

Other systems: httpd, sas,

custom apps etc

Random Access Storage: HBase

File Storage: HDFS

©2014 Cloudera, Inc. All rights reserved.

Page 16: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

16

Hadoop Stack – User Interface

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

In Mem processing: Spark

Interactive SQL: Impala Batch processing

MapReduce

Resource Management: YARN

Coordination: zookeeper;

Security: Sentry

Search: Solr

Even

t ing

est:

Flum

e,

Kafk

a

DB

Impo

rt E

xpor

t: Sq

oop

User interface: HUE

Other systems: httpd, sas,

custom apps etc

Random Access Storage: HBase

File Storage: HDFS

Languages/APIs: Hive, Pig, Crunch, Kite, Mahout

©2014 Cloudera, Inc. All rights reserved.

Page 17: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

17

Hadoop Stack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

In Mem processing: Spark

Interactive SQL: Impala Batch processing

MapReduce

Resource Management: YARN

Coordination: zookeeper;

Security: Sentry

Search: Solr

Even

t ing

est:

Flum

e,

Kafk

a

DB

Impo

rt E

xpor

t: Sq

oop

User interface: HUE

Other systems: httpd, sas,

custom apps etc

Random Access Storage: HBase

File Storage: HDFS

Languages/APIs: Hive, Pig, Crunch, Kite, Mahout

©2014 Cloudera, Inc. All rights reserved.

Page 18: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

18

Managing a Hadoop Cluster

Page 19: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

19

Understanding Normal

•  Establish normal –  Boring logs are good

•  So that you can detect abnormal –  Find anomalies and outliers –  Compare performance

•  And then isolate root causes –  Who are the suspects –  Pull the thread by interrogating –  Correlate across different subsystems

©2014 Cloudera, Inc. All rights reserved.

Page 20: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

20

Basic tools

•  What happened? –  Logs

•  What state are we in now? –  Metrics –  Thread stacks

•  What happens when I do this? –  Tracing –  Dumps

•  Is it alive? –  listings –  Canaries

•  Is it OK? –  fsck

©2014 Cloudera, Inc. All rights reserved.

Page 21: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

21

Diagnostics for single machines

Machine

Linux

Disk, CPU, Mem

Analysis and Tools via @brendangregg ©2014 Cloudera, Inc. All rights reserved.

Page 22: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

22

Diagnosis Tools

©2014 Cloudera, Inc. All rights reserved.

HW/Kernel JVM Hadoop Stack

Logs /log, /var/log dmesg

Enable gc logging *:/var/log/hadoo|hbase|…

Metrics /proc, top, iotop, sar, vmstat, netstat

jstat *:/stacks, *:/jmx,

Tracing Strace, tcpdump Debugger, jdb, jstack htrace

Dumps Core dumps jmap Enable OOME dump

*:/dump

Liveness ping Jps Web UI

Corrupt fsck Hdfs fsck, hbase hbck

Page 23: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

23

Diagnostics for the JVM •  Most Hadoop services run in the Java VM, intros new java subsystems –  Just in time compiler –  Threads –  Garbage collection

•  Dumping Current threads –  jstacks (any threads stuck or blocked?)

•  JVM Settings: –  Enabling GC Logs: -XX:+PrintGCDateStamps -XX:+PrintGCDetails –  Enabling OOME Heap dumps: -XX:+HeapDumpOnOutOfMemoryError

•  Metrics on GC –  Get gc counts and info with jstat

•  Memory Usage dumps –  Create heap dump: jmap –dump:live,format=b,file=heap.bin <pid> –  View heap dump from crash or jmap with jhat

©2014 Cloudera, Inc. All rights reserved.

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Page 24: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

24

Diagnosis Tools

©2014 Cloudera, Inc. All rights reserved.

HW/Kernel JVM Hadoop Stack

Logs /log, /var/log dmesg

Enable gc logging *:/var/log/hadoo|hbase|…

Metrics /proc, top, iotop, sar, vmstat, netstat

jstat *:/stacks, *:/jmx,

Tracing Strace, tcpdump Debugger, jdb, jstack htrace

Dumps Core dumps jmap Enable OOME dump

*:/dump

Liveness ping Jps Web UI

Corrupt fsck Hdfs fsck, hbase hbck

Page 25: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

25

Other systems: httpd, sas,

custom apps etc

Tools for lots of machines

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

©2014 Cloudera, Inc. All rights reserved.

Page 26: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

26

Other systems: httpd, sas,

custom apps etc

Tools for lots of machines

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem Ganglia, Nagios

Ganglia*, Nagios*

©2014 Cloudera, Inc. All rights reserved.

Page 27: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

27

Ganglia

•  Collects and aggregates metrics from machines

•  Good for what’s going on right now and describing normal perf

•  Organized by physical resource (CPU, Mem, Disk) –  Good defaults –  Good for pin pointing machines –  Good for seeing overall utilization –  Uses RRDTool under the covers

•  Some data scalability limitations, lossy over time.

•  Dynamic for new machines, requires config for new metrics

©2014 Cloudera, Inc. All rights reserved.

Page 28: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

28

Graphite

•  Popular alternative to ganglia

•  Can handle the scale of metrics coming in

•  Similar to ganglia, but uses its own RRD database.

•  More aimed at dynamic metrics (as opposed to statically defined metrics)

©2014 Cloudera, Inc. All rights reserved.

Page 29: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

29

Nagios •  Provides alerts from canaries and basic health checks for services on

machines

•  Organized by Service (httpd, dns, etc)

•  Defacto standard for service monitoring

•  Lacks distributed system know how –  Requires bespoke setup for service slaves and masters –  Lacks details with multitenant services or short-lived jobs

©2014 Cloudera, Inc. All rights reserved.

Page 30: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

30

Other systems: httpd, sas,

custom apps etc

Tools for the Hadoop Stack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem Ganglia, Nagios

Ganglia*, Nagios*

©2014 Cloudera, Inc. All rights reserved.

Page 31: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

31

Tools for the Hadoop Stack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

In Mem processing: Spark

Interactive SQL: Impala Batch processing

MapReduce

Resource Management: YARN

Coordination: zookeeper;

Security: Sentry

Search: Solr

Even

t ing

est:

Flum

e,

Kafk

a

DB

Impo

rt E

xpor

t: Sq

oop

User interface: HUE

Other systems: httpd, sas,

custom apps etc

Random Access Storage: HBase

File Storage: HDFS

Languages/APIs: Hive, Pig, Crunch, Kite, Mahout

Ganglia, Nagios

Ganglia*, Nagios*

©2014 Cloudera, Inc. All rights reserved.

Page 32: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

32

Tools for the Hadoop Stack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

In Mem processing: Spark

Interactive SQL: Impala Batch processing

MapReduce

Resource Management: YARN

Coordination: zookeeper;

Security: Sentry

Search: Solr

Even

t ing

est:

Flum

e,

Kafk

a

DB

Impo

rt E

xpor

t: Sq

oop

User interface: HUE

Other systems: httpd, sas,

custom apps etc

Random Access Storage: HBase

File Storage: HDFS

Languages/APIs: Hive, Pig, Crunch, Kite, Mahout

Ganglia, Nagios

Ganglia*, Nagios*

©2014 Cloudera, Inc. All rights reserved.

Page 33: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

33

Diagnostics for the Hadoop Stack •  Single client call can trigger many RPCs spanning many machines •  Systems are evolving quickly •  A failure on one daemon, by design, does not cause failure of the entire service

•  Logs: –  Each service’s master and slaves have their own logs: /var/log/ –  There are lot of logs and they change frequently

•  Metrics: –  Each daemon offers metrics, often aggregated at masters

•  Tracing: –  Htrace (integrated into Hadoop, HBase, recently in Apache Incubator)

•  Liveness: –  Canaries, service/daemon web uis

©2014 Cloudera, Inc. All rights reserved.

Page 34: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

34

Diagnosis Tools HW/Kernel JVM Hadoop Stack

Logs /log, /var/log dmesg

Enable gc logging *:/var/log/hadoo|hbase|…

Metrics /proc, top, iotop, sar, vmstat, netstat

jstat *:/stacks, *:/jmx,

Tracing Strace, tcpdump Debugger, jdb, jstack htrace

Dumps Core dumps jmap Enable OOME dump

*:/dump

Liveness ping Jps Web UI

Corrupt fsck HDFS fsck, HBase hbck

©2014 Cloudera, Inc. All rights reserved.

Page 35: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

35

Tools for the Hadoop Stack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

In Mem processing: Spark

Interactive SQL: Impala Batch processing

MapReduce

Resource Management: YARN

Coordination: zookeeper;

Security: Sentry

Search: Solr

Even

t ing

est:

Flum

e,

Kafk

a

DB

Impo

rt E

xpor

t: Sq

oop

User interface: HUE

Other systems: httpd, sas,

custom apps etc

Random Access Storage: HBase

File Storage: HDFS

Languages/APIs: Hive, Pig, Crunch, Kite, Mahout

Ganglia, Nagios

Ganglia*, Nagios*

©2014 Cloudera, Inc. All rights reserved.

Page 36: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

36

Tools for the Hadoop Stack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

In Mem processing: Spark

Interactive SQL: Impala Batch processing

MapReduce

Resource Management: YARN

Coordination: zookeeper;

Security: Sentry

Search: Solr

Even

t ing

est:

Flum

e,

Kafk

a

DB

Impo

rt E

xpor

t: Sq

oop

User interface: HUE

Other systems: httpd, sas,

custom apps etc

Random Access Storage: HBase

File Storage: HDFS

Languages/APIs: Hive, Pig, Crunch, Kite, Mahout

Logs and Metrics

Logs and Metrics Logs and Metrics

Logs and Metrics Logs and Metrics

Logs and Metrics

Logs

Logs and Metrics

Logs

and

Met

rics

Logs

Logs and Metrics

Logs and Metrics

HTrace

HTrace

Custom app logs

/proc, dmesg, /var/logs/

jmx, jstack, jstat, GC logs

©2014 Cloudera, Inc. All rights reserved.

Page 37: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

37

Tools for the Hadoop Stack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

In Mem processing: Spark

Interactive SQL: Impala Batch processing

MapReduce

Resource Management: YARN

Coordination: zookeeper;

Security: Sentry

Search: Solr

Even

t ing

est:

Flum

e,

Kafk

a

DB

Impo

rt E

xpor

t: Sq

oop

User interface: HUE

Other systems: httpd, sas,

custom apps etc

Random Access Storage: HBase

File Storage: HDFS

Languages/APIs: Hive, Pig, Crunch, Kite, Mahout

Ganglia, Nagios

Ganglia*, Nagios*

Ganglia*, Nagios* ?

©2014 Cloudera, Inc. All rights reserved.

Page 38: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

38

Ganglia and Nagios are not enough

•  Fault tolerant distributed system masks many problems (by design!) –  Some failures are not critical – failure condition more complicated

•  Lacks distributed system know how –  Requires bespoke setup for service slaves and masters –  Lacks details with multitenant services or short-lived jobs

•  Hadoop services are logically dependent on each other –  Need to correlate metrics across different service and machines –  Need to correlate logs from different services and machines. –  Young systems where Logs are changing frequently –  What about all the logs? –  What of all these metrics do we really need?

•  Some data scalability limitations, lossy over time –  What about full fidelity?

©2014 Cloudera, Inc. All rights reserved.

Page 39: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

39

OpenTSDB

•  OpenTSDB (Time Series Database) –  Efficiently stores metric data into HBase –  Keeps data at full fidelity –  Keep as much data as your HBase instance can

handle.

•  Free and Open Source

©2014 Cloudera, Inc. All rights reserved.

Page 40: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

40

Cloudera Manager •  Extracts hardware, OS, and Hadoop service metrics

specifically relevant to Hadoop and its related services. –  Operation Latencies –  # of disk seeks –  HDFS data written –  Java GC time –  Network IO

•  Provides –  Cluster preflight checks –  Basic host checks –  Regular health checks

•  Uses RRD

•  Provides distributed log search

•  Monitors logs for known issues

•  Point and click for useful utils (lsof, jstack, jmap)

•  Free

©2014 Cloudera, Inc. All rights reserved.

Page 41: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

41

Tools for the Hadoop Stack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

In Mem processing: Spark

Interactive SQL: Impala Batch processing

MapReduce

Resource Management: YARN

Coordination: zookeeper;

Security: Sentry

Search: Solr

Even

t ing

est:

Flum

e,

Kafk

a

DB

Impo

rt E

xpor

t: Sq

oop

User interface: HUE

Other systems: httpd, sas,

custom apps etc

Random Access Storage: HBase

File Storage: HDFS

Languages/APIs: Hive, Pig, Crunch, Kite, Mahout

Ganglia, Nagios

Ganglia*, Nagios*

Full Fidelity metrics: Open TSDB*

©2014 Cloudera, Inc. All rights reserved.

Metrics + Logs: Cloudera Manager

Logs: Search a la Splunk

Page 42: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

42

Troubleshooting a Hadoop System

Page 43: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

43

The Law of Cluster Inertia

A cluster in a good state stays in a good state, and

a cluster in a bad state stays in a bad state, unless

acted upon by an external force.

©2014 Cloudera, Inc. All rights reserved.

Page 44: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

44 ©2014 Cloudera, Inc. All rights reserved.

Page 45: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

45

External Forces

•  Failures

•  Acts of God

•  Users

•  Admins

©2014 Cloudera, Inc. All rights reserved.

Page 46: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

46

Batch processing MapReduce

Languages/APIs: Hive, Pig, Crunch, Kite, Mahout

Failures as an External Force

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Coordination: zookeeper;

Security: Sentry

Even

t ing

est:

Flum

e,

Kafk

a

Random Access Storage: HBase

Resource Management: YARN DB

Impo

rt E

xpor

t: Sq

oop

File Storage: HDFS Other systems: httpd, sas,

custom apps etc Machine

JVM

HW

Slave

Job

©2014 Cloudera, Inc. All rights reserved.

In Mem processing: Spark

Interactive SQL: Impala

Search: Solr

User interface: HUE

Page 47: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

47

Acts of God as an External Force WAN Cluster

Backbone Switch

Rack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Top of Rack Switch Rack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Top of Rack Switch Rack

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Top of Rack Switch Power Failure

©2014 Cloudera, Inc. All rights reserved.

Page 48: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

48

Users as an external force

•  Use Security to protect systems from users –  Prevent and track

•  Authentication – proving who you are –  LDAP, Kerberos

•  Authorization – deciding what you are allowed to do –  Apache Sentry (incubating), Hadoop security, HBase security

•  Audit – who and when was something done? –  Cloudera Navigator

©2014 Cloudera, Inc. All rights reserved.

Page 49: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

49

Admins as an external force

Upgrades

•  Linux

•  Hadoop

•  Java

Misconfiguration

•  Memory Mismanagement –  TT OOME –  JT OOME –  Native Threads

•  Thread Mismanagement –  Fetch Failures –  Replicas

•  Disk Mismanagement –  No File –  Too Many Files

©2014 Cloudera, Inc. All rights reserved.

Page 50: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

50

Debugging Hadoop Applications

Page 51: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

51

Example Application Pipeline

ingest process export

ingest process export

ingest process export

time

©2014 Cloudera, Inc. All rights reserved.

Page 52: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

52

Example Application Pipeline - Ingest

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Coordination: zookeeper;

Security: Sentry File Storage: HDFS

Custom app feeds event data in to HBase

Random Access Storage: HBase

Even

t ing

est:

Flum

e,

Kafk

a

Other systems: httpd, sas,

custom apps etc

©2014 Cloudera, Inc. All rights reserved.

Page 53: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

53

Example Application Pipeline - processing

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Coordination: zookeeper;

Security: Sentry

Even

t ing

est:

Flum

e,

Kafk

a

Other systems: httpd, sas,

custom apps etc

Batch processing MapReduce

Resource Management: YARN

Languages/APIs: Hive, Pig, Crunch, Kite, Mahout

Random Access Storage: HBase

Map Reduce Job generates new artifact from HBase data

and writes to hdfs

File Storage: HDFS

©2014 Cloudera, Inc. All rights reserved.

Page 54: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

54

Batch processing MapReduce

Languages/APIs: Hive, Pig, Crunch, Kite, Mahout

Example Application Pipeline - export

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Coordination: zookeeper;

Security: Sentry

Even

t ing

est:

Flum

e,

Kafk

a

Random Access Storage: HBase 9

Resource Management: YARN DB

Impo

rt E

xpor

t: Sq

oop

File Storage: HDFS Other systems: httpd, sas,

custom apps etc

Sqoop result into relational database to serve application

©2014 Cloudera, Inc. All rights reserved.

Page 55: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

55

Case study 1: slow jobs after Hadoop upgrade

Symptom: After an upgrade, activity on the cluster eventually began to slow down and the job queue overflowed.

©2014 Cloudera, Inc. All rights reserved.

Page 56: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

56

Finding the right part of the stack E-SPORE (from Eric Sammer’s Hadoop Operations) •  Environment –  What is different about the environment now from the last time everything worked?

•  Stack –  The entire cluster also has shared dependency on data center infrastructure such as the network, DNS, and other services.

•  Patterns –  Are the tasks from the same job? Are they all assigned to the same tasktracker? Do they all use a shared library that was changed

recently?

•  Output –  Always check log output for exceptions but don’t assume the symptom correlates to the root cause.

•  Resources –  Do local disks have enough? Is the machine swapping? Does the network utilization look normal? Does the CPU utilization look

normal?

•  Event correlation –  It’s important to know the order in which the events led to the failure.

©2014 Cloudera, Inc. All rights reserved.

Page 57: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

57

Example Application Pipeline - processing

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

Coordination: zookeeper;

Security: Sentry

Even

t ing

est:

Flum

e,

Kafk

a

Other systems: httpd, sas,

custom apps etc

Languages/APIs: Hive, Pig, Crunch, Kite, Mahout

Random Access Storage: HBase

File Storage: HDFS

©2014 Cloudera, Inc. All rights reserved.

Batch processing MapReduce

Resource Management: YARN

Map Reduce Job generates new artifact from HBase data

and writes to hdfs

Page 58: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

58

Case study 1: slow jobs after Hadoop upgrade

Evidence:

Isolated to Processing phase (MR).

In TT Logs, found an innocuous but anomalous log entry about “fetch failures.”

Many users had run in to this MR problem using different versions of MR.

Workaround provided: remove the problem node from the cluster.

©2014 Cloudera, Inc. All rights reserved.

Page 59: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

©2014 Cloudera, Inc. All rights reserved.

Page 60: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

60

Case study 1: slow jobs after Hadoop upgrade

Root cause: All MR versions had a common dependency on a particular version of Jetty (Jetty 6.1.26) . Dev was able to reproduce and fix the bug in Jetty.

©2014 Cloudera, Inc. All rights reserved.

Page 61: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

61

Case study 2: slow jobs after Linux upgrade

Symptom: After an upgrade, system CPU usage peaked at 30% or more of the total CPU usage. ©2014 Cloudera, Inc. All rights reserved.

Page 62: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

62

Case study 2: slow jobs after Linux upgrade

Evidence: Used tracing tools to isolate a majority of time was inexplicably spent in virtual memory calls. http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/

©2014 Cloudera, Inc. All rights reserved.

Page 63: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

©2014 Cloudera, Inc. All rights reserved.

Page 64: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

64

Case study 2: slow jobs after Linux upgrade

Root cause: Running RHEL or CentOS versions 6.2, 6.3, and 6.4 or SLES 11 SP2 has a feature called "Transparent Huge Page (THP)" compaction which interacts poorly with Hadoop workloads.

©2014 Cloudera, Inc. All rights reserved.

Page 65: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

65

Case study 3: slow jobs at a precise moment

Symptom: High CPU usage and responsive but sluggish cluster. 30 customers all hit this at the exact same time: 6/30/12 at 5pm PDT.

©2014 Cloudera, Inc. All rights reserved.

Page 66: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

66

Case study 3: slow jobs at a precise moment

Evidence: Checked the kernel message buffer (run dmesg) and look for output confirming the leap second injection. Other systems had same problem. ©2014 Cloudera, Inc. All rights reserved.

Page 67: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

Machine

JVM

Linux

Hadoop Daemons

Disk, CPU, Mem

©2014 Cloudera, Inc. All rights reserved.

Page 68: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

68

Case study 3: slow jobs at a precise moment

Root cause: Linux OS kernel mishandled a leap second added.

©2014 Cloudera, Inc. All rights reserved.

Page 69: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

69

Similar symptoms, different problem

©2014 Cloudera, Inc. All rights reserved.

Page 70: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

70

Case study 1: slow jobs after Hadoop upgrade

After an upgrade, activity on the cluster eventually began to slow down and the job queue overflowed.

In TT Logs, found an innocuous but anomalous log entry about “fetch failures.”

Many users had run in to this MR problem using different versions of MR.

Workaround provided: remove the problem node from the cluster.

All MR versions had a common dependency on a particular version of Jetty (Jetty 6.1.26) .

Dev was able to reproduce and fix the bug in Jetty.

Symptom Evidence Root Cause

©2014 Cloudera, Inc. All rights reserved.

Page 71: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

71

Case study 2: slow jobs after Linux upgrade

After an upgrade, system CPU usage peaked at 30% or more of the total CPU usage.

HT to Greg Rahn ,Brendan Gregg’s flame graph tool, and perf tool which proved that majority of time was inexplicably spent in virtual memory calls.

Running RHEL or CentOS versions 6.2, 6.3, and 6.4 or SLES 11 SP2 has a feature called "Transparent Huge Page (THP)" compaction which interacts poorly with Hadoop workloads.

Symptom Evidence Root Cause

©2014 Cloudera, Inc. All rights reserved.

Page 72: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

72

Case study 3: slow jobs at a precise moment

High CPU usage and responsive but sluggish cluster.

30 customers all hit this at the exact same time.

Checked the kernel message buffer (run dmesg) and look for output confirming the leap second injection.

Linux OS kernel mishandled a leap second added on 6/30/12 at 5pm PDT.

Symptom Evidence Root Cause

©2014 Cloudera, Inc. All rights reserved.

Page 73: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

73

Lessons learned

More crucial than the specific troubleshooting methodology used is to use one.

More crucial than the specific tool used is the type of data analyzed and how it’s analyzed.

Capture for posterity in a knowledge base article, blog post, or conference presentation.

Methodology Tools Learn from failure

©2014 Cloudera, Inc. All rights reserved.

Page 74: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

74

Conclusions

Page 75: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

75

Takeaway

A cluster in a good state stays in a good state, and a cluster in a bad state stays in a bad state, unless acted upon by an external force..

Cloudera has seen a lot of diverse clusters and used that experience to build tools to help diagnose and understand how Hadoop operates.

Similar symptoms can lead to different root causes. Use tools to assist with event correlation and pattern determination.

Anatomy of a Hadoop System Managing Hadoop Clusters Troubleshooting Hadoop Applications

©2014 Cloudera, Inc. All rights reserved.

Page 76: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

76

Shameless plug

A full demo cluster + tutorial and sample data/queries in the cloud – free access for two weeks (plus, clusters in Tableau and Zoomdata flavors!)

cloudera.com/live

Ask technical questions/get technical answers from Cloudera Software Engineers and other users

cloudera.com/community

Account Executives, Solutions Architects, Training

cloudera.com/careers

Cloudera Live Cloudera Community In Europe!

©2014 Cloudera, Inc. All rights reserved.

Page 77: Taming Operations in the Apache Hadoop Ecosystem · 2019-12-18 · Taming Operations in the Apache Hadoop Ecosystem Jon Hsieh, jon@cloudera.com Kate Ting, kate@cloudera.com USENIX

Thank you! Office hours @2pm in Willow A @jmhsieh @kate_ting