hadoop for sysadmins

A whirlwind tour of hadoop

By Eric Marshall For LOPSA-NJ

an all too brief introduction to the world of big data

Eric Marshall

I work for Airisdata; we’re hiring!

Smallest computer I lost sleep over: Sinclair-Timex Z81 – 1KB of memoryLargest computer I lost sleep over: SGI Altix 4700 – 1 TB of memory

Vocabulary disclaimerJust like your favorite swear word, which can act

like many parts of speech and refer to many a thing; hadoop vocabulary has the same problem

Casually, people refer to hadoop as storage, processing, a programming model(s), clustered machines. The same problem exists for other terms in the lexicon, so ask me when I make less sense than usual.

My plan of attackAn intro: the good, the bad and the ugly at

50,000 ft.2¢ tour of hadoop’s processing - map reduce2¢ tour of hadoop’s storage – hdfsA blitz tour of the rest of the hadoop ecosystem

Why did this happen?Old school –> scale up == larger costlier

monolithic system (or a small cluster there of) i.e. vertical scaling

Different approach – all road lead to scale out

Assume failuresSmart software, cheap hardwareDon’t move data; bring processing to data

The GoodSimple development (when compared to

Message Passing Interface programming )Scale – no shared state, programmer don’t need to know the topology, easy to add hardware

Automatic parallelization and distribution of tasks

Fault toleranceWorks with commodity

hardware Open source!

The Bad Not a silver bullet :( MapReduce is batch data processing the time scale is minutes to hours MapReduce is overly simplify/abstracted – you are stuck with the M/R model and it is hard to work smarter MapReduce is low level compared to high-level languages like SQL Not all work decomposes well into parallelized M/R Open source :)

The Ugly?Welcome to the rest of our talk!

First stop, Map Reduce

Hadoop’s MapReduce Lisp’s map and reduce plus the associative property applied to clusters.

Map()Imagine a number of servers with lists of first

names – What is the most popular name?Box 1-isabella William ava mia Emma Alexander Box 2-Noah NOAH Isabella Isabella emma EmmaBox 3-emma Emma Liam liam mason Isabella

Map() would apply a function to each element independent of order. For example, capitalize each word

(MapReduce is covered in greater detail in Chapter 2 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)

Map()So we would have:

Box 1-Isabella William Ava Mia Emma Alexander Box 2-Noah Noah Isabella Isabella Emma EmmaBox 3-Emma Emma Liam Liam Mason Isabella

Map() could be apply function to make pairs For example, Isabella becomes (Isabella, 1)

Map()So we would have:

Box 1-(Isabella,1) (William,1) (Ava,1) (Mia,1) (Emma,1) (Alexander,1)

Box 2-(Noah,1) (Noah,1) (Isabella,1) (Isabella,1) (Emma,1) (Emma,1)

Box 3-(Emma,1) (Emma,1) (Liam,1) (Liam,1) (Mason,1) (Isabella,1)

Now we are almost ready for the reduce, but first the sort and shuffle

Shuffle/SortSo we would have:

Box 1-(Alexander,1) (Ava,1) (Emma,1) (Emma,1) (Emma,1) (Emma,1) (Emma,1)

Box 2-(Isabella,1) (Isabella,1) (Isabella,1) (Isabella,1)

Box 3-(Liam,1) (Liam,1) (Mason,1) (Mia,1) (Noah,1) (Noah,1) (William,1)

Now for the reduce, our function would sum all the of the 1s, and return name and count

ReduceSo we would have:

Box 1-(Alexander,1) (Ava,1) (Emma,1) (Emma,1) (Emma,1) (Emma,1) (Emma,1)

Box 2-(Isabella,1) (Isabella,1) (Isabella,1) (Isabella,1)Box 3-(Liam,1) (Liam,1) (Mason,1) (Mia,1) (Noah,1)

(Noah,1) (William,1)Now for the reduce, our function would sum all the of the 1s, and return name and count Box 1-(Alexander,1) (Ava,1) (Emma,5) Box 2-(Isabella,4) Box 3-(Liam,2) (Mason,1) (Mia,1) (Noah,2) (William,1)(https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html for similar coded in java )

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

(This architecture is covered in greater detail in Chapter 4 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)

Map/reduce failuresCheck the job if: The job throws an uncaught exception. The job exits with a nonzero exit code. The job fails to report progress to the tasktracker for a configurable amount of time. (i.e. hung, stuck, slow)Check the node if: the same node keeps killing jobs…check the nodeCheck the Job tracker/RM if: jobs are lost or stuck and then they all fail

Instant MR testUm, is the system working?yarn jar /usr/hdp/2.3.2.0-2950/hadoop-

mapreduce/hadoop-mapreduce-examples.jar pi 10 100

(your jar most likely will be somewhere else)

(HDFS is covered in greater detail in Chapter 3 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)

There is a HDFS CLIYou already know some of the commands:

hdfs dfs –ls /hdfs dfs –du /hdfs dfs –rm /hdfs dfs –cat /

There are other modes than dfs: dfsadmin, namenode, datanode, fsck, zkfc, balancer, etc.

HDFS failures Jobs fail: due to missing

blocks Jobs fail: due to moving

data due to down datanodes or huge ingest

Without NN HA – single point of failure for everything

Regular file system mayhem that you already know and love

plus the usual perms issues

The rest of the gardenDistributed Filesystems

- Apache HDFS

outliers:

- Tachyon

- Apache GridGain

- Ignite

- XtreemFS

- Ceph Filesystem

- Red Hat GlusterFS

- Quantcast File System QFS

- Lustre

Security

outliers:

- Apache Sentry

- Apache Knox Gateway

- Apache Ranger

Distributed Programming

- Apache MapReduce also MRv2/YARN

- Apache Pig

outliers:

- JAQL

- Apache Spark

- Apache Flink (formerly Stratosphere)

- Netflix PigPen

- AMPLab SIMR

- Facebook Corona

- Apache Twill

- Damballa Parkour

- Apache Hama

- Datasalt Pangool

- Apache Tez

- Apache Llama

- Apache DataFu

- Pydoop

- Kangaroo

- TinkerPop

- Pachyderm MapReduce

NewSQL Databases

outliers:

- TokuDB

- HandlerSocket

- Akiban Server

- Drizzle

- Haeinsa

- SenseiDB

- Sky

- BayesDB

- InfluxDB

NoSQL Databases

:Columnal Data Model

- Apache HBase

outliers:

- Apache Accumulo

- Hypertable

- HP Vertica

:Key Value Data Model

- Apache Cassandra

- Riak

- Redis

- Linkedin Volemort

:Document Data Model

outliers:

- MongoDB

- RethinkDB

- ArangoDB

- CouchDB

:Stream Data Model

outliers:

- EventStore

:Key-Value Data Model

outliers:

- Redis DataBase

- Linkedin Voldemort

- RocksDB

- OpenTSDB

:Graph Data Model

outliers:

- Neo4j

- ArangoDB

- TitanDB

- OrientDB

- Intel GraphBuilder

- Giraph

- Pegasus

- Apache Spark

Scheduling

- Apache Oozie

outliers:

- Linkedin Azkaban

- Spotify Luigi

- Apache Falco

10 in 10 minutes! Easier Programming: Pig, Spark SQL-like tools: Hive, Impala, Hbase Data pipefitting: Sqoop, Flume, Kafka Book keeping: Oozie, Zookeeper

Easier Programming

PigWhat is it: a high level programming language for data manipulation that abstracts M/R from YahooWhy: a few lines of code to munge dataExample: filtered_words = FILTER words BY word MATCHES '\\w+';

word_groups = GROUP filtered_words BY word;

word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;

(Pig is covered in greater detail in Alan Gate’s Programming Pig by O’ReillyAnd in Chapter 16 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)

SparkWhat is it: computing framework from ampLab, UC BerkeleyWhy: high level abstractions and better use of memory Neat trick: in-memory RDDs Example:

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))

Or, in python:

>>> linesWithSpark = textFile.filter(lambda line: "Spark" in line) (Spark is covered in greater detail by Matei Zaharia et al. in Learning Spark by O’ReillyAlso of note is Advanced Analytics with Spark – it shows Spark’s capabilities well but moves way too quick to be truly useful. It is covered in Chapter 19 of Tom White’s Hadoop – The Definitive Guide by O’Reilly – lastest ed. Only)

SQL-ish

Hive/HQLWhat is it: a data infrastructure and query language from FacebookWhy: batched SQL queries against HDFS Neat trick: stores metadata so you don’t have toExample: hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSet;

hive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;

(Hive is covered in greater detail by Jason Ruthergenlen et al. in Programming HIve by O’Reilly. Instant Apache Hive Essentials How-To by Darren Lee by Packt was useful to me as tutorial.It is also covered in Chapter 17 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)

ImpalaWhat is it: SQL query engine from ClouderaWhy: fast adhoc queries on subsets of data stored in hadoopExample:

[impala-host:21000] > select count(*) from customer_address;

(nada, let me know if you hit pay dirt)

HBaseWhat is it: a non-relational database from PowersetWhy: fast access to large sparse data setsExample:

hbase(main):001:0> create 'test', 'cf'0 row(s) in 0.4170 seconds

Hbase::Table – testhbase(main):003:0> put 'test', 'row1', 'cf:a', 'value1'0 row(s) in 0.0850 secondshbase(main):006:0> scan 'test'ROW COLUMN+CELL row1 column=cf:a, timestamp=1421762485768, value=value1

(HBase is covered in Chapter 20 of Tom White’s Hadoop – The Definitive Guide by O’ReillyAnd in covered in greater detail in Lars George’s HBase – The Definitive Guide by O’Reilly)

Data pipefitting

SqoopWhat is it: glue tool for moving data between relational databases and hadoopWhy: make the cumbersome easierExample:

sqoop list-databases --connect jdbc:mysql://mysql/employees –username joe --password myPassword

(HBase is covered in greater detail in Chapter 16 of Tom White’s Hadoop – The Definitive Guide by O’ReillyThere is also a cookbook that covered a few worthy gotchas: Apache Sqoop Cookbook Kathleen Ting by O’Reilly)

FlumeWhat is it: a service for collecting and aggregating logs Why: because log ingestion is tougher than it seemsExample:

# Define a memory channel on agent called memory-channel.agent.channels.memory-channel.type = memory

# Define a source on agent and connect to channel memory-channel.agent.sources.tail-source.type = execagent.sources.tail-source.command = tail -F /var/log/system.logagent.sources.tail-source.channels = memory-channel

# Define a sink that outputs to logger.agent.sinks.log-sink.channel = memory-channelagent.sinks.log-sink.type = logger

# Define a sink that outputs to hdfs.agent.sinks.hdfs-sink.channel = memory-channelagent.sinks.hdfs-sink.type = hdfsagent.sinks.hdfs-sink.hdfs.path = hdfs://localhost:54310/tmp/system.log/agent.sinks.hdfs-sink.hdfs.fileType = DataStream

# Finally, activate.agent.channels = memory-channelagent.sources = tail-sourceagent.sinks = log-sink hdfs-sink

(I haven’t read much on Flume; if you find something clever let me know!)

KafkaWhat is it: message broker from LinkedInWhy: fast handling of data feedsNeat trick: no need to worry about missing data or double processing dataExample: > bin/kafka-console-producer.sh --zookeeper localhost:2181 --topic test This is a messageThis is another message> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginningThis is a messageThis is another message(I disliked the one book I read but I found the online docs very readable! http://kafka.apache.org/

Also check out the design docs http://kafka.apache.org/documentation.html#design )

http://kafka.apache.org/

http://kafka.apache.org/

http://kafka.apache.org/documentation.html%23design

http://kafka.apache.org/documentation.html%23design

Book keeping

OozieWhat is it: workflow scheduler from Yahoo BanglaloreWhy: because cron isn’t perfectExample:

oozie job -oozie http://localhost:8080/oozie -config examples/apps/map-reduce/job.properties -run

(Oozie is covered in greater detail in Islam & Srinivasan’s Apache Oozie: The Workflow Scheduler by O’Reilly)

ZookeeperWhat is it: a coordination service from YahooWhy: sync info for distributed systems (similar idea behind DNS or LDAP)Example:

[zkshell: 14] set /zk_test junkcZxid = 5ctime = Fri Jun 05 13:57:06 PDT 2009mZxid = 6mtime = Fri Jun 05 14:01:52 PDT 2009pZxid = 5

[zkshell: 15] get /zk_testjunkcZxid = 5ctime = Fri Jun 05 13:57:06 PDT 2009mZxid = 6mtime = Fri Jun 05 14:01:52 PDT 2009pZxid = 5

(Zookeeper is covered in greater detail in Zookeeper: Distributed Process Cooridination by O’ReillyAnd in Chapter 21 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)

Distributed Programming- Apache MapReduce also MRv2/YARN

- Apache Pig

outliers:

- JAQL

- Apache Spark

- Apache Flink (formerly Stratosphere)

- Netflix PigPen

- AMPLab SIMR

- Facebook Corona

- Apache Twill

- Damballa Parkour

- Apache Hama

- Datasalt Pangool

- Apache Tez

- Apache Llama

- Apache DataFu

- Pydoop

- Kangaroo

- TinkerPop

- Pachyderm MapReduce

Distributed Filesystems- Apache HDFS

outliers:

- Tachyon

- Apache GridGain

- Ignite

- XtreemFS

- Ceph Filesystem

- Red Hat GlusterFS

- Quantcast File System QFS

- Lustre

NoSQL Databases:Columnal Data Model

- Apache HBase

outliers:

- Apache Accumulo

- Hypertable

- HP Vertica

:Key Value Data Model

- Apache Cassandra

- Riak

- Redis

- Linkedin Volemort

:Document Data Model

outliers:

- MongoDB

- RethinkDB

- ArangoDB

- CouchDB

:Stream Data Model

outliers:

- EventStore

:Key-Value Data

Model

outliers:

- Redis DataBase

- Linkedin Voldemort

- RocksDB

- OpenTSDB

:Graph Data Model

outliers:

- Neo4j

- ArangoDB

- TitanDB

- OrientDB

- Intel GraphBuilder

- Giraph

- Pegasus

- Apache Spark

NewSQL Databases

outliers:

- TokuDB

- HandlerSocket

- Akiban Server

- Drizzle

- Haeinsa

- SenseiDB

- Sky

- BayesDB

- InfluxDB

Data Ingestion

:SQL on Hadoop

- Apache Hive

- Apache HCatalog

outliers:

- Cloudera Kudu

- Trafodion

- Apache Drill

- Cloudera Impala

- Facebook Presto

- Datasalt Splout SQL

- Apache Spark

- Apache Tajo

- Apache Phoenix

- Apache MRQL

- Kylin

Data Ingestion

- Apache Flume

- Apache Sqoop

outliers:

- Facebook Scribe

- Apache Chukwa

- Apache Storm

- Apache Kafka

- Netflix Suro

- Apache Samza

- Cloudera Morphline

- HIHO

- Apache NiFi

Etc.

Service Programming and Frameworks

- Apache Zookeeper

- Apache Avro

- Apache Parquet

outliers:

- Apache Thrift

- Apache Curator

- Apache Karaf

- Twitter Elephant Bird

- Linkedin Norbert

Scheduling

- Apache Oozie

outliers:

- Linkedin Azkaban

- Spotify Luigi

- Apache Falcon

- Schedoscope

Security

outliers:

- Apache Sentry

- Apache Knox Gateway

- Apache Ranger

System Deployment

and Management

outliers:

- Apache Ambari

- Cloudera Manager

- Cloudera HUE

- Apache Whirr

- Apache Mesos

- Myriad

- Marathon

- Brooklyn

- Hortonworks HOYA

- Apache Helix

- Apache Bigtop

- Buildoop

- Deploop

And now a bit of common sense for sys-admin-ing Hadoop clusters

AvoidThe usual -

Don’t let hdfs fill up Don’t use all the memoryDon’t use up all the cpusDon’t drop the network<insert fav disaster>

Resource Exhaustion by usersHardware Failure (drives are the king of this

domain)

Um, backups?Usual suspects plus Namenode’s meta data!! (fsimage)Hdfs? Well, it would nice but unlikely (if so

distcp)Snapshots

Hadoop Management Apache Ambari

Cloudera Manager

MonitoringThe usual suspects plus… JMX support Jvm via jstat, jmap etc.hdfsMapredconf/hadoop-metrics.propertieshttp://namenode:50070/http://namenode:50070/jmx

User managementHdfs quotasAccess controls

Internal andexternal

MR schedulersFifo, Fair, Capacity

Kerberos can be used as well

Configuration /etc/hadoop/confLots of knobs! !Ojo! –

Lots of overrides Get the basic system solid before security and

performance Watch the units – some are in megabytes but some are

in bytes! Have canary jobs Ensure same configs are everywhere (including uniform

dns/host)

Want more?

(Disclaimer: I receive nothing from O’Reilly. Not even a Christmas card…)

FinThanks for listeningSlides:

http://www.slideshare.net/ericwilliammarshall/hadoop-for-sysadmins

Any questions?

What’s in a name?Doug Cutting seems to have been inspired by

his family. Lucene is his wife’s middle name, and her maternal grandmother’s first name. His son, as a toddler, used Nutch as the all-purpose word for meal and later named a yellow stuffed elephant Hadoop. Doug said he “was looking for a name that wasn’t already a web domain and wasn’t trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.”

What to do?Combinations of the usual stuff:Numerical SummarizationsFiltering Altering Data OrganizationJoining DataI/O

federation

(Image from Chapter 2 of Eric Sammer’s Hadoop Operations by O’Reilly)

hadoop for sysadmins

Technology