hadoop for sysadmins

56
A whirlwind tour of hadoop By Eric Marshall For LOPSA-NJ an all too brief introduction to the world of big data

Upload: ericwilliammarshall

Post on 08-Jan-2017

380 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Hadoop for sysadmins

A whirlwind tour of hadoop

By Eric Marshall For LOPSA-NJ

an all too brief introduction to the world of big data

Page 2: Hadoop for sysadmins

Eric Marshall

I work for Airisdata; we’re hiring!

Smallest computer I lost sleep over: Sinclair-Timex Z81 – 1KB of memoryLargest computer I lost sleep over: SGI Altix 4700 – 1 TB of memory

Page 3: Hadoop for sysadmins

Vocabulary disclaimerJust like your favorite swear word, which can act

like many parts of speech and refer to many a thing; hadoop vocabulary has the same problem

Casually, people refer to hadoop as storage, processing, a programming model(s), clustered machines. The same problem exists for other terms in the lexicon, so ask me when I make less sense than usual.

Page 4: Hadoop for sysadmins

My plan of attackAn intro: the good, the bad and the ugly at

50,000 ft.2¢ tour of hadoop’s processing - map reduce2¢ tour of hadoop’s storage – hdfsA blitz tour of the rest of the hadoop ecosystem

Page 5: Hadoop for sysadmins

Why did this happen?Old school –> scale up == larger costlier

monolithic system (or a small cluster there of) i.e. vertical scaling

Different approach – all road lead to scale out

Assume failuresSmart software, cheap hardwareDon’t move data; bring processing to data

Page 6: Hadoop for sysadmins

The GoodSimple development (when compared to

Message Passing Interface programming )Scale – no shared state, programmer don’t need to know the topology, easy to add hardware

Automatic parallelization and distribution of tasks

Fault toleranceWorks with commodity

hardware Open source!

Page 7: Hadoop for sysadmins

The Bad Not a silver bullet :( MapReduce is batch data processing the time scale is minutes to hours MapReduce is overly simplify/abstracted – you are stuck with the M/R model and it is hard to work smarter MapReduce is low level compared to high-level languages like SQL Not all work decomposes well into parallelized M/R Open source :)

Page 8: Hadoop for sysadmins

The Ugly?Welcome to the rest of our talk!

First stop, Map Reduce

Page 9: Hadoop for sysadmins

Hadoop’s MapReduce Lisp’s map and reduce plus the associative property applied to clusters.

Page 10: Hadoop for sysadmins

Map()Imagine a number of servers with lists of first

names – What is the most popular name?Box 1-isabella William ava mia Emma Alexander Box 2-Noah NOAH Isabella Isabella emma EmmaBox 3-emma Emma Liam liam mason Isabella

Map() would apply a function to each element independent of order. For example, capitalize each word

(MapReduce is covered in greater detail in Chapter 2 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)

Page 11: Hadoop for sysadmins

Map()So we would have:

Box 1-Isabella William Ava Mia Emma Alexander Box 2-Noah Noah Isabella Isabella Emma EmmaBox 3-Emma Emma Liam Liam Mason Isabella

Map() could be apply function to make pairs For example, Isabella becomes (Isabella, 1)

Page 12: Hadoop for sysadmins

Map()So we would have:

Box 1-(Isabella,1) (William,1) (Ava,1) (Mia,1) (Emma,1) (Alexander,1)

Box 2-(Noah,1) (Noah,1) (Isabella,1) (Isabella,1) (Emma,1) (Emma,1)

Box 3-(Emma,1) (Emma,1) (Liam,1) (Liam,1) (Mason,1) (Isabella,1)

Now we are almost ready for the reduce, but first the sort and shuffle

Page 13: Hadoop for sysadmins

Shuffle/SortSo we would have:

Box 1-(Alexander,1) (Ava,1) (Emma,1) (Emma,1) (Emma,1) (Emma,1) (Emma,1)

Box 2-(Isabella,1) (Isabella,1) (Isabella,1) (Isabella,1)

Box 3-(Liam,1) (Liam,1) (Mason,1) (Mia,1) (Noah,1) (Noah,1) (William,1)

Now for the reduce, our function would sum all the of the 1s, and return name and count

Page 14: Hadoop for sysadmins

ReduceSo we would have:

Box 1-(Alexander,1) (Ava,1) (Emma,1) (Emma,1) (Emma,1) (Emma,1) (Emma,1)

Box 2-(Isabella,1) (Isabella,1) (Isabella,1) (Isabella,1)Box 3-(Liam,1) (Liam,1) (Mason,1) (Mia,1) (Noah,1)

(Noah,1) (William,1)Now for the reduce, our function would sum all the of the 1s, and return name and count Box 1-(Alexander,1) (Ava,1) (Emma,5) Box 2-(Isabella,4) Box 3-(Liam,2) (Mason,1) (Mia,1) (Noah,2) (William,1)(https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html for similar coded in java )

Page 15: Hadoop for sysadmins

(This architecture is covered in greater detail in Chapter 4 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)

Page 16: Hadoop for sysadmins
Page 17: Hadoop for sysadmins

Map/reduce failuresCheck the job if: The job throws an uncaught exception. The job exits with a nonzero exit code. The job fails to report progress to the tasktracker for a configurable amount of time. (i.e. hung, stuck, slow)Check the node if: the same node keeps killing jobs…check the nodeCheck the Job tracker/RM if: jobs are lost or stuck and then they all fail

Page 18: Hadoop for sysadmins

Instant MR testUm, is the system working?yarn jar /usr/hdp/2.3.2.0-2950/hadoop-

mapreduce/hadoop-mapreduce-examples.jar pi 10 100

(your jar most likely will be somewhere else)

Page 19: Hadoop for sysadmins

(HDFS is covered in greater detail in Chapter 3 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)

Page 20: Hadoop for sysadmins

There is a HDFS CLIYou already know some of the commands:

hdfs dfs –ls /hdfs dfs –du /hdfs dfs –rm /hdfs dfs –cat /

There are other modes than dfs: dfsadmin, namenode, datanode, fsck, zkfc, balancer, etc.

Page 21: Hadoop for sysadmins

HDFS failures Jobs fail: due to missing

blocks Jobs fail: due to moving

data due to down datanodes or huge ingest

Without NN HA – single point of failure for everything

Regular file system mayhem that you already know and love

plus the usual perms issues

Page 22: Hadoop for sysadmins

HDFS failures Jobs fail: due to missing

blocks Jobs fail: due to moving

data due to down datanodes or huge ingest

Without NN HA – single point of failure for everything

Regular file system mayhem that you already know and love

plus the usual perms issues

Page 23: Hadoop for sysadmins

The rest of the gardenDistributed Filesystems

- Apache HDFS

outliers:

- Tachyon

- Apache GridGain

- Ignite

- XtreemFS

- Ceph Filesystem

- Red Hat GlusterFS

- Quantcast File System QFS

- Lustre

Security

outliers:

- Apache Sentry

- Apache Knox Gateway

- Apache Ranger

Distributed Programming

- Apache MapReduce also MRv2/YARN

- Apache Pig

outliers:

- JAQL

- Apache Spark

- Apache Flink (formerly Stratosphere)

- Netflix PigPen

- AMPLab SIMR

- Facebook Corona

- Apache Twill

- Damballa Parkour

- Apache Hama

- Datasalt Pangool

- Apache Tez

- Apache Llama

- Apache DataFu

- Pydoop

- Kangaroo

- TinkerPop

- Pachyderm MapReduce

NewSQL Databases

outliers:

- TokuDB

- HandlerSocket

- Akiban Server

- Drizzle

- Haeinsa

- SenseiDB

- Sky

- BayesDB

- InfluxDB

NoSQL Databases

:Columnal Data Model

- Apache HBase

outliers:

- Apache Accumulo

- Hypertable

- HP Vertica

:Key Value Data Model

- Apache Cassandra

- Riak

- Redis

- Linkedin Volemort

:Document Data Model

outliers:

- MongoDB

- RethinkDB

- ArangoDB

- CouchDB

:Stream Data Model

outliers:

- EventStore

:Key-Value Data Model

outliers:

- Redis DataBase

- Linkedin Voldemort

- RocksDB

- OpenTSDB

:Graph Data Model

outliers:

- Neo4j

- ArangoDB

- TitanDB

- OrientDB

- Intel GraphBuilder

- Giraph

- Pegasus

- Apache Spark

Scheduling

- Apache Oozie

outliers:

- Linkedin Azkaban

- Spotify Luigi

- Apache Falco

Page 24: Hadoop for sysadmins

10 in 10 minutes! Easier Programming: Pig, Spark SQL-like tools: Hive, Impala, Hbase Data pipefitting: Sqoop, Flume, Kafka Book keeping: Oozie, Zookeeper

Page 25: Hadoop for sysadmins

Easier Programming

Page 26: Hadoop for sysadmins

PigWhat is it: a high level programming language for data manipulation that abstracts M/R from YahooWhy: a few lines of code to munge dataExample: filtered_words = FILTER words BY word MATCHES '\\w+';

word_groups = GROUP filtered_words BY word;

word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;

(Pig is covered in greater detail in Alan Gate’s Programming Pig by O’ReillyAnd in Chapter 16 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)

Page 27: Hadoop for sysadmins

SparkWhat is it: computing framework from ampLab, UC BerkeleyWhy: high level abstractions and better use of memory Neat trick: in-memory RDDs Example:

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))

Or, in python:

>>> linesWithSpark = textFile.filter(lambda line: "Spark" in line) (Spark is covered in greater detail by Matei Zaharia et al. in Learning Spark by O’ReillyAlso of note is Advanced Analytics with Spark – it shows Spark’s capabilities well but moves way too quick to be truly useful. It is covered in Chapter 19 of Tom White’s Hadoop – The Definitive Guide by O’Reilly – lastest ed. Only)

Page 28: Hadoop for sysadmins

SQL-ish

Page 29: Hadoop for sysadmins

Hive/HQLWhat is it: a data infrastructure and query language from FacebookWhy: batched SQL queries against HDFS Neat trick: stores metadata so you don’t have toExample: hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSet;

hive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;

(Hive is covered in greater detail by Jason Ruthergenlen et al. in Programming HIve by O’Reilly. Instant Apache Hive Essentials How-To by Darren Lee by Packt was useful to me as tutorial.It is also covered in Chapter 17 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)

Page 30: Hadoop for sysadmins

ImpalaWhat is it: SQL query engine from ClouderaWhy: fast adhoc queries on subsets of data stored in hadoopExample:

[impala-host:21000] > select count(*) from customer_address;

(nada, let me know if you hit pay dirt)

Page 31: Hadoop for sysadmins

HBaseWhat is it: a non-relational database from PowersetWhy: fast access to large sparse data setsExample:

hbase(main):001:0> create 'test', 'cf'0 row(s) in 0.4170 seconds

Hbase::Table – testhbase(main):003:0> put 'test', 'row1', 'cf:a', 'value1'0 row(s) in 0.0850 secondshbase(main):006:0> scan 'test'ROW COLUMN+CELL row1 column=cf:a, timestamp=1421762485768, value=value1

(HBase is covered in Chapter 20 of Tom White’s Hadoop – The Definitive Guide by O’ReillyAnd in covered in greater detail in Lars George’s HBase – The Definitive Guide by O’Reilly)

Page 32: Hadoop for sysadmins

Data pipefitting

Page 33: Hadoop for sysadmins

SqoopWhat is it: glue tool for moving data between relational databases and hadoopWhy: make the cumbersome easierExample:

sqoop list-databases --connect jdbc:mysql://mysql/employees –username joe --password myPassword

(HBase is covered in greater detail in Chapter 16 of Tom White’s Hadoop – The Definitive Guide by O’ReillyThere is also a cookbook that covered a few worthy gotchas: Apache Sqoop Cookbook Kathleen Ting by O’Reilly)

Page 34: Hadoop for sysadmins

FlumeWhat is it: a service for collecting and aggregating logs Why: because log ingestion is tougher than it seemsExample:

# Define a memory channel on agent called memory-channel.agent.channels.memory-channel.type = memory

# Define a source on agent and connect to channel memory-channel.agent.sources.tail-source.type = execagent.sources.tail-source.command = tail -F /var/log/system.logagent.sources.tail-source.channels = memory-channel

# Define a sink that outputs to logger.agent.sinks.log-sink.channel = memory-channelagent.sinks.log-sink.type = logger

# Define a sink that outputs to hdfs.agent.sinks.hdfs-sink.channel = memory-channelagent.sinks.hdfs-sink.type = hdfsagent.sinks.hdfs-sink.hdfs.path = hdfs://localhost:54310/tmp/system.log/agent.sinks.hdfs-sink.hdfs.fileType = DataStream

# Finally, activate.agent.channels = memory-channelagent.sources = tail-sourceagent.sinks = log-sink hdfs-sink

(I haven’t read much on Flume; if you find something clever let me know!)

Page 35: Hadoop for sysadmins

KafkaWhat is it: message broker from LinkedInWhy: fast handling of data feedsNeat trick: no need to worry about missing data or double processing dataExample: > bin/kafka-console-producer.sh --zookeeper localhost:2181 --topic test This is a messageThis is another message> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginningThis is a messageThis is another message(I disliked the one book I read but I found the online docs very readable! http://kafka.apache.org/

Also check out the design docs http://kafka.apache.org/documentation.html#design )

Page 36: Hadoop for sysadmins

Book keeping

Page 37: Hadoop for sysadmins

OozieWhat is it: workflow scheduler from Yahoo BanglaloreWhy: because cron isn’t perfectExample:

oozie job -oozie http://localhost:8080/oozie -config examples/apps/map-reduce/job.properties -run

(Oozie is covered in greater detail in Islam & Srinivasan’s Apache Oozie: The Workflow Scheduler by O’Reilly)

Page 38: Hadoop for sysadmins

ZookeeperWhat is it: a coordination service from YahooWhy: sync info for distributed systems (similar idea behind DNS or LDAP)Example:

[zkshell: 14] set /zk_test junkcZxid = 5ctime = Fri Jun 05 13:57:06 PDT 2009mZxid = 6mtime = Fri Jun 05 14:01:52 PDT 2009pZxid = 5

[zkshell: 15] get /zk_testjunkcZxid = 5ctime = Fri Jun 05 13:57:06 PDT 2009mZxid = 6mtime = Fri Jun 05 14:01:52 PDT 2009pZxid = 5

(Zookeeper is covered in greater detail in Zookeeper: Distributed Process Cooridination by O’ReillyAnd in Chapter 21 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)

Page 39: Hadoop for sysadmins

Distributed Programming- Apache MapReduce also MRv2/YARN

- Apache Pig

outliers:

- JAQL

- Apache Spark

- Apache Flink (formerly Stratosphere)

- Netflix PigPen

- AMPLab SIMR

- Facebook Corona

- Apache Twill

- Damballa Parkour

- Apache Hama

- Datasalt Pangool

- Apache Tez

- Apache Llama

- Apache DataFu

- Pydoop

- Kangaroo

- TinkerPop

- Pachyderm MapReduce

Page 40: Hadoop for sysadmins

Distributed Filesystems- Apache HDFS

outliers:

- Tachyon

- Apache GridGain

- Ignite

- XtreemFS

- Ceph Filesystem

- Red Hat GlusterFS

- Quantcast File System QFS

- Lustre

Page 41: Hadoop for sysadmins

NoSQL Databases:Columnal Data Model

- Apache HBase

outliers:

- Apache Accumulo

- Hypertable

- HP Vertica

:Key Value Data Model

- Apache Cassandra

- Riak

- Redis

- Linkedin Volemort

:Document Data Model

outliers:

- MongoDB

- RethinkDB

- ArangoDB

- CouchDB

:Stream Data Model

outliers:

- EventStore

:Key-Value Data

Model

outliers:

- Redis DataBase

- Linkedin Voldemort

- RocksDB

- OpenTSDB

:Graph Data Model

outliers:

- Neo4j

- ArangoDB

- TitanDB

- OrientDB

- Intel GraphBuilder

- Giraph

- Pegasus

- Apache Spark

NewSQL Databases

outliers:

- TokuDB

- HandlerSocket

- Akiban Server

- Drizzle

- Haeinsa

- SenseiDB

- Sky

- BayesDB

- InfluxDB

Page 42: Hadoop for sysadmins

Data Ingestion

:SQL on Hadoop

- Apache Hive

- Apache HCatalog

outliers:

- Cloudera Kudu

- Trafodion

- Apache Drill

- Cloudera Impala

- Facebook Presto

- Datasalt Splout SQL

- Apache Spark

- Apache Tajo

- Apache Phoenix

- Apache MRQL

- Kylin

Data Ingestion

- Apache Flume

- Apache Sqoop

outliers:

- Facebook Scribe

- Apache Chukwa

- Apache Storm

- Apache Kafka

- Netflix Suro

- Apache Samza

- Cloudera Morphline

- HIHO

- Apache NiFi

Page 43: Hadoop for sysadmins

Etc.

Service Programming and Frameworks

- Apache Zookeeper

- Apache Avro

- Apache Parquet

outliers:

- Apache Thrift

- Apache Curator

- Apache Karaf

- Twitter Elephant Bird

- Linkedin Norbert

Scheduling

- Apache Oozie

outliers:

- Linkedin Azkaban

- Spotify Luigi

- Apache Falcon

- Schedoscope

Security

outliers:

- Apache Sentry

- Apache Knox Gateway

- Apache Ranger

System Deployment

and Management

outliers:

- Apache Ambari

- Cloudera Manager

- Cloudera HUE

- Apache Whirr

- Apache Mesos

- Myriad

- Marathon

- Brooklyn

- Hortonworks HOYA

- Apache Helix

- Apache Bigtop

- Buildoop

- Deploop

Page 44: Hadoop for sysadmins

And now a bit of common sense for sys-admin-ing Hadoop clusters

Page 45: Hadoop for sysadmins

AvoidThe usual -

Don’t let hdfs fill up Don’t use all the memoryDon’t use up all the cpusDon’t drop the network<insert fav disaster>

Resource Exhaustion by usersHardware Failure (drives are the king of this

domain)

Page 46: Hadoop for sysadmins

Um, backups?Usual suspects plus Namenode’s meta data!! (fsimage)Hdfs? Well, it would nice but unlikely (if so

distcp)Snapshots

Page 47: Hadoop for sysadmins

Hadoop Management Apache Ambari

Cloudera Manager

Page 48: Hadoop for sysadmins

MonitoringThe usual suspects plus… JMX support Jvm via jstat, jmap etc.hdfsMapredconf/hadoop-metrics.propertieshttp://namenode:50070/http://namenode:50070/jmx

Page 49: Hadoop for sysadmins

User managementHdfs quotasAccess controls

Internal andexternal

MR schedulersFifo, Fair, Capacity

Kerberos can be used as well

Page 50: Hadoop for sysadmins

Configuration /etc/hadoop/confLots of knobs! !Ojo! –

Lots of overrides Get the basic system solid before security and

performance Watch the units – some are in megabytes but some are

in bytes! Have canary jobs Ensure same configs are everywhere (including uniform

dns/host)

Page 51: Hadoop for sysadmins

Want more?

(Disclaimer: I receive nothing from O’Reilly. Not even a Christmas card…)

Page 52: Hadoop for sysadmins

FinThanks for listeningSlides:

http://www.slideshare.net/ericwilliammarshall/hadoop-for-sysadmins

Any questions?

Page 53: Hadoop for sysadmins

What’s in a name?Doug Cutting seems to have been inspired by

his family. Lucene is his wife’s middle name, and her maternal grandmother’s first name. His son, as a toddler, used Nutch as the all-purpose word for meal and later named a yellow stuffed elephant Hadoop. Doug said he “was looking for a name that wasn’t already a web domain and wasn’t trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.”

Page 54: Hadoop for sysadmins

What to do?Combinations of the usual stuff:Numerical SummarizationsFiltering Altering Data OrganizationJoining DataI/O

Page 55: Hadoop for sysadmins

federation

(Image from Chapter 2 of Eric Sammer’s Hadoop Operations by O’Reilly)

Page 56: Hadoop for sysadmins