deploying hadoop-based bigdata environments

30
Click to edit Master subtitle style Roman Shaposhnik [email protected], Cloudera Inc. Deploying Hadoop-Based Bigdata Environments “[Tall] Tales From The Frontier”

Upload: puppet-labs

Post on 27-Jan-2015

118 views

Category:

Technology


3 download

DESCRIPTION

Roman Shaposhnik of Cloudera and the Apache Software Foundation talks on "Delopying Hadoop-Based Bigdata Environments: [Tall] Tales from the Frontier" at Puppet Camp Silicon Valley 2012.

TRANSCRIPT

Page 1: Deploying Hadoop-Based Bigdata Environments

Click to edit Master subtitle style

Roman [email protected], Cloudera Inc.

Deploying Hadoop-Based Bigdata Environments

“[Tall] Tales From The Frontier”

Page 2: Deploying Hadoop-Based Bigdata Environments

2

$ whoami

An open source software developer Linux kernel, C/C++ compilers, FFmpeg, Plan9

A Hadoop and all around UNIX guy root@cloudera

Member of the “Kitchen” team

Apache Software Foundation Incubator PMC [Bigtop], Hadoop Development Tools, Celix, Helix

VP of Apache Bigtop

Page 3: Deploying Hadoop-Based Bigdata Environments

3

ZooKeeper (coordination)

HUE (web based UI)

HBase YARN/MR1HBase

HDFS (filesystem)

Pig (DQL) Hive (SQL) Impala (SQL)

Oozie

Page 4: Deploying Hadoop-Based Bigdata Environments

4

ZooKeeper (coordination)

HUE (web based UI)

HBase YARN/MR1HBase

HDFS (filesystem)

Pig (DQL) Hive (SQL) Impala (SQL)

Oozie

Page 5: Deploying Hadoop-Based Bigdata Environments

5

It is a jungle out there

Zookeeper

Hadoop

HDFS

YARN

MR1

HTTPFS

HBase

Pig

Hive

Impala

Sqoop

Oozie

Whirr

Mahout

Flume

Giraph

Hama

Hue

Solr

Crunch

JDK/JRE

Kerberos

Ganglia

Nagios

JSVC

Tomcat

Utils

Postgress

HTTPD

Page 6: Deploying Hadoop-Based Bigdata Environments

6

And the answer is:

Puppet[forge]

Page 7: Deploying Hadoop-Based Bigdata Environments

7

One way of using Apache software

$ wget http://apache.org/httpd.tar.gz

$ tar xzvf httpd.tar.gz

$ cd httpd

$ ./configure ; make

$ make install

ERROR: can't write to /usr/local/bin

$ sudo make install

Page 8: Deploying Hadoop-Based Bigdata Environments

8

A different way

$ sudo apt-get install httpd

Would you like to also upgrade your conf?

Page 9: Deploying Hadoop-Based Bigdata Environments

9

Is there apt-get install hadoop ?

Hadoop is still in a very active development Hadoop is Java based Hadoop is a distributeddistributed application Hadoop is way more than HDFS + MR

Page 10: Deploying Hadoop-Based Bigdata Environments

10

Project-by-project approach

“Passively” maintained code Packaging, OS-level (init.d)

Developer-centric view Edit-compile-debug cycle vs. deployment Lack of integration testing

Differences in distributions/packaging: Where is this valid: /usr/libexec ?

Combinatoric explosion of dependencies

Page 11: Deploying Hadoop-Based Bigdata Environments

11

HBaseHBase

Hadoop (1.0, 0.22, 0.23)

Dependencies Inferno:

Hive 0.8.1

HBaseHbase (0.92, 0.90)

A million dollar question:$ tar xzvf hive-0.8.1.tar.gz$ ls hive-0.8.1/lib

Page 12: Deploying Hadoop-Based Bigdata Environments

12

HBaseHBase

Hadoop (1.0, 0.22, 0.23)

Dependencies Inferno:

Hive 0.8.1

HBaseHbase (0.92, 0.90)

A million dollar question:$ tar xzvf hive-0.8.1.tar.gz$ ls hive-0.8.1/lib

hbase-0.89.jar log4j-1.2.15.jar log4j-1.2.16.jar

Page 13: Deploying Hadoop-Based Bigdata Environments

13

Remember what Debian did to Linux?

GNU Software Linux kernelLinux kernel

Page 14: Deploying Hadoop-Based Bigdata Environments

14

Bigtop is trying to do it with Hadoop

Hadoop Ecosystem(Pig, Hive, Mahout) Linux kernel

Hadoop(HDFS + MR)

CDH4 beta 1

Page 15: Deploying Hadoop-Based Bigdata Environments

15

What's there in Bigtop

Build/Packaging infrastructure RPM, DEB, (tarballs, homebrew/MacPorts) VirtualBox, VMWare and KVM VMs Fedora, OpenSUSE, Mageia, CentOS, Ubuntu

Puppet deployment infrastrucutre Integration test infrastrucutre (iTest) Bigtop Jenkins:

http://bigtop01.cloudera.org:8080

Page 16: Deploying Hadoop-Based Bigdata Environments

16

And the answer is:

Puppet[Bigtop]

Page 17: Deploying Hadoop-Based Bigdata Environments

17

System software deployment

Packages vs. Puppet code package/file/service

What is packaging? dependency tracking build encapsulation java packaging file layout user creation service registration

Page 18: Deploying Hadoop-Based Bigdata Environments

18

Does it really work?

Java packaging maven/ivy integration

file layout side-by-side installations of the same package

user creation LDAP/AD provisioning

service registration start on install vs. start on reboot

Page 19: Deploying Hadoop-Based Bigdata Environments

19

Petascale distributed systems

Scale Yahoo! ~5000 nodes

Deployment orchestration Kerberos::Host_keytab <| title == "hdfs" |> ->

Service["hadoop-hdfs-datanode"]

Highly coordinated distributed system It ain't HTTPD/loadbalancer Rolling upgrades/asynchronous rollbacks

Page 20: Deploying Hadoop-Based Bigdata Environments

20

Back to tarballs and shell?

What's better for Puppet: fpm or rpm? What is the role of Puppet?

coordinating the entire system: lack of DSL converging an isolated node: will it ever work? a building block for an agent-based system

One agent to rule them all? there's no spoon^H^H^H^H^H^ agent: Whirr MCollective Cloudera Manager, Ambari

Page 21: Deploying Hadoop-Based Bigdata Environments

21

Evolution, not perfection!

Minimalistic, highly consistent packages /usr/lib/hadoop, /etc/hadoop/conf (alternative) fail gracefully: .... || : ) Java packaging is not solved [yet]: symlinks

Minimalistic Puppet code package/file/service masterless (most of the time) integration with Whirr

BoxGrinder

Page 22: Deploying Hadoop-Based Bigdata Environments

22

The road ahead

New kind of configuration management /etc/hadoop vs Zookeeper

New kinds of system packaging Parcels (tarballs + metadata) HPS (Hadoop Packaging System)

Orchestration: to puppet or not to puppet? Cloudera Manager Apache Ambari (incubating) Reactor 8: http://reactor8.com

Page 23: Deploying Hadoop-Based Bigdata Environments

23

Java Packaging

Fate of Java OpenJDK

OSGi Hadoop's view: MAPREDUCE-1700

https://issues.apache.org/jira/browse/MAPREDUCE-1700

Project Jigsaw Language tie-ins? Really?

Linux vendors getting their act together

Page 24: Deploying Hadoop-Based Bigdata Environments

24

Integration testing

Clean room provisioning Those ain't unit tests – they trash the system

Cluster topology and cluster state discovery How can puppet help us?

Cluster state manipulation Test-driven orchestration Chaos Monkey

How to be successful in OS co-opetition Make everything pluggable (and subvert ;-))

Page 25: Deploying Hadoop-Based Bigdata Environments

25

Anatomy of iTest

Versioned, JVM-based test/data artifacts Dependency between test artifacts Matching stack of integration tests Implementation

Maven artifacts, pom files JUnit test-execution entry point Groovy for scripting

Page 26: Deploying Hadoop-Based Bigdata Environments

26

Who's the target audience

End users YOU!

ASF Projects/Bigdata developers from Avro to Zookeeper

Bigdata solutions vendors Cloudera, EMC, Hortonworks, Karmasphere

DevOPs Ebay, Yahoo, Facebook, LinkedIn

Page 27: Deploying Hadoop-Based Bigdata Environments

27

Who's on-board?

Cloudera CDH4 is 100% based on Bigtop (hadoop v2) Available @cloudera.com

Canonical Ubuntu Server: Hadoop and Bigdata blueprint

https://blueprints.launchpad.net/ubuntu/+spec/servercloud-p-hdp-hadoop

TrendMicro Hortonworks (partially) EMC, EBay (early stages of prototyping)

Page 28: Deploying Hadoop-Based Bigdata Environments

28

What's happening?

A special release: Bigtop 0.3.0-incubating Hadoop 1.0.1

Last stable release: Bigtop 0.5.0 Hadoop 2.0.2-alpha

Next stable release: Bigtop 0.6.0 End of Mar 2013 release Hadoop 2.0.3-beta Major focus on developers

Page 29: Deploying Hadoop-Based Bigdata Environments

29

What Bigtop needs from you?

More of you! Meetup: “Silicon Valley Hands-on Programming”

http://www.meetup.com/HandsOnProgrammingEvents/

More infrastructure for build/test EC2, Supercell, EMC magic cluster, CloudStack

More integration tests Convince your bosses to commit to Bigtop

Validate upstream release using Bigtop

Page 30: Deploying Hadoop-Based Bigdata Environments

30

Contact§ Bigtop home @Apache:

• http://incubator.apache.org/bigtop/§ Hangout places:

• {dev,user}@bigtop.apache.org• #bigtop on Freenode

§ Roman Shaposhnik• [email protected], [email protected]