big data in container; hadoop spark in docker and mesos

42
1 Big Data in Container Heiko Loewe @loeweh Meetup Big Data Hadoop & Spark NRW 08/24/2016

Upload: heiko-loewe

Post on 13-Apr-2017

523 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Big Data in Container; Hadoop Spark in Docker and Mesos

1

Big Data in ContainerHeiko Loewe @loewehMeetup Big Data Hadoop & Spark NRW 08/24/2016

Page 2: Big Data in Container; Hadoop Spark in Docker and Mesos

2

Why• Fast Deployment• Test/Dev Cluster• Better Utilize Hardware• Learn to manage Hadoop• Test new Versions• An appliance for continuous

integration/API testing

Page 3: Big Data in Container; Hadoop Spark in Docker and Mesos

3

DesignMaster Container

- Name Node- Secondary Name Node- Yarn

Slave Container- Node Manager- Data Node

Slave Container- Node Manager- Data Node

Slave Container- Node Manager- Data Node

Slave Container- Node Manager- Data Node

Page 4: Big Data in Container; Hadoop Spark in Docker and Mesos

4

More than 1 Hosts needs Overlay NetInterface Docker0 not routed

Overlay Network

1 Host Config(almost ) noProblem

For 2 Hostsand morewe need anOverlay Net-work

Page 5: Big Data in Container; Hadoop Spark in Docker and Mesos

5

Choice of the Overlay Network Impl.

Docker Multi-Host Network Weave Net• Backend: VXLAN, AWS,

GCE. • Fallback: custom UDP-based tunneling.

• Control plane: built-in, uses Etcd for shared state.

CoreOS Flanneld• Backend: VXLAN. • Fallback: none. • Control plane: built-in,

uses Zookeeper, Consul or Etcd for shared state.

• Backend: VXLAN via OVS. • Fallback: custom UDP-

based tunneling called “sleeve”. • Control plane: built-in.

Page 6: Big Data in Container; Hadoop Spark in Docker and Mesos

6

Normal mode of operations is called FDP – fast data path – which works via OVS’s data path kernel module (mainline since 3.12). It’s just another VXLAN implementation.

Has a sleeve fallback mode, works in userspace via pcap.

Sleeve supports full encryption.

Weaveworks also has Weave DNS, Weave Scope and Weave Flux – providing introspection, service discovery & routing capabilities on top of Weave Net.

WEAVE NET

Page 7: Big Data in Container; Hadoop Spark in Docker and Mesos

7

/etc/sudoers # at the end:vuser ALL=(ALL) NOPASSWD: ALL# secure_path, append /usr/local/bin for weave

Defaults secure_path = /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin

sudo groupadd docker sudo gpasswd -a ${USER} docker sudo chgrp docker /var/run/docker.sock alias docker="sudo /usr/bin/docker"

Docker Adaption (Fedora/Centos/RHEL)

Page 8: Big Data in Container; Hadoop Spark in Docker and Mesos

8

WARNING: existing iptables rule

'-A FORWARD -j REJECT --reject-with icmp-host-prohibited'

will block name resolution via weaveDNS - please reconfigure your firewall.

sudo systemctl stop firewalld Sudo systemctl disable firewalld

/sbin/iptables -D FORWARD -j REJECT --reject-with icmp-host-prohibited/sbin/iptables -D INPUT -j REJECT --reject-with icmp-host-prohibited

iptables-save reboot

Weave Problems on Fedora/Centos/RHEL

Page 9: Big Data in Container; Hadoop Spark in Docker and Mesos

9

[vuser@linux ~]$ ifconfig | grep -v "^ "docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500enp3s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536

[vuser@linux ~]$ docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES[vuser@linux ~]$ sudo weave launch[vuser@linux ~]$ eval $(sudo weave env)[vuser@linux ~]$ sudo weave -–local expose10.32.0.6[vuser@linux ~]$ docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES0fd6ab928d96 weaveworks/plugin:1.6.1 "/home/weave/plugin" 11 seconds ago Up 8 seconds weaveplugin4b24e5802fcc weaveworks/weaveexec:1.6.1 "/home/weave/weavepro" 13 seconds ago Up 10 seconds weaveproxyc4882326398a weaveworks/weave:1.6.1 "/home/weave/weaver -" 18 seconds ago Up 15 seconds weave[vuser@linux ~]$ ifconfig | grep -v "^ "datapath: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410

docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500enp3s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536

vethwe-bridge: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410vethwe-datapath: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410vxlan-6784: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65485weave: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410

WEAVE Container

WEAVE Interfaces

Weave Run

Page 10: Big Data in Container; Hadoop Spark in Docker and Mesos

10

https://github.com/kiwenlau/hadoop-cluster-docker/blob/master/Dockerfile

Hadoop Container Docker FileFROM ubuntu:14.04# install openssh-server, openjdk and wget# install hadoop 2.7.2# set environment variable# ssh without key# set up Hadoop directorties# copy config files from local# make Hadoop start files executable# format namenode#standard run commandCMD [ "sh", "-c", "service ssh start; bash"]

$ docker build –t loewe/hadoop:latest

Page 11: Big Data in Container; Hadoop Spark in Docker and Mesos

11

Start Hadoop ContainerHost 1• Master

$ sudo weave run –itd –p 8088:8088 –p 50070:50070 -–name hadoop-master

• Slaves 1,2$ sudo weave run –itd -–name hadoop-slave1$ sudo weave run –itd -–name hadoop-slave2

Host2• Slave 3,4

$ sudo weave run –itd -–name hadoop-slave1$ sudo weave run –itd -–name hadoop-slave2

root@boot2docker:~# weave status dnshadoop-master 10.32.0.1 6a4db5f52340 92:64:f5:c5:57:a7hadoop-slave1 10.32.0.2 34e0a7de1105 92:64:f5:c5:57:a7hadoop-slave2 10.32.0.3 d879f077cf4e 92:64:f5:c5:57:a7hadoop-slave3 10.44.0.0 6ca7ddb9daf8 92:56:f4:98:36:b0hadoop-slave4 10.44.0.1 c1ed48630b1c 92:56:f4:98:36:b0

Page 12: Big Data in Container; Hadoop Spark in Docker and Mesos

12

Hadoop Cluster / 2 Host / 5 Nodes

Page 13: Big Data in Container; Hadoop Spark in Docker and Mesos

13

Click icon to add picture

Persitent Volumes for HDFS

Page 14: Big Data in Container; Hadoop Spark in Docker and Mesos

14

• Container (like Docker) are the Foundation for agile Software Development

• The initial Container Design was stateless (12-factor App)

• Use-cases are grown in the last few Month (NoSQL, Stateful Apps)

• Persistence for Container is not easy

The Problem

Page 15: Big Data in Container; Hadoop Spark in Docker and Mesos

15

• Enables Persistence of Docker Volumes• Enables the Implementation of

– Fast Bytes (Performance)– Data Services (Protection / Snapshots)– Data Mobility– Availability

• Operations: – Create, Remove, Mount, Path, Unmount– Additonal Option can be passed to the Volume Driver

DOCKER Volume Manager API

Page 16: Big Data in Container; Hadoop Spark in Docker and Mesos

16

Persistente Volumes for CONTAINER

Container OS

Storage

/mnt/PersistentData

Container Container

-v /mnt/PersistenData:/mnt/ContainerData

Container Container

Automation ??Docker Host

Page 17: Big Data in Container; Hadoop Spark in Docker and Mesos

17

Docker Host

Persistente Volumes for CONTAINER

Container OS

Storage

/mnt/PersistentData

Container Container

-v /mnt/PersistenData:/mnt/ContainerData

Container Container

Page 18: Big Data in Container; Hadoop Spark in Docker and Mesos

18

Persistente Volumes for CONTAINER

AWS EC2 (EBS) OpenStack (Cinder) EMC Isilon EMC ScaleIO EMC VMAX EMC XtremIO Google Compute Engine (GCE) VirtualBox

UbuntuDebianRedHatCentOSCoreOSOSXTinyLinux (boot2docker)

Docker Volume APIMesos Isolator...

Page 19: Big Data in Container; Hadoop Spark in Docker and Mesos

19

Hadoop + persisten Volumes

Host A

Making theHadoop Containerephemeral

Page 20: Big Data in Container; Hadoop Spark in Docker and Mesos

20

Overlay Network

Strech Hadoop w/ persisten VolumesHost A Host B

Easiely strechand shrink aCluster withoutloosing the Data

Page 21: Big Data in Container; Hadoop Spark in Docker and Mesos

21

Other similar Projects• Big Top Provisioner / Apache Foundation

https://github.com/apache/bigtop/tree/master/provisioner/docker

• Building Hortonworks HDP on Dockerhttp://henning.kropponline.de/2015/07/19/building-hdp-on-docker/https://hub.docker.com/r/hortonworks/ambari-server/https://hub.docker.com/r/hortonworks/ambari-agent/

• Building Cloudera CHD on Dockerhttp://blog.cloudera.com/blog/2015/12/docker-is-the-new-quickstart-option-for-apache-hadoop-and-cloudera/https://hub.docker.com/r/cloudera/quickstart/Watch out Overlay Network topix

Page 22: Big Data in Container; Hadoop Spark in Docker and Mesos

22

Apache Myriad

Page 23: Big Data in Container; Hadoop Spark in Docker and Mesos

23

Myriad Overview• Mesos Framework for Apache Yarn• Mesos manages DC, Yarn Manages Hadoop• Coarse and fine grained Resource Sharing

Page 24: Big Data in Container; Hadoop Spark in Docker and Mesos

24

Situation without Integration

Page 25: Big Data in Container; Hadoop Spark in Docker and Mesos

25

Yarn/Mesos Integration

Page 26: Big Data in Container; Hadoop Spark in Docker and Mesos

26

How it works (simplyfied)

Myriad = Control Plane

Page 27: Big Data in Container; Hadoop Spark in Docker and Mesos

27

Myriad Container

Page 28: Big Data in Container; Hadoop Spark in Docker and Mesos

28

Page 29: Big Data in Container; Hadoop Spark in Docker and Mesos

29

Page 30: Big Data in Container; Hadoop Spark in Docker and Mesos

30

Page 31: Big Data in Container; Hadoop Spark in Docker and Mesos

31

What about the DataMyriad only cares for the Compute

Master Container- Name Node

- Secondary Name Node- Yarn

Slave Container- Node Manager

- Data Node

Slave Container- Node Manager

- Data Node

Slave Container- Node Manager

- Data Node

Slave Container- Node Manager

- Data Node

Myriad/Mesos

Cares about

Has to be providedOutside fromMyriad/Mesos

Has to be providedOutside fromMyriad/Mesos

Page 32: Big Data in Container; Hadoop Spark in Docker and Mesos

32

What about the Data

• Myriad only cares for Compute / Map Reduce• HDFS has to be provided on other Ways

Big Data New Realities

Big Data Traditional Assumptions

Bare-metal

Data locality

Data on local disks

Big Data New Realities

Containers and VMs

Compute and storage separation

In-place access on remote data stores

New Benefits and Value

Big-Data-as-a-Service

Agility and cost savings

Faster time-to-insights

Page 33: Big Data in Container; Hadoop Spark in Docker and Mesos

33

Options for HDFS Data Layer• Pure HDFS Cluster (only Data Node running)

– Bare Metal– Containerized– Mesos based

• Enterprise HDFS Array– EMC Isilon

Page 34: Big Data in Container; Hadoop Spark in Docker and Mesos

34

Myriad, Mesos, EMC Isilon for HDFS

Page 35: Big Data in Container; Hadoop Spark in Docker and Mesos

35

• Multi Tenancy• Multiple HDFS Environments

sharing the same storage• Quota possible on HDFS

Environments• Snapshots of HDFS Environemnt

possible• Remote Replication• Worm Option for HDFS

• High Avaiable HDFS Infrastructure (distributed Namen and Data Nodes)

• Storage efficient (usable/raw 0.8 compared to 0.33 with Hadoop)

• Shared Access HDFS / CIFS / NFS/SFTP possible

• Maintenance equals Enterprise Array Standard

• All major Distributions supported

EMC Isilon Advantages over classic Hadoop HDFS

Page 36: Big Data in Container; Hadoop Spark in Docker and Mesos

36

Click icon to add picture

Spark on Mesos

Page 37: Big Data in Container; Hadoop Spark in Docker and Mesos

37

48%Standalone mode

40%YARN

11%Mesos

Most Common Spark Deployment Environments (Cluster Managers)

Source: Spark Survey Report, 2015 (Databricks)

Common Deployment Patterns

Page 38: Big Data in Container; Hadoop Spark in Docker and Mesos

38

Bare Metal Bare Metal Bare Metal

Bare MetalSpark Client Virtual Machine

Virtual Machine Virtual Machine Virtual Machine

Spark Slave

tasktask task

Spark Slave

tasktask task

Spark Slave

tasktask task

Spark Master

Spark Cluster – Standalone Mode

Data providedoutside

Page 39: Big Data in Container; Hadoop Spark in Docker and Mesos

39

Node Manager Node Manager Node Manager

Spark Executor

tasktask task

Spark Executor

tasktask task

Spark Executor

tasktask task

Spark Client

Spark Master Resource Manager

Spark Cluster – Hadoop YARNData provideBy HadoopCluster

Page 40: Big Data in Container; Hadoop Spark in Docker and Mesos

40

Mesos Slave Mesos Slave Mesos Slave

Spark Executor

tasktask task

Spark Executor

tasktask task

Spark Executor

tasktask task

MesosMaster

SparkScheduler

Spark Client

Spark Cluster – MesosData providedoutside

Page 41: Big Data in Container; Hadoop Spark in Docker and Mesos

41

Spark + Mesos + EMC IsilonTo solve the HDFS Data Layer

Page 42: Big Data in Container; Hadoop Spark in Docker and Mesos

42

Thank YouFollow me on Twitter: @loeweh