big data in container; hadoop spark in docker and mesos

Download Big Data in Container; Hadoop Spark in Docker and Mesos

Post on 13-Apr-2017

501 views

Category:

Data & Analytics

3 download

Embed Size (px)

TRANSCRIPT

Hadoop in Container

Big Data in ContainerHeiko Loewe @loewehMeetup Big Data Hadoop & Spark NRW 08/24/2016

#WhyFast DeploymentTest/Dev ClusterBetter Utilize HardwareLearn to manage HadoopTest new VersionsAn appliance for continuousintegration/API testing

#Design

#More than 1 Hosts needs Overlay NetInterface Docker0 not routed

Overlay Network1 Host Config(almost ) noProblem

For 2 Hostsand morewe need anOverlay Net-work

#Choice of the Overlay Network Impl.

Docker Multi-Host NetworkWeave NetBackend: VXLAN, AWS, GCE. Fallback: custom UDP-based tunneling. Control plane: built-in, uses Etcd for shared state.CoreOS FlanneldBackend: VXLAN. Fallback: none. Control plane: built-in, uses Zookeeper, Consul or Etcd for shared state.Backend: VXLAN via OVS. Fallback: custom UDP-based tunneling called sleeve.

Control plane: built-in.

#

Normal mode of operations is called FDP fast data path which works via OVSs data path kernel module (mainline since 3.12). Its just

another VXLAN implementation. Has a sleeve fallback mode, works in userspace

via pcap.

Sleeve supports full encryption. Weaveworks also has Weave DNS, Weave Scope and Weave Flux providing introspection, service discovery & routing capabilities on top of Weave Net.

WEAVE NET

#/etc/sudoers # at the end:vuserALL=(ALL)NOPASSWD: ALL# secure_path, append /usr/local/bin for weaveDefaults secure_path = /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/binsudo groupadd dockersudo gpasswd -a ${USER} dockersudo chgrp docker /var/run/docker.sockalias docker="sudo /usr/bin/docker"Docker Adaption (Fedora/Centos/RHEL)

#WARNING: existing iptables rule

'-A FORWARD -j REJECT --reject-with icmp-host-prohibited'

will block name resolution via weaveDNS - please reconfigure your firewall.

sudo systemctl stop firewalldSudo systemctl disable firewalld

/sbin/iptables -D FORWARD -j REJECT --reject-with icmp-host-prohibited/sbin/iptables -D INPUT -j REJECT --reject-with icmp-host-prohibitediptables-saverebootWeave Problems on Fedora/Centos/RHEL

#[vuser@linux ~]$ ifconfig | grep -v "^ "docker0: flags=4099 mtu 1500enp3s0: flags=4163 mtu 1500lo: flags=73 mtu 65536

[vuser@linux ~]$ docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES[vuser@linux ~]$ sudo weave launch[vuser@linux ~]$ eval $(sudo weave env)[vuser@linux ~]$ sudo weave -local expose10.32.0.6[vuser@linux ~]$ docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES0fd6ab928d96 weaveworks/plugin:1.6.1 "/home/weave/plugin" 11 seconds ago Up 8 seconds weaveplugin4b24e5802fcc weaveworks/weaveexec:1.6.1 "/home/weave/weavepro" 13 seconds ago Up 10 seconds weaveproxyc4882326398a weaveworks/weave:1.6.1 "/home/weave/weaver -" 18 seconds ago Up 15 seconds weave[vuser@linux ~]$ ifconfig | grep -v "^ "datapath: flags=4163 mtu 1410

docker0: flags=4099 mtu 1500enp3s0: flags=4163 mtu 1500

lo: flags=73 mtu 65536

vethwe-bridge: flags=4163 mtu 1410vethwe-datapath: flags=4163 mtu 1410vxlan-6784: flags=4163 mtu 65485weave: flags=4163 mtu 1410

WEAVE ContainerWEAVE InterfacesWeave Run

#https://github.com/kiwenlau/hadoop-cluster-docker/blob/master/DockerfileHadoop Container Docker FileFROM ubuntu:14.04# install openssh-server, openjdk and wget# install hadoop 2.7.2# set environment variable# ssh without key# set up Hadoop directorties# copy config files from local# make Hadoop start files executable# format namenode#standard run commandCMD [ "sh", "-c", "service ssh start; bash"]$ docker build t loewe/hadoop:latest

#Start Hadoop ContainerHost 1Master$ sudo weave run itd p 8088:8088 p 50070:50070 -name hadoop-masterSlaves 1,2$ sudo weave run itd -name hadoop-slave1$ sudo weave run itd -name hadoop-slave2Host2Slave 3,4$ sudo weave run itd -name hadoop-slave1$ sudo weave run itd -name hadoop-slave2

root@boot2docker:~# weave status dnshadoop-master 10.32.0.1 6a4db5f52340 92:64:f5:c5:57:a7hadoop-slave1 10.32.0.2 34e0a7de1105 92:64:f5:c5:57:a7hadoop-slave2 10.32.0.3 d879f077cf4e 92:64:f5:c5:57:a7hadoop-slave3 10.44.0.0 6ca7ddb9daf8 92:56:f4:98:36:b0hadoop-slave4 10.44.0.1 c1ed48630b1c 92:56:f4:98:36:b0

#Hadoop Cluster / 2 Host / 5 Nodes

#Persitent Volumes for HDFS

#Container (like Docker) are the Foundation for agile Software DevelopmentThe initial Container Design was stateless (12-factor App)Use-cases are grown in the last few Month (NoSQL, Stateful Apps)Persistence for Container is not easy

The Problem

#Enables Persistence of Docker VolumesEnables the Implementation ofFast Bytes (Performance)Data Services (Protection / Snapshots)Data MobilityAvailabilityOperations: Create, Remove, Mount, Path, UnmountAdditonal Option can be passed to the Volume DriverDOCKER Volume Manager API

#https://docs.docker.com/engine/extend/plugins_volume/

/VolumeDriver.Create/VolumeDriver.Remove/VolumeDriver.Mount/VolumeDriver.Path/VolumeDriver.Unmount

Persistente Volumes for CONTAINERContainer OSStorage

/mnt/PersistentDataContainerContainer

-v /mnt/PersistenData:/mnt/ContainerData ContainerContainerAutomation ??

Docker Host

#Ok, so there really is a way to do this, but this means tons of work. These Dev guys want everything instant and i am just one person.How should I be able to deliver this?

Docker HostPersistente Volumes for CONTAINERContainer OSStorage

/mnt/PersistentDataContainerContainer

-v /mnt/PersistenData:/mnt/ContainerData ContainerContainer

#Ok, so there really is a way to do this, but this means tons of work. These Dev guys want everything instant and i am just one person.How should I be able to deliver this?

Persistente Volumes for CONTAINER

AWS EC2 (EBS) OpenStack (Cinder) EMC Isilon EMC ScaleIO EMC VMAX EMC XtremIO Google Compute Engine (GCE) VirtualBox UbuntuDebianRedHatCentOSCoreOSOSXTinyLinux (boot2docker)Docker Volume APIMesos Isolator...

#Ok, so there really is a way to do this, but this means tons of work. These Dev guys want everything instant and i am just one person.How should I be able to deliver this?

Hadoop + persisten Volumes

Host AMaking theHadoop Containerephemeral

#

Overlay NetworkStrech Hadoop w/ persisten Volumes

Host AHost BEasiely strechand shrink aCluster withoutloosing the Data

#Other similar ProjectsBig Top Provisioner / Apache Foundationhttps://github.com/apache/bigtop/tree/master/provisioner/dockerBuilding Hortonworks HDP on Dockerhttp://henning.kropponline.de/2015/07/19/building-hdp-on-docker/https://hub.docker.com/r/hortonworks/ambari-server/https://hub.docker.com/r/hortonworks/ambari-agent/Building Cloudera CHD on Dockerhttp://blog.cloudera.com/blog/2015/12/docker-is-the-new-quickstart-option-for-apache-hadoop-and-cloudera/https://hub.docker.com/r/cloudera/quickstart/Watch out Overlay Network topix

#Apache Myriad

#Myriad OverviewMesos Framework for Apache YarnMesos manages DC, Yarn Manages HadoopCoarse and fine grained Resource Sharing

#

Situation without Integration

#

Yarn/Mesos Integration

#How it works (simplyfied)

Myriad = Control Plane

#

Myriad Container

#

#

#

#What about the DataMyriad only cares for the Compute

Myriad/MesosCares about

Has to be providedOutside fromMyriad/MesosHas to be providedOutside fromMyriad/Mesos

#What about the DataMyriad only cares for Compute / Map ReduceHDFS has to be provided on other Ways

Big Data New Realities

Big Data Traditional AssumptionsBare-metalData localityData on local disks

Big Data New RealitiesContainers and VMsCompute and storage separationIn-place access on remote data stores

New Benefits and ValueBig-Data-as-a-ServiceAgility and cost savingsFaster time-to-insights

#Options for HDFS Data LayerPure HDFS Cluster (only Data Node running)Bare MetalContainerizedMesos basedEnterprise HDFS ArrayEMC Isilon

#Myriad, Mesos, EMC Isilon for HDFS

#Multi TenancyMultiple HDFS Environments sharing the same storageQuota possible on HDFS EnvironmentsSnapshots of HDFS Environemnt possibleRemote ReplicationWorm Option for HDFSHigh Avaiable HDFS Infrastructure (distributed Namen and Data Nodes)Storage efficient (usable/raw 0.8 compared to 0.33 with Hadoop)Shared Access HDFS / CIFS / NFS/SFTP possibleMaintenance equals Enterprise Array StandardAll major Distributions supportedEMC Isilon Advantages over classic Hadoop HDFS

#Spark on Mesos

#

48%Standalone mode

40%YARN

11%MesosMost Common Spark Deployment Environments (Cluster Managers) Source: Spark Survey Report, 2015 (Databricks)Common Deployment Patterns

# Bare Metal Bare Metal Bare Metal Bare MetalSpark ClientVirtual MachineVirtual MachineVirtual MachineVirtual MachineSpark SlavetasktasktaskSpark SlavetasktasktaskSpark SlavetasktasktaskSpark MasterSpark Cluster Standalone ModeData providedoutside

#

Node ManagerNode ManagerNode ManagerSpark ExecutortasktasktaskSpark ExecutortasktasktaskSpark ExecutortasktasktaskSpark ClientSpark MasterResource ManagerSpark Cluster Hadoop YARNData provideBy HadoopCluster

#

Mesos SlaveMesos SlaveMesos SlaveSpark ExecutortasktasktaskSpark ExecutortasktasktaskSpark ExecutortasktasktaskMesosMasterSparkSchedulerSpark ClientSpark Cluster MesosData providedoutside

#Sp