Big Data in Container; Hadoop Spark in Docker and Mesos

Download Big Data in Container; Hadoop Spark in Docker and Mesos

Post on 13-Apr-2017

501 views

Category:

Data & Analytics

3 download

TRANSCRIPT

Hadoop in ContainerBig Data in ContainerHeiko Loewe @loewehMeetup Big Data Hadoop & Spark NRW 08/24/2016#WhyFast DeploymentTest/Dev ClusterBetter Utilize HardwareLearn to manage HadoopTest new VersionsAn appliance for continuousintegration/API testing#Design#More than 1 Hosts needs Overlay NetInterface Docker0 not routedOverlay Network1 Host Config(almost ) noProblemFor 2 Hostsand morewe need anOverlay Net-work#Choice of the Overlay Network Impl.Docker Multi-Host NetworkWeave NetBackend: VXLAN, AWS, GCE. Fallback: custom UDP-based tunneling. Control plane: built-in, uses Etcd for shared state.CoreOS FlanneldBackend: VXLAN. Fallback: none. Control plane: built-in, uses Zookeeper, Consul or Etcd for shared state.Backend: VXLAN via OVS. Fallback: custom UDP-based tunneling called sleeve. Control plane: built-in.#Normal mode of operations is called FDP fast data path which works via OVSs data path kernel module (mainline since 3.12). Its just another VXLAN implementation. Has a sleeve fallback mode, works in userspace via pcap. Sleeve supports full encryption. Weaveworks also has Weave DNS, Weave Scope and Weave Flux providing introspection, service discovery & routing capabilities on top of Weave Net.WEAVE NET#/etc/sudoers # at the end:vuserALL=(ALL)NOPASSWD: ALL# secure_path, append /usr/local/bin for weaveDefaults secure_path = /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/binsudo groupadd dockersudo gpasswd -a ${USER} dockersudo chgrp docker /var/run/docker.sockalias docker="sudo /usr/bin/docker"Docker Adaption (Fedora/Centos/RHEL)#WARNING: existing iptables rule'-A FORWARD -j REJECT --reject-with icmp-host-prohibited'will block name resolution via weaveDNS - please reconfigure your firewall.sudo systemctl stop firewalldSudo systemctl disable firewalld/sbin/iptables -D FORWARD -j REJECT --reject-with icmp-host-prohibited/sbin/iptables -D INPUT -j REJECT --reject-with icmp-host-prohibitediptables-saverebootWeave Problems on Fedora/Centos/RHEL#[vuser@linux ~]$ ifconfig | grep -v "^ "docker0: flags=4099 mtu 1500enp3s0: flags=4163 mtu 1500lo: flags=73 mtu 65536[vuser@linux ~]$ docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES[vuser@linux ~]$ sudo weave launch[vuser@linux ~]$ eval $(sudo weave env)[vuser@linux ~]$ sudo weave -local expose10.32.0.6[vuser@linux ~]$ docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES0fd6ab928d96 weaveworks/plugin:1.6.1 "/home/weave/plugin" 11 seconds ago Up 8 seconds weaveplugin4b24e5802fcc weaveworks/weaveexec:1.6.1 "/home/weave/weavepro" 13 seconds ago Up 10 seconds weaveproxyc4882326398a weaveworks/weave:1.6.1 "/home/weave/weaver -" 18 seconds ago Up 15 seconds weave[vuser@linux ~]$ ifconfig | grep -v "^ "datapath: flags=4163 mtu 1410docker0: flags=4099 mtu 1500enp3s0: flags=4163 mtu 1500lo: flags=73 mtu 65536vethwe-bridge: flags=4163 mtu 1410vethwe-datapath: flags=4163 mtu 1410vxlan-6784: flags=4163 mtu 65485weave: flags=4163 mtu 1410WEAVE ContainerWEAVE InterfacesWeave Run#https://github.com/kiwenlau/hadoop-cluster-docker/blob/master/DockerfileHadoop Container Docker FileFROM ubuntu:14.04# install openssh-server, openjdk and wget# install hadoop 2.7.2# set environment variable# ssh without key# set up Hadoop directorties# copy config files from local# make Hadoop start files executable# format namenode#standard run commandCMD [ "sh", "-c", "service ssh start; bash"]$ docker build t loewe/hadoop:latest#Start Hadoop ContainerHost 1Master$ sudo weave run itd p 8088:8088 p 50070:50070 -name hadoop-masterSlaves 1,2$ sudo weave run itd -name hadoop-slave1$ sudo weave run itd -name hadoop-slave2Host2Slave 3,4$ sudo weave run itd -name hadoop-slave1$ sudo weave run itd -name hadoop-slave2root@boot2docker:~# weave status dnshadoop-master 10.32.0.1 6a4db5f52340 92:64:f5:c5:57:a7hadoop-slave1 10.32.0.2 34e0a7de1105 92:64:f5:c5:57:a7hadoop-slave2 10.32.0.3 d879f077cf4e 92:64:f5:c5:57:a7hadoop-slave3 10.44.0.0 6ca7ddb9daf8 92:56:f4:98:36:b0hadoop-slave4 10.44.0.1 c1ed48630b1c 92:56:f4:98:36:b0#Hadoop Cluster / 2 Host / 5 Nodes#Persitent Volumes for HDFS#Container (like Docker) are the Foundation for agile Software DevelopmentThe initial Container Design was stateless (12-factor App)Use-cases are grown in the last few Month (NoSQL, Stateful Apps)Persistence for Container is not easyThe Problem#Enables Persistence of Docker VolumesEnables the Implementation ofFast Bytes (Performance)Data Services (Protection / Snapshots)Data MobilityAvailabilityOperations: Create, Remove, Mount, Path, UnmountAdditonal Option can be passed to the Volume DriverDOCKER Volume Manager API#https://docs.docker.com/engine/extend/plugins_volume//VolumeDriver.Create/VolumeDriver.Remove/VolumeDriver.Mount/VolumeDriver.Path/VolumeDriver.UnmountPersistente Volumes for CONTAINERContainer OSStorage/mnt/PersistentDataContainerContainer-v /mnt/PersistenData:/mnt/ContainerData ContainerContainerAutomation ??Docker Host#Ok, so there really is a way to do this, but this means tons of work. These Dev guys want everything instant and i am just one person.How should I be able to deliver this?Docker HostPersistente Volumes for CONTAINERContainer OSStorage/mnt/PersistentDataContainerContainer-v /mnt/PersistenData:/mnt/ContainerData ContainerContainer#Ok, so there really is a way to do this, but this means tons of work. These Dev guys want everything instant and i am just one person.How should I be able to deliver this?Persistente Volumes for CONTAINERAWS EC2 (EBS) OpenStack (Cinder) EMC Isilon EMC ScaleIO EMC VMAX EMC XtremIO Google Compute Engine (GCE) VirtualBox UbuntuDebianRedHatCentOSCoreOSOSXTinyLinux (boot2docker)Docker Volume APIMesos Isolator...#Ok, so there really is a way to do this, but this means tons of work. These Dev guys want everything instant and i am just one person.How should I be able to deliver this?Hadoop + persisten VolumesHost AMaking theHadoop Containerephemeral#Overlay NetworkStrech Hadoop w/ persisten VolumesHost AHost BEasiely strechand shrink aCluster withoutloosing the Data#Other similar ProjectsBig Top Provisioner / Apache Foundationhttps://github.com/apache/bigtop/tree/master/provisioner/dockerBuilding Hortonworks HDP on Dockerhttp://henning.kropponline.de/2015/07/19/building-hdp-on-docker/https://hub.docker.com/r/hortonworks/ambari-server/https://hub.docker.com/r/hortonworks/ambari-agent/Building Cloudera CHD on Dockerhttp://blog.cloudera.com/blog/2015/12/docker-is-the-new-quickstart-option-for-apache-hadoop-and-cloudera/https://hub.docker.com/r/cloudera/quickstart/Watch out Overlay Network topix#Apache Myriad#Myriad OverviewMesos Framework for Apache YarnMesos manages DC, Yarn Manages HadoopCoarse and fine grained Resource Sharing#Situation without Integration#Yarn/Mesos Integration#How it works (simplyfied)Myriad = Control Plane#Myriad Container####What about the DataMyriad only cares for the ComputeMyriad/MesosCares aboutHas to be providedOutside fromMyriad/MesosHas to be providedOutside fromMyriad/Mesos#What about the DataMyriad only cares for Compute / Map ReduceHDFS has to be provided on other WaysBig Data New RealitiesBig Data Traditional AssumptionsBare-metalData localityData on local disksBig Data New RealitiesContainers and VMsCompute and storage separationIn-place access on remote data stores New Benefits and ValueBig-Data-as-a-ServiceAgility and cost savingsFaster time-to-insights#Options for HDFS Data LayerPure HDFS Cluster (only Data Node running)Bare MetalContainerizedMesos basedEnterprise HDFS ArrayEMC Isilon#Myriad, Mesos, EMC Isilon for HDFS#Multi TenancyMultiple HDFS Environments sharing the same storageQuota possible on HDFS EnvironmentsSnapshots of HDFS Environemnt possibleRemote ReplicationWorm Option for HDFSHigh Avaiable HDFS Infrastructure (distributed Namen and Data Nodes)Storage efficient (usable/raw 0.8 compared to 0.33 with Hadoop)Shared Access HDFS / CIFS / NFS/SFTP possibleMaintenance equals Enterprise Array StandardAll major Distributions supportedEMC Isilon Advantages over classic Hadoop HDFS#Spark on Mesos#48%Standalone mode40%YARN11%MesosMost Common Spark Deployment Environments (Cluster Managers) Source: Spark Survey Report, 2015 (Databricks)Common Deployment Patterns# Bare Metal Bare Metal Bare Metal Bare MetalSpark ClientVirtual MachineVirtual MachineVirtual MachineVirtual MachineSpark SlavetasktasktaskSpark SlavetasktasktaskSpark SlavetasktasktaskSpark MasterSpark Cluster Standalone ModeData providedoutside#Node ManagerNode ManagerNode ManagerSpark ExecutortasktasktaskSpark ExecutortasktasktaskSpark ExecutortasktasktaskSpark ClientSpark MasterResource ManagerSpark Cluster Hadoop YARNData provideBy HadoopCluster#Mesos SlaveMesos SlaveMesos SlaveSpark ExecutortasktasktaskSpark ExecutortasktasktaskSpark ExecutortasktasktaskMesosMasterSparkSchedulerSpark ClientSpark Cluster MesosData providedoutside#Spark + Mesos + EMC IsilonTo solve the HDFS Data Layer#Thank YouFollow me on Twitter: @loeweh#Chart11189Standalone modeSheet1Standalone mode1100.00%8900.00%To update the chart, enter data into this table. The data is automatically saved in the chart.Chart14852YarnSheet1Yarn4800.00%5200.00%To update the chart, enter data into this table. The data is automatically saved in the chart.Chart14060Standalone modeSheet1Standalone mode4000.00%6000.00%To update the chart, enter data into this table. The data is automatically saved in the chart.