big data solutions in practice - project | lambda · big data integrator platform goals • a...

62
This project has received funding from the European Union's Horizon 2020 Research and Innovation programme under grant agreement No 809965. Big Data Solutions in Practice

Upload: others

Post on 22-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

This project has received funding from the European Union's Horizon 2020 Research and Innovation

programme under grant agreement No 809965.

Big Data Solutions in Practice

Page 2: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Cloud Computing Who has worked with cloud computing?

• Hadoop • Apache Spark • Apache Flink • DropBox • Google Docs • Email

2

Page 3: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Cloud Computing

Application 03 ● Use existing applications

Infrastructure 01

● Hardware

● Memory

● Computing

Platform 02

● Develop

● Run

● Manage

3

Page 4: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Infrastructure (Hosting)

4

Page 5: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Infrastructure (Hosting)

5

Page 6: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Infrastructure (Hosting)

6

Page 7: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Infrastructure (Hosting)

7

Page 8: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Infrastructure (Hosting)

Cost +

Energy+

Co2

Emissions

8

Page 9: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Infrastucture (Cloud computing)

9

Page 10: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Infrastucture (Cloud computing)

10

Page 11: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

What is in the Cloud? • Data center servers • Software networks • Enables

– Dynamic allocation of resources – Running applications for remote end users.

• Virtualization – Servers can run multiple VMs on demand

11

Page 12: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Enabling technologies • Virtualization • Web 2.0 • Fault-Tolerant Systems

– Distributed Storage – Distributed Computing

• Network (Bandwidth and Latency)

12

Page 13: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Why Cloud Computing? • Large‐Scale Data‐Intensive Applications

– Volume – Velocity – Variety

• Flexibility – Scalability – Tools – Security

• Customized to adaptive needs – Hardware – Software – Access

13

Page 14: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Why Cloud Computing? • Efficiency

a) Easy access b) Speed to Market c) Lean Management d) Less CO2 footprint

• Reliability

a) Fault resilience b) Security

• Affordability a) HW Costs b) Hiring costs c) Maintenance costs

14

Page 15: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Cloud Computing - Definitions “The delivery of computing as a service rather than a product, whereby shared resources, software, and information are provided to computers and other devices as a utility (like the electricity grid) over a network (typically the Internet)” (Wikipedia) “Clouds are a large pool of easily usable and accessible virtualized resources (such as hardware, development platforms and/or services). These resources can be dynamically re-configured to adjust to a variable load (scale), allowing also for an optimum resource utilization. This pool of resources is typically exploited by a pay-per-use model in which guarantees are offered by the Infrastructure Provider by means of customized SLAs(service level agreements)” [1]

15

Page 16: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Service Models of Clouds (*aaS) Software as a Service: use of services (Emails, DropBox, GDocs)

Platform as a Service: develop/deploy services (Website, Google Apps)

Infrastructure as a Service: host services (AWS, IBM Cloud ..)

Cloud Stack Target Customer

SaaS End Users

PaaS Developers

IaaS Operators/IT

16

Page 17: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Deployment Models of Cloud • Public Cloud (Standard Model)

– Users use the services SaaS – Service providers develop services using PaaS – Service providers deploy services on IaaS provider

17

Page 18: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Deployment Models of Cloud Private (Internal Cloud)

• For Enterprises with Large scale IT ( e.g. Google, FaceBook, DHL...etc)

• Enterprises with sensitive data

Hybrid

• Extend the Private Cloud(s) by connecting it to the external cloud vendors to make use of available cloud services from external vendors

Cloud Burst

• Use the local cloud, when in need of more resources, burst into the public cloud

18

Page 19: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Applications of Cloud and Limitations Open Discussion

19

Page 20: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Limitations • Security • Privacy • Vendor Lock-in (Interoperability) • Network-dependent • Migration • Less control

20

Page 21: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Design of a Cloud Platform (Big Data Integrator)

Page 22: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Big Data Integrator Platform Goals

• A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

• Low total cost of ownership – Easy to use, deploy and develop services

• Cater for widely varying use cases – Tools and distribution

• Embraces emerging Big Data technologies • Simple integration with custom components • Flexible, Resilient, Reliable

22

Page 23: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Design of the Platform Phase 1 Resource Manager

• Mesos – Used across multiple in-production pipelines

Virtualization and Packaging tool

• Docker – Easy to install – Compatible with most platforms

Distributed File System

• Hadoop Distributed File System

23

Page 24: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Docker • Lightweight virtualization tool

– Image contains the necessary libraries – Runs on host system

• Compatible with most platforms • Docker file

– Definition of the docker

• Docker compose – Allows creation of multiple containers at-once

24

Page 25: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Tool Support • Distributed resource manager

– Docker Swarm

• Distributed In-Memory Data Flow Processing – Apache Spark – Apache Flink

• Search/indexing – Elastic Search

25

Page 26: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Tool Support • Message passing

– Apache Kafka

• Data storage – Postgis – OpenLink Virtuoso – Cassandra – MongoDB

• Visualization – Kibana

26

Page 27: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

PlatForm Architecture I

27

Page 28: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Advancements in Design • Docker swarm matured over time • Every application can be dockerized

28

Page 29: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Platform Architecture II

29

Page 30: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Challenges 1) Several different Web Interfaces 2) No WorkFlow in Docker-Pipelines

30

Page 31: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Solutions • Unified Integrator Interface • Workflow builder • Init-Daemon microservice • Platform Administration

– WorkFlow (application) monitor – Swarm User Interface

31

Page 32: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

32

Platform Architecture III

Page 33: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Platform features • BDE Development Environment

– Stack builder – Workflow builder – Possibilities to add custom components to the BDE stack

• Administrator Interface – SwarmUI

• BDE Application Environment – Workflow monitor – Integrated web interface

33

Page 34: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

BDE Integrator WorkFlow 34

Stackbuilder

Select components => (Push Create-Flow)

WorkFlow builder

Arrange Components => (Push Monitor)

SwarmUI See the scaling and scale up/down

BDE Logger

Navigate the componentUI and deploy jobs

Git-clone

New Stack

Integrator UI

WorkMonitor Deployment status of Components => (Push OK)

BDE-IDE 34

Page 35: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Demo https://www.youtube.com/channel/UCLSpcbH3OZPWXOcDuOXqqPg

35

Page 36: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

BDE User-roles

36

Page 37: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Scalability • Docker Swarm

– 1,000 nodes, 30,000 containers • BDE setups

– InfAI cluster: 3 nodes, up to 50 containers – NCSR-D cluster: 5 nodes, up to 25 containers

• Hadoop scalability – Facebook cluster: 1100-machine cluster with 8800 cores and

about 12 PB raw storage • Spark scalability

– ebay cluster: 2000 nodes, 100TB of RAM, and 20,000 cores

37

Page 38: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

User Interfaces

38

◎ BDE Development Environment o Stack builder o Workflow builder

◎ Administrative Interface o SwarmUI o Logger Interface

◎ UI Integrator o Workflow monitor o Integrated web interface

Page 39: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Big Data Integrator UI-BDI

39

Page 40: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Stack Builder

40

Page 41: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Stack Editor

41

Component

Services/dockers

Page 42: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

BDE Workflow Builder

42

Component 1

Component 2

Component 3

Page 43: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

BDE Workflow Monitor

43

Component 1

Finished

Component 2

Finished

Component 3

Inprogress

Page 44: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Swarm UI-Stacks

44

Page 45: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Swarm UI-Pipeline

45

Increase number

of instances

Page 46: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Logging-Monitor

46

Page 47: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Integrator UI

47

Component 1 Component 2

Page 48: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

BDE vs Hadoop distributions

Hortonworks Cloudera MapR Bigtop BDE

File System HDFS HDFS NFS HDFS HDFS

Installation Native Native Native Native lightweight virtualization

Flexible Modular Architecture no no no no yes

High Availability Single failure recovery (yarn)

Single failure recovery (yarn)

Self healing, mult. failure rec.

Single failure recovery (yarn)

Failure recovery

Cost Commercial Commercial Commercial Free Free

Scaling Freemium Freemium Freemium Free Free

Addition of custom components

Not easy No No No Yes

Integration testing yes yes yes yes --

Operating systems Linux Linux Linux Linux Windows/Mac/Linux

Management tool Ambari Cloudera manager MapR Control system

- Docker swarm UI+ Custom Interfaces 48

Page 49: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

BDE vs Hadoop distributions

• BDE is not built on top of existing distributions • Targets to facilitate

– Communities – Research Institutions

• Bridges scientists and open data • Multi Tier research efforts towards Smart Data

49

Page 50: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)
Page 51: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Green Transport for Smart Cities

51

Page 52: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Transport FCD Data

52

Streaming sensor network & geo-spatial data integration

Page 53: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Floating Car Data Provider • CERTH-HIT monitors the traffic flow in Thessaloniki, Greece

• It receives floating car data: – 500 – 2.500 speed measurements per minute – Location, speed, orientation, status – Hundreds of GB (historical dataset)

53

Page 54: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

FCD Data • Device ID • GPS position (X, Y, Z) • Orientation (degrees) • Speed (km/h) • Timestamp • Zone • Status

54

Page 55: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Task 1) Match the GPS coordinates to map (Find approximate

location) 2) Aggregate in time windows

a) compute the average flow (number of vehicles) b) Average speed

3) The result of the aggregation a) Road segment identifier, b) the traffic flow (number of vehicles in the time window), c) the average speed and the timestamp.

The resulting data is stored in the distributed file system.

55

Page 56: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Visualization We match the vehicles to a cell within a grid that covers the area of Thessaloniki.

56

Page 57: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Use case Architecture

57

Page 58: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Use case Architecture

58

Page 59: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Pipeline and Flow

59

Page 60: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Execution of the Traffic Use case

60

https://www.youtube.com/watch?v=feBKLYjldvI

Page 61: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

Summary

• Cloud Computing • Executed and looked into the design and decisions behind

development of a cloud platform • Designed and Executed a Big Data Pipeline on the Platform

61

Page 62: Big Data Solutions in Practice - Project | Lambda · Big Data Integrator Platform Goals • A software that can be installed – on premises (Private Cloud), or – on the cloud (Public)

This project has received funding from the European Union's Horizon 2020 Research and Innovation

programme under grant agreement No 809965.

THANK YOU !

Dr. Hajira Jabeen [email protected]