© 2015 ibm corporation - files.meetup.com · 3 © 2015 ibm corporation parallel data processing is...

33
© 2015 IBM Corporation

Upload: others

Post on 27-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation

Page 2: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation2

Hardware improvements through the years...� CPU Speeds:

– 1990 - 44 MIPS at 40 MHz

– 2000 - 3,561 MIPS at 1.2 GHz

– 2010 - 147,600 MIPS at 3.3 GHz

� RAM Memory– 1990 – 640K conventional memory (256K extended memory recommended)

– 2000 – 64MB memory

– 2010 - 8-32GB (and more)

� Disk Capacity– 1990 – 20MB 2000 - 1GB 2010 – 1TB

� Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years, currently around 70 – 80MB / sec

� How long it will take to read 1TB of data? (at 80Mb / sec):

– 1 disk - 3.4 hours

– 10 disks - 20 min

– 100 disks - 2 min

– 1000 disks - 12 sec

Page 3: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation3

Parallel Data Processing is the answer!

� It was with us for a while:–GRID computing - spreads processing load

–Distributed workload - hard to manage applications, overhead on developer

–Parallel databases – DB2 DPF, Teradata, Netezza, etc (distribute the data)

Page 4: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation4

About the IBM Open Platform for Apache Hadoop

� Flexible, enterprise-class support for processing large volumes of data – Supports wide variety of data (structured, unstructured, semi-structured)

– Supports variety of popular APIs (industry-standard SQL, MapReduce, …)

� Enables applications to work with thousands of nodes and petabytesof data in a highly parallel, cost effective manner– CPU + local disks = “node”

– Nodes can be combined into clusters

– New nodes can be added as needed without changing • Data formats

• How data is loaded

• How jobs are written

Page 5: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation5

Design principles of Hadoop

� New way of storing and processing the data:– Let system handle most of the issues automatically:

• Failures• Scalability• Reduce communications • Distribute data and processing power to where the data is• Make parallelism part of operating system• Meant for heterogeneous commodity hardware

� Bring processing to Data!

� Hadoop = HDFS + MapReduce infrastructure

� Optimized to handle– Massive amounts of data through parallelism

� Reliability provided through replication

Page 6: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation6

What is the Hadoop Distributed File System (HDFS)?

� Driving principals

– Data is stored across the entire cluster (multiple nodes)

– Programs are brought to the data, not the data to the program

– Follows the Divide and Conquer paradigm.

� Data is stored across the entire cluster (the DFS)

– The entire cluster participates in the file system

– Blocks of a single file are distributed across the cluster

– A given block is typically replicated as well for resiliency

101101001010010011100111111001010011101001010010110010010101001100010100101110101110101111011011010101101001010100101010101011100100110101110100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

Page 7: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation7

Introduction to MapReduce

MapReduce Application

1. Map Phase(break job into small parts)

2. Shuffle(transfer interim outputfor final processing)

3. Reduce Phase(boil all output down toa single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper

extends Mapper<Object,Text,Text,IntWritable> {

private final static IntWritable

one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text val, Context

StringTokenizer itr =

new StringTokenizer(val.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}

}

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWrita

private IntWritable result = new IntWritable();

public void reduce(Text key,

Iterable<IntWritable> val, Context context){

int sum = 0;

for (IntWritable v : val) {

sum += v.get();

. . .

public static class TokenizerMapper

extends Mapper<Object,Text,Text,IntWritable> {

private final static IntWritable

one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text val, Context

StringTokenizer itr =

new StringTokenizer(val.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}

}

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWrita

private IntWritable result = new IntWritable();

public void reduce(Text key,

Iterable<IntWritable> val, Context context){

int sum = 0;

for (IntWritable v : val) {

sum += v.get();

. . .

Distribute maptasks to cluster

Hadoop Data Nodes

� Scalable to thousands of nodes and petabytes of data

Page 8: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation8

• Reliability

• Resiliency

• Security

• Multiple data sources

• Multiple applications

• Multiple users

Benefits

• Files

• Semi-structured

• Databases

Unlimited Scale

Enterprise Platform

Wide Range of Data Formats

Page 9: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation9

Hadoop MapReduce Challenges

• Need deep Java skills

• Few abstractions available for

analysts

• No in-memory framework

• Application tasks write to disk

with each cycle

• Only suitable for batch

workloads

• Rigid processing model

In-Memory Performance

Ease of Development

Combine Workflows

CONS

Page 10: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation10

An Apache Foundation open source project. Not a product.

What is Spark

1

2

5

33

4

Enables highly iterative analysis on large volumes of data at scale

An in-memory compute engine that works with data. Not a data store.

Radically simplifies process of developing intelligent apps fueled by data

Unified environment for data scientists, developers and data engineers

Page 11: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation11

Open Source Project

� 2002 – MapReduce @ Google

� 2004 – MapReduce paper

� 2006 – Hadoop @ Yahoo

� 2008 – Hadoop Summit

� 2010 – Spark paper

� 2014 – Apache Spark top-level

� 2014 – 1.2.0 release in December

� 2015 – 1.3.0 release in March

� 2015 – 1.4.0 release in June

� Spark is HOT!!!

� Most active project in Hadoop

ecosystem

� One of top 3 most active Apache projects

� Databricks founded by the creators of Spark from UC Berkeley’s AMPLab

Activity for 6 months in 2014(from Matei Zaharia – 2014 Spark Summit)

1

Page 12: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation12

Resilient Distributed Dataset: definition

Slave node 1

c3 d2

a2 b1

partition3

partition1

partition2

Slave node 2

c2 d1

a1 b2

partition1

partition3

Slave node 3

c1 d2

a3 b3

partition2

partition2

partition1

RDD1

RDD2

RDD3

Spark RDDIn-memory distribution

HDFSOn-disk distribution

� An RDD is a distributed collection of Scala/Python/Java objects of the same type:

– RDD of strings, integers …

– RDD of (key, value) pairs

– RDD of class Java/Python/Scala objects

2

Page 13: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation13

Spark Programming Model

• Operations on RDDs (datasets)– Transformation

– Action

• Transformations use lazy evaluation– Executed only if an action requires it

• An application consist of a directed acyclic graph (DAG)– Each action results in a separate batch job

– Parallelism is determined by the number of RDD partitions

RDD1 RDD2 RDD3

Act1

Act2

Job-1

Job-2

Resilient Distributed Dataset: Operations

Page 14: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation14

What happens when an action is executed?

// Creating the RDD

val logFile = sc.textFile(“hdfs://…”)

// Transformations

val errors = logFile.filter(_.startsWith(“ERROR”))

val messages = errors.map(_.split(“\t”)).map(r => r(1))

//Caching

messages.cache()

// Actions

messages.filter(_.contains(“mysql”)).count()

messages.filter(_.contains(“php”)).count()

Driver

Worker Worker WorkerBlock 1 Block 3Block 2

The data is partitioned into

different blocks

Page 15: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation15

What happens when an action is executed?

// Creating the RDD

val logFile = sc.textFile(“hdfs://…”)

// Transformations

val errors = logFile.filter(_.startsWith(“ERROR”))

val messages = errors.map(_.split(“\t”)).map(r => r(1))

//Caching

messages.cache()

// Actions

messages.filter(_.contains(“mysql”)).count()

messages.filter(_.contains(“php”)).count()

Driver

Worker Worker WorkerBlock 1 Block 3Block 2

Driver sends the code to be

executed on each block

Page 16: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation16

What happens when an action is executed?

// Creating the RDD

val logFile = sc.textFile(“hdfs://…”)

// Transformations

val errors = logFile.filter(_.startsWith(“ERROR”))

val messages = errors.map(_.split(“\t”)).map(r => r(1))

//Caching

messages.cache()

// Actions

messages.filter(_.contains(“mysql”)).count()

messages.filter(_.contains(“php”)).count()

Driver

Worker Worker WorkerBlock 1 Block 3Block 2

Read HDFS block

Page 17: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation17

What happens when an action is executed?

// Creating the RDD

val logFile = sc.textFile(“hdfs://…”)

// Transformations

val errors = logFile.filter(_.startsWith(“ERROR”))

val messages = errors.map(_.split(“\t”)).map(r => r(1))

//Caching

messages.cache()

// Actions

messages.filter(_.contains(“mysql”)).count()

messages.filter(_.contains(“php”)).count()

Driver

Worker Worker WorkerBlock 1 Block 3Block 2

Process + cache data

Cache Cache Cache

Page 18: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation18

What happens when an action is executed?

// Creating the RDD

val logFile = sc.textFile(“hdfs://…”)

// Transformations

val errors = logFile.filter(_.startsWith(“ERROR”))

val messages = errors.map(_.split(“\t”)).map(r => r(1))

//Caching

messages.cache()

// Actions

messages.filter(_.contains(“mysql”)).count()

messages.filter(_.contains(“php”)).count()

Driver

Worker Worker WorkerBlock 1 Block 3Block 2

Cache Cache Cache

Send the data back

to the driver

Page 19: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation19

What happens when an action is executed?

// Creating the RDD

val logFile = sc.textFile(“hdfs://…”)

// Transformations

val errors = logFile.filter(_.startsWith(“ERROR”))

val messages = errors.map(_.split(“\t”)).map(r => r(1))

//Caching

messages.cache()

// Actions

messages.filter(_.contains(“mysql”)).count()

messages.filter(_.contains(“php”)).count()

Driver

Worker Worker WorkerBlock 1 Block 3Block 2

Cache Cache Cache

Process from cache

Page 20: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation20

What happens when an action is executed?

// Creating the RDD

val logFile = sc.textFile(“hdfs://…”)

// Transformations

val errors = logFile.filter(_.startsWith(“ERROR”))

val messages = errors.map(_.split(“\t”)).map(r => r(1))

//Caching

messages.cache()

// Actions

messages.filter(_.contains(“mysql”)).count()

messages.filter(_.contains(“php”)).count()

Driver

Worker Worker WorkerBlock 1 Block 3Block 2

Cache Cache Cache

Send the data back

to the driver

Page 21: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation21

Spark Libraries

Apache Spark

Spark SQLSpark

StreamingGraphX MLlib SparkR

• Extensions to the core Spark API (Python, Scala, Java)

• Improvements made to the core are passed to these libraries

• Little overhead to use with the Spark core

Simplifies process of developing 3

Page 22: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation22

Data Scientist

Data Engineer App Developer

“the convincer”

“the builder”“the thinker”

Spark SQL

MLlib

Scala API

Unified environment4

Page 23: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation23

Data Engineer

Data ScientistApp Developer

Spark Empowers More to Accelerate The Insight

Business

Use CaseUnderstanding

attributes Data

Cleaning

Machine

Learning

Analysis of

Accuracy

With the

other

business

applications

5

Page 24: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation24

In-Memory Performance

Ease of Development

• Easier APIs

• Python, Scala, Java

• Resilient Distributed Datasets

• Unify processing

Spark Advantages

• Batch

• Interactive

• Iterative algorithms

• Micro-batch

Combine Workflows

Page 25: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation25

Example of Hadoop Ecosystem

Zo

oke

ep

er

(Coord

ination)

Flu

me

(Data

Colle

ction)

HDFS (or GPFS)(Distributed File System)

YARN(Resource Manager)

Pig

(ET

L)

Oo

zie

(Work

flow

)

Hiv

e(D

ata

Ware

house S

QL)

Syste

mT

(Text A

naly

tics)

Big

R &

ML

(Sta

tistica

l An

aly

sis

)

Blue Boxes components only available with IBM BigInsights product.

Am

ba

ri(M

onitoring)

HB

ase

(Inte

ractive S

tora

ge)

Map Reduce v2(Processing

Framework)

Ma

p R

ed

uce

v1

(Pro

cessin

g F

ram

ew

ork

)

Spark(Processing

Framework)

Page 26: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation26

IBM Announces Major

Commitment to Advance

Apache® Spark™

⦁ …the Most Significant Open Source Project of the Next Decade…

Page 27: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation27

Open Source SystemML

Educate One Million Data Professionals

Establish Spark Technology Center

Founding Member of AMPLab

Contributing to the Core

Announcing

Our commitment to Spark

Page 28: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation28

SystemML unifies the fractured machine learning environments

Gives the core Spark ecosystem a complete set of DML

Allows a data scientist to focus on the algorithm, not the implementation

Improves time to value for data science teams

Establish a de facto standard for reusable machine learning routines

We are Contributing SystemML

Our largest contribution to open source since Linux

Page 29: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation29

Page 30: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation30

Educate 1 Million Data Scientists and Data Engineers

Big Data University MOOC

Spark Fundamentals I and II

Advanced Spark Development series

Foundational Methodology for Data Science

Partnerships with Databricks, AMPLab, DataCamp and MetiStream

Our investment to grow skills

Page 31: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation31

Inspire the use of Spark to solve business problems

Encourage adoption through open and free educational assets

Demonstrate real world solutions to identify opportunities

Use the learning to improve Spark and its application

Spark Technology Center

Our goal is to be the #1 Spark contributor and adopter

Page 32: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation32

Our Partner Ecosystem

Page 33: © 2015 IBM Corporation - files.meetup.com · 3 © 2015 IBM Corporation Parallel Data Processing is the answer! It was with us for a while: –GRID computing - spreads processing

© 2015 IBM Corporation33

Now

IBM Open Platform with Apache Hadoop

IBM InfoSphere Streams

IBM Platform Computing

Our Use of Spark at IBM

⦁ More than 30 IBM Research initiatives

⦁ 100 incubated applications in 10 days

⦁ 3,500 Researchers and Developers to Spark

Targeted for later in year

Apache Spark as a Service on IBM Bluemix (in beta)

IBM Watson Analytics

SPSS Modeler & Analytics Server

IBM DataWorks

IBM PureData Systems with Fluid Query

IBM Commerce