© 2015 ibm corporation - files.meetup.com · 3 © 2015 ibm corporation parallel data processing is...

© 2015 IBM Corporation2

Hardware improvements through the years...� CPU Speeds:

– 1990 - 44 MIPS at 40 MHz

– 2000 - 3,561 MIPS at 1.2 GHz

– 2010 - 147,600 MIPS at 3.3 GHz

� RAM Memory– 1990 – 640K conventional memory (256K extended memory recommended)

– 2000 – 64MB memory

– 2010 - 8-32GB (and more)

� Disk Capacity– 1990 – 20MB 2000 - 1GB 2010 – 1TB

� Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years, currently around 70 – 80MB / sec

� How long it will take to read 1TB of data? (at 80Mb / sec):

– 1 disk - 3.4 hours

– 10 disks - 20 min

– 100 disks - 2 min

– 1000 disks - 12 sec


Parallel Data Processing is the answer!

� It was with us for a while:–GRID computing - spreads processing load

–Distributed workload - hard to manage applications, overhead on developer

–Parallel databases – DB2 DPF, Teradata, Netezza, etc (distribute the data)


About the IBM Open Platform for Apache Hadoop

� Flexible, enterprise-class support for processing large volumes of data – Supports wide variety of data (structured, unstructured, semi-structured)

– Supports variety of popular APIs (industry-standard SQL, MapReduce, …)

� Enables applications to work with thousands of nodes and petabytesof data in a highly parallel, cost effective manner– CPU + local disks = “node”

– Nodes can be combined into clusters

– New nodes can be added as needed without changing • Data formats

• How data is loaded

• How jobs are written


Design principles of Hadoop

� New way of storing and processing the data:– Let system handle most of the issues automatically:

• Failures• Scalability• Reduce communications • Distribute data and processing power to where the data is• Make parallelism part of operating system• Meant for heterogeneous commodity hardware

� Bring processing to Data!

� Hadoop = HDFS + MapReduce infrastructure

� Optimized to handle– Massive amounts of data through parallelism

� Reliability provided through replication


What is the Hadoop Distributed File System (HDFS)?

� Driving principals

– Data is stored across the entire cluster (multiple nodes)

– Programs are brought to the data, not the data to the program

– Follows the Divide and Conquer paradigm.

� Data is stored across the entire cluster (the DFS)

– The entire cluster participates in the file system

– Blocks of a single file are distributed across the cluster

– A given block is typically replicated as well for resiliency

101101001010010011100111111001010011101001010010110010010101001100010100101110101110101111011011010101101001010100101010101011100100110101110100

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44


Introduction to MapReduce

MapReduce Application

1. Map Phase(break job into small parts)

2. Shuffle(transfer interim outputfor final processing)

3. Reduce Phase(boil all output down toa single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper

extends Mapper<Object,Text,Text,IntWritable> {

private final static IntWritable

one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text val, Context

StringTokenizer itr =

new StringTokenizer(val.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}

}

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWrita

private IntWritable result = new IntWritable();

public void reduce(Text key,

Iterable<IntWritable> val, Context context){

int sum = 0;

for (IntWritable v : val) {

sum += v.get();

. . .

public static class TokenizerMapper

extends Mapper<Object,Text,Text,IntWritable> {

private final static IntWritable

one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text val, Context

StringTokenizer itr =

new StringTokenizer(val.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}

}

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWrita

private IntWritable result = new IntWritable();

public void reduce(Text key,

Iterable<IntWritable> val, Context context){

int sum = 0;

for (IntWritable v : val) {

sum += v.get();

. . .

Distribute maptasks to cluster

Hadoop Data Nodes

� Scalable to thousands of nodes and petabytes of data


• Reliability

• Resiliency

• Security

• Multiple data sources

• Multiple applications

• Multiple users

Benefits

• Files

• Semi-structured

• Databases

Unlimited Scale

Enterprise Platform

Wide Range of Data Formats


Hadoop MapReduce Challenges

• Need deep Java skills

• Few abstractions available for

analysts

• No in-memory framework

• Application tasks write to disk

with each cycle

• Only suitable for batch

workloads

• Rigid processing model

In-Memory Performance

Ease of Development

Combine Workflows

CONS


An Apache Foundation open source project. Not a product.

What is Spark

1

2

5

33

4

Enables highly iterative analysis on large volumes of data at scale

An in-memory compute engine that works with data. Not a data store.

Radically simplifies process of developing intelligent apps fueled by data

Unified environment for data scientists, developers and data engineers


Open Source Project

� 2002 – MapReduce @ Google

� 2004 – MapReduce paper

� 2006 – Hadoop @ Yahoo

� 2008 – Hadoop Summit

� 2010 – Spark paper

� 2014 – Apache Spark top-level

� 2014 – 1.2.0 release in December

� 2015 – 1.3.0 release in March

� 2015 – 1.4.0 release in June

� Spark is HOT!!!

� Most active project in Hadoop

ecosystem

� One of top 3 most active Apache projects

� Databricks founded by the creators of Spark from UC Berkeley’s AMPLab

Activity for 6 months in 2014(from Matei Zaharia – 2014 Spark Summit)

1


Resilient Distributed Dataset: definition

Slave node 1

c3 d2

a2 b1

partition3

partition1

partition2

Slave node 2

c2 d1

a1 b2

partition1

partition3

Slave node 3

c1 d2

a3 b3

partition2

partition2

partition1

RDD1

RDD2

RDD3

Spark RDDIn-memory distribution

HDFSOn-disk distribution

� An RDD is a distributed collection of Scala/Python/Java objects of the same type:

– RDD of strings, integers …

– RDD of (key, value) pairs

– RDD of class Java/Python/Scala objects

2


Spark Programming Model

• Operations on RDDs (datasets)– Transformation

– Action

• Transformations use lazy evaluation– Executed only if an action requires it

• An application consist of a directed acyclic graph (DAG)– Each action results in a separate batch job

– Parallelism is determined by the number of RDD partitions

RDD1 RDD2 RDD3

Act1

Act2

Job-1

Job-2

Resilient Distributed Dataset: Operations


What happens when an action is executed?

// Creating the RDD

val logFile = sc.textFile(“hdfs://…”)

// Transformations

val errors = logFile.filter(_.startsWith(“ERROR”))

val messages = errors.map(_.split(“\t”)).map(r => r(1))

//Caching

messages.cache()

// Actions

messages.filter(_.contains(“mysql”)).count()

messages.filter(_.contains(“php”)).count()

Driver

Worker Worker WorkerBlock 1 Block 3Block 2

The data is partitioned into

different blocks



// Creating the RDD


// Transformations



//Caching

messages.cache()

// Actions



Driver


Driver sends the code to be

executed on each block



// Creating the RDD


// Transformations



//Caching

messages.cache()

// Actions



Driver


Read HDFS block



// Creating the RDD


// Transformations



//Caching

messages.cache()

// Actions



Driver


Process + cache data

Cache Cache Cache



// Creating the RDD


// Transformations



//Caching

messages.cache()

// Actions



Driver


Cache Cache Cache

Send the data back

to the driver



// Creating the RDD


// Transformations



//Caching

messages.cache()

// Actions



Driver


Cache Cache Cache

Process from cache



// Creating the RDD


// Transformations



//Caching

messages.cache()

// Actions



Driver


Cache Cache Cache

Send the data back

to the driver


Spark Libraries

Apache Spark

Spark SQLSpark

StreamingGraphX MLlib SparkR

• Extensions to the core Spark API (Python, Scala, Java)

• Improvements made to the core are passed to these libraries

• Little overhead to use with the Spark core

Simplifies process of developing 3


Data Scientist

Data Engineer App Developer

“the convincer”

“the builder”“the thinker”

Spark SQL

MLlib

Scala API

Unified environment4


Data Engineer

Data ScientistApp Developer

Spark Empowers More to Accelerate The Insight

Business

Use CaseUnderstanding

attributes Data

Cleaning

Machine

Learning

Analysis of

Accuracy

With the

other

business

applications

5


In-Memory Performance

Ease of Development

• Easier APIs

• Python, Scala, Java

• Resilient Distributed Datasets

• Unify processing

Spark Advantages

• Batch

• Interactive

• Iterative algorithms

• Micro-batch

Combine Workflows


Example of Hadoop Ecosystem

Zo

oke

ep

er

(Coord

ination)

Flu

me

(Data

Colle

ction)

HDFS (or GPFS)(Distributed File System)

YARN(Resource Manager)

Pig

(ET

L)

Oo

zie

(Work

flow

)

Hiv

e(D

ata

Ware

house S

QL)

Syste

mT

(Text A

naly

tics)

Big

R &

ML

(Sta

tistica

l An

aly

sis

)

Blue Boxes components only available with IBM BigInsights product.

Am

ba

ri(M

onitoring)

HB

ase

(Inte

ractive S

tora

ge)

Map Reduce v2(Processing

Framework)

Ma

p R

ed

uce

v1

(Pro

cessin

g F

ram

ew

ork

)

Spark(Processing

Framework)


IBM Announces Major

Commitment to Advance

Apache® Spark™

⦁ …the Most Significant Open Source Project of the Next Decade…


Open Source SystemML

Educate One Million Data Professionals

Establish Spark Technology Center

Founding Member of AMPLab

Contributing to the Core

Announcing

Our commitment to Spark


SystemML unifies the fractured machine learning environments

Gives the core Spark ecosystem a complete set of DML

Allows a data scientist to focus on the algorithm, not the implementation

Improves time to value for data science teams

Establish a de facto standard for reusable machine learning routines

We are Contributing SystemML

Our largest contribution to open source since Linux


Educate 1 Million Data Scientists and Data Engineers

Big Data University MOOC

Spark Fundamentals I and II

Advanced Spark Development series

Foundational Methodology for Data Science

Partnerships with Databricks, AMPLab, DataCamp and MetiStream

Our investment to grow skills


Inspire the use of Spark to solve business problems

Encourage adoption through open and free educational assets

Demonstrate real world solutions to identify opportunities

Use the learning to improve Spark and its application

Spark Technology Center

Our goal is to be the #1 Spark contributor and adopter


Our Partner Ecosystem


Now

IBM Open Platform with Apache Hadoop

IBM InfoSphere Streams

IBM Platform Computing

Our Use of Spark at IBM

⦁ More than 30 IBM Research initiatives

⦁ 100 incubated applications in 10 days

⦁ 3,500 Researchers and Developers to Spark

Targeted for later in year

Apache Spark as a Service on IBM Bluemix (in beta)

IBM Watson Analytics

SPSS Modeler & Analytics Server

IBM DataWorks

IBM PureData Systems with Fluid Query

IBM Commerce

© 2015 ibm corporation - files.meetup.com · 3 © 2015 ibm corporation parallel data processing is...

Documents