© 2015 ibm corporation - files.meetup.com · 3 © 2015 ibm corporation parallel data processing is...

Hardware improvements through the years...� CPU Speeds:

– 1990 - 44 MIPS at 40 MHz

– 2000 - 3,561 MIPS at 1.2 GHz

– 2010 - 147,600 MIPS at 3.3 GHz

� RAM Memory– 1990 – 640K conventional memory (256K extended memory recommended)

– 2000 – 64MB memory

– 2010 - 8-32GB (and more)

� Disk Capacity– 1990 – 20MB 2000 - 1GB 2010 – 1TB

� Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years, currently around 70 – 80MB / sec

� How long it will take to read 1TB of data? (at 80Mb / sec):

– 1 disk - 3.4 hours

– 10 disks - 20 min

– 100 disks - 2 min

– 1000 disks - 12 sec

Parallel Data Processing is the answer!

� It was with us for a while:–GRID computing - spreads processing load

–Distributed workload - hard to manage applications, overhead on developer

–Parallel databases – DB2 DPF, Teradata, Netezza, etc (distribute the data)

About the IBM Open Platform for Apache Hadoop

� Flexible, enterprise-class support for processing large volumes of data – Supports wide variety of data (structured, unstructured, semi-structured)

– Supports variety of popular APIs (industry-standard SQL, MapReduce, …)

� Enables applications to work with thousands of nodes and petabytesof data in a highly parallel, cost effective manner– CPU + local disks = “node”

– Nodes can be combined into clusters

– New nodes can be added as needed without changing • Data formats

• How data is loaded

• How jobs are written

Design principles of Hadoop

� New way of storing and processing the data:– Let system handle most of the issues automatically:

• Failures• Scalability• Reduce communications • Distribute data and processing power to where the data is• Make parallelism part of operating system• Meant for heterogeneous commodity hardware

� Bring processing to Data!

� Hadoop = HDFS + MapReduce infrastructure

� Optimized to handle– Massive amounts of data through parallelism

� Reliability provided through replication

What is the Hadoop Distributed File System (HDFS)?

� Driving principals

– Data is stored across the entire cluster (multiple nodes)

– Programs are brought to the data, not the data to the program

– Follows the Divide and Conquer paradigm.

� Data is stored across the entire cluster (the DFS)

– The entire cluster participates in the file system

– Blocks of a single file are distributed across the cluster

– A given block is typically replicated as well for resiliency

101101001010010011100111111001010011101001010010110010010101001100010100101110101110101111011011010101101001010100101010101011100100110101110100

Logical File

Blocks

Cluster

Introduction to MapReduce

MapReduce Application

1. Map Phase(break job into small parts)

2. Shuffle(transfer interim outputfor final processing)

3. Reduce Phase(boil all output down toa single result set)

Return a single result setResult Set

Shuffle

public static class TokenizerMapper

extends Mapper<Object,Text,Text,IntWritable> {

private final static IntWritable

one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text val, Context

StringTokenizer itr =

new StringTokenizer(val.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWrita

private IntWritable result = new IntWritable();

public void reduce(Text key,

Iterable<IntWritable> val, Context context){

int sum = 0;

for (IntWritable v : val) {

sum += v.get();

public static class TokenizerMapper

extends Mapper<Object,Text,Text,IntWritable> {

private final static IntWritable

one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text val, Context

StringTokenizer itr =

new StringTokenizer(val.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWrita

private IntWritable result = new IntWritable();

public void reduce(Text key,

Iterable<IntWritable> val, Context context){

int sum = 0;

for (IntWritable v : val) {

sum += v.get();

Distribute maptasks to cluster

Hadoop Data Nodes

� Scalable to thousands of nodes and petabytes of data

• Reliability

• Resiliency

• Security

• Multiple data sources

• Multiple applications

• Multiple users

Benefits

• Files

• Semi-structured

• Databases

Unlimited Scale

Enterprise Platform

Wide Range of Data Formats

Hadoop MapReduce Challenges

• Need deep Java skills

• Few abstractions available for

analysts

• No in-memory framework

• Application tasks write to disk

with each cycle

• Only suitable for batch

workloads

• Rigid processing model

In-Memory Performance

Ease of Development

Combine Workflows

An Apache Foundation open source project. Not a product.

What is Spark

Enables highly iterative analysis on large volumes of data at scale

An in-memory compute engine that works with data. Not a data store.

Radically simplifies process of developing intelligent apps fueled by data

Unified environment for data scientists, developers and data engineers

Open Source Project

� 2002 – MapReduce @ Google

� 2004 – MapReduce paper

� 2006 – Hadoop @ Yahoo

� 2008 – Hadoop Summit

� 2010 – Spark paper

� 2014 – Apache Spark top-level

� 2014 – 1.2.0 release in December

� 2015 – 1.3.0 release in March

� 2015 – 1.4.0 release in June

� Spark is HOT!!!

� Most active project in Hadoop

ecosystem

� One of top 3 most active Apache projects

� Databricks founded by the creators of Spark from UC Berkeley’s AMPLab

Activity for 6 months in 2014(from Matei Zaharia – 2014 Spark Summit)

Resilient Distributed Dataset: definition

Slave node 1

partition3

partition1

partition2

Slave node 2

partition1

partition3

Slave node 3

partition2

partition1

Spark RDDIn-memory distribution

HDFSOn-disk distribution

� An RDD is a distributed collection of Scala/Python/Java objects of the same type:

– RDD of strings, integers …

– RDD of (key, value) pairs

– RDD of class Java/Python/Scala objects

Spark Programming Model

• Operations on RDDs (datasets)– Transformation

– Action

• Transformations use lazy evaluation– Executed only if an action requires it

• An application consist of a directed acyclic graph (DAG)– Each action results in a separate batch job

– Parallelism is determined by the number of RDD partitions

RDD1 RDD2 RDD3

Resilient Distributed Dataset: Operations

What happens when an action is executed?

// Creating the RDD

val logFile = sc.textFile(“hdfs://…”)

// Transformations

val errors = logFile.filter(_.startsWith(“ERROR”))

val messages = errors.map(_.split(“\t”)).map(r => r(1))

//Caching

messages.cache()

// Actions

messages.filter(_.contains(“mysql”)).count()

messages.filter(_.contains(“php”)).count()

Driver

Worker Worker WorkerBlock 1 Block 3Block 2

The data is partitioned into

different blocks

// Creating the RDD

// Transformations

//Caching

messages.cache()

// Actions

Driver

Driver sends the code to be

executed on each block

// Creating the RDD

// Transformations

//Caching

messages.cache()

// Actions

Driver

Read HDFS block

// Creating the RDD

// Transformations

//Caching

messages.cache()

// Actions

Driver

Process + cache data

Cache Cache Cache

// Creating the RDD

// Transformations

//Caching

messages.cache()

// Actions

Driver

Cache Cache Cache

Send the data back

to the driver

// Creating the RDD

// Transformations

//Caching

messages.cache()

// Actions

Driver

Cache Cache Cache

Process from cache

// Creating the RDD

// Transformations

//Caching

messages.cache()

// Actions

Driver

Cache Cache Cache

Send the data back

to the driver

Spark Libraries

Apache Spark

Spark SQLSpark

StreamingGraphX MLlib SparkR

• Extensions to the core Spark API (Python, Scala, Java)

• Improvements made to the core are passed to these libraries

• Little overhead to use with the Spark core

Simplifies process of developing 3

Data Scientist

Data Engineer App Developer

“the convincer”

“the builder”“the thinker”

Spark SQL

Scala API

Unified environment4

Data Engineer

Data ScientistApp Developer

Spark Empowers More to Accelerate The Insight

Business

Use CaseUnderstanding

attributes Data

Cleaning

Machine

Learning

Analysis of

Accuracy

With the

business

applications

In-Memory Performance

Ease of Development

• Easier APIs

• Python, Scala, Java

• Resilient Distributed Datasets

• Unify processing

Spark Advantages

• Batch

• Interactive

• Iterative algorithms

• Micro-batch

Combine Workflows

Example of Hadoop Ecosystem

(Coord

ination)

ction)

HDFS (or GPFS)(Distributed File System)

YARN(Resource Manager)

house S

(Text A

tistica

Blue Boxes components only available with IBM BigInsights product.

onitoring)

ractive S

Map Reduce v2(Processing

Framework)

cessin

Spark(Processing

Framework)

IBM Announces Major

Commitment to Advance

Apache® Spark™

⦁ …the Most Significant Open Source Project of the Next Decade…

Open Source SystemML

Educate One Million Data Professionals

Establish Spark Technology Center

Founding Member of AMPLab

Contributing to the Core

Announcing

Our commitment to Spark

SystemML unifies the fractured machine learning environments

Gives the core Spark ecosystem a complete set of DML

Allows a data scientist to focus on the algorithm, not the implementation

Improves time to value for data science teams

Establish a de facto standard for reusable machine learning routines

We are Contributing SystemML

Our largest contribution to open source since Linux

Educate 1 Million Data Scientists and Data Engineers

Big Data University MOOC

Spark Fundamentals I and II

Advanced Spark Development series

Foundational Methodology for Data Science

Partnerships with Databricks, AMPLab, DataCamp and MetiStream

Our investment to grow skills

Inspire the use of Spark to solve business problems

Encourage adoption through open and free educational assets

Demonstrate real world solutions to identify opportunities

Use the learning to improve Spark and its application

Spark Technology Center

Our goal is to be the #1 Spark contributor and adopter

Our Partner Ecosystem

IBM Open Platform with Apache Hadoop

IBM InfoSphere Streams

IBM Platform Computing

Our Use of Spark at IBM

⦁ More than 30 IBM Research initiatives

⦁ 100 incubated applications in 10 days

⦁ 3,500 Researchers and Developers to Spark

Targeted for later in year

Apache Spark as a Service on IBM Bluemix (in beta)

IBM Watson Analytics

SPSS Modeler & Analytics Server

IBM DataWorks

IBM PureData Systems with Fluid Query

IBM Commerce

© 2015 ibm corporation - files.meetup.com · 3 © 2015 ibm corporation parallel data processing is...

Documents

® © 2007 ibm corporation ibm systembuilder™

© copyright ibm corporation 2009 ibm software group

© 2002 ibm corporation ibm research © 2005 ibm...

ibm docs lab - · pdf file© 2014 ibm corporation page 3 ibm...

© 2011 ibm corporation digital marketing – ibm software...

ibm software group | information management software © 2010...

ibm haifa research lab © 2007-2010 ibm corporation isostack...

© 2014 ibm corporation ibm security services 1 © 2014 ibm...

© 2008 ibm corporation ibm websphere portal roadmap

ibm research confidential | © 2009 ibm corporation...

© 2007 ibm corporation ibm systems ibm system p. ibm system...

© 2012 ibm corporation ibm security systems 1 © 2013 ibm...

ibm market intelligence © 2007 ibm corporation

ibm corporation ibm puredata system for analytics

© 2002 ibm corporation transaction processing facility tpf...

© 2012 ibm corporation ibm security systems 1 © 2014 ibm...

under the hood of the ibm threat protection system · 1 ©...

ibm corporation

toa corporation - ibm

© 2007 ibm corporation ibm information management