is-4011, accelerating analytics on hadoop using opencl, by zubin dowlaty and krishnaraj gharpure

ACCELERATING ANALYTICS ON HADOOP USING OPEN CLZUBIN DOWLATY

KRISHNARAJ GHARPUREROHIT SAREWAR

PRASANNA LALINGKARAUSTIN CV

MU SIGMA INC.

November 2013

| Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL2

AGENDA

High Level Trends in the Applied Analytics Arena‒Computational Enablement

‒Usability & Visualization

‒ Intelligent Systems

Computational Enablement‒Big Data & Hadoop

‒Reference Diagram

‒Open CL and Hadoop Research


THE INFORMATION EXPLOSION IS LEADING TO AN UNPRECEDENTED ABILITY TO STORE, MANAGE AND ANALYZE DATA; WE NEED TO RUN AHEAD..

► Text

Expanding ApplicationsData Source Explosion

Technology EvolutionAdvanced Analytical Techniques

Digital Data in 2011: ~ 1750 Exabytes

Facebook: 100+ million users per day

Mobile Phone Subscription: 4.6 billion

15 Million Bing recommendation records

RFID Tags: 40 billion

INFORMATION

CLOUD

Click-streamdata

Purchase intent data

Social Media, Blogs

Email Archives

RFID & Geospatial Information

Events

Opinion Mining

Social Graph Targeting

InfluencerMarketing

Offer Optimization

Clinical Data AnalysisFraud Detection

SKU Rationalization

Massively large databases

Search TechnologiesComplex Event Processing

Cloud ComputingSoftware as a service

Interactive Visualization

In Memory Analytics

CART

Adaptive Regression Splines

Machine Learning

Radial Basis Functions

Support Vector Machines

Neural Networks Geospatial Predictive Modeling

Intra / Inter Firewall


THREE KEY MACRO ANALYTICS TRENDS TARGETING CONSUMPTION

Macro Trends: Agenda in All Three

Analytics TrendsComputational Enablement

– Scalable Analytics, Big Data & NoSQL technologies, GP-GPU, In – Memory & High Performance Computing

– Master Data Management, Cloud, and Federated Data – the end of the central EDW?

– Analytics as a Service and the Cloud

– Open Source Maturity

– Rise of Agile Self Service

– Data Model Disruption: ETL vs. ELT

Usability and Visualization

– Dashboards will Evolve into Cockpits

– Improved Interactive Data Visualization & Aesthetics

– Augmented / Virtual Reality / Natural User Interfaces

– Graph Analytics sparked by Social Data

– Discovery with Computational Geometry

– Mobile BI and the Integration of Consumer & Enterprise Devices

– Gamification: Do you want to play a game?

Intelligent Systems (“Anticipation

Denotes Intelligence”)

‒ Operational Smart Systems are Back propagating – Pervasive Analytics

‒ Real Time Analytics and Event Streams – Beyond Batch..

‒ Artificial Intelligence (AI) is not what we expected

‒ Don’t forget the 4th tier it’s Juicy

‒ Metaheuristics and the Ants

‒ Operational Analytics, You are the Exception

‒ Experiments in the Enterprise and the Quasi

‒ Just Say No to OLS

‒ Energy Informatics if Radiating

‒ Intelligent Search


“BIG DATA”, THE TERM, IS ABOUT APPLYING NEW TECHNOLOGIES TO MEET UNFULFILLED NEEDS: ANALYTICS AT SCALE…

What is Big Data?

Big Computation Map Reduce

BIG Computation & Big Data Scaling Economically

Intensive Computational Analysis not Supported by the Traditional Data Warehouse Stack

and Architecture


BIG DATA’S RECENT ENTERPRISE POPULARITY CAN BE ATTRIBUTED TO TWO KEY FACTORS

Big Data Technology Evolution

2003 2004 2005 2006 2007 2008 2009 2010 2011

Apache: Hadoop project

Yahoo: 10K core cluster

IBM Acquires

Early Research Open source dev momentum

Initial success stories

Commercialization & Adoption

Cloudera named most promising startup funded in 2009

elastic map-reduce

Teradata buys Aster (map-reduce DB)Google Trigger: Releases “Map Reduce” & “GFS” paper

Google: Bigtablepaper

Lucene subproject

HP Acquires Vertica

EMC Acquires Greenplum

2012

OracleMSOFT

SAS

Technology Trigger

‒ Robust Open Source Technology for Parallel Computation on Commodity Hardware (Hadoop / NoSQL)

Frustration Trigger..

‒ Tyranny of the Data Model, Cost & Complexity for data integration, Agility, Impact, Bureaucracy, Difficulty & Costly to scale of existing EDW technology


A MINDSET FOR ENABLING DECISION SCIENCES AT SCALEDATASET -> TOOLSET -> SKILLSET -> MINDSET

Big Data – What is it?

Skillset

‒ Computer Scientist

‒ Map-Reduce, HDFS, HIVE, R, PIG, NoSQL, Unstructured Data

‒ Parallel computation & Stream Processing

‒ Data & Visual Explorer

‒ Math / Stats / Data Mining / Machine Learning

‒ Math & Technology & Business

‒ Data “Artisans” / Design Thinking

‒ Systems Theory, Generalizations

Mindset

‒ Automation, Scale, Agility

‒ Higher utilization of machines for decisions

‒ Learning vs Knowing

‒ Consumption Focused

‒ Algorithmic and Heuristic (Man & Machine)

‒ Intrapreneur, Curious, Persuasive, Soft Skills


THE BIG DATA ECO-SYSTEM IS A CHALLENGE TO KEEP UP WITH..MINDSET: KNOWING VS. LEARNING..

The Big Data Ecosystem

”Formulating a right question is always hard, but with big data, it is an order of magnitude harder, because you are blazing the trail (not grazing on the green field).” source: Gartner

Source: The Big Data Group llc


TREND: DASHBOARDS WILL EVOLVE TO BE MORE ACTIONABLE AND FORWARDLOOKING

Dashboredom?

http://www.information-management.com/issues/20_7/dashboards_data_management_quality_business_intelligence-10019101-1.html“As we move from the information age into the intelligence age”..

Usability and Visualization: Evolution

http://www.information-management.com/issues/20_7/dashboards_data_management_quality_business_intelligence-10019101-1.html


OPEN SOURCE: D3 (DATA-DRIVEN DOCUMENTS) HTTP://D3JS.ORG/

1. Helps to bring data to life using HTML5, SVG and CSS.

2. Its an extension of “Protovis”, a very popular JavaScript library.

3. With Minimal overhead, D3 is extremely fast, supporting large datasets and dynamic behavior for interaction and animation.

4. It provides many built-in reusable functions and function factories, such as graphical primitives for area, line and pie charts.

http://d3js.org/


http://d3js.org/


OPEN SOURCE:

R is the premier language for data scientists

‒ Strength in data visualizations..

‒ http://www.r-project.org/


http://www.r-project.org/


GRAPH NETWORK VISUALIZATION APPLIED TO METRIC DATA – CORRELATIONNETWORKS



Yippee ! 5 % discount ..

Yeah.. I think I need these headphones too

THE CURRENT BUSINESS SCENARIO DEMANDS AUTOMATING AND INJECTING INTELLIGENCE INTO VARIOUS ENTERPRISE TOUCH POINTS

Webpage

Add to Cart

Checkout

Yearly Bonus Let me buy a new laptop

Taxonomy, Associations, Nearest Neighbor Popup

Optimization, Need State Personalization

Collaborative FilteringRecommender Systems

Offer Optimization

Behind the Scenes

Email

Marketing

Mix

CRM

Revenue

Management

Hey ! I have some good deals being provided hereWow ! Great offers

.. Since I have some more money, let me have a lookCustomer Segmentation

Propensity Score

Attribution Methods to Optimize Media Spend

PricingElasticity

Add Intelligent Analytics Everywhere You Can to Unlock The Value In Information

Search Buy laptop


WE ARE IN THE ‘INFORMATION AGE’ AND ITS TIME THAT DATA AND HUMAN ANALYTICS CAPABILITIES ARE AUGMENTED WITH MACHINE INTELLIGENCE

Intelligent Systems: Next Wave of Innovation and Change (Iron Man Suit)

Data Mining Finally Meets the Operation.. Powered by Big Data Scale & Value

Convergence of the global industrial system with

the power of advanced computing, analytics,

low-cost sensing and new levels of connectivity

permitted by the Internet

* Source: GE (Industrial Internet – Pushing the Boundaries of Minds and Machines)

Intelligent

Machines

Advanced

Analytics

People

At Work


CURRENT NEWS ON INTELLIGENT SYSTEMS

CEP Vendors Acquired June 2013: Streambase (Tibco) and Progress Apama (Software AG)

USA Today, September 16, 2012

– http://www.usatoday.com/money/business/story/2012/09/16/review-automate-this-probes-math-in-global-trading/57782600/1

Two Second Advantage: Published 2011

– Wayne Gretzky: Edge: He predicted a little better than anyone else, where the hockey puck would be.

– Enterprise 2.0: Every transaction became a bit of digital data

– Enterprise 3.0: Every EVENT can become a bit of digital data

– There will always be value in making projections, days, months, years, predictive analytics, data mining, and analytics systems on historical data – batch systems

– But in today’s world, enterprises need something more, instantaneous, proactive, predictive capability –real time systems

GE Mind an Machines: November 2012» Industrial Internet, Intelligent Systems, and Intelligent Machines

» http://www.ge.com/docs/chapters/Industrial_Internet.pdf

» http://www.youtube.com/watch?v=SvI3Pmv-DhE

http://www.usatoday.com/money/business/story/2012/09/16/review-automate-this-probes-math-in-global-trading/57782600/1

http://www.ge.com/docs/chapters/Industrial_Internet.pdf

http://www.youtube.com/watch?v=SvI3Pmv-DhE


AGENDA

High Level Trends in the Applied Analytics Arena‒Computational Enablement

‒Usability & Visualization

‒ Intelligent Systems

Computational Enablement‒Big Data & Hadoop

‒Open CL and Hadoop Research


THE DIVERSE WORLD OF HADOOP

Self healing, distributed

storage

Commodity hardware –

inexpensive storage

+

Fault tolerant distributed

processing

Abstraction for parallel

computing

Manage complex data – relational and non relational – in a single repository

“Data Reservoir/Lake”

Store massive data sets, scale horizontally with inexpensive commodity nodes

Process at source – eliminate data movement

Complements your existing relational database technology that excel at structured data manipulation

**NEWS FLASH***Hadoop as the data “OS”Hadoop 2.0 and Yarn Supporting Other

Parallel Frameworks such as MPI (SAS HPC Announcement)

High Speed SQL (STRATA 2013) Impala, Dremel, SHARKHadapt

ETL & EDW DisruptionGigaOM: Think Bigger.. Schema on Read, ELT

Zookeeper, PuppetMaintenance

HBASE, Pig, Hive, Rhipe, Rhadoop, MahoutData management and analytics

Nagios, Ganglia, SplunkMonitoring

Computational Enablement


THE HPC LANDSCAPE IS VAST AND CONTINUOUS RESEARCH TO UNDERSTAND THE STRENGTHS AND WEAKNESSES OF DIFFERENT COMPONENTS IS THE KEY

Techniques for writing parallel programs

Already in use by Mu Sigma

Should be explored

Legend

Not necessary in current context

Techniques supporting parallelization on the infrastructure

• Size indicates our comfort level with the technique/tool

ComputationDriven

Data Driven

Shared Memory

HAMA MPI

GPGPU

Map-Reduce

STORM

Distributed Memory

JumboMem

SPARK

OpenMP

Agent, Actor Model

Intel TBBs

HPCC


HADOOP 2.0“THE OS FOR DATA”

Source: HortonWorks


MU SIGMA IMPLEMENTED A BIG DATA SOLUTION THAT REDUCED SCORING TIME FROM 18 HOURS TO 2 HOURS..

Marketer: Process to score a list was taking 18 hours…

Data sourcing Data management Data processingCase study: pro CRM client

Load scored customer data in the specified format back to EDW

INGEST

Big Data Infrastructure

Score all customers based on Prioritization logic & produce output in specified format

Data Consumption

Data Scientists adding propensity scores,segmentation codes, LTV, etc..


GLM on Map-Reduce, Attribution modeling on clickstream data

IP Creative 1 Creative 2 Creative 3 Purchase (1/0)

E-Commerce sites are constantly collecting clickstream data, of the order of terabytes, on various attributes such as

users,

the creative's they’ve viewed,

their transactions, etc.

Frequently, when an online customer makes a purchase, the sale is attributed to the most recent creative

However, this inferred causality is not necessarily always correct

A previous creative or sequence of creatives might actually have triggered the customer’s decision to make a purchase

Background

Statistical techniques such as logistic regression can help

better understand the contributions of different creativestowards purchases and

identify key purchase-driving creatives

Hadoop can store and process large volumes of data and R is a powerful statistical programming language

Leveraging R to perform statistical analysis on large clickstream data stored in Hadoop using map-reduce is the natural next step

Least-squares regression, a crucial component of iterative methods used to solve logistic regression problems, has been implemented here in R, using map-reduce

Solution

w.x.y.z 3:15 AM - 3:11 AM 1

a.b.c.d - 9:54 PM 5:12 PM 0

e.f.g.h 5:49 AM 6:16 AM - 1

Application: Machine Logs


SPEEDING UP HADOOP WITH THE GPU

Hadoop jobs are designed to run in batch, with data being read off disks and streamed into ‘mappers’ and ‘reducers’

‒ Execution times of Hadoop jobs can be viewed in terms of a fixed overhead, a variable overhead (dependent mainly on data size) and the processing time, which depends on the complexity of computations and the size of data

‒ For a given size of data, the most computationally trivial job in Hadoop (an identity job) is usually measurable in tens of seconds

The testing objective was to measure the speedups achievable by using GPUs within MapReduce for relatively complex data processing

DOES THE GPU HELP?

Processing complexity

Fixed overhead

Variableoverhead

ProcessingProcessingProcessing

Datasize

Fixed overhead

Variableoverhead

Fixed overhead

Variableoverhead

Fixed overhead

Variableoverhead

×


SETTING IT UP AND MAKING IT WORK

Servers, with

Since mappers/reducers run in parallel, an ideal Hadoop cluster with a GPU per node was simulated by running mappers/reducers on one machine with one GPU

It was found that

‒ Hadoop doesn’t recognize the GPU card

‒ Needs to access GUI services provided by the X-server to be able to do so*

‒ A few modifications to the user’s shell configurations were required to get around this

CONFIG AND TEETHING TROUBLES

* AMD is currently working on getting OpenCL working without the X-server

Details

CPU 4-core Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz

RAM 16 GB

GPU AMD/ATI Radeon HD 7970 with 3GB GDDR5 Memory

Base software Ubuntu 12.04.3 LTS , Java 1.6.0_31

Hadoop + ecosystem Hadoop 2.0.0-cdh4.4.0 , MapReduce 0.20

OpenCL + ecosystem OpenCl 1.2 , JOCL


Linear regression was written in Java MapReduce with and without a GPU

‒ Uses 𝑁 rows and (𝑀 + 2)* columns of data to fit the expected value of the dependent variable as a linear combination of a set 𝑀 of predictor variables, such that for the 𝑖th row

‒ 𝐸 𝑦𝑖 = 𝛽0 + 𝑗=1𝑀 𝛽𝑗𝑥𝑖,𝑗

‒ Regression coefficients are computed by solving the Normal Equations

‒ 𝛽 = (𝑋𝑇𝑋)−1(𝑋𝑇 𝑦) where 𝑋 is a matrix of dimension (𝑁) × (𝑀 + 1) and 𝑦 is a column vector of length 𝑁

Data in Hadoop is distributed across multiple nodes into 𝑁 blocks that are mutually exclusive and collectively exhaustive (notwithstanding Hadoop’s replication factor)

‒ 𝑋𝑏𝑘

𝑇𝑋𝑏𝑘 and 𝑋𝑏𝑘

𝑇 𝑦𝑏𝑘 (**) terms are computed and appended in mappers for each input block

‒ Augmented terms output from each of the mappers are summed and inversion and multiplication of terms is performed in the reducer

Using GPUs, each 2D input** is passed as a flattened 1D array to a kernel function

‒ Mappers compute dot products of all sub-arrays formed by elements 𝑒ℎ such that ℎ mod 𝑀 + 2 = 𝑙 for 𝑙 from 0 to 𝑀 + 1

Data in Hadoop is distributed across multiple nodes into 𝐵 blocks that are mutually exclusive and collectively exhaustive (notwithstanding Hadoop’s replication factor)

EXPERIMENTAL DETAILSTHE USE-CASE

* The dependent variable, 𝑀 predictor variables and a column of 1s used to compute the intercept coefficient, 𝛽0

** 𝑏𝑘 for 𝑘 from 1 to 𝐵 indicating the blocks of data that constitute the input data

*** 1D in case data is processed 1 row at a time

Linear regression was written in Java MapReduce with and without a GPU

Each 2D input*** to mappers is flattened by row and sent as a 1D array to GPU kernel functions


USING THE GPU FROM MAPPERSDATA MOVEMENT AND COMPUTATIONS

1 1 21 3 4

×1 11 32 4

2 4 64 10 146 14 20

=Computing the XTX term

XT X XTX

1 1 21 3 4

2 4 64 10 146 14 20


MAPREDUCE ON CPU

Mapper class uses methods

‒ setup() : Called once at the beginning of the map task

‒ map() : Called for each (K,V) pair, Create a list of all the values in the input split

‒ cleanup(): Called once at the end of the map task, OLS code implemented in this method

Reducer class uses methods

‒ setup() : Called once at the beginning of the reduce task

‒ reduce() : Called for each key, sum of intermediate matrices computed in this method

‒ cleanup(): Called once at the end of the

reduce task, Matrix transpose and β values computed here

Mapper Reducer

CPU OLS CODE LOGIC

Calculate XTX, XTY for input

blocks


Mapper class uses methods

‒ setup() : Called once at the beginning of the map task

‒ map() : Called for each (K,V) pair, Create a list of all the values in the input split

‒ cleanup(): Called once at the end of the map task, OLS code implemented in this method using JOCL library

Reducer class uses methods

‒ setup() : Called once at the beginning of the reduce task

‒ reduce() : Called for each key, sum of intermediate matrices computed in this method

‒ cleanup(): Called once at the end of the

reduce task, Matrix transpose and β values computed here

MAPREDUCE ON GPU + CPU

Mapper Reducer

GPU OLS CODE LOGIC


EXPERIMENTAL OUTCOMES

Processing data one row at a time is not efficient – runtimes were measured for ‘chunks’ of data of increasing size

Performing linear regression using GPUs within MapReduce constantly outperformed the same exercise using ‘vanilla’ MapReduce code

KEY RESULTS

0

1000

2000

3000

4000

5000

6000

1000 2000 5000 10000 25000 50000 100000

Job

exe

cuti

on

tim

e (s

)

Number of rows

Increasing the size of data chunks**

GPU

CPU

* Processing times for 1000 rows and 256 columns, 1 row at a time and all together

** Processing times for 100,000 rows and 256 columns, processed in chunks of ‘Number of rows’

0

50

100

1 row at a time 1000 rows at a time

Job

exe

cuti

on

tim

e (s

)

Multiple data vs row-by-row processing*

GPU CPU

20

30

40

50

60

1000 2000 5000 10000 25000 50000 100000

Job

exe

cuti

on

tim

e (s

)

Number of rows

Increasing the size of data chunks** (GPU only)



CPU execution times show a sharp rise on data beyond 16 columns and 100,000 rows

GPU execution time does not show significant rise for rows from 1000 to 1,000,000 and columns from 4 to 256

Beyond 1,000,000 rows GPU execution times grow than CPU execution times

* Processing time for data sets where number of rows range from 1K to 10 M and columns range from 4 to 256

1K

10K

100K

1000K

10000K

0

100

200

300

400

500

600

48

1632

64128

256

Numberof rows

Job

exe

cuti

on

tim

e (

s)

Number of columns

GPU execution time *

500-600

400-500

300-400

200-300

100-200

0-100

Time (s)

1K

10K

100K

1000K

10000K

0

100

200

300

400

500

600

48

1632

64128

256

Numberof rows

Job

exe

cuti

on

tim

e (s

)

Number of columns

CPU execution time * #

500-600

400-500

300-400

200-300

100-200

0-100

Time (s)

#1 CPU execution times of a few data sets which are more than 600 seconds which are not seen clearly in above graph

Rows Columns Time (s)

1M 256 804

10M 128 1863

10M 256 7701

CPU AND GPU EXECUTION TIME

#2 The 2 bounds of the sharp increase in CPU execution times are 256 columns * 100 K rows (= 51.2 MB) and 16 columns * 10 M rows (= 320 MB)



* For data sets where number of rows range from 1K to 10 M and columns range from 4 to 256

1K

10K

100K

1000K

10000K

-1

1

3

5

7

9

11

13

15

17

19

48

1632

64128

256

Number of RowsR

atio

of

exec

uti

on

tim

e o

f (C

PU

/GP

U)

Number of columns

Ratio of CPU to GPU execution time *

17-19

15-17

13-15

11-13

9-11

7-9

5-7

3-5

1-3

-1-1

Ratio

COMPARING CPU AND GPU

For very small data size, CPU and GPU have comparable runtimes

With increase in the number of columns the CPU execution time shows gradual increase as compared to GPU

As number of rows increase the CPU execution time increases drastically as compared to GPU

For larger data size the ratio of CPU to GPU execution time shows a steep increase


TAKEAWAYSRECOMMENDATIONS AND NEXT STEPS

GPUs can be effectively used within MapReduce to speed up relatively computationally heavy tasks like matrix operations

‒ Since copying data to/from a GPU is expensive, a large amount of data should be copied and processed ‘in one go’

‒ Computation time asymtotically decreases with increasing data chunk size

Multiple mappers/reducers running on a node can share the GPU

Size of data chunks and memory allocation must be chosen carefully to ensure all processes run successfully and faster than they otherwise would, when one GPU is shared across mappers/reducers

‒ Using a GPU is not necessarily more ‘efficient’ for data with a small number of columns

‒ Smaller blocks increases the degree of MapReduce parallelism but also leads to more mappers using the GPU simultaneously

‒ Improper memory allocation in Hadoop and on the GPU can lead to either unused memory, or more memory required than available

Future exploration includes

‒ Measure changes in job execution time with the GPU being used in the reducer as well, for OLS

‒ Empirically determining optimal values for data size and memory allocation with one/more job running multiple mappers/reducers on a node that share a GPU

The Size of input data, Hadoop blocks, memory allocation are key factors that determine successfully and faster execution of Hadoop codes using a GPU, when one GPU is shared across mappers

Multiple mappers running on a node can share the GPU

GPUs can be effectively used within MapReduce to speed up relatively computationally heavy tasks like matrix operations

Future exploration includes


APPENDIX


EXPERIMENTAL RESULTSRUNTIMES OF CPU AND GPU

Size Rows

Cols 1K 10K 100K 1M 10M

4 8 KB 80 KB 800 KB 8 MB 80 MB

8 16 KB 160 KB 1.6 MB 16 MB 160 MB

16 32 KB 320 KB 3.2 MB 32 MB 320 MB

32 64 KB 640 KB 6.4 MB 64 MB 640 MB

64 128 KB 1.28 MB 12.8 MB 128 MB 1.28 GB

128 256 KB 2.56 MB 25.6 MB 256 MB 2.56 GB

256 512 KB 5.12 MB 51.2 MB 512 MB 5.12 GB

CPU Rows

Cols 1K 10K 100K 1M 10M

4 10 9 10 14 33

8 11 10 11 14 54

16 10 10 12 20 62

32 10 14 13 44 466

64 12 12 20 120 1073

128 12 12 46 232 1863

256 14 17 157 804 7701

GPU Rows

Cols 1K 10K 100K 1M 10M

4 12 13 14 12 41

8 11 13 12 16 49

16 12 13 12 18 51

32 12 11 13 22 73

64 12 13 13 36 119

128 11 13 16 39 202

256 13 12 18 57 423

Execution time for GPU OLS code in secondsExecution time for CPU OLS code in seconds

Data size for different number of rows and columns

Ratio Rows

Cols 1K 10K 100K 1M 10M

4 0.83 0.69 0.71 1.17 0.80

8 1.00 0.77 0.92 0.88 1.10

16 0.83 0.77 1.00 1.11 1.22

32 0.83 1.27 1.00 2.00 6.38

64 1.00 0.92 1.54 3.33 9.02

128 1.09 0.92 2.88 5.95 9.22

256 1.08 1.42 8.72 14.11 18.21

Ratio of CPU to GPU execution time

[email protected] @crunchdata

mailto:[email protected]