scalable machine learning using big data sql, hadoop and spark€¦ · scalable machine learning...

43
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016 Marcos Arancibia Product Manager, Oracle Data Science

Upload: others

Post on 20-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

ScalableMachineLearningusingBigDataSQL,HadoopandSparkAdvancedAnalyticsatScale

BIWASummit2016MarcosArancibiaProductManager,OracleDataScience

Page 2: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

SafeHarborStatementThefollowing isintended tooutlineourgeneralproductdirection.Itisintended forinformation purposesonly,andmaynotbeincorporated intoanycontract.Itisnotacommitment todeliveranymaterial,code,orfunctionality,andshouldnotberelieduponinmakingpurchasingdecisions.Thedevelopment, release,andtimingofanyfeaturesorfunctionalitydescribed forOracle’sproducts remainsatthesolediscretionofOracle.

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. | 2

Page 3: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

UseCases

1. RootCauseAnalysisofsemiconductormanufacturingthatusescustom-builtRalgorithmsrunningagainstdataintheDatabaseandagainstHadooptoverifythepotentialofBigDataSQL.

2. PredictingAirlineflightcancellationsusingLogisticRegressiondirectlyagainstaBigDataCluster

3

Page 4: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

DataAnalyticsChallengeSeparatedataaccessinterfaces…

4

Page 5: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

DataAnalyticsChallenge…whichrequireseparatePredictiveAnalyticsinterfaces.

5

NoSQL

OAAORAAHR

Page 6: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

OracleBigDataSQLExpandingthereachofOAAwithPredictiveAnalyticsviaSQL

6

NoSQL

OAAORAAHR

Page 7: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

1.RootCauseAnalysis- RequirementsSemiconductorIndustry

• BISTel isanOraclePartnerintheUSAandinKorea,anditistheleadingproviderofequipmentengineeringsystemsandservicesforthefabricationofsemiconductorchipsandflatpaneldisplays.

• BISTel offerssolutionsandservicesthatenablethecustomerstoachieveyieldimprovementandincreaseproductivity.

• Therequirementwastobecapableofusingtheiroriginal8,000+linesofRcodethatexpresssomeoftheircustomalgorithmsonlargescaledatausingbothOracleDatabasedataaswellasdatastoredinHadoop.

7

Page 8: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

TraditionalManufacturingYieldMgt Process

8

Traditionally …

Identify yield loss patterns manually or through pre-defined pattern libraries

Extract items/lots with similar patterns

YMS

In-House

3rd party YMS

SAS

SPSS

JMP

3rd party BI tools

Use multiple 3rd party tools to do different analysis to find cause down to suspected tool/chamber level

Ask equipment engineers to look at their tools for any possible cause

Report on possible causes

Hours/Days to Find Root Cause

EES

Tool Logs

FDC

SPC

EPT

RMS

Report

Engineer searches through various systems to see if there is any correlation

Page 9: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

NewAnalyticalApproachforManufacturing

99

New Analytical Approach for Manufacturing

New Paradigm: Changing the way you analyze

Run in manual or auto mode

Engineer view result and make decision/report

Customer Time to Root Cause Identification (TTRCI) for Traditional Method

TTRCI using BISTel MA + IM

A 7 days Within 4 hours

B 3 weeks Within 4 hours

B Could not find within 4 weeks Found within 3 hours

ROI from Customer:

Run pattern detection and classification, root cause analysis together for enhanced accuracy and TTRCI

Problem with yield

Page 10: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

E.g.:AutoPatternRecognitionandClassification

1010

Auto Pattern Recognition and Classification

Data •  Coordinate (i.e. x, y) •  Quality Measurement (i.e.

thickness, defect) •  Yield data (i.e. good/

bad, grade)

Manipulate data and convert to map (i.e. circle, rectangle, etc…)

Extrapolate and interpolate data to full map

Dynamic pattern recognition and classification

Root cause analysis of pattern

Page 11: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

E.g.:RootCauseAnalysiswithMachineLearning

11

11

Root Cause Analysis using Advanced Data Mining

Data •  Effect data •  Cause candidate data •  Continuous/categorical

Select Effect

Select Cause Candidates

Ranked Result Engineer Decision

Page 12: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

SolutionArchitectureOverview

12

13

Overview of Analytics using Oracle ORE

ORE Packages

MA

Intellimine

TA

BISTel’s algorithms

R Console

R Studio

eDataLyzer

SQL Developer Or

PL/SQL

R Engine or Oracle R Distribution

Oracle R Distribution

JDBC

JDBC

EXADATA BDA

OracleREnterprise

OracleRDistribution

EXADATA BIGDATAAPPLIANCE

OracleRDistribution

RConsole

RStudio

eDataLyzer

SQLDeveloperorPL/SQL

JDBC

JDBC

BISTel’s algorithms

Page 13: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

InterfacesoftheusageinProduction:Database

1314

How Oracle ORE is used in Production Environment

ETL

User Request

Auto Request

Notify Analysis Result

.Net Client

Java Analysis Server

EXADATA

Various Database for OLTP

JDBC

ExtProcess

ExtProcess

ExtProcess

ExtProcess

ExtProcess

ExtProcess

ExtProcess

ExtProcess

ExtProcess

ExtProcess

ExtProcess

ExtProcess

ExtProcess

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

ExtProcExtProcExtProc

ExtProc

ExtProc

ExtProc

ExtProc

ExtProcExtProc

ExtProcExtProcExtProc

ExtProc

Page 14: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

OracleDatabaseServerwith Advanced Analytics Option

FeaturesofOracleAdvancedAnalyticsFromeitheranRClientoraSQLClient,OAAin-Databasealgorithms,REnginesandOpen-SourceRPackagescanbeaccessed.

RAnalyticsOracleREnterprise

RClient

OREParallelalgorithms:MLPNeural,Stepwise,LM,GLM,PCAAccesstoopen-sourceRpackages SQLDeveloper

OtherSQLApps

SQLBasicStatisticsandJoins

DataMiningPredictiveAnalytics15PL/SQLIn-Databasealgorithms

R

14

SQLClient

Page 15: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

ORDwithinternal

BLAS/LAPACK1thread

ORD+MKL1thread

ORD+MKL2threads

ORD+MKL4threads

ORD+MKL8threads

PerformancegainORD+MKL4threads

PerformancegainORD+MKL8threads

MatrixCalculations 11.2 1.9 1.3 1.1 0.9 9.2x 11.4x

MatrixFunctions 7.2 1.1 0.6 0.4 0.4 17.0x 17.0x

MatrixMultiply 517.6 21.2 10.9 5.8 3.1 88.2x 166.0x

CholeskyFactorization 25 3.9 2.1 1.3 0.8 18.2x 29.4x

Singular ValueDecomposition 103.5 15.1 7.8 4.9 3.4 20.1x 40.9x

PrincipalComponentAnalysis

490.1 42.7 24.9 15.9 11.7 29.8x 40.9x

LinearDiscriminantAnalysis 419.8 120.9 110.8 94.1 88.0 3.5x 3.8x

This benchmark was executed on a 3-node cluster, with 24 cores at 3.07GHz per CPU and 47 GB RAM, using Linux 5.5.More details at https://blogs.oracle.com/R/entry/oracle_r_distribution_3_0

15

FeaturesofOracleAdvancedAnalyticsOracleRDistributionx64runningontheIntelPlatformcanmakeuseofIntel’sMKLforadditionalperformance,evenwithopen-sourceRpackages

Page 16: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

R:Transparencythroughfunctionoverloading.E.g. in-databaseaggregationfunction

> aggdata <- aggregate(ONTIME_S$DEST, + by = list(ONTIME_S$DEST), + FUN = length)

ONTIME_S

In-dbStats

Oracle SQLselect DEST, count(*)from ONTIME_Sgroup by DEST

Oracle Advanced AnalyticsORE Client Packages

Transparency Layer

> class(aggdata)[1] "ore.frame"attr(,"package")[1] "OREbase"> head(aggdata)

Group.1 x1 ABE 2372 ABI 343 ABQ 13574 ABY 105 ACK 36 ACT 33

16

OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"

FeaturesofOracleAdvancedAnalytics

Page 17: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

R:ScalableMachineLearningModels.E.g.customparalleldistributedlmmodel

> options(ore.parallel=4)> lm_mod <- ore.lm(ARRDELAY ~ DISTANCE + DEPDELAY,

data=ONTIME_S)

Oracle Advanced AnalyticsORE Client Packages

Transparency Layer

extprocextprocextprocextproc

3

2

ONTIME_S

ParallelORE

Framework1

> summary(lm_mod)Call:

ore.lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = ONTIME_S)Residuals:

Min 1Q Median 3Q Max -1462.45 -6.97 -1.36 5.07 925.08

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.254e-01 5.197e-02 4.336 1.45e-05 ***DISTANCE -1.218e-03 5.803e-05 -20.979 < 2e-16 ***

DEPDELAY 9.625e-01 1.151e-03 836.289 < 2e-16 ***---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 14.73 on 215144 degrees of freedom(4785 observations deleted due to missingness)

Multiple R-squared: 0.7647,Adjusted R-squared: 0.7647 F-statistic: 3.497e+05 on 2 and 215144 DF, p-value: < 2.2e-16

17

Oracle R DistributionParallel ore.lm Compute

R PackagesOracle R Distribution

Parallel ore.lm ComputeR Packages

Oracle R DistributionParallel ore.lm Compute

R Packages

4

Oracle R DistributionParallel Compute Engine

OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"

FeaturesofOracleAdvancedAnalytics

Page 18: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

Serverexecutionofopen-sourceRpackage:ore.tableApply()

> mod_biglm <- ore.tableApply(dat = ONTIME_S, # Database tablefunction(dat) {

library(biglm) # Load open-source packagebiglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)

});

Oracle Advanced AnalyticsORE Client Packages

Embedded R

extproc

3

2

ONTIME_S

Embedded RORE

Framework

1

> library(biglm) # Load open-source package locally to interpret results> summary(mod_biglm) # Summary of the resulting Model

Large data regression model: biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)Sample size = 392805

Coef (95% CI) SE p(Intercept) 0.0638 -0.7418 0.8693 0.4028 0.8742

DISTANCE -0.0014 -0.0021 -0.0006 0.0004 0.0002DEPDELAY 1.0552 1.0373 1.0731 0.0090 0.0000

18

Oracle R DistributionOpen-source R Packages

4

OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"

FeaturesofOracleAdvancedAnalytics

Page 19: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

> options(ore.parallel=4)> modList <- ore.groupApply(dat = ONTIME_S, # Database table

INDEX = ONTIME_S$DEST,# groupby colfunction(dat) {

library(biglm) # Load open-source packagebiglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)

});> library(biglm) # Load open-source package locally to interpret results> summary(modList) # Checks how many models we have in the model list

Length Class Mode 325 ore.list S4

> summary(modList$BOS) # Request the resulting Model for Boston Logan AirportLarge data regression model: biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)

Sample size = 3928 Coef (95% CI) SE p

(Intercept) 0.0638 -0.7418 0.8693 0.4028 0.8742DISTANCE -0.0014 -0.0021 -0.0006 0.0004 0.0002

DEPDELAY 1.0552 1.0373 1.0731 0.0090 0.0000

Oracle Advanced AnalyticsORE Client Packages

Transparency Layer

extprocextprocextprocextproc

3

2

ONTIME_S

ParallelORE

Framework1

19

Oracle R DistributionOpen-source R Packages

R PackagesOracle R Distribution

Open-source R PackagesR Packages

Oracle R DistributionOpen-source R Packages

R Packages

4

Oracle R DistributionOpen-source R Packages

OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"

FeaturesofOracleAdvancedAnalyticsServerexecutionofopen-sourceRpackage:ore.groupApply()fordataparalellism

Page 20: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

begin

sys.rqScriptCreate('Example6',

'function(){

res <- 1:10

plot( 1:100, rnorm(100), pch = 21,

bg = "red", cex = 2 )

res

}');

end;

/

select value

from table(rqEval(NULL,'XML','Example6'));

SQLinterfacerqEval forRscripts– canalsogenerateXMLstringforgraphicoutputOraclePL/SQL

OracleSQL

RLanguage

• R script output is often dynamic –not conforming to pre-defined structure

• R apps generate stats, new data, graphics• Example

– Plot 100 random numbers– Return a vector with values 1 to 10– Return the results as XML

20

FeaturesofOracleAdvancedAnalytics

Page 21: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

PerformanceonAutoPatternClassification:EXADATAonly

21

15

Performance of ORE for Auto Pattern Classification

Approx. 30,000 wafers (semiconductor)

Approx. 300,000 wafers (semiconductor)

Approx. 3 million wafers (semiconductor)

17 8.8

37.4

278

1

10

100

1000

BISTEL 2GB (57mi) OAA 2GB (57mi) OAA 20GB (570mi) OAA 200GB(5.7Bi)

Min

utes

(log

sca

le)

Database size (records)

10x 100x

1x

Tested on 2-node EXADATA X3

Page 22: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

OracleEXATADAwith Advanced Analytics Option

OAAwithBigDataSQL:EXADATA+BDAUsingthein-Databasealgorithms,plusREngineandOpen-SourceRPackagesifdesired

RAnalyticsOracleREnterprise

RClient

SQLDeveloperOtherSQLApps

R

22

SQLClient

OracleBIGDATAAPPLIANCE

BigDataSQL

Page 23: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

PerformanceonMicroLevelDataAnalytics:EXADATA+BDA

23

Performance of ORE for Micro Level Data Analytics

17

42.0 111.5

855.0

41,652

1.0

100.0

10000.0

4G 20G 200G 10T

Exa BDS (Exa+BDA)

Sec

onds

Data Size

50x

Tested on EXADATA X5 Full Rack •  Algorithm Execution Time •  DoP is 288

5 x 1 x

250x

WithdataeitherasDatabasetablesorasHDFSfilesontheBDA,theperformancewasthesame,highlightingthethroughputofBigDataSQL,andOAA’sagilityonrunningopen-sourceRcodeagainstgroup-byproblems

EXADATA+Big DataSQL+OAAonafullrackEXADATAX5-2,viaInfiniband toa9-nodeBDAX5-2.DegreeofParallelismsetto288

Page 24: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

Conclusion:ROIusingBISTel AnalyticswithOAA

2418

ROI using BISTel Analytics on Oracle ORE

Shorten TTRCI (Time to Root Cause Identification)

Within 1 day from 2 weeks to 2 month

Reduce Investment Cost

Use of proven EXADATA and ORE with minimal data migration from existing Oracle DB

ROI for Customer

Shorten Time to Market

Faster POC (Proof of Concept) at customer site with new ideas/apps – Reduced by more than 30 days Dramatic reduction of development and test time – At least by 50%

ROI for Solution Provider

Page 25: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

2.FlightCancellationpredictioninUSAairportsAirlineIndustry

• On-timearrivaldatafornon-stopdomesticflightsbymajoraircarrierscanbefoundattheBureauofTransportationStatisticswebsite,andisfreetodownload:http://www.transtats.bts.gov/Fields.asp?Table_ID=236

• Severalbenchmarkshavebeenexecutedagainstthisdataset,knownasONTIME.Theoriginalcombineddatasethad123mirecordsandcontaineddatafromOctober1987toApril2008.Itisavailableatmanywebsites,includingtheAmericanStatisticalAssociation:http://stat-computing.org/dataexpo/2009/

• WeaugmentedtheoriginaldatawithinformationuntilSeptember2014toget159miuniquerecords.Forscalabilitytesting,weappendedthefileontoitselftoget1Bitotalrecords

25

Page 26: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

LogisticRegression

OracleRAdvancedAnalyticsforHadoopAdvancedAnalyticsalgorithmsinaHadoopCluster:Map-ReduceandSparkbased

GeneralizedLinearModel

Regression

Classification

AttributeImportance

Principal ComponentsAnalysis

Clustering

Hierarchicalk-Means

FeatureExtraction

NonnegativeMatrixFact(NMF)

StatisticalFunctions

CorrelationCovariance

Cross TabulationSummarystatistics

Multi-LayerNeuralNetworks

LinearRegression

CollaborativeFiltering(LMF)

Page 27: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

OracleRAdvancedAnalyticsforHadoop– vs.Rhadoop (RMR)BestplatformavailabletorunMap-ReduceRjobsvs.RevolutionAnalytics’RHadoopPerformanceona6-nodeBDAX3-2,16coresand47GBofTotalRAMassignedCovariancecomputationon100GBHDFS/200columnsinputdataset

27

Moreinfoathttps://blogs.oracle.com/R/entry/oraah_enabling_high_performance_r

Page 28: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

HadoopClusterwith Oracle R Advanced Analytics for Hadoop

OracleRAdvancedAnalyticsforHadoop:IntegrationUsingORAAH’sHadoopandHIVEIntegration,plusREngineandOpen-SourceRPackages

RAnalyticsOracle R Advanced

Analytics for Hadoop

RClient

ORAAHdistributedalgorithms:MLPNeuralNets*,Logistic Reg*,GLM,LM,PCA,k-Means,NMF,LMF.Open-source RpackagesviaMap-Reduce*Spark-Cachingenabled

SQLDeveloperOtherSQLApps

HQLBasicStatistics,DataPrep,Joins andViewcreation

28

SQLClient

HQL

OracleDatabaseServerwithAdvancedAnalyticsoption

R

Page 29: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

Join thetwotablesbyonecommonvariable>joined<- merge(tab_input,tab_input2,by="value")

InvokeORAAHtransparentfunctions forHIVE:JOINORAAH:TransparencyfunctionsagainstHIVEtables

Oracle R Advanced Analyticsfor Hadoop Client Packages

HIVETransparency Engine

Thenewtable isatemporaryHIVEtablenotseeninhere>ore.ls()[1]“tab_input"“tab_input2"But,it’spartofthe localRobjects>ls()[1]"joined">names(joined)[1]"value""v1.x""v2.x""v3.x""v4.x""v5.x""v6.x""v7.x""v8.x""v9.x"[11]"v10.x""v11.x" "v12.x""v13.x""v14.x""v15.x""v16.x""v17.x""v18.x""v19.x"[21]"v20.x""v21.x" "v1.y""v2.y""v3.y""v4.y""v5.y""v6.y""v7.y""v8.y"[31]"v9.y""v10.y""v11.y""v12.y""v13.y""v14.y""v15.y""v16.y""v17.y""v18.y"[41]"v19.y""v20.y""v21.y"

4

/user/oracle/tab_input

HDFS StorageHDFS Storage

3

HIVEThrift

Server

1HQL

MetastoreMetastoreMetastore

Metastore

2

29

OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"

Page 30: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

Wecancreateanewvariablethat isacombination ofothercolumns inthejoined table,usingjustR’splainsyntax.>joined$NEW_VARIABLE <- (joined$v1.x+joined$v1.y)/2

InvokeORAAHtransparentfunctions forHIVE:CreatenewVariablesandSummaryStatisticsORAAH:TransparencyfunctionsagainstHIVE

Oracle R Advanced Analyticsfor Hadoop Client Packages

HIVETransparency EngineThenewvariablecanbeused inanytypeofcomputation

Checking thecontentofthenewVariable:>summary(joined$NEW_VARIABLE)Min.1stQu.MedianMean3rdQu.Max.-14890-12040-10190-10230-8621-2609

>head(joined$NEW_VARIABLE)[1]-9862.83-9705.47-9643.92 -8655.63-11572.07-11702.89

4

/user/oracle/tab_input

HDFS StorageHDFS Storage

3

HIVEThrift

Server

1HQL

MetastoreMetastoreMetastore

Metastore

2

30

OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"

Page 31: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

dfs <- hdfs.put(iris, key='Species')res <- NULLdfs.res <- hadoop.run(dfs,mapper = function(key, vals) {keyval(key, vals)

},reducer = function(key, vals) {dat <- do.call(rbind.data.frame, vals)orch.dlogv(colnames(dat))mod = lm(Petal.Length ~ Sepal.Length+Petal.Width, data=dat)fname <- paste("fit-",key,".png",sep="")png(fname)par(mfrow=c(2, 2), cex=0.6, mar=c(6, 6, 6, 4), mex=0.8)plot(mod,id.n=1, cex.caption=0.8, which=1:4)dev.off()hdfs.fdir <- "/user/pngfiles"hdfs.fname <- paste(hdfs.fdir,"/",fname, sep="")system(paste("hadoop fs -copyFromLocal", fname, hdfs.fdir))pred <- predict(mod, dat)keyval(NULL, orch.pack(pred, hdfs.fname))

})

Oracle RAAHClient Packages

Map/Reduce Call

res<- hdfs.get(dfs.res)finalres =list()

for(i in1:nrow(res)){finalres[[i]] <-

orch.unpack(res[i,])}

/user/oracle/irisMapper(s)Reducer(s)

R Result Object Stored in HDFS

2

5

Invokeopen-sourcelmmodelandcollectgraphical resultsinMap-ReduceusingjustRORAAH:MachineLearningmodelsagainstHDFSdata

OracleDistributionofRversion3.2.0(--)-- "FullofIngredients" 1

4

YARN: Hadoop Map Reduce Job

31

Page 32: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

MappersMappersMappersMappers

InvokeORAAHcustomparallel distributedmodel(LinearRegression)ORAAH:MachineLearningmodelsagainstHDFSdata

> ontime <- hdfs.attach("/user/oracle/ontime_s")> lm_mod <- orch.lm(ARRDELAY ~ DISTANCE + DEPDELAY,

dfs.dat=ontime, nMappers = 4, nReducers = 2)

Oracle R Advanced Analyticsfor Hadoop Client Packages

Machine Learningalgorithms module

> summary(lm_mod)Call:ore.lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = ONTIME_S)Residuals:

Min 1Q Median 3Q Max -1462.45 -6.97 -1.36 5.07 925.08 Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 2.254e-01 5.197e-02 4.336 1.45e-05 ***DISTANCE -1.218e-03 5.803e-05 -20.979 < 2e-16 ***DEPDELAY 9.625e-01 1.151e-03 836.289 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.73 on 215144 degrees of freedom(4785 observations deleted due to missingness)

Multiple R-squared: 0.7647, Adjusted R-squared: 0.7647 F-statistic: 3.497e+05 on 2 and 215144 DF, p-value: < 2.2e-16

2

/user/oracle/ontime_s

YARN: Hadoop Map Reduce Job

1

4

Custom Java Algorithm ReducersCustom Java

Algorithm Reducers

3

OracleDistributionofRversion3.1.1(--)-- "SockittoMe"

32

Page 33: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

InvokeORAAHcustomparallel distributedGLMModelusingSparkCachingORAAH:MachineLearninginSparkagainstHDFSdata

>#ConnectstoSparkandreservesadedicatedContext>spark.connect("yarn-client",memory="24g")>#CreatesapointertotheHDFSfileforusewithinR>ont1bi<- hdfs.attach("/user/oracle/ontime_1bi")>#Checksthesizeofdataset(rowsandcolumns)>formatC(hdfs.dim(ont1bi),digits=4,format='fg',big.mark = ",")”1,000,000,000" " 30"

>#Formuladefinition:Cancelledflights(0or1)basedonotherattributes>form_oraah_glm2<- CANCELLED~DISTANCE+ORIGIN+DEST+F(YEAR)+F(MONTH)++ F(DAYOFMONTH)+F(DAYOFWEEK)>system.time(m_spark_glm <- orch.glm2(formula=form_oraah_glm2,ont1bi))

OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"

/user/oracle/ontime_1bi

33

YARN: 1. Spark Context Creation

11

1. Reserve Memory in a dedicated Context

1 3

3

ORCHGLM:processed6factorvariables,25.806secORCHGLM:createdmodelmatrix,100128partitions,32.871secORCHGLM:iter 1, deviance 1.38433414089348300E+09, elapsedtime9.582secORCHGLM:iter 2, deviance 3.39315388583931150E+08, elapsedtime9.213secORCHGLM:iter 3, deviance 2.06855738812683250E+08, elapsedtime9.218secORCHGLM:iter 4, deviance 1.75868100359263200E+08, elapsedtime9.104secORCHGLM:iter 5, deviance 1.70023181759611580E+08, elapsedtime9.132secORCHGLM:iter 6, deviance 1.69476890425481350E+08, elapsedtime9.124secORCHGLM:iter 7, deviance 1.69467586045954760E+08, elapsedtime9.077secORCHGLM:iter 8, deviance 1.69467574351380850E+08, elapsedtime9.164secuser system elapsed84.107 5.606143.591

2

22

2

2. Spark Job Execution

2. Loads data from HDFS and runs ORAAH’s in-Memory Custom Machine Learning algorithm

Oracle R Advanced Analyticsfor Hadoop Client Packages

Spark-Based Machine Learning algorithms

module

Page 34: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. | 34

PerformanceagainstORAAH’sMap-ReduceGLMORAAH’sSpark-basedGLMagainstHDFSdata

Performanceona6-nodeBDAX3-2withCDH5.3.0,24coresand96GBofRAMperNode, Spark1.2.0configurationwith24coresand24GBofRAMperNode

Moredetailsathttps://blogs.oracle.com/R/entry/oracle_r_advanced_analytics_for

Page 35: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

OracleRAdvancedAnalyticsforHadoop– Sparkvs.Map-ReduceNeuralNetworksPerformance– LinearModel,845weights,linearmodel

35Moredetailsathttps://blogs.oracle.com/R/entry/oracle_r_advanced_analytics_for

Performanceona6-nodeBDAX3-2withCDH5.3.0,24coresand96GBofRAMperNode,Spark1.2.0configurationwith24coresand24GBofRAMperNode

Page 36: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

ORAAHonSpark:1BirecordsandLargenumberofcoefficientsScalableGLM-LogisticRegressionandComplexNonlinearDeepNeuralNetworks

36

Performanceona6-nodeBDAX3-2withCDH5.3.0,24coresand96GBofRAMperNodeSpark1.2.0configuration with24coresand24GBofRAMperNode

Moredetailsathttps://blogs.oracle.com/R/entry/oracle_r_advanced_analytics_for

Page 37: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. | 37

PerformancemeasuredonsameHardwareandsameHDFSinputDatasetORAAH’sSpark-basedGLMvs.SparkMllib GLM

14.5x

13.9x

5x

Page 38: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

OracleRAdvancedAnalyticsforHadoopEfficientuseofSparkCachingmemory,evenatminimumlevelsPerformanceon159mirecordsonanX4-2Server,40threads,128GBofRAM,CDH5.3.0Spark1.2.0configuration with24coresand24GBofRAMperNodeGLM – Logistic Regression model with 845 CoefficientsNeural Networks - Model using 1 Layer of Neurons, linear activation function, 838 Coefficients

38

Page 39: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

RoadmapArchitecture

39

Hadoop Relational

AlgorithmsCommoncore,parallel,distributed

SQL RGUI

Cloud

Roadmap

Page 40: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

OracleAdvancedAnalytics—OnPremiseorCloud100%CompatibilityEnablesEasyCoexistenceandMigration

40

OracleCloud

CoExistence andMigration

Same Architecture

Same Analytics

Same Standards

TransparentlymoveworkloadsandanalyticalmethodologiesbetweenOn-premiseandpubliccloud

On-Premise

Page 41: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

DataSciencealreadyavailableontheOracleCloudCloud-BasedAdvancedAnalytics

OracleAdvancedAnalyticsoptionincludingtheOracleDataMiningandOracleREnterpriseon:

• OracleEXADATACloudService

• OracleDatabaseasaService:IncludedinHighPerformanceandExtremePerformanceservices

OracleRAdvancedAnalyticsforHadoop• IncludedintheBigDataCloudService

41

Page 42: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. | 42

Page 43: Scalable Machine Learning using Big Data SQL, Hadoop and Spark€¦ · Scalable Machine Learning using Big Data SQL, Hadoop and Spark Advanced Analytics at Scale BIWA Summit 2016

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |