scalable machine learning using big data sql, hadoop and spark€¦ · scalable machine learning...

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. |

ScalableMachineLearningusingBigDataSQL,HadoopandSparkAdvancedAnalyticsatScale

BIWASummit2016MarcosArancibiaProductManager,OracleDataScience

SafeHarborStatementThefollowing isintended tooutlineourgeneralproductdirection.Itisintended forinformation purposesonly,andmaynotbeincorporated intoanycontract.Itisnotacommitment todeliveranymaterial,code,orfunctionality,andshouldnotberelieduponinmakingpurchasingdecisions.Thedevelopment, release,andtimingofanyfeaturesorfunctionalitydescribed forOracle’sproducts remainsatthesolediscretionofOracle.

Copyright©2016Oracleand/or itsaffiliates.All rights reserved. | 2


UseCases

1. RootCauseAnalysisofsemiconductormanufacturingthatusescustom-builtRalgorithmsrunningagainstdataintheDatabaseandagainstHadooptoverifythepotentialofBigDataSQL.

2. PredictingAirlineflightcancellationsusingLogisticRegressiondirectlyagainstaBigDataCluster

3


DataAnalyticsChallengeSeparatedataaccessinterfaces…

4


DataAnalyticsChallenge…whichrequireseparatePredictiveAnalyticsinterfaces.

5

NoSQL

OAAORAAHR


OracleBigDataSQLExpandingthereachofOAAwithPredictiveAnalyticsviaSQL

6

NoSQL

OAAORAAHR


1.RootCauseAnalysis- RequirementsSemiconductorIndustry

• BISTel isanOraclePartnerintheUSAandinKorea,anditistheleadingproviderofequipmentengineeringsystemsandservicesforthefabricationofsemiconductorchipsandflatpaneldisplays.

• BISTel offerssolutionsandservicesthatenablethecustomerstoachieveyieldimprovementandincreaseproductivity.

• Therequirementwastobecapableofusingtheiroriginal8,000+linesofRcodethatexpresssomeoftheircustomalgorithmsonlargescaledatausingbothOracleDatabasedataaswellasdatastoredinHadoop.

7


TraditionalManufacturingYieldMgt Process

8

Traditionally …

Identify yield loss patterns manually or through pre-defined pattern libraries

Extract items/lots with similar patterns

YMS

In-House

3rd party YMS

SAS

SPSS

JMP

3rd party BI tools

Use multiple 3rd party tools to do different analysis to find cause down to suspected tool/chamber level

Ask equipment engineers to look at their tools for any possible cause

Report on possible causes

Hours/Days to Find Root Cause

EES

Tool Logs

FDC

SPC

EPT

RMS

Report

Engineer searches through various systems to see if there is any correlation


NewAnalyticalApproachforManufacturing

99

New Analytical Approach for Manufacturing

New Paradigm: Changing the way you analyze

Run in manual or auto mode

Engineer view result and make decision/report

Customer Time to Root Cause Identification (TTRCI) for Traditional Method

TTRCI using BISTel MA + IM

A 7 days Within 4 hours

B 3 weeks Within 4 hours

B Could not find within 4 weeks Found within 3 hours

ROI from Customer:

Run pattern detection and classification, root cause analysis together for enhanced accuracy and TTRCI

Problem with yield


E.g.:AutoPatternRecognitionandClassification

1010

Auto Pattern Recognition and Classification

Data •  Coordinate (i.e. x, y) •  Quality Measurement (i.e.

thickness, defect) •  Yield data (i.e. good/

bad, grade)

Manipulate data and convert to map (i.e. circle, rectangle, etc…)

Extrapolate and interpolate data to full map

Dynamic pattern recognition and classification

Root cause analysis of pattern


E.g.:RootCauseAnalysiswithMachineLearning

11

11

Root Cause Analysis using Advanced Data Mining

Data •  Effect data •  Cause candidate data •  Continuous/categorical

Select Effect

Select Cause Candidates

Ranked Result Engineer Decision


SolutionArchitectureOverview

12

13

Overview of Analytics using Oracle ORE

ORE Packages

MA

Intellimine

TA

BISTel’s algorithms

R Console

R Studio

eDataLyzer

SQL Developer Or

PL/SQL

R Engine or Oracle R Distribution

Oracle R Distribution

JDBC

JDBC

EXADATA BDA

OracleREnterprise

OracleRDistribution

EXADATA BIGDATAAPPLIANCE

OracleRDistribution

RConsole

RStudio

eDataLyzer

SQLDeveloperorPL/SQL

JDBC

JDBC

BISTel’s algorithms


InterfacesoftheusageinProduction:Database

1314

How Oracle ORE is used in Production Environment

ETL

User Request

Auto Request

Notify Analysis Result

.Net Client

Java Analysis Server

EXADATA

Various Database for OLTP

JDBC

ExtProcess

ExtProcess

ExtProcess

ExtProcess

ExtProcess

ExtProcess

ExtProcess

…

ExtProcess

ExtProcess

ExtProcess

ExtProcess

ExtProcess

ExtProcess

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

Intellimine

ExtProcExtProcExtProc

ExtProc

ExtProc

ExtProc

ExtProc

ExtProcExtProc

ExtProcExtProcExtProc

ExtProc


OracleDatabaseServerwith Advanced Analytics Option

FeaturesofOracleAdvancedAnalyticsFromeitheranRClientoraSQLClient,OAAin-Databasealgorithms,REnginesandOpen-SourceRPackagescanbeaccessed.

RAnalyticsOracleREnterprise

RClient

OREParallelalgorithms:MLPNeural,Stepwise,LM,GLM,PCAAccesstoopen-sourceRpackages SQLDeveloper

OtherSQLApps

SQLBasicStatisticsandJoins

DataMiningPredictiveAnalytics15PL/SQLIn-Databasealgorithms

R

14

SQLClient


ORDwithinternal

BLAS/LAPACK1thread

ORD+MKL1thread

ORD+MKL2threads

ORD+MKL4threads

ORD+MKL8threads

PerformancegainORD+MKL4threads

PerformancegainORD+MKL8threads

MatrixCalculations 11.2 1.9 1.3 1.1 0.9 9.2x 11.4x

MatrixFunctions 7.2 1.1 0.6 0.4 0.4 17.0x 17.0x

MatrixMultiply 517.6 21.2 10.9 5.8 3.1 88.2x 166.0x

CholeskyFactorization 25 3.9 2.1 1.3 0.8 18.2x 29.4x

Singular ValueDecomposition 103.5 15.1 7.8 4.9 3.4 20.1x 40.9x

PrincipalComponentAnalysis

490.1 42.7 24.9 15.9 11.7 29.8x 40.9x

LinearDiscriminantAnalysis 419.8 120.9 110.8 94.1 88.0 3.5x 3.8x

This benchmark was executed on a 3-node cluster, with 24 cores at 3.07GHz per CPU and 47 GB RAM, using Linux 5.5.More details at https://blogs.oracle.com/R/entry/oracle_r_distribution_3_0

15

FeaturesofOracleAdvancedAnalyticsOracleRDistributionx64runningontheIntelPlatformcanmakeuseofIntel’sMKLforadditionalperformance,evenwithopen-sourceRpackages


R:Transparencythroughfunctionoverloading.E.g. in-databaseaggregationfunction

> aggdata <- aggregate(ONTIME_S$DEST, + by = list(ONTIME_S$DEST), + FUN = length)

ONTIME_S

In-dbStats

Oracle SQLselect DEST, count(*)from ONTIME_Sgroup by DEST

Oracle Advanced AnalyticsORE Client Packages

Transparency Layer

> class(aggdata)[1] "ore.frame"attr(,"package")[1] "OREbase"> head(aggdata)

Group.1 x1 ABE 2372 ABI 343 ABQ 13574 ABY 105 ACK 36 ACT 33

16

OracleDistributionofRversion3.2.0(--)-- "FullofIngredients"

FeaturesofOracleAdvancedAnalytics


R:ScalableMachineLearningModels.E.g.customparalleldistributedlmmodel

> options(ore.parallel=4)> lm_mod <- ore.lm(ARRDELAY ~ DISTANCE + DEPDELAY,

data=ONTIME_S)


Transparency Layer

extprocextprocextprocextproc

3

2

ONTIME_S

ParallelORE

Framework1

> summary(lm_mod)Call:

ore.lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = ONTIME_S)Residuals:

Min 1Q Median 3Q Max -1462.45 -6.97 -1.36 5.07 925.08

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.254e-01 5.197e-02 4.336 1.45e-05 ***DISTANCE -1.218e-03 5.803e-05 -20.979 < 2e-16 ***

DEPDELAY 9.625e-01 1.151e-03 836.289 < 2e-16 ***---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 14.73 on 215144 degrees of freedom(4785 observations deleted due to missingness)

Multiple R-squared: 0.7647,Adjusted R-squared: 0.7647 F-statistic: 3.497e+05 on 2 and 215144 DF, p-value: < 2.2e-16

17

Oracle R DistributionParallel ore.lm Compute

R PackagesOracle R Distribution

Parallel ore.lm ComputeR Packages

Oracle R DistributionParallel ore.lm Compute

R Packages

4

Oracle R DistributionParallel Compute Engine




Serverexecutionofopen-sourceRpackage:ore.tableApply()

> mod_biglm <- ore.tableApply(dat = ONTIME_S, # Database tablefunction(dat) {

library(biglm) # Load open-source packagebiglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)

});


Embedded R

extproc

3

2

ONTIME_S

Embedded RORE

Framework

1

> library(biglm) # Load open-source package locally to interpret results> summary(mod_biglm) # Summary of the resulting Model

Large data regression model: biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)Sample size = 392805

Coef (95% CI) SE p(Intercept) 0.0638 -0.7418 0.8693 0.4028 0.8742

DISTANCE -0.0014 -0.0021 -0.0006 0.0004 0.0002DEPDELAY 1.0552 1.0373 1.0731 0.0090 0.0000

18

Oracle R DistributionOpen-source R Packages

4




> options(ore.parallel=4)> modList <- ore.groupApply(dat = ONTIME_S, # Database table

INDEX = ONTIME_S$DEST,# groupby colfunction(dat) {

library(biglm) # Load open-source packagebiglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)

});> library(biglm) # Load open-source package locally to interpret results> summary(modList) # Checks how many models we have in the model list

Length Class Mode 325 ore.list S4

> summary(modList$BOS) # Request the resulting Model for Boston Logan AirportLarge data regression model: biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat)

Sample size = 3928 Coef (95% CI) SE p

(Intercept) 0.0638 -0.7418 0.8693 0.4028 0.8742DISTANCE -0.0014 -0.0021 -0.0006 0.0004 0.0002

DEPDELAY 1.0552 1.0373 1.0731 0.0090 0.0000


Transparency Layer

extprocextprocextprocextproc

3

2

ONTIME_S

ParallelORE

Framework1

19


R PackagesOracle R Distribution

Open-source R PackagesR Packages


R Packages

4



FeaturesofOracleAdvancedAnalyticsServerexecutionofopen-sourceRpackage:ore.groupApply()fordataparalellism


begin

sys.rqScriptCreate('Example6',

'function(){

res <- 1:10

plot( 1:100, rnorm(100), pch = 21,

bg = "red", cex = 2 )

res

}');

end;

/

select value

from table(rqEval(NULL,'XML','Example6'));

SQLinterfacerqEval forRscripts– canalsogenerateXMLstringforgraphicoutputOraclePL/SQL

OracleSQL

RLanguage

• R script output is often dynamic –not conforming to pre-defined structure

• R apps generate stats, new data, graphics• Example

– Plot 100 random numbers– Return a vector with values 1 to 10– Return the results as XML

20



PerformanceonAutoPatternClassification:EXADATAonly

21

15

Performance of ORE for Auto Pattern Classification

Approx. 30,000 wafers (semiconductor)

Approx. 300,000 wafers (semiconductor)

Approx. 3 million wafers (semiconductor)

17 8.8

37.4

278

1

10

100

1000

BISTEL 2GB (57mi) OAA 2GB (57mi) OAA 20GB (570mi) OAA 200GB(5.7Bi)

Min

utes

(log

sca

le)

Database size (records)

10x 100x

1x

Tested on 2-node EXADATA X3


OracleEXATADAwith Advanced Analytics Option

OAAwithBigDataSQL:EXADATA+BDAUsingthein-Databasealgorithms,plusREngineandOpen-SourceRPackagesifdesired

RAnalyticsOracleREnterprise

RClient

SQLDeveloperOtherSQLApps

R

22

SQLClient

OracleBIGDATAAPPLIANCE

BigDataSQL


PerformanceonMicroLevelDataAnalytics:EXADATA+BDA

23

Performance of ORE for Micro Level Data Analytics

17

42.0 111.5

855.0

41,652

1.0

100.0

10000.0

4G 20G 200G 10T

Exa BDS (Exa+BDA)

Sec

onds

Data Size

50x

Tested on EXADATA X5 Full Rack •  Algorithm Execution Time •  DoP is 288

5 x 1 x

250x

WithdataeitherasDatabasetablesorasHDFSfilesontheBDA,theperformancewasthesame,highlightingthethroughputofBigDataSQL,andOAA’sagilityonrunningopen-sourceRcodeagainstgroup-byproblems

EXADATA+Big DataSQL+OAAonafullrackEXADATAX5-2,viaInfiniband toa9-nodeBDAX5-2.DegreeofParallelismsetto288


Conclusion:ROIusingBISTel AnalyticswithOAA

2418

ROI using BISTel Analytics on Oracle ORE

Shorten TTRCI (Time to Root Cause Identification)

Within 1 day from 2 weeks to 2 month

Reduce Investment Cost

Use of proven EXADATA and ORE with minimal data migration from existing Oracle DB

ROI for Customer

Shorten Time to Market

Faster POC (Proof of Concept) at customer site with new ideas/apps – Reduced by more than 30 days Dramatic reduction of development and test time – At least by 50%

ROI for Solution Provider


2.FlightCancellationpredictioninUSAairportsAirlineIndustry

• On-timearrivaldatafornon-stopdomesticflightsbymajoraircarrierscanbefoundattheBureauofTransportationStatisticswebsite,andisfreetodownload:http://www.transtats.bts.gov/Fields.asp?Table_ID=236

• Severalbenchmarkshavebeenexecutedagainstthisdataset,knownasONTIME.Theoriginalcombineddatasethad123mirecordsandcontaineddatafromOctober1987toApril2008.Itisavailableatmanywebsites,includingtheAmericanStatisticalAssociation:http://stat-computing.org/dataexpo/2009/

• WeaugmentedtheoriginaldatawithinformationuntilSeptember2014toget159miuniquerecords.Forscalabilitytesting,weappendedthefileontoitselftoget1Bitotalrecords

25


LogisticRegression

OracleRAdvancedAnalyticsforHadoopAdvancedAnalyticsalgorithmsinaHadoopCluster:Map-ReduceandSparkbased

GeneralizedLinearModel

Regression

Classification

AttributeImportance

Principal ComponentsAnalysis

Clustering

Hierarchicalk-Means

FeatureExtraction

NonnegativeMatrixFact(NMF)

StatisticalFunctions

CorrelationCovariance

Cross TabulationSummarystatistics

Multi-LayerNeuralNetworks

LinearRegression

CollaborativeFiltering(LMF)


OracleRAdvancedAnalyticsforHadoop– vs.Rhadoop (RMR)BestplatformavailabletorunMap-ReduceRjobsvs.RevolutionAnalytics’RHadoopPerformanceona6-nodeBDAX3-2,16coresand47GBofTotalRAMassignedCovariancecomputationon100GBHDFS/200columnsinputdataset

27

Moreinfoathttps://blogs.oracle.com/R/entry/oraah_enabling_high_performance_r


HadoopClusterwith Oracle R Advanced Analytics for Hadoop

OracleRAdvancedAnalyticsforHadoop:IntegrationUsingORAAH’sHadoopandHIVEIntegration,plusREngineandOpen-SourceRPackages

RAnalyticsOracle R Advanced

Analytics for Hadoop

RClient

ORAAHdistributedalgorithms:MLPNeuralNets*,Logistic Reg*,GLM,LM,PCA,k-Means,NMF,LMF.Open-source RpackagesviaMap-Reduce*Spark-Cachingenabled

SQLDeveloperOtherSQLApps

HQLBasicStatistics,DataPrep,Joins andViewcreation

28

SQLClient

HQL

OracleDatabaseServerwithAdvancedAnalyticsoption

R


Join thetwotablesbyonecommonvariable>joined<- merge(tab_input,tab_input2,by="value")

InvokeORAAHtransparentfunctions forHIVE:JOINORAAH:TransparencyfunctionsagainstHIVEtables

Oracle R Advanced Analyticsfor Hadoop Client Packages

HIVETransparency Engine

Thenewtable isatemporaryHIVEtablenotseeninhere>ore.ls()[1]“tab_input"“tab_input2"But,it’spartofthe localRobjects>ls()[1]"joined">names(joined)[1]"value""v1.x""v2.x""v3.x""v4.x""v5.x""v6.x""v7.x""v8.x""v9.x"[11]"v10.x""v11.x" "v12.x""v13.x""v14.x""v15.x""v16.x""v17.x""v18.x""v19.x"[21]"v20.x""v21.x" "v1.y""v2.y""v3.y""v4.y""v5.y""v6.y""v7.y""v8.y"[31]"v9.y""v10.y""v11.y""v12.y""v13.y""v14.y""v15.y""v16.y""v17.y""v18.y"[41]"v19.y""v20.y""v21.y"

4

/user/oracle/tab_input

HDFS StorageHDFS Storage

3

HIVEThrift

Server

1HQL

MetastoreMetastoreMetastore

Metastore

2

29



Wecancreateanewvariablethat isacombination ofothercolumns inthejoined table,usingjustR’splainsyntax.>joined$NEW_VARIABLE <- (joined$v1.x+joined$v1.y)/2

InvokeORAAHtransparentfunctions forHIVE:CreatenewVariablesandSummaryStatisticsORAAH:TransparencyfunctionsagainstHIVE


HIVETransparency EngineThenewvariablecanbeused inanytypeofcomputation

Checking thecontentofthenewVariable:>summary(joined$NEW_VARIABLE)Min.1stQu.MedianMean3rdQu.Max.-14890-12040-10190-10230-8621-2609

>head(joined$NEW_VARIABLE)[1]-9862.83-9705.47-9643.92 -8655.63-11572.07-11702.89

4

/user/oracle/tab_input

HDFS StorageHDFS Storage

3

HIVEThrift

Server

1HQL

MetastoreMetastoreMetastore

Metastore

2

30



dfs <- hdfs.put(iris, key='Species')res <- NULLdfs.res <- hadoop.run(dfs,mapper = function(key, vals) {keyval(key, vals)

},reducer = function(key, vals) {dat <- do.call(rbind.data.frame, vals)orch.dlogv(colnames(dat))mod = lm(Petal.Length ~ Sepal.Length+Petal.Width, data=dat)fname <- paste("fit-",key,".png",sep="")png(fname)par(mfrow=c(2, 2), cex=0.6, mar=c(6, 6, 6, 4), mex=0.8)plot(mod,id.n=1, cex.caption=0.8, which=1:4)dev.off()hdfs.fdir <- "/user/pngfiles"hdfs.fname <- paste(hdfs.fdir,"/",fname, sep="")system(paste("hadoop fs -copyFromLocal", fname, hdfs.fdir))pred <- predict(mod, dat)keyval(NULL, orch.pack(pred, hdfs.fname))

})

Oracle RAAHClient Packages

Map/Reduce Call

res<- hdfs.get(dfs.res)finalres =list()

for(i in1:nrow(res)){finalres[[i]] <-

orch.unpack(res[i,])}

/user/oracle/irisMapper(s)Reducer(s)

R Result Object Stored in HDFS

2

5

Invokeopen-sourcelmmodelandcollectgraphical resultsinMap-ReduceusingjustRORAAH:MachineLearningmodelsagainstHDFSdata

OracleDistributionofRversion3.2.0(--)-- "FullofIngredients" 1

4

YARN: Hadoop Map Reduce Job

31


MappersMappersMappersMappers

InvokeORAAHcustomparallel distributedmodel(LinearRegression)ORAAH:MachineLearningmodelsagainstHDFSdata

> ontime <- hdfs.attach("/user/oracle/ontime_s")> lm_mod <- orch.lm(ARRDELAY ~ DISTANCE + DEPDELAY,

dfs.dat=ontime, nMappers = 4, nReducers = 2)


Machine Learningalgorithms module

> summary(lm_mod)Call:ore.lm(formula = ARRDELAY ~ DISTANCE + DEPDELAY, data = ONTIME_S)Residuals:

Min 1Q Median 3Q Max -1462.45 -6.97 -1.36 5.07 925.08 Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 2.254e-01 5.197e-02 4.336 1.45e-05 ***DISTANCE -1.218e-03 5.803e-05 -20.979 < 2e-16 ***DEPDELAY 9.625e-01 1.151e-03 836.289 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.73 on 215144 degrees of freedom(4785 observations deleted due to missingness)

Multiple R-squared: 0.7647, Adjusted R-squared: 0.7647 F-statistic: 3.497e+05 on 2 and 215144 DF, p-value: < 2.2e-16

2

/user/oracle/ontime_s

YARN: Hadoop Map Reduce Job

1

4

Custom Java Algorithm ReducersCustom Java

Algorithm Reducers

3

OracleDistributionofRversion3.1.1(--)-- "SockittoMe"

32


InvokeORAAHcustomparallel distributedGLMModelusingSparkCachingORAAH:MachineLearninginSparkagainstHDFSdata

>#ConnectstoSparkandreservesadedicatedContext>spark.connect("yarn-client",memory="24g")>#CreatesapointertotheHDFSfileforusewithinR>ont1bi<- hdfs.attach("/user/oracle/ontime_1bi")>#Checksthesizeofdataset(rowsandcolumns)>formatC(hdfs.dim(ont1bi),digits=4,format='fg',big.mark = ",")”1,000,000,000" " 30"

>#Formuladefinition:Cancelledflights(0or1)basedonotherattributes>form_oraah_glm2<- CANCELLED~DISTANCE+ORIGIN+DEST+F(YEAR)+F(MONTH)++ F(DAYOFMONTH)+F(DAYOFWEEK)>system.time(m_spark_glm <- orch.glm2(formula=form_oraah_glm2,ont1bi))


/user/oracle/ontime_1bi

33

YARN: 1. Spark Context Creation

11

1. Reserve Memory in a dedicated Context

1 3

3

ORCHGLM:processed6factorvariables,25.806secORCHGLM:createdmodelmatrix,100128partitions,32.871secORCHGLM:iter 1, deviance 1.38433414089348300E+09, elapsedtime9.582secORCHGLM:iter 2, deviance 3.39315388583931150E+08, elapsedtime9.213secORCHGLM:iter 3, deviance 2.06855738812683250E+08, elapsedtime9.218secORCHGLM:iter 4, deviance 1.75868100359263200E+08, elapsedtime9.104secORCHGLM:iter 5, deviance 1.70023181759611580E+08, elapsedtime9.132secORCHGLM:iter 6, deviance 1.69476890425481350E+08, elapsedtime9.124secORCHGLM:iter 7, deviance 1.69467586045954760E+08, elapsedtime9.077secORCHGLM:iter 8, deviance 1.69467574351380850E+08, elapsedtime9.164secuser system elapsed84.107 5.606143.591

2

22

2

2. Spark Job Execution

2. Loads data from HDFS and runs ORAAH’s in-Memory Custom Machine Learning algorithm


Spark-Based Machine Learning algorithms

module


PerformanceagainstORAAH’sMap-ReduceGLMORAAH’sSpark-basedGLMagainstHDFSdata

Performanceona6-nodeBDAX3-2withCDH5.3.0,24coresand96GBofRAMperNode, Spark1.2.0configurationwith24coresand24GBofRAMperNode

Moredetailsathttps://blogs.oracle.com/R/entry/oracle_r_advanced_analytics_for


OracleRAdvancedAnalyticsforHadoop– Sparkvs.Map-ReduceNeuralNetworksPerformance– LinearModel,845weights,linearmodel

35Moredetailsathttps://blogs.oracle.com/R/entry/oracle_r_advanced_analytics_for

Performanceona6-nodeBDAX3-2withCDH5.3.0,24coresand96GBofRAMperNode,Spark1.2.0configurationwith24coresand24GBofRAMperNode


ORAAHonSpark:1BirecordsandLargenumberofcoefficientsScalableGLM-LogisticRegressionandComplexNonlinearDeepNeuralNetworks

36

Performanceona6-nodeBDAX3-2withCDH5.3.0,24coresand96GBofRAMperNodeSpark1.2.0configuration with24coresand24GBofRAMperNode

Moredetailsathttps://blogs.oracle.com/R/entry/oracle_r_advanced_analytics_for


PerformancemeasuredonsameHardwareandsameHDFSinputDatasetORAAH’sSpark-basedGLMvs.SparkMllib GLM

14.5x

13.9x

5x


OracleRAdvancedAnalyticsforHadoopEfficientuseofSparkCachingmemory,evenatminimumlevelsPerformanceon159mirecordsonanX4-2Server,40threads,128GBofRAM,CDH5.3.0Spark1.2.0configuration with24coresand24GBofRAMperNodeGLM – Logistic Regression model with 845 CoefficientsNeural Networks - Model using 1 Layer of Neurons, linear activation function, 838 Coefficients

38


RoadmapArchitecture

39

Hadoop Relational

AlgorithmsCommoncore,parallel,distributed

SQL RGUI

Cloud

Roadmap


OracleAdvancedAnalytics—OnPremiseorCloud100%CompatibilityEnablesEasyCoexistenceandMigration

40

OracleCloud

CoExistence andMigration

Same Architecture

Same Analytics

Same Standards

TransparentlymoveworkloadsandanalyticalmethodologiesbetweenOn-premiseandpubliccloud

On-Premise


DataSciencealreadyavailableontheOracleCloudCloud-BasedAdvancedAnalytics

OracleAdvancedAnalyticsoptionincludingtheOracleDataMiningandOracleREnterpriseon:

• OracleEXADATACloudService

• OracleDatabaseasaService:IncludedinHighPerformanceandExtremePerformanceservices

OracleRAdvancedAnalyticsforHadoop• IncludedintheBigDataCloudService

41

scalable machine learning using big data sql, hadoop and spark€¦ · scalable machine learning...

Documents