how to achieve real-time analytics on a data lake … · real-time, advanced analytics, speed layer...

30
HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS Mark Brooks - Principal System Engineer @ Kinetica May 09, 2017

Upload: others

Post on 25-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

HOWTOACHIEVEREAL-TIMEANALYTICSONADATALAKEUSINGGPUS

MarkBrooks- PrincipalSystemEngineer@Kinetica

May09,2017

Page 2: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

TheChallenge:

Howtomaintainanalyticperformancewhiledealingwith:

• Largerdatavolumes

• Streamingdatawithminimalend-to-endlatency

• Ad-hocdrilldown(youcan’tpre-aggregateeverything)

2

Page 3: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

ArchitecturalandDesignApproaches

1. Onedatabasetorulethemall

2. SQLonHadoop(ordirectlyontheDataLake)

3. DataLake+NoSQL+Spark+Search+Cache+…

4. LambdaArchitecture

5. KappaArchitecture

6. Nextgenerationhardwareacceleration3

Page 4: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

OneDatabaseToRuleThemAll

4

Page 5: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

SQLonaDataLake

Credit:https://www.slideshare.net/Bigdatapump/sql-on-hadoop-494944945

Page 6: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

Hadoop+NoSQL+Search+MemoryCache+…

Credit:MattTurck - https://www.slideshare.net/mjft01/big-data-landscape-matt-turck-may-20146

Page 7: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

LambdaArchitecture

Credit: NathanMarz http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.htmlJamesKinleyhttp://jameskinley.tumblr.com/tagged/Lambda

7

Page 8: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

LambdaArchitecture

Credit:JamesKinleyhttp://jameskinley.tumblr.com/tagged/Lambda

7

Page 9: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

KappaArchitecture

Credit:JayKrepshttps://www.oreilly.com/ideas/questioning-the-lambda-architecture

8

Page 10: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

KappaArchitecture

Credit:JayKrepshttps://www.oreilly.com/ideas/questioning-the-lambda-architecture

8

Stream processing systems already have a notion of parallelism; why not just handle reprocessing by increasing the parallelism and replaying history very, very fast?

Page 11: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

NextGenerationHardwareAcceleration

Credit:JayKrepshttps://www.oreilly.com/ideas/questioning-the-lambda-architecture

8

Considerasystemwiththesecharacteristics:

• HorizontallyScalable• Lowend-to-endlatency• Powerfulenoughtonotrequirepre-aggregation

Thisisnowpossible…

Page 12: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

GPUAcceleratedCompute

12

DATAWAREHOUSE

RDBMS&DataWarehouse

technologiesenable

organizationstostoreand

analyzegrowingvolumesofdata

onhighperformancemachines,

butathighcost.

DISTRIBUTEDSTORAGE

HadoopandMapReduce

enablesdistributedstorageand

processingacrossmultiple

machines.

Storingmassivevolumesofdata

becomesmoreaffordable,but

performanceisslow

AFFORDABLEMEMORY

Affordablememoryallowsfor

fasterdatareadandwrite.

HANA,MemSQL,&Exadata

providefasteranalytics.

1990- 2000’s 2005… 2010… 2017…

ATSCALEPROCESSINGBECOMESTHEBOTTLENECK

GPUACCELERATEDCOMPUTE

GPUcoresbulkprocesstasksin

parallel- farmoreefficientformany

data-intensivetasksthanCPUs

whichprocessthosetaskslinearly.

Page 13: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

Kinetica:Core

13

ANALYTICSDATABASEACCELERATEDBYGPUs

KINETICA

CommodityHardwarew/GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

GPUAcceleratedColumnarIn-memoryDatabase

HTTPHeadNode

Columnarin-memorydatabase

DataavailablemuchlikeatraditionalRDBMS…rows,columns

Dataheldin-memory;persistedtodisk

InteractwithKineticathroughitsnativeRESTAPI,Java,Python,JavaScript,NodeJS,C++,SQL,etc…aswellaswithvariousconnectors

NativeGIS&IPaddressobjectsupport

VERYFAST:IdealforOLAPworkloadsTypicalhardwaresetup:256GB- 1TBmemorywith2-4GPUspernode.

Page 14: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

Multi-HeadIngestandScale-OutArchitecture

ON-DEMANDSCALEOUT

CommodityHardwarew/GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

ColumnarIn-memory

HTTPHeadNode

+

CommodityHardwarew/GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

ColumnarIn-memory

HTTPHeadNode

CommodityHardwarew/GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

ColumnarIn-memory

HTTPHeadNode

MULTI-HEADINGEST19

Page 15: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

Real-TimeDataHandlersforStructured&UnstructuredData

VISUALIZATIONviaODBC/JDBCAPIs

JavaAPI

JavaScriptAPI

RESTAPI

C++API

Node.jsAPI

PythonAPI

OPENSOURCEINTEGRATION

ApacheNiFi

ApacheKafka

ApacheSpark

ApacheStorm

GEOSPATIALCAPABILITIESGeometricObjects

Tracks

GeospatialEndpoints

WMS

WKT

KINETICACLUSTEROn-DemandScale

CommodityHardwarew/GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

ColumnarIn-memory

HTTPHeadNode

CommodityHardwarew/GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

ColumnarIn-memory

HTTPHeadNode

CommodityHardwarew/GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

ColumnarIn-memory

HTTPHeadNode

CommodityHardwarew/GPUs

Disk

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

ColumnarIn-memory

HTTPHeadNode

OTHERINTEGRATION

MessageQueues

ETLTools

StreamingTools

20

Page 16: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

ParallelIngestProvidesHighPerformanceStreaming

16

1NODE(1TB/2GPU)PARALLEL

INGEST

1NODE(1TB/2GPU)

1NODE(1TB/2GPU)

Eachnodeofthesystemcansharethetaskofdataingest,providesmoreandfasterthroughput.Itcanbemadefastersimplybyaddingmorenodes.

Nocomputeisusedoningest!

Page 17: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

SpeedLayerfortheDataLake

17

ParallelIngestion

Parallelingestionofevents

Kineticaisspeedlayerwithreal-timeanalyticcapabilities

HDFSforarchivalstore

Muchloosercouplingthantraditionallambdaarchitecture

BatchmodeSparkorMRjobscanpushdatatoKineticaasneededforfastqueryondataloadedfromthedatalake

EVENTS

MESSAGEBROKERS

AmazonKinesis

ANALYSTS

MOBILEUSERS

DASHBOARDS&APPLICATIONS

ALERTINGSYSTEMS

Put,get,scan

Executecomplexanalyticsonthefly

KineticaConnectors

STREAMPROCESSING

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

HDFS/AWSS3/GCS/AzureDataLake

Page 18: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

Real-Time,AdvancedAnalytics,SpeedLayerforTeradataorOracle

18

Parallelingestionofevents

Lambda-typearchitectureforTeradataorOracle

Kineticaisspeedlayerwithnear-real-timeanalyticcapabilities

ConvergeMachineLearning,streamingandlocationanalyticsandfastQueryandAnalyticswithKineticaandRDBMS

DATAINMOTIONANDREST

DATAWAREHOUSE/TRANSACTIONAL

AmazonKinesis

ANALYSTS

MOBILEUSERS

DASHBOARDS&APPLICATIONS

ALERTINGSYSTEMS

KineticaConnectors

STREAM/ETLPROCESSING

FastGPUaccelerated,in-

MemoryDatabaseConvergeML,AI,

Streaming

Page 19: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

AdvancedIn-DatabaseAnalytics

1. User-definedfunctions(UDFs)canreceivetabledata,doarbitrarycomputations,andsaveoutputtoaseparatetableinadistributedmanner.

2. UDFshavedirectaccesstoCUDAAPIs– enablescompute-to-gridanalyticsforlogicdeployedwithinKinetica.

3. Workswithcustomcode,orpackagedcode.Opensthewayformachinelearning/artificialintelligencelibrariessuchasTensorFlow,BIDMach,Caffe andTorch toworkondatadirectlywithinKinetica.

4. AvailablenowwithC++&Javabindings.

19

ORCHESTRATIONLAYERWITHUSER-DEFINEDFUNCTIONS(UDFs)

PHYSICAL/VIRTUALSERVER

TableA

Tablen

GPU

UDFsexposedfromRESTfulendpoint

Datareturnedtooutputtableforfurtheranalysis

CUDALibraries

nnumberofKineticaservers

TableB

TableC

ProcServer

UDF_A UDF_B UDF_n/exec/proc/UDF_A/

Page 20: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

KineticaArchitecture

20

ETL/STREAMPROCESSING

ONDEMANDSCALEOUT+

1TBMEM/2GPUCARDS

SQL

NativeAPIs

PARALLELIN

GEST

GeospatialWMS

CustomConnectors

In-DatabaseProcessing

CUSTOMLOGICBIDMach

MLLib

s

BIDASHBOARDS

BI/GIS/APPS

CUSTOMAPPS&GEOSPATIAL

KINETICA‘REVEAL’

STREA

MINGDATA

ERP/CRM/

TRANSA

CTIONALDATA

UDFs

Page 21: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

21

AI&BIonOneGPU-AcceleratedDatabase

HIGHPERFORMANCEANALYTICSDATABASE

UDF UDF UDF

ODBC/JDBC Native

RESTAPI WMS

BUSINESSINTELLIGENCE

CUSTOMAPPLICATIONSHIGHFIDELITY

GEOSPATIAL PIPELINE

MACHINELEARNING&DEEPLEARNING GPU-ACCELERATED

DATASCIENCE

PREDICTIVEMODELSe.g.RiskManagement,SalesVolume,Fraud.

BIDMach

SQL

DATASCIENTISTS/DEVELOPERS

BUSINESSUSERS

Page 22: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

50-100xFasteronQuerieswithLargeDatasets

• LargeretailertestedcomplexSQLquerieson3yearsofretaildata(150bnrows)

• 10nodeKineticaclusteragainst30TB+clusterfromnextbestalternative

• GPUisabletoperformmanyinstructionsinparallel. Hugeperformancegainsonaggregations,groupbys,joins,etc.

• Kineticasustainedingestof1.3bnobjects/minutewith70attributesperrow

22

WHENCOMPAREDTOLEADINGIN-MEMORYALTERNATIVES

SUM (Q1)

GROUP BY (Q5)

SELECT (Q10)

0 5 10 15 20 25 30 35 40 45 50

Kinetica Leading In-Memory DBMoreDetails

Page 23: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

23

DistributedGeospatialPipeline

23

• NATIVEVISUALIZATIONISDESIGNEDFORFASTMOVING,LOCATION-BASEDDATA

NativeGeospatialObjectTypes

• Points,Shapes,Tracks,Labels

NativeGeospatialFunctions

• Filters(byarea,byseries,bygeometry,etc.)

• Aggregation(histograms)

• Geofencing - triggers

• Videogeneration(basedondates/times)

GenerateMapOverlayImagery(viaWMS)

• Rasterizepoints

• Stylebasedonattributes(class-break)

• Heatmaps

Page 24: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

Full-TextSearch

“Rain Tire” ~5Kineticaincludespowerfultextsearchfunctionality,

including:

• ExactPhrases• Boolean– AND/OR• Wildcards• Grouping• FuzzySearch(Damerau-Levenshtein optimalstringalignmentalgorithm)• N-GramTermProximitySearch• TermBoostingRelevancePrioritization

"Union Tranquility"~10

[100 TO 200]

22

Page 25: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

INTELLIGENCE:USArmy- INSCOM

USArmy’sin-memorycomputationalengineforanydatawithageospatialortemporalattributeforamajorjointcloudinitiativewithintheIntelligenceCommunity(ICITE).

Intelanalystsareabletoconductnearreal-timeanalyticsandfuseSIGINT,ISR,andGEOINTstreamingbigdatafeedsandvisualizeinawebbrowser.

Firsttimeinhistorymilitaryanalystsareabletoqueryandvisualizebillionstotrillionsofnearreal-timeobjectsinaproductionenvironment.

Majorexecutivemilitaryandcongressionalvisibility.

OracleSpatial(92Minutes)

42xLowerSpace28xLowerCost38xLowerPowerCost

U.SArmyINSCOMShiftfromOracletoGPUdb

GPUdb(20ms)

1GPUdbservervs42serverswithOracle10gR2(2011)

CASESTUDY: LOCATIONBASEDANALYTICS

24

Page 26: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

LOGISTICS:Workforceoptimization

DISTRIBUTEDANALYSIS

USPS’parallelclusterisabletoserveupto15,000simultaneoussessions,providingtheservice’smanagersandanalystswiththecapabilitytoinstantlyanalyzetheirareasofresponsibilityviadashboards.

ATSCALE

With200,000USPSdevicesemittinglocationonceeveryminute,thatamountstomorethanaquarterbillioneventscapturedandanalyzeddaily…trackedon10nodes.

USPSisthesinglelargestlogisticentityinthecountry,movingmoreindividualitemsinfourhoursthanthecombinationofUPS,FedEx,andDHLmoveallyear.

CASESTUDY: LOCATIONBASEDANALYTICS

25

Page 27: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

LOGISTICS&FLEETMANAGEMENT

27

Kineticaenablesagiletrackingofshipmentstoassiststoremanagersfortrackingofinventoryandarrivaltimes.

• Visibilityandtrackingofdeliveries&trucksforstoremanagers

• ETA&Notifications– Provideestimatedtimeofdelivery,notificationsandcustomlocationbasedalerting

• RouteOptimizationbasedontrucksize,andifcargoisperishableorcontainshazardousmaterials.

LARGERETAILER

CASESTUDY: LOCATIONBASEDANALYTICS

Page 28: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

RISKMANAGEMENT

28

Largefinancialinstitutionmovescounterpartyriskanalysisfromovernighttoreal-time.

• DatacollectedbyXVAlibrarywhichcomputesriskmetricsforeachtrade

• Riskcomputationsarebecomingmorecomplexandcomputationallyheavy.xVA analysisneedstoprojectyearsintothefuture.

• Kineticaenablesbankstomovefrombatch/overnightanalysistoastreaming/real-timesystemforflexiblereal-timemonitoringbytraders,auditorsandmanagement.

MULTINATIONALBANK

CASESTUDY:ADVANCEDIN-DATABASEANALYTICS

Page 29: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

ScaleOutonIndustryStandardHardware

29

Kineticatypicallyresultsin 1⁄10 hardwarecostsofstandardin-memorydatabases.

INTHECLOUDWITH:

CERTIFIEDONPREMISEWITH:

Runsonindustrystandardservers,512GBmemorywithGPUs(ex.NVIDIAK80)

COMINGSOON:

Page 30: HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE … · Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle 18 Parallel ingestion of events Lambda-type architecture

StopbyBooth#431and

GetYourFreeT-shirt

www.kinetica.com