how to achieve real-time analytics on a data lake … · real-time, advanced analytics, speed layer...
TRANSCRIPT
HOWTOACHIEVEREAL-TIMEANALYTICSONADATALAKEUSINGGPUS
MarkBrooks- PrincipalSystemEngineer@Kinetica
May09,2017
TheChallenge:
Howtomaintainanalyticperformancewhiledealingwith:
• Largerdatavolumes
• Streamingdatawithminimalend-to-endlatency
• Ad-hocdrilldown(youcan’tpre-aggregateeverything)
2
ArchitecturalandDesignApproaches
1. Onedatabasetorulethemall
2. SQLonHadoop(ordirectlyontheDataLake)
3. DataLake+NoSQL+Spark+Search+Cache+…
4. LambdaArchitecture
5. KappaArchitecture
6. Nextgenerationhardwareacceleration3
OneDatabaseToRuleThemAll
4
SQLonaDataLake
Credit:https://www.slideshare.net/Bigdatapump/sql-on-hadoop-494944945
Hadoop+NoSQL+Search+MemoryCache+…
Credit:MattTurck - https://www.slideshare.net/mjft01/big-data-landscape-matt-turck-may-20146
LambdaArchitecture
Credit: NathanMarz http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.htmlJamesKinleyhttp://jameskinley.tumblr.com/tagged/Lambda
7
LambdaArchitecture
Credit:JamesKinleyhttp://jameskinley.tumblr.com/tagged/Lambda
7
KappaArchitecture
Credit:JayKrepshttps://www.oreilly.com/ideas/questioning-the-lambda-architecture
8
KappaArchitecture
Credit:JayKrepshttps://www.oreilly.com/ideas/questioning-the-lambda-architecture
8
Stream processing systems already have a notion of parallelism; why not just handle reprocessing by increasing the parallelism and replaying history very, very fast?
NextGenerationHardwareAcceleration
Credit:JayKrepshttps://www.oreilly.com/ideas/questioning-the-lambda-architecture
8
Considerasystemwiththesecharacteristics:
• HorizontallyScalable• Lowend-to-endlatency• Powerfulenoughtonotrequirepre-aggregation
Thisisnowpossible…
GPUAcceleratedCompute
12
DATAWAREHOUSE
RDBMS&DataWarehouse
technologiesenable
organizationstostoreand
analyzegrowingvolumesofdata
onhighperformancemachines,
butathighcost.
DISTRIBUTEDSTORAGE
HadoopandMapReduce
enablesdistributedstorageand
processingacrossmultiple
machines.
Storingmassivevolumesofdata
becomesmoreaffordable,but
performanceisslow
AFFORDABLEMEMORY
Affordablememoryallowsfor
fasterdatareadandwrite.
HANA,MemSQL,&Exadata
providefasteranalytics.
1990- 2000’s 2005… 2010… 2017…
ATSCALEPROCESSINGBECOMESTHEBOTTLENECK
GPUACCELERATEDCOMPUTE
GPUcoresbulkprocesstasksin
parallel- farmoreefficientformany
data-intensivetasksthanCPUs
whichprocessthosetaskslinearly.
Kinetica:Core
13
ANALYTICSDATABASEACCELERATEDBYGPUs
KINETICA
CommodityHardwarew/GPUs
Disk
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
GPUAcceleratedColumnarIn-memoryDatabase
HTTPHeadNode
Columnarin-memorydatabase
DataavailablemuchlikeatraditionalRDBMS…rows,columns
Dataheldin-memory;persistedtodisk
InteractwithKineticathroughitsnativeRESTAPI,Java,Python,JavaScript,NodeJS,C++,SQL,etc…aswellaswithvariousconnectors
NativeGIS&IPaddressobjectsupport
VERYFAST:IdealforOLAPworkloadsTypicalhardwaresetup:256GB- 1TBmemorywith2-4GPUspernode.
Multi-HeadIngestandScale-OutArchitecture
ON-DEMANDSCALEOUT
CommodityHardwarew/GPUs
Disk
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
ColumnarIn-memory
HTTPHeadNode
+
CommodityHardwarew/GPUs
Disk
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
ColumnarIn-memory
HTTPHeadNode
CommodityHardwarew/GPUs
Disk
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
ColumnarIn-memory
HTTPHeadNode
MULTI-HEADINGEST19
Real-TimeDataHandlersforStructured&UnstructuredData
VISUALIZATIONviaODBC/JDBCAPIs
JavaAPI
JavaScriptAPI
RESTAPI
C++API
Node.jsAPI
PythonAPI
OPENSOURCEINTEGRATION
ApacheNiFi
ApacheKafka
ApacheSpark
ApacheStorm
GEOSPATIALCAPABILITIESGeometricObjects
Tracks
GeospatialEndpoints
WMS
WKT
KINETICACLUSTEROn-DemandScale
CommodityHardwarew/GPUs
Disk
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
ColumnarIn-memory
HTTPHeadNode
CommodityHardwarew/GPUs
Disk
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
ColumnarIn-memory
HTTPHeadNode
CommodityHardwarew/GPUs
Disk
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
ColumnarIn-memory
HTTPHeadNode
CommodityHardwarew/GPUs
Disk
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
ColumnarIn-memory
HTTPHeadNode
OTHERINTEGRATION
MessageQueues
ETLTools
StreamingTools
20
ParallelIngestProvidesHighPerformanceStreaming
16
1NODE(1TB/2GPU)PARALLEL
INGEST
1NODE(1TB/2GPU)
1NODE(1TB/2GPU)
Eachnodeofthesystemcansharethetaskofdataingest,providesmoreandfasterthroughput.Itcanbemadefastersimplybyaddingmorenodes.
Nocomputeisusedoningest!
SpeedLayerfortheDataLake
17
ParallelIngestion
Parallelingestionofevents
Kineticaisspeedlayerwithreal-timeanalyticcapabilities
HDFSforarchivalstore
Muchloosercouplingthantraditionallambdaarchitecture
BatchmodeSparkorMRjobscanpushdatatoKineticaasneededforfastqueryondataloadedfromthedatalake
EVENTS
MESSAGEBROKERS
AmazonKinesis
ANALYSTS
MOBILEUSERS
DASHBOARDS&APPLICATIONS
ALERTINGSYSTEMS
Put,get,scan
Executecomplexanalyticsonthefly
KineticaConnectors
STREAMPROCESSING
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS/AWSS3/GCS/AzureDataLake
Real-Time,AdvancedAnalytics,SpeedLayerforTeradataorOracle
18
Parallelingestionofevents
Lambda-typearchitectureforTeradataorOracle
Kineticaisspeedlayerwithnear-real-timeanalyticcapabilities
ConvergeMachineLearning,streamingandlocationanalyticsandfastQueryandAnalyticswithKineticaandRDBMS
DATAINMOTIONANDREST
DATAWAREHOUSE/TRANSACTIONAL
AmazonKinesis
ANALYSTS
MOBILEUSERS
DASHBOARDS&APPLICATIONS
ALERTINGSYSTEMS
KineticaConnectors
STREAM/ETLPROCESSING
FastGPUaccelerated,in-
MemoryDatabaseConvergeML,AI,
Streaming
AdvancedIn-DatabaseAnalytics
1. User-definedfunctions(UDFs)canreceivetabledata,doarbitrarycomputations,andsaveoutputtoaseparatetableinadistributedmanner.
2. UDFshavedirectaccesstoCUDAAPIs– enablescompute-to-gridanalyticsforlogicdeployedwithinKinetica.
3. Workswithcustomcode,orpackagedcode.Opensthewayformachinelearning/artificialintelligencelibrariessuchasTensorFlow,BIDMach,Caffe andTorch toworkondatadirectlywithinKinetica.
4. AvailablenowwithC++&Javabindings.
19
ORCHESTRATIONLAYERWITHUSER-DEFINEDFUNCTIONS(UDFs)
PHYSICAL/VIRTUALSERVER
TableA
Tablen
GPU
UDFsexposedfromRESTfulendpoint
Datareturnedtooutputtableforfurtheranalysis
CUDALibraries
nnumberofKineticaservers
TableB
TableC
ProcServer
UDF_A UDF_B UDF_n/exec/proc/UDF_A/
KineticaArchitecture
20
ETL/STREAMPROCESSING
ONDEMANDSCALEOUT+
1TBMEM/2GPUCARDS
SQL
NativeAPIs
PARALLELIN
GEST
GeospatialWMS
CustomConnectors
In-DatabaseProcessing
CUSTOMLOGICBIDMach
MLLib
s
BIDASHBOARDS
BI/GIS/APPS
CUSTOMAPPS&GEOSPATIAL
KINETICA‘REVEAL’
STREA
MINGDATA
ERP/CRM/
TRANSA
CTIONALDATA
UDFs
21
AI&BIonOneGPU-AcceleratedDatabase
HIGHPERFORMANCEANALYTICSDATABASE
UDF UDF UDF
ODBC/JDBC Native
RESTAPI WMS
BUSINESSINTELLIGENCE
CUSTOMAPPLICATIONSHIGHFIDELITY
GEOSPATIAL PIPELINE
MACHINELEARNING&DEEPLEARNING GPU-ACCELERATED
DATASCIENCE
PREDICTIVEMODELSe.g.RiskManagement,SalesVolume,Fraud.
BIDMach
SQL
DATASCIENTISTS/DEVELOPERS
BUSINESSUSERS
50-100xFasteronQuerieswithLargeDatasets
• LargeretailertestedcomplexSQLquerieson3yearsofretaildata(150bnrows)
• 10nodeKineticaclusteragainst30TB+clusterfromnextbestalternative
• GPUisabletoperformmanyinstructionsinparallel. Hugeperformancegainsonaggregations,groupbys,joins,etc.
• Kineticasustainedingestof1.3bnobjects/minutewith70attributesperrow
22
WHENCOMPAREDTOLEADINGIN-MEMORYALTERNATIVES
SUM (Q1)
GROUP BY (Q5)
SELECT (Q10)
0 5 10 15 20 25 30 35 40 45 50
Kinetica Leading In-Memory DBMoreDetails
23
DistributedGeospatialPipeline
23
• NATIVEVISUALIZATIONISDESIGNEDFORFASTMOVING,LOCATION-BASEDDATA
NativeGeospatialObjectTypes
• Points,Shapes,Tracks,Labels
NativeGeospatialFunctions
• Filters(byarea,byseries,bygeometry,etc.)
• Aggregation(histograms)
• Geofencing - triggers
• Videogeneration(basedondates/times)
GenerateMapOverlayImagery(viaWMS)
• Rasterizepoints
• Stylebasedonattributes(class-break)
• Heatmaps
Full-TextSearch
“Rain Tire” ~5Kineticaincludespowerfultextsearchfunctionality,
including:
• ExactPhrases• Boolean– AND/OR• Wildcards• Grouping• FuzzySearch(Damerau-Levenshtein optimalstringalignmentalgorithm)• N-GramTermProximitySearch• TermBoostingRelevancePrioritization
"Union Tranquility"~10
[100 TO 200]
22
INTELLIGENCE:USArmy- INSCOM
USArmy’sin-memorycomputationalengineforanydatawithageospatialortemporalattributeforamajorjointcloudinitiativewithintheIntelligenceCommunity(ICITE).
Intelanalystsareabletoconductnearreal-timeanalyticsandfuseSIGINT,ISR,andGEOINTstreamingbigdatafeedsandvisualizeinawebbrowser.
Firsttimeinhistorymilitaryanalystsareabletoqueryandvisualizebillionstotrillionsofnearreal-timeobjectsinaproductionenvironment.
Majorexecutivemilitaryandcongressionalvisibility.
OracleSpatial(92Minutes)
42xLowerSpace28xLowerCost38xLowerPowerCost
U.SArmyINSCOMShiftfromOracletoGPUdb
GPUdb(20ms)
1GPUdbservervs42serverswithOracle10gR2(2011)
CASESTUDY: LOCATIONBASEDANALYTICS
24
LOGISTICS:Workforceoptimization
DISTRIBUTEDANALYSIS
USPS’parallelclusterisabletoserveupto15,000simultaneoussessions,providingtheservice’smanagersandanalystswiththecapabilitytoinstantlyanalyzetheirareasofresponsibilityviadashboards.
ATSCALE
With200,000USPSdevicesemittinglocationonceeveryminute,thatamountstomorethanaquarterbillioneventscapturedandanalyzeddaily…trackedon10nodes.
USPSisthesinglelargestlogisticentityinthecountry,movingmoreindividualitemsinfourhoursthanthecombinationofUPS,FedEx,andDHLmoveallyear.
CASESTUDY: LOCATIONBASEDANALYTICS
25
LOGISTICS&FLEETMANAGEMENT
27
Kineticaenablesagiletrackingofshipmentstoassiststoremanagersfortrackingofinventoryandarrivaltimes.
• Visibilityandtrackingofdeliveries&trucksforstoremanagers
• ETA&Notifications– Provideestimatedtimeofdelivery,notificationsandcustomlocationbasedalerting
• RouteOptimizationbasedontrucksize,andifcargoisperishableorcontainshazardousmaterials.
LARGERETAILER
CASESTUDY: LOCATIONBASEDANALYTICS
RISKMANAGEMENT
28
Largefinancialinstitutionmovescounterpartyriskanalysisfromovernighttoreal-time.
• DatacollectedbyXVAlibrarywhichcomputesriskmetricsforeachtrade
• Riskcomputationsarebecomingmorecomplexandcomputationallyheavy.xVA analysisneedstoprojectyearsintothefuture.
• Kineticaenablesbankstomovefrombatch/overnightanalysistoastreaming/real-timesystemforflexiblereal-timemonitoringbytraders,auditorsandmanagement.
MULTINATIONALBANK
CASESTUDY:ADVANCEDIN-DATABASEANALYTICS
ScaleOutonIndustryStandardHardware
29
Kineticatypicallyresultsin 1⁄10 hardwarecostsofstandardin-memorydatabases.
INTHECLOUDWITH:
CERTIFIEDONPREMISEWITH:
Runsonindustrystandardservers,512GBmemorywithGPUs(ex.NVIDIAK80)
COMINGSOON:
StopbyBooth#431and
GetYourFreeT-shirt
www.kinetica.com