implementing data preparation in distributed multimedia system

25
IS DATA PREPARATION THE NEXT BIG DATA DISRUPTION ? The 22nd International Conference on Distributed Multimedia Systems DMS 2016 Grand Hotel Salerno, Salerno, Italy November 25 - 26, 2016

Upload: gianluigi-riccio

Post on 15-Feb-2017

138 views

Category:

Technology


4 download

TRANSCRIPT

IS DATAPREPARATIONTHENEXT

BIGDATADISRUPTION?

The22ndInternationalConferenceonDistributedMultimediaSystemsDMS2016

GrandHotelSalerno,Salerno,ItalyNovember25- 26,2016

• SCENARIO

• BIGDATAINTHEDATADRIVENENTERPRISE

• WHATDATAPREPARATIONSHOULDCOVER

• CREATINGREADYDATAUSINGFRACTALS

• CASESTUDY

Agenda

SourceForrester2016

1. DOESTHEBUSINESSANALYSTUNDERSTANDTHEDATASCIENTIST?2. WHYDATADRIVENCOMPANIESAREHIRINGDATAJOURNALISTS?3. WHYDARKDATAEXTERNALTODATALAKESCONTINUETOGROW?4. WHYITISREQUIREDSOLONGTIMEFORMAKINGDATA?5. DATAPLAYANDNARRATIVES?

HOW LONG TIME AVAILABLE TO EXPLOIT DATA PROCESSING OUTPUT?

77%DataProcessing

23%DataAnalysis

SourceBloor2016

90%ISDARK

12%AVAILABLEFORBUSINESSINSIGHTS

88%ISJUSTSTORED

80%RECORDINGs,PDFs ANDTEXTs

sourceIDC2016

+4300%ANNUALDATAGENERATION

Datapreparationisaniterativeprocessforexploringandtransformingrawdataintoformssuitablefordatascience,datadiscovery,andanalytics.Self-servicedatapreparationtools(SSDP)areuser-orientedtoolsthatenabledatapreparationcapabilitiessuchasdatacataloging- inventorying,datadiscovery,dataexploration,datatransformation,datastructuring,surfacingofsensitiveattributesandanomalydetection.Thesetoolsareaimedatreducingthetimeandcomplexityofpreparingdataandimprovinganalystproductivity.

Preprocess

Prepare

Discover

Exploit

Raw Technicallycorrect

ReadyData

Patterns

Formatted

Multimediadomain

MissingMultimedia

Dependingonhowyoucountthem,thereareanywherefrom20to50providersofself-servicedatapreparationtools.However,they’renotallequal,andusersshouldcarefullyexaminetheofferingtomeasurethey’regettingwhattheyexpect.ManyBIandAdvancedAnalyticsvendors(Tableau,Qlik,Sas etc.)havejumpedontoSSDP,eveniftheircapabilitiesaren’tseparatefromtheircoreofferingsandshowslimitationsintermofPerformances,Neutrality,Customprocessing.Thekeyreasonwhyself-servicedataprepwillsurviveasitsowncategoryentityisthegrowingrealizationthatdatapreparationneedstobekeptseparatefromanalysisandDiscovery.Thevolumesandthenumberofdatasourceswillnotbedecreasing,andneitherwillthenumberofBItools.Tothatend,it’slikelythatself-servicedataprepwillremainaproductcategoryuntoitselffortheforeseeablefuture.

SourceBloor2016

WhereweareBIGDATAINTHEDATADRIVENENTERPRISE

WE ALL AREAWARE

I.T.DIVISIONIS GOING TOBUILD

PLANETS OFDATA

WHICHAREWORLDS MADEOFDATABASEs,DATALAKEs,DATAWAREHOUSEs,

STRUCTUREs,ANDSCHEMAs

IT SEEMS THATTHESE WORLDS ARECALLED

“BIGDATA”

BUT,WE’RE AFRAID TOCREATETHEM,LORDSARETAKING LONGER THAN 7DAYS

AND,UNFORTUNATELY,WORSE…IT SEEMS THAT

HUMANSHAVEN’TACCESSTOTHOSE

WORLDS

Bottomline:

Isthedatapreparationthebridgebetweenplanetsofdataandtheuser?

BigData isnotJusttechnology,responsibilityshouldbeallocatedonthebasisofthefollowingcriticalfactors:

1. Rawdatawill betransfered tothepreparationunit(push),or

2. thepreparationunit has toread datafromthedatalake (pull)?

3. thedatalake has been designed tostageortostorerawdata?

4. what about thevariability ofthecontext anddata?

PULL

ITDatalakepurpo

se

PUSH

STOR

ESTAG

E

DataCommunication mode

ENDUSER

IT

ENDUSER ENDUSERLowvariability

Highvariability

BackgroundsWHATDATAPREPARATIONSHOULDCOVER

rawdatarcold,analyticshot

reality

1993understandingcomics

HowtoConnectanalyticsand

details?

Adatabaseisrequiredtocontextualizelanguagesand

realities

Bottom Line:Usage of data should be faster, cost less with minimum data

movement requirements

• materializerealityandlanguageinaconsistentdatabase

• couplelanguageandrealityusingkeyback features

• BindexternalalgorithmusingOpen(Standard?)UserExits

• fosterholisticviewsofdatathroughGridDataUnification

blendingContext,languagesandfacts

CREATINGREADYDATAUSINGFRACTALADC

rowId Nname Ncity

1 1 1

2 2 2

3 3 3

4 2 2

Key Value NValue

Name Aldo 1

Name Sara 2

Name Anna 3

City Miami 1

… … …

DateBirth UDateB Age

11/1/90 1/11/90 26

12/2/89 2/12/89 26

1.1.68 1/1/68 48

31-1-61 1/31/61 56

Ncity city state

1 Miami Fl

2 NYC NY

3 Rome Italy

Map DictionaryLuggage

hierarchyDatacomplex Storagegroup

name city DateBirth

Aldo Miami 11/1/90

Sara NYC 12/2/89

Anna Rome 1.1.68

Sara NYC 31-1-61

Datasource

Fractalconversion

TransformDateBirth

Add Geoclassification

ADCisafractallikealgorithmthatconvertsaninputrawdataandrelateddataprocessingintoasetofchainedbinaryblocks,formulasandlongpointers.

WeshowthatADCrepresentsanimportantsetofcomputations…TheadvantagesofADCarethat:

itisdescribedbyasmallnumberofparametersandhasaprioriknownsizesoftheviews,theviewscanbegeneratedindependently,theoverheadofcombiningthegeneratedviewsispredictable,thedatasetcanbepartitionedintoanumberofindependentlygeneratedsubsets,theelementsofthedatasetarepseudorandom

ThesepropertiesmakeADCastrongcandidateforadataintensivegridbenchmark<M.Frumkin NASANASDivision>

Using the fractal engine, performances are extreme

Usecase

MATERIALTESTING

• ComplexJson,Oracle,csv,wmv data

• ManualdataprocessingexecutedusingMathlab

• HoursofScientistworktodetectoutlier

• Impossibilitytoreplicatetestswithsameresults

• Scarceknowhowcapitalization

• BlendofdatahappensatNarrativewritingtime

TerabytelevelstagingRigidbatchprocessing

Nohistory

Digitalreality Language

FractalDatabase

BottomLine:Everydaywehearfromentrepreneursdoingtheirbesttoturntheirbigideasinaconsistentandsuccessfulonlinebusiness.HereITistheenablerbut,unfortunately,sometimestheTparthasanegativeinfluenceonthedevelopmentofthecoreidea.

TheidealtoolkitismadeforwhowishtoexploittheIpartoftheIT,sothatentrepreneurshavinggreatideas,cancrafttheirbusinessthemselves.Andtheyshould!

©2016datonixSpa

Thank you