the vada architecture for cost-effec>ve data wrangling · 2017-05-23 · vada: value added data...

Post on 16-Mar-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

VADA:ValueAddedDataSystems–PrinciplesandArchitecture

TheVADAArchitectureforCost-Effec>veDataWrangling

1. Automa*c Bootstrapping: Theuseriden>fiesacollec>onofsourcesand defines a target schema. Thesystem automa>cally orchestrates acollec>on of transducers thattogether generate an ini>al resultdataset.2.Datacontext:Theuserassociatesthe target schema with relatedextents(e.g.masterdata).Suchdataallows various of the steps frombootstrapping to be revisited,including matching and mappingvalida>on. Furthermore, it is nowalso possible to learn quality rules,and thereby to carry out repairs tothemappingresults.TheresultdatashouldnowbeofbeLerquality.

Demonstra>on

The VADA architecture: (i) combines wrangling components thatbetween themcover the completedatawrangling lifecycle; (ii)buildson automa>on wherever possible, using whatever informa>on isavailable;(iii)refinestheresultsofautomatedprocessesinthelightofuserfeedback;and(iv)takesintoaccounttheuser’spriori>es.UserInterface:Thedatascien>stprovidesinforma>onaboutthedatatheyneed,theirpriori>esandfeedback.Transducers:ComponentswithinputandoutputdependenciesdefinedasDatalog±rulesovertheknowledgebase.UserContext:Informa>onabouttherequirementsoftheuser.DataContext:Informa>onabouttheapplica>ondomain.VadalogReasoner:SupportsreasoningovertheknowledgebaseusingVadalog,amemberoftheDatalog±familyoflanguages.

End-To-EndDataWrangling

3.Feedback:Theuserprovidesfeedbacktoindicatethatsomeoftheresultsarecorrectorincorrect.Dependingonthefeedbackprovided,thiswillenablesomeofthepreviousstepsinthewranglingprocesstoberevisited,givingrisetoarevisedresult.4. User context: The result is now hopefully of reasonable quality, but thedata included in the resultmay not be especiallywell-suited to the task athand.Theuserspecifiestheusercontextbyindica>ngtherela>ve(pairwise)importanceofdifferentfeatures intheresult.Thepairwisecomparisonsareusedtoderiveweightsthatinformtheselec>onofmappingsbasedonmul>-dimensionalop>miza>on.

ARealEstateScenarioThe demonstra>on acts on realestatedata,bringingtogether:•  Dataaboutproper>esforsale

fromwebdataextrac>onoverdeepwebsources.

•  Open government data thatprovides informa>on aboutthe areas in which theproper>esarelocated.

1

43

2

www.vada.org.uk

top related