a higher-order data flow model for heterogeneous big data
TRANSCRIPT
![Page 1: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/1.jpg)
A Higher-Order Data Flow Model for Heterogeneous Big Data
Simon Price and Peter Flach
![Page 2: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/2.jpg)
2
Outline of this presentation
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
![Page 3: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/3.jpg)
3
2. JSONMatch
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
![Page 4: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/4.jpg)
4
JSONMatchJSON is the de facto data format for Web 2.0 and mobile apps. JSON is the 'record' in many NoSQL databases.JSONMatch compares the similarity of JSON documents.Use case: interactive web applications for profiling and matching Big (Variety) Data.
http://jsonmatch.com
![Page 5: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/5.jpg)
5
JSONMatch
• A web service for analyzing and integrating data from heterogeneous sources in these formats:
• JSON (default)• CSV• HTML• RDF• XML• YAML• Plain text• Prolog terms• Weka AARF machine learning datasets
![Page 6: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/6.jpg)
6
JSONMatch
• Stores and retrieves structured data (e.g. JSON documents) like a NoSQL database.
• Processes data using data flows defined dynamically in JSON using the REST API.
• Aims to produce results:o quickly for small datasetso eventually for larger datasets.
![Page 7: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/7.jpg)
7
3. Data Flow Model
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
![Page 8: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/8.jpg)
8
Data Model
• Each dataset is a relation. E.g. S• Each relation is a set of key-value pairs. E.g.
S1,S2,...,Sn
• Values can be 'unstructured', semi-structured or structured data.
• In JSONMatch: value = JSON document
![Page 9: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/9.jpg)
9
Example Data Flow
w = Φ3(Φ1(s), Φ2(t))
![Page 10: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/10.jpg)
10Another Example Data Flow
![Page 11: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/11.jpg)
11Higher-Order Transformation
v = Φ(g)(h)(s, t, u, ...)
Function Φ transforms relations s,t,u,... into relation v.Functions g and h are the higher-order parameters.
![Page 12: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/12.jpg)
12Generator Function (g)
• Choose one of three:o Mapo Producto Lambda
![Page 13: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/13.jpg)
13Generator Function (g=map)
![Page 14: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/14.jpg)
14Generator Function (g=product)
![Page 15: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/15.jpg)
15Generator Function (g=lambda)
![Page 16: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/16.jpg)
16Template Function (h)
• Template data item with embedded functions that are expanded by Φ to produce an output item.
• The embedded functions have access to the "current" items from the input relations. i.e. items selected by g.
• The embedded functions use JSONPath expressions (i.e. simplified XPath for JSON) to access sub-parts of the input items.• $.person.title
• $.person.paper[*].author[0].name
• $[0][3][1].foo
![Page 17: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/17.jpg)
• One input relation S. Each item si is an array like this.
• g=map and h is:
• Output relation V has items si like this.
17Example JSONMatch template data
item (h)[ "Ad Feelders", "http://dblp.uni-trier.de/pers/hd/f/Feelders:Ad.html.", "Rankings_and_Partial_Orders", "Active_Learning; Bioinformatics; ..." ]
{ "name": "$.items[0][0]", "url": "$.items[0][1]", "text": ["jm:http_get", "$.items[0][1]"], "primary": "$.items[0][2]", "keywords": ["jm:split", ";", "$.items[0][3]"] }
{ "name": "Ad Feelders", "url": "http://dblp.uni-trier.de/...", "text": "<html><title>A. J. Feeld...</html>", "primary": "Rankings_and_Partial_Orders", "keywords": [ "Active_Learning", "Bioinformatics", ... ] }
![Page 18: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/18.jpg)
18
4. Example
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
![Page 19: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/19.jpg)
19SubSift
SubSift is a prototype application to support academic peer review.
SubSift matches submitted conference/journal papers to potential peer reviewers based on similarity to published works.
Website:http://subsift.ilrt.bris.ac.uk
![Page 20: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/20.jpg)
20Recreating SubSift in JSONMatch
• All the nice features of SubSift are preserved.
• JSONMatch implementation adds other advantages:• Functionality defined by application as data
flow at runtime.
• REST API much smaller and simpler because functionality defined in item template h.
• Does not require a separate web harvester robot.
• External web services can be embedded in data flow.
• Handles much larger numbers of reviewers and papers.
![Page 21: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/21.jpg)
21
5. Summary
1. Introduction
2. JSONMatch
3. Data Flow Model
4. Example
5. Summary
![Page 22: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/22.jpg)
22Higher-Order Data Flow Model
Concise formalism for Big Variety data flows specified dynamically from interactive web applications.
JSONMatch proof-of-concept implementation:• For analyzing and integrating data from heterogeneous
sources• http://jsonmatch.com
Nice properties for analyzing data serially over extended periods of time without Big Data infrastructure.
![Page 23: A Higher-Order Data Flow Model for Heterogeneous Big Data](https://reader033.vdocuments.us/reader033/viewer/2022051706/58ecd8b81a28ab38208b46d1/html5/thumbnails/23.jpg)
http://simonprice.infoGet in touch: