a higher-order data flow model for heterogeneous big data

A Higher-Order Data Flow Model for Heterogeneous Big Data

Simon Price and Peter Flach

2

Outline of this presentation

1. Introduction

2. JSONMatch

3. Data Flow Model

4. Example

5. Summary

3

2. JSONMatch

1. Introduction

2. JSONMatch

3. Data Flow Model

4. Example

5. Summary

4

JSONMatchJSON is the de facto data format for Web 2.0 and mobile apps. JSON is the 'record' in many NoSQL databases.JSONMatch compares the similarity of JSON documents.Use case: interactive web applications for profiling and matching Big (Variety) Data.

http://jsonmatch.com

http://jsonmatch.com/


5

JSONMatch

• A web service for analyzing and integrating data from heterogeneous sources in these formats:

• JSON (default)• CSV• HTML• RDF• XML• YAML• Plain text• Prolog terms• Weka AARF machine learning datasets

6

JSONMatch

• Stores and retrieves structured data (e.g. JSON documents) like a NoSQL database.

• Processes data using data flows defined dynamically in JSON using the REST API.

• Aims to produce results:o quickly for small datasetso eventually for larger datasets.

7

3. Data Flow Model

1. Introduction

2. JSONMatch

3. Data Flow Model

4. Example

5. Summary

8

Data Model

• Each dataset is a relation. E.g. S• Each relation is a set of key-value pairs. E.g.

S1,S2,...,Sn

• Values can be 'unstructured', semi-structured or structured data.

• In JSONMatch: value = JSON document

9

Example Data Flow

w = Φ3(Φ1(s), Φ2(t))

10Another Example Data Flow

11Higher-Order Transformation

v = Φ(g)(h)(s, t, u, ...)

Function Φ transforms relations s,t,u,... into relation v.Functions g and h are the higher-order parameters.

12Generator Function (g)

• Choose one of three:o Mapo Producto Lambda

13Generator Function (g=map)

14Generator Function (g=product)

15Generator Function (g=lambda)

16Template Function (h)

• Template data item with embedded functions that are expanded by Φ to produce an output item.

• The embedded functions have access to the "current" items from the input relations. i.e. items selected by g.

• The embedded functions use JSONPath expressions (i.e. simplified XPath for JSON) to access sub-parts of the input items.• $.person.title

• $.person.paper[*].author[0].name

• $[0][3][1].foo

• One input relation S. Each item si is an array like this.

• g=map and h is:

• Output relation V has items si like this.

17Example JSONMatch template data

item (h)[ "Ad Feelders", "http://dblp.uni-trier.de/pers/hd/f/Feelders:Ad.html.", "Rankings_and_Partial_Orders", "Active_Learning; Bioinformatics; ..." ]

{ "name": "$.items[0][0]", "url": "$.items[0][1]", "text": ["jm:http_get", "$.items[0][1]"], "primary": "$.items[0][2]", "keywords": ["jm:split", ";", "$.items[0][3]"] }

{ "name": "Ad Feelders", "url": "http://dblp.uni-trier.de/...", "text": "<html><title>A. J. Feeld...</html>", "primary": "Rankings_and_Partial_Orders", "keywords": [ "Active_Learning", "Bioinformatics", ... ] }

18

4. Example

1. Introduction

2. JSONMatch

3. Data Flow Model

4. Example

5. Summary

19SubSift

SubSift is a prototype application to support academic peer review.

SubSift matches submitted conference/journal papers to potential peer reviewers based on similarity to published works.

Website:http://subsift.ilrt.bris.ac.uk

http://subsift.ilrt.bris.ac.uk/

http://subsift.ilrt.bris.ac.uk/

20Recreating SubSift in JSONMatch

• All the nice features of SubSift are preserved.

• JSONMatch implementation adds other advantages:• Functionality defined by application as data

flow at runtime.

• REST API much smaller and simpler because functionality defined in item template h.

• Does not require a separate web harvester robot.

• External web services can be embedded in data flow.

• Handles much larger numbers of reviewers and papers.

21

5. Summary

1. Introduction

2. JSONMatch

3. Data Flow Model

4. Example

5. Summary

22Higher-Order Data Flow Model

Concise formalism for Big Variety data flows specified dynamically from interactive web applications.

JSONMatch proof-of-concept implementation:• For analyzing and integrating data from heterogeneous

sources• http://jsonmatch.com

Nice properties for analyzing data serially over extended periods of time without Big Data infrastructure.


http://simonprice.infoGet in touch:

a higher-order data flow model for heterogeneous big data

Data & Analytics