a higher-order data flow model for heterogeneous big data

23
A Higher-Order Data Flow Model for Heterogeneous Big Data Simon Price and Peter Flach

Upload: simon-price

Post on 11-Apr-2017

18 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: A Higher-Order Data Flow Model for Heterogeneous Big Data

A Higher-Order Data Flow Model for Heterogeneous Big Data

Simon Price and Peter Flach

Page 2: A Higher-Order Data Flow Model for Heterogeneous Big Data

2

Outline of this presentation

1. Introduction

2. JSONMatch

3. Data Flow Model

4. Example

5. Summary

Page 3: A Higher-Order Data Flow Model for Heterogeneous Big Data

3

2. JSONMatch

1. Introduction

2. JSONMatch

3. Data Flow Model

4. Example

5. Summary

Page 4: A Higher-Order Data Flow Model for Heterogeneous Big Data

4

JSONMatchJSON is the de facto data format for Web 2.0 and mobile apps. JSON is the 'record' in many NoSQL databases.JSONMatch compares the similarity of JSON documents.Use case: interactive web applications for profiling and matching Big (Variety) Data.

http://jsonmatch.com

Page 5: A Higher-Order Data Flow Model for Heterogeneous Big Data

5

JSONMatch

• A web service for analyzing and integrating data from heterogeneous sources in these formats:

• JSON (default)• CSV• HTML• RDF• XML• YAML• Plain text• Prolog terms• Weka AARF machine learning datasets

Page 6: A Higher-Order Data Flow Model for Heterogeneous Big Data

6

JSONMatch

• Stores and retrieves structured data (e.g. JSON documents) like a NoSQL database.

• Processes data using data flows defined dynamically in JSON using the REST API.

• Aims to produce results:o quickly for small datasetso eventually for larger datasets.

Page 7: A Higher-Order Data Flow Model for Heterogeneous Big Data

7

3. Data Flow Model

1. Introduction

2. JSONMatch

3. Data Flow Model

4. Example

5. Summary

Page 8: A Higher-Order Data Flow Model for Heterogeneous Big Data

8

Data Model

• Each dataset is a relation. E.g. S• Each relation is a set of key-value pairs. E.g.

S1,S2,...,Sn

• Values can be 'unstructured', semi-structured or structured data.

• In JSONMatch: value = JSON document

Page 9: A Higher-Order Data Flow Model for Heterogeneous Big Data

9

Example Data Flow

w = Φ3(Φ1(s), Φ2(t))

Page 10: A Higher-Order Data Flow Model for Heterogeneous Big Data

10Another Example Data Flow

Page 11: A Higher-Order Data Flow Model for Heterogeneous Big Data

11Higher-Order Transformation

v = Φ(g)(h)(s, t, u, ...)

Function Φ transforms relations s,t,u,... into relation v.Functions g and h are the higher-order parameters.

Page 12: A Higher-Order Data Flow Model for Heterogeneous Big Data

12Generator Function (g)

• Choose one of three:o Mapo Producto Lambda

Page 13: A Higher-Order Data Flow Model for Heterogeneous Big Data

13Generator Function (g=map)

Page 14: A Higher-Order Data Flow Model for Heterogeneous Big Data

14Generator Function (g=product)

Page 15: A Higher-Order Data Flow Model for Heterogeneous Big Data

15Generator Function (g=lambda)

Page 16: A Higher-Order Data Flow Model for Heterogeneous Big Data

16Template Function (h)

• Template data item with embedded functions that are expanded by Φ to produce an output item.

• The embedded functions have access to the "current" items from the input relations. i.e. items selected by g.

• The embedded functions use JSONPath expressions (i.e. simplified XPath for JSON) to access sub-parts of the input items.• $.person.title

• $.person.paper[*].author[0].name

• $[0][3][1].foo

Page 17: A Higher-Order Data Flow Model for Heterogeneous Big Data

• One input relation S. Each item si is an array like this.

• g=map and h is:

• Output relation V has items si like this.

17Example JSONMatch template data

item (h)[ "Ad Feelders", "http://dblp.uni-trier.de/pers/hd/f/Feelders:Ad.html.", "Rankings_and_Partial_Orders", "Active_Learning; Bioinformatics; ..." ]

{ "name": "$.items[0][0]", "url": "$.items[0][1]", "text": ["jm:http_get", "$.items[0][1]"], "primary": "$.items[0][2]", "keywords": ["jm:split", ";", "$.items[0][3]"] }

{ "name": "Ad Feelders", "url": "http://dblp.uni-trier.de/...", "text": "<html><title>A. J. Feeld...</html>", "primary": "Rankings_and_Partial_Orders", "keywords": [ "Active_Learning", "Bioinformatics", ... ] }

Page 18: A Higher-Order Data Flow Model for Heterogeneous Big Data

18

4. Example

1. Introduction

2. JSONMatch

3. Data Flow Model

4. Example

5. Summary

Page 19: A Higher-Order Data Flow Model for Heterogeneous Big Data

19SubSift

SubSift is a prototype application to support academic peer review.

SubSift matches submitted conference/journal papers to potential peer reviewers based on similarity to published works.

Website:http://subsift.ilrt.bris.ac.uk

Page 20: A Higher-Order Data Flow Model for Heterogeneous Big Data

20Recreating SubSift in JSONMatch

• All the nice features of SubSift are preserved.

• JSONMatch implementation adds other advantages:• Functionality defined by application as data

flow at runtime.

• REST API much smaller and simpler because functionality defined in item template h.

• Does not require a separate web harvester robot.

• External web services can be embedded in data flow.

• Handles much larger numbers of reviewers and papers.

Page 21: A Higher-Order Data Flow Model for Heterogeneous Big Data

21

5. Summary

1. Introduction

2. JSONMatch

3. Data Flow Model

4. Example

5. Summary

Page 22: A Higher-Order Data Flow Model for Heterogeneous Big Data

22Higher-Order Data Flow Model

Concise formalism for Big Variety data flows specified dynamically from interactive web applications.

JSONMatch proof-of-concept implementation:• For analyzing and integrating data from heterogeneous

sources• http://jsonmatch.com

Nice properties for analyzing data serially over extended periods of time without Big Data infrastructure.

Page 23: A Higher-Order Data Flow Model for Heterogeneous Big Data

http://simonprice.infoGet in touch: