storing, indexing and querying large provenance data sets as rdf graphs in apache hbase artem...

28
Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University of Texas – Pan American Anthony Piazza Piazza Consulting Andrey Kashlev and Shiyong Lu Wayne State University 7th IEEE International Workshop on Scientific Workflows, July 2, 2013 Was Derived From 1

Upload: ariel-strickland

Post on 17-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

1

Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase

Artem Chebotko

Joint work with

John Abraham and Pearl Brazier University of Texas – Pan American

Anthony Piazza Piazza Consulting

Andrey Kashlev and Shiyong LuWayne State University

7th IEEE International Workshop on Scientific Workflows, July 2, 2013

WasDerived

From

Page 2: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

2

Provenance in eScience Metadata that captures history of an experiment

Problem diagnosis Result interpretation Experiment reproducibility

Scientific Workflow Community Provenance Challenges 2006: understanding and sharing information about

provenance representations and capabilities 2006: interoperability of different provenance 2009: evaluating various aspects of OPM 2010: showcase OPM in the context of novel applications

Open Provenance Model (2007 - 2010)

PROV-DM: The PROV Data Model (W3C Recommendation 30 April 2013)

Page 3: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

3

SWFMS and Provenance

Taverna Kepler View VisTrails, Pegasus Swift

Galaxy Triana OPMProv Karma RDFProv etc.

Support provenance collection

Use proprietary or third-party systems to manage provenance

Differ in provenance models, provenance vocabularies, inference support, and query languages.

May eventually converge to W3C PROV specifications

Page 4: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

4

Sample OPM Provenance Graph Nodes:

artifacts processes agents

Edges: used wasGeneratedBy wasControlledBy wasTriggeredBy wasDerivedFrom

Create Table SQL Statements

Create IndexSQL Statements

Create TriggerSQL Statements

Create Database Schema

Schema

Load Data

Dataset

Instance

Page 5: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

5

Sample Graph Serialization: OPMV and Terse RDF Triple Language

utpb:schema rdf:type opmv:Artifact .utpb:instance rdf:type opmv:Artifact .utpb:dataset rdf:type opmv:Artifact .utpb:loadData rdf:type opmv:Process .utpb:loadData opmv:used utpb:schema, utpb:dataset .utpb:instance opmv:wasGeneratedBy utpb:loadData .utpb:instance opmv:wasDerivedFrom utpb:schema, utpb:dataset .

Create Table SQL Statements

Create IndexSQL Statements

Create TriggerSQL Statements

Create Database Schema

Schema

Load Data

Dataset

Instance

Page 6: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

6

Provenance Serialization and Querying

Both OPM and PROV-DM can be serialized in RDF

Queried in SPARQL

Find all artifacts and their values, if any, in a provenance graph with identifier http://cs.panam.edu/utpb#opmGraph

Page 7: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

7

This Work - Motivation Single provenance graph as an RDF graph

In general, readily manageable in main memory of a single machine

Hundreds of thousands or even millions of provenance graphs as a provenance (RDF) dataset Challenging to manage

Our Focus/Problem: Efficient and scalable storage and querying of large collections of provenance graphs serialized as RDF graphs (in an Apache HBase database)

Page 8: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

8

This Work - Contributions Novel storage and indexing schemes for RDF data in

HBase that are suitable for provenance datasets

Novel and efficient querying algorithms to evaluate SPARQL queries in HBase that are optimized to make use of bitmap indices and numeric values instead of triples

Empirical evaluation of our approach using provenance graphs and test queries of the University of Texas Provenance Benchmark (UTPB)

Page 9: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

9

Talk Outline RDF Data and Queries

Indexing Scheme

Storage Scheme

Query Processing

Performance Study

Related Work

Summary and Future work

Page 10: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

10

RDF Data and Queries

Page 11: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

11

RDF Data and Queries

Page 12: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

12

Indexing Scheme Selection Indices: Is, Ip, Io

Find a triple with known s, p and o:

Page 13: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

13

Indexing Scheme Join Indices: Iss, Iso, Ios, Ioo

Find triples with the same object as subject in triple at position i:

Iso(i)

Page 14: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

14

Storage Scheme One table with two column families for data and

indices

Each row stores one complete provenance graph

Page 15: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

15

Query Processing Four efficient algorithms/functions:

application of selection indices application of join indices handling of special cases not supported by the indices basic graph pattern evaluation

Page 16: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

16

Query Processing

Page 17: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

17

Query Processing

Page 18: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

18

Query Processing

Page 19: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

19

Query Processing

Page 20: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

20

Query Processing

Page 21: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

21

Query Processing

Page 22: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

22

Performance Study Implementation

Java, Hadoop 1.0.0, HBase 0.94

Cluster setup One HBase Master Eight HBase Region Servers

All commodity machines

Benchmark – UTPB (5 datasets, 11 queries)

Page 23: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

23

Performance Study Q1 – simplest, yet most expensive query due to a large

result set

Q1. Find all provenance graph identifiers.PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX owl: <http://www.w3.org/2002/07/owl#> SELECT * WHERE { ?graph rdf:type owl:Thing . }

Page 24: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

24

Performance Study Q2 – Q11 – different complexity, yet similar performance

Example: Q8. Find all artifacts and their values, if any, in a particular provenance graph.

PREFIX opmv: <http://purl.org/net/opmv/ns#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX opmo: <http://openprovenance.org/model/opmo#> PREFIX utpb: <http://cs.panam.edu/utpb#> SELECT ?artifact ?value FROM NAMED <http://cs.panam.edu/utpb#opmGraph> WHERE {

GRAPH utpb:opmGraph { ?artifact rdf:type opmv:Artifact . OPTIONAL { ?artifact opmo:annotation ?annotation . ?annotation opmo:property ?property .

?property opmo:value ?value . } . OPTIONAL { ?artifact opmo:avalue ?artifactValue . ?artifactValue opmo:content ?value . } . }

}

Page 25: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

25

Performance Study

Please see other queries in the paper – very efficient and scalable (nearly constant scalability due to minimal data transfers and fast index-based join processing)

Page 26: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

26

Related Work HBase, BigTable, Cassandra

Hadoop, Hive, Pig, CouchDB, MongoDB, etc.

NoSQL solutions to RDF data management

Provenance management systems

RDF data indexing

Page 27: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

27

Summary and Future Work

Designed novel storage and indexing schemes for RDF data in HBase that are suitable for provenance datasets

Empirical evaluation results are promising

Future work

Compare, compare, compare

More experiments with multi-user workloads

More optimizations

PROV-DM benchmark anyone?

Page 28: Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University

28

THANK YOU! Questions? My contact information:

Artem Chebotko, Department of Computer Science, University of Texas – Pan American

[email protected] http://www.cs.panam.edu/~artem

WasDerivedFrom

WasDerivedFrom