challenges in the design of a graph database benchmark

29
© Prof. Dr.-Ing. Wolfgang Lehner | Challenges in the Design of a Graph Database Benchmark FOSDEM‘12 – Graph Processing DevRoom Marcus Paradies

Upload: marcus-paradies

Post on 14-Dec-2014

824 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Challenges in the Design of a Graph Database Benchmark

© Prof. Dr.-Ing. Wolfgang Lehner |

Challenges in the Design of a Graph Database Benchmark FOSDEM‘12 – Graph Processing DevRoom

Marcus Paradies

Page 2: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 1

> Outline

Motivation

Challenges

Thoughts on Graph Data Generation

Thoughts on Query Workload

Summary and Outlook

Discussion

FOSDEM 2012

Page 3: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 2

> Motivation

FOSDEM 2012

Graph databases are gaining momentum

Enterprise corporations are getting interested

How to compare the available graph database vendors?

Main issue: Results from benchmarks are not comparable

Lack of standardization in the data model and query language

What are “typical“ graph operations?

Page 4: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 3

>

Challenges

FOSDEM 2012

Page 5: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 4

> Challenge #1: Application Domain

Graph data is not homogenous

Graph data from different domains follows different patterns

Examples:

Social Network Analysis (SNA)

Protein Interaction Analysis

Recommendation Systems

Supply Chain Management (Vehicle Routing, CRM)

Fraud Detection in Financial Systems

Challenge: Find an application domain which represents a graph data pattern

common in many different scenarios.

FOSDEM 2012

Page 6: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 5

> Challenge #2: Graph Data Model

FOSDEM 2012

What flavours of graph data models are commonly used?

Page 7: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 6

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Page 8: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 7

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Undirected Graph

Page 9: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 8

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Undirected Graph

Mixed Graph

Page 10: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 9

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Undirected Graph

Mixed Graph Multi Graph

Page 11: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 10

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Undirected Graph

Mixed Graph Multi Graph

(Plain) Property Graph

Page 12: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 11

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Undirected Graph

Mixed Graph Multi Graph

(Plain) Property Graph

(Structured Property Graph)

Page 13: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 12

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Undirected Graph

Mixed Graph Multi Graph

(Plain) Property Graph

(Structured Property Graph)

Hyper Graph

Page 14: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 13

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Undirected Graph

Mixed Graph Multi Graph

(Plain) Property Graph

(Structured Property Graph)

Hyper Graph

Challenge: Find a graph data model suited for the majority of use cases

from various domains.

Page 15: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 14

> Challenge #3: Querying Graph Data

FOSDEM 2012

Large variety in graph processing and manipulation languages

Each graph database vendor implements own query languages/APIs

Reason: No standardized graph query language available

Page 16: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 15

> Challenge #3: Querying Graph Data

FOSDEM 2012

Large variety in graph processing and manipulation languages

Each graph database vendor implements own query languages/APIs

Reason: No standardized graph query language available

Challenge: Find a way to abstract from the zoo of available query languages.

Page 17: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 16

> Challenge #4: Defining the Workload

FOSDEM 2012

The workload to be defined is dependent from the underlying

query/manipulation language

Should complex (algorithmic) operations be part of a database benchmark?

Which algorithms to pick?

Social Network Analysis → Find communities

Supply Chain Management → Find maximal flow

Web of Data → Find pattern matches

How are concurrent users represented?

What about transactionality?

Page 18: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 17

>

Thoughts on Graph Data Generation

FOSDEM 2012

Page 19: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 18

> Graph Data Generation - Patterns

FOSDEM 2012

Understanding graph patterns (characteristics) is crucical for a good graph data generator

What are distinguishing characteristics of graphs?

How can we identify graph patterns on large graphs?

Three main patterns [1]:

Power law distributed Small diameters Community Effects

? =

? =

Page 20: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 19

> Pattern 1 – Power law distributed

FOSDEM 2012

Most real-world graph data sets follow a power law distribution

Examples:

Internet router graph Subsets of the WWW Citation Graphs

source: [2] source: [2]

Page 21: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 20

> Pattern 2 – Small Diameters

FOSDEM 2012

Effective Diameter (eccentricity): Minimum number of hops, in which a fraction (e.g. 90%) of all connected pairs of nodes can reach each other

Other measures exist as well, but are not applicable to disconnected graphs

In most use cases, diameter is much smaller than the size of the graph

Examples:

97% eccentricity of around 16 for path lengths in the WWW Average path length around 6 for Epinions social network

source: [1]

Page 22: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 21

> Pattern 3 – Community Effects

FOSDEM 2012

Community: A set of nodes, where each node in the set is closer to all other nodes in the community than to nodes outside the community.

Communities can be found in many real-world graphs, especially social networks and collaboration networks

Clustering Coefficient C: A measure, which qualifies the „clumpiness“ of a graph

Page 23: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 22

>

Thoughts on Query Workload

FOSDEM 2012

Page 24: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 23

> Query Workload - Operations

FOSDEM 2012

Graph Manipulation Operations

Add/Update/Remove Nodes from the Graph Add/Update/Remove Edges from the Graph Add/Update/Remove Edge attributes Add/Update/Remove Node attributes

Graph Query Operations

Retrieve selection of nodes from given filter expression Getting the neighbors of a set of nodes (possibly with edge filter constraints)

Graph Traversals

Based on basic query operations Exploration of neighborhood from a given set of start nodes Terminated by the number of steps and/or edge/node filter constraints

Graph Analytical Operations

Aggregation operations such as sum, avg, min, max Aggregations on node-level and on edge-level

Page 25: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 24

> Query Workload - Measures

FOSDEM 2012

Closely related to benchmark capabilities

Measures from relational benchmarks apply such as

Average query response time

Transactions per second (throughput)

Additional measures for graph traversals

Traversals per second

What about distributed scenarios?

What about concurrent users?

Page 26: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 25

> Summary and Outlook

Graph data distribution highly important for graph database benchmark

Application domains do have very specific graph characteristics

A graph database benchmark has to provide abstract and high-level graph

operation descriptions

Feel free to contact me if you want to contribute:

[email protected]

FOSDEM 2012

Page 27: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 26

>

Discussion

FOSDEM 2012

Page 28: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 27

> Theses

A benchmark based on social network data is nice, but might be not be that

representative for large enterprise applications

Algorithms should NOT be part of a graph database benchmark

Only support basic operations such as simple lookups and path traversals

The underlying graph data model should be a simple property graph

A graph database has to scale in terms of data size as well as number of

concurrent users

....

FOSDEM 2012

Page 29: Challenges in the Design of a Graph Database Benchmark

Marcus Paradies | | 28

> References

[1] Graph Mining: Laws, Generators, and Algorithms (2006)

[2] http://konect.uni-koblenz.de/

[3] A Discussion on the Design of Graph Database Benchmarks (2010)

FOSDEM 2012