challenges in the design of a graph database benchmark

Post on 14-Dec-2014

824 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

© Prof. Dr.-Ing. Wolfgang Lehner |

Challenges in the Design of a Graph Database Benchmark FOSDEM‘12 – Graph Processing DevRoom

Marcus Paradies

Marcus Paradies | | 1

> Outline

Motivation

Challenges

Thoughts on Graph Data Generation

Thoughts on Query Workload

Summary and Outlook

Discussion

FOSDEM 2012

Marcus Paradies | | 2

> Motivation

FOSDEM 2012

Graph databases are gaining momentum

Enterprise corporations are getting interested

How to compare the available graph database vendors?

Main issue: Results from benchmarks are not comparable

Lack of standardization in the data model and query language

What are “typical“ graph operations?

Marcus Paradies | | 3

>

Challenges

FOSDEM 2012

Marcus Paradies | | 4

> Challenge #1: Application Domain

Graph data is not homogenous

Graph data from different domains follows different patterns

Examples:

Social Network Analysis (SNA)

Protein Interaction Analysis

Recommendation Systems

Supply Chain Management (Vehicle Routing, CRM)

Fraud Detection in Financial Systems

Challenge: Find an application domain which represents a graph data pattern

common in many different scenarios.

FOSDEM 2012

Marcus Paradies | | 5

> Challenge #2: Graph Data Model

FOSDEM 2012

What flavours of graph data models are commonly used?

Marcus Paradies | | 6

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Marcus Paradies | | 7

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Undirected Graph

Marcus Paradies | | 8

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Undirected Graph

Mixed Graph

Marcus Paradies | | 9

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Undirected Graph

Mixed Graph Multi Graph

Marcus Paradies | | 10

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Undirected Graph

Mixed Graph Multi Graph

(Plain) Property Graph

Marcus Paradies | | 11

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Undirected Graph

Mixed Graph Multi Graph

(Plain) Property Graph

(Structured Property Graph)

Marcus Paradies | | 12

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Undirected Graph

Mixed Graph Multi Graph

(Plain) Property Graph

(Structured Property Graph)

Hyper Graph

Marcus Paradies | | 13

> Challenge #2: Graph Data Model

FOSDEM 2012

Directed Graph

Undirected Graph

Mixed Graph Multi Graph

(Plain) Property Graph

(Structured Property Graph)

Hyper Graph

Challenge: Find a graph data model suited for the majority of use cases

from various domains.

Marcus Paradies | | 14

> Challenge #3: Querying Graph Data

FOSDEM 2012

Large variety in graph processing and manipulation languages

Each graph database vendor implements own query languages/APIs

Reason: No standardized graph query language available

Marcus Paradies | | 15

> Challenge #3: Querying Graph Data

FOSDEM 2012

Large variety in graph processing and manipulation languages

Each graph database vendor implements own query languages/APIs

Reason: No standardized graph query language available

Challenge: Find a way to abstract from the zoo of available query languages.

Marcus Paradies | | 16

> Challenge #4: Defining the Workload

FOSDEM 2012

The workload to be defined is dependent from the underlying

query/manipulation language

Should complex (algorithmic) operations be part of a database benchmark?

Which algorithms to pick?

Social Network Analysis → Find communities

Supply Chain Management → Find maximal flow

Web of Data → Find pattern matches

How are concurrent users represented?

What about transactionality?

Marcus Paradies | | 17

>

Thoughts on Graph Data Generation

FOSDEM 2012

Marcus Paradies | | 18

> Graph Data Generation - Patterns

FOSDEM 2012

Understanding graph patterns (characteristics) is crucical for a good graph data generator

What are distinguishing characteristics of graphs?

How can we identify graph patterns on large graphs?

Three main patterns [1]:

Power law distributed Small diameters Community Effects

? =

? =

Marcus Paradies | | 19

> Pattern 1 – Power law distributed

FOSDEM 2012

Most real-world graph data sets follow a power law distribution

Examples:

Internet router graph Subsets of the WWW Citation Graphs

source: [2] source: [2]

Marcus Paradies | | 20

> Pattern 2 – Small Diameters

FOSDEM 2012

Effective Diameter (eccentricity): Minimum number of hops, in which a fraction (e.g. 90%) of all connected pairs of nodes can reach each other

Other measures exist as well, but are not applicable to disconnected graphs

In most use cases, diameter is much smaller than the size of the graph

Examples:

97% eccentricity of around 16 for path lengths in the WWW Average path length around 6 for Epinions social network

source: [1]

Marcus Paradies | | 21

> Pattern 3 – Community Effects

FOSDEM 2012

Community: A set of nodes, where each node in the set is closer to all other nodes in the community than to nodes outside the community.

Communities can be found in many real-world graphs, especially social networks and collaboration networks

Clustering Coefficient C: A measure, which qualifies the „clumpiness“ of a graph

Marcus Paradies | | 22

>

Thoughts on Query Workload

FOSDEM 2012

Marcus Paradies | | 23

> Query Workload - Operations

FOSDEM 2012

Graph Manipulation Operations

Add/Update/Remove Nodes from the Graph Add/Update/Remove Edges from the Graph Add/Update/Remove Edge attributes Add/Update/Remove Node attributes

Graph Query Operations

Retrieve selection of nodes from given filter expression Getting the neighbors of a set of nodes (possibly with edge filter constraints)

Graph Traversals

Based on basic query operations Exploration of neighborhood from a given set of start nodes Terminated by the number of steps and/or edge/node filter constraints

Graph Analytical Operations

Aggregation operations such as sum, avg, min, max Aggregations on node-level and on edge-level

Marcus Paradies | | 24

> Query Workload - Measures

FOSDEM 2012

Closely related to benchmark capabilities

Measures from relational benchmarks apply such as

Average query response time

Transactions per second (throughput)

Additional measures for graph traversals

Traversals per second

What about distributed scenarios?

What about concurrent users?

Marcus Paradies | | 25

> Summary and Outlook

Graph data distribution highly important for graph database benchmark

Application domains do have very specific graph characteristics

A graph database benchmark has to provide abstract and high-level graph

operation descriptions

Feel free to contact me if you want to contribute:

marcus.paradies@gmail.com

FOSDEM 2012

Marcus Paradies | | 26

>

Discussion

FOSDEM 2012

Marcus Paradies | | 27

> Theses

A benchmark based on social network data is nice, but might be not be that

representative for large enterprise applications

Algorithms should NOT be part of a graph database benchmark

Only support basic operations such as simple lookups and path traversals

The underlying graph data model should be a simple property graph

A graph database has to scale in terms of data size as well as number of

concurrent users

....

FOSDEM 2012

Marcus Paradies | | 28

> References

[1] Graph Mining: Laws, Generators, and Algorithms (2006)

[2] http://konect.uni-koblenz.de/

[3] A Discussion on the Design of Graph Database Benchmarks (2010)

FOSDEM 2012

top related