introduction to graph database - cse.hcmut.edu.vn

Introduction to Graph database Course: Data EngineeringTeacher: Assoc. Prof. Dr. Dang Tran Khanh

Group 13

● 2070682 - Phạm Nguyễn Nhật Minh● 2170088 - Đặng Ngô Nhật Trường● 2170409 - Lê Dương Khoa● 2070401 - Nguyễn Việt Anh● 1870387 - Lê Văn Duẫn

2

Agenda

1. Graph database introduction

2. Graph database study case

3. Research results

4. Conclusion & Summary

3

Graph database introduction

4

Why are graphs important?● Modeling chemical and biological data● Social networks● The web● Hierarchical data

5

What is Graph database?

● A database built on top of graph data structure. ● Collection of vertices (nodes) and edges.● Property:

❏ Each node/edge is uniquely identified❏ Each node has a set of incoming and outgoing

edges❏ Each node and edge has a collection of

properties❏ Each edge has a label that defines the

relationship between it two nodes6

Graph queries

● List nodes/edges that have this property● List matching subgraphs● Can these two nodes reach each other?● How many hops does it take for two nodes to connect?

7

Graph database versus Relational database

8

Graph database versus Relational database

● Pros:○ Schema Flexibility○ More intuitive querying○ Avoid “join bombs”○ Local hops are not a function of the total nodes

● Cons:○ Not always advantageous○ Query language are not unified

9

When do we need graph database?

● Solve many to many relationships problems (Eg. Friends on Facebook)

● When relationships between data are important.

10

When do we need graph database?

● Who are friends of Alice's friends?

● Who are Alice and Bob’s mutual friends?

● Assuming, Alice doesn’t knowBob, who should they gothrough to get know eachother as quickly as possible?

Graph databases are able to solvethese problems directly andquickly.

11

Examples in Neo4j using the Cypher language

12

Graph database study caseMartin Macak, Matus Stovcik and Barbora Buhnova

“The Suitability of Graph Databases for Big Data Analysis: A Benchmark” , IoTBDS 2020

13

Problem statement

● Not aware the borders of situations in which strategy performs better in non-extreme cases

● Become rarer when it tests running in cluster with big data

14

Purpose

● Find out which strategy is better when running tests under different queries

● Discuss the threats to validity of this work

15

Database technologies for comparison● PostgreSQL● Neo4j

Dataset● Microsoft Academic Graph [Ref]● 1.7 billion rows, 14 tables, 13 relationships between tables● A number of relationship distributed to a series of distinct queries on

data set

Setup

16

https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/

Microsoft Academic Graph (https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/) 17

Queries● 10 queries, specific number of joins or traversal needed● 3 types: join across tables, where statements, string contain functions● find paper/journal under specific conditions

Cluster● Three-nodes cluster● High availability● 1 master node - 2 read-only node

18

Research results

19

Simple joins across multiple tables- J1: Counts the number of papers presented at a

conference.

- J2: Counts the number of papers presented at conference instances

- J3: Counts the number of journals presented at conference instances

Target: Determining the threshold of data complexity20

Relational databaseGraph database

21

Simple joins across multiple tables- In J1, Neo4j handled values counting of the join

between the enormous size of Papers and Conferences better by pre-made relationships.

- In J2 and J3, PostgreSQL performs better than Neo4j due to optimizations of joins where it did not have to use every row in both the joined tables, only a subset of rows

- PostgreSQL achieved better results in more complex joins. 22

Join queries with condition across multiple tables

- W1 Counts the number of papers presented at conferences with a specified short name.

- W2 Counts the number of papers, linked through a conference with a specified short name that was presented at conference instances.

- W3 Counts the number of journals, linked through a conference with a specified short name, that was presented at conference instances.

- W4 Counts the number of papers with specified original paper title presented at conferences.

23

Relational databaseGraph database

24

Join queries with simple condition across tables- W4 query is similar to W1, but have different in

between the direction of traversal.So the direction of relationships, node A -> node B or vice versa has huge impact on performance

- Graph database can have two-way directions of a relationship for better performance in exchange of disk space sacrifice.

25

Summary & Conclusion

26

Summary

● Introduce to Graph Database

● Compare Graph Database and Relational Database

● Introduce some use cases

● Review a paper about benchmark of Neo4j vs PostgreSQL

27

Conclusion

● Graph theory is the foundation

● GraphDB

● Neo4j

● Cloud infra: Microsoft Azure Cosmos DB, Amazon Neptune

28

References

● Microsoft Academic Graph [Link1] [Link2]● “The Suitability of Graph Databases for Big Data Analysis: A Benchmark” ,

IoTBDS 2020 [Link]

● “An Overview of Microsoft Academic Service (MAS) and Applications”Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-june (Paul) Hsu, Kuansan Wang

29

https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/

https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/An_Overview_of_Microsoft_Academic_Service_MAS_and_Applications-2.pdf

https://www.researchgate.net/publication/341469950_The_Suitability_of_Graph_Databases_for_Big_Data_Analysis_A_Benchmark

THANK YOU SO MUCH

30

introduction to graph database - cse.hcmut.edu.vn

Documents