introduction to graph database - cse.hcmut.edu.vn
TRANSCRIPT
Introduction to Graph database Course: Data EngineeringTeacher: Assoc. Prof. Dr. Dang Tran Khanh
Group 13
● 2070682 - Phạm Nguyễn Nhật Minh● 2170088 - Đặng Ngô Nhật Trường● 2170409 - Lê Dương Khoa● 2070401 - Nguyễn Việt Anh● 1870387 - Lê Văn Duẫn
2
Agenda
1. Graph database introduction
2. Graph database study case
3. Research results
4. Conclusion & Summary
3
Graph database introduction
4
Why are graphs important?● Modeling chemical and biological data● Social networks● The web● Hierarchical data
5
What is Graph database?
● A database built on top of graph data structure. ● Collection of vertices (nodes) and edges.● Property:
❏ Each node/edge is uniquely identified❏ Each node has a set of incoming and outgoing
edges❏ Each node and edge has a collection of
properties❏ Each edge has a label that defines the
relationship between it two nodes6
Graph queries
● List nodes/edges that have this property● List matching subgraphs● Can these two nodes reach each other?● How many hops does it take for two nodes to connect?
7
Graph database versus Relational database
8
Graph database versus Relational database
● Pros:○ Schema Flexibility○ More intuitive querying○ Avoid “join bombs”○ Local hops are not a function of the total nodes
● Cons:○ Not always advantageous○ Query language are not unified
9
When do we need graph database?
● Solve many to many relationships problems (Eg. Friends on Facebook)
● When relationships between data are important.
10
When do we need graph database?
● Who are friends of Alice's friends?
● Who are Alice and Bob’s mutual friends?
● Assuming, Alice doesn’t knowBob, who should they gothrough to get know eachother as quickly as possible?
Graph databases are able to solvethese problems directly andquickly.
11
Examples in Neo4j using the Cypher language
12
Graph database study caseMartin Macak, Matus Stovcik and Barbora Buhnova
“The Suitability of Graph Databases for Big Data Analysis: A Benchmark” , IoTBDS 2020
13
Problem statement
● Not aware the borders of situations in which strategy performs better in non-extreme cases
● Become rarer when it tests running in cluster with big data
14
Purpose
● Find out which strategy is better when running tests under different queries
● Discuss the threats to validity of this work
15
Database technologies for comparison● PostgreSQL● Neo4j
Dataset● Microsoft Academic Graph [Ref]● 1.7 billion rows, 14 tables, 13 relationships between tables● A number of relationship distributed to a series of distinct queries on
data set
Setup
16
Microsoft Academic Graph (https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/) 17
Queries● 10 queries, specific number of joins or traversal needed● 3 types: join across tables, where statements, string contain functions● find paper/journal under specific conditions
Cluster● Three-nodes cluster● High availability● 1 master node - 2 read-only node
18
Research results
19
Simple joins across multiple tables- J1: Counts the number of papers presented at a
conference.
- J2: Counts the number of papers presented at conference instances
- J3: Counts the number of journals presented at conference instances
Target: Determining the threshold of data complexity20
Relational databaseGraph database
21
Simple joins across multiple tables- In J1, Neo4j handled values counting of the join
between the enormous size of Papers and Conferences better by pre-made relationships.
- In J2 and J3, PostgreSQL performs better than Neo4j due to optimizations of joins where it did not have to use every row in both the joined tables, only a subset of rows
- PostgreSQL achieved better results in more complex joins. 22
Join queries with condition across multiple tables
- W1 Counts the number of papers presented at conferences with a specified short name.
- W2 Counts the number of papers, linked through a conference with a specified short name that was presented at conference instances.
- W3 Counts the number of journals, linked through a conference with a specified short name, that was presented at conference instances.
- W4 Counts the number of papers with specified original paper title presented at conferences.
23
Relational databaseGraph database
24
Join queries with simple condition across tables- W4 query is similar to W1, but have different in
between the direction of traversal.So the direction of relationships, node A -> node B or vice versa has huge impact on performance
- Graph database can have two-way directions of a relationship for better performance in exchange of disk space sacrifice.
25
Summary & Conclusion
26
Summary
● Introduce to Graph Database
● Compare Graph Database and Relational Database
● Introduce some use cases
● Review a paper about benchmark of Neo4j vs PostgreSQL
27
Conclusion
● Graph theory is the foundation
● GraphDB
● Neo4j
● Cloud infra: Microsoft Azure Cosmos DB, Amazon Neptune
28
References
● Microsoft Academic Graph [Link1] [Link2]● “The Suitability of Graph Databases for Big Data Analysis: A Benchmark” ,
IoTBDS 2020 [Link]
● “An Overview of Microsoft Academic Service (MAS) and Applications”Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-june (Paul) Hsu, Kuansan Wang
29
THANK YOU SO MUCH
30