tips and tricks for graph data modeling
TRANSCRIPT
Graph ModelingTips and Tricks
Mark Needham @markhneedham
Introducing our data set...
meetup.com’s recommendations
Recommendation queries
‣ Several different types• groups to join• topics to follow• events to attend
‣ As a user of meetup.com trying to find groups to join and events to attend
How will this talk be structured?
Find similar groups to Neo4j
As a member of the Neo4j London group
I want to find other similar meetup groups
So that I can join those groups
What makes groups similar?
As a member of the Neo4j London group
I want to find other similar meetup groups
So that I can join those groups
Find similar groups to Neo4j
As a member of the Neo4j London group
I want to find other similar meetup groups
So that I can join those groups
Nodes
As a member of the Neo4j London group
I want to find other similar meetup groups
So that I can join those groups
Relationships
As a member of the Neo4j London group
I want to find other similar meetup groups
So that I can join those groups
Labels
As a member of the Neo4j London group
I want to find other similar meetup groups
So that I can join those groups
Properties
Find similar groups to Neo4j
MATCH (group:Group {name: "Neo4j - London User Group"}) -[:HAS_TOPIC]->(topic)<-[:HAS_TOPIC]-(otherGroup)RETURN otherGroup.name, COUNT(topic) AS topicsInCommon, COLLECT(topic.name) AS topicsORDER BY topicsInCommon DESC, otherGroup.nameLIMIT 10
Find similar groups to Neo4j
Tip: Model incrementally
‣ Build the model for the question we need to answer now then move onto the next question
I’m already a member of these!
What other data can we get?
Exclude groups I’m a member of
As a member of the Neo4j London group
I want to find other similar meetup groupsthat I’m not already a member of
So that I can join those groups
Exclude groups I’m a member of
As a member of the Neo4j London group
I want to find other similar meetup groupsthat I’m not already a member of
So that I can join those groups
Exclude groups I’m a member of
MATCH (group:Group {name: "Neo4j - London User Group"}) -[:HAS_TOPIC]->(topic)<-[:HAS_TOPIC]-(otherGroup:Group)RETURN otherGroup.name, COUNT(topic) AS topicsInCommon, EXISTS((:Member {name: "Mark Needham"}) -[:MEMBER_OF]->(otherGroup)) AS alreadyMember, COLLECT(topic.name) AS topicsORDER BY topicsInCommon DESCLIMIT 10
Exclude groups I’m a member of
Exclude groups I’m a member of
MATCH (group:Group {name: "Neo4j - London User Group"}) -[:HAS_TOPIC]->(topic)<-[:HAS_TOPIC]-(otherGroup:Group)WHERE NOT( (:Member {name: "Mark Needham"}) -[:MEMBER_OF]->(otherGroup) )RETURN otherGroup.name, COUNT(topic) AS topicsInCommon, COLLECT(topic.name) AS topicsORDER BY topicsInCommon DESCLIMIT 10
Exclude groups I’m a member of
Find my similar groups
As a member of several meetup groups
I want to find other similar meetup groups
that I’m not already a member of
So that I can join those groups
Find my similar groups
As a member of several meetup groups
I want to find other similar meetup groups
that I’m not already a member of
So that I can join those groups
Find my similar groups
MATCH (member:Member {name: "Mark Needham"}) -[:INTERESTED_IN]->(topic), (member)-[:MEMBER_OF]->(group)-[:HAS_TOPIC]->(topic)
WITH member, topic, COUNT(*) AS scoreMATCH (topic)<-[:HAS_TOPIC]-(otherGroup) WHERE NOT (member)-[:MEMBER_OF]->(otherGroup)RETURN otherGroup.name, COLLECT(topic.name), SUM(score) as scoreORDER BY score DESC
Find my similar groups
Find Jonny’s similar groups
Oops...Jonny has no interests!
What is Jonny interested in?
As a member of several meetup groups
I want to find other similar meetup groups
that I’m not already a member of
So that I can join those groups
INTERESTED_IN?
What is Jonny interested in?
There’s an implicit INTERESTED_IN relationship between the topics of groups I belong to but don’t express an interest in. Let’s make it explicit
P
G
T
MEMBER_OF
HAS_TOPIC
P
G
T
MEMBER_OF
HAS_TOPIC
INTERESTED_IN
What is Jonny interested in?
MATCH (m:Member)-[:MEMBER_OF]->(group)-[:HAS_TOPIC]->(topic)
WITH m, topic, COUNT(*) AS times WHERE times > 3
MERGE (m)-[:INTERESTED_IN]->(topic)
What is Jonny interested in?
Tip: Make the implicit explicit
‣ Fill in the missing links in the graph‣ You could run this type of query once a day
during a quiet period‣ On bigger graphs we’d run it in
batches to avoid loading thewhole database into memory
Find next group people join
As a member of a meetup group
I want to find out which meetup groups other people join after this one
So that I can join those groups
Find next group people join
MATCH (group:Group {name: "Neo4j - London User Group"}) <-[membership:MEMBER_OF]-(member), (member)-[otherMembership:MEMBER_OF]->(otherGroup)WHERE membership.joined < otherMembership.joinedWITH member, otherGroup ORDER BY otherMembership.joinedWITH member, COLLECT(otherGroup)[0] AS nextGroupRETURN nextGroup.name, COUNT(*) AS timesORDER BY times DESC
Find next group people join
It feels a bit clunky...
MATCH (group:Group {name: "Neo4j - London User Group"}) <-[membership:MEMBER_OF]-(member), (member)-[otherMembership:MEMBER_OF]->(otherGroup)WHERE membership.joined < otherMembership.joinedWITH member, otherGroup ORDER BY otherMembership.joinedWITH member, COLLECT(otherGroup)[0] AS nextGroupRETURN nextGroup.name, COUNT(*) AS timesORDER BY times DESC
‣ We have to scan through all the MEMBER_OF relationships to find the one we want
‣ It might make our lives easier if we made membership a first class citizen of the domain
Facts can become nodes
Facts can become nodes
Refactor to facts
MATCH (member:Member)-[rel:MEMBER_OF]->(group)
MERGE (membership:Membership {id: member.id + "_" + group.id})SET membership.joined = rel.joined
MERGE (member)-[:HAS_MEMBERSHIP]->(membership)MERGE (membership)-[:OF_GROUP]->(group)
Refactor to facts
MATCH (member:Member)-[:HAS_MEMBERSHIP]->(membership)
WITH member, membership ORDER BY member.id, membership.joined
WITH member, COLLECT(membership) AS membershipsUNWIND RANGE(0,SIZE(memberships) - 2) as idx
WITH memberships[idx] AS m1, memberships[idx+1] AS m2MERGE (m1)-[:NEXT]->(m2)
Find next group people join
MATCH (group:Group {name: "Neo4j - London User Group"}) <-[:OF_GROUP]-(membership)-[:NEXT]->(nextMembership), (membership)<-[:HAS_MEMBERSHIP]-(member:Member) -[:HAS_MEMBERSHIP]->(nextMembership), (nextMembership)-[:OF_GROUP]->(nextGroup)RETURN nextGroup.name, COUNT(*) AS timesORDER BY times DESC
Comparing the approaches
vs
MATCH (group:Group {name: "Neo4j - London User Group"}) <-[membership:MEMBER_OF]-(member), (member)-[otherMembership:MEMBER_OF]->(otherGroup)WITH member, membership, otherMembership, otherGroup ORDER BY member.id, otherMembership.joinedWHERE membership.joined < otherMembership.joinedWITH member, membership, COLLECT(otherGroup)[0] AS nextGroupRETURN nextGroup.name, COUNT(*) AS timesORDER BY times DESC
MATCH (group:Group {name: "Neo4j - London User Group"}) <-[:OF_GROUP]-(membership)-[:NEXT]->(nextMembership), (membership)<-[:HAS_MEMBERSHIP]-(member:Member) -[:HAS_MEMBERSHIP]->(nextMembership), (nextMembership)-[:OF_GROUP]->(nextGroup)RETURN nextGroup.name, COUNT(*) AS timesORDER BY times DESC
How do I profile a query?
‣ EXPLAIN• shows the execution plan without actually
executing it or returning any results.
‣ PROFILE• executes the statement and returns the results
along with profiling information.
45
Neo4j’s longest plan (so far…)
46
Neo4j’s longest plan (so far…)
47
Neo4j’s longest plan (so far…)
48
What is our goal?
At a high level, the goal is simple: get the number of db hits down.
49
an abstract unit of storage engine work.
What is a database hit?
“”
50
Comparing the approaches
Cypher version: CYPHER 2.3, planner: COST. 111656 total db hits in 330 ms.vsCypher version: CYPHER 2.3, planner: COST.23650 total db hits in 60 ms.
Tip: Profile your queries
‣ Spike the different models and see which one performs the best
Should we keep both models?
We could but when we add, edit or remove a membership we’d have to keep both graph structures in sync.
Adding a group membershipWITH "Mark Needham" AS memberName, "Neo4j - London User Group" AS groupName, timestamp() AS now
MATCH (group:Group {name: groupName})MATCH (member:Member {name: memberName})
MERGE (member)-[memberOfRel:MEMBER_OF]->(group)ON CREATE SET memberOfRel.time = now
MERGE (membership:Membership {id: member.id + "_" + group.id})ON CREATE SET membership.joined = nowMERGE (member)-[:HAS_MEMBERSHIP]->(membership)MERGE (membership)-[:OF_GROUP]->(group)
Removing a group membershipWITH "Mark Needham" AS memberName, "Neo4j - London User Group" AS groupName, timestamp() AS now
MATCH (group:Group {name: groupName})MATCH (member:Member {name: memberName})
MATCH (member)-[memberOfRel:MEMBER_OF]->(group)
MATCH (membership:Membership {id: member.id + "_" + group.id})MATCH (member)-[hasMembershipRel:HAS_MEMBERSHIP]->(membership)MATCH (membership)-[ofGroupRel:OF_GROUP]->(group)
DELETE memberOfRel, hasMembershipRel, ofGroupRel, membership
Let’s delete MEMBER_OF then...
...not so fast!
As a member of several meetup groups
I want to find other similar meetup groups
that I’m not already a member of
So that I can join those groups
Why not delete MEMBER_OF?MATCH (member:Member {name: "Mark Needham"}) -[:MEMBER_OF]->(group)-[:HAS_TOPIC]->(topic)WITH member, topic, COUNT(*) AS scoreMATCH (topic)<-[:HAS_TOPIC]-(otherGroup) WHERE NOT (member)-[:MEMBER_OF]->(otherGroup)RETURN otherGroup.name, COLLECT(topic.name), SUM(score) as scoreORDER BY score DESC
MATCH (member:Member {name: "Mark Needham"}) -[:HAS_MEMBERSHIP]->()-[:OF_GROUP]->(group:Group)-[:HAS_TOPIC]->(topic)WITH member, topic, COUNT(*) AS scoreMATCH (topic)<-[:HAS_TOPIC]-(otherGroup) WHERE NOT (member)-[:HAS_MEMBERSHIP]->(:Membership)-[:OF_GROUP]->(otherGroup:Group)RETURN otherGroup.name, COLLECT(topic.name), SUM(score) as scoreORDER BY score DESC
Why not delete MEMBER_OF?MATCH (member:Member {name: "Mark Needham"}) -[:MEMBER_OF]->(group)-[:HAS_TOPIC]->(topic)WITH member, topic, COUNT(*) AS scoreMATCH (topic)<-[:HAS_TOPIC]-(otherGroup) WHERE NOT (member)-[:MEMBER_OF]->(otherGroup)RETURN otherGroup.name, COLLECT(topic.name), SUM(score) as scoreORDER BY score DESC
MATCH (member:Member {name: "Mark Needham"}) -[:HAS_MEMBERSHIP]->()-[:OF_GROUP]->(group:Group)-[:HAS_TOPIC]->(topic)WITH member, topic, COUNT(*) AS scoreMATCH (topic)<-[:HAS_TOPIC]-(otherGroup) WHERE NOT (member)-[:HAS_MEMBERSHIP]->(:Membership)-[:OF_GROUP]->(otherGroup:Group)RETURN otherGroup.name, COLLECT(topic.name), SUM(score) as scoreORDER BY score DESC
433318 total db hits in 485 ms.
83268 total db hits in 117 ms.
Tip: Maintaining multiple models
‣ Different models perform better for different queries but worse for others
‣ Optimising for reads may mean we pay a write and maintenance penalty
What about events?
Events in my groups
As a member of several meetup groups who has previously attended events
I want to find other events hosted by those groups
So that I can attend those events
Events in my groups
As a member of several meetup groups who has previously attended events
I want to find other events
hosted by those groups
So that I can attend those events
WITH 24.0*60*60*1000 AS oneDay
MATCH (member:Member {name: "Mark Needham"}),
(member)-[:MEMBER_OF]->(group),
(group)-[:HOSTED_EVENT]->(futureEvent)
WHERE futureEvent.time >= timestamp()
AND NOT (member)-[:RSVPD]->(futureEvent)
RETURN group.name, futureEvent.name,
round((futureEvent.time - timestamp()) / oneDay) AS days
ORDER BY days
LIMIT 10
Events in my groups
Events in my groups
+ previous events attendedWITH 24.0*60*60*1000 AS oneDay
MATCH (member:Member {name: "Mark Needham"})
MATCH (futureEvent:Event)
WHERE futureEvent.time >= timestamp() AND NOT (member)-[:RSVPD]->(futureEvent)
MATCH (futureEvent)<-[:HOSTED_EVENT]-(group)
WITH oneDay, group, futureEvent, member, EXISTS((group)<-[:MEMBER_OF]-(member)) AS isMember
OPTIONAL MATCH (member)-[rsvp:RSVPD {response: "yes"}]->(pastEvent)<-[:HOSTED_EVENT]-(group)
WHERE pastEvent.time < timestamp()
RETURN group.name,
futureEvent.name,
isMember,
COUNT(rsvp) AS previousEvents,
round((futureEvent.time - timestamp()) / oneDay) AS days
ORDER BY days, previousEvents DESC
+ previous events attended
RSVPD_YES vs RSVPD
I was curious whether refactoring RSVPD {response: "yes"} to RSVPD_YES would have any impact as Neo4j is optimised for querying by unique relationship types.
Refactor to specific relationships
MATCH (m:Member)-[rsvp:RSVPD {response:"yes"}]->(event)
MERGE (m)-[rsvpYes:RSVPD_YES {id: rsvp.id}]->(event)
ON CREATE SET rsvpYes.created = rsvp.created,
rsvpYes.lastModified = rsvp.lastModified;
MATCH (m:Member)-[rsvp:RSVPD {response:"no"}]->(event)
MERGE (m)-[rsvpYes:RSVPD_NO {id: rsvp.id}]->(event)
ON CREATE SET rsvpYes.created = rsvp.created,
rsvpYes.lastModified = rsvp.lastModified;
RSVPD_YES vs RSVPD
RSVPD {response: "yes"}
vs
RSVPD_YES
Cypher version: CYPHER 2.3, planner: COST. 688635 total db hits in 232 ms.
Cypher version: CYPHER 2.3, planner: COST. 559866 total db hits in 207 ms.
Why would we keep RSVPD?
MATCH (m:Member)-[r:RSVPD]->(event)<-[:HOSTED_EVENT]-(group)
WHERE m.name = "Mark Needham"
RETURN event, group, r
MATCH (m:Member)-[r:RSVPD_YES|:RSVPD_NO|:RSVPD_WAITLIST]->(event),
(event)<-[:HOSTED_EVENT]-(group)
WHERE m.name = "Mark Needham"
RETURN event, group, r
Tip: Specific relationships
‣ Neo4j is optimised for querying by unique relationship types…
‣ ...but sometimes we pay a query maintenance cost to achieve this
+ events my friends are attending
There’s an implicit FRIENDS relationship between people who attended the same events.Let’s make it explicit.
M
E
M
RSVPD
RSVPD
FRIENDS
M
E
M
RSVPD
RSVPD
+ events my friends are attending
MATCH (m1:Member)WHERE NOT m1:Processed
WITH m1 LIMIT {limit}MATCH (m1)-[:RSVP_YES]->(event:Event)<-[:RSVP_YES]-(m2:Member)
WITH m1, m2, COLLECT(event) AS events, COUNT(*) AS timesWHERE times >= 5
WITH m1, m2, times, [event IN events | SIZE((event)<-[:RSVP_YES]-())] AS attendances
WITH m1, m2, REDUCE(score = 0.0, a IN attendances | score + (1.0 / a)) AS score
MERGE (m1)-[friendsRel:FRIENDS]-(m2)SET friendsRel.score = row.score
Bidirectional relationships
‣ You may have noticed that we didn’t specify a direction when creating the relationshipMERGE (m1)-[:FRIENDS]-(m2)
‣ FRIENDS is a bidirectional relationship. We only need to create it once between two people.
‣ We ignore the direction when querying
+ events my friends are attendingWITH 24.0*60*60*1000 AS oneDayMATCH (member:Member {name: "Mark Needham"})MATCH (futureEvent:Event) WHERE futureEvent.time >= timestamp() AND NOT (member)-[:RSVPD]->(futureEvent)MATCH (futureEvent)<-[:HOSTED_EVENT]-(group)
WITH oneDay, group, futureEvent, member, EXISTS((group)<-[:MEMBER_OF]-(member)) AS isMemberOPTIONAL MATCH (member)-[rsvp:RSVPD {response: "yes"}]->(pastEvent)<-[:HOSTED_EVENT]-(group)WHERE pastEvent.time < timestamp()
WITH oneDay, group, futureEvent, member, isMember, COUNT(rsvp) AS previousEventsOPTIONAL MATCH (futureEvent)<-[:HOSTED_EVENT]-()-[:HAS_TOPIC]->(topic)<-[:INTERESTED_IN]-(member)
WITH oneDay, group, futureEvent, member, isMember, previousEvents, COUNT(topic) AS topicsOPTIONAL MATCH (member)-[:FRIENDS]-(:Member)-[rsvpYes:RSVP_YES]->(futureEvent)
RETURN group.name, futureEvent.name, isMember, round((futureEvent.time - timestamp()) / oneDay) AS days, previousEvents, topics, COUNT(rsvpYes) AS friendsGoingORDER BY days, friendsGoing DESC, previousEvents DESCLIMIT 15
+ events my friends are attending
Tip: Bidirectional relationships
‣ Some relationships are bidirectional in nature‣ Neo4j always stores relationships with a
direction but we can choose to ignore that when we query
tl;dr
‣ Model incrementally‣ Always profile your queries‣ Consider making the implicit explicit…
• ...but beware the maintenance cost
‣ Be specific with relationship types‣ Ignore direction for bidirectional relationships
That’s all for today!Questions? :-)
Mark Needham @markhneedhamhttps://github.com/neo4j-meetups/modeling-worked-example