building a graph database in neo4j with spark & spark sql to gain new insights from log data
TRANSCRIPT
![Page 1: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/1.jpg)
Building a Graph Database in Neo4j with Spark & Spark SQL to Gain New Insights from Log Data
ROBERT HRYNIEWICZ – DATA SCIENTIST/EVANGELIST, HORTONWORKSRACHEL POULSEN – DATA SCIENCE DIRECTOR, TIVO
![Page 2: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/2.jpg)
Overview Overview of business problem
Introduction to TiVo and TiVo data Challenges with using data to optimize UI navigation
Solution Demo Next Steps Questions
![Page 3: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/3.jpg)
TiVoBackground and context
TiVo is a discovery platform for integrated entertainment Multiple ways to find content
TiVo Roamio Demo Video
TiVo collects ~2 million logs a day from their boxes User action events, TiVo action events, inventory events These events are “memory-less” and don’t know what happened
prior to the event
![Page 4: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/4.jpg)
![Page 5: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/5.jpg)
![Page 6: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/6.jpg)
![Page 7: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/7.jpg)
![Page 8: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/8.jpg)
MotivationTiVo Business Initiatives Help users get to content they want faster Help users discover new content easier Is feature X important to discovering and getting to content? (ex: Is the guide
still used to find content?
Challenge Measuring a KPI for initiatives in a log stream that is “memory-less” Identifying events or a pattern of events that impacts the KPIs
Data Objective – “Path Analysis” Analysis that answers questions around the navigational paths users take to
get to or from defined start and/or end points
![Page 9: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/9.jpg)
Architecture Challenges Traditional data platform that was sample-based and SQL-based Relational databases
![Page 10: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/10.jpg)
Technical Challenges Relational Databases Challenges
Little flexibility in “click path” definition Decisions about defining “paths” are made during processing step Many business assumptions have to be made with little insight
![Page 11: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/11.jpg)
Solution
![Page 12: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/12.jpg)
Solution
Graph Database (Neo4j) Relationships are first-class citizens Simple abstractions Enable sophisticated models
“Path Analysis”
![Page 13: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/13.jpg)
PrototypeOne Day Graph Info
Edges: 57K relationshipsNodes:
135 UI or “screen” nodes12 “watch content” nodes
![Page 14: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/14.jpg)
Size reduction (2K times)
70 GB log data 35MB Neo DB (all nodes & edges) 1 Day Oct 1, 2015
LOGS70+ GB
UI nodesWatch nodesEdges
35 MB
![Page 15: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/15.jpg)
Screens, Transitions, and Content
Screen eventsRemote button press eventsWatch content events
![Page 16: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/16.jpg)
Sample Graph
Live Movie My Shows
TiVo Central(Home)
Switched to
Switched to
Switched to
![Page 17: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/17.jpg)
Architecture Overview
![Page 18: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/18.jpg)
What’s captured in the graph? Node (UI)
Name Timestamp
Node (Watch) Type, e.g. Recorded Genre, e.g. TV Show Timestamp
Edge Average Time Total number of keys pressed Key sequence
e.g. Home Up/Down Select/Play Total number of times path taken Unique number of users taking this path
![Page 19: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/19.jpg)
What’s captured in the graph?
TiVo Central
1/1/2016 0.4s average time3 keys pressedHome-Down-Select50 times path used27 unique users
Live Movie
1/1/2016
![Page 20: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/20.jpg)
Raw Log File example…1444809715713072|Watch|live|WBINDT|MV|506|EP019641150097...1444809715812909|Key|HOME1444880816123454|UI|TivoCentralScreen1444809716234553|Key|DOWN1444809716354363|Key|SELECT1444809716518701|Trick3|PLAY|116|1|100|-11444809719888072|Key|PLAY1444809719889072|Trick3|PLAY|119|1|100|-11444809726966880|Watch|rec|WFXTDT|SH|508|...…
![Page 21: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/21.jpg)
Filtered Log...Watch: LIVE MOVIE Key: HOME UI: TIVO HOME Key: DOWN Key: SELECT Key: PLAYWatch: REC SHOW…
Edge
Node
Node
EdgeNode
Same day
![Page 22: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/22.jpg)
Algorithm Overview (1 of 2)1. Filter for desired events
• Remove non-Screen, non-Watch, non-Key events2. Session-ize and order logs to reflect Screen/Watch/Edge events3. Define display for Key Press events - two formats
• Normal: SELECT & UP x 2 & GUIDE & SELECT • Compact: TIVO & 9 KEYS & SELECT
4. Generate an Edge if transition < max time set by stakeholders (e.g. 5 min)• For all logs find the following sequence: Node X - timestamp x (start time)
key A key B key C
Node Y - timestamp y (end time)
![Page 23: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/23.jpg)
Algorithm Overview (2 of 2)
5. For each unique node-edge-node calculate:1. Average transition time
2. Number of transitions
3. Number of unique transitions
4. Number of keys pressed
5. Key sequence (normal or compact)
6. Export results to CSV files
![Page 24: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/24.jpg)
DEMO
What is the most popular path people take to get to content? live vs. recorded
What percent of total paths are most popular? What path is most popular? Overall? Unique? What app is most popular? What percent of total paths involve the Guide screen?
![Page 25: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/25.jpg)
Business Advantages
Measure KPIs for time to content and content discovery
Optimize KPIs (understanding user behavior that impacts the KPIs)
Enhance A/B Testing by helping to answer “why?” Simplify user experience across products Increase engagement with new content Understand feature usage interactions not only as a
mutually exclusive experience
![Page 26: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/26.jpg)
Future Work
Deploy to production -- multi-day queries Add relationships and nodes for feature usage Classify paths (“discovery” or “known
destination”) Exploratory analysis
![Page 27: Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data](https://reader035.vdocuments.us/reader035/viewer/2022062905/586e8cfe1a28aba0038b86fd/html5/thumbnails/27.jpg)
Thanks! @RobHryniewicz @Bayesbabe