machine learning and graphx
TRANSCRIPT
![Page 1: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/1.jpg)
Massive Graph MiningApache Spark’s GraphX and Data Mining
![Page 2: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/2.jpg)
Who we are
Andy
@Noootsab@NextLab_be@Wajug co-driver@Devoxx4Kids organizerMaths & CSData lover: geo, open, massiveFool
Rand
@randhindi@snipsEntrepreneurPhD bioinformatics, etc.. Love data & ML
![Page 3: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/3.jpg)
Graph 101
A graph is a mathematical representation of linked data.It’s defined in term of its Vertices and Edges, G(V,E).
A vertex is an entity that can bring a bag of data (generally small)An edge connects vertices, and can also own a bag of data.
![Page 4: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/4.jpg)
Graph 101
A Graph represent data in a less convenient way for classical processing framework.
Because the burden is not put on the observations themselves (row) but on their linkage, and specifically density.
Thus, the problem is often translated as a self-join one.
![Page 5: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/5.jpg)
Graph 101
A Graph, G(V,E) has a reverse representation, its Dual.
A Dual is nothing other than the graph, G’(V’,E’), where ● a vertex is an edge in G, and● an edge is a vertex in G, which has at least
one edge.
![Page 6: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/6.jpg)
Graph 101
The classical way to store or share the connectivity of a graph is using its tabular version, that is, its Adjacency Matrix.
ref: http://en.wikipedia.org/wiki/Adjacency_matrix
![Page 7: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/7.jpg)
GraphX (Apache Spark)
Spark 101
![Page 8: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/8.jpg)
GraphX (Apache Spark)
Offers a Graph API on top of Spark.Enabling cross-world manipulations
![Page 9: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/9.jpg)
GraphX (Apache Spark)
How it differs from other classical systems...
![Page 10: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/10.jpg)
GraphX (Apache Spark)
![Page 11: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/11.jpg)
GraphX (Apache Spark)
![Page 12: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/12.jpg)
GraphX (Apache Spark)
Plenty of operators on both RDDs, but
![Page 13: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/13.jpg)
GraphX (Apache Spark)
Plenty of operators on both RDDs, but
![Page 14: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/14.jpg)
GraphX (Apache Spark)
1. Sends messages to neighbors2. Returns an RDD of aggregated messages
![Page 15: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/15.jpg)
GraphX (Apache Spark)
Offers higher level operators and algo, like
![Page 16: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/16.jpg)
GraphX (Apache Spark)
This one rules them all (and more)
More later...
![Page 17: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/17.jpg)
PageRank and Pregel
Everybody know PageRank, right?
If not: it’s our oil, our friend, our preferred black box…
It’s why Google Search works so fine!
![Page 18: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/18.jpg)
PageRank and Pregel
Essentially, PageRank is all about importance of a node in a Graph → Link Analysis.
The bottom line is:● In-Links are votes● In-Links from important node are more
important →recursion
![Page 19: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/19.jpg)
PageRank and Pregel
https://d396qusza40orc.cloudfront.net/mmds/lecture_slides/week1_pagerank_the_flow_formulation.pdf
![Page 20: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/20.jpg)
PageRank and Pregel
TL;DRThe importance of a node is the probability that a random (drunk) walker fall on a given node.So, it depends on:1. the probability that he lands into one of its
neighbor2. the probability that he crosses a link from
the neighbor to it3. an arbitrary probability of teleportation
![Page 21: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/21.jpg)
PageRank and Pregel
Solution: Power Method/Iteration (recursive)
r_new = A x r_old Matrix algebra is a pain in distributed environment…
But wait, the process is rather graph oriented!
![Page 22: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/22.jpg)
PageRank and Pregel
Pregel (google again)
Based on BSP, Bulk Sync Parallel
BSP works like message passing style
![Page 23: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/23.jpg)
PageRank and Pregel
During Superstep i, a vertex can:
● use messages received from Superstep i-1● execute a function● send messages● vote to halt
![Page 24: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/24.jpg)
PageRank and Pregel
![Page 25: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/25.jpg)
PageRank and Pregel
In GraphX, as usual with Spark, it’s simple:
mapReduceTriplet
![Page 26: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/26.jpg)
PageRank and Pregel
PageRank with Pregel:
![Page 27: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/27.jpg)
PageRank and Pregel
Applying on our USA.csv file:
![Page 28: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/28.jpg)
OpenStreetMap
Founded by Steve Coast (UK, 2004)
Aims to take Geodata off the govs hands to give them to the crowd
Actually, the crowd has to create them...
![Page 29: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/29.jpg)
OSM
![Page 30: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/30.jpg)
OSM
![Page 31: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/31.jpg)
OSM
So it’s a Graph!
Node = Vertexsingle point in space defined by its latitude, longitude and node id
Way = EdgeA way can have between 2 and 2,000 nodes
![Page 32: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/32.jpg)
OSM
The network is over-complex for what we need, thus:
● reducing cycling ways like roundabouts to a single one
● transforming the nodes into sections, i.e. pieces of streets between 2 intersections
![Page 33: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/33.jpg)
OSM
Hence, OSM ~ G(Node, Way)
If it’s not exactly we can still manipulate them
In our case, we don’t need the connectivity of an intersection, but the connectivity of a section.This is given by G’ (dual of G)
![Page 34: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/34.jpg)
Dataset
● 80 cities● 3M edges in total● smallest city 200 edges (Tempe)● largest city 200,000 edges (Los Angeles)
![Page 35: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/35.jpg)
● Hypothesis: Cities with similar connectivity have similar PageRank distribution
NYC Chicago
Comparing Cities
![Page 36: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/36.jpg)
Fort Worth = Philadelphia?
Looks the same!
![Page 37: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/37.jpg)
Smells like Spurious Correlation
![Page 38: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/38.jpg)
● Problem: PageRank is correlated with the size of the city
● size of city = number of sections (edges) in the graph
● Normalized PageRank = PageRank / size_of_city
● Now we can compare cities of different sizes!
Normalizing PageRank distributions
![Page 39: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/39.jpg)
Fort Worth != Philadelphia!
Totally different!
![Page 40: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/40.jpg)
Fort Worth before and after
Note that range of PageRank is preserved
![Page 41: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/41.jpg)
● How to compare PageRank distributions?● It’s not always a normal distribution!● Can use the Kullback-Leibler divergence
from information theory● the Kullback–Leibler divergence of Q from
P, denoted DKL(P||Q), is a measure of the information lost when Q is used to approximate P
Distance between PG Distributions
![Page 42: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/42.jpg)
● Easy to compute● Units is nats (can be bits if using log2
instead of ln)
KL Divergence
![Page 43: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/43.jpg)
● KL divergence = 18.407 ● Dallas is irregular, Seattle is a perfect grid
Very different cities: Dallas & Seattle
![Page 44: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/44.jpg)
● KL divergence = 0.36● Both are very irregular
Very similar cities: Atlanta & Boston
![Page 45: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/45.jpg)
● Using multiple street topology indicators to measure the risk of car accident
Next steps
![Page 46: Machine Learning and GraphX](https://reader034.vdocuments.us/reader034/viewer/2022042702/55d57975bb61eba42f8b461a/html5/thumbnails/46.jpg)
Q.E.D
Thanks for keeping up!
Question => Future[(Option[Response], Future[Question])]