congressional pagerank: graph analytics of us congress with neo4j

Post on 12-Jan-2017

562 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Congressional PageRank:Graph Analytics Of US Congress

William Lyon

Graph Day - Austin, TXJanuary 2016

About me

Software Developer @Neo4jwilliam.lyon@neo4j.com

@lyonwjlyonwj.com

William Lyon

Agenda

• Brief intro to Neo4j graph database• Modeling US Congress as a graph• Exploring the 114th Congress • Finding influential legislators

Neo4j – Key Features

Native Graph StorageEnsures data consistency and performance

Native Graph Processing Millions of hops per second, in real time

“Whiteboard Friendly” Data ModelingModel data as it naturally occurs

High Data IntegrityFully ACID transactions

Powerful, Expressive Query LanguageRequires 10x to 100x less code than SQL Scalability and High Availability Vertical and horizontal scaling optimized for graphs Built-in ETLSeamless import from other databases Integration Drivers and APIs for popular languages

MATCH(A)

Property Graph Model

The Whiteboard Model Is the Physical Model

Relational Versus Graph Models

Relational Model Graph Model

KNOWS

KNOWS

KNOWS

ANDREAS

TOBIAS

MICA

DELIA

Person FriendPerson-Friend

ANDREASDELIA

TOBIAS

MICA

Property Graph Model Components

Nodes • The objects in the graph • Can have name-value properties • Can be labeled

Relationships • Relate nodes by type and

direction • Can have name-value properties

CAR

DRIVES

name: “Dan” born: May 29, 1970

twitter: “@dan”name: “Ann”

born: Dec 5, 1975

since: Jan 10, 2011

brand: “Volvo” model: “V70”

LOVES

LOVES

LIVES WITH

OWNS

PERSON PERSON

Cypher Query Language

Cypher: Powerful and Expressive Query Language

CREATE (:Person { name:“Dan”} ) -[:LOVES]-> (:Person { name:“Ann”} )

LOVES

Dan Ann

LABEL PROPERTY

NODE NODE

LABEL PROPERTY

MATCH (boss)-[:MANAGES*0..3]->(sub), (sub)-[:MANAGES*1..3]->(report)WHERE boss.name = “John Doe”RETURN sub.name AS Subordinate, count(report) AS Total

Express Complex Queries Easily with Cypher

Find all direct reports and how many people they manage, up to 3 levels down

Cypher Query

SQL Query

http://www.opencypher.org/

Getting Data into Neo4j

Cypher-Based “LOAD CSV” Capability • Transactional (ACID) writes • Initial and incremental loads of up to

10 million nodes and relationships

Command-Line Bulk Loader

neo4j-import • For initial database population • For loads with 10B+ records • Up to 1M records per second

4.58 million things and their relationships…

Loads in 100 seconds!

Neo4j

Graph Database

• Property graph datamodel• Nodes and relationships

• Native graph processing• Cypher query language

Graphing US Congress

https://github.com/legis-graph/legis-graph

https://github.com/legis-graph/legis-graph

LOAD CSV WITH HEADERS FROM “file:///legislators.csv” AS line MERGE (l:Legislator (thomasID: line.thomasID}) SET l = line MERGE (s:State {code:line.state})<-[:REPRESENTS]-(l) …

US Congress

https://github.com/legis-graph/legis-graph

What Legislators represent Texas?

MATCH (s:State {code: "TX"})<-[:REPRESENTS]-(l:Legislator) RETURN l,s;

…include congressional body and partyMATCH (s:State {code: "TX"})<-[:REPRESENTS]-(l:Legislator) MATCH (p:Party)<-[:IS_MEMBER_OF]-(l)-[:ELECTED_TO]->(b:Body) RETURN b,l,s,p;

How to find influential legislators?

Bill Sponsorship

Bill Cosponsorship

Degree centrality

Bill Cosponsorship

• Cosponsors are “influenced by” bill sponsors

• Add INFLUENCED_BY relationships

Betweenness centrality

The number of times a node acts as a bridge along the shortest path between two other nodes.

https://en.wikipedia.org/wiki/Betweenness_centrality

image credit: https://en.wikipedia.org/wiki/PageRank

image credit: https://en.wikipedia.org/wiki/PageRank

?

PageRankCypher approximation

UNWIND range(1,10) AS round MATCH (l:Legislator) WHERE rand() < 0.1 MATCH (l:Legislator)-[:INFLUENCED_BY]->(o:Legislator) SET o.rank = coalesce(o.rank,0) + 1;

http://neo4j.com/blog/using-neo4j-hr-analytics/

Neo4j server extensions with Java

Neo4j server extensions with Java

curl http://localhost:7474/service/v1/pagerank/Person/KNOWS

PageRankGraph processing server extension

https://github.com/maxdemarzi/graph_processing

curl http://localhost:7474/service/v1/pagerank/Person/KNOWS

PageRank

neo4j-noderank

https://github.com/graphaware/neo4j-noderank

Two issues

• Local vs global• Iterative algorithms and graph complexity

Local vs globalLocal Global

Local vs globalLocal Global

Offline / batchOLTP / realtime

For iterative algorithms like PageRank, it’s all about complexity of the graphLots of paths. Lots of iterations

Graph complexity

PageRank

Graph global!

PageRank

Graph global!Iterative!

• Efficient in-memory data processing and machine learning platform

• Graph analytics with GraphX• In-memory message passing algorithm

Apache Spark is a fast and general engine for large-scale data processing.

http://spark.apache.org/

PageRankSpark with Neo4j - Scala

https://github.com/AnormCypher/AnormCypher

import org.anormcypher._ import org.apache.spark.graphx._ import org.apache.spark.graphx.lib._

val total =    100000000 val batch = total/1000000 val links = sc.range(0,batch).repartition(batch).mapPartitionsWithIndex( (i,p) => {    val dbConn = Neo4jREST("localhost", 9474, "/db/data/", "neo4j", "test")    val q = "MATCH (l1:Legislator)-[:INFLUENCED_BY]->(l2:Legislator) RETURN id(l1) as from, id(l2) as to skip {skip} limit 1000000"    p.flatMap( skip => {       Cypher(q).on("skip"->skip*1000000).apply()(dbConn).map(row =>             (row[Int]("from").toLong,row[Int]("to").toLong)         )    }) })

links.cache links.count

val edges = links.map( l => Edge(l._1,l._2, None)) val g = Graph.fromEdges(edges,"none") val v = PageRank.run(g, 5).vertices

Extract subgraph. Run PageRank using Spark GraphX.

val res = v.repartition(total/100000).mapPartitions( part => {   val localConn = Neo4jREST("localhost", 9474, "/db/data/", "neo4j", "test")   val updateStmt = Cypher("UNWIND {updates} as update MATCH (p) where id(p) = update.id SET p.pagerank = update.rank")   val updates = part.map( v => Map("id"->v._1.toLong, "rank" -> v._2.toDouble))   val count = updateStmt.on("updates"->updates).execute()(localConn)   Iterator(part.size) })

Write back to graph

PageRank

Mazerunner

http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html

• Enables two-way ETL between Spark and Neo4j

• Run GraphX jobs from data in Neo4j

• Write results back to Neo4j

PageRank

Mazerunner

http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html

• Enables two-way ETL between Spark and Neo4j

• Run GraphX jobs from data in Neo4j

• Write results back to Neo4j

• Support for:• PageRank• Closeness Centrality• Betweenness Centrality• Triangle Counting• Connected Components• Strongly Connected Components

https://github.com/neo4j-contrib/neo4j-mazerunner

curl http://localhost:7474/service/mazerunner/analysis/pagerank/INFLUENCED_BY

• Cosponsors are “influenced by” bill sponsors

• Add INFLUENCED_BY relationships

Who are the influential legislators?

Who are the influential legislators?

Influential legislators by topic

Influential legislators by topic

graphdatabases.com

http://graphgist.neo4j.com/

http://portal.graphgist.org/challenge/index.html

Links

• http://www.lyonwj.com/2015/09/20/legis-graph-congressional-data-using-neo4j/

• http://www.lyonwj.com/2015/10/11/congressional-pagerank/• https://github.com/legis-graph/legis-graph• https://github.com/neo4j-contrib/neo4j-mazerunner• http://www.kennybastani.com/2014/11/graph-analytics-docker-

spark-neo4j.html• http://www.kennybastani.com/2015/03/spark-neo4j-tutorial-

docker.html

top related