easier than excel: social network analysis of docgraph with gephi janos g. hajagos stony brook...
TRANSCRIPT
![Page 1: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/1.jpg)
Easier than Excel: Social Network Analysis of
DocGraph with GephiJanos G. Hajagos
Stony Brook School of Medicine
Fred Trotterfredtrotter.com
![Page 2: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/2.jpg)
DocGraph Based on FOIA request to CMS by Fred Trotter
Pre-released at Strata RX 2012
Medicare providers (more than doctors)
CY 2011 dates of service
Share 11 or more patients in a 30 day forward window
Initial access restricted to MedStartr funders
2
![Page 3: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/3.jpg)
DocGraph by the numbers Directed graph
Average total degree 52.8
940,492 providers (graph nodes/vertices)
49,685,810 shared edges
3
![Page 4: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/4.jpg)
Geographic visualization
4
http://isurfsoftware.com/blog/2012/12/13/visualizing-geographic-connections-between-us-doctors/
![Page 5: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/5.jpg)
DocGraph data
5
![Page 6: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/6.jpg)
6
![Page 7: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/7.jpg)
NPPES National Plan and Provider Enumeration System
Source of NPI (National Provider Identifier)
No cost download Information is entered and updated by provider
- Data quality is good to poor CSV file with 314 columns A custom MySQL load script is used to normalize the database
Bloom.api open source project to make data easier to access
- http://www.bloomapi.com/
7
![Page 8: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/8.jpg)
Tabular data
8
![Page 9: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/9.jpg)
Things we can do with tabular data
9
![Page 10: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/10.jpg)
Graph dataRelation between authors and MeSH terms from PubMed
10
http://dx.doi.org/10.6084/m9.figshare.94595
![Page 11: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/11.jpg)
Graph types Undirected graph
- Facebook friendships
Directed graph
- Twitter: follow and be followed
Bipartite graph
Multipartite
- RDF graph model
- Property graph model
Allow parallel edges
- RDF graph Model
11
![Page 12: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/12.jpg)
Components of a network/graph
12
![Page 13: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/13.jpg)
Graphs in healthcare Prescriber and patient (bipartite)
- NCPDP data with NPI
Referral data sets
Shared patients
- DocGraph
Social networks
- Tweeting about a disease
Limited by imagination
13
![Page 14: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/14.jpg)
Generating GraphML XML based file format for graphs
Readable by a large number of tools
- Gephi
- Mathematica
- igraph (R)
NetworkX a Python library for graphs which can export to GraphML
GraphML is not a file format for really large graphs
GraphML is not readable by d3.js
14
![Page 15: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/15.jpg)
15
GraphML can be loaded into Mathematica
![Page 16: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/16.jpg)
Gephi
16
![Page 17: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/17.jpg)
Gephi Java based open source tool
Focused on interactivity
- Fast graphics
- Multi-threaded
- Visual updates
Strong graph analytics
Graphs stored in memory
- Upper limit is about 100,000 nodes
Netbeans plugin architecture
- Integration with Neo4J
- Additional layout algorithms
17
![Page 18: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/18.jpg)
Downloading Gephi
http://gephi.org/users/download/
18
![Page 19: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/19.jpg)
Downloading sample files
https://dl.dropboxusercontent.com/u/21690634/DocGraph/docgraph_tutorial_examples.zip
19
![Page 20: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/20.jpg)
Subsets are generated using a Python script
20
python extract_providers_to_graphml.py "npi='1750499653'" sterrence Leaf-edgesOpening connection referralConfigurationSelection criteria for subset graph: npi='1750499653'Referral table _name: referral.referral2011NPI detail table name: referral.npi_summary_primary_taxonomyNodes will be labeled by: provider_nameLeaf-to-leaf edges will be exported? False…Imported 1 nodes…Imported 986 nodes…Imported 1724 edgesEdge types imported{'core-to-leaf': 866, 'leaf-to-core': 856: None : 2}Leaf-to-leaf edges were not selected for exportWriting GraphML file
![Page 21: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/21.jpg)
Generating a subset: some concepts
21
Core nodes
Adding leaf nodes
Connecting core nodes
Connecting to leaf nodes
Connecting leaf nodes
![Page 22: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/22.jpg)
Sample files jamestown_core_provider_graph.graphml
- Providers selected with practice addresses in Jamestown, NY
- Small city in far western New York (approximately 30,000 residents)
- 179 nodes with 5,560 edges
jamestown_core_and_leaf_provider_graph.graphml
- Includes providers above and those who are linked to them
- 1,322 nodes with 12,457 edges
albany_core_provider_graph.graphml
- Providers selected with practice addresses in Albany, NY
- A small city in New York (approximately 100,000 residents)
- 1,368 nodes with 44,711 edges
22
![Page 23: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/23.jpg)
Sample files (continued) bronx_core_provider_graph.graphml
- Providers selected with practice addresses in Bronx, NY
- Urban community (1.4 million residents)
- 3,268 nodes and 53,828 edges
23
![Page 24: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/24.jpg)
Opening a graph file
24
![Page 25: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/25.jpg)
Import report
25
![Page 26: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/26.jpg)
Force directed layout of the graph
26
![Page 27: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/27.jpg)
Results of the layout
27
![Page 28: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/28.jpg)
ForceAtlas 2 works well for larger graphs
28
![Page 29: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/29.jpg)
Navigating the graph Best experience with a three button mouse with a scroll wheel
- Right click and hold to pan
- Scroll wheel to zoom in and out
- Left click to select
- Right click for context menus
MacBook users
- command key and click and hold down on trackpad to pan
- Two fingers to zoom on trackpad
- Click on trackpad to select
- Control click for context menus
29
![Page 30: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/30.jpg)
Coloring the graph (partitioning)
30
![Page 31: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/31.jpg)
Coloring the graph (partitioning)
31
![Page 32: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/32.jpg)
Varying node size based on importance Step 1: Need to select a measure for node importance
- Degree
- PageRank
- Eigenvector centrality
Step 2: Run the measure against the graph
Step 3: Ranking tab and “Size/Weight”
Step 4: Set size range
32
![Page 33: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/33.jpg)
Graph measures Degree
- In-degree
- Out-degree
Graph structure measures
- Clustering (global and local)
- Network diameter
Centrality Measures
- Eigenvector centrality
- PageRank (Google search)
Community measures
And more . . . . .
33
![Page 34: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/34.jpg)
Interactively viewing node attributes
34
Click the “T” icon on the bottom to turn on node labeling
![Page 35: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/35.jpg)
Data Laboratory
35
![Page 36: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/36.jpg)
Selecting visible fields
36
![Page 37: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/37.jpg)
Viewing edge attributes
37
![Page 38: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/38.jpg)
Saving your graph Save your graph in .gephi format
- xml based format
- preserves layout, size, and color
Save in GraphML format for use with outside programs
38
![Page 39: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/39.jpg)
Filtering nodes by attributes
39
![Page 40: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/40.jpg)
Hints for filtering nodes Drag field filter “is_physician” from the top pane to the lower pane
Set the value to filter on
- Value should equal 1
- 1 is equivalent to true
Click “Filter” to apply
40
![Page 41: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/41.jpg)
Producing a final graph
41
We need to rescale the edge weights in the graph
![Page 42: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/42.jpg)
Producing a final graph after scaling
42
![Page 43: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/43.jpg)
Bronx core provider graph
43
![Page 44: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/44.jpg)
Challenge questions Which institution is the most “important” provider for the Bronx?
- Hint: try a centrality measure
Can you determine if geography plays a role in patient sharing in the Bronx?
- Which parameter could be used to partition the graph?
Can you filter the graph to show only radiologists?
Which radiologist has the highest “authority” in the graph?
44
![Page 45: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/45.jpg)
Other tools for graph analysis NetworkX
- Python
- Lots of algorithms
igraph
- R and Python
Gremlin – graph traversal and manipulation
- Groovy shell
- Gremlin interface is implemented for Neo4J
And more . . .
45
![Page 46: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/46.jpg)
Scaling the analysis to the entire DocGraph Most healthcare graphs will be big (millions of nodes)
What we learn at the local level can be applied at the global level
- Importance of geography
- Supernodes (radiologist, ER docs, pathologist, transportation, …)
Many graph measures don’t scale well
- Maximal cliques
Currently exploring how to use Faunus to scale the analysiswith Hadoop
46
![Page 47: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/47.jpg)
Linkshttp://strata.oreilly.com/2012/11/docgraph-open-social-doctor-data.html (information)
https://github.com/jhajagos/DocGraph (code)
http://notonlydev.com/docgraph-data/ (open source $1 covers bandwidth fees)
https://groups.google.com/forum/#!forum/docgraph (mailing list)
47
![Page 48: Easier than Excel: Social Network Analysis of DocGraph with Gephi Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com](https://reader038.vdocuments.us/reader038/viewer/2022110100/56649dd05503460f94ac4a4e/html5/thumbnails/48.jpg)
Questions
48
Try to publish your own healthcare dataset as a graph!