social network analysis with spark

18
Social Network Analysis (SNA) Ghulam Imaduddin

Upload: ghulam-imaduddin

Post on 11-Apr-2017

997 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Social Network Analysis with Spark

Social Network Analysis (SNA)Ghulam Imaduddin

Page 2: Social Network Analysis with Spark

2

Definition

From the point of view of data mining, a social network is a heterogeneous and

multirelational data set represented by a graph. The graph is typically very

large, with nodes (or vertex) corresponding to objects and edges

corresponding to links representing relationships or interactions between

objects. Both nodes and links have attributes

(Han & Kamber, 2006).Call, sms, IM, trf. Balance, …

mention, follow, like, …

subscriber subscriber

Page 3: Social Network Analysis with Spark

3

Benefit of SNA

Identify role of subscriber in community:• Community leader• Bridge• Passive• Follower

Identify high value/prospect community by looking at:• Community size• Closeness• Member’s profile (device,

usage, ARPU, location)• Onnet/Offnet share in

community

Suspected samesubscriber

Comparing two social network to identify single identity of subscriber. By comparing two social network

Furt

her

Util

izatio

n

• New product campaign, targeting community leader, bridge, and high value community• Retention program prioritization for community leader, bridge, and high value community• Product adoption campaign for follower in community that already adopt the product• Identifying rotational churner to be excluded in retention campaign, or to evaluate dealer• SN variable can be used to enhance another predictive model. For example: social network

variable can increase the lift of churn model for high value customer (Imaduddin, 2014)

Page 4: Social Network Analysis with Spark

4

Social Network Graph Mining

By mining the graph of social network, we can extract valuable information such as:• Degree (in-degree, out-degree, max-degree). Degree related to number of edge attached

to one vertex/node. Vertex with high number of in-degree means that vertex receive many information from others, and vice versa.

• PageRank. PageRank measures the importance of each vertex in a graph. If a Twitter user is followed by many others, the user will be ranked highly. For CDR based social network, reverse the graph direction before use PageRank function to identify the important vertex

• Local clustering coefficient (LCC). LCC represent how close a customer’s network. The higher the LCC, the closer the network. LCC calculation derived from triangle counting of each vertex.

𝐿𝐶𝐶=¿ 𝑡𝑟𝑖𝑎𝑛𝑔𝑙𝑒

(𝑛2),𝑛=¿ h𝑛𝑒𝑖𝑔 𝑏𝑜𝑢𝑟

Page 5: Social Network Analysis with Spark

5

How To Build

Tools

Language

Platform

Page 6: Social Network Analysis with Spark

6

Let’s get our hand dirty!

Page 7: Social Network Analysis with Spark

7

Graph ExampleGraph Representation Data Representation

Page 8: Social Network Analysis with Spark

8

Script Example – Degree Information

Page 9: Social Network Analysis with Spark

9

Degree Information ResultGraph Representation

Result(id, total-degree, in-degree, out-degree)

Page 10: Social Network Analysis with Spark

10

Script Example – PageRank

Page 11: Social Network Analysis with Spark

11

PageRank ResultGraph Representation

Result(id, PageRank) (id, reverse PageRank)

Page 12: Social Network Analysis with Spark

12

Script Example – Triangle

Page 13: Social Network Analysis with Spark

13

Triangle Counting ResultGraph Representation

Result(id, #triangle)

Page 14: Social Network Analysis with Spark

14

Solving Real World Problem

• Define the vertices. Is it subscriber, web pages, twitter account?

• Define the edge how the vertices connected. E.g. total call minutes in a month > 5 minutes,

sms > 10, etc

• Identify the mega hubs. Mega hubs is vertex that connected to massive amount of vertices

(something like call center or spammer). Mega hubs can be removed, or process separately

based on the problem.

• Identify the measure needed (PageRank, degree, LCC, triangle, etc)

• Build the data source (separate the vertex properties data and the connection data – join it

later), and put it distributed on hadoop.

• Build the code, run it, and feed the result back to data warehouse or hadoop for further

utilization

Page 15: Social Network Analysis with Spark

15

References & Resources• Han, J., & Kamber, M. (2006). Data Mining Concepts and Techniques. San Francisco: Morgan Kaufmann.• Imaduddin, G. (2014). Evaluation and Improvement of Churn Model Using Customer Value and Social

Network. Jakarta: Universitas Indonesia.

References

Resources• Apache Spark Overview. https://spark.apache.org/docs/latest/• Databricks Training Resources. https://databricks.com/spark-training-resources• GraphX Programming Guide. https://

spark.apache.org/docs/latest/graphx-programming-guide.html• Social Network Analysis. http://en.wikipedia.org/wiki/Social_network_analysis• Spark Scala API Doc. https://

spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.package• The Scala Programming Language. http://www.scala-lang.org/

Page 16: Social Network Analysis with Spark

16

Appendix

Page 17: Social Network Analysis with Spark

17

List of Graph Operation in GraphX

Page 18: Social Network Analysis with Spark

18

List of Graph Operation in GraphX