scalable network analysis - university of texas at austin · scalable network analysis inderjit s....

Post on 20-May-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Scalable Network Analysis

Inderjit S. DhillonUniversity of Texas at Austin

Dec 20, 2013COMAD, Ahmedabad, India

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

OutlineUnstructured Data - Scale & Diversity

Evolving NetworksMachine Learning Problems arising in Networks

Recommender SystemsLink PredictionSign Prediction

Formulation as Missing Value EstimationScalable Algorithms

NOMAD: Distributed matrix completion algorithmResults on ApplicationsConclusions

2

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Structured Data

• Data organized into fields: Relational databases, spreadsheets, XML

• Highly optimized for storage & retrieval (e.g. using SQL)

3

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Structured Data

• Focus is on data format, efficient storage & search

• Less or no uncertainty in semantics: e.g. businesses know the fields of the data

4

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Unstructured Data

5

Modern data is unstructured and diverse

Networks, Graphs Text

Images, Videos

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Unstructured Data

6

Much greater growth rate

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Unstructured Data

• Dynamic aspects of unstructured data:

• Constantly evolving

• Uncertainties abound: What should I ask of the data?

• Seek insights

• Heterogeneity renders traditional database models inadequate

7

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Unstructured Data

• Buzzwords - “Big Data” & “Data Science”

• Machine Learning: Predictive models for data

• Engineering perspective: Scale matters

8

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Network Graphs

9

APQ6

MIP

AQP2

AQP1 AQP5

GK

Social networks(Friendship) Bipartite networks

(Membership, Ratings, etc.)

Gene networks(Functional interaction)

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Graph Evolution

• Social networks are highly dynamic

• Constantly grow, change quickly over time

• Users arrive/leave, relationships form/dissolve

• Understanding graph evolution is important

10

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Graphs meet Machine Learning

• Network analysis: Understanding structure & evolution of networks

• Formulate predictive problems on the adjacency matrix of the graph

• Confluence of graph theory & machine learning

11

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Facebook growth

Scalable Network Analysis

12

Link Prediction

Recommender Systems

Netflix problem: 100M ratings, 0.5M users, 20K movies

A Toy Problem In Comparison!

13

Use

rs

Movies

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Recommender Systems

13

U

sers

Movies

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Link Prediction in Social Networks

• Problem: Infer missing relationships from a given snapshot of the network

14

Network at time T Network at time T + 1

?

?

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Predicting gene-disease links

15

?

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Signed Social Networks

• Sign Prediction Problem: Given a snapshot of the signed social network, predict the signs of missing edges

16

Like

Disli

ke

Like / Dislike?

Formulation as Missing Value Estimation

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Low-rank Matrix Completion

18

Use

rs

Movies

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Low-rank Matrix Completion

19

Use

rs

Movies

?

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Low-rank Matrix Completion

20

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Low-rank Matrix Completion

21

minW2Rm⇥k,H2Rn⇥k

X

(i,j)2⌦

(Aij �WiHTj )

2 + �(kWk2F + kHk2F )

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Low-rank Matrix Completion

22

?

minW2Rm⇥k,H2Rn⇥k

X

(i,j)2⌦

(Aij �WiHTj )

2 + �(kWk2F + kHk2F )

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Low-rank Matrix Completion

23

3.44

minW2Rm⇥k,H2Rn⇥k

X

(i,j)2⌦

(Aij �WiHTj )

2 + �(kWk2F + kHk2F )

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Link Prediction

• Can be posed as matrix completion problem

• Issue: Only positive relationships are observed

24

A(t) =

2

66664

. 1 1 . .1 . 1 1 11 1 . . 1. 1 . . .. 1 1 . .

3

77775⇡

2

66664

11111

3

77775

⇥1 1 1 1 1

Test Link Score

(4,5) 1

(1,4) 1

(3,4) 1

(1,5) 1

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Link Prediction

• Formulate Biased Matrix Completion Problem:

25

A(t) =

2

66664

. 1 1 . .1 . 1 1 11 1 . . 1. 1 . . .. 1 1 . .

3

77775⇡

2

66664

0.851.090.970.690.85

3

77775

⇥0.85 1.09 0.97 0.69 0.85

Test Link Score

(4,5) 0.59

(1,4) 0.59

(3,4) 0.67

(1,5) 0.72

minW2Rn⇥k,H2Rn⇥k

X

(i,j)2⌦

(Aij �WiHTj )

2 + ↵X

(i,j)/2⌦

(Aij �WiHTj )

2 + �(kWk2F + kHk2F )

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Signed Social Networks

• Social Balance [Harary,1953]: • In real-world signed networks, triangles tend to be

balanced

26

Friend of a friendis a friend

Enemy of an enemyis a friend

Balanced Not balanced

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Signed Social Networks

27

Theorem: All triangles in a network are balanced if and only if there exist two antagonistic groups.

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Signed Social Networks

• Relaxation: Weak balance

• Allow triangles with all negative edges

28

Weakly Balanced Not balanced

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Signed Social Networks

29

Theorem: All triangles in a network are weakly balanced if and only if there exist k antagonistic groups.

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Sign Prediction

• Sign inference can be posed as low-rank matrix completion

30

Theorem: A k-weakly balanced signed network has rank at most k.

Theorem: If there are no “small” groups, the underlying network can be exactly recovered, under certain

conditions.

K. Chiang et al. Prediction and Clustering in Signed Networks: A Local to Global Perspective. To appear in JMLR.

Scalable Algorithms

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Stochastic Gradient

• Time per update

• Effective for very large-scale problems

32

wi wi � ⌘((Aij �wTi hj)hj + �wi)

hj hj � ⌘((Aij �wTi hj)wi + �hj)

• Sample random index (i,j) and update corresponding factors:

O(k)

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Distributed Stochastic Gradient Descent (DSGD) [Gemulla et al. KDD 2011]

• Decoupled updates

• Easy to parallelize

• But communication & computation are interleaved

33

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

DSGD

34

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

DSGD

35

Curse of the last reducer

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

DSGD

36

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD

• Goal: Keep CPU & network simultaneously busy.

• Asynchronous distributed solution.

37

Non-locking stOchastic Multi-machine algorithm for Asynchronous & Decentralized matrix factorization

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD

38

w1

w2

w3

w4

h1

h2

h4h5 h3

Nomadic variables queue

Native variables

Workers

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD

39

w1

w2

w3

w4

h1

h2 h4

h5 h3

h4

Nomadic variables queue

Native variables

Push

Workers

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD

40

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD

41

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD

42

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD

43

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD

44

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD

45

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD

46

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD

47

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD

48

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD

49

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD

50

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD Algorithm

1. Initialize: Randomly assign columns to worker queues2. Parallel Foreach q in {1,2,...,p}3. If queue[q] not empty then4. queue[q].pop()5. for ( ) in do6. Do SGD updates7. end for8. Sample q’ uniformly from {1,2,...,p}9. queue[q’].push( )10. end if11. Parallel End

51

(j,hj)

(j,hj)

i, j ⌦(q)j

hj

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

NOMAD Algorithm

1. Initialize: Randomly assign columns to worker queues2. Parallel Foreach q in {1,2,...,p}3. If queue[q] not empty then4. queue[q].pop()5. for ( ) in do6. Do SGD updates7. end for8. Sample q’ uniformly from {1,2,...,p}9. queue[q’].push( )10. end if11. Parallel End

52

(j,hj)

(j,hj)

i, j ⌦(q)j

hj

Concurrent object

Distributed setting:Write over the network

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Algorithm Complexity

• Average space required per worker:

• Average time for one sweep:

53

O(mk/p+ nk/p+ |⌦|/p)

O(|⌦|k/p)

Results on Applications

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Recommender Systems

55

Multicore Distributed

Netflix dataset: 2,649,429 users, 17,770 movies, ~100M ratings

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Recommender Systems

56

Synthetic dataset: ~85M users, 17,770 items, ~8.5B observations

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Link Prediction• Flickr dataset: 1.9M users & 42M links.

• Test set: sampled 5K users.

57

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Predicting Gene-Disease Links

58

0 10 20 30 40 50 60 70 80 90 1000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Rank

Pr(R

ank

<= x

)

Singleton gene rank

CATAPULT(Blom et al.,2013)KatzBiased MF

0 10 20 30 40 50 60 70 80 90 1000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Rank

Pr(R

ank

<= x

)

Test pair rank

CATAPULT(Blom et al.,2013)KatzBiased MF

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Sign Prediction• Epinions dataset (+ve & -ve reviews):

• 131K nodes, 840K edges, 15% edges negative.

• MF-ALS is faster and achieves higher accuracy.

59

MF-ALS takes 455 secs on network with

1.1M nodes & 120M edges

Inderjit S. Dhillon University of Texas at Austin Scalable Network Analysis

Conclusions

• Rapid growth of unstructured data demands scalable machine learning solutions for analysis

• Machine learning problems arising in network analysis can be cast in the matrix completion framework

• Our proposed asynchronous distributed algorithm NOMAD outperforms state-of-the-art matrix completion solvers

• Beyond Matrix Completion: Asynchronous distributed framework for solving machine learning problems

60

top related