mining large graphscmu scs suspicious patterns in event data ucsd'16 (c) 2016, c. faloutsos 55...

103
CMU SCS Mining Large Graphs: Patterns, Anomalies, and Fraud Detection Christos Faloutsos CMU

Upload: others

Post on 20-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Mining Large Graphs: Patterns, Anomalies, and Fraud

Detection

Christos Faloutsos CMU

Page 2: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Thank you!

•  Prof. Julian McAuley

•  Nicholas Urioste

UCSD'16 (c) 2016, C. Faloutsos 2

Page 3: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 3

Roadmap

•  Introduction – Motivation – Why study (big) graphs?

•  Part#1: Patterns in graphs •  Part#2: time-evolving graphs; tensors •  Conclusions

UCSD'16

Page 4: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 4

Graphs - why should we care?

UCSD'16

~1B nodes (web sites) ~6B edges (http links) ‘YahooWeb graph’

Page 5: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 5

Graphs - why should we care?

>$10B; ~1B users

UCSD'16

Page 6: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 6

Graphs - why should we care?

Internet Map [lumeta.com]

Food Web [Martinez ’91]

UCSD'16

Page 7: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 7

Graphs - why should we care? •  web-log (‘blog’) news propagation •  computer network security: email/IP traffic and

anomaly detection •  Recommendation systems •  ....

•  Many-to-many db relationship -> graph

UCSD'16

Page 8: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Motivating problems •  P1: patterns? Fraud detection?

•  P2: patterns in time-evolving graphs / tensors

UCSD'16 (c) 2016, C. Faloutsos 8

time

destination

Page 9: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Motivating problems •  P1: patterns? Fraud detection?

•  P2: patterns in time-evolving graphs / tensors

UCSD'16 (c) 2016, C. Faloutsos 9

time

destination

Patterns anomalies

Page 10: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Motivating problems •  P1: patterns? Fraud detection?

•  P2: patterns in time-evolving graphs / tensors

UCSD'16 (c) 2016, C. Faloutsos 10

time

destination

Patterns anomalies*

* Robust Random Cut Forest Based Anomaly Detection on Streams Sudipto Guha, Nina Mishra , Gourav Roy, Okke Schrijvers, ICML’16

Page 11: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 11

Roadmap

•  Introduction – Motivation – Why study (big) graphs?

•  Part#1: Patterns & fraud detection •  Part#2: time-evolving graphs; tensors •  Conclusions

UCSD'16

Page 12: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

UCSD'16 (c) 2016, C. Faloutsos 12

Part 1: Patterns, &

fraud detection

Page 13: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 13

Laws and patterns •  Q1: Are real graphs random?

UCSD'16

Page 14: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 14

Laws and patterns •  Q1: Are real graphs random? •  A1: NO!!

– Diameter (‘6 degrees’; ‘Kevin Bacon’) –  in- and out- degree distributions –  other (surprising) patterns

•  So, let’s look at the data

UCSD'16

Page 15: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 15

Solution# S.1 •  Power law in the degree distribution [Faloutsos x 3

SIGCOMM99]

log(rank)

log(degree)

internet domains

att.com

ibm.com

UCSD'16

Page 16: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 16

Solution# S.1 •  Power law in the degree distribution [Faloutsos x 3

SIGCOMM99]

log(rank)

log(degree)

-0.82

internet domains

att.com

ibm.com

UCSD'16

Page 17: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

17

S2: connected component sizes •  Connected Components – 4 observations:

Size

Count

(c) 2016, C. Faloutsos UCSD'16

1.4B nodes 6B edges

Page 18: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

18

S2: connected component sizes •  Connected Components

Size

Count

(c) 2016, C. Faloutsos UCSD'16

1) 10K x larger than next

Page 19: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

19

S2: connected component sizes •  Connected Components

Size

Count

(c) 2016, C. Faloutsos UCSD'16

2) ~0.7B singleton nodes

Page 20: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

20

S2: connected component sizes •  Connected Components

Size

Count

(c) 2016, C. Faloutsos UCSD'16

3) SLOPE!

Page 21: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

21

S2: connected component sizes •  Connected Components

Size

Count 300-size

cmpt X 500. Why? 1100-size cmpt

X 65. Why?

(c) 2016, C. Faloutsos UCSD'16

4) Spikes!

Page 22: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

22

S2: connected component sizes •  Connected Components

Size

Count

suspicious financial-advice sites

(not existing now)

(c) 2016, C. Faloutsos UCSD'16

Page 23: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 23

Roadmap

•  Introduction – Motivation •  Part#1: Patterns in graphs

– Patterns: Degree; Triangles – Anomaly/fraud detection – Graph understanding

•  Part#2: time-evolving graphs; tensors •  Conclusions

UCSD'16

Page 24: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 24

Solution# S.3: Triangle ‘Laws’

•  Real social networks have a lot of triangles

UCSD'16

Page 25: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 25

Solution# S.3: Triangle ‘Laws’

•  Real social networks have a lot of triangles –  Friends of friends are friends

•  Any patterns? –  2x the friends, 2x the triangles ?

UCSD'16

Page 26: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 26

Triangle Law: #S.3 [Tsourakakis ICDM 2008]

SN Reuters

Epinions X-axis: degree Y-axis: mean # triangles n friends -> ~n1.6 triangles

UCSD'16

Page 27: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Triangle counting for large graphs?

Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11]

27 UCSD'16 27 (c) 2016, C. Faloutsos

? ?

?

Page 28: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Triangle counting for large graphs?

Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11]

28 UCSD'16 28 (c) 2016, C. Faloutsos

Page 29: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Triangle counting for large graphs?

Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11]

29 UCSD'16 29 (c) 2016, C. Faloutsos

Page 30: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Triangle counting for large graphs?

Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11]

30 UCSD'16 30 (c) 2016, C. Faloutsos

Page 31: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Triangle counting for large graphs?

Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11]

31 UCSD'16 31 (c) 2016, C. Faloutsos

Page 32: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

MORE Graph Patterns

UCSD'16 (c) 2016, C. Faloutsos 32

✔ ✔ ✔

RTG: A Recursive Realistic Graph Generator using Random Typing Leman Akoglu and Christos Faloutsos. PKDD’09.

Page 33: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

MORE Graph Patterns

UCSD'16 (c) 2016, C. Faloutsos 33

•  Mary McGlohon, Leman Akoglu, Christos Faloutsos. Statistical Properties of Social Networks. in "Social Network Data Analytics” (Ed.: Charu Aggarwal)

•  Deepayan Chakrabarti and Christos Faloutsos, Graph Mining: Laws, Tools, and Case Studies Oct. 2012, Morgan Claypool.

Page 34: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 34

Roadmap

•  Introduction – Motivation •  Part#1: Patterns in graphs

– Patterns – Anomaly / fraud detection

•  No labels – spectral •  With labels: Belief Propagation

•  Part#2: time-evolving graphs; tensors •  Conclusions

UCSD'16

Patterns anomalies

Page 35: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

How to find ‘suspicious’ groups? •  ‘blocks’ are normal, right?

UCSD'16 (c) 2016, C. Faloutsos 35

fans

idols

Page 36: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Except that: •  ‘blocks’ are normal, right? •  ‘hyperbolic’ communities are more realistic

[Araujo+, PKDD’14]

UCSD'16 (c) 2016, C. Faloutsos 36

Page 37: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Except that: •  ‘blocks’ are usually suspicious •  ‘hyperbolic’ communities are more realistic

[Araujo+, PKDD’14]

UCSD'16 (c) 2016, C. Faloutsos 37

Q: Can we spot blocks, easily?

Page 38: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Except that: •  ‘blocks’ are usually suspicious •  ‘hyperbolic’ communities are more realistic

[Araujo+, PKDD’14]

UCSD'16 (c) 2016, C. Faloutsos 38

Q: Can we spot blocks, easily? A: Silver bullet: SVD!

Page 39: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Crush intro to SVD •  Recall: (SVD) matrix factorization: finds

blocks

UCSD'16 (c) 2016, C. Faloutsos 39

N fans

M idols

‘music lovers’ ‘singers’

‘sports lovers’ ‘athletes’

‘citizens’ ‘politicians’

~ + +

Page 40: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Crush intro to SVD •  Recall: (SVD) matrix factorization: finds

blocks

UCSD'16 (c) 2016, C. Faloutsos 40

N fans

M idols

‘music lovers’ ‘singers’

‘sports lovers’ ‘athletes’

‘citizens’ ‘politicians’

~ + +

Page 41: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Inferring Strange Behavior from Connectivity Pattern in Social Networks

Meng Jiang, Peng Cui, Shiqiang Yang (Tsinghua, Beijing)

Alex Beutel, Christos Faloutsos (CMU)

Page 42: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Lockstep and Spectral Subspace Plot •  Case #0: No lockstep behavior in random

power law graph of 1M nodes, 3M edges •  Random “Scatter”

Adjacency Matrix Spectral Subspace Plot

UCSD'16 42 (c) 2016, C. Faloutsos

Page 43: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Lockstep and Spectral Subspace Plot •  Case #1: non-overlapping lockstep •  “Blocks” “Rays”

Adjacency Matrix Spectral Subspace Plot

UCSD'16 43 (c) 2016, C. Faloutsos

Page 44: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Lockstep and Spectral Subspace Plot •  Case #2: non-overlapping lockstep •  “Blocks; low density” Elongation

Adjacency Matrix Spectral Subspace Plot

UCSD'16 44 (c) 2016, C. Faloutsos

Page 45: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Lockstep and Spectral Subspace Plot •  Case #3: non-overlapping lockstep •  “Camouflage” (or “Fame”) Tilting

“Rays” Adjacency Matrix Spectral Subspace Plot

UCSD'16 45 (c) 2016, C. Faloutsos

Page 46: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Lockstep and Spectral Subspace Plot •  Case #3: non-overlapping lockstep •  “Camouflage” (or “Fame”) Tilting

“Rays” Adjacency Matrix Spectral Subspace Plot

UCSD'16 46 (c) 2016, C. Faloutsos

Page 47: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Lockstep and Spectral Subspace Plot •  Case #4: ? lockstep •  “?” “Pearls”

Adjacency Matrix Spectral Subspace Plot

?

UCSD'16 47 (c) 2016, C. Faloutsos

Page 48: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Dataset

•  Tencent Weibo •  117 million nodes (with profile and UGC

data) •  3.33 billion directed edges

UCSD'16 49 (c) 2016, C. Faloutsos

Page 49: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Real Data

“Pearls” “Staircase”

“Rays” “Block”

UCSD'16 50 (c) 2016, C. Faloutsos

Page 50: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Real Data •  Spikes on the out-degree distribution

×

×UCSD'16 51 (c) 2016, C. Faloutsos

Page 51: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 52

Roadmap

•  Introduction – Motivation •  Part#1: Patterns in graphs

– Patterns – Anomaly / fraud detection

•  No labels – spectral methods –  Suspiciousness

•  With labels: Belief Propagation

•  Part#2: time-evolving graphs; tensors •  Conclusions UCSD'16

Page 52: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Suspicious Patterns in Event Data

UCSD'16 (c) 2016, C. Faloutsos 53

A General Suspiciousness Metric for Dense Blocks in Multimodal Data, Meng Jiang, Alex Beutel, Peng Cui, Bryan Hooi, Shiqiang Yang, and Christos Faloutsos, ICDM, 2015.

2-modes

n-modes

?

? ?

Page 53: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Suspicious Patterns in Event Data

UCSD'16 (c) 2016, C. Faloutsos 54

Which is more suspicious? 20,000 Users Retweeting same 20 tweets 6 times each All in 10 hours

225 Users Retweeting same 1 tweet 15 times each All in 3 hours All from 2 IP addresses

vs.

ICDM 2015

Answer: volume * DKL(p|| pbackground)

Page 54: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Suspicious Patterns in Event Data

UCSD'16 (c) 2016, C. Faloutsos 55

A General Suspiciousness Metric for Dense Blocksin Multimodal Data

AnonymousAnonymousAnonymous

AnonymousAnonymousAnonymous

AnonymousAnonymousAnonymous

AnonymousAnonymousAnonymous

Abstract—Which seems more suspicious: 5,000 tweets from200 users on 5 IP addresses, or 10,000 tweets from 500 userson 500 IP addresses but all with the same hashtag and all in10 minutes? The literature has many methods that try to finddense blocks in matrices, and, recently, tensors, but no methodgives a principled way to score the suspiciouness of dense blockswith different numbers of modes and rank them to draw humanattention accordingly. Dense blocks are worth inspecting, typicallyindicating fraud, emerging trends, or some other noteworthydeviation from the usual. Our main contribution is that we showhow to unify these methods and how to give a principled answer toquestions like the above. Specifically, (a) we give a list of axiomsthat any metric of suspicousness should satisfy; (b) we propose anintuitive, principled metric that satisfies the axioms, and is fast tocompute; (c) we propose CROSSSPOT, an algorithm to spot denseregions, and sort them in importance (“suspiciousness”) order.Finally, we apply CROSSSPOT to real data, where it improvesthe F1 score over previous techniques by 68% and finds retweet-boosting and hashtag-hijacking in social datasets spanning 0.3

billion posts.

I. INTRODUCTION

Imagine your job at Twitter is to detect when fraudstersare trying to manipulate the most popular tweets for a giventrending topic. Given time pressure, which is more worthy ofyour investigation: 2,000 Twitter users, all retweeting the same20 tweets, 4 to 6 times each; or 225 Twitter users, retweetingthe same 1 tweet, 10 to 15 times each? Now, what if the latterbatch of activity happened within 3 hours, while the formerspanned 10 hours? What if all 225 users of the latter groupused the same 2 IP addresses?

Figure 1 shows an example of these patterns from TencentWeibo, one of the largest microblogging platforms in China;our method CROSSSPOT detected a block of 225 users, using2 IP addresses (“blue circle” and “red cross”), retweeting thesame tweet 27K times, within 200 minutes. Further, manualinspection shows that several of these users get activated every5 minutes. This type of lockstep behavior is suspicious (say,due to automated scripts), and it leads to dense blocks, asin Figure 1. These blocks may span several modes (user-id,timestamp, hashtag, etc.). Although our main motivation isfraud detection in a Twitter-like setting, our proposed approachis suitable for numerous other settings, like distributed-denial-of-service (DDoS) attacks, link fraud, click fraud, even health-insurance fraud, as we discuss next.

Thus, the core question we ask in this paper is: what isthe right way to compare the severity/suspiciousness/surpriseof two dense blocks, that span 2 or more modes? Informally,the problem is:

200 Minutes

225Users

27,313Retweets

...

· · ·

10:09

10:14

10:19

10:24

10:32

10:37

11:04

15:37

21:13

. . .“John123” l : : l l l“Joe456” l : l : : : :“Job789” l : l : l l l“Jack135” : l : : : l : l“Jill246” : l l l :

“Jane314” : : l l l l l“Jay357” l l l l : l

.

.

.

Fig. 1. Density in multiple modes is suspicious. Left: A dense block of 225users on Tencent Weibo (Chinese Twitter) retweeting one tweet 27,313 timesfrom 2 IP addresses over 200 minutes. Right: magnification of a subset of thisblock. l and : indicate the two IP addresses used. Notice how synchronizedthe behavior is, across several modes (IP-address, user-id, timestamp).

Informal Problem 1 (Suspiciousness score) Given a K-mode dataset (tensor) X , with counts of events (that arenon-negative integer values), and two subtensors Y

1

and Y2

,which is more suspicious and worthy of further investigation?

Why multimodal data (tensor): Graphs and social networkshave attracted huge interest - and they are perfectly modeled asK=2 mode datasets, that is, matrices. With K=2 modes we canmodel Twitter’s “who-follows-whom” network [1][2], Face-book’s “who-friends-whom” and “who-Likes-what” graphs[3], eBay’s “who-buys-from-whom” graph [4], financial ac-tivities of “who-trades-what-stocks”, and scientific relationsof “who-cites-whom.” Several high-impact datasets make useof higher mode relations. With K=3 modes, we can considerhow all of the above graphs change over time or what wordsare used in product reviews on eBay or Amazon. With K=4modes, we can analyze network traces for intrusion detectionand distributed denial of service (DDoS) attacks by lookingfor patterns in the source IP, destination IP, destination port,and timestamp [5][6]. Health-insurance fraud detection isanother example, with (patient-id, doctor-id, prescription-id,timestamp) where corrupt doctors prescribe fake, expensivemedicine to senile or corrupt patients [7][8].

Why are dense regions worth inspecting: Dense regionsare surprising in all of the examples above. Past work hasrepeatedly found that dense regions in these tensors correspondto suspicious, lockstep behavior: Purchased Page Likes onFacebook result in a few users “Liking” the same “Pages”always at the same time (when the order for the Page Likes isplaced) [3]. Spammers paid to write deceptively high (or low)reviews for restaurants or hotels will reuse the same accountsand often even the same text [9][10]. Zombie followers, botnetswho are set up to build social links, will inflate the numberof followers to make their customers seem more popularthan they actually are [1][11]. This high-density outcome hasa reason: Spammers have constrained resources (users, IP

Retweeting: “Galaxy Note Dream Project: Happy Happy Life Traveling the World”

Recall0 0.5 1

Prec

isio

n

0

0.2

0.4

0.6

0.8

1

CrossSpotHOSVD (r = 20)HOSVD (r = 10)HOSVD (r = 5)MAF

CrossSpot

Mass of injected 30*30*30 blocks512 256 128 64 32 16

Rec

all

0

0.2

0.4

0.6

0.8

1

HOSVDCrossSpot (HOSVD seed)

CrossSpot

HOSVD

(a) Performance (b) Recall with HOSVD seed

Fig. 3. Finding dense blocks: CROSSSPOT outperforms baselines in finding3-mode blocks, and directly method improves the recall on top of HOSVD.

• SVD has small recall but high precision. However, theSVD can hardy catch small, sparse injected blocks suchas 30⇥30 submatrices of mass 16 and 32, even thoughthey are denser than the background. Higher decomposi-tion rank brings higher classification accuracy.

Finding dense high-order blocks in multimodal data: Wegenerate random tensor data with parameters as (1) the numberof modes k=3, (2) the size of data N

1

=1,000, N2

=1,000and N

3

=1,000 and (3) the mass of data C=10,000. Weinject b=6 blocks of k0=3 modes into the random data, so,I = {1, 2, 3}. Each block has size 30⇥30⇥30 and massc2{16, 32, 64, 128, 256, 512}. The task is again to classify thetensor entries into suspicious and normal classes. Figure 3(a)reports the performances of CROSSSPOT and baselines. Weobserve that in order to find all the six 3-mode injected blocks,our proposed CROSSSPOT has better performance in precisionand recall than baselines. The best F1 score CROSSSPOT givesis 0.891, which is 46.0% higher than the F1 score given bythe best of HOSVD (0.610). If we use the results of HOSVDas seeds to CROSSSPOT, the best F1 score of CROSSSPOTreaches 0.979. Figure 3(b) gives the recall value of everyinjected block. We observe that CROSSSPOT improves therecall over HOSVD, especially on slightly sparser blocks.

Finding dense low-order blocks in multimodal data: Wegenerate random tensor data with parameters as (1) the numberof modes k=3, (2) the size of data N

1

=1,000, N2

=1,000 andN

3

=1,000 and (3) the mass of data C=10,000. We inject b=4blocks into the random data:

• Block #1: The number of modes is k01

=3 and I1

={1,2,3}.The size is 30⇥30⇥30 and the block’s mass is c

1

=512.• Block #2: The number of modes is k0

2

=2 and I2

={1,2}.The size is 30⇥30⇥1,000 and the block’s mass is c

2

=512.• Block #3: k0

3

=2, I3

={1,3}; 30⇥1,000⇥30 of c3

=512.• Block #4: k0

4

=2, I4

={2,3}; 1,000⇥30⇥30 of c4

=512.

Note, blocks 2-4 are dense in only 2 modes and random in thethird mode. Table V report the classification performances ofCROSSSPOT and baselines. We show the overall evaluations(precision, recall and F1 score) and recall value of every block.We observe that CROSSSPOT has 100% recall in catching the3-mode block #1, while the baselines have 85-95% recall.More impressively, CROSSSPOT successfully catches the 2-mode blocks, where HOSVD has difficulty and low recall.The F1 score of overall evaluation is as large as 0.972 with68.8% improvement.

Testing robustness of the random seed number: We test howthe performance of our CROSSSPOT improves when we usemore seed blocks in the low-order block detection experiments.

(a) Robustness (b) Convergence

Fig. 4. CROSSSPOT is robust to the number of random seeds. In detectingthe 4 low-order blocks, when we use 41 seeds, the best F1 score has reportedthe final result of as many as 1,000 seeds. CROSSSPOT converges very fast:the average number of iterations is 2.87.

TABLE VI. BIG DENSE BLOCKS WITH TOP METRIC VALUESDISCOVERED IN THE RETWEETING DATASET.

# User⇥tweet⇥IP⇥minute Mass c Suspiciousness

CROSSSPOT1 14⇥1⇥2⇥1,114 41,396 1,239,8652 225⇥1⇥2⇥200 27,313 777,7813 8⇥2⇥4⇥1,872 17,701 491,323

HOSVD1 24⇥6⇥11⇥439 3,582 131,1132 18⇥4⇥5⇥223 1,942 74,0873 14⇥2⇥1⇥265 9,061 381,211

Figure 4(a) shows the best F1 score for different numbers ofrandom seeds. We find that when we use 41 random seeds, thebest F1 score is close to the results when we use as many as1,000 random seeds. Thus, once we exceed a moderate numberof random seeds, the performance is fairly robust to the numberof random seeds.

Efficiency analysis: CROSSSPOT can be parallelized intomultiple machines to search dense blocks with different setsof random seeds. The time cost of every iteration is linearin the number of non-zero entries in the multimodal dataas we have discussed in Section V. Figure 4(b) reports thecounts of iterations in the procedure of 1000 random seeds.We observe that usually CROSSSPOT takes 2 or 3 iterations tofinish the local search. Each iteration takes only 5.6 seconds.Tensor decompositions such as HOSVD and PARAFAC usedin MAF often take more time. On the same machine, HOSVDmethods of rank r=5, 10 and 20 take 280, 1750, 34,510seconds respectively. From Table V and Figure 4(a), evenwithout parallelization, we know that CROSSSPOT takes only230 seconds to have the best F1 score 0.972, while HOSVDneeds more time (280 seconds if r=5) to have a much smallerF1 score 0.324.

D. Retweeting Boosting

Table VI shows big, dense block patterns of TencentWeibo retweeting dataset. CROSSSPOT reports blocks of highmass and high density. For example, we spot that 14 usersretweet the same content for 41,396 times on 2 IP addressesin 19 hours. Their coordinated, suspicious behaviors resultin a few tweets that seem extremely popular. We observethat CROSSSPOT catches more suspicious (bigger and denser)blocks than HOSVD does: HOSVD evaluates the number ofretweets per user, item, IP, or minute, but does not considerthe block’s density, mass nor the background.

Table VII shows an example of retweeting boosting fromthe big, dense 225⇥1⇥2⇥200 block reported by our proposedCROSSSPOT. A group of users (e.g., A, B, C) retweet thesame message “Galaxy note dream project: Happy happy life

ICDM 2015

Page 55: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 56

Roadmap

•  Introduction – Motivation •  Part#1: Patterns in graphs

– Patterns – Anomaly / fraud detection

•  No labels – spectral methods •  With labels: Belief Propagation

•  Part#2: time-evolving graphs; tensors •  Conclusions

UCSD'16

Page 56: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

UCSD'16 (c) 2016, C. Faloutsos 57

E-bay Fraud detection

w/ Polo Chau & Shashank Pandit, CMU [www’07]

Page 57: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

UCSD'16 (c) 2016, C. Faloutsos 58

E-bay Fraud detection

Page 58: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

UCSD'16 (c) 2016, C. Faloutsos 59

E-bay Fraud detection

Page 59: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

UCSD'16 (c) 2016, C. Faloutsos 60

E-bay Fraud detection - NetProbe

Page 60: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Popular press

And less desirable attention: •  E-mail from ‘Belgium police’ (‘copy of

your code?’) UCSD'16 (c) 2016, C. Faloutsos 61

Page 61: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 62

Roadmap

•  Introduction – Motivation •  Part#1: Patterns in graphs

– Patterns – Anomaly / fraud detection

•  No labels - Spectral methods •  w/ labels: Belief Propagation – closed formulas

•  Part#2: time-evolving graphs; tensors •  Conclusions

UCSD'16

Page 62: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

UnifyingGuilt-by-AssociationApproaches:TheoremsandFastAlgorithms

Danai Koutra* U Kang

Hsing-Kuo Kenneth Pao

Tai-You Ke Duen Horng (Polo) Chau

Christos Faloutsos

ECML PKDD, 5-9 September 2011, Athens, Greece

(*KDD dissertation award, 2016)

Page 63: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS Problem Definition:

GBA techniques

(c) 2016, C. Faloutsos 64

Given: Graph; & few labeled nodes Find: labels of rest (assuming network effects)

UCSD'16

Page 64: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Are they related? •  RWR (Random Walk with Restarts)

–  google’s pageRank (‘if my friends are important, I’m important, too’)

•  SSL (Semi-supervised learning) – minimize the differences among neighbors

•  BP (Belief propagation) –  send messages to neighbors, on what you

believe about them

UCSD'16 (c) 2016, C. Faloutsos 65

Page 65: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Are they related? •  RWR (Random Walk with Restarts)

–  google’s pageRank (‘if my friends are important, I’m important, too’)

•  SSL (Semi-supervised learning) – minimize the differences among neighbors

•  BP (Belief propagation) –  send messages to neighbors, on what you

believe about them

UCSD'16 (c) 2016, C. Faloutsos 66

YES!

Page 66: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Correspondence of Methods

(c) 2016, C. Faloutsos 67

Method Matrix Unknown known RWR [I – c AD-1] × x = (1-c)ySSL [I + a(D - A)] × x = y

FABP [I + a D - c’A] × bh = φh

0 1 0 1 0 1 0 1 0

? 0 1 1

1 1 1

d1 d2

d3 final

labels/ beliefs

prior labels/ beliefs

adjacency matrix

UCSD'16

Page 67: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Problem: e-commerce ratings fraud •  Given a heterogeneous

graph on users, products, sellers and positive/negative ratings with “seed labels”

•  Find the top k most fraudulent users, products and sellers

UCSD'16 (c) 2016, C. Faloutsos 68

Dhivya Eswaran, Stephan Günnemann, Christos Faloutsos, “ZooBP: Belief Propagation for Heterogeneous Networks”, In submission to VLDB 2017

Page 68: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

ZooBP: Salient features

(c) 2016, C. Faloutsos 69

Fast approximate heterogeneous belief propagation with precise convergence guarantees!

Near-perfect accuracy Scalable: linear in graph size

UCSD'16

Dhivya Eswaran, Stephan Günnemann, Christos Faloutsos, “ZooBP: Belief Propagation for Heterogeneous Networks”, In submission to VLDB 2017

Page 69: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Summary of Part#1 •  *many* patterns in real graphs

– Power-laws everywhere – Long (and growing) list of tools for anomaly/

fraud detection

UCSD'16 (c) 2016, C. Faloutsos 70

Patterns anomalies

Page 70: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 71

Roadmap

•  Introduction – Motivation •  Part#1: Patterns in graphs •  Part#2: time-evolving graphs; tensors

– P2.1: time-evolving graphs – P2.2: inter-arrival time patterns

•  Conclusions

UCSD'16

Page 71: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

UCSD'16 (c) 2016, C. Faloutsos 72

Part 2: Time evolving

graphs; tensors

Page 72: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Graphs over time -> tensors! •  Problem #2.1:

– Given who calls whom, and when – Find patterns / anomalies

UCSD'16 (c) 2016, C. Faloutsos 73

smith

Page 73: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Graphs over time -> tensors! •  Problem #2.1:

– Given who calls whom, and when – Find patterns / anomalies

UCSD'16 (c) 2016, C. Faloutsos 74

Page 74: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Graphs over time -> tensors! •  Problem #2.1:

– Given who calls whom, and when – Find patterns / anomalies

UCSD'16 (c) 2016, C. Faloutsos 75

Mon Tue

Page 75: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Graphs over time -> tensors! •  Problem #2.1:

– Given who calls whom, and when – Find patterns / anomalies

UCSD'16 (c) 2016, C. Faloutsos 76 callee

caller

Page 76: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Graphs over time -> tensors! •  Problem #2.1’:

– Given author-keyword-date – Find patterns / anomalies

UCSD'16 (c) 2016, C. Faloutsos 77 keyword

author

MANY more settings, with >2 ‘modes’

Page 77: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Graphs over time -> tensors! •  Problem #2.1’’:

– Given subject – verb – object facts – Find patterns / anomalies

UCSD'16 (c) 2016, C. Faloutsos 78 object

subject

MANY more settings, with >2 ‘modes’

Page 78: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Graphs over time -> tensors! •  Problem #2.1’’’:

– Given <triplets> – Find patterns / anomalies

UCSD'16 (c) 2016, C. Faloutsos 79 mode2

mode1

MANY more settings, with >2 ‘modes’ (and 4, 5, etc modes)

Page 79: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 82

Roadmap

•  Introduction – Motivation •  Part#1: Patterns in graphs •  Part#2: time-evolving graphs; tensors

– P2.1: time-evolving graphs – P2.2: inter-arrival time patterns

•  Conclusions

UCSD'16

Page 80: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Crush intro to SVD •  Recall: (SVD) matrix factorization: finds

blocks

UCSD'16 (c) 2016, C. Faloutsos 83

N fans

M idols

‘music lovers’ ‘singers’

‘sports lovers’ ‘athletes’

‘citizens’ ‘politicians’

~ + +

Page 81: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS Answer to both: tensor

factorization •  PARAFAC decomposition

UCSD'16 (c) 2016, C. Faloutsos 84

= + + subject

object

politicians artists athletes

Page 82: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Answer: tensor factorization •  PARAFAC decomposition •  Results for who-calls-whom-when

–  4M x 15 days

UCSD'16 (c) 2016, C. Faloutsos 85

= + + caller

callee

?? ?? ??

Page 83: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS Anomaly detection in time-

evolving graphs

•  Anomalous communities in phone call data: – European country, 4M clients, data over 2 weeks

~200 calls to EACH receiver on EACH day!

1 caller 5 receivers 4 days of activity

UCSD'16 86 (c) 2016, C. Faloutsos

=

Page 84: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS Anomaly detection in time-

evolving graphs

•  Anomalous communities in phone call data: – European country, 4M clients, data over 2 weeks

~200 calls to EACH receiver on EACH day!

1 caller 5 receivers 4 days of activity

UCSD'16 87 (c) 2016, C. Faloutsos

=

Page 85: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS Anomaly detection in time-

evolving graphs

•  Anomalous communities in phone call data: – European country, 4M clients, data over 2 weeks

~200 calls to EACH receiver on EACH day! UCSD'16 88 (c) 2016, C. Faloutsos

=

Miguel Araujo, Spiros Papadimitriou, Stephan Günnemann, Christos Faloutsos, Prithwish Basu, Ananthram Swami, Evangelos Papalexakis, Danai Koutra. Com2: Fast Automatic Discovery of Temporal (Comet) Communities. PAKDD 2014, Tainan, Taiwan.

Page 86: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 89

Roadmap

•  Introduction – Motivation – Why study (big) graphs?

•  Part#1: Patterns in graphs •  Part#2: time-evolving graphs;

– P2.1: tensors – P2.2: Inter-arrival time patterns

•  Acknowledgements and Conclusions

UCSD'16

Page 87: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

RSC: Mining and Modeling Temporal Activity in Social Media

Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina

Caetano Traina Jr. Christos Faloutsos

Universidade de São Paulo

KDD 2015 – Sydney, Australia

*[email protected]

Page 88: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Reddit Dataset Time-stamp from comments 21,198 users 20 Million time-stamps

Twitter Dataset Time-stamp from tweets 6,790 users 16 Million time-stamps

Pattern Mining: Datasets

For each user we have: Sequence of postings time-stamps: T = (t1, t2, t3, …) Inter-arrival times (IAT) of postings: (∆1, ∆2, ∆3, …)

91 t1 t2 t3 t4

∆1 ∆2 ∆3

time UCSD'16 (c) 2016, C. Faloutsos

Page 89: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Pattern Mining Pattern 1: Distribution of IAT is heavy-tailed

Users can be inactive for long periods of time before making new postings

IAT Complementary Cumulative Distribution Function (CCDF) (log-log axis)

92

105 106 10710−5

100

6, IAT (seconds)

CC

DF,

P(x

* 6

)

1 day 10 days105 106 107

10−5

100

6, IAT (seconds)

CC

DF,

P(x

* 6

)

1 day10 days

Reddit Users Twitter Users UCSD'16 (c) 2016, C. Faloutsos

Page 90: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Pattern Mining Pattern 1: Distribution of IAT is heavy-tailed

Users can be inactive for long periods of time before making new postings

IAT Complementary Cumulative Distribution Function (CCDF) (log-log axis)

93

105 106 10710−5

100

6, IAT (seconds)

CC

DF,

P(x

* 6

)

1 day 10 days105 106 107

10−5

100

6, IAT (seconds)

CC

DF,

P(x

* 6

)

1 day10 days

Reddit Users Twitter Users UCSD'16 (c) 2016, C. Faloutsos

No surprises – Should we give up?

Page 91: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Human? Robots?

log

linear

UCSD'16 (c) 2016, C. Faloutsos 95

Page 92: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Human? Robots?

log

linear 2’ 3h 1day

UCSD'16 (c) 2016, C. Faloutsos 96

Page 93: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS Experiments: Can RSC-Spotter Detect

Bots? Precision vs. Sensitivity Curves

Good performance: curve close to the top

97

0 0.2 0.4 0.6 0.8 10

0.20.40.60.8

1

Sensitivity (Recall)

Prec

isio

n

0 0.2 0.4 0.6 0.8 10

0.20.40.60.8

1

Sensitivity (Recall)

Prec

isio

n

RSC−Spotter

IAT Hist.Entropy [6]Weekday Hist.

All Features

Precision > 94% Sensitivity > 70%

With strongly imbalanced datasets # humans >> # bots

Twitter

UCSD'16 (c) 2016, C. Faloutsos

Page 94: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS Experiments: Can RSC-Spotter Detect

Bots? Precision vs. Sensitivity Curves

Good performance: curve close to the top

98

0 0.2 0.4 0.6 0.8 10

0.20.40.60.8

1

Sensitivity (Recall)

Prec

isio

n

RSC−Spotter

IAT Hist.Entropy [6]Weekday Hist.

All Features

Precision > 96% Sensitivity > 47% With strongly imbalanced datasets # humans >> # bots

Reddit

UCSD'16 (c) 2016, C. Faloutsos

Page 95: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Part 2: Conclusions

•  Time-evolving / heterogeneous graphs -> tensors

•  PARAFAC finds patterns •  (GigaTensor/HaTen2 -> fast & scalable)

UCSD'16 99 (c) 2016, C. Faloutsos

=

Page 96: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 100

Roadmap

•  Introduction – Motivation – Why study (big) graphs?

•  Part#1: Patterns in graphs •  Part#2: time-evolving graphs; tensors •  Acknowledgements and Conclusions

UCSD'16

Page 97: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 101

Thanks

UCSD'16

Thanks to: NSF IIS-0705359, IIS-0534205, CTA-INARC; Yahoo (M45), LLNL, IBM, SPRINT, Google, INTEL, HP, iLab

Disclaimer: All opinions are mine; not necessarily reflecting the opinions of the funding agencies

Page 98: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 102

Cast

Akoglu, Leman

Chau, Polo

Kang, U

Hooi, Bryan

UCSD'16

Koutra, Danai

Beutel, Alex

Papalexakis, Vagelis

Shah, Neil

Araujo, Miguel

Song, Hyun Ah

Eswaran, Dhivya

Shin, Kijung

Page 99: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 103

CONCLUSION#1 – Big data •  Patterns Anomalies

•  Large datasets reveal patterns/outliers that are invisible otherwise

UCSD'16

Page 100: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 104

CONCLUSION#2 – tensors

•  powerful tool

UCSD'16

=

1 caller 5 receivers 4 days of activity

Page 101: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

(c) 2016, C. Faloutsos 105

References •  D. Chakrabarti, C. Faloutsos: Graph Mining – Laws,

Tools and Case Studies, Morgan Claypool 2012 •  http://www.morganclaypool.com/doi/abs/10.2200/

S00449ED1V01Y201209DMK006

UCSD'16

Page 102: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

TAKE HOME MESSAGE: Cross-disciplinarity

UCSD'16 (c) 2016, C. Faloutsos 106

=

Page 103: Mining Large GraphsCMU SCS Suspicious Patterns in Event Data UCSD'16 (c) 2016, C. Faloutsos 55 A General Suspiciousness Metric for Dense Blocks in Multimodal Data Anonymous Anonymous

CMU SCS

Cross-disciplinarity

UCSD'16 (c) 2016, C. Faloutsos 107

=

Thank you!

http://www.cs.cmu.edu/~christos/TALKS/16-10-UCSD/