rm world 2014: analytic network package for rapidminer

A Basic Analytic Network

Package for RapidMinerBalázs Kósa, Márton Balassi, Péter Englert, Gábor Rácz, Zoltán Pusztai, Attila Kiss

Why networks?

Hundreds of networks from entirely distinct areas of life sharesome basic structural properties and the implications of thisstructural features.

Networks

neural network of the worm Caenorhabditis elegans

the power grid of US

link structure of the web etc.

Properties

degree distribution

robustness

presence of denser subgraphs (communities)

the average distance is very low

shortest paths can be found efficiently etc.

When Luther had been captured on 4th May 1521,

Erasmus was already informed the day after.

The distance is around 483 km ~ 282 mi.

The Package I.

The algorithms of our package can be categorized into

two groups

methods related to nodes and edges

algorithms providing an overall description of the graph.

The first group can be further divided into

algorithms indicating the importance of a node, edge

methods measuring the similarity between nodes.

The Package II.

Importance

degree

betweenness

Linerank

influence maximization.

Similarity

cosine similarity

a regular similarity.

The Package III.

Overall descriptors

number of triangles

clustering coefficient

strongly connected components.

Outline

Explanation of measures, models and their application

fields

Some technical details

How the package should be used in RapidMiner

Experimental running times

Two test cases

Degree

A degree of a node may predict its influence.

Several networks have been reported to roughly have a

power law degree distribution: P(k) ~ k-β (2 ≤ β ≤ 3).

P(k): the fraction of nodes in the network having k edges to other

nodes.

Only the tail of the degree distribution obeys the power law.

This means that most of the nodes have few connections.

Some priviliged nodes possess a large fraction of

connections.

Degree Distribution - Example

US cities and the

main roads between

them.

US airports and the

flights between them.

Source: Albert-László Barabási: Linked. 2003.

Betweenness

Informally, betwenness measures the amount of

information that may pass through a node.

Formally: 𝑣𝑏𝑒𝑡𝑤 = 𝑢,𝑤𝑏𝑢,𝑣,𝑤

𝑏𝑢,𝑤

bu,w: the number of the shortest paths between nodes u

and w

bu,v,w: the number of those shortest paths between nodes u

and w that pass through v.

The definition on edges is similar.

Why betweenness is useful?

• By means of betweenness one may

detect bridges.

• Bridges are edges connecting

communities.

• They are the most "vulnerable" parts

of a network.

Linerank I.

The fastest algorithm calculating betweenness runs in

O(nm) time (n: # of nodes, m: # of edges).

For big data, this is prohibitively expensive.

As a substitute the researchers of IBM and Google

developed Linerank.

U Kang, Spiros Papadimitriou, Jimeng Sun, Hanghang Ton:

Centralities in Large Networks: Algorithms and

Observations. 2011.

Linerank II.

1

2

3

4 5

1

2 3

5 4

• In the calculation of Linerank

Pagerank is computed on the line graph

of the original graph.

• Line graph…

• Roughly, The Pagerank value of a

node is the probability that a random

walker stays at that node after a long

period of time.

• In the last step for each node of the

original graph the previous Pagerank

values of its incident edges are

aggregated.

Influence Maximization I.

The goal is to find the most influential subset of nodes

under a given information diffusion model and size

restriction.

A possible application field is viral marketing.

As a diffusion model most of the research papers

focused on the Independent Cascade Model.

Independent Cascade Model

• Each edge is labelled with a

probability.

• We start with an active set of

nodes (0th step).

• In ith step those nodes that became

active in the (i -1)th step may activate

their neighbours.

• Node u activate its neighbour v

with the probability assigned to their

linking edge.

0.2

0.5

0.1

0.7

0.3

0.8

0.2

0.10.2

Influence Maximization II.

Denote σ(A) the number of activated nodes, if in thebeginning the elements of A were active. This is theinfluence of A.

For a given k, the problem is to find the most influentialsubset with k elements.

The problem is NP-hard.

However: 𝜎 𝐴𝑔𝑟𝑒𝑒𝑑𝑦 ≥1

1+𝑒𝜎 𝐴𝑜𝑝𝑡 , i.e, the influence of

the result of the greedy algorithm is never worse than63% of the influence of the optimal solution.

We have implemented two optimized version of thegreedy algorithm (Wei Chen, Yajun Wang, Siyu Yang.Efficient Influence Maximization in Social Networks. 2009).

Similarity

There are two fundamental approaches to defining

measures of similarities between nodes

structural similarity

regular similarity.

Two nodes are structurally similar, if they share many

common neighbours.

Two nodes are regularly similar, if they have respective

neighbours that are also regularly similar.

They do not have to have any common neighbour.

Cosine Similarity

β

v

u

• In geometry the cosine of the angle between

two vectors is calculated as: 𝑢𝑣

𝑢 𝑣.

• u and v are close to each other, if cos(β) ~ 1.

• In networks for node i its characteristic vector

ui is considered.

• The jth element of ui is 1, if nodes i and j are

neighbours. Otherwise, it is 0.

• The length of u is: 𝑢 = 𝑖 𝑢𝑖2.

• Note: if two nodes have exactly the same

neighbours, then their cosine similarity is 1.

1

2 3

4

u2 = (1, 0, 1, 0)

Regular Similarity

Recall, two nodes are regularly similar, if they haverespective neighbours that are also regularly similar.

A slightly different formulation: nodes i and j are similar, ifi has a neighbour that is similar to j.

This definition results the following formula:

𝜎𝑖𝑗 = 𝛼 𝑘𝐴𝑖𝑘𝜎𝑘𝑗.

Nodes are considered to be self-similar, thus the finalformula is: 𝜎𝑖𝑗 = 𝛼 𝑘𝐴𝑖𝑘𝜎𝑘𝑗 + 𝛿𝑖𝑗.

Further details: E. A. Leicht, Petter Holme and, M. E. J. Newman. Vertex Similarity in Networks. 2005.

Number of Triangles

In a social network a triangle represents that two friends

of user u are also friends.

The high number of triangles indicate the presence of

communities.

Clustering Coefficient

• The clustering coefficient of node i, Ci, is

the number of its connected neighbours

divided by the number of the pairs of its

neighbours.

• The clustering coefficent is the average

of these local clustering coefficients:

𝐶 =1

𝑛 𝑖𝐶𝑖

i

𝐶𝑖 = 13

• Generally, in social networks this number is high.

• Interestingly, in the link graph of the web, the ccf is

much lower than it is expected.

Strongly Connected Components

A strongly connected component in a

directed graph is a subgraph, in which

there is a directed path between any of

two nodes.

Our implementation works in linear time,

in fact, it traverses the graph only once.

Further details: Harold N. Gabow. Path-

based Depth-search for Strong and

Biconnected components. 2000.

Some Technical Details

We derived a class called Graph from theResultObjectAdapter class.

Basically, in the Graph class each node is stored with thelist of its neighbours.

Each of our methods uses this class.

For the conversion of ExampleSet to Graph weimplemented the Convert2Graph operator.

Its input consists of two columns, which store connectednode pairs.

If the input represents a directed graph, the user has toselect, which column contains the source nodes.

Running TimesCit-HepH ego-Twitter web-BerkStan soc-Pokec

Nodes 34.546 81.306 685.230 1.632.803

Edges 421.578 2.420.766 7.600.595 30.622.564

Betweenness 00:01:31 00:49:05 17:54:40 --

Linerank 00:00:00 00:00:01 00:00:05 00:00:20

SCC 00:00:00 00:00:00 00:00:00 00:00:06

# of triangles 00:00:00 00:00:28 00:00:42 00:07:17

CCF 00:00:01 00:00:01 00:01:57 00:19:48

The datasets can be downloaded from http://snap.stanford.edu/data/ .

• The time is given in format HH:MM:SS.

• In the experiment we used a single machine with 12-core

2.67 GHz Intel Xeon Cpu and 24 Gb of RAM.

A Note on the Running Times

The algorithm calculating the number of triangles has

been changed to another method.

This new implementation runs 20 times faster in average.

Further details: Matthieu Latapy. Main-memory Triangle

Computations for Very Large (Sparse (Power-Law))

Graphs. 2008.

Test cases

Correlation between Linerank and betweenness.

We examined whether the most influential users of a

segment of the Twitter follower graph tweet rather

emotionally positive or negative texts.

Test case 1 – Diagram

Test Case 1 – Input, Output

• Input*: who-trust-

whom online

social network

(Epinions.com)

with 75.789 nodes

and 508.837 edges.

• The Pearson’s

Correlation Coeff. is

0.561.

* http://snap.stanford.edu/data/soc-Epinions1.html

Test Case 2 – Diagram

Test Case 2 – Training I.

• Dataset*: 1.578.627

classified tweets according

to their positive or negative

sentiments.

• Sample: 1500 tweets.

* http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-

09-22/

Test Case 2 – Training II.

• Both the precision and the recall values are rather low.

• This does not improve even for samples with 100.000 tweets.

Test Case 2 - Output

The Twitter follower graph* contained 81.306 nodes and

1.768.149 edges.

The 10 most influential users were selected by the

NewGreedy algorithm.

84 tweets were found that were posted by these users.

Among them 38 were classified to be positive.

* http://snap.stanford.edu/data/egonets-Twitter.html

Further work, final remarks

Some important measures are missing from the package

e.g. Pagerank, closeness, diameter etc.

After their implementation the package should be

made available in RapidMiner marketplace.

Some algorithms e.g. SCC, triangle counting are also

implemented in HADOOP, thus cooperation with

RADOOP might be advantageous.

Currently, the package can be downloaded from:

http://inf.elte.hu/balhal/RapidMiner/ .

http://inf.elte.hu/balhal/RapidMiner/

rm world 2014: analytic network package for rapidminer

Documents