rm world 2014: analytic network package for rapidminer
DESCRIPTION
TRANSCRIPT
A Basic Analytic Network
Package for RapidMinerBalázs Kósa, Márton Balassi, Péter Englert, Gábor Rácz, Zoltán Pusztai, Attila Kiss
Why networks?
Hundreds of networks from entirely distinct areas of life sharesome basic structural properties and the implications of thisstructural features.
Networks
neural network of the worm Caenorhabditis elegans
the power grid of US
link structure of the web etc.
Properties
degree distribution
robustness
presence of denser subgraphs (communities)
the average distance is very low
shortest paths can be found efficiently etc.
When Luther had been captured on 4th May 1521,
Erasmus was already informed the day after.
The distance is around 483 km ~ 282 mi.
The Package I.
The algorithms of our package can be categorized into
two groups
methods related to nodes and edges
algorithms providing an overall description of the graph.
The first group can be further divided into
algorithms indicating the importance of a node, edge
methods measuring the similarity between nodes.
The Package II.
Importance
degree
betweenness
Linerank
influence maximization.
Similarity
cosine similarity
a regular similarity.
The Package III.
Overall descriptors
number of triangles
clustering coefficient
strongly connected components.
Outline
Explanation of measures, models and their application
fields
Some technical details
How the package should be used in RapidMiner
Experimental running times
Two test cases
Degree
A degree of a node may predict its influence.
Several networks have been reported to roughly have a
power law degree distribution: P(k) ~ k-β (2 ≤ β ≤ 3).
P(k): the fraction of nodes in the network having k edges to other
nodes.
Only the tail of the degree distribution obeys the power law.
This means that most of the nodes have few connections.
Some priviliged nodes possess a large fraction of
connections.
Degree Distribution - Example
US cities and the
main roads between
them.
US airports and the
flights between them.
Source: Albert-László Barabási: Linked. 2003.
Betweenness
Informally, betwenness measures the amount of
information that may pass through a node.
Formally: 𝑣𝑏𝑒𝑡𝑤 = 𝑢,𝑤𝑏𝑢,𝑣,𝑤
𝑏𝑢,𝑤
bu,w: the number of the shortest paths between nodes u
and w
bu,v,w: the number of those shortest paths between nodes u
and w that pass through v.
The definition on edges is similar.
Why betweenness is useful?
• By means of betweenness one may
detect bridges.
• Bridges are edges connecting
communities.
• They are the most "vulnerable" parts
of a network.
Linerank I.
The fastest algorithm calculating betweenness runs in
O(nm) time (n: # of nodes, m: # of edges).
For big data, this is prohibitively expensive.
As a substitute the researchers of IBM and Google
developed Linerank.
U Kang, Spiros Papadimitriou, Jimeng Sun, Hanghang Ton:
Centralities in Large Networks: Algorithms and
Observations. 2011.
Linerank II.
1
2
3
4 5
1
2 3
5 4
• In the calculation of Linerank
Pagerank is computed on the line graph
of the original graph.
• Line graph…
• Roughly, The Pagerank value of a
node is the probability that a random
walker stays at that node after a long
period of time.
• In the last step for each node of the
original graph the previous Pagerank
values of its incident edges are
aggregated.
Influence Maximization I.
The goal is to find the most influential subset of nodes
under a given information diffusion model and size
restriction.
A possible application field is viral marketing.
As a diffusion model most of the research papers
focused on the Independent Cascade Model.
Independent Cascade Model
• Each edge is labelled with a
probability.
• We start with an active set of
nodes (0th step).
• In ith step those nodes that became
active in the (i -1)th step may activate
their neighbours.
• Node u activate its neighbour v
with the probability assigned to their
linking edge.
0.2
0.5
0.1
0.7
0.3
0.8
0.2
0.10.2
Influence Maximization II.
Denote σ(A) the number of activated nodes, if in thebeginning the elements of A were active. This is theinfluence of A.
For a given k, the problem is to find the most influentialsubset with k elements.
The problem is NP-hard.
However: 𝜎 𝐴𝑔𝑟𝑒𝑒𝑑𝑦 ≥1
1+𝑒𝜎 𝐴𝑜𝑝𝑡 , i.e, the influence of
the result of the greedy algorithm is never worse than63% of the influence of the optimal solution.
We have implemented two optimized version of thegreedy algorithm (Wei Chen, Yajun Wang, Siyu Yang.Efficient Influence Maximization in Social Networks. 2009).
Similarity
There are two fundamental approaches to defining
measures of similarities between nodes
structural similarity
regular similarity.
Two nodes are structurally similar, if they share many
common neighbours.
Two nodes are regularly similar, if they have respective
neighbours that are also regularly similar.
They do not have to have any common neighbour.
Cosine Similarity
β
v
u
• In geometry the cosine of the angle between
two vectors is calculated as: 𝑢𝑣
𝑢 𝑣.
• u and v are close to each other, if cos(β) ~ 1.
• In networks for node i its characteristic vector
ui is considered.
• The jth element of ui is 1, if nodes i and j are
neighbours. Otherwise, it is 0.
• The length of u is: 𝑢 = 𝑖 𝑢𝑖2.
• Note: if two nodes have exactly the same
neighbours, then their cosine similarity is 1.
1
2 3
4
u2 = (1, 0, 1, 0)
Regular Similarity
Recall, two nodes are regularly similar, if they haverespective neighbours that are also regularly similar.
A slightly different formulation: nodes i and j are similar, ifi has a neighbour that is similar to j.
This definition results the following formula:
𝜎𝑖𝑗 = 𝛼 𝑘𝐴𝑖𝑘𝜎𝑘𝑗.
Nodes are considered to be self-similar, thus the finalformula is: 𝜎𝑖𝑗 = 𝛼 𝑘𝐴𝑖𝑘𝜎𝑘𝑗 + 𝛿𝑖𝑗.
Further details: E. A. Leicht, Petter Holme and, M. E. J. Newman. Vertex Similarity in Networks. 2005.
Number of Triangles
In a social network a triangle represents that two friends
of user u are also friends.
The high number of triangles indicate the presence of
communities.
Clustering Coefficient
• The clustering coefficient of node i, Ci, is
the number of its connected neighbours
divided by the number of the pairs of its
neighbours.
• The clustering coefficent is the average
of these local clustering coefficients:
𝐶 =1
𝑛 𝑖𝐶𝑖
i
𝐶𝑖 = 13
• Generally, in social networks this number is high.
• Interestingly, in the link graph of the web, the ccf is
much lower than it is expected.
Strongly Connected Components
A strongly connected component in a
directed graph is a subgraph, in which
there is a directed path between any of
two nodes.
Our implementation works in linear time,
in fact, it traverses the graph only once.
Further details: Harold N. Gabow. Path-
based Depth-search for Strong and
Biconnected components. 2000.
Some Technical Details
We derived a class called Graph from theResultObjectAdapter class.
Basically, in the Graph class each node is stored with thelist of its neighbours.
Each of our methods uses this class.
For the conversion of ExampleSet to Graph weimplemented the Convert2Graph operator.
Its input consists of two columns, which store connectednode pairs.
If the input represents a directed graph, the user has toselect, which column contains the source nodes.
Running TimesCit-HepH ego-Twitter web-BerkStan soc-Pokec
Nodes 34.546 81.306 685.230 1.632.803
Edges 421.578 2.420.766 7.600.595 30.622.564
Betweenness 00:01:31 00:49:05 17:54:40 --
Linerank 00:00:00 00:00:01 00:00:05 00:00:20
SCC 00:00:00 00:00:00 00:00:00 00:00:06
# of triangles 00:00:00 00:00:28 00:00:42 00:07:17
CCF 00:00:01 00:00:01 00:01:57 00:19:48
The datasets can be downloaded from http://snap.stanford.edu/data/ .
• The time is given in format HH:MM:SS.
• In the experiment we used a single machine with 12-core
2.67 GHz Intel Xeon Cpu and 24 Gb of RAM.
A Note on the Running Times
The algorithm calculating the number of triangles has
been changed to another method.
This new implementation runs 20 times faster in average.
Further details: Matthieu Latapy. Main-memory Triangle
Computations for Very Large (Sparse (Power-Law))
Graphs. 2008.
Test cases
Correlation between Linerank and betweenness.
We examined whether the most influential users of a
segment of the Twitter follower graph tweet rather
emotionally positive or negative texts.
Test case 1 – Diagram
Test Case 1 – Input, Output
• Input*: who-trust-
whom online
social network
(Epinions.com)
with 75.789 nodes
and 508.837 edges.
• The Pearson’s
Correlation Coeff. is
0.561.
* http://snap.stanford.edu/data/soc-Epinions1.html
Test Case 2 – Diagram
Test Case 2 – Training I.
• Dataset*: 1.578.627
classified tweets according
to their positive or negative
sentiments.
• Sample: 1500 tweets.
* http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-
09-22/
Test Case 2 – Training II.
• Both the precision and the recall values are rather low.
• This does not improve even for samples with 100.000 tweets.
Test Case 2 - Output
The Twitter follower graph* contained 81.306 nodes and
1.768.149 edges.
The 10 most influential users were selected by the
NewGreedy algorithm.
84 tweets were found that were posted by these users.
Among them 38 were classified to be positive.
* http://snap.stanford.edu/data/egonets-Twitter.html
Further work, final remarks
Some important measures are missing from the package
e.g. Pagerank, closeness, diameter etc.
After their implementation the package should be
made available in RapidMiner marketplace.
Some algorithms e.g. SCC, triangle counting are also
implemented in HADOOP, thus cooperation with
RADOOP might be advantageous.
Currently, the package can be downloaded from:
http://inf.elte.hu/balhal/RapidMiner/ .