one day in twitter: topic detection via joint complexity

One Day in Twitter: Topic Detection Via Joint

Complexity

Gérard Burnside1, Dimitris Milioris1,2 and Philippe Jacquet1

1Bell Labs, Alcatel-Lucent, France2École Polytechnique ParisTech

Snow Challenge @ WWW 2014

2

Overview Motivation & Challenges I-Complexity Joint Complexity Theoretical Background Snow Challenge Dataset Topic Detection

Headlines Keywords Media URLs

Benefits Conclusions Future work

3

Motivation

Online social media services have seen a huge expansion:

The value of information has increased dramatically

Interactions and communication between users help predict the evolution of information

The ability to study Social Networks can provide relevant info in real time

4

ChallengesThe study of Soc. Networks has several research challenges

Searching in social media is still an open problem short size of posts, tremendous quantity in real time

Information of the correlation between groups of users predict media consumption, network resources, traffic improve QoS

Analyze the relationship between members of a group/community reveal important teams

Spam and adv. detection continuously growing amount of irrelevant info

5

I – Complexity

X is a sequence and I(X) is a set of factors (distinct substr.)

Example: X = apple, then:

I(X) = {a, p, l, e, ap, pp, pl, le, app, ppl, ple, appl, pple, apple, v}

|I(X)| is the complexity of a sequence |I(X)| = 15 (v denotes the empty string)

6

Joint Complexity [1]

The information contained in a string may be revealed by comparing with a reference string

The Joint Complexity is the number of common distinct factors in two sequences

J(X, Y) = |I(X) ∩ I(Y)|

Efficient way to estimate similarity degree of two sequences

The analysis of a sequence in subcomponents is done by Suffix Trees Simple, fast and low complexity method to store and recall from memory

[1] P. Jacquet, D. Milioris and W. Szpankowski, “Classification of Markov Sources Through Joint String Complexity: Theory and Experiments”, in IEEE International Symposium on Information Theory (ISIT’13), Istanbul, Turkey, July 2013.

7

Suffix Trees Superposition

Suffix Tree superposition of X = apple and Y = maple It reveals the common factors of X and Y, and gives a similarity metric Time to build a S.T. = O(n logn) Space in memory = O(n), n is the length of the tweet

JC(apple, maple) = 9

8

Theoretical Background [2]

JC is expected to be in , κ < 2

In presence of quasi duplicates of JC is in

When topics are the same JC= , h = entropy of the source.

Used to verify the thresholds Thlow and Thmax

€

n k

€

n2

€

2log2h

n

[2] D. Milioris and P. Jacquet, “Joint Sequence Complexity Analysis: Application to Social Networks Information Flow”, in Bell Laboratories Technical Journal, Issue on Data Analytics, Vol. 18, No. 4, 2014. (DOI: 10.1002/bltj.21647).

9

Snow Data Challenge Collected Tweets for 24 hours;

between Tue Feb. 25, 18:00 and Wed Feb. 26, 18:00 (GMT) by following 556,295 users, and also looking for specific keywords (Syria; terror; Ukraine; bitcoin)

Total tweets: 1,041,062

N = 96 timeslots (new timeslot = every 15 minutes)

Challenge: Provide one or more (max 10) different topics per timeslot (headline, set of keywords, Media URLs)

10

Topic Detection

Timeslot representation via connected weighted graphs

Each tweet is a node in the graph and an adjacency matrix (triangular) holds the weight (JC) of every edge

11

Topic Detection

12

Algorithms

13

Most Representative and Central Tweets The best-ranked tweet is chosen unconditionally

The second one is picked only if its JC score with the first one is below a chosen threshold Thrlow, otherwise it is added to the list of related tweets of the first tweet

Similarly, the third one is picked only if its JC score with the first two is below Thrlow, etc.

This ensures that the topics are dissimilar enough and it classifies best ranked tweets into topics at the same time

14

Headlines By removing punctuation, special characters, etc., of each

central tweet, we construct the headlines of each topic and we run through the list of related tweets to keep only tweets that are different enough from the central one’s (no duplicates)

We do so by keeping only the tweets whose JC score with the central tweet and all previous related tweets is above a chosen threshold Thrmax.

We first chose the values 400 and 600 for Thrlow and Thrmax respectively, but many topics had only one related tweet (all the others were RT), so we

decided to lower that threshold to 240

15

Keywords

In the bag of words constructed from the list of related tweets, we remove articles (stop-words), punctuation, special characters, etc.

We get a list of words, and we order them by decreasing frequency of occurrence.

Finally we report the k most frequent words, in a list of keywords

16

Media URLs The body of a tweet (in the json file format), contains a URL

information for links to media files entities → media → media url.

We scan the original json format in order to retrieve such a URL, from the most representative tweet or any of its related tweets, pointing to valid photos or pictures in a jpg, png or gif format, and then we report these pictures along with the headlines and the set of keywords

Almost half of the headlines (47%) produced by our method had an image retrieved from the original tweet.

17

Benefits

Both message classification and identification of the growing trends in real time (trend sensing) -- > submitted to KDD’14

Track the information and timeline within a social network

Deal with languages other than English without specific pre-processing or dictionaries, because the method is: simple, context-free, with no grammar and does not use semantics

18

Conclusions Implementation of a topic detection method applied to a

dataset of tweets emitted during a 24 hour period

It relies heavily on the concept of Joint String Complexity which has the benefit of being language agnostic and does not require humans to deal

with list of keywords has high algorithmic efficiency

The results obtained are satisfactory and promising on the SNOW dataset and other non latin languages (e.g. Greek)

19

Future Work, Improvements Use the theoretical background in order to automatically fix

the threshold values, than empirical ones chosen in this work

Fix a discarding threshold to remove not significant enough topics; thus allowing some not very active time-slots to contain less or more than a fixed number of topics.

Handle topics that where cut in half between two timeslots (since they where arbitrary divided in 15 min.)

Extend the JC metric to make topological classification of tweets and perform clustering based on this distance

20

Publications related to JC D. Milioris and P. Jacquet, “Joint Sequence Complexity Analysis: Application

to Social Networks Information Flow”, in Bell Laboratories Technical Journal, Issue on Data Analytics, Vol. 18, No. 4, 2014

P. Jacquet, D. Milioris, and W. Szpankowski, “Classification of Markov Sources Through Joint String Complexity: Theory and Experiments,” Proc. IEEE Internat. Symp. Inform. Theory (ISIT ’13)

P. Jacquet and W. Szpankowski, “Joint String Complexity for Markov Sources,” Proc. 23rd Internat. Meeting on Probabilistic, Combinatorial, and Asymptotic Methods for the Anal. of Algorithms (AofA ’12)

P. Jacquet, “Common Words Between Two Random Strings,” Proc. IEEE Internat. Symp. on Inform. Theory (ISIT ’07)

------------------------------------------------------------------------------------------------------- P. Jacquet and W. Szpankowski, “Analytical Depoissonization and Its

Applications,” Theoret. Comput. Sci., 201:1-2 (1998), 1–62. P. Jacquet and W. Szpankowski, “Autocorrelation on Words and Its

Applications: Analysis of Suffix Trees by String-Ruler Approach,” J. Combin. Theory Ser. A, 66:2 (1994), 237–269.

21

Questions [email protected]@polytechnique.edu

one day in twitter: topic detection via joint complexity

Documents

complexity x

joint string complexity

string5joint complexity

value of information

evolution of information

real time information

low complexity method

information theory isit13