a topic analysis approach to revealing discussions on the australian twittersphere

19
A TOPIC ANALYSIS APPROACH TO REVEALING DISCUSSIONS ON THE AUSTRALIAN TWITTERSPHERE Brenda Moon Queensland University of Technology

Upload: brenda-moon

Post on 15-Jan-2017

152 views

Category:

Social Media


1 download

TRANSCRIPT

Page 1: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

A TOPIC ANALYSIS APPROACH TO REVEALING DISCUSSIONS ON THE AUSTRALIAN TWITTERSPHERE

Brenda MoonQueensland University of Technology

Page 2: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Introduction

This paper investigates techniques to identify the topics being discussed in one week of tweets from the Australian Twittersphere. Tweets were extracted from a comprehensive dataset which captures all tweets by 2.8m Australian: the Tracking Infrastructure for Social Media Analysis (TrISMA) (Bruns, Burgess & Banks et al., 2016).

Page 3: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Selected week: Sunday 2 August to Saturday 8 August 2015

• Thursday 6th August 2015 was used for One Day in the Life of a National Twittersphere (Axel Bruns and Brenda Moon, presented at Social Media and Society, London, 13 July 2016)

• Same day used for initial development of topic modelling approach

• Then extended to full week

Page 4: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Latent Dirichlet Allocation

Blei, D. M. (2011)

Page 5: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Data cleaning

• Remove – retweets & multitweets (“rt”, “mt” or “via”)– URLs– dates, times, distances & weights– Words less than 3 characters – elipses ('...’)

• NTLK tokenisation using Twitter Tokenizer– Remove all @users and urls– Lowercase

• Convert – HTML entities to text– Hashtags to words (trim ‘#’ off hashtags)

• NLTK lemmatisation• NLTK stopwords

Page 6: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Hashtag pooling

• Mehrotra, Sanner, Buntine & Xie (2013) looked at different options of ‘pooling’ tweets into documents before LDA analysis to see if this could increase accuracy. They found that hashtag pooling was effective (best was hashtag pooling with clustering, but more complex to apply)

• Group all the tweets with hashtags into documents for each hashtag (some tweets will be added into more than one document)

• Tweets without hashtags stay as individual documents

Page 7: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Corpus filtering (Thursday 6 August 2015)

• Raw tweets: 963,064• After data cleaning: 583,528• After hashtag pooling: 516,263

– 23% of tweets had hashtags• Dictionary pruning – remove most frequent and least

frequent terms – no_above=0.5 (percent of documents), no_below=5

(documents)– 223,157 unique tokens reduced to 49,964 unique tokens

Page 8: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Latent Dirichlet Allocation (LDA)

• Gensim LDA (Lau & Baldwin, 2014)• LdaMulticore• Identify 30 topics• 100 passes

Page 9: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Thursday 6th August 2015 – overall terms

https://github.com/bmabey/pyLDAvis

Page 10: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Thursday 6th August 2015Topic 2: Politics / coal / China / Queensland

Page 11: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Thursday 6th August 2015Topic 5: Cricket – The Ashes

Page 12: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Thursday 6th August 2015Topic 5: Cricket – The Ashes

Page 13: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Thursday 6th August 2015Topic 5: Cricket – The Ashes

Page 14: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Thursday 6th August 2015Topic 5: Cricket – The Ashes

Page 15: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Thursday 6th August 201530 topics, With hashtag pooling.

MH370

Page 16: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Thursday 6th August 201530 topics, With hashtag pooling.Comparison to other study

Pop?

Teen culture?

MH370

Page 17: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

1.1m tweets from 147k, to 224k accounts294k nodes total, including non-Australians535k edges from 856k @mentions / RTs

Visualisation: Gephi, Force Atlas 2Colours: Gephi, modularity resolution 1.0

Labels assigned through qualitative evaluation

Politics

Cricket

Teen CulturePop

From “One Day in the Life of a National Twittersphere” by Axel Bruns and Brenda Moon, presented at Social Media and Society, London, 13 July 2016.

Page 18: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

Further Outlook• Confirm initial topic labelling by looking at top tweets for each topic• Check whether the hashtag pooling has allowed non-hashtag tweet

topics to still be visible• Use statistical coherence of model (U_Mass Coherence, C_V

coherence) to tune LDA parameters• Model different numbers of topics (coarse/fine grain)• Relate topics per user back to our mention network graphs• Extend to the full week (or longer)• Compare to alternative approaches

– Doc2Vec / Tensorflow / dynamic LDA etc

Page 19: A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere

References

• Blei, D. M. (2011). Introduction to probabilistic topic models. Communications of the ACM, 1–16. Retrieved from http://www.cs.princeton.edu/~blei/papers/Blei2011.pdf

• Mehrotra, R., Sanner, S., Buntine, W., & Xie, L. (2013). Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, 889–892. http://doi.org/10.1145/2484028.2484166

• Lau, J. H., & Baldwin, T. (2014). An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation.

• Puschmann, C., & Scheffler, T. (2016). Topic modeling for media and communication research : A short primer (HIIG Discussion Paper Series No. 2016–5). Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2836478

• Sievert, C., & Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, 63–70. Retrieved from http://www.aclweb.org/anthology/W/W14/W14-3110