discovering the multifaceted information hidden within largedanielpr/files/upenn14-slides.pdf ·...

Discovering the multifaceted

information hidden within large

user-generated text streams

Daniel Preotiuc-Pietro [email protected]

23.04.2014

Context

• vast increase in user generated content • Online Social Networks

most time-consuming activity on Internet

• multiple modalities: text, time, location, user info, images, etc.

• social network structure • Challenges:

• Engeneering: data volume • Algorithmic: restricted information,

grounded in context, streaming, noise

Motivation

Assumption: Text has different use conditioned on factors such as time, location, etc. Aim: Build models which incorporate these factors Tasks: • Supervised prediction applications

• internal, external • Study the effect of these factors in text use • Improve performance of downstream applications

Outline

i. Introduction ii. Data processing iii. Temporal patterns iv. Text forecasting real-world outcomes v. Spatio-temporal clustering vi. User level properties

TrendMiner project

• `Large scale, cross-lingual trend mining and summarization of real time media streams’

• 6+4 organisations; we work with University of Southampton and SORA on machine learning

• application to predicting political polls and aiding political analysts to make sense of social media data

www.trendminer-project.eu

Text Processing

RT @MediaScotland greeeat!!!lvly speech by cameron on scott's indy :)

#indyref

unorthodox capitalisation OOV words

creative spellings

shortenings

new conventions lack of context

Processing Architecture

• Fast: real time processing, Hadoop MapReduce (I/O bound), online and batch processing

• Scalable: adding more machines

• Modular: easy to add new modules

• Pipeline: the user specifies his needs

• Extensible: different sources of data (USMF format)

• Data consistency: JSON format, append to ‘analysis’

• Reusable: open-source

(ICWSM 2012)

Components

Gaussian Processes

(EMNLP 2013)

Task: Forecast hashtag frequency in Social Media - identify and categorise complex temporal patterns

Non-parametric Bayesian framework • kernelised • probabilistic formulation • propagation of uncertainty • exact posterior inference for regression • Non-parametric extension of Bayesian regression • very good results, but hardly used in NLP

Gaussian Processes

Define prior over functions Compute posterior

(ACL 2014 Tutorial)

Extrapolation

Examples of time series

#FAIL #RAW

#SNOW #FYI

SE

Experimental results

Experimental results

Compared to Mean prediction

Text classification

Task: Assign the hashtag to a given tweet

• Most frequent (MF)

• Naive Bayes model (NB-E)

• Naive Bayes with GP forecast as prior (NB-P)

MF NB-E NB-P

Match@1 7.28% 16.04% 17.39%

Match@5 19.90% 29.51% 31.91%

Match@50 44.92% 59.17% 60.85%

MRR 0.144 0.237 0.252

User behaviour

Task: Predict venue

check-in frequencies

• Modelled using GPs

• Compared to Mean

-150

-100

-50

0

50

100

Linear SE PER PS Select

Individual user behaviour

Task: Predict venue type of user check-in

• highly periodic

• compared to standard Markov predictors

Method Accuracy

Random 11.11%

M.Freq Categ. 35.21%

Markov-1 36.13%

Markov-2 34.21%

Daily period 38.92%

Weekly period 40.65%

(WebScience 2013)

(ACL 2013)

Text based forecasting

Task: predicting real world outcomes

Aim: replace expensive polls with streaming text

• predict political voting intention (not elections!)

• based on social media (Twitter) text

• strong baselines (last day, mean)

• 2 different use cases (UK and Austria)

• UK: 42k users, 60m tweets, 3 parties, 2 years

Linear regression

w xt + β = yt

Linear regression

w, β = argmin (𝑤𝑥𝑖 + 𝛽 − 𝑦𝑖)2

𝑛

𝑖=1

Linear regression

w, β = argmin (𝑤𝑥𝑖 + 𝛽 − 𝑦𝑖)2+ 𝜓𝑒𝑙(𝑤, 𝜌)

𝑛

𝑖=1

LEN – Elastic Net

Bilinear regression

• main issue is noise:

many non-informative users

• we look for a model of

sparse words & sparse users

• bi-convex optimisation problem

• solved by alternatively fixing each set of weights and iterating until convergence

Bilinear regression

u Xt wT + β = yt

Bilinear regression

w, u, β = argmin (𝑢𝑋𝑖𝑤𝑇 + 𝛽 − 𝑦𝑖)

2

𝑛

𝑖=1

Bilinear regression

w, u, β = argmin (𝑢𝑋𝑖𝑤𝑇 + 𝛽 − 𝑦𝑖)

2+ 𝜓𝑒𝑙 𝑤, 𝜌1 +

𝑛

𝑖=1

𝜓𝑒𝑙(𝑢, 𝜌2)

BEN – Bilinear Elastic Net

Bilinear regression

𝑤𝑡 , 𝑢𝑡 , β = argmin (𝑢𝑡𝑋𝑖𝑤𝑡 + 𝛽 − 𝑦𝑡𝑖)2+ 𝜓𝑒𝑙 𝑤𝑡, 𝜌1 +

𝑛

𝑖=1

𝜓𝑒𝑙(𝑢𝑡 , 𝜌2)

Bilinear regression

w, u, β = argmin (𝑢𝑡𝑋𝑖𝑤𝑡 + 𝛽 − 𝑦𝑡𝑖)2+ 𝜓𝑙1𝑙2 𝑤, 𝜌1 +

𝑛

𝑖=1

𝜏

𝑡=1

𝜓𝑙1𝑙2(𝑢, 𝜌2)

BGL – Bilinear Group LASSO

Quantitative results

Root Mean Squared Error (RMSE) forecasting

results over 50 testing polls (in VI %) BGL

BEN Polls

Quantitative results

Party Tweet Score Author

CON PM in friendly chat with top EU mate, Sweden’s Fredrik Reinfeldt, before family photo

1.334 Journalist

Have Liberal Democrats broken electoral rules? Blog on Labour complaint to cabinet secretary

-0.991 Journalist

LAB Blog Post Liverpool: City of Radicals Website now Live <link> #liverpool #art

1.954 Art Fanzine

I am so pleased to head Paul Savage who worked for the Labour group has been Appointed the Marketing manager for the baths hall GREAT NEWS

-0.552 Politicial (Labour)

LBD RT @user: Must be awful for TV bosses to keep getting knocked back by all the women they ask to host election night (via @user)

0.874 LibDem MP

Blog Post Liverpool: City of Radicals 2011 – More Details Announced #liverpool #art

-0.521 Art Fanzine

• The real-world outcome and users share:

i. region info: London (L), South England (S), Midlands & Wales (MW), North (N), Scotland (Sc) - observed

ii. gender: Male (M), Female (F) - inferred using statistical text-based classifier

iii. age: 18-24, 25-39, 40-59, 60+ - unknown

User features

Recap: Bilinear regression

w, u, β = argmin (𝑢𝑡𝑋𝑖𝑤𝑡 + 𝛽 − 𝑦𝑡𝑖)2+ 𝜓𝑙1𝑙2 𝑤, 𝜌1 +

𝑛

𝑖=1

𝜏

𝑡=1

𝜓𝑙1𝑙2(𝑢, 𝜌2)

BGL – Bilinear Group LASSO

Region & Demographics

w, u, β = argmin (𝑢𝑡𝑟𝑋𝑖𝑟𝑤𝑡𝑟 + 𝛽𝑡𝑟 − 𝑦𝑡𝑖𝑟)2

𝑛

𝑖=1

𝜕

𝑟=1

𝜏

𝑡=1

+

𝜓𝑙1𝑙2 𝑤𝑟 , 𝜌1 + 𝜓𝑙1𝑙2 𝑤𝑡 , 𝜌1 + 𝜓𝑙1𝑙2(𝑢𝑟 , 𝜌2)

𝜕

𝑟=1

BGGR


S L MW N Sc 𝝁

𝑩𝝁 2.9 3.9 3.2 3.2 3.8 3.4

𝑩𝒍𝒂𝒔𝒕 3.0 4.9 4.3 4.0 5.3 4.3

BGGR 2.6 3.9 3.2 3.0 3.7 3.3

M F 𝝁

𝑩𝝁 2.6 2.1 2.4

𝑩𝒍𝒂𝒔𝒕 2.6 2.4 2.5

BGGR 2.1 2.1 2.1

Regional model

Gender model


London Predictions

Female Predictions


Conservatives, Positive London

Task: Predict socioeconomic EU indicators

Dataset:

• News summaries from Open Europe think tank

• Daily summaries of EU and member states related news together with their news source

• Feb 2006 – Nov 2013; 1,913 days; 94 months

• 296 news outlets (with >10 summaries)

• Features: unigrams + bigrams

NewsSummaries dataset

(LACSS 2014)

Predictions

ESI (Economic Sentiment Indicator) Unemployment

ESI Unemployment

LEN 9.253 (9.89%) 0.9275 (8.75%)

BEN 8.209 (8.77%) 0.9047 (8.52%)

Economic Sentiment Indicator

Unemployment

Deep linguistic features

• Unigrams (8,912) (cameron) • Bigrams (33,206) (david__cameron) • POS (10,277): Unigrams together with their

part-of-speech (cameron/NNP) • NE (1,013): Entities - Location, Person or

Organisation (Person:David_Cameron) • Annotations (3,392): Link entities to DBpedia

e.g. political party (Org:Conservative_Party), office held (Office:Prime_minister)

Deep linguistic features

Features ESI Unempl.

Unigrams 8.21 1.27

Bigrams 9.66 1.61

Unigrams + Bigrams 8.91 1.47

POS 7.87 1.14

Entities 9.59 1.45

POS + NE 8.09 1.12

NE + Annotations 12.67 1.62

POS + NE + Annotations 10.50 1.31

Unigrams + NE + Annotations 10.92 1.31

Unigrams + Bigrams + NE + Annotations 10.81 1.53

Dimensionality reduction is used to aid browsing large data collections Topic models: • find `topics’ in a collection of documents • `topic’ = a set of semantically coherent words • each document is assigned to a few `topics’ • each word is assigned with a probability to each

`topic’ (soft clustering) • extra factors can be accomodated, e.g. spatio-

temporal dependencies and evolution

Clustering

Temporal topic models

Latent Dirichlet Allocation (LDA) Dirichlet Multinomial Regression (DMR)

• LDA: Documents analysed over time, no temporal conditioning

• Temporal DMR (MId): Documents authored in the same interval share similar topics

• Temporal DMR (TimeRBF): Neighbouring time intervals influence each others

• Regional DMR (OutletId): Documents with similar news source share similar topics

• Regional DMR (DomainId): Documents with similar domain name share similar topics

Temporal & Regional models

Spatio-temporal experiments

Method Perplexity

LDA 4,597

DMR MId 4,575

DMR TimeRBF 4,262

DMR TimeRBF+OutletId 4,086

DMR TimeRBF+OutletId+DomainId 4,036

Experiments: temporal & regional

Top domains: .it 3.44 .fr 0.09 .tv 0.08

.ee 0.06 .ir 0.05

Top outlets: ft.com 0.79

corriere.it 0.68 repubblica.it 0.49

elpais.com 0.45


Top domains: .fr 0.27

.org 0.10 .es 0.08 .ca 0.06 .ch 0.03

Top outlets:

guardian.co.uk 0.61 diplomatie.gouv.fr 0.60

bluesstatedigital.com 0.55 dw-world.de 0.49

User-level properties

• User-level properties:

age, gender, location, social grade, impact

• Aim: understand text use in context of these features - `profile’ users

• Task:

• build a model with good predictive value on held-out users

• interpret the features of this model

User impact

Impact score:

lnlistings ∗ followers2

followees

Data: 38k UK users, 48m deduplicated messages, all tweets from 1 year

Features:

profile info and text

under the user’s control

(EACL 2014)

User impact

• Models:

Linear Regression (LIN)

Gaussian processes (GP)

with ARD kernel

• Features:

User account (18)

Topics from user text (100): derived using spectral clustering on word co-occurrence matrix

Pearson correlation

User impact

Feature Importance

Using default profile image 0.73

Total number of tweets (entire history) 1.32

Number of unique @-mentions in tweets 2.31

Number of tweets (in dataset) 3.47

Links ratio in tweets 3.57

T1 (Weather): mph, humidity, barometer, gust, winds 3.73

T2 (Healthcare, Housing): nursing, nurse, rn, registered, bedroom, clinical, #news, estate, #hospital

5.44

T3 (Politics): senate, republican, gop, police, arrested, voters, robbery, democrats, presidential, elections

6.07

Proportion of days with non-zero tweets 6.96

Proportion of tweets with @-replies 7.10

User impact

Impact distribution for users with high (H) values of this feature as opposed to low (L). Red line is the mean impact score.

Number of tweets Number of unique @-mentions

User impact

damon, potter, #tvd, harry

elena, kate, portman

pattinson, hermione, jennifer

senate, republican, gop

police, arrested, voters

robbery, democrats

presidential, elections

Impact distribution for users with high (H) values of this feature. Red line is the mean impact score.

User impact

User scenario: 1. high number of tweets 2. talk about T3 (showbiz) 3. talk about T4 (politics) 4. use links (L) 5. do not use links (NL)

Vasileios Lampos

UCL

www.lampos.net

Trevor Cohn

Melbourne http://dcs.shef.ac.uk/~tcohn/

Sina Samangooei

Southampton

www.sinjax.net

Nikos Aletras

Sheffield http://dcs.shef.ac.uk/~nikos/

Collaborators

References

(ICWSM 2012) Trendminer: An Architecture for Real Time Analysis of Social Media Text.

D. Preotiuc-Pietro, S. Samangooei, T. Cohn, N. Gibbins, M. Niranjan

(HT 2013) Where’s @wally: A classification approach to Geolocating users based on their social ties.

D. Rout, D. Preotiuc-Pietro, K.Bontcheva, T. Cohn (`Ted Nelson’ award)

(WebScience 2013) Mining User Behaviours: A study of check-in patterns in Location Based Social Networks.

D. Preotiuc-Pietro, T. Cohn

(ACL 2013) A user-centric model of voting intention from Social Media.

V. Lampos, D. Preotiuc-Pietro, T. Cohn

(EMNLP 2013) A temporal model of text periodicities using Gaussian Processes.

D. Preotiuc-Pietro, T. Cohn

(EACL 2014) Predicting and Characterising User Impact on Twitter.

V. Lampos, N. Aletras, D. Preotiuc-Pietro, T.Cohn

(LACSS 2014) Extracting Socioeconomic Patterns from the News: Modelling Text and Outlet Importance Jointly.

V. Lampos, D. Preotiuc-Pietro, S. Samangooei, D. Gelling, T. Cohn

Thank you !