an analysis of the user occupational class through twitter ...danielpr/files/jobs-slides.pdfan...

35
An analysis of the user occupational class through Twitter content Daniel Preot ¸iuc-Pietro 1 Vasileios Lampos 2 Nikolaos Aletras 2 1 Computer and Information Science 2 Department of Computer Science University of Pennsylvania University College London 29 July 2015

Upload: others

Post on 07-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

An analysis of the user occupational classthrough Twitter content

Daniel Preotiuc-Pietro1 Vasileios Lampos2 Nikolaos Aletras2

1Computer and Information Science 2Department of Computer ScienceUniversity of Pennsylvania University College London

29 July 2015

Page 2: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Motivation

User attribute prediction from text is successful:

I Age (Rao et al. 2010 ACL)I Gender (Burger et al. 2011 EMNLP)I Location (Eisenstein et al. 2011 EMNLP)I Personality (Schwartz et al. 2013 PLoS One)I Impact (Lampos et al. 2014 EACL)I Political orientation (Volkova et al. 2014 ACL)I Mental illness (Coppersmith et al. 2014 ACL)

Downstream applications are benefiting from this:

I Sentiment analysis (Volkova et al. 2013 EMNLP)I Text classification (Hovy 2015 ACL)

Page 3: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

However...

Socio-economic factors (occupation, social class, education,income) play a vital role in language use

(Bernstein 1960, Labov 1972/2006)

No large scale user level dataset to date

Applications:

I sociological analysis of language useI embedding to downstream tasks (e.g. controlling for

socio-economic status)

Page 4: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

At a Glance

Our contributions:

I Predicting new user attribute: occupationI New dataset: user←→ occupationI Gaussian Process classification for NLP tasksI Feature ranking and analysis using non-linear methods

Page 5: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Standard Occupational Classification

Standardised job classification taxonomy

Developed and used by the UK Office for National Statistics(ONS)

Hierarchical:

I 1-digit (major) groups: 9I 2-digit (sub-major) groups: 25I 3-digit (minor) groups: 90I 4-digit (unit) groups: 369

Jobs grouped by skill requirements

Page 6: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Standard Occupational Classification

C1 Managers, Directors and Senior Officials

I 11 Corporate Managers and DirectorsI 111 Chief Executives and Senior Officials

I 1115 Chief Executives and Senior OfficialsJob: chief executive, bank manager

I 1116 Elected Officers and RepresentativesI 112 Production Managers and DirectorsI 113 Functional Managers and DirectorsI 115 Financial Institution Managers and DirectorsI 116 Managers and Directors in Transport and LogisticsI 117 Senior Officers in Protective ServicesI 118 Health and Social Services Managers and DirectorsI 119 Managers and Directors in Retail and Wholesale

I 12 Other Managers and Proprietors

Page 7: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Standard Occupational Classification

C2 Professional OccupationsJob: mechanical engineer, pediatrist, postdoctoral researcher

C3 Associate Professional and Technical OccupationsJob: system administrator, dispensing optician

C4 Administrative and Secretarial OccupationsJob: legal clerk, company secretary

C5 Skilled Trades OccupationsJob: electrical fitter, tailor

C6 Caring, Leisure, Other Service OccupationsJob: school assistant, hairdresser

C7 Sales and Customer Service OccupationsJob: sales assistant, telephonist

C8 Process, Plant and Machine OperativesJob: factory worker, van driver

C9 Elementary OccupationsJob: shelf stacker, bartender

Page 8: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Data

5,191 users←→ 3-digit job group

Users collected by self-disclosure of job title in profile

Manually filtered by the authors

10M tweets, average 94.4 users per 3-digit group

Page 9: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Data

Here we classify only at the 1-digit top level group (9 classes)

Feature representation and labels available online

Raw data available for research purposes on request (perTwitter TOS)

Page 10: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Features

User Level features (18), such as:

I number of:I followersI friendsI listingsI tweets

I proportion of:I retweetsI hashtagsI @-repliesI links

I average:I tweets/dayI retweets/tweet

Page 11: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Features

Focus on interpretable features for analysis

Compute over reference corpus of 400M tweets:

I SVD embeddings and clustersI Word2Vec (W2V) embeddings and clusters

Page 12: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

SVD Features

Compute word ×word similarity matrix

Similarity metric is Normalized PMI (Bouma 2009) using theentire tweet as context

SVD with different number of dimensions (30, 50, 100, 200)

User is represented by summing its word representations

The low-dimensional features offer no interpretability

Page 13: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

SVD Features

Spectral clustering to get hard clusters of words (30, 50, 100, 200clusters)

Each cluster consists of distributionally similar words←→ topic

User is represented by the number of times he uses a wordfrom each cluster.

Page 14: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Word2Vec Features

Trained Word2Vec (layer size 50) on our Twitter referencecorpus

Spectral clustering on the word ×word similiarity matrix (30,50, 100, 200 clusters)

Similarity is cosine similarity of words in the embedding space

Page 15: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Gaussian Processes

Brings together several key ideas in one framework:

I BayesianI kernelisedI non-parametricI non-linearI modelling uncertainty

Elegant and powerful framework, with growing popularity inmachine learning and application domains

Page 16: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Gaussian Process Graphical Model View

f ∼ GP(m, k)

y ∼ N( f (x), σ2)

I f : RD− > R is a latent

functionI y is a noisy realisation

of f (x)I k is the covariance

function or kernelI m and σ2 are learnt

from data

k

f

yx

σ

N

Page 17: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Gaussian Process Classification

Pass latent function through logistic function to squash theinput from (−∞,∞) to obtain probability, π(x) = p(yi = 1| fi)(similar to logistic regression)

The likelihood is non-Gaussian and solution is not analytical

Inference using Expectation propagation (EP)

FITC approximation for large data

Page 18: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Gaussian Process Classification

ARD kernel learns feature importance→ features mostdiscriminative between classes

We learn 9 one-vs-all binary classifiers

This way, we find the most predictive features consistent for allclasses

Page 19: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Gaussian Process Resources

Free book:http://www.gaussianprocess.org/gpml/chapters/

Page 20: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Gaussian Process Resources

I GPs for Natural Language Processing tutorial (ACL 2014)http://www.preotiuc.ro

I GP Schools in Sheffield and roadshows in Kampala,Pereira, Nyeri, Melbournehttp://ml.dcs.shef.ac.uk/gpss/

I Annotated bibliography and other materialshttp://www.gaussianprocess.org

I GPy Toolkit (Python)https://github.com/SheffieldML/GPy

Page 21: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Prediction

34 31.5

34.2

25

30

35

40

45

50

55

User Level

LR SVM-RBF GP Baseline

Stratified 10 fold cross-validation

Page 22: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Prediction

34

40

31.5

43.1

34.2

43.8

25

30

35

40

45

50

55

User Level SVD-E (200)

LR SVM-RBF GP Baseline

Stratified 10 fold cross-validation

Page 23: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Prediction

34

40

44.2

31.5

43.1

47.9

34.2

43.8

48.2

25

30

35

40

45

50

55

User Level SVD-E (200) SVD-C (200)

LR SVM-RBF GP Baseline

Stratified 10 fold cross-validation

Page 24: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Prediction

34

40

44.2 42.5

31.5

43.1

47.9 49

34.2

43.8

48.2 48.4

25

30

35

40

45

50

55

User Level SVD-E (200) SVD-C (200) W2V-E (50)

LR SVM-RBF GP Baseline

Stratified 10 fold cross-validation

Page 25: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Prediction

34

40

44.2 42.5

46.9

31.5

43.1

47.9 49

51.7

34.2

43.8

48.2 48.4

52.7

25

30

35

40

45

50

55

User Level SVD-E (200) SVD-C (200) W2V-E (50) W2V-C (200)

LR SVM-RBF GP Baseline

Stratified 10 fold cross-validation

Page 26: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Prediction Analysis

User level features have no predictive value

Clusters outperform embeddings

Word2Vec features are better than SVD/NPMI for prediction

Non-linear methods (SVM-RBF and GP) significantlyoutperform linear methods

52.7% accuracy for 9-class classification is decent

Page 27: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Class Comparison

Jensen-Shannon Divergence between topic distributions acrossoccupational classes

Some clusters of occupations are observable

1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

90.00

0.01

0.02

0.03

Page 28: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Feature AnalysisRank Manual Label Topic (most frequent words)

1 Arts art, design, print, collection,poster, painting, custom, logo,printing, drawing

2 Health risk, cancer, mental, stress, pa-tients, treatment, surgery, dis-ease, drugs, doctor

3 Beauty Care beauty, natural, dry, skin, mas-sage, plastic, spray, facial, treat-ments, soap

4 Higher Education students, research, board, stu-dent, college, education, library,schools, teaching, teachers

5 Software Engineering service, data, system, services,access, security, development,software, testing, standard

Most predictive Word2Vec 200 clusters as given by GaussianProcess ARD ranking

Page 29: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Feature Analysis

Rank Manual Label Topic (most frequent words)7 Football van, foster, cole, winger, terry,

reckons, youngster, rooney,fielding, kenny

8 Corporate patent, industry, reports, global,survey, leading, firm, 2015, in-novation, financial

9 Cooking recipe, meat, salad, egg, soup,sauce, beef, served, pork, rice

12 Elongated Words wait, till, til, yay, ahhh, hoo,woo, woot, whoop, woohoo

16 Politics human, culture, justice, religion,democracy, religious, humanity,tradition, ancient, racism

Most predictive Word2Vec 200 clusters as given by GaussianProcess ARD ranking

Page 30: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Feature Analysis - Cumulative density functions

0.001 0.01 0.050

0.2

0.4

0.6

0.8

1

Topic proportion

Use

r pr

obab

ility

Higher Education (#21)

C1C2C3C4C5C6C7C8C9

Topic more prevalent→ CDF line closer to bottom-right corner

Page 31: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Feature Analysis - Cumulative density functions

0.001 0.01 0.050

0.2

0.4

0.6

0.8

1

Topic proportion

Use

r pr

obab

ility

Arts (#116)

C1C2C3C4C5C6C7C8C9

Topic more prevalent→ CDF line closer to bottom-right corner

Page 32: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Feature Analysis - Cumulative density functions

0.001 0.01 0.050

0.2

0.4

0.6

0.8

1

Topic proportion

Use

r pr

obab

ility

Elongated Words (#164)

C1C2C3C4C5C6C7C8C9

Topic more prevalent→ CDF line closer to bottom-right corner

Page 33: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Feature Analysis

Comparison of mean topic usage between supersets ofoccupational classes (1-2 vs. 6-9)

Page 34: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Take Aways

User occupation influences language use in social media

Non-linear methods (Gaussian Processes) obtain significantgains over linear methods

Topic (clusters) features are both predictive and interpretable

New dataset available for research

Page 35: An analysis of the user occupational class through Twitter ...danielpr/files/jobs-slides.pdfAn analysis of the user occupational class through Twitter content Daniel Preot¸iuc-Pietro1

Questions

http://sites.sas.upenn.edu/danielpr/twitter-occupation