arxiv:1612.06671v1 [cs.cl] 20 dec 2016

Inferring the location of authors from words in their texts

Max Berggren & Jussi KarlgrenGavagai, Stockholm

{max, jussi}@gavagai.se

Robert Ostling & Mikael ParkvallDept of Linguistics, Stockholm University{robert, parkvall}@ling.su.se

AbstractFor the purposes of computational dialectol-ogy or other geographically bound text anal-ysis tasks, texts must be annotated with theiror their authors’ location. Many texts are lo-catable but most have no explicit annotationof place. This paper describes a series of ex-periments to determine how positionally an-notated microblog posts can be used to learnlocation indicating words which then can beused to locate blog texts and their authors. AGaussian distribution is used to model the lo-cational qualities of words. We introduce thenotion of placeness to describe how locationalwords are.We find that modelling word distributions toaccount for several locations and thus severalGaussian distributions per word, defining a fil-ter which picks out words with high placenessbased on their local distributional context, andaggregating locational information in a cen-troid for each text gives the most useful results.The results are applied to data in the Swedishlanguage.

1 Text and Geographical PositionAuthors write texts in a location, about something in alocation (or about the location itself), reside and con-duct their business in various locations, and have abackground in some location. Some texts are personal,anchored in the here and now, where others are gen-eral and not necessarily bound to any context. Textswritten by authors reflect the above facts explicitly orimplicitly, through explicit author intention or inciden-tally. When a text is locational, it may be so becausethe author mentions some location or because the au-thor is contextually bound to some location. In bothcases, the text may or may not have explicit mentionsof the context of the author or mention other locationsin the text.

For some applications, inferring the location of a textor its author automatically is of interest. We present inthis paper how establishing the location of a text canbe done by the locational qualities of the terminologyused by its author. Here, we investigate the utility ofdoing so for two distinct use cases.

Firstly, for detecting regional language usage for thepurposes of real-time dialectology. The issue here isto find differences in term usage across locations andto investigate whether terminological variation differsacross regions. In this case, the ultimate objective isto collect sizeable text collections from various regionsof a linguistic area to establish if a certain term or turnof phrase is used more or less frequently in some spe-cific region. The task is then to establish where the au-thor of a text originally is from. This has hitherto beeninvestigated by manual inspection of text collections.(Parkvall 2012, e.g.)

Secondly, for monitoring public opinion of e.g.brands, political issues, or other topic of interest. Inthis case the ultimate objective is to find whether thereis a regional variation for the occurrence of opinionatedmentions for the topic or topical target under consider-ation. The task is then to establish the location where agiven text is written, or, alternatively, what location thetext refers to.

In both cases, the system is presented with a body oftext with the task of assigning a likely location to it. Inthe former task, typically the body of text is larger andnoisier (since authors may refer to other locations thantheir immediate context); in the second task, the textmay be short and have little evidence to work from.Both tasks, that of identifying the location of an author,or that of a text, have been addressed by recent exper-iments with various points of departure: knowledge-based, making use of recorded points of interest in alocation, modelling the geographic distribution of top-ics, or using social network analysis to find additionalinformation about the author.

This set of experiments focuses on the text itself andon using distributional semantics to refine the set ofterms used for locating a text.

2 Location and words as evidence oflocations

Most words contribute little or not at all to position-ing text. Some words are dead giveaways: an authormay mention a specific location in the text. Frequently,but not always, this is reasonable evidence of position.Some words are less patently locational, but contributeincidentally, such as the name of some establishmentor some characteristic feature of a location.

arX

iv:1

612.

0667

1v1

[cs

.CL

] 2

0 D

ec 2

016

Some locational terms are polysemous; some in-specific; some are vague. As indicated in Figure 1,the term Falkoping unambiguously indicates a town inSouthern Sweden, which in turn is a vague term withouta clear and well defined border to other bits of Swe-den. The term Sodermalm is polysemous and refersto a section of town in several Swedish towns; theterm sparvagn (“tram”) is indicative of one of severalSwedish towns with tram lines. We call both of theselatter types of term polylocational and allow them tocontribute to numerous places simultaneously.

Other words contribute variously to location of atext. Some words are less patently locational thannamed places, but contribute incidentally, such as thename of some establishment, some characteristic fea-ture of a location, some event which takes place insome location, or some other topic the discussion ofwhich is more typical in one location than in another.We will estimate the placeness of words in these exper-iments.

Figure 1: Some terms are polylocational

3 Mapping from a continuous to adiscrete representation

We, as has been done in previous experiments, collectthe geographic distribution of word usage through col-lecting microblog posts, some of which have longitudeand latitude, from Twitter. Posts with location infor-mation are distributed over a map in what amounts toa continuous representation. The words from posts canbe collected and associated with the positions they havebeen observed in.

First experiments which use similar training datato ours have typically assigned the posts and thusthe words they occur in directly to some representa-tion of locations - a word which occurs in tweets at[N59.35,E18.11] and [N59.31,E18.05] will have bothobservations recorded to be in the same city (Chenget al. 2010, Mahmud et al. 2012). An alternativeand later approach by e.g. Priedhorsky et al. (2014) isto aggregate all observations of a word over a map andassign a named location to the distribution, rather thanto each observation, deferring the labeling to a pointin the analysis where more understanding of the termdistribution is known.

Another approach is to model topics as inferred fromvocabulary usage in text across their geographical dis-tribution, and then, for each text, to assess the topic andthus its attendant location visavi the topic model mostlikely to have generated the text in question (Eisensteinet al. 2010, Yin et al. 2011, Kinsella et al. 2011, Honget al. 2012). We have found that topic models as imple-mented are computationally demanding, do not add ac-curacy to prediction, and have little explanatory valueto aid the understanding of localised language use.

In these experiments we will compare using a listof known places with a model where we aggregatethe locational information provided by words (and po-tentially other linguistic items such as constructions)trained on longitude and latitude either by letting thewords vote for place or by averaging the information ona word-by-word basis. The latter model defers the map-ping to place until some analysis has been performed;the former assigns place to the words earlier in the pro-cess.

4 Test Data

These experiments have focused on Swedish-languagematerial and on Swedish locations. Most Swedish-speakers live in Sweden; Swedish is mainly writtenand spoken in Sweden and in Finland. Sweden is aroughly rectangular country of about 450 000 km2 asshown in Figure 2. Sweden has since 1634 been or-ganised into 22 counties or lan of between 3 000 km2

and 100 000 km2. The median size of a county is10 545 km2 which would, assuming quadratic counties,give a side of 100 km for a typical county.

We measure accuracy of textual location using theHaversine distance, the great-circle distance betweentwo points on a sphere. We report averages, both meanand median, as well as percentage of texts we have lo-cated within 100 km from their known position.

Our test data set is composed of social media texts.Firstly, 18 GB of blog text from major Swedish blogand forum sites, with self-reported location by author -variously, home town, municipality, village, or county.The texts are mainly personal texts with authors of allages but with a preponderance of pre-teens to youngadults. The data are from 2001 and onward, with moredata from the latest years. The data are concatenatedinto one document per blog, totalling to 154 062 doc-uments from unique sources. Somewhat more than athird, 35%, have more than 10k characters.

Secondly, 37 GB of blog text without any explicitindication of location. A target task for these experi-ments is to enrich these 37 GB of non-located data withpredicted location, in order to address data sparsity forunusual dialectal linguistic items.

Figure 2: Map of Sweden

5 Baseline: the GAZETTEER modelFor a list of known places we used a list1 of 1 956Swedish cities and 2 920 towns and villages as definedby Statistics Sweden2 in 2010.

As the most obvious baseline, we identify all tokensfound in the gazetteer. Each such token is convertedto a position through the Geoencoding API offered byGoogle3. The position with largest observed frequencyof occurrence in the text is assumed to be the positionof the text. Other approaches have taken this as a use-ful approach for identifying features such as Places ofInterest mentioned in texts (Li et al. 2014). We callthis approach the GAZETTEER approach.

6 Training DataAs a basis for learning how words were used we usedgeotagged microblog data from Twitter. About 2% ofSwedish Twitter posts have latitude and longitude ex-plicitly given,4 typically those that have been postedfrom a mobile phone. We gathered data from Twitter’sstreaming API5 during the months of May to August of2014, saving posts with latitude and longitude and withSweden explicitly given as point of origin. This gaveus 4 429 516 posts of about 630 MB.

7 Polylocational Gaussian MixtureModels

Given a set of geographically located texts, we recordfor each linguistic item – meaning word, in these exper-iments – the locations from the metadata of every text it

1http://en.wikipedia.org/wiki/List of urban areas in SwedenOne named location (“Nar”) was removed from the list sinceit is homographic to the adverbials corrresponding to theEnglish near and when, causing a disproportionate amountof noise.

2A locality consists of a group of buildings normally notmore than 200 metres apart from each other, and must fulfila minimum criterion of having at least 200 inhabitants. De-limitation of localities is made by Statistics Sweden every fiveyears. [http://www.scb.se]

3https://developers.google.com/.../geocoding/4Determined by listening to Twitter’s streaming API for

about a day.5The “garden hose”: https://dev.twitter.com/streaming/public

occurs in. This gives each word a mapped geographicdistribution of latitude-longitude pairs. We model theseobserved distributions using Gaussian 2-D functions,as defined by Priedhorsky et al. (2014). A 2-D Gaus-sian function will assume a peak at some position andallow for a graceful inclusion of hits at nearby positionsinto the model in a bell-like distribution.

In contrast to the original definition and and othersimilar following approaches, we want to be able tohandle polylocational words. After testing variousmodels on a subset of our data we find that fittingmore than one Gaussian function—in effect, assum-ing that locationally interesting words refer to severallocations–yields better results than fitting all locationaldata into one distribution. After some initial parame-ter exploration as shown in Figure 3, we settle on threeGaussian functions as a reasonable model: words withmore than three distributional peaks are likely to be ofless utility for locating texts. We consequently fit eachword with three Gaussian functions to allow a word tocontribute to many locations for the texts it is observedin.

1 2 3 4 5 6 7 8Number of gaussians

0

100

200

300

400

500

600

Err

or

(km

)

Mean

25th percentile

75th percentile

50th percentile

Figure 3: Effect of allowing polylocational representa-tions

8 The notion of placenessIn keeping with previous research on geolocationalterms such as Han et al. (2014), we rank candidatewords for their locational specificity. From the Gaus-sian Mixture Model representation, we take the logprobability ρ in the mean of the Gaussian and trans-

form it into a placeness score by p = e100−ρ . This is done

for every word, for all three Gaussians. The score isthen used to rank words for locational utility.

Gaussian1st 2nd 3d

Falkoping 58 9 9Stockholm 37 10 10sparvagn “tram” 36 18 15

och “and” 16 15 9

Table 1: Example words and their log placeness

Table 1 shows the placeness of the three Gaussiansfor some sample words. The two sample named lo-cations have high placeness for their first Gaussians,

http://www.scb.se

https://developers.google.com/maps/documentation/geocoding/

indicating that they have locational utility. “Stock-holm”, the capital city, which is frequently mentionedin conversations elsewhere has less placeness than has“Falkoping”, a smaller city. The word “tram” has lowerplaceness than the two cities, and the word “and” witha log placeness score of 16 can not be considered lo-cational at all. Inspecting the resulting list as given inTable 2 which shows some examples from the top ofthe list, we find that words with high placeness fre-quently are non-gazetteer locations (“Slottsskogen”),user names, hash tags – frequently referring to events(“#lundakarneval”), and other local terms, most typi-cally street names (“Holgersgatan”), spelling variants(“Stackhalm”), or public establishments.

The performance of the predictive models intro-duced below can be improved by excluding words withlow placeness from the centroid. This exclusion thresh-old is referred to as T below.

known places hash tags otherhogstorp #lundakarneval holgersgatan

nyhammar #bishopsarms margretegardeparkensjuntorp #gothenburg uddevallahustyringe #westpride14 kampenhof

slottsskogen #swedenlove1dday stackhalmstorvik #sverigemotet gullmarsplan

charlottenberg #sthlmtech tvarbanan

Table 2: Example words with high placeness

9 Experimental settings: the TOTAL andFILTERED models

We run one experimental setting with all words of aset, only filtered for placeness. We call this approachthe TOTAL approach.

As a more informed model, we filter the words inthe feature set to find the most locationally appropri-ate terms, in order to reduce noise and computationaleffort, but above all, in keeping with our hypothesisthat the locational signal is present in only part of thetexts. Backstrom et al. (2008) and following them,Cheng et al. (2010), using similar data as we do, alsolimit their analyses to “local” rather than “non-local”words in the text matter they process, modeling wordlocality through observed occurrences, modulated withsome geographical smoothing. To find the most appro-priate localised linguistic items, we bootstrap from thegazetteer and collect the most distinctive distributionalcontexts of gazetteer terms. For this, we used contextwindows of six words before (6+ 0), around (3+ 3),and after (0+6) each target word. These context win-dows were tabulated and the most frequently occurringconstructions6 are then ranked based on their ability toreturn words with high placeness. For each construc-tion, the percentage of words returned with logT > 20is used as a ranking criterion. Using this ranking, the

6In these experiments, the 900 most frequent construc-tions are used.

(a) All words of a text contribute to the pre-dicted location .

(b) Only words filtered through the distribu-tional model contribute votes to yield a pre-diction very close to the correct position .

Figure 4: Comparison illustrating the grid and showinghow the grammar transforms the result.

top 150 constructions are retained as a paradigmatic fil-ter to generate usefully locational words. Constructionssuch as lives in <location> will be at the topof the list. Examples are given in Figure 8.

Words found in the <location> slot of the con-structions are frequency filtered with respect to N, thelength of the text under analysis, with thresholds setby experimentation to 0.00008× N ≤ fwd ≤ N/300.This reduces the number of Gaussian models to evalu-ate drastically. Each text under consideration was thenfiltered to only include words found through the aboveprocedure, reducing the size of the texts to about 6% ofthe original.

10 Aggregating the locationalinformation for filtered texts

The filtered texts are now processed in two differentways. Every unique word token in the Twitter datasethas a Gaussian mixture model i based on its observed

(a) Using labeled data set

(b) Using enriched data set increases the data

Figure 5: Regional terminology for “second cousin”

<location> mellanvarit i <location>bor i <location>var i <location>vi till <location>in till <location>ska till <location><location> centrumav till <location>det av till <location>hemma i <location>till <location>upp till <location>

(a) In Swedish

<location> betweenbeen in <location>live(s) in <location>was in <location>we to <location>in to <location>going to <location><location> centreoff to <location>go to <location>home in <location>to <location>up to <location>

(b) Translated to English

Figure 8: Examples of locational constructions

occurrences, as shown in Section 8. This is representedby the three mean coordinates µ

i and their correspond-ing placenesses pi.

µi =

µ1µ2µ3

i

pi =

p1p2p3

i

We compute a centroid for these coordinates, as anaverage best guess for geographic signal for a text. Wedo this with an arithmetic weighted mean. Given nwords:

M =

n∑

i=1µ

n · pn

n∑

i=1

3∑j=1

pnj

Where µn · pn is the dot product7. We call this model

FILTERED CENTROID

Alternatively, we do not average the coordinates, butselect by weighted majority vote. We divide Swedeninto a grid of roughly 50x50km cells. The placenessscore of every locational word in a text is added to itscell. The centerpoint of the cell with highest score isassigned to the text as a location. We call this modelFILTERED VOTE.

Figure 4 shows how filtering improves results, hereillustrated by the FILTERED VOTE model. The topmap shows how every word of a text contributes votes,weighted by their placeness, to give a prediction ( ).The bottom map shows how when only words filteredthrough the distributional model are used, the votingyields a correct result in comparison with the gold stan-dard ( ) given by the metadata.

7µi · pi = µ i

1 pi1 +µ i

2 pi2 +µ i

3 pi3 for this specific case.

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

Error (km)

0.000

0.001

0.002

0.003

0.004

0.005

0.006F

ract

ion

of

test

s log(T)=60

log(T)=50

log(T)=40

log(T)=20

log(T)=10

T=0

Figure 6: Comparing placeness thresholds for the FILTERED CENTROID model.

Placeness Error (km) Percentile (km) e < 100 kmlogT e e 25 % 50 % 75 % Precision Recall

FILTERED CENTROID — 204 365 45 204 464 0.38 0.38FILTERED CENTROID 10 204 365 45 204 464 0.38 0.38FILTERED CENTROID 20 200 365 44 200 460 0.38 0.38FILTERED CENTROID 40 145 333 32 145 396 0.44 0.32FILTERED CENTROID 50 90 286 22 90 321 0.52 0.23FILTERED CENTROID 60 70 271 13 70 330 0.53 0.04

Table 3: Comparing placeness thresholds for the FILTERED CENTROID model.

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

Error (km)

0.00000.00050.00100.00150.00200.00250.00300.00350.0040

Fra

ctio

n o

f te

sts Filtered centroid

Filtered vote

Total

Gazetteer

Figure 7: Comparing models with placeness threshold at logT = 20.

Placeness Error (km) Percentile (km) e < 100 kmlogT e e 25 % 50 % 75 % Precision Recall

GAZETTEER 20 450 626 62 450 964 0.31 0.31TOTAL 20 256 380 51 256 516 0.34 0.34

FILTERED CENTROID 20 200 365 44 200 460 0.38 0.38FILTERED VOTE 20 208 377 58 208 467 0.37 0.36

Table 4: Comparing models: e is the median error and e is the mean error in km.

11 Results

As shown in Table 4 and Figure 7, the Gaussianmodels FILTERED CENTROID andFILTERED VOTE outperform theGAZETTEER model handily. Filter-ing words distributionally, in addition to reducingprocessing, improves results further. The FILTEREDCENTROID model is slightly betterthan the FILTERED VOTE model ,providing support for late discretization of locationalinformation. A closer look at the effect, shown inTable 3 and in Figure 6, of feature selection with theplaceness threshold shows the precision-recall trade-off contingent on reducing the number of acceptedlocational words.

These results are well comparable with the resultsreported by others: while direct comparison with otherlinguistic and geographic areas is difficult, Cheng et al.(2010) set a 100-mile (≈ 160 km) success criterion fora similar task of geo-locating microblog authors (notsingle posts). They find that about 10% of microblogusers can be localised within their 100-mile radius.Eisenstein et al. (2010) found they could on averageachieve a 900 km accuracy for texts or a 24% accuracyon a US state level.

12 Regional variation

Returning to our use case we now use the FILTEREDCENTROID model to position and thusenrich a further 38% of our unlabeled blog collec-tion with a location tag (setting the placeness thresholdlogT = 20). This gives a noticeably better resolutionfor studying regional word usage as shown in Figure 5:the term for “second cousin” varies across dialects, andgiven the enriched data set we are able to gain betterfrequencies and a more distinct image of usage.

13 Conclusions

We find that

• modelling geographical distribution of linguisticitems with multiple (in this case, three) peaksproved useful;

• filtering locationally indicative linguistic items us-ing distributional constructions proved useful;

• modelling the placeness of locational linguisticitems for thresholding proved useful;

• training a locational model on positionally anno-tated microblog posts was a useful bootstrap forassigning location to texts of an entirely differentgenre;

• we are able to detect and explore regional varia-tion in terminological usage.

AcknowlegdmentsThis work was in part supported by the grant SI-NUS (Spridning av innovationer i nutida svenska) fromVetenskapsradet, the Swedish Research Council.

ReferencesLars Backstrom, Jon Kleinberg, Ravi Kumar, and Jas-

mine Novak. Spatial variation in search enginequeries. In Proceedings of the 17th internationalconference on the WWW, ACM, 2008.

Zhiyyan Cheng, James Caverlee, and Kyumin Lee. Youare where you tweet: a content-based approach togeo-locating twitter users. In Proceedings of the 19thACM International Conference on Information andKnowledge Management (CIKM) ACM, 2010.

Jacob Eisenstein, Brendan O’Connor, Noah A Smith,and Eric P Xing. A latent variable model for geo-graphic lexical variation. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing, (EMNLP). ACL, 2010.

Bo Han, Paul Cook, and Timothy Baldwin. Text-basedtwitter user geolocation prediction. Journal of Arti-ficial Intelligence Research (JAIR), 49. 2014.

Liangjie Hong, Amr Ahmed, Siva Gurumurthy,Alexander J Smola, and Kostas Tsioutsiouliklis. Dis-covering geographical topics in the twitter stream. InProceedings of the 21st international conference onthe WWW. ACM, 2012.

Sheila Kinsella, Vanessa Murdock, and Neil O’Hare.I’m eating a sandwich in glasgow: modeling loca-tions with tweets. In Proceedings of the 3rd in-ternational workshop on Search and mining user-generated contents. ACM, 2011.

Guoliang Li, Jun Hu, Jianhua Feng, and Kian-lee Tan.Effective location identification from microblogs. In30th IEEE International Conference on Data Engi-neering (ICDE). IEEE, 2014.

Jalal Mahmud, Jeffrey Nichols, and Clemens Drews.Where is this tweet from? Inferring home locationsof Twitter users. In Proceedings of the 6th Interna-tional AAAI Conference on Web and Social Media,2012.

Mikael Parkvall. Har gar gransen. Spraktidningen, Oc-tober 2012. ISSN 1654-5028.

Reid Priedhorsky, Aron Culotta, and Sara Y Del Valle.Inferring the origin locations of tweets with quan-titative confidence. In Proceedings of the 17thACM conference on Computer supported coopera-tive work & social computing, ACM, 2014.

Zhijun Yin, Liangliang Cao, Jiawei Han, ChengxiangZhai, and Thomas Huang. Geographical topic dis-covery and comparison. In Proceedings of the 20thinternational conference on the WWW, ACM, 2011.

arxiv:1612.06671v1 [cs.cl] 20 dec 2016

Documents