mapping dialectal variation by querying social media · mapping dialectal variation by querying...

9
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 98–106, Gothenburg, Sweden, April 26-30 2014. c 2014 Association for Computational Linguistics Mapping dialectal variation by querying social media Gabriel Doyle Department of Linguistics University of California, San Diego La Jolla, CA, USA 92093-0108 [email protected] Abstract We propose a Bayesian method of esti- mating a conditional distribution of data given metadata (e.g., the usage of a di- alectal variant given a location) based on queries from a big data/social me- dia source, such as Twitter. This distri- bution is structurally equivalent to those built from traditional experimental meth- ods, despite lacking negative examples. Tests using Twitter to investigate the ge- ographic distribution of dialectal forms show that this method can provide distri- butions that are tightly correlated with ex- isting gold-standard studies at a fraction of the time, cost, and effort. 1 Introduction Social media provides a linguist with a new data source of unprecedented scale, opening novel av- enues for research in empirically-driven areas, such as corpus and sociolinguistics. Extracting the right information from social media, though, is not as straightforward as in traditional data sources, as the size and format of big data makes it too un- wieldy to observe as a whole. Researchers often must interact with big data through queries, which produce only positive results, those matching the search term. At best, this can be augmented with a set of “absences” covering results that do not match the search term, but explicit negative data (e.g., confirmation that a datapoint could never match the search term) does not exist. In addition to the lack of explicit negative data, query-derived data has a conditional distribution that reverses the dependent and independent variables compared to traditional data sources, such as sociolinguistic in- terviews. This paper proposes a Bayesian method for overcoming these two difficulties, allowing query- derived data to be applied to traditional problems without requiring explicit negative data or the abil- ity to view the entire dataset at once. The test case in this paper is dialect geography, where the pos- itive data is the presence of a dialectal word or phrase in a tweet, and the metadata is the location of the person tweeting it. However, the method is general and applies to any queryable big data source that includes metadata about the user or set- ting that generated the data. The key to this method lies in using an indepen- dent query to estimate the overall distribution of the metadata. This estimated distribution corrects for non-uniformity in the data source, enabling the reversal of the conditionality on the query-derived distribution to convert it to the distribution of in- terest. Section 2 explains the mathematical core of the Bayesian analysis. Section 3 implements this anal- ysis for Twitter and introduces an open-source program for determining the geographic distri- bution of tweets. Section 4 tests the method on problems in linguistic geography and shows that its results are well-correlated with those of tradi- tional sociolinguistic research. Section 5 addresses potential concerns about noise or biases in the queries. 2 Reversing the conditionality of query data 2.1 Corpora and positive-only data In traditional linguistic studies, the experimenter has control over the participants’ metadata, but not over their data. For instance, a sociolinguist may select speakers with known ages or locations, but will not know their usages in advance. Cor- pus queries reverse the direction of investigation; the experimenter selects a linguistic form to search for, but then lacks control over the metadata of the participants who use the query. The direction of conditionality must be reversed to get compara- 98

Upload: others

Post on 12-Feb-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mapping Dialectal Variation by Querying Social Media · Mapping dialectal variation by querying social media Gabriel Doyle Department of Linguistics University of California, San

Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 98–106,Gothenburg, Sweden, April 26-30 2014. c©2014 Association for Computational Linguistics

Mapping dialectal variation by querying social media

Gabriel DoyleDepartment of Linguistics

University of California, San DiegoLa Jolla, CA, USA 92093-0108

[email protected]

AbstractWe propose a Bayesian method of esti-mating a conditional distribution of datagiven metadata (e.g., the usage of a di-alectal variant given a location) basedon queries from a big data/social me-dia source, such as Twitter. This distri-bution is structurally equivalent to thosebuilt from traditional experimental meth-ods, despite lacking negative examples.Tests using Twitter to investigate the ge-ographic distribution of dialectal formsshow that this method can provide distri-butions that are tightly correlated with ex-isting gold-standard studies at a fraction ofthe time, cost, and effort.

1 Introduction

Social media provides a linguist with a new datasource of unprecedented scale, opening novel av-enues for research in empirically-driven areas,such as corpus and sociolinguistics. Extracting theright information from social media, though, is notas straightforward as in traditional data sources, asthe size and format of big data makes it too un-wieldy to observe as a whole. Researchers oftenmust interact with big data through queries, whichproduce only positive results, those matching thesearch term. At best, this can be augmented witha set of “absences” covering results that do notmatch the search term, but explicit negative data(e.g., confirmation that a datapoint could nevermatch the search term) does not exist. In additionto the lack of explicit negative data, query-deriveddata has a conditional distribution that reverses thedependent and independent variables compared totraditional data sources, such as sociolinguistic in-terviews.

This paper proposes a Bayesian method forovercoming these two difficulties, allowing query-derived data to be applied to traditional problems

without requiring explicit negative data or the abil-ity to view the entire dataset at once. The test casein this paper is dialect geography, where the pos-itive data is the presence of a dialectal word orphrase in a tweet, and the metadata is the locationof the person tweeting it. However, the methodis general and applies to any queryable big datasource that includes metadata about the user or set-ting that generated the data.

The key to this method lies in using an indepen-dent query to estimate the overall distribution ofthe metadata. This estimated distribution correctsfor non-uniformity in the data source, enabling thereversal of the conditionality on the query-deriveddistribution to convert it to the distribution of in-terest.

Section 2 explains the mathematical core of theBayesian analysis. Section 3 implements this anal-ysis for Twitter and introduces an open-sourceprogram for determining the geographic distri-bution of tweets. Section 4 tests the method onproblems in linguistic geography and shows thatits results are well-correlated with those of tradi-tional sociolinguistic research. Section 5 addressespotential concerns about noise or biases in thequeries.

2 Reversing the conditionality of querydata

2.1 Corpora and positive-only data

In traditional linguistic studies, the experimenterhas control over the participants’ metadata, butnot over their data. For instance, a sociolinguistmay select speakers with known ages or locations,but will not know their usages in advance. Cor-pus queries reverse the direction of investigation;the experimenter selects a linguistic form to searchfor, but then lacks control over the metadata of theparticipants who use the query. The direction ofconditionality must be reversed to get compara-

98

Page 2: Mapping Dialectal Variation by Querying Social Media · Mapping dialectal variation by querying social media Gabriel Doyle Department of Linguistics University of California, San

ble information from query-derived and traditionaldata.

Queries also complicate the problem by pro-viding only positive examples. This lack of ex-plicit negative data is common in language ac-quisition, as children encounter mostly grammat-ical statements during learning, and receive fewexplicitly ungrammatical examples, yet still de-velop a consistent grammaticality classificationsystem as they mature. Similar positive-only prob-lems abound in cognitive science and artificial in-telligence, and a variety of proposals have beenoffered to overcome it in different tasks. Theseinclude biases like the Size Principle (Tenen-baum and Griffiths, 2001), heuristics like gen-erating pseudo-negatives from unobserved data(Okanohara and Tsujii, 2007; Poon et al., 2009),or innate prespecifications like Universal Gram-mar in the Principles and Parameters framework.

For query-derived data, Bayesian reasoning canaddress both problems by inverting the condi-tionality of the distribution and implying negativedata. The key insight is that a lack of positive ex-amples where positive examples are otherwise ex-pected is implicit negative evidence. This methodallows a researcher to produce an estimated distri-bution that approximates the true conditional dis-tribution up to a normalizing factor. This condi-tional distribution is that of data (e.g., a dialectalform) conditioned on metadata (e.g., a location).

This distribution can be written as p(D|M),where D and M are random variables represent-ing the data and metadata. A query for a data valued returns metadata values m distributed accordingto p(M |D = d). All of the returned results willhave the searched-for data value, but the metadatacan take any value.

For most research, p(M |D = d) is not the dis-tribution of interest, as it is conflated with the over-all distribution of the metadata. For instance, ifthe query results indicate that 60% of users of thelinguistic form d live in urban areas, this seemsto suggest that the linguistic form is more likelyin urban areas. But if 80% of people live in ur-ban areas, the linguistic form is actually underrep-resented in these areas, and positively associatedwith rural areas. An example of the effect of suchmisanalysis is shown in Sect. 4.2.

2.2 Reversing the conditionality

Bayesian reasoning allows a researcher to movefrom the sampled p(M |D) distribution to the de-sired p(D|M). We invoke Bayes’ Rule:

p(D|M) =p(M |D)p(D)

p(M)

In some situations, these underlying distribu-tions will be easily obtainable. For small corpora,p(D) and p(M) can be calculated by enumeration.For data with explicit negative examples available,p(D) can be estimated as the ratio of positive ex-amples to the sum of positive and negative exam-ples.1 But for queries in general, neither of theseapproximations is possible. Instead, we estimatep(M) through the querying mechanism itself.

This is done by choosing a “baseline” queryterm q whose distribution is approximately inde-pendent of the metadata – that is, a query q suchthat p(q|m) is approximately constant for all meta-data values m ∈M . If p(q|m) is constant, then byBayes’ Rule:

p(m|q) =p(q|m)p(m)

p(q)≈ p(m), ∀m ∈M

Thus we can treat results from a baseline queryas though they are draws directly from p(M), andestimate the denominator from this distribution.The remaining unknown distribution p(d) is con-stant for a given data value d, so combining theabove equations yields the unnormalized probabil-ity p̃(d|M):

p(d|M) ∝ p̃(d|M) =p(M |d)p(M |q) . (1)

This switch to the unnormalized distribution canimprove interpretability as well. If p̃(d|m) = 1,then p(m|d) = p(m|q), which means that themetadata m is observed for the linguistic form djust as often as it is for the baseline query. Whenp̃(d|m) > 1, the linguistic form is more commonfor metadata m than average, and when p̃(d|m) <1, the form is less common for that metadata.2

1This can be extended to multi-class outcomes; if D hasmore than two outcomes, each possible outcome is an implicitnegative example for the other possible outcomes.

2If a normalized distribution is needed, p(d) may be es-timable, depending on the data source. In the Twitter data pre-sented here, tweets are sequentially numbered, so p(d) couldbe estimated using these index numbers. This paper only usesunnormalized distributions.

99

Page 3: Mapping Dialectal Variation by Querying Social Media · Mapping dialectal variation by querying social media Gabriel Doyle Department of Linguistics University of California, San

2.3 Coverage and confidenceDue to the potentially non-uniform distributionof metadata, the amount of error in the estimatein Eq. 1 can vary with m. Intuitively, the confi-dence in the conditional probability estimates de-pends on the amount of data observed for eachmetadata value. Because queries estimate p(M |d)by repeated draws from that distribution, the er-ror in the estimate decreases as the number ofdraws increases. The overall error in the estimateof p̃(d|m) decreases as the number of datapointsobserved at m increases. This suggests estimatingconfidence as the square root of the count of ob-servations of the metadata m, as the standard errorof the mean decreases in proportion to the squareroot of the number of observations. More complexBayesian inference can be used improve error es-timates in the future.

3 Sample Implementation: SeeTweet

This section implements the method described inthe previous section on a case study of the ge-ographic distributions of linguistic forms, calcu-lated from recent tweets. It is implemented asa suite of novel open-source Python/R programscalled SeeTweet, which queries Twitter, obtainstweet locations, performs the mathematical anal-ysis, and maps the results. The suite is avail-able at http://github.com/gabedoyle/seetweet.

3.1 SeeTweet goalsTraditionally, sociolinguistic studies are highlytime-intensive, and broad coverage is difficult toobtain at reasonable costs. Two data sources thatwe compare SeeTweet to are the Atlas of NorthAmerican English (Labov et al., 2008, ANAE)and the Harvard Dialect Survey (Vaux and Golder,2003, HDS), both of which obtained high-qualitydata, but over the course of years. Such studiesremain the gold-standard for most purposes, butSeeTweet presents a rapid, cheap, and surprisinglyeffective alternative for broad coverage on someproblems in dialect geography.

3.2 Querying TwitterSeeTweet queries Twitter through its API, us-ing Mike Verdone’s Python Twitter Tools3. TheAPI returns the 1000 most recent query-matchingtweets or all query-matching tweets within the

3http://mike.verdone.ca/twitter/

last week, whichever is smaller, and can be ge-ographically limited to tweets within a certainradius of a center point. In theory, the contigu-ous United States are covered by a 2500km ra-dius (Twitter’s maximum) around the geographiccenter, approximately 39.8◦N, 98.6◦W, near theKansas-Nebraska border. In practice, though, sucha query only returns tweets from a non-circular re-gion within the Great Plains.

Through trial-and-error, four search centerswere found that span the contiguous U.S. withminimal overlap and nearly complete coverage,4

located near Austin, Kansas City, San Diego, andSan Francisco. All results presented here are basedon these four search centers. Tweets located out-side the U.S. or with unmappable locations are dis-carded.

The need for multiple queries and the API’stweet limit complicate the analysis. The foursearches must be balanced against each other toavoid overrepresenting certain areas, especially inconstructing the baseline p(M). If any searchesreach the 1000-tweet limit, only the search withthe most recent 1000th tweet has all of its tweetsused. All tweets before that tweet are removed,balancing the searches by having them all span thesame timeframe. Due to the seven-day limit for re-cent tweets, many searches do not return 1000 hits;if none of the searches max out, all returned tweetsare accepted.

3.3 Establishing the baseline

For the baseline query (used to estimate p(M)),SeeTweet needs a query with approximately uni-form usage across the country. Function or stopwords are reasonable candidates for this task. Weuse the word I here, which was chosen as it iscommon in all American English dialects but notother major languages of the U.S., and it has fewobvious alternative forms. Other stop words weretested, but the specific baseline query had little im-pact on the learned distribution; correlations be-tween maps with I, of, the or a baselines were allabove .97 on both baseline distributions and esti-mated conditional distributions.

Each tweet from the target query requires itsown baseline estimate, as the true distribution ofmetadata varies over time. For instance, there willbe relatively more tweets on the East Coast in

4Northern New England has limited coverage, and theMountain West returns little data outside the major cities.

100

Page 4: Mapping Dialectal Variation by Querying Social Media · Mapping dialectal variation by querying social media Gabriel Doyle Department of Linguistics University of California, San

the early morning (when much of the West Coastis still asleep). Thus, SeeTweet builds the base-line distribution by querying the baseline term I,and using the first 50 tweets preceding each tar-get tweet. This query is performed for each searchcenter for each tweet, with the centers balanced asdiscussed in the previous section.5

3.4 Determining coordinates and mapping

A tweet’s geographic information can be specifiedin many ways. These include coordinates specifiedby a GPS system (“geotags”), user-specified coor-dinates, or user specification of a home locationwhose coordinates can be geocoded. Some tweetsmay include more that one of these, and SeeTweetuses this hierarchy: geotags are accepted first, fol-lowed by user-specified coordinates, followed byuser-specified cities. This hierarchy moves fromsources with the least noise to the most.

Obtaining coordinates from user-specified loca-tions is done in two steps. First, if the user’s loca-tion follows a “city, state” format, it is searchedfor in the US Board on Geographic Names’sGeographic Names Information System6, whichmatches city names to coordinates. Locations thatdo not fit the “city, state” format are checkedagainst a manually compiled list of coordinatesfor 100 major American cities. This second stepcatches many cities that are sufficiently well-known that a nickname is used for the city (e.g.,Philly) and/or the state is omitted.

Tweets whose coordinates cannot be deter-mined by these methods are discarded; this is ap-proximately half of the returned tweets in the ex-periments discussed here.

This process yields a database of tweet coor-dinates for each query. To build the probabilitydistributions, SeeTweet uses a two-dimensionalGaussian kernel density estimator. Gaussian distri-butions account for local geographic dependencyand uncertainty in the exact location of a tweeteras well as smoothing the distributions. The stan-dard deviation (“bandwidth”) of the kernels is afree parameter, and can be scaled to supply ap-propriate coverage/granularity of the map. We use

5An alternative baseline, perhaps even more intuitive,would be to use some number of sequential tweets preced-ing the target tweet. However, the Twitter API query mecha-nism subsamples from the overall set of tweets, so sequentialtweets may not follow the same distribution as the queriesand would provide an inappropriate baseline.

6http://geonames.usgs.gov/domestic/download_data.htm

3 degrees (approximately 200 miles) of band-width for all maps in this paper, but found con-sistently high correlation (at least .79 by Hosmer-Lemeshow) to the ANAE data in Sect. 4.1 withbandwidths between 0.5 and 10 degrees.

The KDE estimates probabilities on a grid over-laid on the map; we make each grid box a squareone-tenth of a degree on each side and calculatep̃(d|m) for each box m. SeeTweet maps plot thevalue of p̃(d|M) on a color gradient with approxi-mately constant luminosity. Orange indicates highprobability of the search term, and blue low prob-ability. Constant luminosity is used so that confi-dence in the estimate can be represented by opac-ity; regions with higher confidence in the esti-mated probability appear more opaque.7 Unfortu-nately, this means that the maps will not be infor-mative if printed in black and white.

4 Experiments in dialect geography

Our first goal is to test the SeeTweet results againstan existing gold standard in dialect geography;for this, we compare SeeTweet distributions ofthe needs done construction to those found bylong-term sociolinguistic studies and show that thequick-and-dirty unsupervised SeeTweet distribu-tions are accurate reflections of the slow-and-cleanresults. Our second goal is show the importance ofusing the correct conditional distribution, by com-paring it to the unadjusted distribution. With thesepoints established, we then use SeeTweet to createmaps of previously uninvestigated problems.

4.1 Method verification on need + pastparticiple

The Atlas of North American English (Labov etal., 2008) is the most complete linguistic atlas ofAmerican English dialect geography. It focuses onphonological variation, but also includes a smallset of lexical/syntactic alternations. One is theneeds + past participle construction, as in The carneeds (to be) washed. This construction has a lim-ited geographic distribution, and ANAE providesthe first nationwide survey of its usage.

We compare SeeTweet’s conditional probabili-ties for this construction to the ANAE responses tosee how the relatively uncontrolled Twitter sourcecompares to the tightly controlled telephone sur-vey data that ANAE reports. We create a SeeTweet

7Confidence is given by the square root of the smoothednumber of tweets in a grid box m, p(m|d) ∗ C(d).

101

Page 5: Mapping Dialectal Variation by Querying Social Media · Mapping dialectal variation by querying social media Gabriel Doyle Department of Linguistics University of California, San

●●

●●

●●

●●

●●

●●

●●●● ●●

●●

●● ●●

●●

●●

●●

●●

●●

●●●●

●●

●● ●●●

●●

●●

● ●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●● ●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

25

30

35

40

45

50

−120 −100 −80Longitude

Latit

ude

Response

●Used byRespondent

Familiar toRespondent

Unfamiliar toRespondent

ANAE/Telsur Response Data

(a) ANAE/Telsur survey responses for need+past partici-ple.

(b) SeeTweet search for “needs done”.

Figure 1: Comparing the SeeTweet distributionand ANAE responses for needs done usage. Or-ange indicates higher local usage, purple moder-ate, and blue lower. Increased opacity indicatesmore confidence (i.e., more tweets) in a region.

map and visually compare this to the ANAE map,along with a Hosmer-Lemeshow-style analysis.The SeeTweet map is not calibrated to the ANAEmap; they are each built independently.

The ANAE map (Fig. 1a) shows the responsesof 577 survey participants who were asked aboutneeds done. Three possible responses were consid-ered: they used the construction themselves, theydid not use it but thought it was used in their area,or they neither used it nor believed it to be used intheir area.

The SeeTweet map (Fig. 1b) is built from fivesearches for the phrase “needs done”, yielding 480positive tweets and 32275 baseline tweets.8 Thecomponent distributions p(M |d) and p(M) are es-timated by Gaussian kernels with bandwidth 3.The log of p̃(f |M), calculated as in Eq. 1, de-termines the color of a region; orange indicates ahigher value, purple a middle (approx. 1) value,and blue a low value. Confidence in the estimateis reflected by opacity; higher opacity indicateshigher confidence in the estimate. Confidence val-ues above 3 (corresponding to 9 tweets per bin) are

8The verb do was used as it was found to be the most com-mon verb in corpus work on needs to be [verbed] construc-tions (Doyle and Levy, 2008), appearing almost three timesas often as the second-most common verb (replace).

fully opaque. This description holds for all othermaps in this paper.

We start with a qualitative comparison of themaps. Both maps show the construction to be mostprominent in the area between the Plains states andcentral Pennsylvania (the North Midland dialectregion), with minimal use in New England andNorthern California and limited use elsewhere.SeeTweet lacks data in the Mountain West andGreat Plains, and ANAE lacks data for Minnesotaand surrounding states.9 The most notable devia-tion between the maps is that SeeTweet finds theconstruction more common in the Southeast thanANAE does.

Quantitative comparison is possible by compar-ing SeeTweet’s estimates of the unnormalized con-ditional probability of needs done in a locationwith the ANAE informants’ judgments there. Twosuch comparisons are shown in Fig. 2.

The first comparison (Fig. 2a) is a violinplot with the ANAE divided into the three re-sponse categories. The vertical axis representsthe SeeTweet estimates, and the width of a vi-olin is proportional to the likelihood of thatANAE response coming from a region of thegiven SeeTweet estimate. The violins’ mass shiftstoward regions with lower SeeTweet estimates(down in the graph) as the respondents reportdecreasing use/familiarity with the construction(moving left to right).

Users of the construction are most likely tocome from regions with above-average condi-tional probability of needs done, as seen in the left-most violin. Non-users, whether familiar with theconstruction or not, are more likely to come fromregions with below-average conditional probabil-ity. Non-users who are unfamiliar with it tend tolive in regions with the lowest conditional prob-abilities of the three groups. This shows the ex-pected correspondence trend between the ANAEresponses and the estimated prevalence of the con-struction in an area; the mean SeeTweet estimatesfor the three groups are 0.45, −0.34, and −0.61,respectively.

The second comparison (Fig. 2b) is a Hosmer-Lemeshow plot. The respondents are first dividedinto deciles based on the SeeTweet estimate attheir location. Two mean values are calculated foreach decile: the mean SeeTweet log-probability

9Murray et al. (1996)’s data suggest that these untestedareas would not use the construction; the SeeTweet data sug-gests this as well.

102

Page 6: Mapping Dialectal Variation by Querying Social Media · Mapping dialectal variation by querying social media Gabriel Doyle Department of Linguistics University of California, San

●●

●●

● ●

●●

●●

●●

●●●

●●●

● ●

●●

●●

●●

● ●

●● ●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

● ●●

● ●●

●●

●●

●●

● ●

● ●

●●●

●●

●●

●●

●●●

●● ●

●●

●●●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●●●●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●●●●●●●●●

●●●●

●●●●●

●●●● ●●

● ●●

●●

●●●●

●●●

● ●

●● ●

●●●●●

●●

●●

−2

−1

0

1

2

Used byRespondent

Familiar toRespondent

Unfamiliar toRespondent

respondent opinion on needs+ppt [ANAE]

log

unno

rmal

ized

p(n

eeds

+pp

t|loc

atio

n) [S

eeTw

eet]

(a) Violin plot of SeeTweet estimated conditional proba-bility against ANAE response type.

−3

−2

−1

0

1

2

−2 −1 0log proportion accepting needs+ppt [ANAE]

log

unno

rmal

ized

p(n

eeds

+pp

t|loc

atio

n) [S

eeTw

eet]

ANAE vs. SeeTweet, Binned Predictions

(b) Hosmer-Lemeshow plot of SeeTweet distributiondeciles against average probability of ANAE respondentusage.

Figure 2: Quantifying the relationship between theSeeTweet distribution and ANAE reports for needsdone.

estimate (increasing with each decile) and the log-proportion of respondents in that decile who usethe construction.10 If SeeTweet estimates of theconditional distribution are an adequate reflectionof the ANAE survey data, we should see a tightcorrelation between the SeeTweet and ANAE val-ues in each decile. The correlation between thetwo is R2 = 0.90. This is an improvement over theinappropriate conditional distribution p(M |d) thatis obtained by smoothing the tweet map withoutdividing by the overall tweet distribution p(M).Its Hosmer-Lemeshow correlation is R2 = 0.79

These experiments verify two important points:the SeeTweet method can generate data that istightly correlated with gold-standard data fromcontrolled surveys, and conditionality inversionestablishes a more appropriate distribution to cor-rect for different baseline frequencies in tweeting.This second point will be examined further withdouble modals in the next section.

4.2 Double modals and the importance of thebaseline

The double modal construction provides a secondtest case. While ungrammatical in Standard Amer-ican English, forms like I might could use yourhelp are grammatical and common in SouthernAmerican dialects. This construction is interestingboth for its theoretical syntax implications on thenature of modals as well as the relationship be-tween its sociolinguistic distribution and its prag-matics (Hasty, 2011).

The ANAE does not have data on doublemodals’ distribution, but another large-scale soci-olinguistic experiment does: the Harvard DialectSurvey (Vaux and Golder, 2003). This online sur-vey obtained 30788 responses to 122 dialect ques-tions, including the use of double modals. Katz(2013) used a nearest-neighbor model to create ap(d|M) distribution over the contiguous U.S. fordouble modal usage, mapped in Fig. 3a.11 Lightercolors indicate higher rates of double modal ac-ceptance.

SeeTweet generates a similar map (Fig. 3b),based on three searches with 928 positive and66272 baseline tweets. As with the ANAE test, the

10We remove all respondents who do not use the construc-tion but report it in their area. Such respondents are fairlyrare (slightly over 10% of the population), and removing thisresponse converts the data to a binary classification problemappropriate to Hosmer-Lemeshow analysis.

11http://spark.rstudio.com/jkatz/Data/comp-53.png

103

Page 7: Mapping Dialectal Variation by Querying Social Media · Mapping dialectal variation by querying social media Gabriel Doyle Department of Linguistics University of California, San

(a) Katz’s nearest-neighbor estimates of the doublemodal’s distribution in the Harvard Dialect Survey.

(b) SeeTweet distribution for might could.

(c) Inappropriate p(M |d) distribution directly estimatedfrom Twitter hits.

Figure 3: Maps of the double modal’s distribution.

SeeTweet map is built independently of the HDSdata and is not calibrated to it.

The notable difference between the maps isthat SeeTweet does not localize double modals assharply to the Southeast, with pockets in citiesthroughout the country. This may reflect the dif-ference in the meaning of locations on Twitter andin the HDS; Twitter locations will be a user’s cur-rent home, whereas the HDS explicitly asks for arespondent’s location during their formative years.SeeTweet may partly capture the spread of dialec-tal features due to migration.

Double modals also provide an illustration ofthe importance of the Bayesian inversion in Eqn.1, as shown in Fig. 3c. This map, based onthe inappropriate distribution p(M |d), which doesnot account for the overall distribution p(M),disagrees with general knowledge of the doublemodal’s geography and the HDS map. Althoughboth maps find double modals to be prominentaround Atlanta, the inappropriate distribution find

New York City, Chicago, and Los Angeles to bethe next most prominent double modal regions,with only moderate probability in the rest of theSoutheast. This is not incorrect, per se, as theseare the sources of many double modal tweets; butthese peaks are incidental, as major cities producemore tweets than the rest of the country. This isconfirmed by their absence in the HDS map aswell as the appropriate SeeTweet map.

4.3 Extending SeeTweet to new problems

Given SeeTweet’s success in mapping needs doneand double modals, it can also be used to test newquestions. An understudied issue in past work onthe need + past participle construction is its rela-tionship with alternative forms need to be + pastparticiple and need + present participle. Murrayet al. (1996) suggest that their need + past par-ticiple users reject both alternatives, although itis worth noting that their informants are more ac-cepting of the to be alternative, calling it merely“too formal”, as opposed to an “odd” or “ungram-matical” opinion about the present participle form.Their analysis of the opinions on alternative formsdoes not go beyond this anecdotal evidence.

SeeTweet provides the opportunity to examinethis issue, and finds that the to be form is per-sistent across the country (Fig. 4c), both in areaswith and without the need + past participle form,whereas the present participle alternant (Fig. 4b)is strongest in areas where need + past participle isnot used. Although further analysis is necessary tosee if the same people use both the past participleforms, the current data suggests that the bare pastparticiple and bare present participle forms are incomplementary distribution, while the to be formis acceptable in most locations.

We also compare the alternative constructionsto the ANAE data. Using Hosmer-Lemeshowanalysis, we find negative correlations: R2 =−.65 for needs doing and R2 = −.25 for needsto be done. In addition, mean SeeTweet estimatesof needs doing usage were lower for regions whererespondents use needs done than for regions wherethey do not: −.93 versus −.49.12 Thus, SeeTweetprovides evidence that needs done and needs do-ing are in a geographically distinct distribution,while needs done and needs to be done are at mostweakly distinct.

12SeeTweet estimates of needs to be done usage were com-parable in both regions, −.018 against .019.

104

Page 8: Mapping Dialectal Variation by Querying Social Media · Mapping dialectal variation by querying social media Gabriel Doyle Department of Linguistics University of California, San

(a) “Needs done” distribution

(b) “Needs doing” distribution

(c) “Needs to be done” distribution

Figure 4: SeeTweet distributions for needs done,needs to be done, and needs doing.

5 The appropriateness of Twitter as adata source

A possible concern with this analysis is that Twit-ter could be a biased and noisy dataset, inappropri-ate for sociolinguistic investigation. Twitter skewstoward the young and slightly toward urbanites(Duggan and Brenner, 2013). However, as youngurbanites tend to drive language change (Labovet al., 2008), any such bias would make the re-sults more useful for examining sociolinguisticchanges and emergent forms. The informality ofthe medium also provides unedited writing datathat is more reflective of non-standard usage thanmost corpora, and its large amounts of data in shorttimescales offers new abilities to track emerginglinguistic change.

As for noise in the tweet data and locations, thestrong correlations between the gold-standard andSeeTweet results show that, at least for these fea-tures, the noise is mitigated by the size of dataset.We examined the impact of noise on the needsdone dataset by manually inspecting the data forfalse positives and re-mapping the clean data. Al-though the false positive rate was 12%, the con-

ditional distribution learned with and without thefalse positives removed remained tightly corre-lated, at R2 = .94. The SeeTweet method ap-pears to be robust to false positives, although nois-ier queries may require manual inspection.

A final point to note is that while the datasetsused in constructing these maps are relativelysmall, they are crucially derived from big data. Be-cause the needs done and double modal construc-tions are quite rare, there would be very few ex-amples in a standard-sized corpus. Only becausethere are so many tweets are we able to get thehundreds of examples we used in this study.

6 Conclusion

We have shown that Bayesian inversion can beused to build conditional probability distributionsover data given metadata from the results ofqueries on social media, connecting query-deriveddata to traditional data sources. Tests on Twittershow that such calculations can provide dialectgeographies that are well correlated with exist-ing gold-standard sources at a fraction of the time,cost, and effort.

Acknowledgments

We wish to thank Roger Levy, Dan Michel, EmilyMorgan, Mark Myslı́n, Bill Presant, Agatha Ven-tura, and the reviewers for their advice, sugges-tions, and testing. This work was supported in partby NSF award 0830535.

ReferencesGabriel Doyle and Roger Levy. 2008. Environment

prototypicality in syntactic alternation. In Proceed-ings of the 34th Annual Meeting of the Berkeley Lin-guistics Society.

Maeve Duggan and Joanna Brenner. 2013. The demo-graphics of social media users – 2012. Pew Internetand American Life Project.

J. Daniel Hasty. 2011. I might would not say that:A sociolinguistic study of double modal acceptance.In University of Pennsylvania Working Papers inLinguistics, volume 17.

Joshua Katz. 2013. Beyond “soda, pop, orcoke”: Regional dialect variation in the continentalUS. Retrieved from http://www4.ncsu.edu/˜jakatz2/project-dialect.html.

William Labov, Sharon Ash, and Charles Boberg.2008. The Atlas of North American English. Pho-netics, Phonology, and Sound Change. de GruyterMouton.

105

Page 9: Mapping Dialectal Variation by Querying Social Media · Mapping dialectal variation by querying social media Gabriel Doyle Department of Linguistics University of California, San

Thomas Murray, Timothy Frazer, and Beth Lee Simon.1996. Need + past participle in American English.American Speech, 71:255–271.

Daisuke Okanohara and Jun’ichi Tsujii. 2007. A dis-criminative language model with pseudo-negativesamples. In Proceedings of the 45th Annual Meet-ing of the Association of Computational Linguistics.

Hoifung Poon, Colin Cherry, and Kristina Toutanova.2009. Unsupervised morphological segmentationwith log-linear models. In Proceedings of the NorthAmerican Chapter of the Association for Computa-tional Linguistics.

Joshua Tenenbaum and Thomas Griffiths. 2001. Gen-eralization, similiarity, and Bayesian inference. Be-havioral and Brain Sciences, 24:629–640.

Bert Vaux and Scott Golder. 2003. Harvard dialectsurvey. Available at http://www4.uwm.edu/FLL/linguistics/dialect/index.html.

106