surprisal and interference effects of case markers in

13
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 30–42 Minneapolis, USA, June 7, 2019. c 2019 Association for Computational Linguistics 30 Surprisal and Interference Effects of Case Markers in Hindi Word Order Sidharth Ranjan IIT Delhi [email protected] Rajakrishnan Rajkumar IISER Bhopal [email protected] Sumeet Agarwal IIT Delhi [email protected] Abstract Based on the Production-Distribution- Comprehension (PDC) account of language processing, we formulate two distinct hy- potheses about case marking, word order choices and processing in Hindi. Our first hypothesis is that Hindi tends to optimize for processing efficiency at both lexical and syntactic levels. We quantify the role of case markers in this process. For the task of predicting the reference sentence occurring in a corpus (amidst meaning-equivalent gram- matical variants) using a machine learning model, surprisal estimates from an artificial version of the language (i.e., Hindi without any case markers) result in lower prediction accuracy compared to natural Hindi. Our second hypothesis is that Hindi tends to minimize interference due to case markers while ordering preverbal constituents. We show that Hindi tends to avoid placing next to each other constituents whose heads are marked by identical case inflections. Our findings adhere to PDC assumptions and we discuss their implications for language production, learning and universals. 1 Introduction Language universals encode distributional regular- ities across languages of the world. This study is motivated by the well known correlation be- tween case marking and increased word order flexibility (Sapir, 1921; Blake, 2001), often ex- pressed as an implicational universal 1 . The ori- gin of such universals has been the topic of a long-standing debate in linguistics and cognitive science (Fedzechkina et al., 2012). As the cited work expounds, one view is that language uni- versals emerged due to constraints specific to the 1 “Given X in a particular language, we always find Y” where X and Y are characteristics of the language (Green- berg, 1963). system of language and not related to the other cognitive faculties (Chomsky, 1965; Fodor, 2001). Another view is that languages evolved over time as a consequence of cognitive mechanisms and pressures linked with language use. Thus, cogni- tive biases related to processing (Hawkins, 2004), learnability (Christiansen and Chater, 2008) and communicative efficiency (Jaeger and Tily, 2011) have been proposed as underlying the systematic similarities and divergences between natural lan- guages. The Production-Distribution-Comprehension (PDC) account of language processing proposed by MacDonald (2013) is an integrated theory of language production and comprehension that seeks to connect language production with typology and comprehension. It is broadly in the spirit of the second view regarding linguistic universals described above and posits production difficulty as the sole factor influencing linguistic form. Hence it is an unconventional approach, contrasting radically with alternative accounts of language use in which language forms are shaped by constraints on language acquisition processes or considerations of facilitating lan- guage comprehension for the listeners. Based on PDC assumptions, we formulated two hypotheses linking processing efficiency, case marking, and word order choices at the level of individual speakers (as opposed to the population level) in Hindi, a language having predominantly SOV word order. Hindi has a rich system of case markers along with a relatively flexible word order (Agnihotri, 2007; Kachru, 2006) and thus adheres to the implicational universal stated at the outset. The PDC principle Easy First stipulates that more accessible words are ordered before less ac- cessible words. Accessibility of a word is influ- enced by its ease of retrievability from memory.

Upload: others

Post on 04-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 30–42Minneapolis, USA, June 7, 2019. c©2019 Association for Computational Linguistics

30

Surprisal and Interference Effects of Case Markers in Hindi Word Order

Sidharth RanjanIIT Delhi

[email protected]

Rajakrishnan RajkumarIISER Bhopal

[email protected]

Sumeet AgarwalIIT Delhi

[email protected]

Abstract

Based on the Production-Distribution-Comprehension (PDC) account of languageprocessing, we formulate two distinct hy-potheses about case marking, word orderchoices and processing in Hindi. Our firsthypothesis is that Hindi tends to optimizefor processing efficiency at both lexical andsyntactic levels. We quantify the role ofcase markers in this process. For the task ofpredicting the reference sentence occurring ina corpus (amidst meaning-equivalent gram-matical variants) using a machine learningmodel, surprisal estimates from an artificialversion of the language (i.e., Hindi withoutany case markers) result in lower predictionaccuracy compared to natural Hindi. Oursecond hypothesis is that Hindi tends tominimize interference due to case markerswhile ordering preverbal constituents. Weshow that Hindi tends to avoid placing nextto each other constituents whose heads aremarked by identical case inflections. Ourfindings adhere to PDC assumptions andwe discuss their implications for languageproduction, learning and universals.

1 Introduction

Language universals encode distributional regular-ities across languages of the world. This studyis motivated by the well known correlation be-tween case marking and increased word orderflexibility (Sapir, 1921; Blake, 2001), often ex-pressed as an implicational universal1. The ori-gin of such universals has been the topic of along-standing debate in linguistics and cognitivescience (Fedzechkina et al., 2012). As the citedwork expounds, one view is that language uni-versals emerged due to constraints specific to the

1“Given X in a particular language, we always find Y”where X and Y are characteristics of the language (Green-berg, 1963).

system of language and not related to the othercognitive faculties (Chomsky, 1965; Fodor, 2001).Another view is that languages evolved over timeas a consequence of cognitive mechanisms andpressures linked with language use. Thus, cogni-tive biases related to processing (Hawkins, 2004),learnability (Christiansen and Chater, 2008) andcommunicative efficiency (Jaeger and Tily, 2011)have been proposed as underlying the systematicsimilarities and divergences between natural lan-guages.

The Production-Distribution-Comprehension(PDC) account of language processing proposedby MacDonald (2013) is an integrated theoryof language production and comprehensionthat seeks to connect language production withtypology and comprehension. It is broadly inthe spirit of the second view regarding linguisticuniversals described above and posits productiondifficulty as the sole factor influencing linguisticform. Hence it is an unconventional approach,contrasting radically with alternative accountsof language use in which language forms areshaped by constraints on language acquisitionprocesses or considerations of facilitating lan-guage comprehension for the listeners. Based onPDC assumptions, we formulated two hypotheseslinking processing efficiency, case marking, andword order choices at the level of individualspeakers (as opposed to the population level) inHindi, a language having predominantly SOVword order. Hindi has a rich system of casemarkers along with a relatively flexible wordorder (Agnihotri, 2007; Kachru, 2006) and thusadheres to the implicational universal stated at theoutset.

The PDC principle Easy First stipulates thatmore accessible words are ordered before less ac-cessible words. Accessibility of a word is influ-enced by its ease of retrievability from memory.

31

Inspired by the stated PDC principle, our first hy-pothesis is that Hindi tends to optimize for pro-cessing efficiency at both lexical and syntactic lev-els. We investigate the role of case markers in thisprocess by comparing the processing efficiency ofnatural Hindi and an artificial version of Hindiwithout case markers. Based on the PDC principleof Reduce Interference, our second hypothesis isthat Hindi orders constituents such that phonolog-ical interference caused by case marker repetitionis minimized. Interference is the idea that entitieswith similar properties (like form, meaning, ani-macy, concreteness and so forth) cause process-ing difficulties when they occur in proximity. Along line of research attests the role of interfer-ence in both production (Bock, 1987; Jaeger et al.,2012) and comprehension (Van Dyke and McEl-ree, 2006; Van Dyke, 2007).

In order to test the stated hypotheses, we de-ploy a machine learning model to predict the ref-erence sentence occurring in the Hindi-Urdu Tree-Bank (HUTB) corpus (Bhatt et al., 2009) of writ-ten text2 (Example 1a below), amidst a set of artifi-cially created grammatical variants expressing thesame proposition (Examples 1b-1c). Case markersare shown in bold for illustration purposes.

(1) a. issethis

pehlebefore

jiladistrict

upabhoktaconsumer

adalatcourt

neERG

28 April 199828 April 1998

koON

apneown

faisledecision

meLOC

companycompany

koDAT

nirdeshdirection

diyaa thaagave

ki...COMPL

Earlier, the District Consumer Court had di-rected the company in its decision on April 28,1998 that ...

b. isse pehle jila upabhokta adalat ne apne faisleme 28 April 1998 ko company ko nirdeshdiyaa thaa ki...

c. jila upabhokta adalat ne isse pehle apne faisleme 28 April 1998 ko company ko nirdeshdiyaa thaa ki...

The variants above have two adjacent ko-marked constituents, potentially causing interfer-ence during production. So the PDC accountwould not prefer these sentences on account ofproduction difficulty and instead prefer the refer-ence sentence above. The possibility that speak-ers chose the reference sentence above so that itwould facilitate comprehension for listeners (com-pared to variant sentences which might be harderto interpret) is not considered by the PDC account.

2We concede that the use of written data (due to the lack ofa publicly available Hindi speech corpus) is a major limitationof our study.

We quantified processing efficiency using sur-prisal, originally proposed as a measure of lan-guage comprehension difficulty by Surprisal The-ory (Hale, 2001; Levy, 2008). Consequently, weintroduced surprisal estimated from n-gram anddependency parsing models into a logistic regres-sion model for the task of predicting the refer-ence sentence. Our choice of surprisal is inspiredby Levy and Gibson (2013), who point out that thedesiderata for PDC to become a theory of power-ful empirical import is that it should make quan-titative and localized predictions about incremen-tal processing difficulty at each word. They high-light the fact that such a theory already exists, viz.the Surprisal Theory of language comprehensionmentioned above. A perusal of the literature oninformation density in language production sug-gests that surprisal is a reasonable choice to modelproduction difficulty as well.

Information density and surprisal are mathemat-ically equivalent and both quantify the contex-tual predictability of a linguistic unit. But sur-prisal is based on different theoretical assumptionsabout resource allocation in comprehension. Re-cent research has demonstrated that reduction phe-nomena at both lexical (Frank and Jaeger, 2008,verb contraction) and syntactic (Jaeger, 2010, that-complementizer choice) levels exhibit the drive tominimize variation in information density acrossthe linguistic signal. Moreover, instances of thesame word which have greater predictability tendto be spoken faster and with less emphasis onacoustic details (Bell et al., 2009; Pluymaekerset al., 2005). The work cited above uses lex-ical frequencies or n-gram models over wordsto estimate contextual predictability. More re-cently, Demberg et al. (2012) showed that syntac-tic surprisal estimated from a top-down incremen-tal parser is positively correlated with the durationof words in spontaneous speech, even in the pres-ence of controls including word frequencies andtrigram lexical surprisal estimates. Crucial to ourstudy, words which are predictable in context havebeen interpreted to be more accessible in recentresearch (Arnold, 2011).

The results of our experiments show that ref-erence sentences tend to minimize both trigramand dependency parser surprisal in comparison totheir variants. Further, we show that the predic-tion accuracies of surprisal estimates derived froman artificially created version of Hindi without

32

any case markers are significantly worse than thecorresponding surprisal estimates based on nat-ural Hindi. This experiment demonstrates thecrucial contribution of case markers towards thepredictive ability of surprisal and confirms ourfirst hypothesis. Subsequently, we demonstratethat Hindi tends to avoid placing together con-stituents whose heads are marked by the same casemarker. Moreover, incorporating predictors basedon adjacent case marker sequences in a statisti-cal model significantly improves model predictionaccuracy over an extremely competitive baselineprovided by n-gram and dependency parser sur-prisal. Phonological interference is a plausibleexplanation for this phenomenon and lends cre-dence to our second hypothesis. The Hindi sen-tence comprehension literature provides only lim-ited support for interference involving case markersequences (Vasishth, 2003). Hence, it is plausiblethat this effect is a factor confined to the produc-tion system and not related to considerations oflanguage comprehension. Further research usingspoken corpora and spontaneous production ex-periments need to be performed in order to val-idate the psychological reality of our findings.Given that symbols used in the Hindi orthographyhave a direct correspondence with the sounds ofthe language (Vaid and Gupta, 2002), we expectspeech to behave similarly.

Our main contribution is that we broaden thetypological base of the PDC account of languageprocessing, leveraging its connection with the wellestablished surprisal theory of language compre-hension. Levy and Gibson (2013) state that sur-prisal would enable PDC to be implemented com-putationally, thus facilitating hypothesis testingon a wide range of linguistic phenomena cross-linguistically. To this end, we set up a compu-tational framework consisting of standard toolsand techniques from the field of Natural Lan-guage Generation (NLG). Methodologically, thetask of referent sentence prediction is a relativelynovel way of studying word order and is inspiredfrom the surface realization component of NLG.Recently, using a similar setup, Rajkumar et al.(2016) showed the impact of dependency lengthon English word order choices.

In this paper, Section 2 provides necessarybackground and Section 3 provides details of ourdata sets and models. Section 4 presents our exper-iments and their results. Finally, Section 5 summa-

Marker Case (Gloss) GrammaticalFunction

φ nominative (NOM) subject/objectne ergative (ERG) subjectko accusative (ACC) object

dative (DAT) subject/indirect objectse instrumental (INS) subject/oblique/adjunct

ka/ki/ke genitive (GEN) subject (infinitives)specifier

me/par/tak locative (LOC) oblique/adjunct

Table 1: Hindi case markers (Butt and King, 1996).

rizes the conclusions of our study and discussesthe implications of our results for language pro-duction and learning.

2 Background

This section offers a brief background on Hindiword order and case marking, surprisal and coreassumptions of the PDC account.

2.1 Hindi Word Order and Case Marking

A long line of work (Butt and King, 1996; Kid-wai, 2000) has shown that scrambling in Hindi isinfluenced by factors like discourse considerations(topic, focus, background, and completive infor-mation), semantics (definiteness and animacy),and prosody (Patil et al., 2008). Hindi followsthe head-marking strategy where case markers arepostpositions which attach to noun phrases and en-code a range of grammatical functions like subjectand object (see Table 1 and case markers in boldin Examples 1a and 2a).

2.2 Surprisal Theory

The Surprisal Theory of language comprehensionposits that fine-grained probabilistic knowledge(attained from prior linguistic experience) helpscomprehenders form expectations about interpre-tations of the previously encountered structure aswell as upcoming material (Hale, 2001; Levy,2008). The theory defines surprisal as a measureof comprehension difficulty. In this work, we usedthe following definitions of surprisal:

1. n-gram surprisal: Mathematically, n-gramsurprisal of the (i+1)th word, wi+1, basedon a traditional n-gram model is given bySi+1 = − logP (wi+1|wi−n+2, ..., wi−1, wi),as defined by Hale (2001). We estimated n-gram sur-prisal via trigram models (n=3) over words trained on1 million sentences from the EMILLE corpus (Bakeret al., 2002) using the SRILM toolkit (Stolcke, 2002)with Good-Turing discounting.

2. Dependency parser surprisal was computed usingthe probabilistic incremental dependency parser devel-oped by Agrawal et al. (2017), based on the parallel-

33

processing variant of the arc-eager parsing strat-egy (Nivre, 2008) proposed by Boston et al. (2011).This parser maintains a set of the k most probableparses at each word as it proceeds through the sentence.The probability of a parser state is taken to be the prod-uct of the probabilities of all transitions made to reachthat state. This parser can thus be used to define a mea-sure of dependency parser surprisal: for the ith wordin a sentence, we first define the prefix probability αi

as the sum of probabilities of the k maintained parserstates at word i:

αi =∑

top k derivations d leading to word i

Prob(d) (1)

The dependency parser surprisal at word i+1 is thencomputed as:

Ssyni+1 = − log(αi+1/αi) (2)

The dependency parser surprisal of the (i+1)th word iscomputed as the negative log-ratio of the sum of prob-abilities of maintained parser states at word i+1 to thesame sum at word i. We estimated it using a corpus of12,000 HUTB projective trees.

2.3 Production-Distribution-Comprehension(PDC) Account

The Production component of the PDC accountposits three factors of production ease. 1. EasyFirst: Relatively more accessible (ease of memoryretrieval and conceptual salience) or available ele-ments are produced earlier in the structure. 2. PlanReuse: Speakers tend to repeat previously usedor mentioned structures due to syntactic priming.3. Reduce Interference: Speakers tend to choosewords which do not interfere with other words inthe utterance plan. These factors compete witheach other during the production process to mouldlanguage forms.

The Distribution component states that the dis-tribution of structures in natural languages reflectsa bias towards having a greater number of struc-tures which are easier to produce. Thus PDC at-tributes the greater frequency of subject relativeclauses compared to object relatives across lan-guages to production ease. Finally, the Com-prehension part of the PDC approach proposesthat language comprehension reflects the statis-tics of the input (i.e., production patterns) per-ceived by language users. Thus, according toPDC, the greater difficulty involved in compre-hending object relative clauses compared to sub-ject relatives (Gibson, 2000) is because of thelower exposure to object relatives by virtue of theirlower frequency in the linguistic input to compre-henders. Levy and Gibson (2013) puts forth the

idea that surprisal (estimated from corpora) is nat-urally very compatible with the PDC assumptiondescribed above. Maryellen MacDonald and col-leagues validate PDC predictions using a seriesof experiments related to relative clause produc-tion and comprehension in many languages (Gen-nari and MacDonald, 2008, 2009; Gennari et al.,2012).

3 Data and Models

Our data set consists of 8736 reference sentencescorresponding to labeled, projective dependencytrees in the Hindi-Urdu TreeBank (HUTB) corpusof written Hindi (Bhatt et al., 2009). We gener-ated variants for each reference sentence by ran-domly permuting the preverbal constituents of theroot node of its dependency tree. We selectedtrees whose roots were verbs. For example, inthe tree depicted in Figure 1 (corresponding to Ex-ample 1a), we reordered the preverbal constituentsimmediately dominated by the verb diyaa and ob-tained the variants shown in Examples 1b and 1c.In order to eliminate ungrammatical variants, weexcluded variants containing dependency relationsequences of the root word not present in the cor-pus of HUTB gold standard trees. Dependencyrelation sequences like k7t-k1, k1-k7t, k7t-k7 andk7-k4 in Figure 1 simulate grammar rules used ingrammar-based surface realization systems. Weobtained 175801 variants after filtering.

In order to mitigate the imbalance between thenumber of reference and variant sentences, wetransformed the data set using a technique de-scribed in Joachims (2002). As per this technique,a binary classification problem can be convertedinto a pairwise ranking problem by training a clas-sifier on the difference between the feature vectorsof a reference sentence and its syntactic choicevariants:

w · φ(Reference) > w · φ(V ariant) (3)

w · (φ(Reference)− φ(V ariant)) > 0 (4)

In Equation 3 above, the Reference data pointis predicted to outrank the V ariant data pointwhen the dot product of the feature vector ofthe reference with w (learned feature weights) isgreater than the corresponding product of the vari-ant. The same can be written (Equation 4) asthe dot product of w with the feature vector dif-ference being positive. We created ordered pairs

34

ROOT

diyaa

main

isse

k7t

adaalat

k1

1998

k7t

faesle

k7

company

k4

nirdesh

pof

tha

lwg__vaux

ki

k2

.

rsym

pahle

lwg__psp

jila

pof__cn

upbhoktaa

pof__cn

ne

lwg__psp

28

pof__cn

april

pof__cn

ko

lwg__psp

apne

r6

me

lwg__psp

ko

lwg__psp

...

ccof

Figure 1: Example HUTB dependency tree (Table 8 in the Appendix provides a glossary of dependency relations)

consisting of the feature vectors of reference andvariant sentences. Every reference sentence in thedata set was paired with each of its variants (Ex-amples 1a-1b and Examples 1c-1a constitute twosuch pairs). Then the feature values of the firstmember of the pair were subtracted from the cor-responding values of the second member. Pairs al-ternate between reference-variant (coded as “1”)and variant-reference (coded as “0”), resulting ina data set consisting of an equal number of clas-sification labels of each kind (see Appendix for adetailed illustration).

4 Experiments

In this section, we describe three experiments totest our hypotheses on the transformed version ofour data set consisting of 175801 data points us-ing a logistic regression model. The goal is topredict “1” and “0” labels (as described in theprevious section) using a set of cognitively mo-tivated features. We calculated lexical and de-pendency parser surprisal feature values over en-tire sentences by summing the log probabilities ofthe surprisal values of individual words. We car-ried out 27-fold cross-validation; for each run, amodel trained on 26 folds (consisting of 1 fold forhyperparameter tuning) was used to generate pre-dictions about the remaining fold (100 training it-erations using lbfgs solver in python scikit-learntoolkit-v0.16.1).

4.1 Processing Efficiency Experiments

Here, we test the hypothesis that word orderchoices in language are optimized for process-ing efficiency by incorporating trigram and depen-dency parser surprisal as predictors in a logistic re-

43.8147.03

0

25

50

75

100

125

Reference Variants

Ngr

am s

urpr

isal

Figure 2: Mean trigram surprisal per sentence of ref-erence and variant sentences (95% confidence intervalsindicated)

gression model. A negative regression coefficientfor these predictors would imply that corpus sen-tences have lower surprisal than variants. For theentire corpus, Figure 2 indicates this trend, wherethe mean trigram surprisal per sentence of the cor-pus of reference sentences is lower than the corre-sponding value of all their variants (Figure 4 in theAppendix depicts the same trend for syntactic sur-prisal). For the task of predicting HUTB referencesentences, both our surprisal measures have a neg-ative regression coefficient, individually as well asin combination (first three rows of Table 2). Thisconfirms our hypothesis that word order choicesoptimize for processing efficiency. Given our in-terpretation of low surprisal as denoting ease ofaccessibility, our first experiment shows that Hindi

35

Predictor(s) Accuracy% Weight(s)HindiParser surprisal 62.10 -0.43Trigram surprisal 89.96 -0.81Trigram + parser surprisal 90.14 -0.98, -0.43Caseless HindiCaseless parser surprisal 55.03 -0.29Caseless trigram surprisal 87.73 -0.83Caseless trigram + parser surprisal 87.81 -0.93, -0.27

Table 2: Classification accuracies of surprisal for natural andcaseless Hindi (175801 data points)

speakers tend to produce sentences by orderingpreverbal constituents such that more accessibleelements are realized first compared to other com-peting grammatical variants. This is in line withthe PDC Easy First principle. Further, the classi-fication accuracies indicate that trigram surprisalestimated from the EMILLE corpus is very effec-tive in modelling syntactic choice (89.96% accu-racy). For the same task, Ranjan (2015) reportedthat trigram surprisal estimated from the HUTB it-self (smaller quantity of in-domain data) resultedin a lower accuracy of around 85%. In this context,our results show that a bigger n-gram training setcan overcome the limitation of being from a dif-ferent domain. A qualitative exploration revealedthat n-gram model surprisal was particularly ef-fective in reference-variant pairs as shown below(case markers shown in bold):

(2) a. Paakistan nePakistan ERG

brihaspativaar koThursday at

kathit taur parallegedly

apne yahaan nirmitindigenous

paramanunuclear

hathiyaarweapons

dhone mecapable LOC

saksamcarrying

kroojcruise

missile kamissile GEN

pareekshan kiyaa hai.tested

Pakistan has allegedly tested an indigenouscruise missile capable of carrying nuclearweapons on Thursday.

b. brihaspativaar ko Pakistan ne kathit taur parapne ...

The reference sentence (Example 2a with tri-gram surprisal of 45.60 hartleys) has the ergative-accusative (ne-ko) ordering of case-marked nounscompared to the variant (Example 2b with highertrigram surprisal 47.12) having the opposite or-dering of nouns. Overall, 6% of a total of 175HUTB sentences having ergative and accusativecase markers exhibit a non-canonical accusative-ergative order (Agrawal et al., 2017). In bothsentences above, the case markers in questionsare separated by a single word and hence formpart of a single trigram. Thus trigram surprisalis able to model the dominant order successfully

while dispreferring the opposite order seen in Ex-ample 2b. Moreover, dependency parser surprisalhas much lower classification accuracy comparedto trigram surprisal and has a very negligible im-pact on performance on top of trigram surprisal.Thus, surprisal estimates from an incremental de-pendency parser are not effective in modellingconstituent order choices. This is slightly unex-pected as Agrawal et al. (2017) showed that sur-prisal estimates derived via the dependency parserdeployed in our work accounts for per-word read-ing times for the Potsdam-Allahabad corpus overand above bigram frequencies. Using a similarsetup, Rajkumar et al. (2016) showed that for thetask of predicting English syntactic choice alterna-tions, PCFG surprisal performed significantly bet-ter than n-gram model surprisal and the impactof dependency length is over and above both theaforementioned surprisal predictors. We are in theprocess of creating a constituency structure tree-bank for Hindi and plan to experiment with sur-prisal derived from a constituent-structure parservery soon. In recently completed work, Ranjanet al. (In Preparation) show that for Hindi, de-pendency length exhibits a weak effect over andabove surprisal for predicting corpus sentencesamidst artificial variants. Finally, we examined1022 reference-variant pairs in our dataset wherenone of our features was able to predict the ref-erence sentence correctly. We isolated cases in-volving other factors like given-new orders (30%cases), focus or topic considerations (marked byhi or to markers constituting 10% of cases) andnull subjects (7.5%). Such discourse considera-tions are not encoded in our surprisal estimates(confined to single sentences) and further researchcan incorporate information about sentences fromthe preceding context into surprisal estimates.

Note that when considering the relationshipbetween communicative efficiency and word or-der choices, there is a potential ‘levels’ prob-lem (Levy, 2018). At the level of evolutionarytimescales and entire populations, one might ex-pect the grammar or distributional properties ofthe language to be adapted for efficiency. Butat the level of an individual speaker’s produc-tion choices, certain measures of efficiency will inturn depend on the extant distribution of linguisticforms. So there is a potential circularity in tryingto assess the validity of such measures. Here weseek to model only the lower of these levels, i.e.,

36

individual choices over a human lifetime. Hence,all the non-corpus variants we consider are gram-matical. We assume that the grammar of the lan-guage is held fixed, and within the set of possibleword order variants of a sentence licensed by thatextant grammar, seek to model why speakers mayhave a greater propensity to produce some variantsover others.

4.2 Case Markers and Processing Efficiency

In order to quantify the exact contribution ofHindi case markers towards the predictive accu-racy of syntactic and trigram surprisal, we per-formed similar experiments using an artificial ver-sion of the language (i.e., Hindi without casemarkers). The sentence comprehension literaturedemonstrates the vital role of case markers in pre-dicting the final verb in verb-final constructions oflanguages like German (Levy and Keller, 2013)and Japanese (Grissom II et al., 2016). Moreover,in recent years, deploying artificial languages totest hypotheses about language processing andlearning has been in vogue in both connection-ist modelling (Lupyan and Christiansen, 2002;Everbroeck, 2003) as well as behavioural exper-iments (Kurumada and Jaeger, 2015; Fedzechkinaet al., 2017). Inspired by the cited works, we cre-ated a caseless version of Hindi by removing casemarkers (those listed in Table 1) from both refer-ence and variant sentences. The caseless equiva-lents of Examples 2a and 2b discussed in the pre-vious section are given below:

(3) a. Pakistan brihaspativaar kathit taur apne ya-haan nirmit paramanu hathiyaar dhone meinsaksam krooj missile pareekshan kiyaa hai.

b. brihaspativaar Pakistan kathit taur apne ...

Then we estimated surprisal by stripping off casemarkers from the EMILLE corpus (trigram sur-prisal) as well as HUTB trees (dependency parsersurprisal) so that our surprisal estimates mirroredthe patterns in the caseless version of the languagefaithfully. Both surprisal measures derived fromthe caseless version of Hindi perform significantlyworse than natural Hindi (last three rows of Ta-ble 2). Caseless trigram surprisal does 2% worse,while there is a 7% dip in the performance of case-less dependency parser surprisal (McNemar’s two-tailed significance p < 0.001 for both measures).Thus the caseless language model is not able topredict the reference sentence shown in Exam-ple 3a as it awards higher trigram surprisal (45.21),

in comparison to the variant sentence in Exam-ple 3b, which has a lower surprisal value (43.74).Figure 3 in the Appendix depicts the lexical sur-prisal profiles for the examples discussed above(both regular Hindi and caseless equivalents). De-pendency parser surprisal also exhibited the samepredictions.

Removing any kinds of words (especially func-tion words) will result in a decrease in predictionaccuracy. So we compared the prediction accu-racy of caseless surprisal with another baselineobtained by removing case markers and all otherpostpositions (e.g. ke liye, ke dwara) from bothtraining and test data. Surprisal estimates derivedfrom the case marker and postposition strippedversion of Hindi resulted in an extra dip of 0.3%in the accuracy of trigram surprisal and 2.5% fordependency parser surprisal compared to surprisalobtained by stripping just the case markers. Thuseven within the set of postpositions, case markersplay a significant role in lexical and syntactic pre-dictability and hence processing efficiency. Lackof case markers reduces the overall informationcontent of a sentence for both speaker and hearer.Spontaneous production experiments showed thatJapanese speakers tend to omit the optional marker-o when the meaning of the sentence is probablein a given context (Kurumada and Jaeger, 2015).However, in cases where the meaning is not plau-sible, speakers tend to mention the case marker, inspite of entailing greater production effort.

The work of Lupyan and Christiansen (2002)showed that for artificial SOV languages with nocase marking, a sequential learning device (Sim-ple Recurrent Network) failed to achieve high ac-curacy for the task of mapping words to grammati-cal roles. Their simulations suggest that verb-finallanguages need a case system for optimal learningas word order is not a reliable cue for grammati-cal function assignment. Using the miniature ar-tificial language learning paradigm, Fedzechkinaet al. (2017) conducted a study where two groupsof adult learners were exposed to artificial lan-guages with optional case marking (one fixed or-der and one flexible order). Learners of the flexi-ble constituent order language produced more casemarkers than learners of the fixed order language,mirroring typological patterns.

4.2.1 Interference ExperimentsIn the light of the PDC principle of MinimizeInterference, we investigate whether interference

37

Predictor Sequence Distancenameφ-ne 1 3ne-ko 1 3ko-me 1 2me-ko 1 1same-seq 0diff-seq 4

Table 3: Values of casefeatures extracted from treein Figure 1.

Case marker Weightsequenceφ-φ -0.002ke-ke -0.025ko-ko -0.291me-me -0.061tak-tak 0.008par-par 0.231se-se 0.055same-seq -0.009diff-seq 0.009

Table 4: Learned weightsof some case-sequence pre-dictors.

between NPs whose heads are marked by thesame case marker influence preverbal constituentordering choices in Hindi. Since PDC seeksto link production and comprehension, our ex-periments are also motivated by prior work oncase marker interference in sentence compre-hension in SOV languages like Japanese (Lewisand Nakayama, 2001), Korean (Lee et al., 2005)and Hindi (Vasishth, 2003). Our work is di-rectly related to the experiments on identicalcase marking described in Chapter 3 of Vasishth(2003). In the case of Hindi center-embeddings,this work examined whether NPs having nominalheads marked by identical case markers inducesimilarity-based interference effects at the subse-quent verb as predicted by the Retrieval Interfer-ence Theory (Lewis, 1998; Lewis and Nakayama,2001). The study shows limited support for in-terference emanating from phonologically similarcase markers.

In order to investigate interference caused bycase markers in syntactic choice, we designed fea-tures based on case markers and incorporated theminto our logistic regression model. For each de-pendency tree, we introduced two types of featuresassociated with preverbal constituents of the rootverb. 1. Case-sequence features: Counts of casemarker sequences associated with the heads of apair of adjacent constituents. We also introducedgeneric case-sequence features same-seq and diff-seq to model the overall trend. For each tree, thesefeatures denote the total number of identical anddifferent case markers sequences associated withpairs of adjacent constituents. 2. Case-distancefeatures: Number of intervening words betweenheads of the constituents of root verbs. Here, thefeature name is obtained by combining the casemarkers associated with the constituent heads inquestion. Constituents which do not have casemarked heads are marked as φ in order to model

Predictor(s) Classification Rankingaccuracy% accuracy%

Case distance features 70.79Case sequence features 74.94Random Classifier 21.25Baseline (trigram+parser surprisal) 90.16 55.04Baseline+Case distance features 90.85*** 55.68***Baseline+Case sequence features 91.13*** 56.03***Baseline+Case distance + sequence features 91.60*** 56.16***

Table 5: Pairwise classification and ranking accuracy (***denotes McNemar’s two-tailed significance p < 0.001 overthe baseline model).

the fact that languages often use adverbial ele-ments or other non-case marked arguments to sep-arate case marked constituents. Table 3 illustratesour case features based on the dependency tree inFigure 1 corresponding to Example 1a.

In isolation, the case-sequence and case-distance features exhibit accuracies around 70%(second column of Table 5). The case sequenceand distance features together induce a significantaccuracy increase of 1.5% (McNemar’s two-tailedsignificance p < 0.001) over a baseline modelconsisting of lexical and dependency parser sur-prisal as features. Though this might be a smallincrease when considered in isolation, we wouldlike to note that our baseline model is extremelycompetitive (90.16% accuracy). Even dependencyparser surprisal did not confer considerable per-formance gains over and above trigram surprisalas discussed earlier. So in this context, the contri-bution of case features is noteworthy.

Subsequently, we examined the learned weightsof our case sequence features (Table 4) in ourbest model containing surprisal and all the casemarker features. A negative weight is associatedwith four of the seven identical case marker se-quences as well as the same-seq feature encodingthe overall pattern across all case markers. Thesenegative weights lend support to our hypothesisthat Hindi shows a dispreference for placing to-gether constituents whose heads are marked us-ing the same case inflection. Interference due torepetition of phonologically identical case mark-ers may be a plausible explanation for this phe-nomenon. However, three other case marker se-quences have a positive weight and hence indicatea tendency towards adjacency. These three casemarkers are much lower in frequency in the HUTBcompared to the other four and might not representthe dominant tendency. However, future inquiriesneed to explore the role of case-based facilita-

38

tion (Logacev and Vasishth, 2012). Since our fea-tures are not sensitive to clause boundaries, con-clusive evidence for phonological interference willemerge only after controlling for clause bound-aries.

The best model (baseline + case marker fea-tures) picked the reference sentence (Example 1a)while the baseline model erroneously selectedthe artificially generated variants (Examples 1band 1c). The reference sentence has two ko-marked constituents separated by intervening con-stituents. In contrast, the variant sentences havetwo adjacent ko-marked constituents, potentiallycausing interference. These examples also high-light the ambiguous nature of the ko-marker indenoting several functions in Hindi. As notedby Ahmed (2006), ko marks both accusative anddative case on objects (company in the cited ex-amples) as well as dative subjects. In addition, italso occurs on spatial and temporal adjuncts (asin 28 April 1998). In these examples, since komarks both dative case and temporality, interfer-ence might be purely phonological in nature andnot related to the actual grammatical function be-ing marked. Further, we calculated the ranking ac-curacy of our main models, i.e., the percentage oftimes a model ranked the reference sentence com-pared to all its variants. Table 5 (column 3) in-dicates that introducing case marker features intothe baseline model induced significant ranking ac-curacy gains (McNemar’s two-tailed significancep < 0.001). So our best model ranked Example 1aas the best sentence among all the other variants.Our classification and ranking results suggest thatthe PDC Reduce Interference principle of produc-tion ease is a valid constraint in constituent order-ing.

In Hindi sentence comprehension, Vasishth(2003) explored the idea of Positional similar-ity (Lewis and Nakayama, 2001), whereby the po-sition of otherwise syntactically indiscriminableNPs in the structure contribute to interference atthe subsequent verb. So he compared readingtimes at the innermost verb in the sequences ofconstituents with heads marked by ne-se-ko-koand ne-ko-se-ko inflections. However, there wasno significant difference in reading times betweenthese sequences, thus offering no support for po-sitional similarity during comprehension. This isthe experimental condition which is most closelylinked to our work. Interpreted in conjunction with

our findings, case marker interference in Hindi ap-pears to be a constraint on production rather thancomprehension.

5 Discussion

Our main findings are broadly in line with two ofthe production ease principles of the PDC account.Our first experiment shows that the Hindi lan-guage orders words to optimize production ease(quantified using surprisal) at both lexical and syn-tactic levels, consistent with the PDC Easy Firstprinciple. Our second experiment shows that casemarkers make a significant contribution towardsthe predictive accuracy of both syntactic and tri-gram surprisal in choosing the reference sentenceamongst grammatical variants denoting the samemeaning. The role of surprisal and case mark-ers in conferring accessibility needs to be inves-tigated more thoroughly in future work. Finally,our third experiment shows that Hindi tends todisprefer constituent sequences with heads casemarked by identical case markers, as predicted bythe PDC principle of Reduce Interference. How-ever, the lack of case marker interference in Hindicomprehension necessitates further inquiries intothe PDC account, which conceives the lexico-syntactic statistics of production data (result ofbiases in utterance planning) as guiding compre-hension processes. Thus, overall, we would liketo conclude that certain aspects of PDC are vali-dated by our experimental results. Further com-putational inquiries will be facilitated by formu-lating an algorithmic sketch of a process modeloutlining the causes of mismatches between pro-duction and comprehension. Finally, the PDC ac-count conceives word order variation in languagesof the world as emerging from an interplay of thethree PDC production principles. Crucially, PDCconceives learning biases to be production biases,i.e., speakers learn forms which are easier to pro-duce (MacDonald, 2013). Future inquiries can ex-plore whether learning outcomes are indeed con-sistent with typological patterns described by lan-guage universals.

6 Acknowledgements

We are grateful to the anonymous reviewers of thisworkshop, NAACL-2019 and Sigmorphon-2018for their feedback. The second author acknowl-edges support from IISER Bhopal’s Faculty Initi-ation Grant (IISERB/R&D/2018-19/77).

39

ReferencesRama Kant Agnihotri. 2007. Hindi: An Essential

Grammar. Essential Grammars. Routledge.

Arpit Agrawal, Sumeet Agarwal, and Samar Husain.2017. Role of expectation and working memoryconstraints in hindi comprehension: An eyetrack-ing corpus analysis. Journal of Eye Movement Re-search, 10(2).

Tafseer Ahmed. 2006. Spatial, temporal and structuraluses of urdu ko. In Miriam Butt and Tracy HollowayKing, editors, Proceedings of the 11th InternationalLFG Conference. CSLI Publications, Stanford.

Jennifer E. Arnold. 2011. Ordering choices in pro-duction: For the speaker or for the listener? InEmily M. Bender and Jennifer E. Arnold, editors,Language From a Cognitive Perspective: Grammar,Usage, and Processing, pages 199–222. CSLI Pub-lishers.

Paul Baker, Andrew Hardie, Tony McEnery, HamishCunningham, and Robert Gaizauskas. 2002. Emille:a 67-million word corpus of indic languages: datacollection, mark-up and harmonization. In Proceed-ings of LREC 2002, pages 819–827. Lancaster Uni-versity.

Alan Bell, Jason M Brenier, Michelle Gregory, Cyn-thia Girand, and Dan Jurafsky. 2009. Predictabilityeffects on durations of content and function wordsin conversational English. Journal of Memory andLanguage, 60(1):92–111.

Rajesh Bhatt, Bhuvana Narasimhan, Martha Palmer,Owen Rambow, Dipti Misra Sharma, and Fei Xia.2009. A multi-representational and multi-layeredtreebank for hindi/urdu. In Proceedings of the ThirdLinguistic Annotation Workshop, ACL-IJCNLP ’09,pages 186–189, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

Barry J. Blake. 2001. Case, 2 edition. Cam-bridge Textbooks in Linguistics. Cambridge Univer-sity Press.

Kathryn Bock. 1987. An effect of the accessibility ofword forms on sentence structures. Journal of Mem-ory and Language, 26(2):119 – 137.

Marisa Ferrara Boston, John T. Hale, Shravan Vasishth,and Reinhold Kliegl. 2011. Parallel processing andsentence comprehension difficulty. Language andCognitive Processes, 26(3):301–349.

Miriam Butt and Tracy Holloway King. 1996. Struc-tural topic and focus without movement. In MiriamButt and Tracy Holloway King, editors, Proceed-ings of the First LFG Conference. CSLI Publica-tions, Stanford.

Noam Chomsky. 1965. Aspects of the Theory of Syn-tax, volume 11. The MIT press.

Morten H. Christiansen and Nick Chater. 2008. Lan-guage as shaped by the brain. Behavioral and BrainSciences, 31(5):489509.

Vera Demberg, Asad B. Sayeed, Philip J. Gorinski,and Nikolaos Engonopoulos. 2012. Syntactic sur-prisal affects spoken word duration in conversa-tional contexts. In Proceedings of the 2012 JointConference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Lan-guage Learning, EMNLP-CoNLL ’12, pages 356–367, Stroudsburg, PA, USA. Association for Com-putational Linguistics.

Ezra Van Everbroeck. 2003. Language type frequencyand learnability from a connectionist perspective.Linguistic Typology, 7(1):1–50.

Maryia Fedzechkina, T. Florian Jaeger, and Elissa L.Newport. 2012. Language learners restructure theirinput to facilitate efficient communication. Pro-ceedings of the National Academy of Sciences,109(44):17897–17902.

Maryia Fedzechkina, Elissa L. Newport, and T. Flo-rian Jaeger. 2017. Balancing effort and informationtransmission during language acquisition: Evidencefrom word order and case marking. Cognitive Sci-ence, 41(2):416–446.

Janet Dean Fodor. 2001. Setting syntactic parameters.The Handbook of Contemporary Syntactic Theory,pages 730–767.

A. Frank and T.F. Jaeger. 2008. Speaking rationally:Uniform information density as an optimal strategyfor language production. Cogsci. Washington, DC:CogSci.

Silvia P. Gennari and Maryellen C. MacDonald. 2008.Semantic indeterminacy in object relative clauses.Journal of Memory and Language, 58(2):161 – 187.

Silvia P. Gennari and Maryellen C. MacDonald. 2009.Linking production and comprehension processes:The case of relative clauses. Cognition, 111(1):1 –23.

Silvia P. Gennari, Jelena Mirkovi, and Maryellen C.MacDonald. 2012. Animacy and competition in rel-ative clause production: A cross-linguistic investi-gation. Cognitive Psychology, 65(2):141 – 176.

Edward Gibson. 2000. Dependency locality theory:A distance-based theory of linguistic complexity.In Alec Marantz, Yasushi Miyashita, and WayneO’Neil, editors, Image, Language, brain: Papersfrom the First Mind Articulation Project Symposium.MIT Press, Cambridge, MA.

Joseph H. Greenberg. 1963. Some universals of gram-mar with particular reference to the order of mean-ingful elements. In Joseph H. Greenberg, editor,Universals of Human Language, pages 73–113. MITPress, Cambridge, Mass.

40

Alvin Grissom II, Naho Orita, and Jordan Boyd-Graber. 2016. Incremental prediction of sentence-final verbs: Humans versus machines. In Proceed-ings of The 20th SIGNLL Conference on Computa-tional Natural Language Learning, pages 95–104.Association for Computational Linguistics.

John Hale. 2001. A probabilistic Earley parser as apsycholinguistic model. In Proceedings of the sec-ond meeting of the North American Chapter of theAssociation for Computational Linguistics on Lan-guage technologies, NAACL ’01, pages 1–8, Pitts-burgh, Pennsylvania. Association for ComputationalLinguistics.

John A. Hawkins. 2004. Efficiency and Complexity inGrammars. Oxford University Press.

Florian Jaeger, Katrina Furth, and Caitlin Hilliard.2012. Incremental phonological encoding duringunscripted sentence production. Frontiers in Psy-chology, 3:481.

T. Florian Jaeger. 2010. Redundancy and reduction:Speakers manage information density. CognitivePsychology, 61(1):23–62.

T. Florian Jaeger and Harold Tily. 2011. Language pro-cessing complexity and communicative efficiency.WIRE: Cognitive Science, 2(3):323–335.

Thorsten Joachims. 2002. Optimizing search enginesusing clickthrough data. In Proceedings of theEighth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’02,pages 133–142, New York, NY, USA. ACM.

Y. Kachru. 2006. Hindi. London Oriental and Africanlanguage library. John Benjamins Publishing Com-pany.

Ayesha Kidwai. 2000. XP-Adjunction in UniversalGrammar: Scrambling and Binding in Hindi-Urdu:Scrambling and Binding in Hindi-Urdu. Oxfordstudies in comparative syntax. Oxford UniversityPress.

Chigusa Kurumada and T. Florian Jaeger. 2015. Com-municative efficiency in language production: Op-tional case-marking in japanese. Journal of Memoryand Language, 83(Supplement C):152 – 178.

Sun-Hee Lee, Mineharu Nakayama, and Richard L.Lewis. 2005. Difficulty of processing Japanese andKorean center-embedding constructions. In M. Mi-nami, H. Kobayashi, M. Nakayama, and H.Sirai, ed-itors, Studies in Language Science, volume Volume4, pages 99–118. Kurosio Publishers, Tokyo, Tokyo.

Roger Levy. 2008. Expectation-based syntactic com-prehension. Cognition, 106(3):1126 – 1177.

Roger Levy and Edward Gibson. 2013. Surprisal, thepdc, and the primary locus of processing difficultyin relative clauses. Frontiers in Psychology, 4(229).

Roger P Levy. 2018. Communicative efficiency, uni-form information density, and the rational speech acttheory.

Roger P. Levy and Frank Keller. 2013. Expectationand locality effects in german verb-final structures.Journal of Memory and Language, 68(2):199 – 222.

Richard L. Lewis. 1998. Interference in working mem-ory: Retroactive and proactive interference in pars-ing. CUNY sentence processing conference.

Richard L. Lewis and Mineharu Nakayama. 2001.Syntactic and positional similarity effects in the pro-cessing of Japanese embeddings. In Sentence Pro-cessing in East Asian Languages, pages 85–113,Stanford, CA. CSLI.

Pavel Logacev and Shravan Vasishth. 2012. Casematching and conflicting bindings interference. InMonique Lamers and Peter de Swart, editors, Case,Word Order and Prominence: Interacting Cues inLanguage Production and Comprehension, pages187–216. Springer Netherlands, Dordrecht.

Gary Lupyan and Morten H. Christiansen. 2002. Case,word order, and language learnability: Insights fromconnectionist modeling. In In Proceedings of the24th Annual Conference of the Cognitive ScienceSociety, pages 596–601. Erlbaum.

Maryellen C. MacDonald. 2013. How language pro-duction shapes language form and comprehension.Frontiers in Psychology, 4(226):1–16. Publishedwith commentaries in Frontiers.

Joakim Nivre. 2008. Algorithms for deterministic in-cremental dependency parsing. Comput. Linguist.,34(4):513–553.

Umesh Patil, Gerrit Kentner, Anja Gollrad, FrankKugler, Caroline Fery, and Shravan Vasishth. 2008.Focus, word order and intonation in Hindi. Journalof South Asian Linguistics, 1(1):55–72.

Mark Pluymaekers, Mirjam Ernestus, and R HaraldBaayen. 2005. Lexical frequency and acoustic re-duction in spoken dutch. The Journal of the Acous-tical Society of America, 118(4):2561–2569.

Rajakrishnan Rajkumar, Marten van Schijndel,Michael White, and William Schuler. 2016. In-vestigating locality effects and surprisal in writtenEnglish syntactic choice phenomena. Cognition,155:204–232.

Sidharth Ranjan. 2015. Investigation of locality effectsin hindi language production. Master’s thesis, In-dian Institute of Technology (IIT) Delhi. Unpub-lished thesis.

Sidharth Ranjan, Rajakrishnan Rajkumar, and SumeetAgarwal. In Preparation. Locality and surprisal ef-fects in hindi preverbal constituent ordering.

Edward Sapir. 1921. Language: An Introduction to theStudy of Speech. Harcourt, Brace, New York.

41

Andreas Stolcke. 2002. SRILM — An extensible lan-guage modeling toolkit. In Proc. ICSLP-02.

J Vaid and Anshum Gupta. 2002. Exploring wordrecognition in a semi-alphabetic script: The case ofDevanagari. Brain and Language, 81:679–90.

Julie A. Van Dyke. 2007. Interference effects fromgrammatically unavailable constituents during sen-tence processing. Journal of Experimental Psychol-ogy. Learning, Memory, and Cognition, 33 2:407–30.

Julie A. Van Dyke and Brian McElree. 2006. Retrievalinterference in sentence comprehension. Journal ofMemory and Language, 55(2):157 – 166.

S. Vasishth. 2003. Working Memory in Sentence Com-prehension: Processing Hindi Center Embeddings.Outstanding Dissertations in Linguistics. Taylor &Francis.

42

01

23

45

Lexi

cal s

urpr

isal

(in

bits

)

pakistan ne vrihaspativaar ko kathit taur par apne...

vrihaspativaar ko pakistan ne kathit taur par apne...

ReferentVariant

01

23

45

6

Lexi

cal s

urpr

isal

(in

bits

)

pakistan vrihaspativaar kathit taur apne...vrihaspativaar pakistan kathit taur apne...

ReferentVariant

Figure 3: Lexical surprisal profiles of normal and thecaseless artificial version of Hindi

Sentence Label Lexical Syntactictype surprisal surprisal

Reference 1 97.45 156.64Variant1 0 97.69 160.77Variant2 0 98.25 156.91Variant3 0 97.97 159.50Variant4 0 98.16 161.94

Table 6: Original dataset

A Appendix

A.1 Joachims Transformation

Consider the first example in the following Hindisentences as reference corresponding to ‘Jay-alalitha has written a letter to the prime ministeron this issue’ and remaining as grammatical vari-ants expressing the same idea. Assuming this as atoy dataset, Table 6 denotes their lexical and syn-tactic surprisal feature values whereas Table 7 rep-resents its Joachims transformation.

Reference [jayalalitha-ne]1 [is mazle par]2 [pradhanmantri-ko]3 [ek patr]4 V ...

Variant1 [is mazle par]2 [jayalalitha-ne]1 [pradhanmantri-ko]3 [ek patr]4 V ...

Variant2 [jayalalitha-ne]1 [pradhanmantri-ko]3 [is mazle par]2 [ek patr]4 V ...

Variant3 [pradhanmantri-ko]3 [is mazle par]2 [jayalalitha-ne]1 [ek patr]4 V ...

Variant4 [is mazle par]2 [pradhanmantri-ko]3 [jayalalitha-ne]1 [ek patr]4 V ...

3.87 3.99

0

5

10

15

Reference Variants

Dep

ende

ncy

surp

risal

Figure 4: Mean syntactic surprisal per sentence of ref-erence and variant sentences (95% confidence intervalsindicated)

New Label δ Lexical δ Syntacticsurprisal surprisal

Variant1-Reference 0 0.25 4.13Reference-Variant2 1 -0.81 -0.28Variant3-Reference 0 0.53 2.87Reference-Variant4 1 -0.72 -5.30

Table 7: Transformed dataset

Here are the steps to transform the data set usingthe Joachims transformation technique.

1. Equal number of ordered pairs of type (Reference, Vari-ant) and (Variant, Reference) were created.

2. Differences between the feature values of the elementsof these ordered pairs were taken (see Table 7).

3. <Reference-Variant> pairs were labelled as 1 and<Variant-Reference> pairs were labelled as 0. Here,1 stands for the correct choice and 0 denotes the incor-rect choice.

Label Dependencyrelation

Invariant syntactic relationsk1 subject/agentk2 object/patientk4 recipientk7 location

(elsewhere)k7t location

(in time)r6 genitive/possessiveLocal word group (lwg)lwg psp postpositionlwg vaux auxilliary verbSymbolsrsym symbol relationIndirect dependency relationsccof co-ordination and sub-ordinationpof part of units such as conjunct verbspof cn part of units such as compound noun

Table 8: Glossary of dependency relations