a structural probe for finding syntax in word representations · in summary, we contribute a simple...

Proceedings of NAACL-HLT 2019, pages 4129–4138Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

4129

A Structural Probe for Finding Syntax in Word Representations

John HewittStanford University

[email protected]

Christopher D. ManningStanford University

[email protected]

Abstract

Recent work has improved our ability todetect linguistic knowledge in word repre-sentations. However, current methods fordetecting syntactic knowledge do not testwhether syntax trees are represented in theirentirety. In this work, we propose a structuralprobe, which evaluates whether syntax treesare embedded in a linear transformation of aneural network’s word representation space.The probe identifies a linear transformationunder which squared L2 distance encodes thedistance between words in the parse tree, andone in which squared L2 norm encodes depthin the parse tree. Using our probe, we showthat such transformations exist for both ELMoand BERT but not in baselines, providingevidence that entire syntax trees are embeddedimplicitly in deep models’ vector geometry.

1 Introduction

As pretrained deep models that build contextual-ized representations of language continue to pro-vide gains on NLP benchmarks, understandingwhat they learn is increasingly important. To thisend, probing methods are designed to evaluate theextent to which representations of language en-code particular knowledge of interest, like part-of-speech (Belinkov et al., 2017), morphology (Peterset al., 2018a), or sentence length (Adi et al., 2017).Such methods work by specifying a probe (Con-neau et al., 2018; Hupkes et al., 2018), a supervisedmodel for finding information in a representation.

Of particular interest, both for linguisticsand for building better models, is whether deepmodels’ representations encode syntax (Linzen,2018). Despite recent work (Kuncoro et al., 2018;Peters et al., 2018b; Tenney et al., 2019), openquestions remain as to whether deep contextualmodels encode entire parse trees in their wordrepresentations.

In this work, we propose a structural probe, asimple model which tests whether syntax trees areconsistently embedded in a linear transformationof a neural network’s word representation space.Tree structure is embedded if the transformed spacehas the property that squared L2 distance betweentwo words’ vectors corresponds to the number ofedges between the words in the parse tree. To re-construct edge directions, we hypothesize a lineartransformation under which the squared L2 normcorresponds to the depth of the word in the parsetree. Our probe uses supervision to find the trans-formations under which these properties are bestapproximated for each model. If such transfor-mations exist, they define inner products on theoriginal space under which squared distances andnorms encode syntax trees – even though the mod-els being probed were never given trees as input orsupervised to reconstruct them. This is a structuralproperty of the word representation space, akin tovector offsets encoding word analogies (Mikolovet al., 2013). Using our probe, we conduct a tar-geted case study, showing that ELMo (Peters et al.,2018a) and BERT (Devlin et al., 2019) representa-tions embed parse trees with high consistency incontrast to baselines, and in a low-rank space.1

In summary, we contribute a simple structuralprobe for finding syntax in word representations(§2), and experiments providing insights intoand examples of how a low-rank transformationrecovers parse trees from ELMo and BERT rep-resentations (§3,4). Finally, we discuss our probeand limitations in the context of recent work (§5).

2 Methods

Our goal is to design a simple method for testingwhether a neural network embeds each sentence’s

1We release our code at https://github.com/john-hewitt/structural-probes.

https://github.com/john-hewitt/structural-probes

https://github.com/john-hewitt/structural-probes

4130

dependency parse tree in its contextual word rep-resentations – a structural hypothesis. Under a rea-sonable definition, to embed a graph is to learn avector representation of each node such that geom-etry in the vector space—distances and norms—approximates geometry in the graph (Hamiltonet al., 2017). Intuitively, why do parse tree dis-tances and depths matter to syntax? The dis-tance metric—the path length between each pairof words—recovers the tree T simply by identify-ing that nodes u, v with distance dT (u, v) = 1 areneighbors. The node with greater norm—depth inthe tree—is the child. Beyond this identity, the dis-tance metric explains hierarchical behavior. For ex-ample, the ability to perform the classic hierarchytest of subject-verb number agreeement (Linzenet al., 2016) in the presence of “attractors” can beexplained as the verb (V) being closer in the tree toits subject (S) than to any of the attactor nouns:

S ... A1 ... A2 ... V ...

.. . .

Intuitively, if a neural network embeds parse trees,it likely will not use its entire representation spaceto do so, since it needs to encode many kinds ofinformation. Our probe learns a linear transforma-tion of a word representation space such that thetransformed space embeds parse trees across allsentences. This can be interpreted as finding thepart of the representation space that is used to en-code syntax; equivalently, it is finding the distanceon the original space that best fits the tree metrics.

2.1 The structural probeIn this section we provide a description of our pro-posed structural probe, first discussing the distanceformulation. LetM be a model that takes in a se-quence of n words w`

1:n and produces a sequenceof vector representations h`

1:n, where ` identifiesthe sentence. Starting with the dot product, re-call that we can define a family of inner prod-ucts, hTAh, parameterized by any positive semi-definite, symmetric matrix A ∈ Sm×m

+ . Equiv-alently, we can view this as specifying a lineartransformation B ∈ Rk×m, such that A = BTB.The inner product is then (Bh)T (Bh), the normof h once transformed by B. Every inner productcorresponds to a distance metric. Thus, our familyof squared distances is defined as:

dB(hì ,h

`j)

2 =(B(h`

i − h`j))T (

B(hì − h`

j))(1)

where i, j index the word in the sentence.2 Theparameters of our probe are exactly the matrix B,which we train to recreate the tree distance betweenall pairs of words (w`

i , w`j) in all sentences T ` in

the training set of a parsed corpus. Specifically, weapproximate through gradient descent:

minB

∑`

1

|s`|2∑i,j

∣∣dT `(wì , w

`j)− dB(h`

i ,h`j)

2∣∣

where |s`| is the length of the sentence; we nor-malize by the square since each sentence has |s`|2word pairs.

2.2 Properties of the structural probeBecause our structural probe defines a valid dis-tance metric, we get a few nice properties for free.The simplest is that distances are guaranteed non-negative and symmetric, which fits our probingtask. Perhaps most importantly, the probe tests theconcrete claim that there exists an inner product onthe representation space whose squared distance—a global property of the space—encodes syntaxtree distance. This means that the model not onlyencodes which word is governed by which otherword, but each word’s proximity to every otherword in the syntax tree.3 This is a claim about thestructure of the representation space, akin to theclaim that analogies are encoded as vector-offsetsin uncontextualized word embeddings (Mikolovet al., 2013). One benefit of this is the ability toquery the nature of this structure: for example, thedimensionality of the transformed space (§ 4.1).

2.3 Tree depth structural probesThe second tree property we consider is the parsedepth ‖wi‖ of a word wi, defined as the numberof edges in the parse tree between wi and the rootof the tree. This property is naturally representedas a norm – it imposes a total order on the wordsin the sentence. We wish to probe to see if thereexists a squared norm on the word representation

2As noted in Eqn 1, in practice, we find that approximatingthe parse tree distance and norms with the squared vectordistances and norms consistently performs better. Because adistance metric and its square encode exactly the same parsetrees, we use the squared distance throughout this paper. Alsostrictly, since A is not positive definite, the inner product isindefinite, and the distance a pseudometric. Further discussioncan be found in our appendix.

3 Probing for distance instead of headedness also helpsavoid somewhat arbitrary decisions regarding PP headedness,the DP hypothesis, and auxiliaries, letting the representation“disagree” on these while still encoding roughly the sameglobal structure. See Section 5 for more discussion.

4131

Distance DepthMethod UUAS DSpr. Root% NSpr.

LINEAR 48.9 0.58 2.9 0.27ELMO0 26.8 0.44 54.3 0.56DECAY0 51.7 0.61 54.3 0.56PROJ0 59.8 0.73 64.4 0.75

ELMO1 77.0 0.83 86.5 0.87BERTBASE7 79.8 0.85 88.0 0.87

BERTLARGE15 82.5 0.86 89.4 0.88BERTLARGE16 81.7 0.87 90.1 0.89

Table 1: Results of structural probes on the PTB WSJ testset; baselines in the top half, models hypothesized to encodesyntax in the bottom half. For the distance probes, we showthe Undirected Unlabeled Attachment Score (UUAS) as wellas the average Spearman correlation of true to predicted dis-tances, DSpr. For the norm probes, we show the root predic-tion accuracy and the average Spearman correlation of true topredicted norms, NSpr.

Figure 1: Parse distance UUAS and distance Spearmancorrelation across the BERT and ELMo model layers.

space that encodes this tree norm. We replacethe vector distance function dB(hi,hj) with thesquared vector norm ‖hi‖2B , replacing Equation 1with ‖hi‖A = (Bhi)

T (Bhi) and training B torecreate ‖wi‖. Like the distance probe, this normformulation makes a concrete claim about the struc-ture of the vector space.

3 Experiments

Using our probe, we evaluate whether representa-tions from ELMo and BERT, two popular Englishmodels pre-trained on language modeling-like ob-jectives, embed parse trees according to our struc-tural hypothesis. Unless otherwise specified, wepermit the linear transformation B to be potentiallyfull-rank (i.e., B is square.) Later, we explore whatrank of transformation is actually necessary forencoding syntax (§ 4.1).

Representation models We use the 5.5B-wordpre-trained ELMo weights for all ELMo rep-resentations, and both BERT-base (cased) andBERT-large (cased). The representations weevaluate are denoted ELMOK, BERTBASEK,

BERTLARGEK, where K indexes the hiddenlayer of the corresponding model. All ELMoand BERT-large layers are dimensionality 1024;BERT-base layers are dimensionality 768.

Data We probe models for their ability to capturethe Stanford Dependencies formalism (de Marn-effe et al., 2006), claiming that capturing most as-pects of the formalism implies an understandingof English syntactic structure. To this end, we ob-tain fixed word representations for sentences of theparsing train/dev/test splits of the Penn Treebank(Marcus et al., 1993), with no pre-processing.4

Baselines Our baselines should encode featuresuseful for training a parser, but not be capable ofparsing themselves, to provide points of compari-son against ELMo and BERT. They are as follows:

LINEAR : The tree resulting from the assumptionthat English parse trees form a left-to-rightchain. A model that encodes the positions ofwords should be able to meet this baseline.

ELMO0 : Strong character-level word embed-dings with no contextual information. Asthese representations lack even position in-formation, we should be completely unable tofind syntax trees embedded.

DECAY0 : Assigns each word a weighted averageof all ELMO0 embeddings in the sentence.The weight assigned to each word decays ex-ponentially as 1

2d, where d is the linear dis-

tance between the words.

PROJ0 : Contextualizes the ELMO0 embeddingswith a randomly initialized BiLSTM layer ofdimensionality identical to ELMo (1024), asurprisingly strong baseline for contextualiza-tion (Conneau et al., 2018).

3.1 Tree distance evaluation metrics

We evaluate models on how well the predicteddistances between all pairs of words reconstructgold parse trees and correlate with the parse trees’distance metrics. To evaluate tree reconstruction,we take each test sentence’s predicted parse treedistances and compute the minimum spanningtree. We evaluate the predicted tree on undirected

4Since BERT constructs subword representations, we alignsubword vectors with gold Penn Treebank tokens, and assigneach token the average of its subword representation. Thisthus represents a lower-bound on BERT’s performance.

4132

BERTlarge16

The complex financing plan in the S+L bailout law includes raising $ 30 billion from debt issued by the newly created RTC .. . . .

.. . . . . ... . . . .

. .. .

.. . .. ... .. .. .

...

.. . . .

ELMo1


.. . . . . .

.. . . . .

. .. .

...

.. . ... ..

..

...

.. .

.

.

Proj0


.. . . . . ... . . . .

. .. .

.. ... ... . .. .. .. .. .. .

.

Figure 2: Minimum spanning trees resultant from predicted squared distances on BERTLARGE16 and ELMO1 comparedto the best baseline, PROJ0. Black edges are the gold parse, above each sentence; blue are BERTLARGE16, red are ELMO1,and purple are PROJ0.

attachment score (UUAS)—the percent of undi-rected edges placed correctly—against the goldtree. For distance correlation, we compute theSpearman correlation between true and predicteddistances for each word in each sentence. Weaverage these correlations between all sentences ofa fixed length, and report the macro average acrosssentence lengths 5–50 as the “distance Spearman(DSpr.)” metric.5

3.2 Tree depth evaluation metrics

We evaluate models on their ability to recreate theorder of words specified by their depth in the parsetree. We report the Spearman correlation betwenthe true depth ordering and the predicted ordering,averaging first between sentences of the samelength, and then across sentence lengths 5–50, asthe “norm Spearman (NSpr.)”. We also evaluatemodels’ ability to identify the root of the sentenceas the least deep, as the “root%”.6

4 Results

We report the results of parse distance probes andparse depth probes in Table 1. We first confirmthat our probe can’t simply “learn to parse” on topof any informative representation, unlike parser-based probes (Peters et al., 2018b). In particular,ELMO0 and DECAY0 fail to substantially outper-form a right-branching-tree oracle that encodes thelinear sequence of words. PROJ0, which has all ofthe representational capacity of ELMO1 but noneof the training, performs the best among the base-lines. Upon inspection, we found that our probeon PROJ0 improves over the linear hypothesis with

5The 5–50 range is chosen to avoid simple short sentencesas well as sentences so long as to be rare in the test data.

6In UUAS and “root%” evaluations, we ignore all punctu-ation tokens, as is standard.

Figure 3: Parse tree depth according to the gold tree (black,circle) and the norm probes (squared) on ELMO1 (red, trian-gle) and BERTLARGE16 (blue, square).

mostly simple deviations from linearity, as visual-ized in Figure 2.

We find surprisingly robust syntax embeddedin each of ELMo and BERT according to ourprobes. Figure 2 shows the surprising extent towhich a minimum spanning tree on predicteddistances recovers the dependency parse structurein both ELMo and BERT. As we note however, thedistance metric itself is a global notion; all pairs ofwords are trained to know their distance – not justwhich word is their head; Figure 4 demonstratesthe rich structure of the true parse distance metricrecovered by the predicted distances. Figure 3demonstrates the surprising extent to which thedepth in the tree is encoded by vector norm afterthe probe transformation. Between models, wefind consistently that BERTLARGE performsbetter than BERTBASE, which performs betterthan ELMO.7 We also find, as in Peters et al.(2018b), a clear difference in syntactic informationbetween layers; Figure 1 reports the performance

7It is worthwhile to note that our hypotheses weredeveloped while analyzing LSTM models like ELMo, andapplied without modification on the self-attention basedBERT models.

4133

Figure 4: (left) Matrix representing gold tree distancesbetween all pairs of words in a sentence, whose linear orderruns top-to-bottom and left-to-right. Darker colors indicateclose words, lighter indicate far. (right) The same distancesas embedded by BERTLARGE16 (squared). More detailedgraphs available in the Appendix.

Figure 5: Parse distance tree reconstruction accuracy whenthe linear transformation is constrained to varying maximumdimensionality.

of probes trained on each layer of each system.

4.1 Analysis of linear transformation rankWith the result that there exists syntax-encodingvector structure in both ELMo and BERT, it is nat-ural to ask how compactly syntactic information isencoded in the vector space. We find that in bothmodels, the effective rank of linear transformationrequired is surprisingly low. We train structuralprobes of varying k, that is, specifying a matrixB ∈ Rk×m such that the transformed vector Bh isin Rk. As shown in Figure 5, increasing k beyond64 or 128 leads to no further gains in parsing accu-racy. Intuitively, larger k means a more expressiveprobing model, and a larger fraction of the repre-sentational capacity of the model being devoted tosyntax. We also note with curiosity that the threemodels we consider all seem to require transfor-mations of approximately the same rank; we leaveexploration of this to exciting future work.

5 Discussion & Conclusion

Recent work has analyzed model behavior to deter-mine if a model understands hierarchy and other lin-guistic phenomena (Linzen, 2018; Gulordava et al.,2018; Kuncoro et al., 2018; Linzen and Leonard,2018; van Schijndel and Linzen, 2018; Tang et al.,2018; Futrell et al., 2018). Our work extends the

literature on linguistic probes, found at least in (Pe-ters et al., 2018b; Belinkov et al., 2017; Blevinset al., 2018; Hupkes et al., 2018). Conneau et al.(2018) present a task similar to our parse depthprediction, where a sentence representation vectoris asked to classify the maximum parse depth everachieved in the sentence. Tenney et al. (2019) eval-uates a complementary task to ours, training probesto learn the labels on structures when the gold struc-tures themselves are given. Peters et al. (2018b)evaluates the extent to which constituency trees canbe extracted from hidden states, but uses a probeof considerable complexity, making less concretehypotheses about how the information is encoded.

Probing tasks and limitations Our reviewersrightfully noted that one might just probe for head-edness, as in a bilinear graph-based dependencyparser. More broadly, a deep neural network probeof some kind is almost certain to achieve higherparsing accuracies than our method. Our task andprobe construction are designed not to test for somenotion of syntactic knowledge broadly construed,but instead for an extremely strict notion whereall pairs of words know their syntactic distance,and this information is a global structural prop-erty of the vector space. However, this study islimited to testing that hypothesis, and we foreseefuture probing tasks which make other tradeoffs be-tween probe complexity, probe task, and hypothe-ses tested.

In summary, through our structural probes wedemonstrate that the structure of syntax treesemerges through properly defined distances andnorms on two deep models’ word representationspaces. Beyond this actionable insight, we suggestour probe may be useful for testing the existenceof different types of graph structures on any neuralrepresentation of language, an exciting avenue forfuture work.

6 Acknowledgements

We would like to acknowledge Urvashi Khandel-wal and Tatsunori B. Hashimoto for formativeadvice in early stages, Abigail See, Kevin Clark,Siva Reddy, Drew A. Hudson, and Roma Patel forhelpful comments on drafts, and Percy Liang, forguidance on rank experiments. We would also liketo thank the reviewers, whose helpful commentsled to increased clarity and extra experiments. Thisresearch was supported by a gift from Tencent.

4134

ReferencesYossi Adi, Einat Kermany, Yonatan Belinkov, Ofer

Lavi, and Yoav Goldberg. 2017. Fine-grained anal-ysis of sentence embeddings using auxiliary predic-tion tasks. In International Conference on LearningRepresentations.

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has-san Sajjad, and James Glass. 2017. What do neu-ral machine translation models learn about morphol-ogy? In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 861–872. Associationfor Computational Linguistics.

Terra Blevins, Omer Levy, and Luke Zettlemoyer. 2018.Deep RNNs encode soft hierarchical syntax. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers), pages 14–19. Association for Com-putational Linguistics.

Alexis Conneau, German Kruszewski, Guillaume Lam-ple, Loıc Barrault, and Marco Baroni. 2018. Whatyou can cram into a single \$&!#* vector: Prob-ing sentence embeddings for linguistic properties.In Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), pages 2126–2136. Association forComputational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 2 (Short Papers). Association forComputational Linguistics.

Richard Futrell, Ethan Wilcox, Takashi Morita, andRoger Levy. 2018. RNNs as psycholinguistic sub-jects: Syntactic state and grammatical dependency.arXiv preprint arXiv:1809.01329.

Kristina Gulordava, Piotr Bojanowski, Edouard Grave,Tal Linzen, and Marco Baroni. 2018. Colorlessgreen recurrent networks dream hierarchically. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers), volume 1, pages 1195–1205.

William L Hamilton, Rex Ying, and Jure Leskovec.2017. Representation learning on graphs: Methodsand applications. arXiv preprint arXiv:1709.05584.

Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema.2018. Visualisation and ‘diagnostic classifiers’ re-veal how recurrent and recursive neural networksprocess hierarchical structure. Journal of ArtificialIntelligence Research, 61:907–926.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo-gatama, Stephen Clark, and Phil Blunsom. 2018.LSTMs can learn syntax-sensitive dependencieswell, but modeling structure makes them better. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), volume 1, pages 1426–1436.

Tal Linzen. 2018. What can linguistics and deeplearning contribute to each other? arXiv preprintarXiv:1809.04179.

Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of LSTMs to learnsyntax-sensitive dependencies. Transactions of theAssociation for Computational Linguistics, 4:521–535.

Tal Linzen and Brian Leonard. 2018. Distinct patternsof syntactic agreement errors in recurrent networksand humans. In Proceedings of the 40th Annual Con-ference of the Cognitive Science Society, pages 692–697. Cognitive Science Society, Austin, TX.

Mitchell P Marcus, Mary Ann Marcinkiewicz, andBeatrice Santorini. 1993. Building a large annotatedcorpus of English: The Penn Treebank. Computa-tional linguistics, 19(2):313–330.

Marie-Catherine de Marneffe, Bill MacCartney, andChristopher D. Manning. 2006. Generating typeddependency parses from phrase structure parses. InLREC.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their composition-ality. In C. J. C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Q. Weinberger, editors, Ad-vances in Neural Information Processing Systems26, pages 3111–3119. Curran Associates, Inc.

Graham Neubig, Chris Dyer, Yoav Goldberg, AustinMatthews, Waleed Ammar, Antonios Anastasopou-los, Miguel Ballesteros, David Chiang, Daniel Cloth-iaux, Trevor Cohn, Kevin Duh, Manaal Faruqui,Cynthia Gan, Dan Garrette, Yangfeng Ji, LingpengKong, Adhiguna Kuncoro, Gaurav Kumar, Chai-tanya Malaviya, Paul Michel, Yusuke Oda, MatthewRichardson, Naomi Saphra, Swabha Swayamdipta,and Pengcheng Yin. 2017. Dynet: The dy-namic neural network toolkit. arXiv preprintarXiv:1701.03980.

Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in PyTorch.In NIPS Autodiff Workshop.

https://doi.org/10.18653/v1/P17-1080

https://doi.org/10.18653/v1/P17-1080

https://doi.org/10.18653/v1/P17-1080

http://aclweb.org/anthology/P18-2003




https://www.transacl.org/ojs/index.php/tacl/article/view/972

https://www.transacl.org/ojs/index.php/tacl/article/view/972

http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf



4135

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018a. Deep contextualized word rep-resentations. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers), volume 1,pages 2227–2237.

Matthew Peters, Mark Neumann, Luke Zettlemoyer,and Wen-tau Yih. 2018b. Dissecting contextualword embeddings: Architecture and representation.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages1499–1509.

Marten van Schijndel and Tal Linzen. 2018. Model-ing garden path effects without explicit hierarchi-cal syntax. In Tim Rogers, Marina Rau, Jerry Zhu,and Chuck Kalish, editors, Proceedings of the 40thAnnual Conference of the Cognitive Science Soci-ety, pages 2600–2605. Cognitive Science Society,Austin, TX.

Gongbo Tang, Mathias Muller, Annette Rios, and RicoSennrich. 2018. Why self-attention? A targetedevaluation of neural machine translation architec-tures. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 4263–4272. Association for ComputationalLinguistics.

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Sam Bowman, Dipanjan Das,and Ellie Pavlick. 2019. What do you learn fromcontext? probing for sentence structure in contextu-alized word representations. In International Con-ference on Learning Representations.

A Appendix: Implementation Details

A.1 Squared L2 distance vs. L2 distanceIn Section 2.2, we note that while our distanceprobe specifies a distance metric, we recreate itwith a squared vector distance; likewise, while ournorm probe specifies a norm, we recreate it with asquared vector norm. We found this to be importantfor recreating the exact parse tree distances andnorms. This does mean that in order to recreate theexact scalar values of the parse tree structures, weneed to use the squared vector quantities. This maybe problematic, since for example squared distancedoesn’t obey the triangle inequality, whereas a validdistance metric does.

However, we note that in terms of the graph struc-tures encoded, distance and squared distance areidentical. After training with the squared vector dis-tance, we can square-root the predicted quantitiesto achieve a distance metric. The relative ordering

between all pairs of words will be unchanged; thesame tree is encoded either way, and none of ourquantitative metrics will change; however, the ex-act scalar distances will differ from the true treedistances.

This raises a question for future work as to whysquared distance works better than distance, andbeyond that, what function of the L2 distance (orperhaps, what Lp distance) would best encode treedistances. It is possibly related to the gradients ofthe loss with respect to the function of the distance,as well as how amenable the function is to matchingthe exact scalar values of the tree distances.

A.2 Probe training detailsAll probes are trained to minimize L1 loss of thepredicted squared distance or squared norm w.r.t.the true distance or norm. Optimization is per-formed using the Adam optimizer (Kingma andBa, 2014) initialized at learning rate 0.001, withβ1 = .9, β2 = .999, ε = 10−8. Probes are trainedto convergence, up to 40 epochs, with a batch sizeof 20. For depth probes, loss is summed over allpredictions in a sentence, normalized by the lengthof the sentence, and then summed over all sen-tences in a batch before a gradient step is taken.For distance probes, normalization is performed bythe square of the length of the sentence. At eachepoch, dev loss is computed; if the dev loss doesnot achieve a new minimum, the optimizer is re-set (no momentum terms are kept) with an initiallearning rate multiplied by 0.1. All models wereimplemented in both DyNet (Neubig et al., 2017),and in PyTorch (Paszke et al., 2017).

B Appendix: Extra examples

In this section we provide additional examples ofmodel behavior, including baseline model behavior,across parse distance prediction and parse depthprediction. In Figure 6 and Figure 7, we presenta single sentence with dependency trees as ex-tracted from many of our models and baselines.In Figure 8, we present tree depth predictions on acomplex sentence from ELMO1, BERTLARGE16,and our baseline PROJ0. Finally, in Figure 9, wepresent gold parse distances and predicted squaredparse distances between all pairs of words in large,high-resolution format.

http://aclweb.org/anthology/D18-1458



https://openreview.net/forum?id=SJzSgnRcKX



4136

ELMo0

Another $ 20 billion would be raised through Treasury bonds , which pay lower interest rates .. .

.. . . . . .

..

. . .

..

......

..

.

..

.

Decay0


.. . . . . .

..

. . .

.... .. .. . .

.

. ..

Proj0


.. . . . . .

..

. . .

.. .. .. .. ... .. .

ELMo1


.. . . . . .

..

. . .

...

.. .. .. ...

..

BERTbase7


.. . . . . .

..

. . .

.. . . ... . ...

..

.

BERTlarge16


.. . . . . .

..

. . .

... .. ..

....

..

.

Figure 6: A relatively simple sentence, and the minimum spanning trees extracted by various models.

ELMo0

But the RTC also requires “ working ” capital to maintain the bad assets of thrifts that are sold , until the assets can be sold separately ... . . . . . . .

.

. . . . . . ..

. . . . .

.

.

. .. .

. . . .

.

. .

.

. .. .

.

....

Decay0


.

. . . . . . ..

. . . . .

. ...

.. ... .. ..

.. . .. .

.

...

Proj0


.

. . . . . . ..

. . . . .

.. .... ... . ... . . .. .. . ..

.

ELMo1


.

. . . . . . ..

. . . . .

. .. . ... ....

.. . ... .. ..

..

BERTbase7


.

. . . . . . ..

. . . . .

. .. . .. ... .. .. ... .... . .

.

BERTlarge16


.

. . . . . . ..

. . . . .

. .. ... . .. .. .. . . ..... ..

.

Figure 7: A complex sentence, and the minimum spanning trees extracted by various models.

4137

Figure 8: A long sentence with gold dependency parse depths (grey) and dependency parse depths (squared) as extracted byBERTLARGE16 (blue, top), ELMO1 (red, middle), and the baseline PROJ0 (purple, bottom). Note the non-standard subject,“that he was the A’s winningest pitcher”.

4138

Figure 9: The distance graphs defined by the gold parse distances on a sentence (below) and as extracted from BERTLARGE16(above, squared).

a structural probe for finding syntax in word representations · in summary, we contribute a simple...

Documents