wenliang˜chen˜· min˜zhang semi- supervised dependency parsing · preface semi-supervised...

Wenliang Chen · Min Zhang

Semi-Supervised Dependency Parsing

Semi-Supervised Dependency Parsing

Wenliang Chen • Min Zhang

Semi-SupervisedDependency Parsing

123

Wenliang ChenSoochow UniversitySuzhou, Jiangsu, China

Min ZhangSoochow UniversitySuzhou, Jiangsu, China

ISBN 978-981-287-551-8 ISBN 978-981-287-552-5 (eBook)DOI 10.1007/978-981-287-552-5

Library of Congress Control Number: 2015941148

Springer Singapore Heidelberg New York Dordrecht London© Springer Science+Business Media Singapore 2015This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part ofthe material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,broadcasting, reproduction on microfilms or in any other physical way, and transmission or informationstorage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodologynow known or hereafter developed.The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoes not imply, even in the absence of a specific statement, that such names are exempt from the relevantprotective laws and regulations and therefore free for general use.The publisher, the authors and the editors are safe to assume that the advice and information in this bookare believed to be true and accurate at the date of publication. Neither the publisher nor the authors orthe editors give a warranty, express or implied, with respect to the material contained herein or for anyerrors or omissions that may have been made.

Printed on acid-free paper

Springer Science+Business Media Singapore Pte Ltd. is part of Springer Science+Business Media (www.springer.com)

www.springer.com

www.springer.com

Preface

Semi-supervised approaches for dependency parsing have become increasinglypopular in recent years. One of the reasons for their success is that they can make useof large unlabeled data together with relatively small labeled data and have showntheir advantages on the task of dependency parsing for many languages. A rangeof different semi-supervised dependency parsing approaches have been proposedin recent work which utilize different types of information learned from unlabeleddata.

The aim of this book is to give readers a comprehensive introduction to thesemi-supervised approaches for dependency parsing. This book is targeted to bea textbook for advanced undergraduate and graduate students and researchers insyntactic parsing and natural language processing. This book is partly derived fromour earlier publications. We want to thank our coauthors in those publications:Hitoshi Isahara, Daisuke Kawahara, Jun’ichi Kazama, Kentaro Torisawa, Yoshi-masa Tsuruoka, Kiyotaka Uchimoto, Yiou Wang, Yujie Zhang, Xiangyu Duan,Zhenghua Li, Haizhou Li, and Yue Zhang. We also want to thank the attendees inthe IJCNLP2013 and COLING2014 tutorials on Dependency Parsing: Past, Present,and Future, presented by Zhenghua Li, Wenliang Chen, and Min Zhang. This bookis also partly based on the material from the tutorials.

This book was partially supported by the National Natural Science Foundationof China (Grant No. 61203314, 61373095, and 61432013) and CollaborativeInnovation Center of Novel Software Technology and Industrialization.

Finally, we would like to thank our friends and colleagues from the NationalInstitute of Communication Technology (NICT, Japan), Institute for InfocommResearch (I2R, Singapore), and School of Computer Science and Technology,Soochow University, China, for their invaluable help. We are lucky for havingworked with them and cherish our friendship. We are also grateful to our researchcolleagues in the research communities for their encouragements and help for thelast ten years.

Suzhou, China Wenliang ChenMarch 2015 Min Zhang

v

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Dependency Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Dependency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Supervised, Semi-supervised, and Unsupervised Parsing . . . . . . . . . . . 61.4 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Summary .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Dependency Parsing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1 Graph-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Transition-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4 Performance Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.5 Summary .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Overview of Semi-supervised Dependency Parsing Approaches . . . . . . 273.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Framework of Semi-supervised Dependency Parsing . . . . . . . . . . . . . . . 283.3 Three Levels of Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4 Performance Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5 Summary .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Training with Auto-parsed Whole Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.1 Self-Training .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Co-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Ambiguity-Aware Ensemble Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.4 Summary .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

vii

viii Contents

5 Training with Lexical Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.1 An Approach Based on Word Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2 An Approach Based on Web-Derived Selection Preference .. . . . . . . . 535.3 Experiments .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.4 Summary .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Training with Bilexical Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.1 A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.2 Reliable Bilexical Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.3 Parsing with the Information on Word Pairs . . . . . . . . . . . . . . . . . . . . . . . . . 626.4 Experiments .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.5 Results Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.6 Summary .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7 Training with Subtree Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.1 Subtrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.2 Monolingual Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757.3 Bilingual Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.4 Experiments for Monolingual Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.5 Experiments for Bilingual Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967.6 Summary .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

8 Training with Dependency Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 1078.1 Dependency Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1088.2 Parsing with Dependency Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . 1098.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1108.4 Bilingual Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148.5 Experiments for Monolingual Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178.6 Experiments for Bilingual Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1238.7 Summary .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

9 Training with Meta-features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1279.1 Baseline Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1279.2 Meta-features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1299.3 Experiments .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1319.4 Summary .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

10 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141References .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Chapter 1Introduction

In this chapter, we briefly review the theoretical foundations of dependencygrammar and introduce the task of dependency parsing. Dependency parsing wedescribe in this book is in a narrow sense, i.e. the parsing systems generate adependency tree given an input sentence.

First, let us take a broad look at the outline of this book. In the book, Chap. 1 pro-vides essential background for novice readers. Then, Chap. 2 introduces the widelyused supervised models for dependency parsing. Chapter 3 gives the overviewof semi-supervised dependency parsing approaches. In Chaps. 4–9, we introduceseveral semi-supervised existing approaches in details. Chapter 10 summarizes theentire book.

Dependency parsing performs structure analysis to generate the dependencyrelations among the words in sentences. Although dependency grammar has a longand venerable history, dependency parsing has until recently become importantin natural language processing (Debusmann 2000; Nivre 2005). The increasinginterest in dependency structures is driven by the properties of dependency grammar.Compared with other types of parsing, such as constituency parsing, dependencyparsing has the potential usefulness of bilexical relations in disambiguation andby the gains in efficiency. The dependency relations among words can transferwith low cost among different languages. Thus, dependency parsing has greatadvantages in the tasks of large-scale and multilingual data processing. In recentyears, dependency parsing has been applied to many NLP applications such asmachine translation (Ding and Palmer 2005; Nakazawa et al. 2006; Xie et al. 2014;Yu et al. 2014), information extraction (Culotta and Sorensen 2004), and questionanswering (Cui et al. 2005).

In recent years, there are some tutorials on dependency parsing presented inmajor conferences:

• Tutorial@ACL2006: Dependency Parsing by Nivre and Kubler (2006)• Tutorial@NAACL2010: Recent Advances in Dependency Parsing by Wang and

Zhang (2010)

© Springer Science+Business Media Singapore 2015W. Chen, M. Zhang, Semi-Supervised Dependency Parsing,DOI 10.1007/978-981-287-552-5_1

1

2 1 Introduction

• Tutorial@IJCNLP2013 and @COLING2014: Dependency Parsing: Past,Present, and Future by Chen et al. (2014)

• Tutorial@EACL2014: Recent Advances in Dependency Parsing by McDonaldand Nivre (2014)

1.1 Dependency Structures

Traditional dependency grammar may be first introduced by Panini’s grammarof Sanskrit long time before the Common Era (Kruijff 2002), while moderndependency grammar was designed by Tesnière (1959), a French linguist.

Besides the theory of syntax structures introduced by Tesnière, there are manyother formulations of dependency grammar. We will not try to list all the theorieshere but give a brief overview of some of them. Word Grammar (WG) (Hudson1984) is defined over general graphs instead of trees. The word order of adependency link is defined together with the type of dependency relation. Theyuse an additional dependency named visitor between the verb and the extracteefor extraction of objects. Dependency Unification Grammar (DUG) (Hellwig 1986)is based on tree structures. The DUG theory is non-projective and uses positionalfeatures to handle word order. Functional Generative Description (FGD) (Sgallet al. 1986) uses ordering rules to map a language-independent word underlyingorder to the concrete surface realization over projective dependency trees. The FGDtheory distinguishes five levels of representations. Meaning-Text Theory (MTT)(Melarcuk 1988) maps unordered dependency trees of syntactic representationsonto the annotated lexical sequences of morphological representations via rules.The MTT assumes seven levels of representations and uses global ordering rules fordiscontinuities. Functional Dependency Grammar (FDG) (Jarvinen and Tapanainen1998) defines two different rules: rules for dependency and rules for surfacelinearization. The FDG theory is non-projective and uses nuclei, a notion fromTesnière, to represent the primitive elements.

From the above dependency grammar theories, the common observation is verysimple: All but one word depend on other words in a sentence, and we call the oneword, which does not depend on any other, the root of the sentence. We use an exam-ple below to demonstrate a typical dependency analysis of the sentence “I like it”:

I depends on like, or I is the subject of likeIt depends on like, or it is the objective of likeLike is the root of the sentence (does not depend on any other words)

or show it as a dependency tree in Fig. 1.1.

1.1.1 Notions of Dependency

Dependency represents the syntactic structure of a sentence as binary asymmetricalrelations between the words of the sentence. The idea is first expressed by Tesnière.

1.1 Dependency Structures 3

Fig. 1.1 An example ofdependency tree

All the above grammars describe the relation between words in sentences. The rela-tion is called dependency relation that is between a head and a dependent. We alsocan use terms governor for head and modifier for dependent. On the dependencyrelations, we can assign predefined labels to indicate syntactic categories.

Robinson (1970) formulates four axioms for the well-formed structures ofdependency as follows:

1. One and only one word is independent.2. All others depend directly on some word.3. No word depends directly on more than one other.4. If A depends directly on B and some word C intervenes between them (in the

linear order of the string), then C depends directly on A or B or some otherintervening word.

Axioms 1–3 define the essential conditions for the well-formed dependency trees.Axiom 3 states that if word A depends directly on word B, it must not depend on athird word C. This is often called the requirement of single head. Axiom 4 states therequirement of projectivity, i.e., there are not crossing edges in dependency trees.We will discuss the projective and non-projective issues later.

Nivre lists some criteria for identifying a syntactic relation between a head H anda dependent D in a dependency structure C (Nivre 2005):

1. H determines the syntactic category of C and can often replace C.2. H determines the semantic category of C; D gives semantic specification.3. H is obligatory; D may be optional.4. H selects D and determines if D is obligatory or optional.5. The form of D depends on H (agreement or government).6. The linear position of D is specified with reference to H.

1.1.1.1 Endocentric and Exocentric

In theoretical linguistics, the syntactic constructions are of two main types:endocentric and exocentric constructions, depending on their distribution and therelation between the words. An endocentric construction consists of an obligatoryhead and one or more dependents, which presents the meaning of the head. Thatis, the head is functionally equivalent to that of the construction. Usually, nounphrases, verb phrases, and adjective phrases belong to the endocentric type becausethe dependents are subordinate to the head, for example, yellow duck (noun phrase)

4 1 Introduction

and sing songs (verb phrase). The rest of the construction, apart from the head, isoptional and can be removed without losing the basic meaning.

In an exocentric construction, the head does not function like the completeconstruction. For example, “in rooms” is exocentric because the head “on” func-tions differently from a prepositional phrase. The exocentric constructions fail oncriterion No. 1, but they may satisfy the remaining criteria. Exocentric constructionis the opposite of endocentric construction.

1.1.1.2 Projective and Non-projective

The distinction between projective and non-projective dependency structures refersto the issue of whether the Robinson’s Axiom 4 is obeyed or not. In practice, mostdependency parsing systems use projective representations, and most dependency-based linguistic theories allow non-projective representations. Some languages withfree or flexible word order are hard to be described by the constraint of projectiverepresentations.

1.2 Dependency Parsing

Dependency parsing is a task to perform syntactic analysis inspired by the depen-dency grammar. Its target is to build a dependency tree given an input sentence(Buchholz and Marsi 2006). Dependency parsing can take two difference inputs:monolingual sentences or bilingual sentence pairs. We call the former monolingualparsing and the latter bilingual parsing.

1.2.1 Monolingual Parsing

The task of monolingual parsing is to build dependency trees for given monolingualsentences. Figure 1.2 demonstrates the output tree for the input “I ate the fish witha fork.”

An input sentence x is denoted by x D .x0; x1; : : : ; xn/, where x0 D ROOT andxi refers to a word in the sentence. Using y to represent a dependency tree for x,

Fig. 1.2 Example formonolingual parsing task

I ate the fish with a fork .Input:

ROOT I ate the fish with a fork .Output:

1.2 Dependency Parsing 5

we write .i; j/ 2 y if there is a dependency in y from word xi to word xj (xi is thehead and xj is the dependent). The parser tries to find a dependency tree y for eachsentence x.

The target of parsing algorithms for a given sentence x is to find y,

y D arg maxy2T.x/

S.x; y/ (1.1)

where T.x/ is the set of all the possible dependency trees of x that are valid forsentence x (McDonald and Nivre 2007) and S.x; y/ is a scoring function. The scoringfunction has been defined in different ways in previous studies (McDonald et al.2005; Nivre and McDonald 2008; Nivre and Scholz 2004). The details will bedescribed in the following chapters.

1.2.2 Bilingual Parsing

Parsing bilingual texts (bitexts) is crucial for training machine translation systemsthat rely on syntactic structures on either the source side or the target side orboth (Ding and Palmer 2005; Nakazawa et al. 2006). Bitexts can provide moreinformation for parsing than commonly used monolingual texts. This informationcan be considered as “bilingual constraints” (Burkett and Klein 2008; Huang et al.2009). Thus, we expect to obtain more accurate parsing trees that can be effectivelyused in the training of syntax-based machine translation (MT) systems (Liu andHuang 2010). This has motivated several studies aimed at highly accurate bitextparsing (Burkett and Klein 2008; Chen et al. 2010; Huang et al. 2009; Smith andSmith 2004; Zhao et al. 2009).

Given bilingual sentence pairs, the task of bilingual parsing is to builddependency trees on both sides. Figure 1.3 demonstrates the output trees forthe input sentence pair “I ate the fish with a fork.” and “我(wo)/用(yong)/叉子(chazi)/吃(chi)/鱼(yu)/。/” where the source sentence is in English, the target isin Chinese, and the dashed undirected links are word alignment links.

For bitext parsing, we denote an input sentence pair by xb D .xs; xt/, where xs isthe source sentence and xt is the target sentence.

The target of bilingual parsing algorithms for a given sentence pair xb is to findyb,

yb D arg max.yb/2T.xb/

S.xb; yb/ (1.2)

where T.xb/ is the set of all the possible dependency tree pairs of xb that are valid,yb D .ys; yt/ is the dependency tree pair for xs and xt, and S.xb; yb/ is a scoring func-tion. Usually, we can use the information of alignment links Ast between xs and xt.

6 1 Introduction

Fig. 1.3 Example forbilingual parsing task (1)

Source: I ate the fish with a fork .Target :

Input:Target :


ROOT

Fig. 1.4 Example forbilingual parsing task (2)

Input: ROOT I ate the fish with a fork .

ROOT


The input of bilingual parsing can also be a sentence pair and the dependencytree on the target side. So we can improve source-language parsing with the helpof the tree on the target side. Figure 1.4 shows an example. In the example, it isdifficult to determine the head of the word “with” because of the PP attachmentproblem. However, on the Chinese side, it is unambiguous. Therefore, we can usethe information on the Chinese side to help disambiguation on the English side.

1.3 Supervised, Semi-supervised, and Unsupervised Parsing

Dependency parsers are usually constructed by using supervised techniques(Described in Chap. 2), which train the parsers using human-annotated trainingdata (Buchholz et al. 2006; Nivre et al. 2007). However, to obtain dependencyparsers with high accuracy, the supervised techniques require a large amount ofannotated data, which are extremely expensive. On the other hand, we can easilyobtain large-scale unannotated data such as web data and newspaper articles. The

1.4 Data Sets 7

Fig. 1.5 Supervised vs. semi-supervised vs. unsupervised dependency parsing

use of large-scale unannotated data in training is therefore an attractive idea forimproving dependency parsing performance.

We divide the dependency parsing systems into three types: (1) supervisedparsing, which uses human-annotated data to train systems (Nivre et al. 2007; Nivreand McDonald 2008); (2) semi-supervised parsing, which uses unannotated datain addition to human-annotated data (Koo et al. 2008; Sagae and Tsujii 2007);and (3) unsupervised parsing, which uses unannotated data to infer dependencyrelations (Brody 2010; Headden et al. 2009; Ma and Xia 2014; Marecek and Straka2013; Marecek and Žabokrtský 2012; Schwartz et al. 2011; Spitkovsky et al. 2011).Figure 1.5 shows the data usages of three types of systems.

1.4 Data Sets

For data-driven dependency parsing, the labeled data sets are usually derived fromthe treebanks. In CONLL 2006 and 2007 (Buchholz et al. 2006; Nivre et al.2007), in the multilingual track, there are several treebanks for different languages,including Arabic, Basque, Catalan, Chinese, Czech, English, Greek, Hungarian,Italian, Turkish, etc.

In this book, the experiments are conducted on English and Chinese. For English,the Penn Treebank (PTB) (Marcus et al. 1993) is widely used in the previous work.The standard data split is shown in Table 1.1. “Penn2Malt”1 is used to convert thedata into dependency structures using a standard set of head rules (Yamada andMatsumoto 2003).

For Chinese, the Chinese Treebank versions 4/5 (CTB4/5)2 are often used in theprevious work. The data is also converted by the “Penn2Malt” tool. The data splits

1http://w3.msi.vxu.se/~nivre/research/Penn2Malt.html2http://www.cis.upenn.edu/~chinese/

http://w3.msi.vxu.se/~nivre/research/Penn2Malt.html

http://www.cis.upenn.edu/~chinese/

8 1 Introduction

Table 1.1 Data sets of PTBand CTB

Train Dev Test

PTB 2–21 22 23

CTB4 001–270 301–325 271–300

400–931

CTB5 001–815 886–931 816–885

1,001–1,136 1,148–1,151 1,137–1,147

CTB2tp 001–270 301–325 271–300

of CTB4 and CTB5 are shown in Table 1.1. For bilingual parsing, the translatedportion of the Chinese Treebank V2 (CTB2tp) is often used.

1.5 Summary

In this chapter, we have introduced the theoretical foundations of dependencygrammar briefly and described the tasks of dependency parsing. There are manyformulations of dependency grammars including Word Grammar, DependencyUnification Grammar, Functional Generative Description, Meaning-Text Theory,and Functional Dependency Grammar. As for the form of dependency structures,Robinson formulate four axioms. And Nivre defines some criteria to identify syn-tactic relations between heads and dependents. There are two types of dependencyparsing, monolingual parsing and bilingual parsing. According to the data usages,we divide the related work into three categories: supervised, semi-supervised, andunsupervised parsing.

References

Brody, S. (2010). It depends on the translation: unsupervised dependency parsing via wordalignment. In Proceedings of the 2010 conference on empirical methods in natural languageprocessing (pp. 1214–1222). Cambridge: Association for Computational Linguistics. http://www.aclweb.org/anthology/D10-1118.

Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. InProceedings of CoNLL-X. SIGNLL. Stroudsburg: Association for Computational Linguistics.

Buchholz, S., Marsi, E., Dubey, A., & Krymolowski, Y. (2006). CoNLL-X shared task onmultilingual dependency parsing. In Proceedings of CoNLL-X, New York.

Burkett, D., & Klein, D. (2008). Two languages are better than one (for syntactic parsing).In Proceedings of EMNLP 2008 (pp. 877–886). Honolulu: Association for ComputationalLinguistics.

Chen, W., Kazama, J., & Torisawa, K. (2010). Bitext dependency parsing with bilingual subtreeconstraints. In Proceedings of ACL 2010 (pp. 21–29). Uppsala: Association for ComputationalLinguistics.

http://www.aclweb.org/anthology/D10-1118


References 9

Chen, W., Li, Z., & Zhang, M. (2014). Dependency parsing: past, present, and future. InProceedings of COLING 2014 (Tutorial) (pp. 14–16). Dublin: Dublin City University andAssociation for Computational Linguistics.

Cui, H., Sun, R., Li, K., Kan, M., & Chua, T. (2005). Question answering passage retrievalusing dependency relations. In Proceedings of SIGIR 2005 (pp. 400–407). New York: ACM.doi:http://doi.acm.org/10.1145/1076034.1076103.

Culotta, A., & Sorensen, J. (2004). Dependency tree kernels for relation extraction. In Proceedingsof ACL 2004, Barcelona (pp. 423–429).

Debusmann, R. (2000). An introduction to dependency grammar. Hausarbeit fur das Hauptsemi-nar Dependenzgrammatik SoSe, 99, 1–16.

Ding, Y., & Palmer, M. (2005). Machine translation using probabilistic synchronous dependencyinsertion grammars. In Proceedings of ACL 2005 (pp. 541–548). Morristown: Association forComputational Linguistics. doi:http://dx.doi.org/10.3115/1219840.1219907.

Headden III, W. P., Johnson, M., & McClosky, D. (2009). Improving unsupervised dependencyparsing with richer contexts and smoothing. In Proceedings of human language technologies:the 2009 annual conference of the North American chapter of the association for computationallinguistics (pp. 101–109). Stroudsburg: Association for Computational Linguistics.

Hellwig, P. (1986). Dependency unification grammar. In Proceedings of the 11th coferenceon computational linguistics (pp. 195–198). Stroudsburg: Association for ComputationalLinguistics.

Huang, L., Jiang, W., & Liu, Q. (2009). Bilingually-constrained (monolingual) shift-reduceparsing. In Proceedings of EMNLP 2009 (pp. 1222–1231). Singapore: Association forComputational Linguistics.

Hudson, R. (1984). Word grammar. Oxford/New York: Blackwell.Jarvinen, T., & Tapanainen, P. (1998). Towards an implementable dependency grammar. In Pro-

ceedings of the workshop on processing of dependency-based grammars (Vol. 10). Stroudsburg:Association for Computational Linguistics.

Koo, T., Carreras, X., & Collins, M. (2008). Simple semi-supervised dependency parsing. InProceedings of ACL-08: HLT, Columbus.

Kruijff, G. J. M. (2002). Formal and computational aspects of dependency grammar. Lecture notesfor ESSLLI-2002. http://www.infoamerica.org/documentos_pdf/bar03.pdf

Liu, Y., & Huang, L. (2010). Tree-based and forest-based translation. In Tutorial abstracts of ACL2010 (p. 2). Uppsala: Association for Computational Linguistics.

Ma, X., & Xia, F. (2014). Unsupervised dependency parsing with transferring distribution viaparallel guidance and entropy regularization. In Proceedings of the 52nd annual meeting of theassociation for computational linguistics (Volume 1: Long papers, pp. 1337–1348). Baltimore:Association for Computational Linguistics. http://www.aclweb.org/anthology/P14-1126.

Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpusof English: the Penn Treebank. Computational Linguisticss, 19(2), 313–330.

Marecek, D., & Straka, M. (2013). Stop-probability estimates computed on a large corpus improveunsupervised dependency parsing. In Proceedings of the 51st annual meeting of the associationfor computational linguistics (Volume 1: Long papers, pp. 281–290). Sofia: Association forComputational Linguistics. http://www.aclweb.org/anthology/P13-1028.

Marecek, D., & Žabokrtský, Z. (2012). Exploiting reducibility in unsupervised dependencyparsing. In Proceedings of the 2012 joint conference on empirical methods in naturallanguage processing and computational natural language learning (pp. 297–307). Jeju Island:Association for Computational Linguistics. http://www.aclweb.org/anthology/D12-1028.

McDonald, R., Crammer, K., & Pereira, F. (2005). Online large-margin training of dependencyparsers. In Proceedings of ACL 2005 (pp. 91–98). Stroudsburg: Association for ComputationalLinguistics.

McDonald, R., & Nivre, J. (2007). Characterizing the errors of data-driven dependency parsingmodels. In Proceedings of EMNLP-CoNLL, Prague (pp. 122–131).

McDonald, R., & Nivre, J. (2014). Recent advances in dependency parsing. In Proceedings ofEACL 2014, Gothenburg (Tutorial).

http://doi.acm.org/10.1145/1076034.1076103

http://dx.doi.org/10.3115/1219840.1219907

http://www.infoamerica.org/documentos_pdf/bar03.pdf

http://www.aclweb.org/anthology/P14-1126



10 1 Introduction

Melarcuk, I. A. (1988). Dependency syntax: theory and practice. Albany: SUNY Press.Nakazawa, T., Yu, K., Kawahara, D., & Kurohashi, S. (2006). Example-based machine translation

based on deeper NLP. In Proceedings of IWSLT 2006, Kyoto (pp. 64–70).Nivre, J. (2005). Dependency grammar and dependency parsing. MSI report 5133(1959), 1–32.Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J., Riedel, S., & Yuret, D. (2007). The CoNLL

2007 shared task on dependency parsing. In Proceedings of the CoNLL shared task session ofEMNLP-CoNLL 2007, Prague (pp. 915–932).

Nivre, J., & Kubler, S. (2006). Dependency parsing: tutorial at Coling-ACL 2006. In: CoLING-ACL, Sydney.

Nivre, J., & McDonald, R. (2008). Integrating graph-based and transition-based dependencyparsers. In Proceedings of ACL-08: HLT, Columbus.

Nivre, J., & Scholz, M. (2004). Deterministic dependency parsing of English text. In Proceedingsof the 20th international conference on computational linguistics (COLING), Geneva (pp. 64–70).

Robinson, J. J. (1970). Dependency structures and transformational rules. Language, 46, 259–285.Sagae, K., & Tsujii, J. (2007). Dependency parsing and domain adaptation with LR models and

parser ensembles. In Proceedings of the CoNLL shared task session of EMNLP-CoNLL 2007,Prague (pp. 1044–1050).

Schwartz, R., Abend, O., Reichart, R., & Rappoport, A. (2011). Neutralizing linguisticallyproblematic annotations in unsupervised dependency parsing evaluation. In Proceedings ofthe 49th annual meeting of the association for computational linguistics: human languagetechnologies (pp. 663–672). Portland: Association for Computational Linguistics. http://www.aclweb.org/anthology/P11-1067.

Sgall, P., Hajicová, E., & Panevová, J. (1986). The meaning of the sentence in its semantic andpragmatic aspects. Prague: Academia.

Smith, D. A., & Smith, N. A. (2004). Bilingual parsing with factored estimation: using English toparse Korean. In Proceedings of EMNLP 2004, Barcelona (pp. 49–56).

Spitkovsky, V. I., Alshawi, H., Chang, A. X., & Jurafsky, D. (2011). Unsupervised dependencyparsing without gold part-of-speech tags. In Proceedings of the 2011 conference on empiricalmethods in natural language processing (pp. 1281–1290). Edinburgh: Association for Compu-tational Linguistics. http://www.aclweb.org/anthology/D11-1118.

Tesnière, L. (1959). Eléments de syntaxe structurale. Librairie C. Klincksieck.Wang, Q. I., & Zhang, Y. (2010). Recent advances in dependency parsing. In NAACL HLT 2010

tutorial abstracts (pp. 7–8). Los Angeles: Association for Computational Linguistics.Xie, J., Xu, J., & Liu, Q. (2014). Augment dependency-to-string translation with fixed and

floating structures. In Proceedings of COLING 2014, the 25th international conference oncomputational linguistics: technical papers (pp. 2217–2226). Dublin: Dublin City Universityand Association for Computational Linguistics. http://www.aclweb.org/anthology/C14-1209.

Yamada, H., & Matsumoto, Y. (2003). Statistical dependency analysis with support vectormachines. In Proceedings of IWPT 2003, Nancy (pp. 195–206).

Yu, H., Wu, X., Xie, J., Jiang, W., Liu, Q., & Lin, S. (2014). Red: a reference dependency basedmt evaluation metric. In Proceedings of COLING 2014, the 25th international conference oncomputational linguistics: technical papers (pp. 2042–2051). Dublin: Dublin City Universityand Association for Computational Linguistics. http://www.aclweb.org/anthology/C14-1193.

Zhao, H., Song, Y., Kit, C., & Zhou, G. (2009). Cross language dependency parsing using abilingual lexicon. In Proceedings of ACL-IJCNLP2009 (pp. 55–63). Suntec: Association forComputational Linguistics.




http://www.aclweb.org/anthology/C14-1209


Chapter 2Dependency Parsing Models

In this chapter, we describe the data-driven supervised dependency parsing modelsand then summarize the recent reported performance of previous work on PennEnglish Treebank, a widely used data set.

For data-driven dependency parsing, there are two major parsing models (Nivreand McDonald 2008): the graph-based model (Carreras 2007; McDonald et al.2005) and the transition-based model (Nivre 2003; Yamada and Matsumoto 2003),which achieved state-of-the-art accuracy for a wide range of languages, as shownin recent CoNLL shared tasks (Buchholz et al. 2006; Nivre et al. 2007). Nivreand McDonald (2008) compare the differences between these two models. Thegraph-based model uses exhaustive inference and local features (Carreras 2007; Maand Zhao 2012), while the transition-based model uses greedy inference and richfeatures (based on decision history) (Noji and Miyao 2014; Zhang and Nivre 2011).

2.1 Graph-Based Models

In recent years, several researchers have designed different learning and decodingalgorithms for graph-based parsing models (Carreras 2007; McDonald et al. 2005;McDonald and Pereira 2006). In graph-based models, dependency parsing is treatedas a structured prediction problem in which the graphs are usually represented asfactored structures. The information of the factored structures decides the featuresthat the models can utilize. There are several previous studies that exploit high-order features that lead to significant improvements. McDonald et al. (2005) andCovington (2001) develop models that represent first-order features over a singlearc in graphs. By extending the first-order model, McDonald and Pereira (2006)


11

12 2 Dependency Parsing Models

and Carreras (2007) exploit second-order features over two adjacent arcs in second-order models. Koo and Collins (2010) further propose a third-order model that usesthird-order features, while Ma and Zhao (2012) use fourth-order features in theirsystem. These models utilize higher-order feature representations and achieve betterperformance than the first-order models.

An input sentence x is denoted by x D .x0; x1; : : : ; xn/, where x0 D ROOT andxi refers to a word in the sentence. Using y to represent a dependency tree for x, wewrite .i; j/ 2 y if there is a dependency in y from word xi to word xj (xi is the headand xj is the dependent). A graph is denoted by Gx that consists of a set of nodesVx D fx0; x1; : : : ; xi; : : : ; xng and a set of arcs (edges) Ex D f.i; j/ W i ¤ j; xi 2Vx; xj 2 .Vx x0/g, where the nodes in Vx are the words in x. Let T.Gx/ be the setof all the subgraphs of Gx that are valid dependency graphs (McDonald and Nivre2007) for sentence x.

The score of dependency graph y 2 T.Gx/ is computed by the sum of its arcscores provided by scoring function S,

S.x; y/ DX

g2y

s.w; x; g/ (2.1)

where g is a spanning subgraph of y. Then y is represented as a set of factors andscores each factor using a weight vector w. w contains weights for the features tobe learned during training by using the Margin Infused Relaxed Algorithm (MIRA)(Crammer and Singer 2003; McDonald and Pereira 2006).

The task of parsing algorithms for a given sentence x is to find y,

y D arg maxy2T.Gx/

S.x; y/ D arg maxy2T.Gx/

X

g2y

s.w; x; g/ (2.2)

The problem is equivalent to finding a maximum spanning tree (MST) that is thehighest scoring tree in T.Gx/. In the MST parsing model, there are three widely usedmodels: the first order, second order, and third order.

2.1.1 First-Order Model

In a first-order model, g is a single edge. Then the scoring function S1 is as follows:

S1.x; y/ DX

g2y

s1.w; x; g/ (2.3)

The first-order features for the first-order model are defined through a featurefunction that corresponds to a single dependency, i.e., f1.x; h; d/, where h and d arethe head and dependent of the dependency .h; d/, respectively. Figure 2.1 shows

2.1 Graph-Based Models 13

Fig. 2.1 Relations infirst-order model

h d

Fig. 2.2 Relations insecond-order model

h ch cdi d cdo

the relations between h and d. We should note that f1.x; h; d/ can include arbitraryfeatures on the edge .h; d/ and the input sentence x. Then s1 is represented asfollows:

s1.w; x; g/ D f1.x; h; d/ w1 (2.4)

where w1 is a weight vector.

2.1.2 Second-Order Model

In the second-order model, the features can be defined over two adjacent edges.There are several types of three-node subgraphs with several different levels of com-putational cost and difficulty in their implementation. For the second-order models,we use two types of features that represent a parent-sibling relation and parent-child-grandchild relation in the model (Carreras 2007; Johansson and Nugues2008; McDonald 2006; McDonald and Pereira 2006). The parent-sibling relation isbetween the head and dependent tokens. The parent-child-grandchild relation is forthe head, dependent, and children of the dependent. Carreras (2007) and Johanssonand Nugues (2008) considered both types of relations, while McDonald (2006)implemented the parent-sibling relation.

The features of the second-order model are defined through a feature functionthat is represented by f2.x; h; d; c/, where c is one of fch; cdi; cdog, ch is the closestsibling of d inside Œh : : : d, cdi is the furthest child of d inside [h. . . d], and cdo

is the furthest child of d outside [h. . . d]. We call these second-order features.Figure 2.2 shows the relations of tokens in the second-order model. In the followingcontent, we use .h; d; c/ to denote two adjacent edges (parent-sibling and parent-child-grandchild structures). The scoring function S2 of the second-order model isas follows:

S2.x; y/ DX

g2y

s2.w; x; g/ (2.5)


where s2 is represented as follows:

s2.w; x; g/ D sc1.h; d/C sc2.h; d; c/ (2.6)

D sc1.h; d/C sch.h; d; ch/ (2.7)

Cscdi.h; d; cdi/C sdo.h; d; cdo/

sc1.h; d/ D f1.x; h; d/ w1

sch.h; d; ch/ D f2.x; h; d; ch/ wh

scdi.h; d; cdi/ D f2.x; h; d; cdi/ wdi

sdo.h; d; cdo/ D f2.x; h; d; cdo/ wdo

where sc1 is the function for first-order features; sch, scdi, and scdo are the functionsfor the second-order features of ch, cdi, and cdo, respectively; w1 is as in thefirst-order model; and wh, wdi, and wdo are the weight vectors that correspond,respectively, to one of the adjacent dependencies.

2.1.3 Third-Order Model

In the third-order model, g is defined over three adjacent edges. Koo and Collins(2010) define two types of third-order structures: grand-sibling and tri-sibling.Figure 2.3a, b show the relations of grand-sibling and tri-sibling, respectively, wheres is the closest sibling of d inside [h. . . d], t is the closest sibling of s inside [h. . . s],and g is the head of h. The scoring function S3 of the third-order model is as follows:

S3.y/ DX

g2y

s3.w; x; g/ (2.8)

where s2 is represented as follows:

s3.w; x; g/ D sc1.h; d/C sc2.h; d; c/ (2.9)

Csc3g.g; h; s; m/C sc3t.h; t; s; m/

Fig. 2.3 Relations inthird-order model

g h s d

h t s d

a

b


sc3g.g; h; s; d/ D f3.x; g; h; s; d/ wgsib

sc3t.h; t; s; d/ D f3.x; h; t; s; d/ wtsib

where sc3g and sc3t are the functions for the third-order features grand-sibling andtri-sibling, respectively, and wgsib and wtsib are the weight vectors that correspond,respectively, to the features.

Ma and Zhao (2012) extend to the fourth-order model by considering grand-tri-sibling structures, which is defined over four adjacent edges.

2.1.4 Parsing Algorithm

Algorithm 1 Pseudo-code of second-order parsing algorithm1: Initialization: CŒsŒsŒdŒs D 0:0, OŒsŒsŒd D 0:0 8s; d2: for k D 1 to n do3: for s D 0 to n do4: t D sC k5: if t > n then break6: % Create incomplete items7: % Left direction8: OŒsŒtŒ D sc1.t; s/Cmaxsr<tfvlsib C vlgcg9: % Right direction

10: OŒsŒtŒ! D sc1.s; t/Cmaxsr<tfvrsib C vrgcg11: % Create complete items12: for r D s to t do13: if r ¤ t then14: if r = s then15: CŒsŒtŒ Œr D OŒrŒtŒ C CŒsŒrŒ ŒrC scdo.t; r;/

16: else17: CŒsŒtŒ Œr D OŒrŒtŒ Cmaxsm<r fCŒsŒrŒ ŒmC scdo.t; r; m/g18: end if19: end if20: if r ¤ s then21: if r=t then22: CŒsŒtŒ!Œr D OŒsŒrŒ!C CŒrŒtŒ!ŒtC scdo.s; r;/

23: else24: CŒsŒtŒ!Œr D OŒsŒrŒ!Cmaxr<mt fCŒrŒtŒ!ŒmC scdo.s; r; m/g25: end if26: end if27: end for28: end for29: end for30: Return max0<rn CŒ0ŒnŒ!Œr

Graph-based parsing algorithms are usually extended from the parsing algorithmof (Eisner 1996), which was a modified version of the CKY chart parsing algorithm.As for the first-order, second-order, third-order, and fourth-order models, different


parsing algorithms are proposed in McDonald et al. (2005), Carreras (2007),McDonald (2006), Koo and Collins (2010), and Ma and Zhao (2012). We use thesecond-order algorithm of Carreras (2007) as an example of parsing algorithm toexplain the idea of the procedure.

The CKY-style parsing algorithms independently parse the left and right depen-dents of a word and combine them later (McDonald 2006). There are two types ofchart items (McDonald and Pereira 2006): (1) a complete item in which the wordsare unable to accept more dependents in a certain direction and (2) an incompleteitem in which the words can accept more dependents in a certain direction. Inthe following figures, complete items are represented by triangles and incompleteitems are represented by trapezoids. In the CKY-style algorithms, we create bothtypes of chart items with two directions for all the word pairs in a given sentence.The direction of a dependency is from the head to the dependent. The right (left)direction indicates the dependent is on the right (left) side of the head. Largerchart items are created from pairs of smaller ones in a bottom-up style. Figure 2.4illustrates the cubic parsing actions of the CKY-style parsing algorithm (Eisner1996) in the right direction, where s, r, and t refer to the start and end indices ofthe chart items. In Fig. 2.4a, all the items on the left side are complete, and thealgorithm creates the incomplete item (trapezoid on the right side) of s – t, wherethe triangle of s – r and the triangle of rC1 – t are complete items. In Fig. 2.4b, theitem of s – r is incomplete and the item of r – t is complete. Then the algorithmcreates the complete item of s – t. In Fig. 2.4, the longer vertical edge in a triangleor a trapezoid corresponds to the head word of the spanning subgraph. For example,s is the head word of the span s – t in Fig. 2.4a. For the left direction case, the actionsare similar.

Once the parser has considered the dependency relations between words ofdistance 1, it goes on to dependency relations between words of distance 2 andso on. For words of distance 2 and greater, it considers every possible partition ofthe structures into two parts and chooses the one with the highest score for eachdirection. We store the obtained chart items in a table. A chart item includes its owninformation of the optimal splitting point. Thus, by looking up the table, we canobtain the best tree structure (with the highest score) of any chart item.

Fig. 2.4 Cubic parsingactions of Eisner (1996)

s r r+1 t s t

s r r t s t

a

b


The pseudo-code for the second-order parsing algorithm is given in Algorithm 1.For simplicity, this algorithm calculates only scores, but the actual algorithm alsostores the corresponding dependency structures. We let CŒsŒtŒdŒr be a table thatstores the score of the best complete item from position s to position t, s t, withdirection d and splitting position r, and let OŒsŒtŒd be a table that stores the scoreof the best incomplete item from position s to position t, s t, with directiond. d indicates the direction ( or !) of the dependency. “” in score functionsindicates that the dependent does not have a sibling node or grandchild.

In line 8 of the algorithm,

vlsib D(

CŒrC 1ŒtŒ Œt C sch.t; s;/; if 1C r D t

maxr<m<tfCŒrC 1ŒtŒ ŒmC sch.t; s; m/g; if 1C r ¤ t

and

vlgc D(

CŒsŒrŒ!Œr C scdi.t; s;/; if s D r

maxs<mrfCŒsŒrŒ!Œm C scdi.t; s; m/g; if s ¤ r

In line 10,

vrgc D(

CŒrC 1ŒtŒ Œt C scdi.s; t;/; if 1C r D t

maxr<m<tfCŒrC 1ŒtŒ Œm C scdi.s; t; m/g; if 1C r ¤ t

and

vrsib D(

CŒsŒrŒ!Œr C sch.s; t;/; if s D r

maxs<mrfCŒsŒrŒ!Œm C sch.s; t; m/g; if s ¤ r

In the algorithm, we create two types of items: incomplete and complete.Incomplete items are created by lines 8 and 10. Here, we explain the right directioncase. Graphically, the intuition behind line 10 is given in Fig. 2.5, where ms is theright furthest child of s in the complete item (sr) and mt is the left furthest child of

s(=h) t(=d)r r+1 mt(=cdi)ms(=ch)s ms r r+1 mt t

Fig. 2.5 An incomplete item


s(=h) tr(=d) m(=cdo)ts r r m

Fig. 2.6 A complete item

t in the complete item (rC1t). Then the algorithm creates an incomplete item (st)with s as the head of t by looking for the optimal splitting point r, sibling ch, andgrandchild cdi between s(head) and t(dependent). vrsib indicates that the algorithmlooks for the best sibling node (McDonald 2006), and vrgc indicates that it searchesfor the best grandchild node (Carreras 2007). It searches for the best sibling nodeand the best grandchild node independently because our features do not considerinteractions between them.

The complete items are created by lines 13–19 and 20–26. Here, we explain theright direction case. Graphically, the intuition behind lines 20–26 is given in Fig. 2.6,where m is the furthest right child of r. The algorithm creates complete items bylooking for the best grandchild cdo for each splitting point r (Carreras 2007).

Compared with the second-order parsing algorithm of McDonald (2006), Algo-rithm 1 can support the parent-child-grandchild structure. For creating the incom-plete items (line 8 and 10), Algorithm 1 searches for the best grandchild node (cdi)and best sibling node (ch), while the algorithm of McDonald (2006) only searchesfor the best sibling node. For creating the complete items (lines 13–19 and 20–26), Algorithm 1 searches for the best grandchild node (cdo), while the algorithm ofMcDonald (2006) does not do that. The computational cost of Algorithm 1 is O.n4/,while the cost of the algorithm of McDonald (2006) is O.n3/. However, in practice,the parsing time of Algorithm 1 is only three times slower than that of the algorithmof McDonald (2006) for parsing the sentences whose average length is 25 words.

2.1.5 Feature Templates

For the first-order model, the feature templates are defined over each edge (h, d),where h and d are the head and dependent, respectively. These feature templatesconsider the surface form of the head and dependent words, their part-of-speech(POS) tags, and the surface form and POS tags of surrounding words. They alsoinclude conjunctions between these features with direction and distance from thehead to the dependent. The first-order feature templates of McDonald et al. (2005)are listed in Table 2.1, where h-word/pos refers to the word/POS of head h, d-word/pos refers to the word/POS of dependent d, hC1-pos refers to the POS tothe right of the head, h1-pos refers to the POS to the left of the head, dC1-pos

2.2 Transition-Based Models 19

Table 2.1 First-order feature templates

(a) Uni-gram features (b) Bi-gram features (c) Other features

h-word, h-pos h-word, h-pos, d-word, d-pos h-pos, b-pos, d-pos

h-word h-pos, h-word, h-pos h-pos, hC1-pos, d1-pos, d-pos

h-pos h-word, d-word, d-pos h1-pos, h-pos, d1-pos, d-pos

d-word, d-pos h-word, h-pos, d-pos h-pos, hC1-pos, d-pos, dC1-pos

d-word h-word, h-pos, d-word h1-pos, h-pos, d-pos, dC1-pos

d-pos h-word, d-word

h-pos, d-pos

Table 2.2 Second-orderfeatures

h-pos, d-pos, c-pos

h-pos, c-pos

d-pos, c-pos

h-word, c-word

d-word, c-word

h-pos, c-word

d-pos, c-word

h-word, c-pos

d-word, c-pos

refers to the POS to the right of the dependent, d1-pos refers to the POS to the leftof the dependent, and b-pos refers to the POS of a word in between the head anddependent.

For the second-order model, the feature templates are defined over two adjacentedges (h, d, c), where h and d are the head and dependent, respectively, and c is oneof ch; cdi; cdo. There are two types of second-order features: the parent-siblingsecond-order features (McDonald and Pereira 2006) and parent-child-grandchildfeatures (Carreras 2007). The second-order feature templates are listed in Table 2.2.We also conjoin them with the direction of dependency. The second-order featuresare further extended by Bohnet (2010) by introducing more lexical features as thebase features.

The third-order feature templates are defined over three adjacent edges as same asin Koo and Collins (2010). The third-order feature templates are listed in Table 2.3,where g, t, and s are the same meanings as the ones in Fig. 2.3. We also conjointhem with the direction of dependency.

2.2 Transition-Based Models

The transition-based models are proposed by Nivre (2003) and Yamada andMatsumoto (2003). This type of models is simple and works very well in the sharedtasks of CoNLL 2006 (Nivre et al. 2006) and CoNLL 2007 (Hall et al. 2007).


Table 2.3 Third-orderfeatures

Grand-sibling

g-pos, h-pos, s-pos, d-pos

g-word, h-pos, s-pos, d-pos

g-pos, h-word, s-pos, d-pos

g-pos, h-pos, s-word, d-pos

g-pos, h-pos, s-pos, d-word

g-pos, h-pos, s-pos, d-pos, gC1-pos, hC1-pos, mC1-pos

Tri-sibling

h-pos, t-pos, s-pos, d-pos

h-word, t-pos, s-pos, d-pos

h-pos, t-word, s-pos, d-pos

h-pos, t-pos, s-word, d-pos

h-pos, t-pos, s-pos, d-word

t-pos, s-pos, d-pos

t-pos, d-pos

2.2.1 Parsing Algorithm

The Nivre (2003) model is a shift-reduce-type algorithm, which uses a stack tostore processed tokens and a queue to store remaining input tokens. It can performdependency parsing in O(n) time. The dependency parsing tree is built from atomicactions in a left-to-right pass over the input. The behaviors of the parser are definedby four elementary actions (where TOP is the token on top of the stack and NEXTis the next token in the original input string):

• Left-Arc: Add an arc from NEXT to TOP; pop the stack.• Right-Arc: Add an arc from TOP to NEXT; push NEXT onto the stack.• Reduce: Pop the stack.• Shift: Push NEXT onto the stack.

The two actions (Left-Arc and Right-Arc) will add a dependency relation betweenTOP and NEXT.

These actions are determined by the parser configurations C, which are repre-sented by triples < S; Q; A >, where S denotes the stack, Q is the list of remaininginput tokens, and A is the current arcs from the dependency graph. The parser isinitialized to < nil; W; Ø > for an input sentence W. The parsing procedure isstopped when the configuration is < S; nil; A > (for any list S and set of arcs A).For a configuration c, the parser takes the optimal action t D arg maxt2T s.c; t/.s denotes a score function of an action t in a configuration c. It represents thelikelihood of taking action t out of configuration c. The actions Left-Arc and Right-Arc are subject to the condition that ensures the graph condition single head issatisfied. In contrast, the action Reduce can be applied only if TOP has a head.For Shift, the condition is that Q is non-empty.

2.2 Transition-Based Models 21

Fig. 2.7 Four parsing actions

Figure 2.7 shows the four actions and their conditions, where wi denotes TOP(the token on top of the stack S) and wj denotes NEXT (the first token in the list Q).In Fig. 2.7, k < j. The parser can take the action Left-Arc for wi and wj if wi doesnot have a head. For the action Left-Arc, we add an arc wi wj into A and pop wi

from the stack S. Similarly, the parser can take the action Right-Arc for wi and wj ifwj does not have a head. For Right-Arc, we add an arc wi ! wj into A and push wj

onto S. The action Reduce can be taken only if wi has a head. For Reduce, we popwi from S. For Shift, we push wj onto S.

The parser uses a classifier to produce a sequence of actions for a sentence.For the classifier, we can use several machine learning algorithms such as ME(Maximum Entropy), SVM (Support Vector Machines), and MBL (Memory-BasedLearning).

Zhang and Clark (2008), Huang et al. (2009), and Huang and Sagae (2010) usethe generalized perceptron for global learning and beam search for decoding. Theirdecoder is based on the incremental shift-reduce parsing process. The B highestscoring states are stored in buffer during the parsing process. At each step in thedecoding, existing states from the buffer are processed by applying legal parsingactions. From all newly obtained states, the B highest scoring states are stored inthe buffer. The decoding process is finished when the highest scored state builds acomplete tree.

2.2.2 Feature Templates

At each step of parsing, we have configuration C D< S; Q; A >. Based on C, thefeature templates are shown in Table 2.4, where w refers to word; p refers to POStag; S0 refers to the top of S; Q0, Q1, and Q2 refer to the first, second, and third front


Table 2.4 Feature templatesfor transition-based models

S0w; S0p; S0wS0p;

Q0w; Q0p; Q0wQ0p;

Q1w; Q1p; Q1wQ2p;

Q2w; Q2p; Q2wQ2p;

S0wS0pQ0wQ0p; S0wS0pQ0w; S0wQ0wQ0p; S0wS0pQ0p;

S0pQ0wQ0p; S0wQ0w; S0pQ0p; Q0pQ1p;

Q0pQ1pQ2p; S0pQ0pQ1p; S0hpS0pQ0p;

S0pS0lpQ0p; S0pS0rpQ0p; S0pQ0pQ0lp;

Table 2.5 Rich featuretemplates for transition-basedmodels

dS0w; dS0p; dQ0w; dQ0p; dS0wQ0w; dS0pQ0p;

S0wvr ; S0pvr ; S0wvl ; S0pvl ; Q0wvl Q0pvl ;

words of Q, respectively; S0h refers to the head of S0 (if any); and S0l and S0r referto the leftmost and rightmost modifier of S0 (if any), respectively. These featuresare used in Zhang and Clark (2008), Huang and Sagae (2010), and Zhang and Nivre(2011). The features consider the word surfaces and POS tags of single words, wordpairs, and three words.

Zhang and Nivre (2011) explore richer features for transition-based parsers. Thefeatures used in their parser are very effective. Table 2.5 shows some features ofZhang and Nivre (2011), where d is the distance between S0 and Q0 and vr and vl

refer to the left and right valencies of a node, respectively.

2.3 Evaluation Measures

The standard evaluation measures for dependency parsing in previous work andshared tasks (Buchholz and Marsi 2006; Nivre et al. 2007) are listed as follows:

• LAS: the percentage of tokens for which the system had predicted the correcthead and dependency label.

• UAS: the percentage of tokens for which the system had predicted the correcthead.

• ROOT: the percentage of sentences for which the system had predicted thecorrect root.

• COMP: the percentage of sentences for which the system completely haspredicted the correct heads for all the tokens.

Note that punctuation tokens are often excluded from scoring in the standardevaluations (Nivre et al. 2007).

2.5 Summary 23

2.4 Performance Summary

In recent years, there are many studies working on supervised dependency parsing.To show the recent progress on this research field, we list the reported scores on thetest set of Penn English Treebank (PTB) in Table 2.6. PTB is widely used in thefield of dependency parsing. The data are split into three sets: training set (Sections2–21), development set (Section 22), and testing set (Section 23).

From the table, we can find that the graph-based systems with higher-orderfeatures obviously achieve better scores. The second-order model provides 1.07absolute improvement over the first-order model, while the third-order modelalso achieves significant improvement over the second-order model. However, thefourth-order model provides only 0.4 improvement over the third-order model. Thisindicates that it is hard to improve further by increasing the order of models.

As for the transition-based systems, the structure learning algorithms andenlarging the search space help to improve the performance. Zhang and Nivre (2011)demonstrate that the transition-based systems can also obtain benefits from richnonlocal features.

2.5 Summary

In this chapter, we have described two types of supervised dependency parsingmodels: graph-based and transition-based parsing models. The graph-based modeluses exhaustive search and defines features over limited scope, while the transition-based model uses greedy search or beam search and defines features over decisionhistory. The supervised parsing systems have achieved good performance after

Table 2.6 Performance summary of supervised parsers on PTB (test)

System UAS Comment

Graph based McDonald2005 (McDonald et al.2005)

90.95 First order

Koo2008Sup (Koo et al. 2008) 92.02 Second order

Koo2010 (Koo and Collins 2010) 93.0 Third order

Ma2012 (Ma and Zhao 2012) 93.4 Fourth orderTransition based Yamada2003 (Yamada and Mat-

sumoto 2003)90.3 SVM

Zhang2008 (Zhang and Clark 2008) 91.4 Perceptron, beam search

Huang2010 (Huang and Sagae 2010) 92.1 Perceptron, dynamic pro-gramming

Zhang2011 (Zhang and Nivre 2011) 92.9 Perceptron, beam search


researchers propose different strategies during the past decade. However, givenlimited training data, it is difficult to improve further. On the other hand, we caneasily obtain large-scale raw data.

How to make full use of the raw data to improve dependency parsing is the mainchallenge beyond the supervised parsing. In the following chapters in this book, wewill describe the semi-supervised approaches, which have advanced the progress ofdependency parsing in recent years.

References

Bohnet, B. (2010). Top accuracy and fast dependency parsing is not a contradiction. In Proceedingsof the 23rd international conference on computational linguistics (Coling 2010) (pp. 89–97).Beijing: Coling 2010 Organizing Committee. http://www.aclweb.org/anthology/C10-1011.

Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. In:Proceedings of CoNLL-X. SIGNLL, New York.

Buchholz, S., Marsi, E., Dubey, A., & Krymolowski, Y. (2006). CoNLL-X shared task onmultilingual dependency parsing. In: Proceedings of CoNLL-X, New York.

Carreras, X. (2007). Experiments with a higher-order projective dependency parser. In Proceedingsof the CoNLL shared task session of EMNLP-CoNLL 2007 (pp. 957–961). Prague: Associationfor Computational Linguistics.

Covington, M. A. (2001). A dundamental algorithm for dependency parsing. In Proceedings of the39th annual ACM southeast conference, Athens (pp. 95–102).

Crammer, K., & Singer, Y. (2003). Ultraconservative online algorithms for multiclass problems.Journal of Machine Learning Research, 3, 951–991. doi:http://dx.doi.org/10.1162/jmlr.2003.3.4-5.951.

Eisner, J. (1996). Three new probabilistic models for dependency parsing: an exploration. InProceedings of COLING1996, Copenhagen (pp. 340–345).

Hall, J., Nilsson, J., Nivre, J., Eryigit, G., Megyesi, B., Nilsson, M., & Saers, M. (2007). Singlemalt or blended? A study in multilingual parser optimization. In Proceedings of the CoNLLshared task session of EMNLP-CoNLL 2007, Prague (pp. 933–939).

Huang, L., & Sagae, K. (2010). Dynamic programming for linear-time incremental parsing.In Proceedings of the 48th annual meeting of the association for computational linguistics(pp. 1077–1086). Uppsala: Association for Computational Linguistics. http://www.aclweb.org/anthology/P10-1110.


Johansson, R., & Nugues, P. (2008). Dependency-based syntactic–semantic analysis with Prop-Bank and NomBank. In CoNLL 2008: proceedings of the twelfth conference on computationalnatural language learning (pp. 183–187). Manchester: Coling 2008 Organizing Committee.


Koo, T., & Collins, M. (2010). Efficient third-order dependency parsers. In Proceedings of ACL2010 (pp. 1–11). Uppsala: Association for Computational Linguistics.

Ma, X., & Zhao, H. (2012). Fourth-order dependency parsing. In Proceedings of COLING 2012:posters (pp. 785–796). Mumbai: The COLING 2012 Organizing Committee. http://www.aclweb.org/anthology/C12-2077.

McDonald, R. (2006). Discriminative training and spanning tree algorithms for dependencyparsing. Ph.D. thesis, University of Pennsylvania.


http://dx.doi.org/10.1162/jmlr.2003.3.4-5.951






References 25

McDonald, R., Crammer, K., & Pereira, F. (2005). Online large-margin training of dependencyparsers. In Proceedings of ACL 2005, Michigan (pp. 91–98). Michigan: Association forComputational Linguistics.


McDonald, R., & Pereira, F. (2006). Online learning of approximate dependency parsing algo-rithms. In Proceedings of EACL 2006, Trento (pp. 81–88).

Nivre, J. (2003). An efficient algorithm for projective dependency parsing. In Proceedings ofIWPT2003, Nancy (pp. 149–160).

Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J., Riedel, S., & Yuret, D. (2007). The CoNLL2007 shared task on dependency parsing. In Proceedings of the CoNLL shared task session ofEMNLP-CoNLL 2007, Prague (pp. 915–932).

Nivre, J., Hall, J., Nilsson, J., Eryigit, G., & Marinov, S. (2006). Labeled pseudo-projectivedependency parsing with support vector machines. In CoNLL-X, New York.


Noji, H., & Miyao, Y. (2014). Left-corner transitions on dependency parsing. In Proceedingsof COLING 2014, the 25th international conference on computational linguistics: technicalpapers (pp. 2140–2150). Dublin: Dublin City University and Association for ComputationalLinguistics. http://www.aclweb.org/anthology/C14-1202.


Zhang, Y., & Clark, S. (2008). A tale of two parsers: investigating and combining graph-based and transition-based dependency parsing. In Proceedings of EMNLP 2008, Honolulu(pp. 562–571).

Zhang, Y., & Nivre, J. (2011). Transition-based dependency parsing with rich non-local features.In Proceedings of ACL-HLT2011 (pp. 188–193). Portland: Association for ComputationalLinguistics. http://www.aclweb.org/anthology/P11-2033.



Chapter 3Overview of Semi-supervised DependencyParsing Approaches

In this chapter, we briefly review the approaches of semi-supervised dependencyparsing and categorize them into different types. And then we summarize theperformance of recent semi-supervised parsing systems on Penn Treebank (PTB).

3.1 History

As for semi-supervised parsing, researchers first try to use traditional semi-supervised algorithms, such as co-training and self-training. Sagae and Tsujii (2007)present an approach based on co-training for dependency parsing. They use twoparsers to parse the sentences in unannotated data and select only identical resultsproduced by those two parsers. They then retrain a parser on newly parsed sentencesand the original labeled data. Kawahara and Uchimoto (2008) use a self-trainingapproach to select new sentences for dependency parsing. McClosky et al. (2006)presents a self-training approach for phrase structure parsing, and this is shown tobe effective in practice.

Since the co-training/self-training algorithms only select some sentences asnewly labeled data to retrain parsers, it is very hard to make full use of large-scale raw data. There are many researchers who utilize the information of wordsor partial structures instead of the whole trees. Koo et al. (2008) apply theBrown algorithm to produce word clusters on large-scale unannotated data andrepresent new features based on the clusters for parsing models. The cluster-basedfeatures provided extremely impressive results. Yu et al. (2008) construct casestructures from auto-parsed data and utilize them in decoding. Chen et al. (2009a)propose an approach that uses the information on short dependency relations forChinese dependency parsing and only uses word pairs within two word distancesfor a transition-based parsing algorithm. Suzuki et al. (2009) extend a Semi-supervised Structured Conditional Model (SS-SCM) (Suzuki and Isozaki 2008) tothe dependency parsing problem and combine their method with the approach of


27

28 3 Overview of Semi-supervised Dependency Parsing Approaches

Fig. 3.1 Framework of semi-supervised dependency parsing

Koo et al. (2008).Suzuki et al. (2011) report the best results so far on the standardtest sets of PTB using a condensed feature representation combined with the wordcluster-based features of Koo et al. (2008).

3.2 Framework of Semi-supervised Dependency Parsing

Figure 3.1 shows a typical procedure of semi-supervised dependency parsing. In thefirst step, we preprocess the raw sentences. The preprocessing may include wordsegmentation (if needed) and part-of-speech tagging. After that, we obtain the word-segmented sentences with the POS tags. We then use a baseline parser to parse thesentences in the data. Finally, we obtain the auto-parsed data.

Based on the auto-parsed data, some learning algorithms are designed to learnnew information. For instance, in self-training, a conventional semi-supervisedmethod, the task of learning algorithms is to select some reliable auto-parsedsentences. The selected sentences are used as another annotated set for the nextstep. What kind of information the algorithms learn is the most important for semi-supervised dependency parsing.

In the final step, a new parser is trained by using human-annotated data and thenew information. The key of this step is to design a training approach which canmake full use of the new information and human-annotated data.

3.3 Three Levels of Approaches

Most of the early studies (often apply self- and co-training algorithms) selectentire auto-parsed trees as newly labeled data for retraining new parsers (Sagaeand Tsujii 2007; Steedman et al. 2003). These methods mainly suffer from two

3.3 Three Levels of Approaches 29

Fig. 3.2 Three levels

problems: (1) It is difficult to select reliable auto-parsed trees as newly annotateddata (Steedman et al. 2003) and (2) it is difficult to scale to large data becauseof the high computational cost of training models with a large number of newlyautomatically annotated sentences (Chen et al. 2008). Instead of using entire trees,several researchers exploit lexical information, such as word clusters and word co-occurrences (Koo et al. 2008; Zhou et al. 2011). The lexical information is easy to beused in parsing models, but it ignores the dependency relations among words whichmight be useful. The use of bilexical dependencies is attempted in van Noord (2007)and Chen et al. (2008). However, the bilexical dependencies provide a relativelypoor level of useful information for parsing. To provide richer information, we canconsider more words, such as subtrees (Chen et al. 2009a).

In this book, we divide the semi-supervised dependency parsing approaches intothree types according to what kind of new information the algorithms learn in thesecond step (shown in Fig. 3.2). The three types are listed as follows:

• Whole-tree level: The approaches in this type select auto-parsed whole trees asnewly annotated data to train new parsers. For example, Kawahara and Uchimoto(2008) use a self-training approach to select new sentences, while Sagae andTsujii (2007) use the co-training technique to improve parsing performance. Theapproaches of this type will be described in Chap. 4.

• Word level: The approaches in this type learn lexical information from rawdata, but not auto-parsed data. For example, Koo et al. (2008) use a clusteringalgorithm to produce word clusters on a large amount of unannotated data andrepresent new features based on the clusters for dependency parsing models. Theapproaches of the word level will be described in Chap. 5.

• Partial-tree level: The approaches in this type make use of the informationof partial trees from auto-parsed data. The use of bilexical dependencies is

30 3 Overview of Semi-supervised Dependency Parsing Approaches

Table 3.1 Performance summary of semi-supervised parsers on PTB (test)

Type System Baseline (Sup) Semi-Sup GapWhole-tree level Self-training (Li et al. 2014) 92.34 92.29 -0.05

Co-training (Li et al. 2014) 92.34 92.81 +0.47

Ambiguity-aware (Li et al. 2014) 92.34 93.19 +0.85Word level Koo2008-dep2c (Koo et al. 2008) 92.02 93.16 +1.14

Zhou2011 (Zhou et al. 2011) 91.98 92.64 +0.66Partial-tree level Suzuki2009 (Suzuki et al. 2009) 92.70 93.79 +1.09

Chen2009 (Chen et al. 2009a) 91.92 92.89 +0.97

Suzuki2011 (Suzuki et al. 2011) 92.82 94.22 +1.4

Chen2012 (Chen et al. 2012) 92.10 92.76 +0.56

Chen2013 (Chen et al. 2013) 92.76 93.77 +1.01

attempted in van Noord (2007) and Chen et al. (2008). Chen et al. (2009a)propose an approach that uses the information on subtrees from auto-parseddata. To enlarge a view of scope, dependency language models are used inChen et al. (2012). It can be extended further. Meta-features defined over surfacewords, part-of-speech tags represent more complex tree structures than bilexicaldependencies and lexical subtrees (Chen et al. 2013). The approaches of thepartial-tree level will be described in Chaps. 6–9.

3.4 Performance Summary

There are many studies working on semi-supervised dependency parsing in recentyears. To show the recent progress on this research field, we list the reported scoreson the test set of Penn English Treebank (PTB) in Table 3.1, where “Baseline (Sup)”refers to the baseline systems and “Semi-Sup” refers to the semi-supervised systemsin previous studies. PTB is widely used in the field of dependency parsing. Thedata are split into three sets: training set (Sections 2–21), development set (Section22), and testing set (Section 23). Compared with the performance of supervisedapproaches in Table 2.6, the semi-supervised approaches often perform better.

3.5 Summary

This chapter gives out the overview of semi-supervised dependency parsing modelsand summarizes the performance of semi-supervised approaches. The details of therelated approaches will be described in the following chapters: Chap. 4 presents theapproaches of the whole-tree-level type, Chap. 5 describes the approaches of theword level, and Chaps. 6–9 introduce the approaches of the partial-tree level.

References 31

References

Chen, W., Kawahara, D., Uchimoto, K., Zhang, Y., & Isahara, H. (2008). Dependency parsing withshort dependency relations in unlabeled data. In Proceedings of IJCNLP, Hyderabad.

Chen, W., Kawahara, D., Uchimoto, K., Zhang, Y., & Isahara, H. (2009a). Using short dependencyrelations from auto-parsed data for Chinese dependency parsing. ACM Transactions on AsianLanguage Information Processing (TALIP), 8(3), Article 10.

Chen, W., Kazama, J., Uchimoto, K., & Torisawa, K. (2009b). Improving dependency parsing withsubtrees from auto-parsed data. In Proceedings of EMNLP, Singapore (pp. 570–579).

Chen, W., Zhang, M., & Li, H. (2012). Utilizing dependency language models for graph-baseddependency parsing models. In Proceedings of ACL, Jeju.

Chen, W., Zhang, M., & Zhang, Y. (2013). Semi-supervised feature transformation for dependencyparsing. In Proceedings of EMNLP, Seattle (pp. 1303–1313). Association for ComputationalLinguistics. http://www.aclweb.org/anthology/D13-1129.

Kawahara, D., & Uchimoto, K. (2008). Learning reliability of parses for domain adaptation ofdependency parsing. In Proceedings of IJCNLP, Hyderabad.


Li, Z., Zhang, M., & Chen, W. (2014). Ambiguity-aware ensemble training for semi-superviseddependency parsing. In Proceedings of annual meeting of the association for computationallinguistics (ACL2014), Baltimore (pp. 456–467, 22–27).

McClosky, D., Charniak, E., & Johnson, M. (2006). Reranking and self-training for parseradaptation. In Proceedings of COLING-ACL, Sydney (pp. 337–344).

van Noord, G. (2007). Using self-trained bilexical preferences to improve disambiguation accuracy.In Proceedings of IWPT-07, Prague.

Sagae, K., & Tsujii, J. (2007). Dependency parsing and domain adaptation with LR models andparser ensembles. In Proceedings of the CoNLL shared task session of EMNLP-CoNLL, Prague(pp. 1044–1050).

Steedman, M., Osborne, M., Sarkar, A., Clark, S., Hwa, R., Hockenmaier, J., Ruhlen, P., Baker,S., & Crim, J. (2003). Bootstrapping statistical parsers from small datasets. In Proceedings ofEACL, Budapest (pp. 331–338).

Suzuki, J., & Isozaki, H. (2008). Semi-supervised sequential labeling and segmentation usingGiga-word scale unlabeled data. In Proceedings of ACL-08: HLT, Columbus (pp. 665–673).Association for Computational Linguistics.

Suzuki, J., Isozaki, H., Carreras, X., & Collins, M. (2009). An empirical study of semi-supervisedstructured conditional models for dependency parsing. In Proceedings of EMNLP, Singapore(pp. 551–560). Association for Computational Linguistics.

Suzuki, J., Isozaki, H., & Nagata, M. (2011). Learning condensed feature representations fromlarge unsupervised data sets for supervised learning. In Proceedings of ACL, Portland (pp. 636–641). Association for Computational Linguistics. http://www.aclweb.org/anthology/P11-2112.

Yu, K., Kawahara, D., & Kurohashi, S. (2008). Chinese dependency parsing with large scaleautomatically constructed case structures. In Proceedings of COLING, Manchester (pp. 1049–1056).

Zhou, G., Zhao, J., Liu, K., & Cai, L. (2011). Exploiting web-derived selectional preferenceto improve statistical dependency parsing. In Proceedings of ACL-HLT, Portland (pp. 1556–1565). Association for Computational Linguistics. http://www.aclweb.org/anthology/P11-1156.





Chapter 4Training with Auto-parsed Whole Trees

This chapter describes the approaches that make use of entire auto-parsed depen-dency trees. We first briefly introduce the self-training and co-training approachesand then introduce the approach of ambiguity-aware ensemble training in details.

The conventional approaches of the whole-tree level pick up some high-qualityauto-parsed training instances from unlabeled data using bootstrapping methods,such as self-training (Yarowsky 1995), co-training (Blum and Mitchell 1998), andtri-training (Zhou and Li 2005). However, these methods gain limited successin dependency parsing. Although working well on constituent parsing (Huangand Harper 2009; McClosky et al. 2006), self-training is shown unsuccessful fordependency parsing (Spreyer and Kuhn 2009). The reason may be that dependencyparsing models are prone to amplify previous mistakes during training on self-parsed unlabeled data. Sagae and Tsujii (2007) apply a variant of co-training todependency parsing and report positive results on out-of-domain text. Søgaard andRishøj (2010) combine tri-training and parser ensemble to boost parsing accuracy.Both works employ two parsers to process the unlabeled data and only select asextra training data sentences on which the 1-best parse trees of the two parsersare identical. In this way, the auto-parsed unlabeled data becomes more reliable.However, one obvious drawback of the self-training and co-training approaches isthat they are unable to exploit unlabeled data with different outputs from differentparsers. Intuitively, an unlabeled sentence with divergent outputs should containsome ambiguous syntactic structures (such as preposition phrase attachment) thatare very hard to resolve and lead to the disagreement of different parsers. Suchsentences can provide more discriminative instances for training which may beunavailable in labeled data. To solve the above issues, Li et al. (2014) propose asimple yet effective framework to make use of whole trees, referred to as ambiguity-aware ensemble training.


33

34 4 Training with Auto-parsed Whole Trees

Fig. 4.1 Self-training

4.1 Self-Training

Figure 4.1 shows the standard procedure of self-training for dependency parsing.There are four steps: (1) base training, training a first-stage parser with the labeleddata; (2) processing, applying the parser to produce automatic parses for theunlabeled data; (3) selecting, selecting some auto-parsed sentences as newly labeleddata; (4) final training, training a better parser by combining the labeled andunlabeled data. It is very hard to select new sentences as newly labeled data. InKawahara and Uchimoto (2008), they use a SVM classifier to pick up reliable trees.

4.2 Co-training

Figure 4.2 shows the standard procedure of co-training for dependency parsing. Thedifference from the self-training method is that there are two base parsers fromdifferent views in co-training. The new sentences are selected according to theconsistency between the two parsers and added into the labeled data (Sagae andTsujii 2007).

4.3 Ambiguity-Aware Ensemble Training

In this section, we describe the ambiguity-aware ensemble training proposed by Liet al. (2014). Instead of only using 1-best parse trees in the self-training and co-training approaches, the core idea is to utilize parse forest (ambiguous labelings)

4.3 Ambiguity-Aware Ensemble Training 35

Fig. 4.2 Co-training

Fig. 4.3 An example sentence with an ambiguous parse forest

to combine multiple 1-best parse trees generated from diverse parsers on unlabeleddata. Figure 4.3 shows an example sentence with an ambiguous parse forest. Theforest is formed by two parse trees, respectively, shown at the upper and lowersides of the sentence. The differences between the two parse trees are highlightedusing dashed arcs. The upper tree take “deer” as the subject of “riding”, whereasthe lower one indicates that “he” rides the bicycle. The other difference is wherethe preposition phrase (PP) “in the park” should be attached, which is also knownas the PP attachment problem, a notorious challenge for parsing. Reserving suchuncertainty has three potential advantages. First, noise in unlabeled data is largelyalleviated, since parse forest encodes only a few highly possible parse trees withhigh oracle score. Note that the parse forest in Fig. 4.3 contains four parse trees after


combination of the two different choices. Second, the parser is able to learn usefulfeatures from the unambiguous parts of the parse forest. Finally, with sufficientunlabeled data, it is possible that the parser can learn to resolve such uncertaintyby biasing to more reasonable parse trees.

To construct parse forest on unlabeled data, we employ three supervised parsersbased on different paradigms, including our baseline graph-based dependencyparser, a transition-based dependency parser (Zhang and Nivre 2011), and agenerative constituent parser (Petrov and Klein 2007). The 1-best parse trees ofthese three parsers are aggregated in different ways. Finally, using a conditionalrandom field (CRF)-based probabilistic parser, a model is trained by maximizingmixed likelihood of labeled data and auto-parsed unlabeled data with ambiguouslabelings.

4.3.1 CRF-Based GParser

The second-order graph-based dependency parsing model of McDonald and Pereira(2006) is used as our core parser, which incorporates features from the two kinds ofsubtrees in Fig. 4.4. The details of the parsing model can be found in Chap. 2. Thenthe score of a dependency tree is

S.x; yIw/ DX

f.h;m/gy

wdep fdep.x; h; m/

CX

f.h;s/;.h;m/gy

wsib fsib.x; h; s; m/

where fdep.x; h; m/ and fsib.x; h; s; m/ are the feature vectors of the two subtrees inFig. 4.4; wdep=sib are feature weight vectors; the dot product gives scores contributedby corresponding subtrees.

For syntactic features, we can use the features of Bohnet (2010) which includetwo categories corresponding to the two types of scoring subtrees in Fig. 4.4. Wesummarize the atomic features used in each feature category in Table 4.1. Theseatomic features are concatenated in different combinations to compose rich featuresets. Please refer to Table 4 of Bohnet (2010) for the complete feature list.

Previous work on graph-based dependency parsing mostly adopts linear modelsand perceptron-based training procedures, which lack probabilistic explanations of

Fig. 4.4 Two types ofscoring subtrees in oursecond-order graph-basedparsers

single dependency adjacent sibling

h m h s m

a b


Table 4.1 Brief illustration of the syntactic features. ti denotes the POS tag of wi. b is an indexbetween h and m. dir.i; j/ and dist.i; j/ denote the direction and distance of the dependency .i; j/

Dependency features fdep.x; h; m/:

wh, wm, th, tm, th˙1, tm˙1, tb, dir.h; m/, dist.h; m/

Sibling features fsib.x; h; m; s/:

wh, ws, wm, th, tm, ts, th˙1, tm˙1, ts˙1

dir.h; m/, dist.h; m/

dependency trees and do not need to compute likelihood of labeled training data.Instead, we build a log-linear CRF-based dependency parser, which is similar to theCRF-based constituent parser of Finkel et al. (2008). Assuming the feature weightsw are known, the probability of a dependency tree y given an input sentence x isdefined as

p.yjxIw/ D expfS.x; yIw/gZ.xIw/

Z.xIw/ DX

y02T.x/

expfS.x; y0Iw/g(4.1)

where Z.x/ is the normalization factor and T.x/ is the set of all legal dependencytrees for x.

Suppose the labeled training data is D D f.xi; yi/gNiD1. Then the log likelihoodof D is

L .D Iw/ DNX

iD1

log p.yijxiIw/

The training objective is to maximize the log likelihood of the training dataL .D/. The partial derivative with respect to the feature weights w is

@L .D Iw/

@wD

NX

iD1

0

[email protected]; yi/

X

y02T.xi/

p.y0jxiIw/f.xi; y0/

1

CA (4.2)

where the first term is the empirical counts and the second term is the modelexpectations. Since T.xi/ contains exponentially many dependency trees, directcalculation of the second term is prohibitive. Instead, the classic inside-outsidealgorithm is used to efficiently compute the model expectations within O.n3/ timecomplexity, where n is the input sentence length.


4.3.2 Ambiguity-Aware Ensemble Training

The key idea is the use of ambiguous labelings for the purpose of aggregatingmultiple 1-best parse trees produced by several diverse parsers. Here, “ambiguouslabelings” mean an unlabeled sentence may have multiple parse trees as gold-standard reference, represented by parse forest (see Fig. 4.3). The training procedureaims to maximize mixed likelihood of both manually labeled and auto-parsedunlabeled data with ambiguous labelings. For an unlabeled instance, the model isupdated to maximize the probability of its parse forest, instead of a single parse treein traditional tri-training. In other words, the model is free to distribute probabilitymass among the trees in the parse forest to its liking, as long as the likelihoodimproves (Täckström et al. 2013).

4.3.2.1 Likelihood of the Unlabeled Data

The auto-parsed unlabeled data with ambiguous labelings is denoted as D 0 Df.ui;Vi/gMiD1, where ui is an unlabeled sentence and Vi is the corresponding parseforest. Then the log likelihood of D 0 is

L .D 0Iw/ DMX

iD1

log

0

@X

y02Vi

p.y0juiIw/

1

A

where p.y0juiIw/ is the conditional probability of y0 given ui, as defined in Eq. (4.1).For an unlabeled sentence ui, the probability of its parse forest Vi is the summationof the probabilities of all the parse trees contained in the forest.

Then we can derive the partial derivative of the log likelihood with respect to w:

@L .D 0Iw/

@wD

MX

iD1

0BB@

X

y02Vi

Qp.y0jui;ViIw/f.ui; y0/

X

y02T.ui/

p.y0juiIw/f.ui; y0/

1CCA (4.3)

where Qp.y0jui;ViIw/ is the probability of y0 under the space constrained by the parseforest Vi.

Qp.y0jui;ViIw/ D expfS.ui; y0Iw/gZ.ui;ViIw/

Z.ui;ViIw/ DX

y02Vi

expfS.ui; y0Iw/g

The second term in Eq. (4.3) is the same with the second term in Eq. (4.2). The firstterm in Eq. (4.3) can be efficiently computed by running the inside-outside algorithmin the constrained search space Vi.


Algorithm 2 SGD training with mixed labeled and unlabeled data

1: Input: Labeled data D D f.xi; yi/gNiD1, and unlabeled data D 0 D f.ui;Vi/gMjD1; Parameters:I, N1, M1, b

2: Output: w3: Initialization: w.0/ D 0, k D 0;4: for i D 1 to I do iterations5: Randomly select N1 instances from D and M1 instances from D 0 to compose a new data set

Di, and shuffle it.6: Traverse Di: a small batch Db

i;k Di at one step.7: wkC1 D wk C k

1brL .Db

i;kIwk/

8: k D kC 1

9: end for

4.3.2.2 Stochastic Gradient Descent (SGD) Training

L2-norm regularized SGD training is applied to iteratively learn feature weights wfor our CRF-based baseline and semi-supervised parsers. At each step, the algorithmapproximates a gradient with a small subset of the training examples and thenupdates the feature weights. Finkel et al. (2008) show that SGD achieves optimaltest performance with far fewer iterations than other optimization routines such asL-BFGS. Moreover, it is very convenient to parallel SGD since computations amongexamples in the same batch is mutually independent.

Training with the combined labeled and unlabeled data, the objective is tomaximize the mixed likelihood:

L .D ID 0/ D L .D/CL .D 0/

Since D 0 contains much more instances than D (1.7M vs. 40K for English and4M vs. 16K for Chinese), it is likely that the unlabeled data may overwhelm thelabeled data during SGD training. Therefore, we propose a simple corpus-weightingstrategy, as shown in Algorithm 2, where Db

i;k is the subset of training data used inkth update and b is the batch size; k is the update step, which is adjusted followingthe simulated annealing procedure (Finkel et al. 2008). The idea is to use a fractionof training data (Di) at each iteration and do corpus weighting by randomly samplinglabeled and unlabeled instances in a certain proportion (N1 vs. M1).

Once the feature weights w are learnt, we can parse the test data to find theoptimal parse tree.

d D arg maxy02T.x/

p.y0jxIw/

D arg maxy02T.x/

S.x; y0Iw/

This can be done with the Viterbi decoding algorithm described in McDonald andPereira (2006) in O.n3/ parsing time.


4.3.2.3 Forest Construction with Diverse Parsers

To construct parse forests for unlabeled data, we employ three diverse parsers, i.e.,our baseline GParser, a transition-based parser (ZPar) (Zhang and Nivre 2011), and agenerative constituent parser (Berkeley Parser) (Petrov and Klein 2007). These threeparsers are trained on labeled data and then used to parse each unlabeled sentence.

4.3.3 Experiments and Analysis

The experiments are conducted on Penn Treebank (PTB) and Penn ChineseTreebank 5.1 (CTB5). For English, the data are split into training (sections 2–21),development (section 22), and test (section 23). For CTB5, the data split is followingDuan et al. (2007). Penn2Malt is used to convert original bracketed structures intodependency structures with its default head-finding rules.

For unlabeled data, the BLLIP WSJ corpus (Charniak et al. 2000) is used forEnglish and Xinhua portion of Chinese Gigaword Version 2.0 (LDC2009T14)(Huang 2009) is used for Chinese. A CRF-based bigram part-of-speech (POS)tagger with the features described in Li et al. (2012) produces POS tags for alltrain/development/test/unlabeled sets (10-way jackknifing for training sets). Thetagging accuracy on test sets is 97.3 % on English and 94.0 % on Chinese. Table 4.2shows the data statistics.

The results are reported using the standard unlabeled attachment score (UAS),excluding punctuation marks. For significance test, Dan Bikel’s randomized parsingevaluation comparator (Noreen 1989) is used.1

4.3.3.1 Parameter Setting

When training our CRF-based parsers with SGD, we use the batch size b D 100

for all experiments. We run SGD for I D 100 iterations and choose the model thatperforms best on development data. For the semi-supervised parsers trained withAlgorithm 2, we use N1 D 20 K and M1 D 50 K for English and N1 D 15 K andM1 D 50 K for Chinese, based on a few preliminary experiments. To accelerate thetraining, we adopt parallelized implementation of SGD and employ 20 threads for

Table 4.2 Data sets (insentence number)

Train Dev Test Unlabeled

PTB 39,832 1,700 2,416 1.7M

CTB5 16,091 803 1,910 4M

1http://www.cis.upenn.edu/~dbikel/software.html

http://www.cis.upenn.edu/~dbikel/software.html


each run. For semi-supervised cases, one iteration takes about 2 h on an IBM serverhaving 2.0 GHz Intel Xeon CPUs and 72G memory.

Default parameter settings are used for training ZPar and Berkeley Parser. Werun ZPar for 50 iterations and choose the model that achieves highest accuracy onthe development data. For Berkeley Parser, we use the model after 5 split-mergeiterations to avoid over-fitting the training data according to the manual. The phrasestructure outputs of Berkeley Parser are converted into dependency structures usingthe same head-finding rules.

4.3.3.2 Methodology Study on Development Data

Using three supervised parsers, we have many options to construct parse foreston unlabeled data. To examine the effect of different ways for forest construction,we conduct extensive methodology study on development data. Table 4.3 presentsthe results. We divide the systems into three types: (1) supervised single parsers;(2) CRF-based GParser with conventional self-/co-/tri-training; and (3) CRF-basedGParser with our approach. For the latter two cases, we also present the oracleaccuracy and averaged head number per word (“H/W”) of parse forest whenapplying different ways to construct forests on development data sets.

The first major row presents performance of the three supervised parsers. Wecan see that the three parsers achieve comparable performance on English, but theperformance of ZPar is largely inferior on Chinese.

The second major row shows the results when we use single 1-best parse treeson unlabeled data. When using the outputs of GParser itself (“Parse G”), the

Table 4.3 Main results on PTB (dev) and CTB5 (dev). G is short for GParser, Z for ZPar, and Bfor Berkeley Parser

English Chinese

UAS Oracle H/W UAS Oracle H/W

Supervised GParser 92:85– –

82:28

– –ZPar 92:50 81:04

Berkeley 92:70 82:46Single Parse G (self-train) 92:88 92:85

1.000

82:14 82:28

1.0001-best trees Parse Z (co-train) 93:15a 92:50 82:54 81:04

Parse B (co-train) 93:40a 92:70 83:34a 82:46

Parse B=Z (tri-train) 93:50a 97:52 83:10a 95:05

Ambiguity-aware Parse Z+G 93:18a 94:97 1:053 82:78 86:66 1:136

ensemble Parse B+G 93:35a 96:37 1:080 83:24a 89:72 1:188

Parse B+Z 93:78a;b 96:18 1:082 83:86a;b 89:54 1:199

Parse B+(Z\G) 93:77a;b 95:60 1:050 84:26a;b 87:76 1:106

Parse B+Z+G 93:50a 96:95 1:112 83:30a 91:50 1:281

a means the corresponding parser significantly outperforms supervised parsersb means the result significantly outperforms co-/tri-training at confidence level of p < 0:01


experiment reproduces traditional self-training. The results on both English andChinese reconfirm that self-training may not work for dependency parsing, whichis consistent with previous studies (Spreyer and Kuhn 2009). The reason may bethat dependency parsers are prone to amplify previous mistakes on unlabeled dataduring training.

The next two experiments in the second major row reimplement co-training,where another parser’s 1-best results are projected into unlabeled data to helpthe core parser. Using unlabeled data with the results of ZPar (“Parse Z”)significantly outperforms the baseline GParser by 0.30 % (93.15–82.85) on English.However, the improvement on Chinese is not significant. Using unlabeled datawith the results of Berkeley Parser (“Parse B”) significantly improves parsingaccuracy by 0.55 % (93.40–92.85) on English and 1.06 % (83.34–82.28)on Chinese.We believe the reason is that being a generative model designed for constituentparsing, Berkeley Parser is more different from discriminative dependency parsersand therefore can provide more divergent syntactic structures. This kind of syntacticdivergence is helpful because it can provide complementary knowledge from adifferent perspective. Surdeanu and Manning (2010) also show that the diversityof parsers is important for performance improvement when integrating differentparsers in the supervised track. Therefore, we can conclude that co-training helpsdependency parsing, especially when using a more divergent parser.

The last experiment in the second major row is known as tri-training, whichonly uses unlabeled sentences on which Berkeley Parser and ZPar produce identicaloutputs (“Parse B=Z”). We can see that with the verification of two views, theoracle accuracy is much higher than using single parsers (97.52 % vs. 92.85 %on English and 95.06 % vs. 82.46 % on Chinese). Although using less unlabeledsentences (0.7M for English and 1.2M for Chinese), tri-training achieves compa-rable performance to co-training (slightly better on English and slightly worse onChinese).

The third major row shows the results of the semi-supervised GParser withour proposed approach. We experiment with different combinations of the 1-bestparse trees of the three supervised parsers. The first three experiments combine 1-best outputs of two parsers to compose parse forest on unlabeled data. “Parse B+(Z\G)” means that the parse forest is initialized with the Berkeley parse andaugmented with the intersection of dependencies of the 1-best outputs of ZPar andGParser. In the last setting, the parse forest contains all three 1-best results.

When the parse forests of the unlabeled data are the union of the outputs ofGParser and ZPar, denoted as “Parse Z+G,” each word has 1.053 candidate headson English and 1.136 on Chinese, and the oracle accuracy is higher than using 1-best outputs of single parsers (94.97 % vs. 92.85 % on English, 86.66 % vs. 82.46 %on Chinese). However, we find that although the parser significantly outperformsthe supervised GParser on English, it does not gain significant improvement overco-training with ZPar (“Parse Z”) on both English and Chinese.

Combining the outputs of Berkeley Parser and GParser (“Parse B+G”), weget higher oracle score (96.37 % on English and 89.72 % on Chinese) and highersyntactic divergence (1.085 candidate heads per word on English and 1.188 on


Chinese) than “Parse Z+G,” which verifies our earlier discussion that BerkeleyParser produces more different structures than ZPar. However, it leads to slightlyworse accuracy than co-training with Berkeley Parser (“Parse B”). This indicatesthat adding the outputs of GParser itself does not help the model.

Combining the outputs of Berkeley Parser and ZPar (“Parse B+Z”), we get thebest performance on English, which is also significantly better than both co-training(“Parse B”) and tri-training (“Parse B=Z”) on both English and Chinese. Thisdemonstrates that our proposed approach can better exploit unlabeled data thantraditional self-/co-/tri-training. More analysis and discussions are in Sect. 4.3.3.4.

During experimental trials, we find that “Parse B+(Z\G)” can further boostperformance on Chinese. A possible explanation is that by using the intersection ofthe outputs of GParser and ZPar, the size of the parse forest is better controlled,which is helpful considering that ZPar performs worse on this data than bothBerkeley Parser and GParser.

Adding the output of GParser itself (“Parse B+Z+G”) leads to accuracy drop,although the oracle score is higher (96.95 % on English and 91.50 % on Chinese)than “Parse B+Z.” We suspect the reason is that the model is likely to distributethe probability mass to these parse trees produced by itself instead of those byBerkeley Parser or ZPar under this setting.

In summary, we can conclude that our proposed ambiguity-aware ensembletraining is significantly better than both the supervised approaches and the semi-supervised approaches that use 1-best parse trees. Appropriately composing theforest parse, our approach outperforms the best results of co-training or tri-trainingby 0.28 % (93.78–93.50) on English and 0.92 % (84.26–83.34) on Chinese.

4.3.3.3 Comparison with Previous Work

We adopt the best settings on development data for semi-supervised GParser withour proposed approach and make comparison with previous results on test data.Table 4.4 shows the results.

The first major row lists several state-of-the-art supervised methods. McDonaldand Pereira (2006) propose a second-order graph-based parser, but use a smallerfeature set than our work. Koo and Collins (2010) propose a third-order graph-basedparser. Zhang and McDonald (2012) explore higher-order features for graph-baseddependency parsing and adopt beam search for fast decoding. Zhang and Nivre(2011) propose a feature-rich transition-based parser. All work in the second majorrow adopts semi-supervised methods. The results show that our approach achievescomparable accuracy with most previous semi-supervised methods. Both Suzukiet al. (2009) and Chen et al. (2013) adopt the higher-order parsing model of Carreras(2007) and Suzuki et al. (2009) also incorporate word cluster features proposedby Koo et al. (2008) in their system. We expect our approach may achieve higherperformance with such enhancements, which we leave for future work. Moreover,our method may be combined with other semi-supervised approaches, since they areorthogonal in methodology and utilize unlabeled data from different perspectives.


Table 4.4 UAS comparison on PTB (test)

Sup Semi

McDonald2006 (McDonald and Pereira 2006) 91.5–Koo2010 (Koo and Collins 2010) [higher-order] 93.04

Zhang2012 (Zhang and McDonald 2012) [higher-order] 93.06

Zhang2011 (Zhang and Nivre 2011) [higher-order] 92.9

Koo2008 (Koo et al. 2008) [higher-order] 92.02 93.16

Chen2009 (Chen et al. 2009) [higher-order] 92.40 93.16

Suzuki2009 (Suzuki et al. 2009) [higher-order,cluster] 92.70 93.79

Zhou2011 (Zhou et al. 2011) [higher-order] 91.98 92.64

Chen2011 (Chen et al. 2013) [higher-order] 92.76 93.77

Ambiguity-aware 92.34 93.19

Table 4.5 UAS comparison on CTB5(test)

UAS

Supervised Li2012 (Li et al. 2012) [joint] 82.37

Bohnet2012 (Bohnet and Nivre 2012) [joint] 81.42

Chen2013 (Chen et al. 2013) [higher-order] 81.01

Semi Chen2013 (Chen et al. 2013) [higher-order] 83.08

Ambiguity-aware 82.89

Table 4.5 makes comparisons with previous results on Chinese test data. Liet al. (2012) and Bohnet and Nivre (2012) use joint models for POS tagging anddependency parsing, significantly outperforming their pipeline counterparts. Ourapproach can be combined with their work to utilize unlabeled data to improve bothPOS tagging and parsing simultaneously. Our work achieves comparable accuracywith Chen et al. (2013), although they adopt the higher-order model of Carreras(2007). Again, our method may be combined with their work to achieve higherperformance.

4.3.3.4 Analysis

To better understand the effectiveness of our proposed approach, we make detailedanalysis using the semi-supervised GParser with “Parse B+Z” on English datasets.

Contribution of unlabeled data with regard to syntactic divergence: We dividethe unlabeled data into three sets according to the divergence of the 1-best outputsof Berkeley Parser and ZPar. The first set contains those sentences that the twoparsers produce identical parse trees, denoted by “consistent,” which correspondsto the setting for tri-training. Other sentences are split into two sets according toaveraged number of heads per word in parse forests, denoted by “low divergence”


Table 4.6 Performance of our semi-supervised GParser with different sets of “Parse B+Z” onPTB(test). “Len” means averaged sentence length

Unlabeled data UAS #Sent Len H/W Oracle

NULL 92.34 0 – – –

Consistent (tri-train) 92.94 0.7M 18.25 1.000 97.65

Low divergence 92.94 0.5M 28.19 1.062 96.53

High divergence 93.03 0.5M 27.85 1.211 94.28

ALL 93.19 1.7M 24.15 1.087 96.09

92.3

92.4

92.5

92.6

92.7

92.8

92.9

93

93.1

93.2

0 50K 100K 200K 500K 1M 1.7M

UA

S

Unlabeled Data Size

B+Z Parser

Fig. 4.5 Performance of GParser with different sizes of “Parse B+Z” on PTB(test)

and “high divergence,” respectively. Then we train semi-supervised GParser usingthe three sets of unlabeled data. Table 4.6 illustrates the results and statistics. We cansee that unlabeled data with identical outputs from Berkeley Parser and ZPar tendsto be short sentences (18.25 words per sentence on average). Results show all thethree sets of unlabeled data can help the parser. Especially, the unlabeled data withhighly divergent structures leads to slightly higher improvement. This demonstratesthat our approach can better exploit unlabeled data on which parsers of differentviews produce divergent structures.

Impact of unlabeled data size: To understand how our approach performs withregards to the unlabeled data size, we train semi-supervised GParser with differentsizes of unlabeled data. Figure 4.5 shows the accuracy curve on the test set. We cansee that the parser consistently achieves higher accuracy with more unlabeled data,demonstrating the effectiveness of our approach. We expect that our approach haspotential to achieve higher accuracy with more additional data.


4.4 Summary

In this chapter, we have introduced the approaches belong to the whole-tree-leveltype, including self-training and co-training. We mainly focus on a generalizedframework of ambiguity-aware ensemble training proposed in Li et al. (2014). Foreach unlabeled sentence, the 1-best parse trees of several diverse parsers is usedto compose ambiguous labels, represented by a parse forest. The training objectiveis to maximize the mixed likelihood of both the labeled data and the auto-parsedunlabeled data with ambiguous labels.

References

Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. InProceedings of the 11th annual conference on computational learning theory, Madison (pp. 92–100)

Bohnet, B. (2010). Top accuracy and fast dependency parsing is not a contradiction. In Proceedingsof the 23rd international conference on computational linguistics (COLING), Beijing (pp. 89–97). COLING 2010 Organizing Committee. http://www.aclweb.org/anthology/C10-1011.

Bohnet, B., & Nivre, J. (2012). A transition-based system for joint part-of-speech tagging andlabeled non-projective dependency parsing. In Proceedings of EMNLP, Jeju Island (pp. 1455–1465).

Carreras, X. (2007). Experiments with a higher-order projective dependency parser. In Proceedingsof the CoNLL shared task session of EMNLP-CoNLL 2007, Prague (pp. 957–961). Associationfor Computational Linguistics.

Charniak, E., Blaheta, D., Ge, N., Hall, K., Hale, J., & Johnson, M. (2000). BLLIP 1987–89 WSJCorpus Release 1, LDC2000T43. Linguistic Data Consortium.

Chen, W., Kazama, J., Uchimoto, K., & Torisawa, K. (2009). Improving dependency parsing withsubtrees from auto-parsed data. In Proceedings of EMNLP, Singapore (pp. 570–579).


Duan, X., Zhao, J., & Xu, B. (2007). Probabilistic models for action-based Chinese dependencyparsing. In Proceedings of ECML/ECPPKDD, Warsaw.

Finkel, J. R., Kleeman, A., & Manning, C. D. (2008). Efficient, feature-based, conditional randomfield parsing. In Proceedings of ACL, Columbus (pp. 959–967).

Huang, C. R. (2009). Tagged Chinese Gigaword Version 2.0, LDC2009T14. Linguistic DataConsortium.

Huang, Z., & Harper, M. (2009). Self-training PCFG grammars with latent annotations acrosslanguages. In Proceedings of EMNLP, Singapore (pp 832–841).

Kawahara, D., & Uchimoto, K. (2008). Learning reliability of parses for domain adaptation ofdependency parsing. In Proceedings of IJCNLP, Hyderabad.


Koo, T., & Collins, M. (2010). Efficient third-order dependency parsers. In Proceedings of ACL,Uppsala (pp. 1–11). Association for Computational Linguistics.



References 47

Li, Z., Zhang, M., Che, W., & Liu, T. (2012). A Separately Passive-Aggressive Training Algorithmfor Joint POS Tagging and Dependency Parsing. In Proceedings of the 24rd internationalconference on computational linguistics (COLING 2012), Mumbai. COLING 2012 OrganizingCommittee.

Li, Z., Zhang, M., & Chen, W. (2014). Ambiguity-aware ensemble training for semi-superviseddependency parsing. In Proceedings of the annual meeting of the association for computationallinguistics (ACL2014), Baltimore (pp. 457–467, 22–27).

McClosky, D., Charniak, E., & Johnson, M. (2006). Effective self-training for parsing. InProceedings of NAACL, New York (pp. 152–159).

McDonald, R., & Pereira, F. (2006). Online learning of approximate dependency parsing algo-rithms. In Proceedings of EACL, Trento (pp. 81–88).

Noreen, E. W. (1989). Computer-intensive methods for testing hypotheses: An introduction. NewYork: Wiley.

Petrov, S., & Klein, D. (2007). Improved inference for unlexicalized parsing. In Human languagetechnologies 2007: The conference of the North American chapter of the association forcomputational linguistics; Proceedings of the main conference, Rochester (pp. 404–411).Association for Computational Linguistics.

Sagae, K., & Tsujii, J. (2007). Dependency parsing and domain adaptation with LR models andparser ensembles. In Proceedings of the CoNLL shared task session of EMNLP-CoNLL, Prague(pp. 1044–1050).

Søgaard, A., & Rishøj, C. (2010). Semi-supervised dependency parsing using generalized tri-training. In Proceedings of ACL, Uppsala (pp. 1065–1073).

Spreyer, K., & Kuhn, J. (2009). Data-driven dependency parsing of new languages usingincomplete and noisy training data. In Proceedings of CoNLL, Boulder (pp. 12–20).

Surdeanu, M., & Manning, C. D. (2010). Ensemble models for dependency parsing: Cheap andgood? In Proceedings of NAACL, Los Angeles (pp. 649–652).

Suzuki, J., Isozaki, H., Carreras, X., & Collins, M. (2009). An empirical study of semi-supervisedstructured conditional models for dependency parsing. In Proceedings of EMNLP, Singapore(pp. 551–560). Association for Computational Linguistics.

Täckström, O., McDonald, R., & Nivre, J. (2013). Target language adaptation of discriminativetransfer parsers. In Proceedings of NAACL, Atlanta (pp. 1061–1071).

Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. InProceedings of ACL, Cambridge (pp. 189–196).

Zhang, H., & McDonald, R. (2012). Generalized higher-order dependency parsing with cubepruning. In Proceedings of EMNLP-CoNLL, Jeju Island (pp. 320–331).

Zhang, Y., & Nivre, J. (2011). Transition-based dependency parsing with rich non-local features. InProceedings of ACL-HLT, Portland (pp. 188–193). Association for Computational Linguistics.http://www.aclweb.org/anthology/P11-2033.

Zhou, Z. H., & Li, M. (2005). Tri-training: Exploiting unlabeled data using three classifiers. IEEETransactions on Knowledge and Data Engineering, 17(11), 1529–1541.

Zhou, G., Zhao, J., Liu, K., & Cai, L. (2011). Exploiting web-derived selectional preferenceto improve statistical dependency parsing. In Proceedings of ACL-HLT, Portland (pp. 1556–1565). Association for Computational Linguistics. http://www.aclweb.org/anthology/P11-1156.




Chapter 5Training with Lexical Information

This chapter describes the approaches of the word level, which make use of theinformation based on word surfaces. The lexical information is very important forresolving ambiguous relationships for dependency parsing, but lexicalized statisticsare sparse and difficult to estimate directly given a limited train data set. Thus it isattractive to consider learning lexical information from large-scale unlabeled data,such as web data.

Koo et al. (2008) introduce intermediate entities which are between the wordsand the part-of-speech tags, but capture the information necessary to resolve theambiguities. They proposed a semi-supervised approach which has two steps: (1)use a large-scale unlabeled data to learn word clusters and (2) define a set of wordcluster-based features for dependency parsing models. The information of wordclusters has been used in the task of named entity recognition (Miller et al. 2004).Compared with named entity recognition, dependency parsing has more complexstructured relationships.

Zhou et al. (2011) propose an alternative approach to learning lexical informationby exploiting web-derived selectional preference. The lexical statistics are obtainedfrom two types of web-scale corpus, the web and Google Web 1T 5-gram corpus.The idea inside is that the web-scale data has large coverage for word pairacquisition. Based on such large data, the parsing model can make use of theadditional information to capture the word-to-word level relationships.

Compared with the approaches of the whole-tree and partial-tree levels, the onesof lexical level do not need to parse the unannotated data. The experimental resultsof Koo et al. (2008) and Zhou et al. (2011) show that the approaches are simple yeteffective.


49

50 5 Training with Lexical Information

5.1 An Approach Based on Word Clusters

In this section, we briefly introduce the approach proposed by Koo et al. (2008),which uses word clusters to represent a set of new features for parsing models.

5.1.1 Generating Word Clusters

In the approach, the Brown clustering algorithm is used to provide word clusters.The algorithm has been previously used in other NLP applications, such as NER(Miller et al. 2004). The algorithm is a bottom-up agglomerative word clusteringalgorithm which derives a hierarchical clustering of words. The input to thealgorithm is a vocabulary of words to be clustered and a text containing these words.The output is a binary tree, in which the leaves are the words. Each internal nodeis a cluster containing the words in that subtree rooted at the internal node. Here,the algorithm generates a hard clustering in which each word belongs to exactlyone cluster. Initially, each word in the vocabulary is treated to be in its own distinctcluster. The algorithm repeatedly merges the two clusters that causes the smallestdecrease in the likelihood of the text corpus. A class-based bigram language modelis defined on the word clusters. Given clusters C that maps each word to a cluster, thelanguage model computes a probability to the text corpus T D w1; : : : ; wn, wherethe maximum-likelihood estimate of the model parameters is used. The likelihoodis computed as follows:

P.T/ D 1

nlog P.w1; : : : ; wn/

D 1

nlog P.w1; : : : ; wn; C.w1/; : : : ; C.wn//

D 1

nlog

nY

iD1

P.C.wi/jC.wi1//P.wijC.wi// (5.1)

where C.wi/ refers to the cluster wi belongs to.After we obtain the binary tree, each word can be assigned a binary string by

following its path from the root to its leaf, assigning a 0 for each left branch anda 1 for each right branch. Table 5.1 shows example bit strings from the output.1

We specify different sets of clusters by prefixes of the bit strings. Short prefixesspecify short paths from the root node and large clusters while long ones specifylong paths and small clusters. This strategy is used in Miller et al. (2004) and Kooet al. (2008).

1The clusters are provided by Koo et al. (2008) that recovers at most 1,000 distinct bit strings.

5.1 An Approach Based on Word Clusters 51

Table 5.1 Sample bit strings Bit-string Word

100100100 Coast

100100100 Cartel

100100100 Province

100100100 Museum

100100100 Island

100100100 Region

100100100 Country

100100100 City

: : :

101001000111 Survivor

101001000111 Vendor

101001000111 Regulator

101001000111 Raider

101001000111 Salesman

101001000111 Lender

101001000111 Competitor

101001000111 Dealer

101001000111 Buyer

101001000111 Broker

101001000111 Source

101001000111 Trader

: : :

111011011110 Cartoon

111011011110 Cogeneration

111011011110 Convenience

111011011110 Basketball

111011011110 Cruise

111011011110 Tennis

111011011110 Golf

111011011110 Football

111011011110 Baseball

111011011110 Assembly

111011011110 Sports

111011011110 Video

111011011110 Radio

111011011110 TV

111011011110 Television

5.1.2 Word Cluster-Based Features

A set of word cluster-based features are represented based on the word clusters toassist the parser. In the final system, we employ the base features defined over wordforms and part-of-speech tags (described in Sect. 2.1.5) and the word cluster-basedfeatures.


The cluster-based feature sets are supersets of the base feature set by addingan additional layer of features that incorporate word clusters. As we mentionedabove, we use prefixes of the Brown cluster hierarchy to generate different setsof word clusters. Koo et al. (2008) reported that the optimal lengths of bit-stringprefixes were task specific for dependency parsing and named entity recognition.After experimenting with many different feature configurations, two different typesof word clusters are used in the experiments:

1. Short bit-string prefixes (e.g., 4–6 bits), which are used as replacements for part-of-speech tags.

2. Full bit strings, which are used as substitutes for word forms.

Table 5.2 shows some examples of cluster-based features, where ht is head POS,hw is head word, hc is 4-bit prefix of head, hc6 is 6-bit prefix of head, hc* is full bitstring of head, and m, s, and g refer to modifier, sibling, and grandchild, respectively.The base features involve word forms and part-of-speech tags interactions betweenheads and modifiers of dependencies. The cluster-based features are generated byreplacing word forms and part-of-speech tags with bit strings for the templatestructure of the original baseline features. For example, the base template “ht, mt”(the first line in Table 5.2 ) can be transformed as “hc4, mc4” by using 4-bit stringsto substitute for part-of-speech tags of head and modifier.

The experiments of Koo et al. (2008) also consider hybrid features including onebit string and one part-of-speech tag. The results show that the hybrid features canprovide large improvements. The reason might be that the clustering results canbe noisy or only weakly relevant to dependencies. It is useful to use the clustersanchored to word forms and part-of-speech tags.

Table 5.2 Examples of baseand cluster-based featuretemplates (Borrowed fromKoo et al. 2008)

Base features Cluster-based features

ht, mt hc4, mc4

hw, mw, hc6, mc6

hw, ht, mt hc*, mc*

hw, ht, mw hc4, mt

ht, mw, mt ht, mc4

hw, mw, mt hc6, mt

hw, ht, mw, mt ht, mc6

: : : hc4, mw

hw, mc4

: : :

ht, mt, st hc4, mc4, sc4

ht, mt, gt hc6, mc6, sc6

: : : ht, mc4, sc4

hc4, mc4, gc4

: : :

5.2 An Approach Based on Web-Derived Selection Preference 53

5.2 An Approach Based on Web-Derived SelectionPreference

In this section, we introduce the approach proposed by Zhou et al. (2011), whichexploits selectional preference to improve dependency parsing. Lexical statisticsare needed for resolving ambiguous relationships, but they are sparse and difficultto estimate directly given the limited size of training data. The approach of Kooet al. (2008) introduces word clusters which are lexical intermediaries at a coarserlevel than words, but does not consider the selectional preference for word-to-wordinteractions which is very important for dependency parsing. The approach usestwo unannotated data to capture the bilexical relationship at the word-to-word level:(1) the web, which is a huge data, and (2) a web-scale N-gram corpus (Google V1in short) released by Google (Thorsten and Franz 2006). Then the information ispresented as additional features for the parsing models.

5.2.1 N-Gram Counts

The selectional preference is measured by the scores derived from web-scaleresources. To calculate the scores, we need N-gram counts first. For the web, N-gram counts are approximated by Google hits (from Google search engine). ForGoogle V1, we can directly retrieve the counts from the N-gram tables. N-gramsthat appear less than 40 times are removed.

5.2.2 Selectional Preference Features

5.2.2.1 Association Scores

The pointwise mutual information (PMI) is used to compute the association scorebetween a pair of words in the dependency trees. The PMI score is calculated by

PMI.a; b/ D logp.“ab”/

p.“a”/p.“b”/(5.2)

where p.“ab”/ is the co-occurrence probability of “a b.” We can use the N-gramcounts to compute the probabilities. To obtain the N-gram counts for the web, thequeries with quotation marks are sent to Google search engine. Two special binaryfeatures are defined if the probabilities are zero.

The preference of three words is also measured by the PMI score as follows:

PMI.a; b; c/ D logp.“abc”/

p.“ab”/p.“bc”/(5.3)

Based on the preference of three words, we can define tri-gram features for thesecond-order parsing models.


Table 5.3 Examples of baseand N-gram feature templates(Borrowed from Zhou et al.2011)

Base features N-gram features

hw, mw hw, mw, PMI(hw, mw)

hw, ht, mw hw, ht, mw, PMI(hw, mw)

hw, mw, mt hw, mw, mt, PMI(hw, mw)

hw, ht, mw, mt hw, ht, mw, mt, PMI(hw, mw)

: : : : : :

hw, mw, sw hw, mw, sw, PMI(hw, mw, sw)

hw, mw, gw hw, mw, gw, PMI(hw, mw, gw)

: : : : : :

5.2.2.2 Feature Templates

After we obtain the PMI scores, a new set of features is represented based on them.The base features consider word-to-word, tag-to-tag, or word-to-tag interactionsbetween the head and modifier. The N-gram features are defined for word-to-wordinteractions by mimicking the template structure of the base features. Table 5.3shows the new defined N-gram feature templates, where hw, mw refer to the wordsurfaces of the head and modifier respectively. All features are combined with thedirection of dependency arc as well as the distance between the head and modifier.The second-order features are defined over the PMI scores of three words.

The parsing models (McDonald and Nivre 2007) contain only binary features,while the values in the PMI scores are real numbers that are not in a bounded range.If the range of the values is too large, they will exert much more influence than thebinary features. To solve this problem, Zhou et al. (2011) converts the PMI scoresinto a scaled range by replacing the PMI scores with .x /= , where and arethe mean and standard deviation of the PMI distribution respectively.

5.3 Experiments

In this section, we introduce the experimental results of Koo et al. (2008) and showsome new results based on a stronger baseline.

5.3.1 Data Sets

We describe the data sets used in Koo et al. (2008) below. The Penn Treebank(Marcus et al. 1993) is used in our experiments, and the tool “Penn2Malt”2

is used to convert the data into dependency structures using a standard set ofhead rules (Yamada and Matsumoto 2003). The data are divided into a training

2http://w3.msi.vxu.se/~nivre/research/Penn2Malt.html


5.3 Experiments 55

set (sections 2–21), a development set (section 22), and a test set (section 23)and use the same setting for part-of-speech tags. We also use the MXPOST(Ratnaparkhi 1996) tagger trained on training data to provide part-of-speech tagsfor the development and the test set. In practice, the parsers trained on the datawith auto-generated part-of-speech tags perform a little better than those trainedon the data with gold part-of-speech tags for parsing the sentences with the auto-generated part-of-speech tags. Thus, we use 10-way jackknifing (tagging each foldwith the tagger trained on the other ninefolds) to generate part-of-speech tags forthe training set as did (Koo et al. 2008). For the unannotated data, we use the BLLIPcorpus (Charniak et al. 2000) that contains about 43 million words of WSJ text.3 Weused the MXPOST tagger trained on training data to assign part-of-speech tags anduse the baseline parser of Chen et al. (2013) to process the sentences of the BLLIPcorpus.

We measure the parser quality by the unlabeled attachment score (UAS), i.e.,the percentage of tokens (excluding all punctuation tokens) with the correct HEAD.And we also evaluate on complete dependency analysis (COMP).

5.3.2 Experimental Results

We list the results of Koo et al. (2008) which uses the same settings of data sets. Theresults are shown in Table 5.4, where “Koo2008-dep1” and “Koo2008-dep2” arethe baselines for first- and second-order models, respectively, and “Koo2008-dep1c”and “Koo2008-dep2c” are the systems with the cluster-based features, respectively.The results show that the systems with new features outperform the baselines.

We use the baseline system of Chen et al. (2013) as another baseline to investigatewhether the cluster-based features work well on a much stronger baseline. Theresults are shown in Table 5.4, where “Chen2013-dep2” refers to the baseline systemof Chen et al. (2013) and “Chen2013-dep2c” refers to the system with the cluster-based features. From the table, we find that Chen13-dep2 performs better than thebaseline (Koo08-dep2) of Koo et al. (2008). The results show that the cluster-basedfeatures still work very well on such a strong baseline.

Table 5.4 Experimentalresults on PTB(test)

System UAS COMP

Koo2008-dep1 90.84

Koo2008-dep1c 92.23

Koo2008-dep2 92.02

Koo2008-dep2c 93.16

Chen2013-dep2 92.78 48.08

Chen2013-dep2c 93.37 49.26

3We ensure that the text used for extracting subtrees does not include the sentences of the PennTreebank.


5.4 Summary

In this chapter, we have introduced effective approaches to improve dependencyparsing by using the lexical information obtained from unlabeled data. The firstapproach is proposed by Koo et al. (2008) which introduces intermediate entitieswhich are between the words and the part-of-speech tags, but capture the infor-mation necessary to resolve the ambiguities. The second one is proposed by Zhouet al. (2011) which learns lexical information by exploiting web-derived selectionalpreference.

References




Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpusof English: The Penn Treebank. Computational Linguisticss, 19(2), 313–330.


Miller, S., Guinness, J., & Zamanian, A. (2004). Name tagging with word clusters and dis-criminative training. In D. M. Susan Dumais & S. Roukos (Eds.), HLT-NAACL 2004: Mainproceedings, Boston (pp. 337–342). Association for Computational Linguistics.

Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proceedings ofEMNLP 1996, Philadelphia (pp. 133–142). Copenhagen: Denmark.

Thorsten, B., & Franz, A. (2006). Web 1T 5-gram Version 1 LDC2006T13. Linguistic DataConsortium. https://catalog.ldc.upenn.edu/LDC2006T13.

Yamada, H., & Matsumoto, Y. (2003). Statistical dependency analysis with support vectormachines. In Proceedings of IWPT, Nancy (pp. 195–206).

Zhou, G., Zhao, J., Liu, K., & Cai, L. (2011). Exploiting web-derived selectional preference toimprove statistical dependency parsing. In Proceedings of ACL-HLT2011, Portland (pp. 1556–1565). Association for Computational Linguistics. http://www.aclweb.org/anthology/P11-1156.


https://catalog.ldc.upenn.edu/LDC2006T13



Chapter 6Training with Bilexical Dependencies

In this chapter, we describe the approach which makes use of the information ofbilexical dependencies from auto-parsed data in order to improve parsing accuracy.First, all the sentences in the unlabeled data are parsed by a baseline parser.Subsequently, information on short dependency relations is extracted from theparsed data, because the accuracies for short dependencies are relatively higher thanthose for others. Finally, we train another parser by using the extracted informationas features.

Given a set of labeled sentences, it is easy to train a supervised dependencyparser. However, the sizes of training data are usually small, especially for resource-poor languages. This is a problem, in practice, for parsing unknown word pairs. vanNoord (2007) and Chen et al. (2008) use the information of bilexical dependenciesin auto-parsed data to improve the performance of dependency parsing.

In the procedures of van Noord (2007) and Chen et al. (2008), the first step is toparse the raw sentences using a baseline parser. Then the information of bilexicaldependencies is collected from the parsed trees. Finally, a new set of features isdefined based on the information for parsing models. Compared with the self- andco-training methods employed by Sagae and Tsujii (2007) and Reichart and Rap-poport (2007), the approaches employ information on word pairs in auto-parsed datainstead of selecting entire sentences for training new parsers. It is difficult to detectreliable parsed sentences, but relative reliable parsed word pairs can be obtained.

6.1 A Case Study

Currently used statistical dependency parsers provide poor results when thedependency length increases (McDonald and Nivre 2007). Here, the length of adependency from word wi and word wj is equal to ji jj. Figure 6.1 shows the F1


57

58 6 Training with Bilexical Dependencies

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30

F1

Dependency Length

baseline

Fig. 6.1 F-score relative to dependency length

score1 obtained by using a deterministic parser relative to the dependency lengthon our testing data. The figure indicates that the F1 score decreases when thedependency length increases as observed by McDonald and Nivre (2007). InFig. 6.1, we also notice that the parser provides quite good results for shortdependencies (94.57 % for dependency length = 1 and 89.40 % for dependencylength = 2). Figure 6.2 shows the percentages of different dependency lengths in thedata. We can find that over 58 % of all the dependencies are associated with eitherlength = 1 or length = 2. In this chapter, short dependency refers to the dependencieswith length either 1 or 2. The information on short dependency is expected to helpto parse words separated by longer distances.

In general, the two words in a head-dependent relation in one sentence canbe adjacent words (word distance = 1), neighboring words (word distance = 2),or words with greater distance (word distance >2) in other sentences. Here, wedefine that the word distance of words wi and word wj is equal to ji jj. InChinese, some modifiers can be added between the two words in a head-dependentrelation. For instances, a prepositional phrase can be added between a noun anda verb that act as a subject-predicate relation. And a noun can be added betweenan adjective and the modified noun. Figure 6.3 shows that “专家级(specialistlevel)” and “会谈(discussion)” have a head-dependent relationship with differentdistances in the sentences, where “专家级(specialist level)” is an adjective (JJ) and

1Precision represents the percentage of predicted arcs of length d that are correct, and recallmeasures the percentage of gold-standard arcs of length d that are correctly predicted.F1 D 2 precision recall=.precisionC recall/.

6.1 A Case Study 59

0

0.05

0.1

0.15

.2

0.25

0.3

0.35

0.4

0.45

0 5 10 15 20 25 30

Per

cent

age

Dependency Length

test data

Fig. 6.2 The percentage of different dependency lengths

Fig. 6.3 Dependencies with different distances

“会谈(discussion)” is a noun (NN). “Dependent” is optional to a head-dependentstructure (Nivre and Kubler 2006). We expect that the information obtained fromword pairs with different distances can be shared to each other and thus the parsercan be improved.

First, we demonstrate the method to use the dependency between adjacent wordsin unlabeled data to parse two words whose word distance is 2. The string “专家级JJ(specialist level)/工作NN(working)/会谈NN(discussion)” should be taggedas solution (a) in Fig. 6.4. However, the baseline parser may select solution (b) inFig. 6.4 without using any additional information. The question is: how to assign thehead for “专家级(specialist level).” Is it “工作(working)” or “会谈(discussion)”?

As Fig. 6.1 suggests, the baseline parser is good at tagging the relationshipbetween adjacent words. We expect that dependencies of adjacent words can provideuseful information for parsing words whose word distances are longer. By searchingthe string “专家级(specialist level)/会谈(discussion)” at www.google.com, many

www.google.com


(b)

(a)

Fig. 6.4 Two solutions for “专家级(specialist level)/工作(working)/会谈(discussion)”

1)… 5 25 26 / / /,/ / …

2)… ,/ / / 2004 2 18 …

3)… / / / …

n)… / / / …

.)… / …

Fig. 6.5 Parsing “专家级(specialist level)/会谈(discussion)” in unlabeled data

relevant documents can be retrieved. The baseline parser may assign the relation-ships between two adjacent words in the retrieved documents, as shown in Fig. 6.5.We can find that “会谈(discussion)” is the head of “专家级(specialist level)” inmany cases.

Now, consider what a learning model could do to assign the appropriaterelationship between “专家级(specialist level)” and “会谈(discussion)” in thestring “专家级(specialist level)/工作(working)/会谈(discussion).” In this case, weprovide additional information to “会谈(discussion)” by saying that it is the possiblehead of “专家级(specialist level)” in the unlabeled data. In this manner, the learningmodel may use this information to make correct decisions.

Thus far, we have demonstrated how to use the dependency relation betweenadjacent words in unlabeled data to help parse two words whose word distance is 2.Similarly, we can provide information for parsing two words whose word distanceis longer. On the basis of the above observations, it is possible to exploit informationof bilexical dependencies from large-scale auto-parsed data to improve dependencyparsing.

6.2 Reliable Bilexical Dependencies

In this section, we describe the approach of exploiting reliable features fromunlabeled data that have been parsed by the baseline parser. First, the unlabeled datais preprocessed to obtain an auto-parsed data. Subsequently, we collect bilexical

6.2 Reliable Bilexical Dependencies 61

dependencies (word pairs) from the auto-parsed data. Finally, we represent a set offeatures on the basis of the collected word pairs and train another parser on new datarepresentation. Figure 6.6 shows the architecture of the approach.

6.2.1 Unlabeled Data Preprocessing

The input in the approach is unlabeled data, which can easily be obtained. For thebaseline parser, the corpus should have part-of-speech (POS) tags. Therefore, weshould assign the POS tags by using a POS tagger. For Chinese sentences, we shouldsegment the sentences into words before POS tagging. After data preprocessing, weobtain the word-segmented sentences with the POS tags. We then use the baselineparser to parse all sentences in the unlabeled data. Finally, we obtain an auto-parseddata.

6.2.2 Collecting Reliable Word Pairs

The baseline parser can provide complete dependency parsing trees for all sentencesin unlabeled data. As shown in Fig. 6.1, short dependencies are more reliable.To offer reliable information for the model, we extract word pairs having shortdependency relations from the newly auto-parsed data.

6.2.2.1 Extracting Word Pairs from Auto-parsed Data

Suppose that the current two words are wi and wj and the word pair pt is “wi-wj.”Because POS tags are too generic, we only consider word pairs. Thus, in this section,

Unlabeleddata

Auto-parseddata

DepListTraining

data

New data representation

Segmenting

POS tagging

Parsing

Extracting word pairsHaving short dependency

Classifying extracted word pairs into buckets

Model Training

New parser

Fig. 6.6 Architecture of the approach


we describe the method to extract word pairs that have short dependency relationsin the sentences.

In a parsed sentence, if the dependency length of two words (wi and wj) is either1 or 2, we add this word pair pt to a list named DepList and then count its frequency.We also consider the direction DIR D fLA; RAg and length LEN D fL1; L2g ofthe dependency. L1 refers to the pairs with dependency length 1, L2 refers to thepairs with dependency length 2, RA refers to right arc, and LA refers to left arc. Forexample, “专家级(specialist level)” and “会谈(discussion)” are adjacent words ina sentence “我们(We)/举行(held)/专家级(specialist level)/会谈(discussion)/” andhave a left dependency arc assigned by the baseline parser. Hence, we have the wordpair “专家级(specialist level)-会谈(discussion)” with “LA-L1.” The pair with “LA-L1” and its frequency is added to the DepList. We use Freq.pt; DIR; LEN/ to denotethe frequency.

6.2.2.2 Classifying into Buckets

We group word pairs into different clusters according to the following equation:

C.pt; DIR; LEN/ D

8ˆˆ<

ˆˆ:

C1 Freq.pt; DIR; LEN/ f1C2 f1 < Freq.pt; DIR; LEN/ f2: : :

Cn fn1 < Freq.pt; DIR; LEN/ fnCnC1 fn < Freq.pt; DIR; LEN/

(6.1)

In practice, the pairs are grouped into four clusters corresponding to four levels:“high frequency,” “middle frequency,” “low frequency,” and “infrequency.” Thethreshold values (f1; f2; f3) are selected by tuning on development data. After tuning,we obtain four clusters by setting f1 D 1, f2 D 7, and f3 D 14. Among all thecollected pairs, the pairs in C1, C2, C3, and C4 take approximately 70 %, 24 %, 3 %,and 3 % respectively. To avoid the change in threshold values when using differentsizes of unlabeled data, we can use the percentages as threshold values.

We then form buckets by combining clusters with the dependency length anddependency direction. For example, if the frequency of the pair “专家级(specialistlevel)-会谈(discussion)” with “LA-L1” is 20, it belongs to the bucket “C4_LA_L1.”Additional examples are provided in Table 6.1.

6.3 Parsing with the Information on Word Pairs

In this section, we build a parser based on the transition-based parsing model(described in Sect. 2.2 of Chap. 2). The parser follows the settings of Nivre (2003).

6.3 Parsing with the Information on Word Pairs 63

Table 6.1 The examples in the DepList

Pair LEN DIR Freq Cluster Bucket

专家(specialist)-学习(study) L1 LA 1 C1 C1_LA_L1

专家(specialist)-学员(student) L2 LA 1 C1 C1_LA_L2

专家(specialist)-学者(scholar) L1 LA 390 C4 C4_LA_L1

专家(specialist)-学者(scholar) L2 LA 153 C4 C4_LA_L2

专家(specialist)-询问(inquire) L2 LA 1 C1 C1_LA_L2

询问(inquire)-病情(state of an illness) L1 RA 2 C2 C2_RA_L1

询问(inquire)-当事人(persons involved) L1 RA 1 C1 C1_RA_L1

6.3.1 New Features

Based on the buckets, a set of new features is designed for training or parsing thecurrent two words: S0w and Q0w. We consider word pairs from the context aroundS0w and Q0w, and we obtain the buckets of the pairs in the DepList. According todependency lengths, we divide the pairs into two sets: L1 pairs and L2 pairs.

First, we represent the features based on L1 pairs. We name these features L1features. The L1 features are listed according to different word distances betweenS0w and Q0w, as follows:

1. Word distance is 1: (TN0) the bucket of the word pair of S0w and Q0w, and (TN1)the bucket of the word pair of S0w and the next token after Q0w.

2. Word distance is 2: (TN0) the bucket of the word pair of S0w and Q0w, (TN1)the bucket of the word pair of S0w and the next token after Q0w, and (TN_1) thebucket of the word pair of S0w and the token immediately before Q0w.

3. Word distance is 3 and 3+: (TN0) the bucket of the word pair of S0w and Q0w,(TN1) the bucket of the word pair of S0w and the next token after Q0w, and (TN_1)the bucket of the word pair of S0w and the token immediately before Q0w.

Thus, we obtain eight types of L1 features, including two types in item (1), threetypes in item (2), and three types in item (3). The feature is formatted as “Word-Distance:Position:PairBucket.” For example, we have the string “专家级(specialistlevel)/w1/w2/w3/会谈(discussion).” Here, “专家级(specialist level)” is S0w and “会谈(discussion)” is Q0w. Thus, we obtain the feature “D3+:TN0:C4_LA_L1” forS0w and Q0w, because the word distance is 4(3+) and “专家级(specialist level)-会谈(discussion)” belongs to the bucket “C4_LA_L1.” A pair can belong to twobuckets because there exists two directories (LA and RA). Here, we use the bucketwhose pair has higher frequency.

Similarly, we represent the features based on L2 pairs. We name these featuresL2 features. The L2 features are listed as follows:

1. Word distance is 1: (TN1) the bucket of the word pair of S0w and next token afterQ0w.


2. Word distance is 2: (TN0) the bucket of the word pair of S0w and Q0w, and (TN1)the bucket of the word pair of S0w and the next token after Q0w.

We obtain three types of L2 features, including one type in item (1) and two typesin item (2).

6.3.2 Training a New Parser

By using the base features (represented in Table 2.4 of Chap. 2) and the new features(represented in Sect. 6.3.1), the data is represented in a new feature space. A newparser is then trained on the new data representation.

6.4 Experiments

6.4.1 Experimental Settings

For the labeled data, we use the Chinese Treebank (CTB) version 4.02 in theexperiments. We use the same rules for conversion and created the same data splitas (Wang et al. 2007): files 1–270 and files 400–931 for training, files 271–300 fortesting, and files 301–325 for development. We use the gold-standard segmentationand POS tags in the CTB.

For the unlabeled data, we use the PFR corpus.3 It includes documents fromthe People’s Daily at 1998 (12 months). There are approximately 290 thousandsentences and 15 million words in the PFR corpus. To simplify, we use itssegmentation. We discard the POS tags because PFR and CTB use different POSsets. We use the package TNT (Brants 2000), a highly efficient statistical part-of-speech tagger, to train a POS tagger on training data from the CTB. To know whetherour POS tagger is good, we test the TNT package on the standard training and testingsets for full parsing (Wang et al. 2006). The TNT-based tagger provide 91.52 %accuracy, which is comparable with the results obtained by Wang et al. (2006).


In the experiments, we train the parsers on training data and tune the parameterson the development data. In the following sessions, “baseline” refers to the baselineparser (the model with the basic features), and “Bilex” refers to the new parser (themodel with all features).

2More detailed information can be found at http://www.cis.upenn.edu/~chinese/3More detailed information can be found at http://www.icl.pku.edu


http://www.icl.pku.edu

6.4 Experiments 65

Table 6.2 The results withdifferent feature sets onCTB4(test)

UAS ROOT

Baseline 85.28 88.21

+L1 86.40 89.23

+L1L2(Bilex) 86.52 89.36

6.4.2.1 Main Results

Table 6.2 shows the results of the parser with different feature sets, where “+L1”refers to the parser with the basic features and L1 features, and “+L1L2” refers tothe parser with all features (base features, L1 features, and L2 features). From thetable, we find that the system achieves significant improvement (1.12 % for UASand 1.02 % for ROOT) by adding L1 features. The L2 features provide a furtherbut small improvement, 0.12 % for UAS and 0.13 % for ROOT. The reason maybe that the information from the dependency length 2 data contains more noise. Asshown in Fig. 6.1, the score of dependency length 2 is about 5 % lower than that ofdependency length 1. The other reasons may be that the number of L2 pairs is muchless than that of L1 pairs and that 35.8 % of the L2 pairs are included in the list of L1pairs. Hence, the L2 features could not provide much further information. Totally,we achieve a 1.24 % improvement for UAS and 1.15 % for ROOT. The improvementis significant in the one-tail-paired t-test (p < 105).

6.4.2.2 Comparison of Other Systems

We compare Bilex with two other systems: SelfTrain and CoTrain. The SelfTrainsystem is similar to the method described by Reichart and Rappoport (2007), andnew auto-parsed sentences are randomly selected. The CoTrain system is similar tothe method described by Sagae and Tsujii (2007). First, we train a forward parser(same as our baseline system) and a backward parser. Then, we select the sentencesthat have been identically parsed by the two parsers as newly labeled data. Finally,we retrain the forward parser with new training data. We select the sentences havingabout 200k words from the PFR data as newly labeled data for the SelfTrain andCoTrain systems.

Table 6.3 shows the experimental results. SelfTrain shows an improvement of0.43 %, and CoTrain gives a 0.55 % improvement as compared with the baselinesystem. Bilex performs the best among all the systems. The time required fortraining the SelfTrain and CoTrain systems increase because they employ almostdouble the training data. We compare the run times (in minutes) for training foursystems. Table 6.3 shows that Bilex required 1,126 min, the similar time by thebaseline system. SelfTrain and CoTrain require almost three times as much as thetime required by the baseline system.


Table 6.3 The results ofseveral semi-supervisedmethods on CTB4(test)

Method UAS ROOT Cost (mins)

Baseline 85:28 88:21 1,110

SelfTrain 85:71 88:50 3,051

CoTrain 85:83 88:79 2,988

Bilex 86:52 89:36 1,126

45

50

55

60

65

70

75

80

85

90

95

0 5 10 15 20

F1

Dependency Length

baseline

Bilex

Fig. 6.7 Improvement relative to dependency length

6.5 Results Analysis

6.5.1 Improvement Relative to Dependency Length

We consider the improvement relative to the dependency length. We conduct theexperiments by performing a 10-cross validation. The training data and testing dataare merged into one set. Then it is randomly divided into ten parts. We perform theexperiments ten times. Each time, we use one of them as testing data and the othersas training data. Figure 6.7 shows the average scores relative to the dependencylength. From the figure, we find that Bilex provides better performance for alllengths. In particular, it performs much better than the baseline parser on lengthsgreater than 4.

6.5 Results Analysis 67

6.5.2 Improvement Relative to Unknown Words

The unknown word4 problem is an important issue for parsing. The proposedapproach can partially release the unknown word problem. We calculate the numberof unknown words in a sentence and list the accuracies of the sentences havingunknown words, from 0 to 5. We discard the sentences having more than fiveunknown words, because their number is very small. We group each sentenceinto one of three classes: (Better) those are Bilex’s score increased relative to thebaseline’s score, (NoChange) those where the score remains the same, and (Worse)those where the score has a relative decrease. We add another class (NoWorse) bymerging Better and NoChange.

Figure 6.8 shows the experimental results, where the x axis refers to the numberof unknown words in one sentence and the y axis shows the class in percentage. Forexample, for the sentences having three unknown words, about 38.46 % improved,15.38 % become worse, 46.15 % are unchanged, and 84.61 % do not become worse.The NoWorse curve shows that regardless of the number of unknown words in asentence, there is more than an 80 % chance that the proposed approach did notharm the result. The Better curve and Worse curve show that the proposed approachalways provides better results. The results also indicate that the proposed approachcan achieve larger improvement in parsing sentences having unknown words thanparsing sentences without any unknown words. The reason may be as follows:

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5

Per

cent

age

Number of unknown words

BetterNoChange

WorseNoWorse

Fig. 6.8 Improvement relative to unknown words

4An unknown word is the word that is not included in training data.


We collect word pairs including unknown word pairs and known word pairs inSect. 6.2.2 and group them into one of the buckets by using Eq. (6.1). Unknownword pairs in the testing data are also mapped into one of the buckets by usingEq. (6.1). Hence, known word pairs can share the features with unknown wordpairs.

6.5.3 Improvement Relative to POS Pairs

In this section, we list the improvements relative to POS tags of paired words havinga dependency relation. Table 6.4 shows the accuracies of baseline and Bilex on TOP30 POS pairs (ordered by the frequencies of their occurrences in testing data), whereAb refers to the accuracy of baseline, Ao refers to the accuracy of Bilex, and “Pairs”is the POS pairs of dependent-head. For example, “NN-VV” means that “NN” is thePOS of the dependent and “VV” is the POS of the head. Baseline yields 84.35 %accuracy and Bilex yields 87.87 % (3.52 % higher) on “NN-VV.” The table showsthat the proposed approach works well for most POS pairs (better for 19 pairs, nochange for 4, and worse for 7).

In Fig. 6.7, we find that Bilex performs well for lengths greater than 4. Consid-ering POS pairs and dependency distances, we investigate the improvement relativeto these two factors. We divide the lengths into two ranges: “D1–D4” for lengthsranging from 1 to 4 and “D5+” for lengths no less than 5. Table 6.5 shows theaccuracies of baseline and Bilex on TOP 10 POS pairs with two ranges, where“ALL” refers to all the pairs. For example, for the pair “NN-VV,” there are 70.27 %

Table 6.4 Improvement relative to POS pairs

Pairs Ab Ao Ao Ab Pairs Ab Ao Ao Ab

NN-NN 84.10 84.15 (+0.05) CD-M 100.00 100.00 (=)

NN-VV 84.35 87.87 (+3.52) AS-VV 100.00 100.00 (=)

VV-VV 68.86 69.44 (+0.58) DT-NN 94.05 94.05 (=)

NR-NN 85.90 89.10 (+3.20) NN-NR 91.57 93.98 (+2.41)

P-VV 83.57 87.50 (+3.93) VA-DEC 100.00 98.68 (1.32)

AD-VV 97.72 98.10 (+0.38) M-NN 82.19 83.56 (+1.37)

JJ-NN 91.39 93.85 (+2.46) NN-VC 95.59 91.18 (4.41)

DEG-NN 95.36 94.85 (0.51) PN-VV 95.52 98.51 (+2.99)

NR-VV 89.12 89.12 (=) NT-VV 91.67 93.33 (+1.66)

DEC-NN 97.74 98.31 (+0.57) LC-P 91.23 89.47 (1.76)

NN-P 81.44 83.83 (+2.39) CD-NN 96.36 94.55 (1.81)

CC-NN 82.27 81.56 (0.71) NN-LC 90.57 96.23 (+5.66)

VV-DEC 73.73 77.12 (+3.39) CC-VV 74.47 70.21 (4.26)

NN-DEG 94.59 95.50 (+0.91) NR-P 91.49 95.74 (+4.25)

NR-NR 94.44 95.37 (+0.93) NT-NN 90.48 95.24 (+4.76)

References 69

Table 6.5 Improvement relative to POS pairs with distances

PairsTwo ranges D1–D4 D5C

D1-D4 D5+ Ab Ao Ao Ab Ab Ao Ao Ab

NN-NN 88.90 11.10 87.94 87.94 (=) 53.39 53.79 (+0.40)

NN-VV 70.27 29.73 88.50 90.54 (+2.04) 74.56 81.58 (+7.02)

VV-VV 43.91 56.09 84.14 83.70 (0.44) 56.90 58.28 (+1.38)

NR-NN 90.96 9.04 90.35 92.11 (+1.76) 41.18 58.82 (+17.64)

P-VV 42.86 57.14 93.33 95.00 (+1.67) 76.25 81.88 (+5.63)

AD-VV 89.35 10.65 98.72 99.15 (+0.43) 89.29 89.29 (=)

JJ-NN 98.77 1.23 92.53 94.61 (+2.08) 0.00 33.33 (+33.33)

DEG-NN 90.72 9.28 97.73 97.16 (0.57) 72.22 72.22 (=)

NR-VV 82.90 17.10 91.25 91.88 (+0.63) 78.79 75.76 (3.03)

DEC-NN 96.05 3.95 98.82 98.82 (=) 71.43 85.71 (+14.28)

ALL 77.93 22.07 90.94 91.67 (+0.73) 64.63 67.69 (+3.06)

in the range “D1–D4” and 29.73 % in the range “D5+.” Bilex provides 2.04 %improvement for the range “D1–D4” and 7.02 % for the range “D5+.” From thetable, we find that Bilex provides larger improvement for “D5+” on 7 pairs out of10. For the pairs “NN-VV” and “P-VV,” the rates of “D5+” are higher than theaverage, and Bilex provides an improvement greater than 5 %.

6.6 Summary

This chapter has introduced the approach which uses bilexical dependencies toimprove dependency parsing by using unlabeled data (Chen et al. 2008). The infor-mation is extracted on short bilexical dependencies in an automatically generatedcorpus parsed by the baseline parser. We then train a new parser by using thisextracted information.

References

Brants, T. (2000). TnT–a statistical part-of-speech tagger. In Proceedings of ANLP, Seattle(pp. 224–231).

Chen, W., Kawahara, D., Uchimoto, K., Zhang, Y., & Isahara, H. (2008). Dependency parsing withshort dependency relations in unlabeled data. In Proceedings of IJCNLP 2008, Hyderabad.


Nivre, J. (2003). An efficient algorithm for projective dependency parsing. In Proceedings of IWPT2003, Nancy (pp. 149–160).

Nivre, J., & Kubler, S. (2006). Dependency parsing: Tutorial at Coling-ACL 2006. In CoLING-ACL, Sydney.


Reichart, R., & Rappoport, A. (2007). Self-training for enhancement and domain adaptation ofstatistical parsers trained on small datasets. In: Proceedings of ACL, Prague.

Sagae, K., & Tsujii, J. (2007). Dependency parsing and domain adaptation with LR models andparser ensembles. In Proceedings of the CoNLL shared task session of EMNLP-CoNLL 2007,Prague (pp. 1044–1050).


Wang, M., Sagae, K., & Mitamura, T. (2006). A fast, accurate deterministic parser for Chinese. InProceedings of Coling-ACL 2006, Sydney.

Wang, Q. I., Lin, D., & Schuurmans, D. (2007). Simple training of dependency parsers viastructured boosting. In Proceedings of IJCAI 2007, Hyderabad.

Chapter 7Training with Subtree Structures

In this chapter, we introduce a semi-supervised approach of using subtree structuresto improve dependency parsing. The subtrees are extracted from dependency treesin auto-parsed data and a set of subtree-based features is designed for the parsingmodels.

Unlike most of the previous studies (Sagae and Tsujii 2007; Steedman et al.2003) that improved performance by using entire trees from auto-parsed data, thesubtree-based approach exploits partial information (i.e., subtrees) in auto-parseddata. Sagae and Tsujii (2007) and Steedman et al. (2003) used entire auto-parsedtrees as newly annotated data to train parsing models. We instead use subtree-basedfeatures in the training that uses the original gold-standard data. Methods usingwhole auto-parsed trees mainly suffer from two problems: (1) It is difficult to selectreliable auto-parsed trees as newly annotated data (Steedman et al. 2003), and (2) itis difficult to scale to large data because of the high computational cost of trainingmodels with a large number of newly automatically annotated sentences. On theother hand, we can easily extract plenty of subtrees from the large data and derivereliable features from them. We then augment the features of existing parsers anddo not enlarge the training data.

The use of word pairs in auto-parsed data was attempted by van Noord (2007)and Chen et al. (2008) (described in Chap. 6). However, we believe that these wordpairs provide a relatively poor level of useful information for parsing. To providericher information, we consider more words, besides word pairs. Specifically, we usesubtrees containing two or three words extracted from dependency trees in the auto-parsed data. In order to encode the information concerning the extracted subtreesin features for dependency parsing, we assign labels to sets of subtrees that areclassified according to a certain policy. The subtree-based approach first is appliedto monolingual parsing and then to bilingual parsing.


71

72 7 Training with Subtree Structures

Unannotateddata

Auto-parseddata

st Annotateddata

PreprocessingSubtree

extractionFeature

Generation

POS tagging

Parsing

TrainingSubtree

Parsing

New parser

classification

Fig. 7.1 Overview of subtree extraction

7.1 Subtrees

In this section, we describe the approach of extracting subtrees from unannotateddata. First, we parse the unannotated data using the baseline parser which is trainedwith the base features and obtain auto-parsed data. Subsequently, we extract thesubtrees from the dependency trees in the auto-parsed data and perform subtreeclassification.

Figure 7.1 shows the overview of the approach, where Lst refers to the collectedsubtrees. The rest of this section describes the following steps: unannotateddata preprocessing (in Sect. 7.1.1), subtree extraction (in Sect. 7.1.2), and subtreeclassification (in Sect. 7.1.3). The output is the subtrees and their labels that areused to train the parsing models.

7.1.1 Unannotated Data Preprocessing

The unannotated data are preprocessed before parsing. We perform word seg-mentation (if needed) and part-of-speech (POS) tagging. After that, we obtainthe word-segmented sentences with the POS tags. We then use the baselineparser to parse the sentences in the data. Finally, we obtain the auto-parseddata.

7.1.2 Subtree Extraction

Figure 7.2 shows a dependency tree generated by the baseline parser. The subtreesare extracted from the dependency trees. If a subtree contains two nodes (words),

7.1 Subtrees 73

we call it a bigram-subtree. If a subtree contains three nodes (words), we call it atrigram-subtree. We eliminate the subtrees that occur only once in the data.

7.1.2.1 Bigram-Subtree

We extract the bigram-subtrees from the dependency trees and store them in listLst. If two words have a dependency relation in a tree, they form a bigram-subtree.For example, “ate” and “fish” have a right dependency arc in the sentence shown inFig. 7.2; thus, they form a subtree.

Note that the dependency direction and the order of the words in the originalsentence are important in the extraction. To enable this, the subtrees are encoded inthe string format expressed as st D w W wid W hid.w W wid W hid/C,1 where w refersto a word in the subtree, wid refers to the ID (starting from 1) of w in the subtree(words are ordered according to the positions of the original sentence),2 and hidrefers to the ID of w’s head (hid = 0 means that this word is the root of the subtree).So the subtree of “ate” and “fish” is encoded as “ate:1:0-fish:2:1.” Figure 7.3 showsall the bigram-subtrees extracted from the sentence in Fig. 7.2. We exclude bigram-subtrees that contain punctuation marks.

Fig. 7.2 Example of adependency structure in treeformat

ROOT

ate

I fish with .

the forkthe fork

a

I ate the fish with a fork .

ate

II:1:2-ate:2:0

ate

fishate:1:0-fish:2:1

ate

withate:1:0-with:2:1

fish

thethe:1:2-fish:2:0

with

forkwith:1:0-fork:2:1

fork

aa:1:2-fork:2:0

Fig. 7.3 Examples of bigram-subtrees

1C refers to matching the preceding element one or more times.2So wid is in fact redundant, but we include it for ease of understanding.


ate

fish with ate:1:0-fish:2:1-with:3:1

ate

with . ate:1:0-with:2:1-.:3:1

I:1:3-NULL:2:3-ate:3:0ate

I NULL

ate

NULL fish ate:1:0-NULL:2:1-fish:3:1

the:1:3-NULL:2:3-fish:3:0

a:1:3-NULL:2:3-fork:3:0

with:1:0-NULL:2:1-fork:3:1

ate:1:0-the:2:3-fish:3:1 ate:1:0-with:2:1-fork:3:2

with:1:0-a:2:3-fork:3:1 NULL:1:2-I:2:3-ate:3:0

I:1:3-NULL:2:1-ate:3:0 ate:1:0-fish:2:1-NULL:3:2

ate:1:0-NULL:2:3-with:3:1 with:1:0-fork:2:1-NULL:3:2

NULL:1:2-a:2:3-fork:3:0 a:1:3-NULL:2:1-fork:3:0

ate:1:0-NULL:2:3-.:3:1 ate:1:0-.:2:1-NULL:3:2

NULL:1:2-the:2:3-fish:3:0 the:1:3-NULL:2:1-fish:3:0

a

b

Fig. 7.4 Examples of trigram-subtrees

7.1.2.2 Trigram-Subtree

We extract the trigram-subtrees and store them in list Lst. These trigram-subtreesare divided into sibling-type and parent-child-grandchild type (grandchild-type,for short). We do not use arbitrary trigram-subtrees, because we find that thecomputational cost of parsing is very high if we extract all the trigram-subtreesand encode them in features.

For the sibling-type, we extract the subtrees containing a head h, its dependentd, and d’s closest sibling ch in [h. . . d]. The structure of the sibling-type subtrees issimilar to that of the parent-sibling type in the second-order model. We add a NULLtoken when ch is null. Figure 7.4a shows the sibling-type trigram-subtrees extractedfrom the sentence in Fig. 7.2. For simplification, we only show the string format forsome subtrees in Fig. 7.4.

For the grandchild-type, we extract the subtrees containing the parent, child, andchild’s furthest child(left or right side). The structure of this type is similar to theparent-child-grandchild structure in the second-order model. We also add a NULLtoken when the grandchild is null. Figure 7.4b shows the grandchild-type trigram-subtrees extracted from the sentence in Fig. 7.2.

7.2 Monolingual Parsing 75

7.1.2.3 Higher-Order Subtree

Here, we use only the bigram-subtrees and trigram-subtrees, though in theory wecan use k-gram-subtrees in the (k-1)th-order MST parsing models mentioned inMcDonald and Pereira (2006). Though the higher-order MST parsing models willbe slow with exact inference, requiring O(nk) time (McDonald and Pereira 2006), itmight be possible to use the higher-order k-gram subtrees with beam search in thefuture.

7.1.3 Subtree Classification

To share the information among the extracted subtrees, we classify the subtreesinto sets and assign labels to the sets. We call the labels STLabel in the followingcontent.

The extracted subtrees are grouped into different sets according to their fre-quencies. After experiments with many different threshold settings on developmentdata sets, we chose the following method. We group the subtrees into three setscorresponding to high frequency (HF), middle frequency (MF), and low frequency(LF). HF, MF, and LF are used as the STLabels of these three sets. The followingare the settings: if a subtree is one of the top 10 % most frequent subtrees, it islabeled as HF; if it is one of the top 20 % subtrees, it is labeled as MF; otherwise,it is labeled as LF. We store the STLabels for every subtree in Lst. For example, ifsubtree “ate:1:0-with:2:1” is among the top 10 %, its STLabel is HF. This methodassumes that the subtrees with higher frequencies are relatively more reliable.

If a subtree is not included in Lst, its STLabel is ZERO. Note that we performthe subtree classification within a set of subtrees belonging to one of the bigram-type, sibling-type, or grandchild-type. Naturally, we can use other methods such asgeneral clustering to perform subtree classification. We will leave other methods forperforming subtree classification for future work.

7.2 Monolingual Parsing

In this section, the subtrees are applied to the graph-based dependency parsingmodel in the monolingual parsing task.

7.2.1 Subtree-Based Features

The base features are defined over each edge (h, d) and two adjacent edges (h, d, c).In this section, on the basis of the extracted subtrees and their STLabels, we also


Fig. 7.5 Word pairs andtriple for featurerepresentation

… h-1 h h+1 … d-1 d d+1 …

… h … ch … d …

a

b

design the subtree-based features for the edges. The features based on the bigram-subtrees correspond to the first-order features in the parsing model, and those basedon the trigram-subtrees correspond to the second-order features.

7.2.1.1 First-Order Subtree-Based Features

The first-order features are based on the bigram-subtrees related to word pairs. Wedesign new features for edge (h, d), where h is the head and d is the dependent, inthe parsing process. The new features consider the subtrees formed by the head, thedependent, and their surrounding words. Figure 7.5a3 shows the words and theirsurrounding words, where h1 refers to the word to the left of the head in thesentence, hC1 refers to the word to the right of the head, d1 refers to the wordto the left of the dependent, and dC1 refers to the word to the right of the dependent.Temporary bigram-subtrees are formed by the word pairs that are linked by thedashed lines in the figure. Then we retrieve these subtrees in Lst to acquire theirSTLabels.

We then generate the first-order subtree-based features, consisting of indicatorfunctions for the STLabels of the retrieved bigram-subtrees. When generating thesubtree-based features, each dashed line in Fig. 7.5a triggers a different feature.

To demonstrate how to generate the first-order subtree-based features, we use thefollowing example. Suppose we intend to parse sentence “He ate the cake with afork” as shown in Fig. 7.6, where h is “ate” and d is “with.” We can generate thefeatures for the pairs linked by dashed lines, such as h d and h dC1. Then wehave temporary bigram-subtrees “ate:1:0-with:2:1” for hd and “ate:1:0-a:2:1” forh dC1, and so on. If we can find subtree “ate:1:0-with:2:1” from Lst and obtainits STLabel HF, we generate feature “BI:H-D:HF,” where “BI” refers to the bigram-subtree, “H-D” means that this feature is related to head (H) “ate” and dependent(D) “with,” and “HF” is the STLabel. And if the STLabel of subtree “ate:1:0-a:2:1”for h dC1 is ZERO, we generate feature “BI:H-D+1:ZERO,” where “H-D+1”

3Please note that d could be before h.

7.2 Monolingual Parsing 77

Fig. 7.6 First-ordersubtree-based features

He ate the cake with a fork .h-1 h h+1 d-1 d d+1

means that these features are related to head (H) “ate” and word dC1 “a” (D+1)and “ZERO” is the STLabel. The other three features are also generated similarly.

7.2.1.2 Second-Order Subtree-Based Features

The second-order features are based on the trigram-subtrees that are related to triplesof words. We design new features over two adjacent edges (h, d, c), where h and dare the head and dependent, respectively, and c is one of ch; cdi; cdo.

First, we design sibling-type features for a triple of a head h, its dependent d,and d’s closest sibling ch in Œh : : : d. The triple is shown in Fig. 7.5b. A temporarytrigram-subtree is formed by the word forms of h, d, and ch. Then we retrieve thesubtree in Lst to get its STLabel. We also consider the triples of “h-NULL,”4 d, andch, which means that we only check the words of sibling nodes without checkingthe head word.

Then, we generate the second-order subtree-based features, consisting of indica-tor functions for the STLabels of the retrieved sibling-type subtrees. Similarly, wedesign features based on the grandchild-type subtrees.

For the first- and second-order subtree-based features, we also generate combinedfeatures involving the STLabels and the POS tags of heads, and the STLabels andthe word forms of heads. Specifically, we remove any feature related to word form ifthe word is not one of the Top-N most frequent words in the training data. We usedN D 1;000 for the experiments. This method can reduce the size of the featuresets.

7.2.2 Subtree-Based Parser

We combine the base features with the subtree-based features by a new scoringfunction:

SST.x; g/ D fb.x; g/ wb C fst.x; g/ wst (7.1)

4h-NULL is a dummy token.


where x refers to a input sentence, fb.x; g/ refers to the base features, fst.x; g/

refers to the subtree-based features, and wb and wst are their corresponding weights,respectively. The feature weights are learned during training using MIRA (Crammerand Singer 2003; McDonald et al. 2005). Note that wb is also retrained here. Thus,given a sentence x, we find the parsing tree yST ,

yST D arg maxy2Y.Gx/

X

g2y

SST.x; g/

where g is a subgraph.

7.3 Bilingual Parsing

In this section, we apply the subtree-based approach to bilingual parsing. In mostprevious studies on the parsing of bitexts, bilingual treebanks were used to generatethe bilingual constraints. There are two types of bilingual treebanks: (1) fullbilingual treebanks in which there are human-annotated tree structures on both sidesand the target sentences are translated by hand and (2) non-full bilingual treebanksin which there are human-annotated tree structures on the source side and the targetsentences are translated by hand. Burkett and Klein (2008) proposed using jointmodels on bitexts to improve the performance on either or both sides. Their methodused the full bilingual treebanks. Huang et al. (2009) presented a method to traina source-language parser by using the reordering information on words betweenthe sentences on two sides. It used the non-full bilingual treebanks. Chen et al.(2010) also used the non-full bilingual treebanks, but used rules for generating treestructures on the target side. However, the full/non-full bilingual treebanks are costlyand troublesome to obtain, partly because of the high cost of human translation.Thus, in their experiments, they applied their methods to a small data set, themanually translated portion of the Chinese Treebank (CTB) which contains onlyabout 3,000 sentences. On the other hand, many large-scale monolingual treebanksexist, such as the Penn English Treebank (PTB) (Marcus et al. 1993) (about 40,000sentences in Version 3) and the latest version of CTB (over 50,000 sentences inVersion 7).

The information on subtrees can be used to generate the bilingual constraints onmonolingual treebanks with the help of SMT systems. With this method, we aim toimprove source-language parsing with the help of auto-translated target sentences.

In the first step, an SMT system translates the sentences of a source monolingualtreebank into the target language. Then, the target sentences are parsed by a parsertrained on a target monolingual treebank. This results in an auto-generated bilingualtreebank that has human-annotated trees on the source side and auto-generated treeson the target side. Although the sentences and parse trees on the target side arenot perfect, it should be possible to improve bitext parsing performance by usingthis auto-generated bilingual treebank. Word alignment links are built automatically

7.3 Bilingual Parsing 79

ROOTta gaodu pingjia le yu lipeng zongli de huitan jieguo

ROOT H hi hl d d h l f h f i h P LiHe highly commen e the results of the conference with Peng Li

ROOTta gaodu pingjia le yu lipeng zongli de huitan jieguo

b

a

Fig. 7.7 Input and output of bitext parsing

using a word alignment tool. Then we can produce a set of bilingual constraintsbetween the two sides. Compared with the full/non-full bilingual treebanks usedin the previous work, our auto-generated bilingual treebank requires less humanannotation. It contains human-annotated tree structures on the source side, the targetsentences are translated by the SMT system, and the tree structures on the target sideare parsed by the target parser.

Since the translation, parsing, and word alignment are done automatically, theconstraints may not be sufficiently reliable. To overcome this problem, we verifythe reliability of the constraints using target monolingual subtrees and bilingualsubtrees extracted from a large number of auto-parsed target monolingual sentencesand bilingual sentence pairs. Finally, we design a set of bilingual features basedon the verified constraints for parsing models. The basic idea is as follows: ifthe dependency structures of a bilingual constraint can be found in the targetmonolingual subtrees or the bilingual subtrees, this constraint is probably reliable.

7.3.1 A Case Study

Bitext dependency parsing is the task of parsing source sentences with the help oftheir corresponding translations. Figure 7.7a shows an example of bitext parsinginput, where ROOT is an artificial root token inserted at the beginning and does notdepend on any other token in the sentence, the dashed undirected links are wordalignment links, and the directed links between words indicate that they have adependency relation. Given such input, the parser builds dependency trees for thesource sentences. Figure 7.7b shows the output of bitext parsing for the example inFig. 7.7a.


ta xiwang quanti yundongyuan chongfeng fahui pingshi peiyu qilai de liliang he jiqiaoPN VV DT NN AD VV AD VV VV DEC NN CC NN

ta xiwang quanti yundongyuan chongfeng fahui pingshi peiyu qilai de liliang he jiqiao


a

b

c

Fig. 7.8 Example of an ambiguity on the Chinese side

In bitext parsing, there are many sentences where ambiguities exist on the sourceside but not on the target side. These differences should help improve source-sideparsing.

Suppose we have the Chinese sentence shown in Fig. 7.8a. In this sentence, thereis a nominalization case (Li and Thompson 1997) in which the particle “的 (de)” isplaced after the verb “起来 (qilai)” to modify “技巧 (jiqiao).” This nominalizationis a relative clause, but there are no clues to its boundary. This means that it isdifficult to determine which word is the head of “技巧 (jiqiao).” The head may be“发挥 (fahui)” or “培育 (peiyu),” as shown in Fig. 7.8b, c. In this case, (b) is correct.

In the English translation (Fig. 7.9), the second “that” is a clue indicatingthe boundary of the relative clause showing the relationship between “skill” and“cultivate,” as shown in Fig. 7.9. This example shows that a translation can provideuseful bilingual constraints. From the dependency tree on the target side, we find that“skill” corresponding to “技巧 (jiqiao)” depends on “demonstrate” correspondingto “发挥 (fahui),” while “cultivate” corresponding to “培育 (peiyu)” is a grandchildof “skill.” This is a positive piece of evidence for supporting “发挥 (fahui)” as beingthe head of “技巧 (jiqiao).”

The above case uses the human translation on the target side. To provide thebilingual constraints for bitext parsing, the previous methods of Burkett and Klein(2008), Huang et al. (2009), and Chen et al. (2010) used full/non-full bilingualtreebanks as training data. However, since such treebanks are expensive to construct,the bilingual treebanks used are typically small. In contrast, there are large-scalemonolingual treebanks available, e.g., the PTB and the latest version of CTB. Weuse an SMT system and monolingual parsers to automatically construct a bilingualtreebank, which, although it may contain errors, should improve source side parsing.

Figure 7.10 shows an example of a translation using a Moses-based SMT system(Koehn et al. 2007). The translation contains errors, but it also contains correct parts



He hoped that all the athletes would fully demonstrate the strength and skill that they cultivate daily

Fig. 7.9 Example of human translation


he expressed the hope that all athletes used to give full play to the country 's strength and skills

Fig. 7.10 Example of Moses translation

that can be used for source disambiguation. The word “play” corresponding to “发挥 (fahui)” is the grandparent of “skills” corresponding to “技巧 (jiqiao).” This isevidence that the head of “技巧 (jiqiao)” is “发挥 (fahui).”

This example shows that, even if the sentences and parse trees on the target sideare not perfect, the bilingual treebank still contains useful information that can beused to improve bitext parsing. Here, we focus on the identification of the unreliablebilingual constraints.

7.3.2 Original Bilingual Features

We generate two types of bilingual features, original and verified. The originalbilingual features (described in this section) are generated on the basis of thebilingual constraints without being verified by large-scale unannotated data. Theverified bilingual features (described in Sect. 7.3.3) are generated on the basis of thebilingual constraints verified by using a large amount of unannotated data.

Here we use the first- and second-order parsing model in our systems. Thus, weonly define first- and second-order bilingual features. For higher-order features (Kooand Collins 2010), we leave it for a future study.


Fig. 7.11 Steps in building an auto-generated bilingual treebank

7.3.2.1 Auto-generated Bilingual Treebank

We use a baseline parser that supports the first-order model and the parent-siblingstructures (McDonald and Pereira 2006) and parent-child-grandchild structures(Carreras 2007) of the second-order model. We call the parser with the monolingualfeatures on the source side “Parsers” and the parser with the monolingual featureson the target side “Parsert.”

We assume that there are monolingual treebanks available on the source side, anSMT system that can translate the source sentences into the target language, and aParsert trained on the target monolingual treebank.

Figure 7.11 shows the steps in building an auto-generated bilingual treebank.First, the SMT system translates the sentences of the source monolingual treebankinto the target language. Usually, SMT systems can output the word alignment linksdirectly. If they cannot, we perform word alignment using some publicly availabletools, such as Giza++ (Och and Ney 2003) or Berkeley Aligner (DeNero and Klein2007; Liang et al. 2006). The translated sentences are parsed by Parsert. Finally, anauto-generated bilingual treebank is generated.

7.3.2.2 Bilingual Constraint Functions

Because we focus on the first- and second-order graph-based parsing models(Carreras 2007; McDonald and Pereira 2006), we consider the constraints forbigram (a single edge) and trigram (adjacent edges) dependencies.

Suppose we have a (candidate) dependency relation rs that can be a bigram ortrigram dependency. We examine whether the words corresponding to the sourcewords of rs have a dependency relation rt in the target trees. We also consider thedirection of the dependency relation. The corresponding word of the head shouldalso be the head in rt. We define a binary function for this bilingual constraint:Fbn.rsn W rtk/, where n and k refer to the dependency types (2 for bigram and 3 fortrigram). For example, in rs2 W rt3, rs2 is a bigram dependency on the source side andrt3 is a trigram dependency on the target side.




Fig. 7.12 Example of bilingual constraints (2to2)




7.3.2.3 Bigram Constraint Function: Fb2

For rs2, we consider two types of bilingual constraints. The first constraint, denotedas Fb2.rs2 W rt2/, is that the corresponding words also have a direct dependencyrelation rt2. Figure 7.12 shows an example, where the source word “全体 (quanti)”depends on “运动员 (yundongyuan)” and its corresponding word “all” dependson “athletes” corresponding to “运动员 (yundongyuan).” In this case, Fb2.rs2 Wrt2/DC1. However, if the source words are “他 (ta)” and “希望 (xiwang),” theircorresponding words “He” and “hope” do not have a direct dependency relation. Inthis case, Fb2.rs2 W rt2/D1.

In the second constraint, denoted as Fb2.rs2 W rt3/, corresponding words forma parent-child-grandchild relation that often occurs in translation (Koehn et al.2003). Figure 7.13 shows an example. The source word “技巧 (jiqiao)” dependson “挥 (fahui),” while its corresponding word “skills” indirectly depends on“play” which corresponds to “发挥 (fahui)” via “to.” In this case, functionFb2.rs2 W rt3/DC1.

7.3.2.4 Trigram Constraint Function: Fb3

For a second-order relation on the source side, we consider one type of constraint.We have three source words that form a second-order relation, and all of them have





Table 7.1 Original bilingualfeatures

First-order features Second-order features

< Fro >

< Fb2; Dir > < Fb3; Dir >

< Fb2; Dir; Fro > < Fb3; Dir; Fro >

corresponding words that also form a second-order relation. We define functionFb3.rs3 W rt3/ for this constraint. An example is shown in Fig. 7.14. The source words“力量 (liliang),” “和 (he),” and “技巧 (jiqiao)” form a sibling structure, and theircorresponding words (“strength,” “and,” and “skills”) form a sibling structure on thetarget side. In this case, function Fb3.rs3 W rt3/DC1.

7.3.2.5 Bilingual Reordering Function: Fro

Huang et al. (2009) proposed using features based on reordering between languagesfor a shift-reduce parser. They define the features on the basis of word alignmentinformation to verify whether the corresponding words form a contiguous span toresolve shift-reduce conflicts. We use similar features in our system. For example,in Fig. 7.7a the source span is [会谈 (huitan), 结果 (jieguo)], which maps onto[results, conference]. Because no word within this target span is aligned to a sourceword outside of the source span, this span is a contiguous span. In this case, functionFroDC1; otherwise, FroD1.

7.3.2.6 Original Bilingual Features

We define original bilingual features on the basis of the bilingual constraintfunctions and the bilingual reordering function.

Table 7.1 lists the original features, where Dir refers to the directions ofdependencies, Fb2 can be Fb2.rs2 W rt2/ and Fb2.rs2 W rt3/, and Fb3 is Fb3.rs3 W rt3/.Each line in the table defines a feature template that is a combination of functions.


We use examples to show how to generate the original bilingual features inpractice. In the example shown in Fig. 7.10, we want to define the bilingual featuresfor the bigram dependency (rs2) between “发挥 (fahui)” and “技巧 (jiqiao).”The corresponding words form a trigram relation rt3. The direction of the bigramdependency is to the right. Thus, we have feature “< Fb2.rs2 W rt3/DC1; RIGHT >”for the second first-order feature template in Table 7.1. In the example shown inFig. 7.14, the source words “力量 (liliang),” “和 (he),” and “技巧 (jiqiao)” forma sibling structure, while their corresponding words “strength,” “and,” and “skills”form a sibling structure on the target side. The directions of the two dependenciesare to the left. We thus define feature “< Fb3.rs3 W rt3/DC1; LEFT LEFT >” forthe first second-order feature template in Table 7.1.

7.3.3 Verified Bilingual Features

Since we use auto-translation and auto-parsed trees on the target side, using thebilingual constraint alone is not reliable. Therefore, we verify the reliability of theconstraints using a large amount of unannotated data. More specifically, the rtk ofeach constraint is verified by checking a list of target monolingual subtrees, andrsn W rtk is verified by checking a list of bilingual subtrees. The subtrees are extractedfrom the unannotated data. The basic idea is that, if the dependency structures of abilingual constraint can be found in the target monolingual subtrees or the bilingualsubtrees, the constraint is probably reliable.

Figure 7.15 shows an overview of the proposed approach, where STbi refers tothe set of bilingual subtrees and STt refers to the set of monolingual subtrees. First,a large amount of unannotated target monolingual and bilingual data is parsed.Then, the monolingual and bilingual subtrees are extracted from the parsed data.The reliability of the bilingual constraints is verified using the extracted subtrees.Finally, the bilingual features are generated using the verified constraints for theparsing models.

7.3.3.1 Monolingual Target Subtrees

In Sect. 7.1.2, we use a simple method for extracting subtrees from a large amountof monolingual data and using them as features to improve monolingual parsing.Similarly, we use Parsert to parse the unannotated data and obtain a subtree list(STt) on the target side. Two types of subtrees are extracted: bigram (two-word)subtrees and trigram (three-word) subtrees.

We also perform subtree classification to assign the labels to the subtrees(Described in Sect. 7.1.3). We use Type.stt/ to refer to the label of subtree stt.


Fig. 7.15 Overview of generating verified bilingual features

7.3.3.2 Verified Target Constraint Function: Fvt.rtk/

We use the extracted target subtrees to verify the rtk of the bilingual constraints.In fact, rtk is a candidate subtree. If the rtk is included in STt, function Fvt.rtk/ DType.rtk/; otherwise, Fvt.rtk/ D ZERO. For example, in the example shown inFig. 7.12, the bigram structure of “all” and “athletes” can form a bigram-subtreethat is included in STt and its label is HF. In this case, Fvt.rt2/ D HF forthem.

7.3.3.3 Bilingual Subtrees

We extract bilingual subtrees from a bilingual corpus, which is respectively parsedon the source and target sides by Parsers and Parsert. We extract three typesof bilingual subtrees: bigram-bigram (stbi22), bigram-trigram (stbi23), and trigram-trigram (stbi33). For example, stbi22 consists of a bigram-subtree on the source sideand a bigram-subtree on the target side.

From the dependency tree in Fig. 7.16a, we obtain the bilingual subtrees shownin Fig. 7.16b. Figure 7.16b shows the extracted bigram-bigram bilingual subtrees.After extraction, we obtain the bilingual subtrees STbi. We remove the subtreesoccurring only once in the data. We do not classify the bilingual subtrees into severalsets due to data sparseness.


ROOTta shi yi ming xuesheng

ROOT He is a studentHe is is student

a b

Fig. 7.16 Examples of bilingual subtree extraction

Table 7.2 Verified bilingualfeatures

First-order features Second-order features

< Fro >

< Fb2; Fvt.rtk/ > < Fb3; Fvt.rtk/ >

< Fb2; Fvt.rtk/; Dir > < Fb3; Fvt.rtk/; Dir >

< Fb2; Fvb.rbink/ > < Fb3; Fvb.rbink/ >

< Fb2; Fvb.rbink/; Dir > < Fb3; Fvb.rbink/; Dir >

< Fb2; Fro; Fvb.rbink/ >

7.3.3.4 Verified Bilingual Constraint Function: Fvb.rbink/

We use the extracted bilingual subtrees to verify the rsn W rtk (rbink for short) of thebilingual constraints. rsn and rtk form a candidate bilingual subtree stbink. If stbink isincluded in STbi, Fvb.rbink/DC1; otherwise, Fvb.rbink/D1.

7.3.3.5 Verified Bilingual Features

Next, we define another set of bilingual features by combining the verified constraintfunctions. We call these bilingual features “verified bilingual features.” Table 7.2lists the verified bilingual features used in our experiments, where each line definesa feature template that is a combination of functions.

We use an example to show how the bilingual features are generated. In theexample in Fig. 7.10, we want to define the bilingual features for the bigram depen-dency (rs2) between ““发挥 (fahui)” and ““技巧 (jiqiao).” The correspondingwords form a trigram relation rt3. The direction of the bigram dependency is tothe right. Suppose we can find rt3 in STt with label MF and can not find thecandidate bilingual subtree in STbi. In this case, we have feature “< Fb2.rs2 W rt3/DC1; Fvt.rt3/D MF; RIGHT >” for the third first-order feature template and feature“< Fb2.rs2 W rt3/DC1; Fvb.rbi23/D1; RIGHT >” for the fifth first-order featuretemplate in Table 7.2.


7.3.4 Subtree-Based Parser

We combine the base features with the bilingual features by a new scoring function,

SBST.x; xt; yt; g/ D fb.x; g/ wb C fbst.x; xt; yt; g/ wbst (7.2)

where xt refers to the target sentence, yt refers to the dependency tree of xt, fb.x; g/

refers to the base features, fbst.x; g/ refers to the bilingual features, and wb andwbst are their corresponding weights, respectively. The feature weights are learnedduring training using MIRA (Crammer and Singer 2003; McDonald et al. 2005).Note that wb is also retrained here. Thus, given a sentence pair .x; xt; yt; Ast/, wefind the parsing tree yBST for x,

yBST D arg maxy2Y.Gx/

X

g2y

SBST.x; xt; yt; g/

7.4 Experiments for Monolingual Parsing

7.4.1 Data Sets

For English, we use the Penn Treebank (Marcus et al. 1993) in our experimentsand the tool “Penn2Malt” to convert the data into dependency structures using astandard set of head rules (Yamada and Matsumoto 2003). To match previous work(Koo et al. 2008; McDonald et al. 2005; McDonald and Pereira 2006), we splitthe data into a training set (sections 2–21), a development set (section 22), anda test set (section 23) and use the same setting for part-of-speech tags. As in theprevious works for this data, we use auto-generated part-of-speech tags insteadof the gold-standard tags. Following the work of Koo et al. (2008), we use theMXPOST (Ratnaparkhi 1996) tagger trained on training data to provide part-of-speech tags for the development and the test set. In practice, the parsers trained onthe data with auto-generated part-of-speech tags perform a little better than thosetrained on the data with gold part-of-speech tags for parsing the sentences with theauto-generated part-of-speech tags. Thus, we use 10-way jackknifing (tagging eachfold with the tagger trained on the other ninefolds) to generate part-of-speech tagsfor the training set as did (Koo et al. 2008). For the unannotated data, we use theBLLIP corpus (Charniak et al. 2000) that contains about 43 million words of WSJtext.5 We use the MXPOST tagger trained on training data to assign part-of-speechtags and use the baseline parser to process the sentences of the BLLIP corpus.

5We ensure that the text used for extracting subtrees does not include the sentences of the PennTreebank.

7.4 Experiments for Monolingual Parsing 89

For Chinese, we use the Chinese Treebank version 4.0 (CTB4)6 in the exper-iments. We also use the “Penn2Malt” tool to convert the data and created a datasplit: files 1–270 and files 400–931 for training, files 271–300 for testing, and files301–325 for development. We use gold-standard segmentation and part-of-speechtags in the CTB. The data partition and part-of-speech settings are chosen to matchprevious work (Chen et al. 2008; Yu et al. 2008). For the unannotated data, we usethe XIN_CMN portion of Chinese Gigaword Version 2.0 (LDC2009T14) (Huang2009), which has approximately 311 million words whose segmentation and POStags are given.7 We discard the annotations due to the differences in annotationpolicy between CTB and this corpus. We use the MMA system (Kruengkrai et al.2009) trained on the training data to perform word segmentation and POS taggingand used the baseline parser to parse all the sentences in the data.

We measure the parser quality by the unlabeled attachment score (UAS), i.e.,the percentage of tokens (excluding all punctuation tokens) with the correct HEAD.And we also evaluate on complete dependency analysis.


7.4.2.1 Main Results of English Data

The results on the test set of PTB are shown in Table 7.3, where Ord1/Ord2 refersto a first-/second-order model with base features, Ord1s/Ord2s refers to a first-/second-order model with base+subtree-based features, and the improvements bythe subtree-based features over the base features are shown in parentheses. Notethat we use both the bigram- and trigram-subtrees in Ord2s. The parsers using thesubtree-based features consistently outperform those using the base features. Forthe first-order parser, we find that there is an absolute improvement of 0.81 points(UAS) when the subtree-based features are added. For the second-order parser, weobtain an absolute improvement of 0.97 points (UAS) by including the subtree-

Table 7.3 Main results onPTB(test) for English

UAS Complete

Ord1 90.95 37.45

Ord1s 91.76(+0.81) 40.68

Ord2 91.92 44.28

Ord2s 92.89(+0.97) 47.97

Ord2b 92.26 45.03

Ord2t 92.67 47.01

6http://www.cis.upenn.edu/~chinese/7We excluded the sentences of the CTB data from the Gigaword data.



based features. The improvements in parsing with the subtree-based features aresignificant in McNemar’s test (p < 106). In the second-order model, the numberof the subtree-based features is about 92 thousands and that of the base features isabout 8,823 thousands.

We also check the sole effect of the bigram- and trigram-subtrees. These resultsare also shown in Table 7.3, where Ord2b/Ord2t refers to a second-order modelwith bigram-/trigram-subtrees only. The results show that the trigram-subtrees canprovide further improvement.

7.4.2.2 Comparative Results of English Data

Table 7.4 shows the performance of the systems that were compared, whereY&M2003 refers to the parser of Yamada and Matsumoto (2003), CO2006 refersto the parser of Corston-Oliver et al. (2006), Hall2006 refers to the parser of Hallet al. (2006), Wang2007 refers to the parser of Wang et al. (2007), Z&C 2008 refersto the combination graph-based and transition-based system of Zhang and Clark(2008), KOO08-dep1c/KOO08-dep2c refers to a graph-based system with first-/second-order cluster-based features by Koo et al. (2008), Carreras2008 refers to theparser of Carreras et al. (2008), and Suzuki2009 refers to the parser of Suzuki et al.(2009). The results show that Ord2s perform better than the first five systems. Oursystem perform worse than KOO08-dep2c which use word clusters generated fromthe BLLIP corpus. Carreras2008 (Carreras et al. 2008) reports a very high accuracyusing information of constituent structure of the TAG grammar formalism. We donot use such knowledge. Suzuki2009 (Suzuki et al. 2009) reports the best reportedresult by combining a Semi-supervised Structured Conditional Model (Suzuki andIsozaki 2008) with the method of Koo et al. (2008).

Our subtree-based features could be combined with the techniques presented inother work, such as the cluster-based features in Koo et al. (2008), the integratingmethods of Zhang and Clark (2008), and Nivre and McDonald (2008), and themodel of Suzuki et al. (2009).

To demonstrate that the subtree-based approach and the other work are com-plementary, we thus implement a system using all the techniques we have at handthat use the subtree- and cluster-based features and apply the integrating methodof Nivre and McDonald (2008). We use the word clustering tool,8 which was usedby Koo et al. (2008), to produce word clusters on the BLLIP corpus. The cluster-based features are the same as those used by Koo et al. (2008). For the integratingmethod of Nivre and McDonald (2008), we use the transition MaxEnt-based parserof Zhao and Kit (2008) because it is faster than the MaltParser. The results are shownat the bottom part of Table 7.4, where Ord1c/Ord2c refers to a first-/second-ordermodel with cluster-based features, Ord1i/Ord2i refers to a first-/second-order modelwith integrating-based features, Ord1sc/Ord2sc refers to a first-/second-order model

8http://www.cs.berkeley.edu/~pliang/software/brown-cluster-1.2.zip

http://www.cs.berkeley.edu/~pliang/software/brown-cluster-1.2.zip


Table 7.4 Results onPTB(test) for English, for ourparsers and previous work

UAS Complete

Y&M2003 90.3 38.4

CO2006 90.8 37.6

Hall2006 89.4 36.4

Wang2007 89.2 34.4

Z&C2008 92.1 45.4

KOO08-dep1c 92.23 –

KOO08-dep2c 93.16 –

Carreras2008 93.5 –

Suzuki2009 93.79 –

Ord1 90.95 37.45

Ord1s 91.76 40.68

Ord1c 91.88 40.71

Ord1i 91.68 41.43

Ord1sc 92.20 42.98

Ord1sci 92.60 44.28

Ord2 91.92 44.28

Ord2s 92.89 47.55

Ord2c 92.67 46.39

Ord2i 92.53 47.06

Ord2sc 93.20 47.97

Ord2sci 93.55 49.95

with subtree-based + cluster-based features, and Ord1sci/Ord2sci refers to a first-/second-order model with subtree-based+cluster-based + integrating-based features.Ord1c/Ord2c perform worse than KOO08-dep1c/-dep2c, but Ord1sci/Ord2sci out-perform KOO08-dep1c/KOO08-dep2c by using all the techniques we have. Theseresults indicate that the subtree-based features can provide different information andwork well with other techniques.

7.4.2.3 Main Results of Chinese Data

The results are shown in Table 7.5 where the abbreviations used are the same asthose in Table 7.3. As in the English experiments, the parsers with the subtree-based features outperform those with the base features, and the second-order parsersoutperform the first-order. For the first-order parser, the subtree-based featuresprovide an absolute improvement of 1.73 points (UAS). For the second-order parser,the subtree-based features achieve an absolute improvement of 3.18 points (UAS).The improvements in parsing with the subtree-based features are significant inMcNemar’s test (p < 107). In the second-order model, the number of the subtree-based features is about 75 thousands and that of the base features is about 2,075thousands.


Table 7.5 Main results onCTB4(test) for Chinese

UAS Complete

Ord1 86.38 40.80

Ord1s 88.11(+1.73) 43.10

Ord2 88.59 48.85

Ord2s 91.77(+3.18) 54.31

Ord2b 89.42 50.00

Ord2t 91.20 53.16

Table 7.6 Results onCTB4(test) for Chinese, forour parsers and for previouswork

All words 40 words

UAS Complete UAS Complete

Wang2007 – – 86.6 28.4

Chen2008 86.52 – 88.4 –

Yu2008 87.26 – – –

Zhao2009 87.0 – 88.9 –

Ord1s 88.11 43.10 91.77 55.93

Ord1si 88.41 45.11 91.92 59.00

Ord2s 91.77 54.31 94.34 68.19

Ord2si 91.93 55.45 94.72 70.88

7.4.2.4 Comparative Results of Chinese Data

Table 7.6 shows the comparative results, where Wang2007 refers to the parser ofWang et al. (2007), Chen2008 refers to the parser of Chen et al. (2008), Zhao2009refers to the parser of Zhao et al. (2009), and Yu2008 refers to the parser of Yu et al.(2008) that is the best reported results for this data set. Additionally, “all words”refers to all the sentences in the test set and “40 words”9 refers to the sentenceswith a length up to 40. The table shows that our parsers outperform the previoussystems.

We also implement integrating systems for Chinese data. When we apply thecluster-based features, the performance dropped a little. The reason may be thatwe are using gold-POS tags for the Chinese data.10 Thus we do not use cluster-based features for the integrating systems. The results are shown in Table 7.6,where Ord1si/Ord2si refers to the first-order/second-order system with subtree-based+intergrating-based features. We find that the integrating systems providebetter results. Overall, we have achieved a high accuracy, which is the best knownresult for this data set.

Duan et al. (2007) and Zhang and Clark (2008) report results on a different datasplit of the Penn Chinese Treebank (CTB5). We also run our systems (Ord2s) on

9Wang et al. (2007) and Chen et al. (2008) reported the scores on these sentences.10We try to use the cluster-based features for Chinese with the same setting of POS tags as Englishdata, then the cluster-based features do provide improvement.


their data. The results are shown in Table 7.7, where Duan2007 refers to the parserof Duan et al. (2007) and Zhang2008 refers to the parser of Zhang and Clark (2008).The scores are reported in non-root words and root words. The results show that oursystem performs better than the previous systems on this data.

7.4.2.5 Effect of Different Sizes of Unannotated Data

Here, we consider the improvement relative to the sizes of the unannotated data.Figures 7.17 and 7.18 show the results of first-order parsers with different numbersof words in the unannotated data. From the figures, we find that the parser obtainsmore benefits as we add more unannotated data.

7.4.3 Results Analysis

In this section, we investigate the results at the sentence level from different views.For Figs. 7.19–7.22, we classify each sentence into one of the following threeclasses: “Better” for those where the proposed parsers provide better results relativeto the parsers with base features, “Worse” for those where the proposed parsersprovide worse results relative to the base parsers, and “NoChange” for those wherethe accuracies remain the same.

Table 7.7 Results onCTB5(test)

Non-root words Root words

Duan2007 84.36 73.70

Zhang2008 86.21 76.26

Ord2s 88.13 79.42

90

90.5

91

91.5

92

92.5

93

4332168420

UA

S

Size of unannotated data(M)

English

Fig. 7.17 Results with different sizes of large-scale unannotated data for English


86

86.5

87

87.5

88

88.5

89

3111608040200

UA

S

Size of unannotated data(M)

Chinese

Fig. 7.18 Results with different sizes of large-scale unannotated data for Chinese

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5 6

Per

cent

age

(sm

ooth

ed)


BetterNoChange

Worse

Fig. 7.19 Improvement relative to unknown words for English

7.4.3.1 Unknown Words

Here, we consider the unknown word11 problem, which is an important issue forparsing. We calculate the number of unknown words in one sentence, and list thechanges in the sentences with unknown words. We compare the Ord1 and Ord1ssystems.

Figures 7.19 and 7.20 show the results, where the x axis refers to the numberof unknown words in one sentence and the y axis shows the percentages of thethree classes. For example, for sentences having two unknown words in the Chinese

11An unknown word is a word that is not included in the training data.


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5 6

Per

cent

age

(sm

ooth

ed)


BetterNoChange

Worse

Fig. 7.20 Improvement relative to unknown words for Chinese

data, 24.61 % improved, 15.38 % worsened, and 60.01 % are unchanged. We donot show the results of sentences with more than six unknown words because theirnumbers are very small. From the figures, we find that the Better curves are alwayshigher than the Worse curves. This indicates that Ord1s provides better resultsthan the baseline for the sentences with different numbers of the unknown words.For the Chinese data, the results indicate that the improvements (the gap betweenthe Better and Worse curves) apparently become larger when the sentences havemore unknown words. For the English data, the graph show that the improvementsbecome slightly larger when the sentences had more unknown words, though theimprovements for the sentences with three and four unknown words are slightly lessthan the others. We also find that BST have a greater chance of producing differentresults as the NoChange curves showed along with the numbers of unknown words,though it may have a risk of providing worse results as the Worse curves show.

7.4.3.2 PP Attachment

We analyze the behavior of our new parsers for preposition-phrase attachment,which is also a difficult task for parsing (Ratnaparkhi et al. 1994). We compare theOrd2 system with the Ord2s system. Figures 7.21 and 7.22 show how the subtree-based features affect accuracy as a function of the number of prepositions, wherethe x axis refers to the number of prepositions in one sentence and the y axisshows the percentages of the three classes. The figures show that BST has a greaterchance of producing different results, as the NoChange curves show along with thenumbers of prepositions, though it may have a risk of providing worse results, as theWorse curves show. For the English data, the improvements become larger when thesentences have more prepositions. For the Chinese data, the improvements enlargeslightly when the sentences have more prepositions.


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5 6 7

Per

cent

age

(sm

ooth

ed)

Number of prepositions

BetterNoChange

Worse

Fig. 7.21 Improvement relative to number of prepositions for English

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

3210

Per

cent

age

(sm

ooth

ed)

Number of prepositions

BetterNoChange

Worse

Fig. 7.22 Improvement relative to number of prepositions for Chinese

7.5 Experiments for Bilingual Parsing

7.5.1 Data Sets

We evaluate the proposed method on the standard data sets, i.e. the translatedportion of the Chinese Treebank V2 (CTB2tp) (Bies et al. 2007), articles 1–325of CTB, which have English translations with gold-standard parse trees. The tool“Penn2Malt”12 is used to convert the data into dependency structures. We use the



7.5 Experiments for Bilingual Parsing 97

Table 7.8 Number ofsentences of evaluation datasets used

Train Dev Test

CTB2tp 2,745 273 290

CTB7 50,747 273 290

same data settings as in the previous studies (Burkett and Klein 2008; Huang et al.2009) and (Chen et al. 2010): 1–270 for training, 301–325 for development, and271–300 for testing. Note that we do not use human translation on the Englishside of this bilingual treebank to train our new parsers. For testing, we use twosettings: a test with human translation and another with auto-translation. To processthe unannotated data, we train first-order Parsers on the training data.

To determine if the proposed method also works for larger monolingual tree-banks, we test our methods on the CTB7 (LDC2010T07) that includes many moresentences than CTB2tp. We use articles 301–325 for development, 271–300 fortesting, and the rest for training. That is, we evaluate the systems on the same testdata as CTB2tp. Table 7.8 shows the statistical information on the data sets.

We build Chinese-to-English SMT systems using Moses.13 Minimum error ratetraining (MERT) with respect to BLEU score is used to tune the parameters of thesystems. The translation model is created from the FBIS corpus (LDC2003E14). Weuse SRILM14 to train a 5-gram language model. The language model is trained onthe target side of the FBIS corpus and the Xinhua news in English Gigaword corpus(LDC2009T13). The development sentences are from the test set of NIST MT03and the test sentences are from the test set of NIST MT06 evaluation campaign.15

We then use the SMT systems to translate the training data of CTB2tp and CTB7.To enable direct comparison with the results of previous work (Huang et al. 2009;

Chen et al. 2010), we also use the same word alignment tool, Berkeley Aligner(DeNero and Klein 2007; Liang et al. 2006), to perform word alignment for CTB2tp

and CTB7. We train the tool on the FBIS corpus and remove notoriously bad linksin a, an, the的 (de),了 (le) as was done by Huang et al. (2009).

To train an English parser, we use the Penn English Treebank (PTB) (Marcuset al. 1993) in our experiments and the tool “Penn2Malt” to convert the data. We splitthe data into a training set (sections 2–21), a development set (section 22), and a testset (section 23). We train first-order and second-order Parsert on the training data.The unlabeled attachment score (UAS) of second-order Parsert is 91.92, indicatingstate-of-the-art accuracy for the test data. We use second-order Parsert to parse theauto-translated and human-translated target sentences in the CTB data.

To extract English subtrees, we use the BLLIP corpus (Charniak et al. 2000)that consists of about 43 million words of WSJ texts. We use the MXPOST tagger(Ratnaparkhi 1996) trained on the training data to assign POS tags and use the first-

13http://www.statmt.org/moses/14http://www.speech.sri.com/projects/srilm/download.html15http://www.itl.nist.gov/iad/mig//tests/mt/

http://www.statmt.org/moses/

http://www.speech.sri.com/projects/srilm/download.html

http://www.itl.nist.gov/iad/mig//tests/mt/


Table 7.9 List of resources

Purpose Resources

(1) Train SMT systems The FBIS corpus

The English Gigaword corpus

(2) Train Berkeley Aligner The FBIS corpus

(3) Train Parsert (English) The Penn English Treebank

(4) Extract target (English) subtrees The BLLIP corpus

(5) Extract bilingual subtrees The training data of NIST MT08 evaluation

The FBIS corpus

order Parsert to process the sentences in the BLLIP corpus. To extract bilingualsubtrees, we use the FBIS corpus and an additional bilingual corpus containing800,000 sentence pairs from the training data of NIST MT08 evaluation campaign.On the Chinese side, we use the morphological analyzer described in Kruengkraiet al. (2009) trained on the training data of CTBtp to perform word segmentation andPOS tagging and used the first-order Parsers to parse all the sentences in the data.On the English side, we use the same procedure as we do for the BLLIP corpus.Word alignment is performed using the Berkeley Aligner.

The resources used are summarized in Table 7.9. We use the FBIS corpus andEnglish Gigaword corpus to train the SMT systems (1) that are used to translatethe monolingual treebanks (CTBtp and CTB7) into the target language. Then weperform word alignment using the Berkeley Aligner trained on the FBIS corpus (2).The target sentences are parsed by the Parsert trained on the Penn English Treebank(3). To verify the bilingual constraints, we extract the target subtrees from the BLLIPcorpus (4) and the bilingual subtrees from the FBIS corpus and the training data ofNIST MT08 evaluation (5).


To compare with the previously reported results of Burkett and Klein (2008), Huanget al. (2009), and Chen et al. (2010), we use the test data with human translation inour experiments. The target sentences are parsed by second-order Parsert.

We report the parser quality by the UAS.

7.5.2.1 Training with CTB2tp

We first conduct experiments on the CTB2tp data set, which is also used in otherstudies (Burkett and Klein 2008; Chen et al. 2010; Huang et al. 2009). The resultsare given in Table 7.10, where baseline refers to the system with the base features,Bu refers to that after adding only the original bilingual features of Table 7.1,


Table 7.10 Results oftraining with CTB2tp on thetest set (UAS)

Order-1 Order-2


Bu 84.71(+0.36) 87.85(+0.65)

BST 85.37(+1.02) 88.49(+1.29)

ORACLE 85.79(+1.44) 88.87(+1.67)

BST refers to that after adding all the verified bilingual features of Table 7.2 tobaseline, and ORACLE refers to using human translation for training data with thefeatures of Table 7.1. We obtain an absolute improvement of 1.02 points for thefirst-order model and 1.29 points for the second-order model by adding the verifiedbilingual features. The improvements of the final systems (BST) over the baselinesare significant according to McNemar’s test (p < 103 for the first-order model andp < 104 for the second-order model). Adding only the original bilingual features(Bu) results in less improvement (lower by 0.66 points for the first-order and 0.64points for the second-order compared with BST). This indicates that the verifiedbilingual constraints are useful information for the parsing models.

We also find that BST is about 0.3 points lower than ORACLE. The reason ismainly due to the imperfect translations, although we use the large-scale subtreelists to help verify the reliability of the constraints. We try adding the features ofTable 7.2 to the ORACLE system, but obtain worse results. These results indicatethat our method benefits from the verified constraints, while ORACLE needs onlythe bilingual constraints. Note that UAS scores of ORACLE are affected by theword alignment that is performed automatically.

7.5.2.2 Training with CTB7

Here, we would like to demonstrate that our method is still able to provideimprovement, even if we utilize larger training data that results in strong baselinesystems. We randomly select the training sentences from the CTB7. Note that theCTB7 includes the text from the different genres and sources, while the CTB2tp

only includes the text from Xinhua newswire. Figure 7.23 shows the results ofusing different sizes of CTB7 training data, where the numbers of the x-axis referto the sentence numbers of training data used, Baseline1 and Baseline2 refer to thefirst- and second-order baseline systems, and OURS1 and OURS2 refer to our first-and second-order systems. The figure indicates that our system always outperformsthe baseline systems. For small data sizes, our system performs much better thanthe baselines. For example, when using 5,000 sentences, our second-order systemprovides a 1.26-point improvement over the second-order baseline. Finally, when weuse all of the CTB7 training data, our system achieves 91.66 for the second-ordermodel, while the baseline achieves 91.10. These results indicate that our methodcontinues to achieve improvement when we use larger training data.


0.83

0.84

0.85

0.86

0.87

0.88

0.89

0.9

0.91

0.92

5 10 20 30 40 50

UA

S

Amount of training data (K)

Baseline1OURS1

Baseline2OURS2

Fig. 7.23 Results of using different sizes of training data

Table 7.11 Results of using different setting for training SMT systems

D10 D20 D50 D100 BTrain GTran ORACLE

BLEU 18.45 21.82 25.69 27.16 31.75 n/a n/a

UAS 87.63 87.67 88.20 88.49 88.51 88.58 88.87

7.5.2.3 Different Settings for Training SMT Systems

We investigate the effects of using different settings for training the SMT systems.We randomly select 10 %, 20 %, and 50 % of the sentences in FBIS and use them totrain the Moses systems that are used to translate CTB2tp. The results are reported inTable 7.11, where D10, D20, D50, and D100, respectively, indicate training of thesystem using 10 %, 20 %, 50 %, and 100 % of the sentences. We also train a SMTsystem on a data set containing nine million sentence pairs (different from FBIS)and the results are shown as BTrain in the table. For reference, we use the GoogleTranslate online system,16 indicated as GTran in the table, to translate the CTB2tp.

From the table, we find that the BLEU17 and UAS scores increase with thenumber of sentences used for training. But the differences among the UAS scoresof D50, D100, BTrain, and GTran are small. This indicates that our method is veryrobust to the imperfect translation results. The reason is due to the fact that we use alarge amount of unannotated data to verify the reliability of the bilingual constraints.Note that the parsing results are also affected by the word alignment, which alsocontains errors.

16http://translate.google.com/17In Chen et al. (2011), we use an early version of multi-bleu.pl. The BLEU scores look very low.In this experiment, we use the mteval-v11b.pl.

http://translate.google.com/


Table 7.12 Comparison ofour results with those ofpreviously reported systems

With CTB2tp With CTB7

Type System UAS System UAS

S Baseline 87.20 Baseline 91.10

HA Huang2009 86.3 n/a

Chen2010BI 88.56

Chen2010ALL 90.13

AG BST 88.49 BST 91.66

BST+STs 89.75

7.5.2.4 Comparison with Previous Results

We compare our results with those reported previously for the same data. Wedivide the systems into three types, S, HA, and AG, which denote training on themonolingual treebank (source side), human-annotated treebank, and auto-generatedbilingual treebanks, respectively. Table 7.12 lists the results, where Huang2009refers to the result of Huang et al. (2009), Chen2010BI refers to the result of usingbilingual features in Chen et al. (2010), and Chen2010ALL refers to the result ofusing all of the features in Chen et al. (2010). The results show that our newparser achieves better accuracy than Huang2009 that use a shift-reduce parser andcomparable to Chen2010BI. To achieve higher performance, we also add the sourcesubtree features (Chen et al. 2009) to our system: BST+STs. The new result isclose to Chen2010ALL. Compared with the method of Huang et al. (2009) and Chenet al. (2010), our method uses an auto-generated bilingual treebank, while theirsrequires a human-annotated bilingual treebank. Chen et al. (2010) need great effortin building mapping rules. By using all of the training data of CTB7, we obtain amore powerful baseline that performs much better than the previous reported results.Our parser achieves 91.66, much higher accuracy than the others.

7.5.3 Results Analysis

We do the results analysis at the word level and at the sentence level. At the wordlevel, we compare the UAS scores for the predefined word sets with the averagescores for all the words. At the sentence level, we classify each sentence into one ofthe three classes: “Better” for those where the bitext parsers provide better resultsrelative to the baselines, “Worse” for those where the bitext parsers provide worseresults relative to the baselines, and “NoChange” for those where the accuraciesremain the same.


Table 7.13 Improvement for “的 (de)” structures (Order-2)

(a) Word level (UAS)

SRDEC ALL


BST 88.44 88.49

(BST baseline) +6.13 +1.29

(b) Sentence level

SentDEC SentALL

Worse 18.91 16.55

Better 29.72 23.10

(Better-worse) +10.81 +6.55

7.5.3.1 “的的的 (de)” Structures

In Chinese sentences, the “的的的 (de)” structures are commonly used and one ofthe most difficult problems (Li and Thompson 1997) for parsing. This is because“的的的 (de)” can play one of two roles (Li and Thompson 1997): (1) a complementizeror a nominalizer and (2) a genitive marker or an associative marker. In the CTB, thefirst type is tagged as DEC and the second type is tagged as DEG (Xue et al. 2000).

Here, we consider the first case in which the “的的的 (de)” structures are relativeclauses (“DEC structures” for short). The example is shown in Fig. 7.8. Asmentioned, it is hard to determine the head of the subroots of DEC structures, suchas the head of “技技技巧巧巧 (jiqiao)” in Fig. 7.8.

We compare the BST system with the baseline system trained on the CTB2tp

data. We check the sentences having the DEC structures. Table 7.13 shows theimprovement related to the DEC structures for the second-order models. Table 7.13ashows the results at the word level, where SRDEC refers to the subroots of the DECstructures and ALL refers to all the words. We find that the bitext parser achievesan absolute improvement of 6.13 points for SRDEC, much better than the averageimprovement (1.29 points). Table 7.13b shows the results at the sentence level,where SentDEC refers to the sentences having the DEC structures and SentALL refersto all the sentences. Again, our method produces better results for SentDEC than forSentALL. Overall, these results indicate that the bitext parser provides better resultsfor the DEC structures than the baseline.

7.5.3.2 Conjunction Structures

We analyze the behavior of our bitext parser for coordinating conjunction structures,which is also a very difficult problem for parsing (Kawahara and Kurohashi 2008).Here, we also compare the BST system with the baseline system trained on theCTB2tp data.

Table 7.14 shows the improvement related to conjunction structures for thesecond-order models. The results are again shown at the word level and sentencelevel. Table 7.14a shows the improvement at the word level, where CC refers tothe coordinating conjunctions and ALL refers to all the words. We find that thebitext parser achieves an absolute improvement of 4.29 points for the conjunctions,much better than the average improvements (1.29 points). Table 7.14b shows the

References 103

Table 7.14 Improvement for conjunction structures (Order-2)

(a) Word level (UAS)

CC ALL


BST 84.48 88.49

(BST baseline) +4.29 +1.29

(b) Sentence level

SentCC SentALL

Worse 23.57 16.55

Better 35.72 23.10

(Better-worse) +12.15 +6.55

improvement at the sentence level, where SentCC refers to the sentences having atleast one conjunction and SentALL refers to all the sentences. For SentCC, 35.73 % areimproved and 23.57 % are worsened, while 23.10 % are improved and 16.55 % areworsened for SentALL. These results indicate that the bilingual features do improvethe performance for the coordinating conjunction problem.

7.6 Summary

In this chapter, we have presented a subtree-based semi-supervised approach toimprove monolingual and bilingual dependency parsing. In the approach, first abaseline parser is used to parse large-scale unannotated data, and then subtrees areextracted from dependency parsing trees in the auto-parsed data. We also proposea method to classify the extracted subtrees into sets and assign labels to the sets.Finally, we design new subtree-based features for parsing models. The subtree-based approach is applied to the monolingual and bilingual parsing tasks.

References

Bies, A., Palmer, M., Mott, J., & Warner, C. (2007). English Chinese translation Treebank V 1.0,LDC2007T02. Linguistic Data Consortium.



Carreras, X., Collins, M., & Koo, T. (2008). Tag, dynamic programming, and the perceptron forefficient, feature-rich parsing. In Proceedings of CoNLL 2008 (pp. 9–16). Manchester: Coling2008 Organizing Committee.

Charniak, E., Blaheta, D., Ge, N., Hall, K., Hale, J., & Johnson, M. (2000). BLLIP 1987–89 WSJcorpus release 1, LDC2000T43. Linguistic Data Consortium.

Chen, W., Kawahara, D., Uchimoto, K., Zhang, Y., & Isahara, H. (2008). Dependency parsing withshort dependency relations in unlabeled data. In Proceedings of IJCNLP 2008, Hyderabad



Chen, W., Kazama, J., Uchimoto, K., & Torisawa, K. (2009). Improving dependency parsing withsubtrees from auto-parsed data. In Proceedings of EMNLP 2009, Singapore (pp. 570–579).

Chen, W., Kazama, J., Zhang, M., Tsuruoka, Y., Zhang, Y., Wang, Y., Torisawa, K., & Li, H.(2011). SMT helps bitext dependency parsing. In Proceedings of EMNLP 2011, Edinburgh.

Corston-Oliver, S., Aue, A., Duh, K., & Ringger, E. (2006). Multilingual dependency parsing usingbayes point machines. In HLT-NAACL2006, New York.


DeNero, J., & Klein, D. (2007). Tailoring word alignments to syntactic machine translation. InProceedings of ACL 2007 (pp. 17–24). Prague: Association for Computational Linguistics.

Duan, X., Zhao, J., & Xu, B. (2007). Probabilistic models for action-based Chinese dependencyparsing. In Proceedings of ECML/ECPPKDD, Warsaw.

Hall, J., Nivre, J., & Nilsson, J. (2006). Discriminative classifiers for deterministic dependencyparsing. In Proceedings of CoLING-ACL, Sydney.

Huang, C. R. (2009). Tagged Chinese Gigaword version 2.0, LDC2009T14. Linguistic DataConsortium.

Huang, L., Jiang, W., & Liu, Q. (2009) Bilingually-constrained (monolingual) shift-reduce parsing.In Proceedings of EMNLP 2009 (pp. 1222–1231). Singapore: Association for ComputationalLinguistics.

Kawahara, D., & Kurohashi, S. (2008). Coordination disambiguation without any similarities. InProceedings of Coling 2008, Manchester (pp. 425–432).

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen,W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Opensource toolkit for statistical machine translation. In Proceedings of the 45th annual meeting ofthe association for computational linguistics companion volume proceedings of the demo andposter sessions (pp. 177–180). Prague: Association for Computational Linguistics.

Koehn, P., Och, F. J., & Marcu, D. (2003) Statistical phrase-based translation. In Proceedings ofNAACL 2003, Edmonton (pp. 48–54). Association for Computational Linguistics.

Koo, T., Carreras, X., & Collins, M. (2008). Simple semi-supervised dependency parsing. InProceedings of ACL-08: HLT, Columbus


Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., & Isahara, H. (2009). An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging.In Proceedings of ACL-IJCNLP2009 (pp. 513–521). Suntec: Association for ComputationalLinguistics.

Li, C. N., & Thompson, S. A. (1997). Mandarin Chinese – a functional reference grammar.Oakland: University of California Press.

Liang, P., Taskar, B., & Klein, D. (2006). Alignment by agreement. In Proceedings of NAACL 2006(pp. 104–111). New York City: Association for Computational Linguistics.

Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpusof English: The Penn treebank. Computational Linguisticss, 19(2), 313–330.

McDonald, R., Crammer, K., & Pereira, F. (2005). Online large-margin training of dependencyparsers. In Proceedings of ACL 2005, (pp. 91–98). East Stroudsburg: Association for Compu-tational Linguistics.





References 105


Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models.Computational Linguistics, 29(1), 19–51.


Ratnaparkhi, A., Reynar, J., & Roukos, S. (1994). A maximum entropy model for prepositionalphrase attachment. In Proceedings of HLT, Plainsboro (pp. 250–255).


Steedman, M., Osborne, M., Sarkar, A., Clark, S., Hwa, R., Hockenmaier, J., Ruhlen, P., Baker,S., & Crim, J. (2003). Bootstrapping statistical parsers from small datasets. In Proceedings ofEACL 2003, Budapest (pp. 331–338)

Suzuki, J., & Isozaki, H. (2008). Semi-supervised sequential labeling and segmentation usingGiga-word scale unlabeled data. In Proceedings of ACL-08: HLT (pp. 665–673). Columbus:Association for Computational Linguistics.

Suzuki, J., Isozaki, H., Carreras, X., & Collins, M. (2009). An empirical study of semi-supervisedstructured conditional models for dependency parsing. In Proceedings of EMNLP2009(pp. 551–560). Singapore: Association for Computational Linguistics.

Wang, Q. I., Lin, D., & Schuurmans, D. (2007) Simple training of dependency parsers viastructured boosting. In Proceedings of IJCAI 2007, Hyderabad.

Xue, N., Xia, F., Huang, S., & Kroch, A. (2000). The bracketing guidelines for the Penn ChineseTreebank. Technical report, University of Pennsylvania.

Yamada, H., & Matsumoto, Y. (2003). Statistical dependency analysis with support vectormachines. In Proceedings of IWPT 2003, Nancy (pp. 195–206)

Yu, K., Kawahara, D., & Kurohashi, S. (2008). Chinese dependency parsing with large scale auto-matically constructed case structures. In Proceedings of Coling 2008, Manchester (pp. 1049–1056).

Zhang, Y., & Clark, S. (2008). A tale of two parsers: Investigating and combining graph-basedand transition-based dependency parsing. In Proceedings of EMNLP 2008, Honolulu (pp. 562–571).

Zhao, H., & Kit, C. (2008). Parsing syntactic and semantic dependencies with two single-stagemaximum entropy models. In Proceedings of CoNLL 2008, Manchester (pp. 203–207).

Zhao, H., Song, Y., Kit, C., & Zhou, G. (2009). Cross language dependency parsing using abilingual lexicon. In Proceedings of ACL-IJCNLP 2009 (pp. 55–63). Suntec: Association forComputational Linguistics.

Chapter 8Training with Dependency Language Models

In this chapter, we describe an approach that enriches the feature representationsfor a graph-based model using a dependency language model (DLM) (Shen et al.2008). The N-gram DLM has the ability to predict the next child based on the N 1

immediate previous children and their head (Shen et al. 2008).There are several previous studies that exploit high-order features that lead to

significant improvements. McDonald et al. (2005) and Covington (2001) developmodels that represent first-order features over a single arc in graphs. By extendingthe first-order model, McDonald and Pereira (2006) and Carreras (2007) exploitsecond-order features over two adjacent arcs in second-order models. Koo andCollins (2010) further propose a third-order model that uses third-order features.These models utilize higher-order feature representations and achieve better per-formance than the first-order models. But this achievement is at the cost of thehigher decoding complexity, from O.n2/ to O.n4/, where n is the length of the inputsentence. Thus, it is very hard to develop higher-order models further in this way.

How to enrich high-order feature representations without increasing the decodingcomplexity for graph-based models becomes a very challenging problem in thedependency parsing task. In this chapter, we describe an approach that solves thisissue by using a dependency language model (DLM) (Shen et al. 2008). The N-gramDLM has the ability to predict the next child based on the N1 immediate previouschildren and their head (Shen et al. 2008). The basic idea behind is that DLM isused to evaluate whether a valid dependency tree (McDonald and Nivre 2007) iswell-formed from a view of large scope. The parsing model searches for the finaldependency trees by considering the original scores and the scores of DLM.

In the approach, the DLM is built on a large amount of auto-parsed data, whichis processed by a first-order baseline parser (McDonald et al. 2005). A set of newfeatures is defined based on the DLM. The DLM-based features can capture theN-gram information of the parent-children structures for the parsing model. Then,they are integrated directly in the decoding algorithms using beam search. Thenew parsing model can utilize rich high-order feature representations but without


107

108 8 Training with Dependency Language Models

increasing the complexity. The DLM-based approach is applied on the monolingualtext (monotext) parsing. It is extended to parse bilingual texts (bitexts) by addingthe DLM-based features on both source and target sides.

8.1 Dependency Language Model

Language models play a very important role for statistical machine translation(SMT). The standard N-gram-based language model predicts the next word basedon the N 1 immediate previous words. However, the traditional N-gram languagemodel cannot capture long-distance word relations. To overcome this problem, Shenet al. (2008) proposed a dependency language model (DLM) to exploit long-distanceword relations for SMT. The N-gram DLM predicts the next child of a head basedon the N 1 immediate previous children and the head itself. In this chapter, wedefine a DLM, which is similar to the one of Shen et al. (2008), to score entiredependency trees.

An input sentence is denoted by x D .x0; x1; : : : ; xi; : : : ; xn/, where x0 D ROOTand does not depend on any other token in x and each token xi refers to a word.Let y be a dependency tree for x and H.y/ be a set that includes the words thathave at least one dependent. For each xh 2 H.y/, we have a dependency structureDh D .xLk; : : : xL1; xh; xR1 : : : xRm/, where xLk; : : : xL1 are the children on the left sidefrom the farthest to the nearest and xR1 : : : xRm are the children on the right side fromthe nearest to the farthest. Probability P.Dh/ is defined as follows:

P.Dh/ D PL.Dh/ PR.Dh/ (8.1)

Here, PL and PR are left and right side generative probabilities, respectively.Suppose, we use an N-gram dependency language model. PL is defined as follows:

PL.Dh/ PLc.xL1jxh/

PLc.xL2jxL1; xh/

: : : (8.2)

PLc.xLkjxL.k1/; : : : ; xL.kNC1/; xh/

where the approximation is based on the nth order Markov assumption. The rightside probability is similar. For a dependency tree, we calculate the probability asfollows:

P.y/ DY

xh2H.y/

P.Dh/ (8.3)

A linear model is used to calculate the scores for the parsing models (defined inSect. 2.1). Accordingly, we reform Eq. (8.3). We define fDLM as a high-dimensional

8.2 Parsing with Dependency Language Model 109

feature representation which is based on arbitrary features of PLc, PRc and x. Then,the DLM score of tree y is in turn computed as the inner product of fDLM with acorresponding weight vector wDLM.

scoreDLM.y/ D fDLM wDLM (8.4)

8.2 Parsing with Dependency Language Model

In this section, we describe a parsing model which includes the dependencylanguage model by extending the model of McDonald et al. (2005).

8.2.1 Add DLM Scores

In the DLM-based approach, we consider the scores of the DLM when searchingfor the maximum spanning tree. Then for a given sentence x, we find yDLM,

yDLM D arg maxy2T.Gx/

X

g2y

score.w; x; g/C scoreDLM.y/

!

After adding the DLM scores, the new parsing model can capture richerinformation. Figure 8.1 illustrates the changes. In the original first-order parsingmodel, we only utilize the information of single arc (xh, xL.k1/) for xL.k1/ as shownin Fig. 8.1a. If we use 3 gram DLM, we can utilize the additional information ofthe two previous children (nearer to xh than xL.k1/): xL.k2/ and xL.k3/ as shown inFig. 8.1b.

Fig. 8.1 Adding the DLMscores to the parsing model xh

xLk xL(k 1) xL(k 2) xL(k 3) … xL1 xR1 … xRm

xh

a

xLk xL(k 1) xL(k 2) xL(k 3) … xL1 xR1 … xRm

b

Lk L(k 1) L(k 2) L(k 3) L1 R1 Rm


Table 8.1 DLM-basedfeature templates

< ˚.Pu.ch//; TYPE >

< ˚.Pu.ch//; TYPE; h_pos >

< ˚.Pu.ch//; TYPE; h_word >

< ˚.Pu.ch//; TYPE; ch_pos >

< ˚.Pu.ch//; TYPE; ch_word >

< ˚.Pu.ch//; TYPE; h_pos; ch_pos >

< ˚.Pu.ch//; TYPE; h_word; ch_word >

8.2.2 DLM-Based Feature Templates

A set of DLM-based features is defined for Dh D .xLk; : : : xL1; xh; xR1 : : : xRm/. Foreach child xch on the left side, we have PLc.xchjHIS/, where HIS refers to the N 1

immediate previous right children and head xh. Similarly, we have PRc.xchjHIS/ foreach child on the right side. Let Pu.xchjHIS/ (Pu.ch/ in short) be one of the aboveprobabilities. We use the map function ˚.Pu.ch// to obtain the predefined discretevalue (defined in Sect. 8.3.3.4). The feature templates are outlined in Table 8.1,where TYPE refers to one of the types:PL or PR, h_pos refers to the part-of-speechtag of xh, h_word refers to the lexical form of xh, ch_pos refers to the part-of-speechtag of xch, and ch_word refers to the lexical form of xch.

8.3 Decoding

In this section, we turn to the problem of adding the DLM in the decoding algorithm.Two solutions are proposed: (1) rescoring, in which we rescore the K-best list withthe DLM-based features, and (2) intersect, in which we add the DLM-based featuresin the decoding algorithm directly.

8.3.1 Rescoring

The DLM-based features are used in the decoding procedure by using the rescoringtechnique used in Shen et al. (2008). We can use an original parser to produce theK-best list. This method has the potential to be very fast. However, because theperformance of this method is restricted to the K-best list, we may have to set K toa high number in order to find the best parsing tree (with DLM) or a tree acceptablyclose to the best (Shen et al. 2008).

8.3.2 Intersect

In the second solution, the DLM-based features are used in the decoding algorithmdirectly. The DLM-based features are generated online during decoding.

8.3 Decoding 111

For our parser, we use the decoding algorithm of McDonald et al. (2005). Thealgorithm was extensions of the parsing algorithm of Eisner (1996), which was amodified version of the CKY chart parsing algorithm. Here, we describe how to addthe DLM-based features in the first-order algorithm. The second-order and higher-order algorithms can be extended by the similar way.

The parsing algorithm independently parses the left and right dependents of aword and combines them later. There are two types of chart items (McDonald andPereira 2006): (1) a complete item in which the words are unable to accept moredependents in a certain direction and (2) an incomplete item in which the wordscan accept more dependents in a certain direction. In the algorithm, we create bothtypes of chart items with two directions for all the word pairs in a given sentence.The direction of a dependency is from the head to the dependent. The right (left)direction indicates the dependent is on the right (left) side of the head. Largerchart items are created from pairs of smaller ones in a bottom-up style. In thefollowing figures, complete items are represented by triangles, and incomplete itemsare represented by trapezoids. Figure 8.2 illustrates the cubic parsing actions of thealgorithm (Eisner 1996) in the right direction, where s, r, and t refer to the start andend indices of the chart items. In Fig. 8.2a, all the items on the left side are complete,and the algorithm creates the incomplete item (trapezoid on the right side) of s – t.This action builds a dependency relation from s to t. In Fig. 8.2b, the item of s –r is incomplete, and the item of r – t is complete. Then the algorithm creates thecomplete item of s – t. In this action, all the children of r are generated. In Fig. 8.2,the longer vertical edge in a triangle or a trapezoid corresponds to the subroot of thestructure (spanning chart). For example, s is the subroot of the span s – t in Fig. 8.2a.For the left direction case, the actions are similar.

Then, we add the DLM-based features into the parsing actions. Because theparsing algorithm is in the bottom-up style, the nearer children are generated earlierthan the farther ones of the same head. Thus, we calculate the left or right sideprobability for a new child when a new dependency relation is built. For Fig. 8.2a,we add the features of PRc.xtjHIS/. Figure 8.3 shows the structure, where cRs refersto the current children (nearer than xt) of xs. In the figure, HIS includes cRs and xs.

Fig. 8.2 Cubic parsingactions of Eisner (1996)

s r r+1 t s t

s r r t s t

a

b


Fig. 8.3 Add DLM-basedfeatures in cubic parsing

s r r+1 t s cRs t

Algorithm 3 The first-order decoder of graph-based parsing with DLM, developedfrom the MST parsing algorithm1: Initialization: CŒsŒsŒd D 0:0, OŒsŒsŒd D 0:0 8s; d2: for k D 1 to n do3: for s D 0 to n do4: t D sC k5: if t > n then break6: % Create incomplete items7: % Left direction8: OŒsŒtŒ D maxsr<t.sco1.t; s/C CŒsŒrŒ!C CŒrC 1ŒtŒ C scLc.xs//

9: % Right direction10: OŒsŒtŒ! D maxsr<t.sco1.s; t/C CŒsŒrŒ!C CŒrC 1ŒtŒ C scRc.xt//

11: % Create complete items12: CŒsŒtŒ D maxsr<t.OŒrŒtŒ C CŒsŒrŒ /

13: CŒsŒtŒ! D maxs<rt.OŒsŒrŒ!C CŒrŒtŒ!/

14: end for15: end for16: Return CŒ0ŒnŒ!

The pseudo-code for the modified first-order parsing algorithm is given inAlgorithm 3. For simplicity, this algorithm calculates only scores, but the actualalgorithm also stores the corresponding dependency structures. Let CŒsŒtŒd be atable that stores the score of the best complete item from position s to position t,s < t, with direction d, and let OŒsŒtŒd be a table that stores the score of the bestincomplete item from position s to position t, s < t, with direction d. d indicates thedirection ( or!) of the dependency.

In the algorithm, we create two types of items: incomplete and complete. Theincomplete items are created by lines 8 and 10, and the complete items are created bylines 12 and 13. We add the DLM features to the incomplete items. Here, we explainthe right direction case as shown at line 10. As we mentioned above, graphicallythe intuition behind line 10 is given in Fig. 8.3. sco1.s; t/ is the score function offeatures of dependency relation .xs; xt/, and scRc.xt/ is the score function of featuresfor PLc.xtjHIS/.

We use beam search to choose the one having the overall best score as the finalparse, where K spans are built at each step (Zhang and Clark 2008). At each step, weperform the parsing actions in the current beam and then choose the best K resultingspans for the next step. The time complexity of the new decoding algorithm isO.Kn3/, while the original one is O.n3/, where n is the length of the input sentence.With the rich feature set in Table 8.1, the running time of intersect is longer than

8.3 Decoding 113

Unannotateddata DLM Annotated

data

Parsing BuildingDLM

FeatureRepresentation

Auto-parseddata

Training

New parser

Fig. 8.4 Overview of the DLM-based approach

the time of rescoring. But intersect considers more combination of spans with theDLM-based features than rescoring that is only given a K-best list.

8.3.3 Implementation Details

8.3.3.1 Overview of the Proposed Approach

Figure 8.4 shows an overview of the proposed approach. We use a large amount ofunannotated data to build the dependency language model. We first perform wordsegmentation (if needed) and part-of-speech tagging. After that, we obtain the word-segmented sentences with the part-of-speech tags. Then the sentences are parsed bythe baseline parser. Finally, we obtain the auto-parsed data. Then we build the DLMbased on the auto-parsed data. Finally, the DLM-based features are generated fortraining a new parser.

8.3.3.2 Baseline Parser

We implement our parsers based on the MSTParser,1 a freely available implementa-tion of the graph-based model proposed by McDonald and Pereira (2006). We traina first-order parser on the training data (described in Sect. 8.5.1) with the featuresdefined in McDonald et al. (2005), which consider the surface forms and part-of-speech (POS) tags of the head and dependent words and the surface forms andPOS tags of surrounding words, and include conjuncts between the features withdirection and distance from the head to the dependent. We call this first-order parserbaseline parser.

1http://mstparser.sourceforge.net

http://mstparser.sourceforge.net


8.3.3.3 Build Dependency Language Models

Given the dependency trees, we estimate the probability distribution by relativefrequency:

Pu.xchjHIS/ D count.xch; HIS/Px0

chcount.x0ch; HIS/

(8.5)

No smoothing is performed because we use the mapping function for the featurerepresentations.

8.3.3.4 Mapping Function

We can define different mapping functions for the feature representations. Here,we use a simple way. First, the probabilities are sorted in decreasing order. LetNo.Pu.ch// be the position number of Pu.ch/ in the sorted list. The mappingfunction is

˚.Pu.ch// D

8ˆ<

ˆ:

PH if No.Pu.ch// TOP10PM if TOP10 < No.Pu.ch// TOP30PL if TOP30 < No.Pu.ch//

PO if Pu.ch/ D 0

where TOP10 and TOP30 refer to the position numbers of top 10 % and top 30 %respectively. The numbers 10 % and 30 % are tuned on the development sets in theexperiments.

8.4 Bilingual Parsing

In this section, we extend the approach to parse bilingual texts (bitexts). We aimto improve source-language parsing with the help of DLM-based features on bothsource and target sides in the bitext parsing task.

8.4.1 Bitext Parsing Model

For bitext parsing, we denote an input sentence pair xb by xb D .xs; xt/, where xs isthe source sentence and xt is the target sentence. yt denotes the dependency tree ofxt, and Ast denotes the word alignment links between xs and xt.


Gbx denotes a bitext graph consisting of a set of source nodes Vs

x Dfxs

0; xs1; : : : ; xs

i ; : : : ; xsng, a set of target nodes Vt

x D fxt0; xt

1; : : : ; xti; : : : ; xt

mg, a setof word alignment links (edges) Ea D fastjast 2 Astg, a set of arcs (edges) on thetarget side Et

x D f.i; j/j.xti; xt

j/ 2 ytg, and a set of arcs (edges) on the source sideEs

x D f.i; j/ji ¤ j; xi 2 Vsx ; xj 2 .Vs

x xs0/g, where the nodes in Vs

x are the words inxs and the nodes in Vt

x are the words in xt. Let T.Gbx/ be the set of all the subgraphs

of Gbx that are valid dependency graphs (McDonald and Nivre 2007) for source

sentence xs.The score of source dependency graph ys 2 T.Gb

x/ is defined as the sum of its arcscores provided by scoring function sb,

sb.xb; ys/ DX

g2ys

.scores.ws; xs; g/C scoret.wt; xb; yt; Ast; g// (8.6)

where scores.ws; xs; g/ is for the monolingual features and scoret.wt; xs; xt; yt; Ast; g/

is for the bilingual features, ws is a weight vector for the monolingual features, andwt is a weight vector for the bilingual features.

Scoring function scores is given by

scores.ws; xs; g/ D fs.xs; g/ ws (8.7)

Scoring function scoret is given by

scoret.wt; xs; xt; yt; Ast; g/ D ft.xs; xt; yt; Ast; g/ wt (8.8)

Then the task of bitext parsing in this section for a given sentence pair.x; xt; yt; Ast/ is to find ysb,

ysb D arg maxys2T.Gb

x /

sb.xb; ys/ (8.9)

8.4.2 Bitext Parsing with DLM

Here, we add the scores of the DLM to the bitext parsing model. For dependencystructure Ds

h D .xsLk; : : : xs

L1; xsh; xs

R1 : : : xsRm/ on the source side, we have corre-

sponding dependency structure Dth D .xt

Lk0 ; : : : xtL1; xt

h; xtR1 : : : xt

Rm0/ if xsh has the

corresponding word xth on the target side. Suppose we use an M-gram DLM on

the target side, the target probability Pt.Dth/ is defined as follows:

Pt.Dth/

Y

xtLi2xt

A

PtLc.x

tLijxt

L.i1/; : : : ; xtL.iMC1/; xt

h/


Y

xtRi2xt

A

PtRc.x

tRijxt

R.i1/; : : : ; xtR.iMC1/; xt

h/

where xtA refers to the set of the corresponding words of the children of xs

h. Followingthe definitions in Sect. 8.1, source probability Ps.Ds

h/ is defined as follows:

Ps.Dsh/ D Ps

L.Dsh/ Ps

R.Dsh/

Here, PsL and Ps

R are left and right side generative probabilities, respectively.Suppose, we use an N-gram dependency language model on the source side. Ps

Lis defined as follows:

PsL.Ds

h/ PsLc.x

sL1jxs

h/

PsLc.x

sL2jxs

L1; xsh/

: : :

PsLc.x

sLkjxs

L.k1/; : : : ; xsL.kNC1/; xs

h/

The right side probability is similar. For a dependency tree, we calculate theprobability as follows:

Pb.ys/ DY

xsh2H.ys/

.Ps.Dsh/ Pt.Dt

h// (8.10)

Accordingly, we also reform Eq. (8.10). We define fsDLM and ft

DLM as high-dimensional feature representations on both two sides. Then, the DLM score of treeys is in turn computed as the inner product of features with corresponding weightvectors ws

DLM and wtDLM.

scorebDLM.ys/ D fs

DLM wsDLM C ft

DLM wtDLM (8.11)

Finally, we use the DLM-based features as the bilingual features in Eq. (8.6) forthe bitext parsing model. For a given sentence pair .x; xt; yt; Ast/ is to find ysb

DLM,

ysbDLM D arg max

ys2T.Gbx /

0

@X

g2ys

.scores.ws; xs; g//C scorebDLM.ys/

1

A (8.12)

8.4.3 DLM-Based Feature Templates for Bitext Parsing

The DLM-based feature templates for bitext parsing are the same as the templatesdefined in Table 8.1.


8.5 Experiments for Monolingual Parsing

We conduct experiments on English and Chinese data.

8.5.1 Data Sets

For English, we use the Penn Treebank (Marcus et al. 1993) in our experiments. Wecreate a standard data split: sections 2–21 for training, section 22 for development,and section 23 for testing. Tool “Penn2Malt”2 was used to convert the data intodependency structures using a standard set of head rules (Yamada and Matsumoto2003). Following the work of Koo et al. (2008), we use the MXPOST (Ratnaparkhi1996) tagger trained on training data to provide part-of-speech tags for the develop-ment and the test set and use 10-way jackknifing to generate part-of-speech tags forthe training set. For the unannotated data, we use the BLLIP corpus (Charniak et al.2000) that contains about 43 million words of WSJ text.3 We use the MXPOSTtagger trained on training data to assign part-of-speech tags and use the baselineparser to process the sentences of the BLLIP corpus.

For Chinese, we use the Chinese Treebank (CTB) version 4.04 in the experi-ments. We also use the “Penn2Malt” tool to convert the data and create a data split:files 1–270 and files 400–931 for training, files 271–300 for testing, and files 301–325 for development. We use gold-standard segmentation and part-of-speech tags inthe CTB. The data partition and part-of-speech settings are chosen to match previouswork (Chen et al. 2008, 2009; Yu et al. 2008). For the unannotated data, we usethe XIN_CMN portion of Chinese Gigaword5 Version 2.0 (LDC2009T14) (Huang2009), which has approximately 311 million words whose segmentation and POStags are given. We discard the annotations due to the differences in annotation policybetween CTB and this corpus. We use the MMA system (Kruengkrai et al. 2009)trained on the training data to perform word segmentation and POS tagging and usethe baseline parser to parse all the sentences in the data.

8.5.2 Features for Basic and Enhanced Parsers

The previous studies have defined four types of features: (FT1) the first-orderfeatures defined in McDonald et al. (2005), (FT2SB) the second-order parent-sibling

2http://w3.msi.vxu.se/~nivre/research/Penn2Malt.html3We ensure that the text used for extracting subtrees do not include the sentences of the PennTreebank.4http://www.cis.upenn.edu/~chinese/.5We exclude the sentences of the CTB data from the Gigaword data.




Table 8.2 Baseline parsers System Features

MST1 (FT1)

MSTB1 (FT1) + (FT2SB + FT2GC + FT3)

MST2 (FT1 + FT2SB)

MSTB2 (FT1 + FT2SB) + (FT2GC + FT3)

Table 8.3 The parsing timeson PTB (dev) (seconds forall the sentences)

K 1 2 4 8 16

English 157.1 247.4 351.9 462.3 578.2

features defined in McDonald and Pereira (2006), (FT2GC) the second-order parent-child-grandchild features defined in Carreras (2007), and (FT3) the third-orderfeatures defined in Koo and Collins (2010).

We use the first- and second-order parsers of the MSTParser as the basicparsers. Then we enhance them with other higher-order features using beam search.Table 8.2 shows the feature settings of the systems, where MST1/2 refers to the basicfirst-/second-order parser and MSTB1/MSTB2 refers to the enhanced first-/second-order parser. MSTB1 and MSTB2 use the same feature setting but use differentorder models. This results in the difference of using FT2SB (beam search in MSTB1vs exact inference in MSTB2). We use these four parsers as the baselines in theexperiments.

We measure the parser quality by the unlabeled attachment score (UAS), i.e., thepercentage of tokens (excluding all punctuation tokens) with the correct HEAD. Inthe following experiments, we use “Inter” to refer to the parser with intersect and“Rescore” to refer to the parser with rescoring.

8.5.3 Development Experiments

Since the setting of K affects our parsers, we study its influence on the developmentset for English. We add the DLM-based features to MST1. Figure 8.5 shows theUAS curves on the development set, where K is the beam size for intersect andK-best for rescoring, the X-axis represents K, and the Y-axis represents the UASscores. The parsing performance generally increases as the K increases. The parserwith intersect always outperforms the one with rescoring.

Table 8.3 shows the parsing times of intersect on the development set for English.By comparing the curves of Fig. 8.5, we can see that, while using larger K reducedthe parsing speed, it improved the performance of our parsers. In the rest of theexperiments, we set K = 8 in order to obtain the high accuracy with reasonablespeed and use intersect to add the DLM-based features.

Then, we study the effect of adding different N-gram DLMs to MST1. Table 8.4shows the results. From the table, we find that the parsing performance roughlyincreases as the N increases. When N D 3 and N D 4, the parsers obtain the same


0.912

0.914

0.916

0.918

0.92

0.922

0.924

0.926

0.928

1 2 4 8 16

UA

S

K

RescoreInter

Fig. 8.5 The influence of K on PTB (dev)

Table 8.4 Effect of differentN-gram DLMs

N 0 1 2 3 4

English 91.30 91.87 92.52 92.72 92.72

Chinese 87.36 87.96 89.33 89.92 90.40

Table 8.5 Main results onPTB (test) for English

Order1 UAS Order2 UAS

MST1 90.95 MST2 91.71

MST1-DLM 91.89 MST2-DLM 92.34

MSTB1 91.92 MSTB2 92.10

MSTB1-DLM 92.55 MSTB2-DLM 92.76

scores for English. For Chinese, the parser obtains the best score when N D 4. Notethat the size of the Chinese unannotated data is larger than that of English. In therest of the experiments, we use 3 gram for English and 4 gram for Chinese.

8.5.4 Main Results on English Data

We evaluate the systems on the testing data for English. The results are shown inTable 8.5, where -DLM refers to adding the DLM-based features to the baselines.The parsers using the DLM-based features consistently outperform the baselines.For the basic models (MST1/2), we obtain absolute improvements of 0.94 and 0.63points, respectively. For the enhanced models (MSTB1/2), we find that there are0.63 and 0.66 points improvements, respectively. The improvements are significantin McNemar’s Test (p < 105) (Nivre et al. 2004).


Table 8.6 Main results onCTB5 (test) for Chinese


MST1 86.38 MST2 88.11

MST1-DLM 90.66 MST2-DLM 91.62

MSTB1 88.38 MSTB2 88.66

MSTB1-DLM 91.38 MSTB2-DLM 91.59

8.5.5 Main Results on Chinese Data

The results are shown in Table 8.6, where the abbreviations used are the same asthose in Table 8.5. As in the English experiments, the parsers using the DLM-based features consistently outperformed the baselines. For the basic models(MST1/2), we obtain absolute improvements of 4.28 and 3.51 points, respectively.For the enhanced models (MSTB1/2), we get 3.00 and 2.93 points improvements,respectively. We obtain large improvements on the Chinese data. The reasons maybe that we use the very large amount of data and 4 gram DLM that captures high-order information. The improvements are significant in McNemar’s Test (p < 107).

8.5.6 Compare with Previous Work on English

Table 8.7 shows the performance of the graph-based systems that were compared,where McDonald2006 refers to the second-order parser of McDonald and Pereira(2006), Koo2008-standard refers to the second-order parser with the features definedin Koo et al. (2008), Koo2010-model1 refers to the third-order parser with model1of Koo and Collins (2010), Koo2008-dep2c refers to the second-order parser withcluster-based features of Koo et al. (2008), Suzuki2009 refers to the parser of Suzukiet al. (2009), Chen2009-ord2s refers to the second-order parser with subtree-basedfeatures of Chen et al. (2009), and Zhou2011 refers to the second-order parser withweb-derived selectional preference features of Zhou et al. (2011).

The results show that our MSTB2-DLM obtains the comparable accuracy withthe previous state-of-the-art systems. Koo2010-model1 (Koo and Collins 2010)uses the third-order features and achieves the best reported result among thesupervised parsers. Suzuki2009 (Suzuki et al. 2009) reports the best reported resultby combining a Semi-supervised Structured Conditional Model (Suzuki and Isozaki2008) with the method of Koo et al. (2008). However, their decoding complexitiesare higher than ours, and we believe that the performance of our parser can be furtherenhanced by integrating their methods with our parser.

8.5.7 Compare with Previous Work on Chinese

Table 8.8 shows the comparative results, where Chen2008 refers to the parser ofChen et al. (2008), Yu2008 refers to the parser of Yu et al. (2008), Zhao2009 refers


Table 8.7 Relevant resultsfor English. G denotes thesupervised graph-basedparsers, S denotes thegraph-based parsers withsemi-supervised methods, andD denotes our new parsers

Type System UAS CostG McDonald2006 91.5 O(n3)

Koo2008-standard 92.02 O(n4)

Koo2010-model1 93.04 O(n4)S Koo2008-dep2c 93.16 O(n4)

Suzuki2009 93.79 O(n4)

Chen2009-ord2s 92.51 O(n3)

Zhou2011 92.64 O(n4)D MSTB2-DLM 92.76 O(Kn3)

Table 8.8 Relevant resultsfor Chinese

System UAS

Chen2008 86.52

Yu2008 87.26

Zhao2009 87.0

Chen2009-ord2s 89.43

MSTB2-DLM 91.59

to the parser of Zhao et al. (2009), and Chen2009-ord2s refers to the second-orderparser with subtree-based features of Chen et al. (2009). The results show thatour score for this data is the best reported so far and significantly higher than theprevious scores.

8.5.8 Analysis

We do the results analysis to study the effect of the DLM-based features in oursystems. We compare the MST1-DLM and MST1 systems on the English data.

8.5.8.1 Number of Children

Dependency parsers tend to perform worse on heads which have many children.Here, we study the effect of the DLM-based features for this structure. We calculatethe number of children for each head and list the accuracy changes for differentnumbers. The accuracy is the percentage of heads having all the correct children.

Figure 8.6 shows the results for English, where the X-axis represents the numberof children, the Y-axis represents the accuracies, and baseline refers to MST1. Forexample, for heads having two children, baseline obtains 89.04 % accuracy, whileMST1-DLM obtains 89.32 %. From the figure, we find that MST1-DLM achievesbetter performance consistently in all cases, and the larger the number of childrenbecomes, the more significant the performance improvement is.


0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Acc

urac

y

Number of children

BaselineMST1-DLM

Fig. 8.6 Improvement relative to numbers of children

65

70

75

80

85

90

95

100

1 2 3 4 5 6 7 8 9 10

UA

S

Distance Length

BaselineMST1-DLM

Fig. 8.7 Improvement relative to distance lengths of dependency relations

8.5.8.2 Distance of Dependency Relations

As we mentioned in Sect. 8.1, the DLM can capture the long-distance word relations(Shen et al. 2008). Here, we investigate the effect of the DLM-based features relativeto the distances of dependency relations.

Figure 8.7 shows the results for English, where the X-axis represents the distancelength of dependencies, the Y-axis represents the accuracies, MST1-DLM refersto MST1-DLM, and baseline refers to MST1. The curves in the figure showthat MST1-DLM always achieves better performance, and the longer the distancebecomes, the more significant the performance improvement is.


8.6 Experiments for Bilingual Parsing

We evaluate the proposed method on the standard data sets, i.e., the translatedportion of the Chinese Treebank V2 (CTB2tp) (Bies et al. 2007), articles 1–325of CTB, which have English translations with gold-standard parse trees. The tool“Penn2Malt”6 is used to convert the data into dependency structures. We use thesame data settings as in the previous studies (Burkett and Klein 2008; Chen et al.2010; Huang et al. 2009): 1–270 for training, 301–325 for development, and 271–300 for testing.

To enable direct comparison with the results of previous work (Chen et al. 2010;Huang et al. 2009), we also use the same word alignment tool, Berkeley Aligner(DeNero and Klein 2007; Liang et al. 2006), to perform word alignment for CTB2tp

and CTB7. We train the tool on the FBIS corpus (LDC2003E14) and removenotoriously bad links in a, an, the的(de),了(le) as was done by Huang et al.(2009).

To process the unannotated data, we train first-order baseline parsers for Chineseand English on the training data. We also use the English baseline parser to parsethe sentences on the English side of the test data. In the evaluation, we use Chineseas the source language and English as the target language.

8.6.1 Main Results

The results are listed in Table 8.9, where -SDLM refers to using the DLM-basedfeatures on the source side, -TDLM refers to using the DLM-based features on thetarget side, and -BDLM refers to using the DLM-based features on both two sides.From the table, we find that the parsers obtain improvements with the new featureseither the source or target side or both. The parsers with BDLM perform the best inall the feature settings.

Table 8.9 Main results onthe test set of CTB2tp forbitext parsing


MST1 84.35 MST2 86.38

MST1-SDLM 89.57 MST2-SDLM 90.03

MST1-TDLM 84.92 MST2-TDLM 87.01

MST1-BDLM 90.02 MST2-BDLM 90.30

MSTB1 86.78 MSTB2 87.03

MSTB1-SDLM 90.01 MSTB2-SDLM 90.22

MSTB1-TDLM 87.44 MSTB2-TDLM 87.65

MSTB1-BDLM 90.70 MSTB2-BDLM 90.88




Table 8.10 Relevant resultsfor bitext parsing

Type System UAS

MT Baseline 87:20

BT Huang2009 86:3

Chen2010ALL 90:13

MSTB2-BDLM 90:88

8.6.2 Compare with Previous Work on Bitext Parsing

We compare the results of the proposed approach with those reported previouslyfor the same data. We divide the systems into three types: MT and BT, whichdenote training on the monolingual treebank (source side) and bilingual treebank,respectively. Table 8.10 lists the results, where Huang2009 refers to the result ofHuang et al. (2009) and Chen2010ALL refers to the result of using all of the featuresin Chen et al. (2010). The results show that the new parser achieves better accuracythan Huang2009 and Chen2010ALL. Our score for this data is the best reported sofar on this data.

8.7 Summary

In this chapter, we have presented the approach to utilizing the dependency languagemodel to improve graph-based dependency parsing. New features are defined basedon the dependency language model and integrated into the decoding algorithmdirectly using beam search. The approach enriches the feature representations butwithout increasing the decoding complexity. The DLM-based approach is appliedto monolingual and bilingual dependency parsing.

References

Bies, A., Palmer, M., Mott, J., & Warner, C. (2007). English Chinese translation Treebank V 1.0,LDC2007T02. Linguistic Data Consortium.





References 125



Covington, M. A. (2001). A dundamental algorithm for dependency parsing. In Proceedings of the39th annual ACM southeast conference, Athens (pp. 95–102).

DeNero, J., & Klein, D. (2007). Tailoring word alignments to syntactic machine translation. InProceedings of ACL 2007 (pp. 17–24). Prague: Association for Computational Linguistics.

Eisner, J. (1996). Three new probabilistic models for dependency parsing: An exploration. InProceedings of COLING1996, Copenhagen (pp. 340–345).

Huang, C. R. (2009). Tagged Chinese Gigaword version 2.0, LDC2009T14. Linguistic DataConsortium.




Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., & Isahara, H. (2009). An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging.In Proceedings of ACL-IJCNLP2009 (pp. 513–521). Suntec: Association for ComputationalLinguistics.

Liang, P., Taskar, B., Klein, D. (2006). Alignment by agreement. In Proceedings of NAACL 2006(pp. 104–111). New York City: Association for Computational Linguistics.

Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpusof English: The Penn Treebank. Computational Linguisticss, 19(2), 313–330.

McDonald, R., Crammer, K., & Pereira, F. (2005). Online large-margin training of dependencyparsers. In Proceedings of ACL 2005 (pp. 91–98). New Brunswick: Association for Computa-tional Linguistics.

McDonald, R., & Nivre, J. (2007) Characterizing the errors of data-driven dependency parsingmodels. In Proceedings of EMNLP-CoNLL, Prague (pp. 122–131).


Nivre, J., Hall, J., & Nilsson, J. (2004). Memory-based dependency parsing. In Proceedings ofCoNLL 2004, Boston (pp. 49–56).


Shen, L., Xu, J., & Weischedel, R. (2008). A new string-to-dependency machine translationalgorithm with a target dependency language model. In Proceedings of ACL-08: HLT(pp. 577–585). Columbus: Association for Computational Linguistics. http://www.aclweb.org/anthology/P/P08/P08-1066.

Suzuki, J., & Isozaki, H. (2008). Semi-supervised sequential labeling and segmentation usingGiga-word scale unlabeled data. In Proceedings of ACL-08: HLT (pp. 665–673). Columbus:Association for Computational Linguistics.

Suzuki, J., Isozaki, H., Carreras, X., & Collins, M. (2009). An empirical study of semi-supervisedstructured conditional models for dependency parsing. In Proceedings of EMNLP2009(pp. 551–560). Singapore: Association for Computational Linguistics.


Yu, K., Kawahara, D., & Kurohashi, S. (2008). Chinese dependency parsing with large scaleautomatically constructed case structures. In Proceedings of COLING 2008, Manchester(pp. 1049–1056).

http://www.aclweb.org/anthology/P/P08/P08-1066

http://www.aclweb.org/anthology/P/P08/P08-1066


Zhang, Y., & Clark, S. (2008). A tale of two parsers: Investigating and combining graph-based and transition-based dependency parsing. In Proceedings of EMNLP 2008, Honolulu(pp. 562–571).

Zhao, H., Song, Y., Kit, C., & Zhou, G. (2009). Cross language dependency parsing using abilingual lexicon. In Proceedings of ACL-IJCNLP2009 (pp. 55–63). Suntec: Association forComputational Linguistics.

Zhou, G., Zhao, J., Liu, K., & Cai, L. (2011). Exploiting web-derived selectional preference toimprove statistical dependency parsing. In Proceedings of ACL-HLT2011 (pp. 1556–1565).Portland: Association for Computational Linguistics. http://www.aclweb.org/anthology/P11-1156.



Chapter 9Training with Meta-features

In the previous chapters, we have described the approaches of using the informationof bilexical dependencies and subtrees. The approaches make use of bi- and tri-gram lexical subtree structures. It can be extended further. The base features definedover surface words, part-of-speech tags represent more complex tree structures thanbilexical dependencies and lexical subtrees.

In this chapter, we describe an approach to semi-supervised dependency parsingvia feature transformation (Ando and Zhang 2005). More specifically, we transformbase features to a higher-level space. The base features defined over surface words,part-of-speech tags, and dependency trees are high dimensional and have beenexplored in the above previous studies. The higher-level features, which we callmeta-features, are low dimensional and newly defined in this paper. The key ideabehind is that we build connections between known and unknown base features viathe meta-features. From another viewpoint, we can also interpret the meta-featuresas a way of doing feature smoothing.

In the approach, the base features are grouped and each group relates to ameta-feature. In the first step, we use a baseline parser to parse a large amountof unannotated sentences. Then we collect the base features from the parsetrees. The collected features are transformed into predefined discrete values via atransformation function. Based on the transformed values, we define a set of meta-features. Finally, the meta-features are incorporated directly into parsing models.

9.1 Baseline Parser

In this section, we build a baseline parser based on the graph-based parsing modelproposed by McDonald et al. (2005). In the system, we use the decoding algorithmproposed by Carreras (2007), which is a second-order CKY-style algorithm (Eisner


127

128 9 Training with Meta-features

Table 9.1 Base feature templates

(a) First-order standard

hŒwp, dŒwp, d.h; d/

hŒwp, d.h; d/

dw, dp, d.h; d/

dŒwp, d.h; d/

hw, hp, dw, dp, d.h; d/

hp, hw, dp, d.h; d/

hw, dw, dp , d.h; d/

hw, hp, dŒwp, d.h; d/

(b) First-order linear

hp, bp , dp, d.h; d/

hp, hC1p, d1 p, dp, d.h; d/

h1 p, hp, d1 p, dp, d.h; d/

hp, hC1 p, dp , dC1 p, d.h; d/

h1 p, hp, dp, dC1 p, d.h; d/

(c) Second-order stan-dard

hp, dp , cp, d.h; d; c/

hw, dw, cw, d.h; d; c/

hp, cŒwp, d.h; d; c/

dp, cŒwp, d.h; d; c/

hw, cŒwp, d.h; d; c/

dw, cŒwp, d.h; d; c/

(d) Second-order linear

hŒwp, hC1 Œwp, cŒwp, d.h; d; c/

h1 Œwp, hŒwp, cŒwp, d.h; d; c/

hŒwp, c1 Œwp, cŒwp, d.h; d; c/

hŒwp, cŒwp, cC1Œwp, d.h; d; c/

h1Œwp, hŒwp, c1Œwp, cŒwp, d.h; d; c/

hŒwp, hC1Œwp, c1Œwp, cŒwp, d.h; d; c/

h1Œwp, hŒwp, cŒwp, cC1Œwp, d.h; d; c/

hŒwp, hC1Œwp, cŒwp, cC1Œwp, d.h; d; c/

dŒwp, dC1Œwp, cŒwp, d.h; d; c/

d1Œwp, dŒwp, cŒwp, d.h; d; c/

dŒwp, c1Œwp, cŒwp, d.h; d; c/

dŒwp, cŒwp, cC1Œwp, d.h; d; c/

dŒwp, dC1Œwp, c1Œwp, cŒwp, d.h; d; c/

dŒwp, dC1Œwp, cŒwp, cC1Œwp, d.h; d; c/

d1Œwp, dŒwp, c1Œwp, cŒwp, d.h; d; c/

d1Œwp, dŒwp, cŒwp, cC1Œwp, d.h; d; c/

1996) and feature weights w are learned during training using the Margin InfusedRelaxed Algorithm (MIRA) (Crammer and Singer 2003; McDonald et al. 2005).

9.1.1 Base Features

Previous studies have defined different sets of features for the graph-based parsingmodels, such as the first-order features defined in McDonald et al. (2005), thesecond-order parent-siblings features defined in McDonald and Pereira (2006),and the second-order parent-child-grandchild features defined in Carreras (2007).Bohnet (2010) explores a richer set of features than the above sets. We further extendthe features defined by Bohnet (2010) by introducing more lexical features as thebase features. The base feature templates are listed in Table 9.1, where h, d refer tothe head, the dependent respectively, c refers to d’s sibling or child, b refers to theword between h and d, C1 (1) refers to the next (previous) word, w and p refer tothe surface word and part-of-speech tag respectively, Œwp refers to the surface wordor part-of-speech tag, d.h; d/ is the direction of the dependency relation between hand d, and d.h; d; c/ is the directions of the relation among h, d, and c. We generatethe base features based on the above templates.

9.1.2 Baseline Parser

We train a parser with the base features as the baseline parser. We define fb.x; g/

as the base features and wb as the corresponding weights. The scoring function

9.2 Meta-features 129

becomes

S.x; g/ D fb.x; g/ wb (9.1)

9.2 Meta-features

In this section, we describe a semi-supervised approach to transform the featuresin the base feature space (FB) to features in a higher-level space (FM) with thefollowing properties:

• The features in FM are able to build connections between known and unknownfeatures in FB and therefore should be highly informative.

• The transformation should be learnable based on a labeled training set andan automatically parsed data set and automatically computable for the testsentences.

The features in FM are referred to as meta-features. In order to perform the featuretransformation, we choose to define a simple yet effective mapping function. Basedon the mapped values, we define feature templates for generating the meta-features.Finally, we build a new parser with the base and meta-features.

9.2.1 Template-Based Mapping Function

We define a template-based function for mapping the base features to predefineddiscrete values. We first put the base features into several groups and then performmapping.

We have a set of base feature templates TB. For each template Ti 2 TB, we cangenerate a set of base features Fi from dependency trees in the parsed data, which isautomatically parsed by the baseline parser. We collect the features and count theirfrequencies. The collected features are sorted in decreasing order of frequencies.The mapping function for a base feature fb of Fi is defined as follows:

˚.fb/ D

8ˆ<

ˆ:

Hi if R.fb/ TOP10Mi if TOP10 < R.fb/ TOP30Li if TOP30 < R.fb/

Oi Others

where R.fb/ is the position number of fb in the sorted list, “Others” is defined forthe base features that are not included in the list, and TOP10 and TOP30 refer tothe position numbers of top 10 % and top 30 %, respectively. The numbers, 10 %and 30 %, are tuned on the development sets in the experiments. For a base feature


Table 9.2 Meta-featuretemplates

Œ˚.fb/

Œ˚.fb/; hp

Œ˚.fb/; hw

generated from template Ti, we have four possible values: Hi, Mi, Li, and Oi. In total,we have 4 N.TB/ possible values for all the base features, where N.TB/ refers tothe number of the base feature templates, which is usually small. We can obtain themapped values of all the collected features via the mapping function.

9.2.2 Meta-feature Templates

Based on the mapped values, we define meta-feature templates in FM for depen-dency parsing. The meta-feature templates are listed in Table 9.2, where fb is a basefeature of FB, hp refers to the part-of-speech tag of the head, and hw refers to thesurface word of the head. Of the table, the first template uses the mapped valueonly; the second and third templates combine the value with the head information.The number of the meta-features is relatively small. It has 4 N.TB/ for the firsttype, 4 N.TB/ N.POS/ for the second type, and 4N.TB/ N.WORD/ for thethird one, where N.POS/ refers to the number of part-of-speech tags and N.WORD/

refers to the number of words. We remove any feature related to the surface form ifthe word is not one of the Top-N most frequent words in the training data. We usedTop-1000 for the experiments for this paper. This method can reduce the size of thefeature sets. The empirical statistics of the feature sizes at Sect. 9.3.2.2 shows thatthe size of meta-features is only 1.2 % of base features.

9.2.3 Generating Meta-features

We use an example to demonstrate how to generate the meta-features based on themeta-feature templates in practice. Suppose that we have sentence “I ate the meatwith a fork” and want to generate the meta-features for the relation among “ate,”“meat,” and “with,” where “ate” is the head, “meat” is the dependent, and “with” isthe closest left sibling of “meat.” Figure 9.1 shows the example.

We demonstrate the generating procedure using template Tk = “hw; dw; cw;

d.h; d; c/” (the second template of Table 9.1c), which contains the surface formsof the head, the dependent, its sibling, and the directions of the dependenciesamong h, d, and c. We can have a base feature “ate, meat, with, RIGHTSIB,”where “RIGHTSIB” refers to the parent-siblings structure with the right direction.In the auto-parsed data, this feature occurs 200 times and ranks between TOP10and TOP30. According to the mapping function, we obtain the mapped value Mk.

9.3 Experiments 131

Fig. 9.1 An example ofgenerating meta-features

Finally, we have the three meta-features “ŒMk,” “ŒMk; VV ,” and “ŒMk; ate,” whereVV is the part-of-speech tag of word “ate.” In this way, we can generate all themeta-features for the graph-based model.

9.2.4 Meta Parser

We combine the base features with the meta-features by a new scoring function:

S.x; g/ D fb.x; g/ wb C fm.x; g/ wm (9.2)

where fb.x; g/ refers to the base features, fm.x; g/ refers to the meta-features, andwb and wm are their corresponding weights, respectively. The feature weights arelearned during training using MIRA (Crammer and Singer 2003; McDonald et al.2005). Note that wb is also retrained here.

We use the same decoding algorithm in the new parser as in the baseline parser.The new parser is referred to as the meta parser.

9.3 Experiments

In this section, the effect of the meta-features for the graph-based parsers isevaluated on English and Chinese data.

9.3.1 Experimental Settings

In the experiments, we use the Penn Treebank (PTB) (Marcus et al. 1993) forEnglish and the Chinese Treebank version 5.1 (CTB5) (Xue et al. 2005) for Chinese.


Table 9.3 Standard datasplits

Train Dev Test

PTB 2–21 22 23

(sections)

CTB5 001–815 886–931 816–885

(files) 1001–1136 1148–1151 1137–1147

The tool “Penn2Malt” is used to convert the data into dependency structures withthe English head rules of Yamada and Matsumoto (2003) and the Chinese headrules of Zhang and Clark (2008). We follow the standard data splits as shown inTable 9.3. Following the work of Koo et al. (2008), we use a tagger trained ontraining data to provide part-of-speech (POS) tags for the development and test setsand use 10-way jackknifing to generate part-of-speech tags for the training set. Weuse the MXPOST (Ratnaparkhi 1996) tagger for English and the CRF-based taggerfor Chinese. We use gold-standard segmentation in the CTB5. The data partition ofChinese are chosen to match previous work (Duan et al. 2007; Hatori et al. 2011; Liet al. 2011).

For the unannotated data in English, we use the BLLIP WSJ corpus (Charniaket al. 2000) containing about 43 million words.1 We use the MXPOST tagger trainedon the training data to assign part-of-speech tags and use the baseline parser toprocess the sentences of the Brown corpus. For the unannotated data in Chinese, weuse the Xinhua portion of Chinese Gigaword2 Version 2.0 (LDC2009T14) (Huang2009), which has approximately 311 million words. We use the MMA system (Kru-engkrai et al. 2009) trained on the training data to perform word segmentation andPOS tagging and use the baseline parser to parse the sentences in the Gigaword data.

In collecting the base features, we remove the features which occur only oncein the English data and less than four times in the Chinese data. The featureoccurrences of one time and four times are based on the development dataperformance.

We measure the parser quality by the unlabeled attachment score (UAS), i.e., thepercentage of tokens (excluding all punctuation tokens) with the correct HEAD. Wealso report the scores on complete dependency trees evaluation (COMP).

9.3.2 Feature Selection on Development Sets

We evaluate the parsers with different settings on the development sets to select themeta-features.

1We ensure that the text used for building the meta-features did not include the sentences of thePenn Treebank.2We exclude the sentences of the CTB data from the Gigaword data.

9.3 Experiments 133

Table 9.4 Categories of basefeature templates

Category Example

N1P hp; d.h; d/

N1WM hw; d.h; d/; hw; hp; d.h; d/

N2P hp; dp; d.h; d/

N2WM hw; dw; d.h; d/;

hw; dp; d.h; d/

N3P hp; dp; cp; d.h; d; c/

N3WM hw; dw; cw; d.h; d; c/;

dw; dC1p; cp; d.h; d; c/

N4P hp; hC1p; cp; cC1p; d.h; d; c/

N4WM hw; hC1w; cw; cC1w; d.h; d; c/;

hw; hC1p; cp; cC1p; d.h; d; c/

9.3.2.1 Different Models vs. Meta-features

In this section, we investigate the effect of different types of meta-features for themodels trained on different sizes of training data on English.

There are too many base feature templates to test one by one. We divide thetemplates into several categories. Of Table 9.1, some templates are only relatedto part-of-speech tags (P), some are only related to surface words (W), and theothers contain both part-of-speech tags and surfaces (M). Table 9.4 shows thecategories, where numbers [1–4] refer to the number of words involved in templates.For example, the templates of N3WM are related to three words and contain thetemplates of W and M. Based on different categories of base templates, we havedifferent sets of meta-features.3

We randomly select 1 % and 10 % of the sentences respectively from the trainingdata. We train the POS taggers and baseline parsers on these small training data anduse them to process the unannotated data. Then, we generate the meta-features basedon the newly auto-parsed data. The meta parsers are trained on the different subsetsof the training data with different sets of meta-features. Finally, we have three metaparsers, MP1, MP10, MPFULL, which are trained on 1 %, 10 %, and 100 % of thetraining data.

Table 9.5 shows the results, where we add each category of Table 9.4 individually.From the table, we find that the meta-features that are only related to part-of-speechtags do not always help, while the ones related to the surface words are very helpful.We also find that MP1 provides the largest relative improvement among the threesettings. These suggest that the more sparse the base features are, the more effective

3We also test the settings of dividing WM into two subtypes: W and M. The results show that bothsubtypes provide positive results. To simplify, we merge W and M into one category, WM.


Table 9.5 Results withdifferent categories ofmeta-features on PTB (dev)

MP1 MP10 MPFULL

Baseline 82.22 89.50 93.01

+N1P 82.42 89.48 93.08

+N1WM 82.80 89.42 93.19

+N2P 81.29 89.01 93.02

+N2WM 82.69 90.10 93.23

+N3P 83.32 89.73 93.05

+N3WM 84.47 90.75 93.80

+N4P 82.73 89.48 93.01

+N4WM 84.07 90.42 93.67

MetaParser 85.11 91.14 93.91

Table 9.6 Results withdifferent types ofmeta-features on PTB (dev)

System NumOfFeat UAS

Baseline 27,119,354 93.01

+CORE +498 93.84

+WithPOS +14,993 93.82

+WithWORD +312,373 93.27

MetaParser +327,864 93.91

the corresponding meta-features are. Thus, we build the final parsers (MetaParser)by adding the meta-features of N1WM, N2WM, N3WM, and N4WM. The resultsshow that MetaParser achieves better performance than the systems with individualsets of meta-features.

9.3.2.2 Different Meta-feature Types

In Table 9.2, there are three types of meta-feature templates. Here, the results of theparsers with different settings are shown in Table 9.6, where CORE refers to the firsttype, WithPOS refers to the second one, and WithWORD refers to the third one. Theresults show that with all the types, the parser (MetaParser) achieves the best. Wealso count the numbers of the meta-features. Only 327,864 (or 1.2 %) features areadded into MetaParser. Thus, we use all the three types of meta-features in the finalmeta parsers.

9.3.3 Main Results on Test Sets

We then evaluate the meta parsers on the English and Chinese test sets.

9.3 Experiments 135

Table 9.7 Main results onPTB (test)

UAS COMP


MetaParser 93.77 51.36

Table 9.8 Main results onCTB5 (test)

UAS COMP



Table 9.9 Effect of differentsizes of auto-parsed data

English Chinese


TrainData 91.93 80.40

P0.1 92.82 81.58

P1 93.14 82.23

P10 93.48 82.81

FULL 93.77 83.08

9.3.3.1 English

The results are shown in Table 9.7, where MetaParser refers to the meta parser. Wefind that the meta parser outperforms the baseline with an absolute improvement of1.01 points (UAS). The improvement is significant in McNemar’s test (p < 107 ).

9.3.3.2 Chinese

The results are shown in Table 9.8. As in the experiment on English, the meta parseroutperforms the baseline. We obtain an absolute improvement of 2.07 points (UAS).The improvement is significant in McNemar’s test (p < 108 ).

In summary, Tables 9.7 and 9.8 convincingly show the effectiveness of theproposed approach.

9.3.4 Different Sizes of Unannotated Data

Here, we consider the improvement relative to the sizes of the unannotated data usedto generate the meta-features. We randomly select the 0.1 %, 1 %, and 10 % of thesentences from the full data. Table 9.9 shows the results, where P0.1, P1, and P10correspond to 0.1 %, 1 %, and 10 % respectively. From the table, we find that theparsers obtain more benefits as we use more raw sentences. We also try generatingthe meta-features from the training data only, shown as TrainData in Table 9.9.However, the results show that the parsers performed worse than the baselines. Thereason might be that only the known base features are included in the training data.


Table 9.10 Relevant resultsfor English. Sup denotes thesupervised parsers. Semidenotes the parsers withsemi-supervised methods

Type System UAS COMPSup McDonald2006 91.5

Koo2010 93.04 –Zhang2011 92.9 48.0Li2012 93.12 –Baseline 92.76 48.05

Semi Koo08 93.16Suzuki2009 93.79Chen2009 93.16 47.15Zhou2011 92.64 46.61Suzuki2011 94.22 –Chen2012 92.76 –MetaParser 93.77 51.36

Table 9.11 Relevant resultsfor Chinese

System UAS COMP

Li2011 80.79 29.11

Hatori2011 81.33 29.90

Li2012 81.21 –

Wu2013 80.89 –



9.3.5 Comparison with Previous Work

9.3.5.1 English

Table 9.10 shows the performance of the previous systems that were compared,where McDonald2006 refers to the second-order parser of McDonald and Pereira(2006), Koo2010 refers to the third-order parser with model1 of Koo and Collins(2010), Zhang2011 refers to the parser of Zhang and Nivre (2011), Li2012 refers tothe unlabeled parser of Li et al. (2012), Koo2008 refers to the parser of Koo et al.(2008), Suzuki2009 refers to the parser of Suzuki et al. (2009), Chen2009 refers tothe parser of Chen et al. (2009), Zhou11 refers to the parser of Zhou et al. (2011),Suzuki2011 refers to the parser of Suzuki et al. (2011), and Chen2012 refers to theparser of Chen et al. (2012).

The results show that the meta parser outperforms most of the previous systemsand obtains the comparable accuracy with the best result of Suzuki11 (Suzukiet al. 2011) which combines the clustering-based word representations of Koo et al.(2008) and a condensed feature representation. However, the proposed approachis much simpler than theirs, and we believe that the meta parser can be furtherimproved by combining their methods.

9.3 Experiments 137

9.3.5.2 Chinese

Table 9.11 shows the comparative results, where Li2011 refers to the parser of Liet al. (2011), Hatori2011 refers to the parser of Hatori et al. (2011), and Li2012refers to the unlabeled parser of Li et al. (2012). The reported scores on this data areproduced by the supervised learning methods, and the baseline (supervised) parserprovides the comparable accuracy. We find that the score of the meta parser for thisdata is the best reported so far and significantly higher than the previous scores. Notethat we use the auto-assigned POS tags in the test set to match the above previousstudies.

9.3.6 Analysis

Here, we analyze the effect of the meta-features on the data sparseness problem.We first check the effect of unknown features on the parsing accuracy. We

calculate the number of unknown features in each sentence and computed theaverage number per word. The average numbers are used to eliminate the influenceof varied sentence sizes. We sort the test sentences in increasing orders of theseaverage numbers and divide equally into five bins. BIN 1 is assigned the sentenceswith the smallest numbers and BIN 5 is with the largest ones. Figure 9.2 shows theaverage accuracy scores of the baseline parsers against the bins. From the figure,we find that for both two languages, the baseline parsers perform worse while thesentences contained more unknown features.

70

75

80

85

90

95

100

1 2 3 4 5

Acc

urac

y

BIN

EnglishChinese

Fig. 9.2 Accuracies relative to numbers of unknown features (average per word) by baselineparsers


0

10

20

30

40

50

1 2 3 4 5

Per

cent

age

BIN

BetterWorse

Fig. 9.3 Improvement relative to numbers of active meta-features on English (average per word)

0

10

20

30

40

50

1 2 3 4 5

Per

cent

age

BIN

BetterWorse

Fig. 9.4 Improvement relative to numbers of active meta-features on Chinese (average per word)

Then, we investigate the effect of the meta-features. We calculate the averagenumber of active meta-features per word that are transformed from the unknownfeatures for each sentence. We sort the sentences in increasing order of the averagenumbers of active meta-features and divide them into five bins. BIN 1 is assigned thesentences with the smallest numbers and BIN 5 is with the largest ones. Figures 9.3and 9.4 show the results, where “Better” is for the sentences where the metaparsers provide better results than the baselines and “Worse” is for those wherethe meta parsers provide worse results. We find that the gap between “Better” and“Worse” becomes larger while the sentences contain more active meta-features forthe unknown features. The gap means performance improvement. This indicates thatthe meta-features are very effective in processing the unknown features.

References 139

9.4 Summary

In this chapter, we have presented a simple but effective semi-supervised approachto learning the meta-features from the auto-parsed data for dependency parsing. Ameta parser is built by combining the meta-features with the base features in a graph-based model. Further analysis indicates that the meta-features are very effective inprocessing the unknown features.

References

Ando, R., & Zhang, T. (2005). A high-performance semi-supervised learning method for textchunking. Association for Computational Linguistics, 1(43), 1–9.

Bohnet, B. (2010). Top accuracy and fast dependency parsing is not a contradiction. In Proceedingsof the 23rd international conference on computational linguistics (Coling 2010), Beijing(pp. 89–97). Coling 2010 Organizing Committee. http://www.aclweb.org/anthology/C10-1011.

Carreras, X. (2007). Experiments with a higher-order projective dependency parser. In Proceedingsof the CoNLL shared task session of EMNLP-CoNLL 2007, Prague (pp. 957–961). Associationfor Computational Linguistics.

Charniak, E., Blaheta, D., Ge, N., Hall, K., Hale, J., & Johnson, M. (2000). BLLIP 1987–89 WSJCorpus release 1. LDC2000T43. Linguistic Data Consortium.


Chen, W., Zhang, M., & Li, H. (2012). Utilizing dependency language models for graph-baseddependency parsing models. In Proceedings of ACL 2012, Jeju.


Duan, X., Zhao, J., & Xu, B. (2007). Probabilistic models for action-based chinese dependencyparsing. In Proceedings of ECML/ECPPKDD, Warsaw.

Eisner, J. (1996). Three new probabilistic models for dependency parsing: An exploration. InProceedings of COLING1996, Copenhagen (pp. 340–345).

Hatori, J., Matsuzaki, T., Miyao, Y., & Tsujii, J. (2011). Incremental joint POS tagging anddependency parsing in Chinese. In Proceedings of 5th international joint conference on naturallanguage processing, Chiang Mai (pp. 1216–1224). Asian Federation of Natural LanguageProcessing. http://www.aclweb.org/anthology/I11-1136.

Huang, C. R. (2009). Tagged Chinese gigaword version 2.0. LDC2009T14. Linguistic DataConsortium.


Koo, T., & Collins, M. (2010). Efficient third-order dependency parsers. In Proceedings of ACL2010, Uppsala (pp. 1–11). Association for Computational Linguistics.

Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., & Isahara, H. (2009). An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging.In Proceedings of ACL-IJCNLP2009, Suntec (pp. 513–521). Association for ComputationalLinguistics.

Li, Z., Zhang, M., Che, W., & Liu, T. (2012). A separately passive-aggressive training algorithm forjoint POS tagging and dependency parsing. In Proceedings of the 24rd international conferenceon computational linguistics (Coling 2012), Mumbai. Coling 2012 Organizing Committee.

Li, Z., Zhang, M., Che, W., Liu, T., Chen, W., & Li, H. (2011). Joint models for Chinese POStagging and dependency parsing. In Proceedings of EMNLP 2011, Edinburgh.




http://www.aclweb.org/anthology/I11-1136


Marcus, M. P., Santorini, B., & Marcinkiewicz, M.A. (1993). Building a large annotated corpus ofEnglish: The Penn Treebank. Computational Linguisticss, 19(2), 313–330.

McDonald, R., Crammer, K., & Pereira, F. (2005). Online large-margin training of dependencyparsers. In Proceedings of ACL 2005, Ann Arbor (pp. 91–98). Association for ComputationalLinguistics.



Suzuki, J., Isozaki, H., Carreras, X., & Collins, M. (2009). An empirical study of semi-supervisedstructured conditional models for dependency parsing. In Proceedings of EMNLP2009,Singapore (pp. 551–560). Association for Computational Linguistics.

Suzuki, J., Isozaki, H., & Nagata, M. (2011). Learning condensed feature representations fromlarge unsupervised data sets for supervised learning. In Proceedings of ACL2011, Portland(pp. 636–641). Association for Computational Linguistics. http://www.aclweb.org/anthology/P11-2112.

Xue, N., Xia, F., dong Chiou, F., & Palmer, M. (2005). Building a large annotated Chinese corpus:The Penn Chinese treebank. Journal of Natural Language Engineering, 11(2), 207–238.


Zhang, Y., & Clark, S. (2008). A tale of two parsers: Investigating and combining graph-based and transition-based dependency parsing. In Proceedings of EMNLP 2008, Honolulu(pp. 562–571).

Zhang, Y., & Nivre, J. (2011). Transition-based dependency parsing with rich non-local features.In Proceedings of ACL-HLT2011, Portland (pp. 188–193). Association for ComputationalLinguistics. http://www.aclweb.org/anthology/P11-2033.

Zhou, G., Zhao, J., Liu, K., & Cai, L. (2011). Exploiting web-derived selectional pref-erence to improve statistical dependency parsing. In Proceedings of ACL-HLT2011,Portland (pp. 1556–1565). Association for Computational Linguistics. http://www.aclweb.org/anthology/P11-1156.






Chapter 10Closing Remarks

In this chapter, we summarize the entire book. In particular, we list the approachesintroduced in this book in a table. We then discuss the approaches further.

In this book, we have presented a comprehensive overview of the approaches insemi-supervised dependency parsing.

• We have introduced two major supervised models for dependency parsing inChap. 2. The first is called graph-based model, which treats dependency parsingas a structure prediction problem in which the graphs are usually represented asfactored structures. The second is called transition-based model, which learns amodel for scoring transitions from one parser state to the next, conditioned on theparse history. The graph-based model uses exhaustive search and defines featuresover limited scope, while the transition-based model uses greedy search or beamsearch and defines features over decision history.

• We have introduced the approaches of whole tree level, which make use ofentire auto-parsed dependency trees, in Chap. 4. The conventional approachespick up some high-quality auto-parsed training instances from unlabeled datausing bootstrapping methods, such as self-training and co-training. The obviousdrawback is that they only use 1-best parse tree for each sentence. To overcomethis problem, a new learning framework referred to as ambiguity-aware ensembletraining is proposed to make use of parse forest that combines multiple 1-bestparse trees generated from different parsers on raw data.

• We have introduced the approaches of word level, which utilizes the informationbased on word surfaces, in Chap. 5. The lexical information is very importantfor resolving ambiguous relationships for dependency parsing, but lexicalizedstatistics are sparse and difficult to estimate directly given a limited train data set.The approaches represent new features based on word clusters or word-to-wordrelations over large-scale raw data.

• We have introduced the approaches of partial tree level, which make use of theinformation of partial structures from auto-parsed dependency trees. In recentyears, the researchers have advantaged this field by exploiting the information


141

142 10 Closing Remarks

Table 10.1 Approaches of semi-supervised dependency parsing

Type Approach (brief description) Paper

Whole tree level Co-training Sagae and Tsujii (2007)Self-training Spreyer and Kuhn (2009)Tri-training Søgaard and Rishøj (2010)Ambiguity-aware ensemble training Li et al. (2014)

Partial tree level Based on bilexical dependencies van Noord (2007)Based on bilexical dependencies Chen et al. (2008)Based on bi- and tri-gram subtrees Chen et al. (2009)Based on generative models Suzuki et al. (2009)Based on condensed features Suzuki et al. (2011)Based on dependency language model Chen et al. (2012)Based on meta features Chen et al. (2013)

Word-level Based on word-clusters Koo et al. (2008)Based on web-derived word-to-word relations Zhou et al. (2011)

from bilexical dependencies to more complex tree structures. The use of bilexicaldependencies is described in Chap. 6. Chapter 7 introduces the approach that usesthe information on lexical subtrees from auto-parsed data. We further introducethe approach of applying dependency language models in Chap. 8. Chapter 9describes meta features defined over surface words, part-of-speech tags whichrepresents more complex tree structures than bilexical dependencies and lexicalsubtrees.

As a summary, we list the representative approaches of semi-supervised depen-dency parsing in Table 10.1. From the table, we can find that semi-superviseddependency parsing receives more and more attention in recent years.

At the end of this book, let us discuss future directions. There are many ways inwhich the current semi-supervised approaches could be extended.

• It is worth using big data to revisit the current approaches in the big data era.Compared with the web data, the data used in the related work is not that big.

• The domain adaptation problem is still unresolved. The performance of theparsing systems drops a lot when adapting to new domains in the evaluation ofSANCL2012.1 The sentences from the Web, such as Twitter or Sina Weibo, arehard to be parsed correctly (Kong et al. 2014; Wang et al. 2014).

• The multilingual dependency parsing task is very interesting and has receivedmuch attention (McDonald et al. 2011; Täckström et al. 2013, 2012). Everylanguage has its own characteristics (Hatori et al. 2011; Li et al. 2012). Forexample, we can perform character-level dependency parsing for Chinese (Zhanget al. 2014). How to transfer the knowledge and capture the differences amonglanguages is very challenging and tough.

1https://sites.google.com/site/sancl2012/

https://sites.google.com/site/sancl2012/

References 143

• How to use deep learning algorithms (Bengio 2009; Hinton et al. 2006) independency parsing is a very interesting topic (Chen et al. 2014). Le and Zuidema(2014) apply the Recursive Neural Network (RNN) model for dependencyparsing. Chen and Manning (2014) develop a fast dependency parser usingNeural Networks. But their systems still lag behind the state-of-the-art systems.

References

Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends R in MachineLearning, 2(1), 1–127.

Chen, D., & Manning, C. (2014). A fast and accurate dependency parser using neural networks.In Proceedings of the 2014 conference on empirical methods in natural language processing(EMNLP), Doha (pp. 740–750). Association for Computational Linguistics. http://www.aclweb.org/anthology/D14-1082.



Chen, W., Zhang, M., & Li, H. (2012). Utilizing dependency language models for graph-baseddependency parsing models. In Proceedings of ACL 2012, Jeju.

Chen, W., Zhang, M., & Zhang, Y. (2013). Semi-supervised feature transformation for dependencyparsing. In Proceedings of EMNLP 2013, Seattle (pp. 1303–1313). Association for Computa-tional Linguistics. http://www.aclweb.org/anthology/D13-1129.

Chen, W., Zhang, Y., & Zhang, M. (2014). Feature embeddings for dependency parsing. InProceedings of coling 2014, Dublin.

Hatori, J., Matsuzaki, T., Miyao, Y., & Tsujii, J. (2011). Incremental joint POS tagging anddependency parsing in Chinese. In Proceedings of 5th international joint conference on naturallanguage processing, Chiang Mai (pp. 1216–1224). Asian Federation of Natural LanguageProcessing. http://www.aclweb.org/anthology/I11-1136.

Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets.Neural Computation, 18(7), 1527–1554.

Kong, L., Schneider, N., Swayamdipta, S., Bhatia, A., Dyer, C., & Smith, N. A. (2014). Adependency parser for tweets. In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP), Doha (pp. 1001–1012). Association for ComputationalLinguistics. http://www.aclweb.org/anthology/D14-1108.


Le, P., & Zuidema, W. (2014). The inside-outside recursive neural network model for dependencyparsing. In Proceedings of the 2014 conference on empirical methods in natural languageprocessing (EMNLP), Doha (pp. 729–739). Association for Computational Linguistics. http://www.aclweb.org/anthology/D14-1081.

Li, Z., Zhang, M., Che, W., & Liu, T. (2012). A separately passive-aggressive training algorithm forjoint POS tagging and dependency parsing. In Proceedings of the 24rd international conferenceon computational linguistics (Coling 2012), Mumbai. Coling 2012 Organizing Committee.

Li, Z., Zhang, M., & Chen, W. (2014). Ambiguity-aware ensemble training for semi-superviseddependency parsing. In Proceedings of annual meeting of the association for computationallinguistics (ACL2014), Baltimore (pp. 457–467, 22–27).

McDonald, R., Petrov, S., & Hall, K. (2011). Multi-source transfer of delexicalized dependencyparsers. In Proceedings of the conference on empirical methods in natural language processing,Edinburgh (pp. 62–72). Association for Computational Linguistics.




http://www.aclweb.org/anthology/I11-1136




144 10 Closing Remarks


Søgaard, A., & Rishøj, C. (2010). Semi-supervised dependency parsing using generalized tri-training. In Proceedings of ACL, Uppsala (pp. 1065–1073).

Spreyer, K., & Kuhn, J. (2009). Data-driven dependency parsing of new languages usingincomplete and noisy training data. In CoNLL, Boulder (pp. 12–20).

Suzuki, J., Isozaki, H., Carreras, X., & Collins, M. (2009). An empirical study of semi-supervisedstructured conditional models for dependency parsing. In Proceedings of EMNLP2009, Singa-pore (pp. 551–560). Association for Computational Linguistics.

Suzuki, J., Isozaki, H., & Nagata, M. (2011). Learning condensed feature representations fromlarge unsupervised data sets for supervised learning. In Proceedings of ACL2011, Portland(pp. 636–641). Association for Computational Linguistics. http://www.aclweb.org/anthology/P11-2112.

Täckström, O., McDonald, R., & Nivre, J. (2013). Target language adaptation of discriminativetransfer parsers. In Proceedings of NAACL, Atlanta (pp. 1061–1071).

Täckström, O., McDonald, R., & Uszkoreit, J. (2012). Cross-lingual word clusters for directtransfer of linguistic structure. In Proceedings of the 2012 conference of the North Americanchapter of the association for computational linguistics: Human language technologies,Montréal (pp. 477–487). Association for Computational Linguistics.


Wang, W. Y., Kong, L., Mazaitis, K., & Cohen, W. W. (2014). Dependency parsing for weibo:An efficient probabilistic logic programming approach. In Proceedings of the 2014 conferenceon empirical methods in natural language processing (EMNLP), Doha (pp. 1152–1158).Association for Computational Linguistics. http://www.aclweb.org/anthology/D14-1122.

Zhang, M., Zhang, Y., Che, W., & Liu, T. (2014). Character-level chinese dependency parsing. InProceedings of the 52nd annual meeting of the association for computational linguistics (vol-ume 1: long papers), Baltimore (pp. 1326–1336). Association for Computational Linguistics.http://www.aclweb.org/anthology/P14-1125.

Zhou, G., Zhao, J., Liu, K., & Cai, L. (2011). Exploiting web-derived selectional preference toimprove statistical dependency parsing. In Proceedings of ACL-HLT2011, Portland (pp. 1556–1565). Association for Computational Linguistics. http://www.aclweb.org/anthology/P11-1156.







wenliang˜chen˜· min˜zhang semi- supervised dependency parsing · preface semi-supervised...

Documents