vacnet: extracting and analyzing non-trivial linguistic structures at scale

VACNET: Extracting and analyzing non-trivial linguistic structures at

scale

Matthew Brook O’Donnell,Nick C. Ellis, Ute Römer & Gin Corden

English Language Institute

[email protected]

The 2nd University of Michigan Workshop on Data, Text, Web, and Social Network Mining

April 22, 2011

Challenge of natural language for data mining•Much work in NLP, IR and text classification

relies upon frequency analysis of• single words• n-grams (contiguous word sequences of various lengths)

• Units are computationally trivial to retrieve• Map-Reduce ‘Hello World’!

• Techniques tend to use a ‘bag of words’ approach, disregarding structure• Frequency and statistical measures highlight

distinctive items and document ‘aboutness’• But this is a weak proxy for meaning, which

remains somewhat elusive!

Sentence splitting

Word Tokenization

POS tagging

Chunking/Parsing

Named-entity recognition

meaning???

text text text text text text

Typical NLP Pipeline

Can linguistic theory help?... NLP tools:

Challenge of natural language for data mining

Analyzing natural language data is, in my opinion, the problem of the next 2-3 decades. It's an incredibly difficult issue […] It's imperative to have a sufficiently sophisticated and rigorous enough approach that relevant context can be taken into account.

Matthew Russell, Author

Can linguistic theory help?... What is relevant context?

Learning meaning in language

How are we able to learn what novel words mean?

① She moogels about her book

each word contributes individual meaning

verb meaning central; yet verbs are highly polysemous

larger configuration of words carries meaning

these we call CONSTRUCTIONS

V about n

moogle inherits its interpretation from the echoes of the verbs that occupy the V about n Verb Argument Construction (VAC), words like: talk, think, know, write, hear,

speak, worry … fuss, shout, mutter, gossip ‘recurrent patterns of linguistic elements

that serve some well-defined linguistic function’ (Ellis 2003)

Collaborative project to build an inventory of a large number of English verb argument constructions (VACs) using:

• the COBUILD Verb Grammar Patterns descriptions • tools from computational and corpus linguistics• techniques from data mining, machine learning and network

analysis

The project has two components:

(1) a computational corpus analysis of corpora to retrieve instances and verb distributions for the full range of VACs

(2) psycholinguistic experiments to measure speaker knowledge of these VACs through the verbs selected.

VACNET

V about n – some examples• He grumbled incessantly about the ‘disgusting’ provincial life we had to

lead on the island• You should try to think ahead about your financial situation• He worried persistently about the poverty of his social life• She would keep banging on about her son• He wondered briefly about the effects of prolonged exposure to solar

radiation• The housekeeper left the room, muttering about ingratitude• I do not want to carp about the work of the Committee• ‘Any views expressed about Master Matthew?’• There are several other valid justifications for teaching explicitly about

language• Those who gossip about him tend to meet with nasty accidents.

• TASK– retrieval of 700+ verb argument constructions from

a 100 million corpus with minimal intervention but requirement for high precision and high recall

• Multidisciplinary TEAM– linguists, psychologists, information scientists– undergraduate/graduate student RAs, faculty

• TOOLS– dependency parsed corpus in GraphML format– web-based precision analysis tool– processing pipeline

VACNET: Language engineering challenge

Architecture: Large scale extraction of constructions

8

POS tagging &

Dependency Parsing

CouchDB document database

COBUILD Verb Patterns

Construction Descriptions

CORPUS

BNC 100 mill. words

Word Sense Disambiguation

Statistical analysis of

distributions

Web application

WordNet

Network Analysis &

Visualization

DISCO

Method: Collaborative semi-automatic extraction

1.DEFINE search graph2.ENCODE in XML3.CONVERT to Python code4.SEARCH corpus and RECORD matches5.ERROR CODE

Method: Collaborative semi-automatic extraction

Precision analysis interface

Recall analysis

VAC freqtalk 2232think 1810know 879hear 349worry 347forget 322write 299ask 298say 281care 250go 203complain 192speak 181find 148learn 143be 124feel 118look 115wonder 102read 101

Results: V about n

Types (list of different verbs occurring in VAC)

Frequency (Zipfian?) Contingency

(attraction of verb construction)

Semantics prototypicality of meaning & radial structure (Zipfian?)

VAC freqtalk 2232think 1810know 879hear 349worry 347forget 322write 299ask 298say 281care 250go 203complain 192speak 181find 148learn 143be 124feel 118look 115wonder 102read 101

Results: V about nVAC freq Corpus freq Faithfulness

reminisce 12 98 0.1224moon 5 51 0.098talk 2232 24566 0.0909brag 5 69 0.0725carp 5 72 0.0694worry 347 5027 0.069generalize 15 244 0.0615generalise 10 176 0.0568enthuse 13 236 0.0551complain 192 3947 0.0486grumble 18 407 0.0442rave 9 205 0.0439fret 10 265 0.0377fuss 9 246 0.0366care 250 7064 0.0354speculate 26 771 0.0337gossip 9 270 0.0333forget 322 10240 0.0314enquire 38 1341 0.0283prowl 5 179 0.0279

15

VAC Types Tokens TTR Lead verb Token*Faith MIcw

V about n 365 3519 10.37 talk talk bragV across n 799 4889 16.34 come spread scudV after n 1168 7528 15.52 look look lustV among pl-n 417 1228 33.96 find divide nestleV around n 761 3801 20.02 look revolve traipseV as adj 235 1012 23.22 know regard classV as n 1702 34383 4.95 know act masqueradeV at n 1302 9700 13.42 look look officiateV between pl-n 669 3572 18.73 distinguish distinguish sandwichV for n 2779 79894 3.48 look wait vieV in n 2671 37766 7.07 find result couchV into n 1873 46488 4.03 go divide delveV like n 548 1972 27.79 look look glitterV n n 663 9183 7.22 give give renameV of n 1222 25155 4.86 think consist partakeV over n 1312 9269 14.15 go preside poreV through n 842 4936 17.06 go riffle riffleV to n 707 7823 9.04 go listen randomizeV towards n 190 732 25.96 move bias gravitateV under n 1243 8514 14.6 come come wiltV way prep 365 2896 12.6 make wend wendV with n 1942 24932 7.79 deal deal pepper

The frequency distributions for the types occupying each VAC are Zipfian

The most frequent verb for each VAC is much more frequent than the other members, taking the lion’s share of the distribution

The most frequent verb in each VAC is prototypical of that construction’s functional interpretation

generic in its action semantics

VACs are selective in their verb form family occupancy:

Individual verbs select particular constructions Particular constructions select particular verbs There is greater contingency between verb types and

constructions

VACs are coherent in their semantics.

Initial Findings

What do speakers know about verbs in VACS?

s/he/it _____ about the …

Two Experiments276 Native & 276 L1 German speakers of English

Asked to fill the gap with the first word that comes to mind given the prompt

But what about meaning?...• We want to quantify the semantic coherence or

‘clumpiness’ of the verbs extracted in the previous steps– {think, know, hear, worry, care,…} ABOUT

• Construction patterns are productive units in language and subject to polysemy just like words. Can we separate meaning groups within verb distributions?– COMMUNICATION: {talk, write, ask, say, argue,…} ABOUT– COGNITION: {think, know, hear, worry, care,…} ABOUT– MOTION: {move, walk, run, fall, wander,…} ABOUT

• The semantic sources must not be based on localized distributional language analysis– Use WordNet and Roget’s

• Pedersen et al. (2004) WordNet similarity measures• Kennedy, A. (2009). The Open Roget's Project: Electronic lexical knowledge base

Building a semantic network• Use semantic similarity scores for pairs of verbs (from

WordNet, Roget, DISCO, etc.) to create network• nodes = lemma forms from VAC/CEC distribution• edges = link between nodes for top n similarity scores for a pair of verbs

COGNITION

COMMUNICATION

Community detection

top 100 verbs in VAC V about n

Semantic Networks• Exploring community detection algorithms

• Edge Betweenness (Girvan and Newman, 2002)• Fast Greedy (Clauset, Newman and Moore, 2004)• Label Propagation (Raghavan, Albert and Kumara, 2007)• Leading Eigenvector (Newman 2006)• Spinglass (Reichardt and Bornholdt, 2006)• Walktrap (Pons and Latapy, 2005)• Louvain (Blondel, Guillaume, Lambiotte and Lefebvre, 2008)

VACNET Summary Challenge of natural language for data mining Project investigates usage of VACs at scale

constructions = meaning through patterns IR challenge: retrieving non-trivial structures at scale

Corpus analysis examines the distributions of verbs in VACs frequency distribution contingency semantics

Psycholinguistic experiments explore the psychological reality of VACs

VACNET structured inventory verb to construction and construction to verb valuable for NLP and DM tasks

Future explorations: Train classifiers on our datasets Tackle ‘big data’ sets

Thank you!

[email protected]

mailto:[email protected]

vacnet: extracting and analyzing non-trivial linguistic structures at scale

Documents

text classification

challenge of natural

verb distributions

verb argument constructions

data miningmuch work

nlp tools

scale vacnet

novel words