computaonal linguiscs - the stanford natural language

39
Computa(onal Linguis(cs (aka Natural Language Processing) Bill MacCartney SymSys 100 Stanford University 26 May 2011 (some slides adapted from Chris Manning)

Upload: others

Post on 21-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computaonal Linguiscs - The Stanford Natural Language

Computa(onalLinguis(cs(akaNaturalLanguageProcessing)

BillMacCartneySymSys100

StanfordUniversity26May2011

(someslidesadaptedfromChrisManning)

Page 2: Computaonal Linguiscs - The Stanford Natural Language

xkcdsnarkiness

OK,Randall,it’sfunny…butwrong!

cartoon from xkcd.com

Page 3: Computaonal Linguiscs - The Stanford Natural Language

Awordonterminology

Ifyoucallit…

•  Computa(onalLinguis(cs(CL)•  …you’realinguist!•  …youusecomputerstostudylanguage

•  NaturalLanguageProcessing(NLP)•  …you’reacomputerscien(st!

•  …youworkonapplica(onsinvolvinglanguage

Butreally,they’repreQymuchsynonymous

Page 4: Computaonal Linguiscs - The Stanford Natural Language

Let’sgetsituated!

Today,weareinthisinters(ce

cartoon from xkcd.com

Page 5: Computaonal Linguiscs - The Stanford Natural Language

NLP:TheVision

I’msorry,Dave.Ican’tdothat.

Oh,dear! Thatiscorrect,captain.

Page 6: Computaonal Linguiscs - The Stanford Natural Language

Language:theul(mateUI

WhereisABug’sLifeplayinginMountainView?

ABug’sLifeisplayingattheCentury16Theater.

Whenisitplayingthere?

It’splayingat2pm,5pm,and8pm.

OK.I’dlike1adultand2childrenforthefirstshow.Howmuchwouldthatcost?

Butweneeddomainknowledge,discourseknowledge,worldknowledge(Nottomen(onlinguis(cknowledge!)

Page 7: Computaonal Linguiscs - The Stanford Natural Language

NLP:Goalsofthefield

•  Fromthelo_y…•  full‐onnaturallanguageunderstanding•  par(cipa(oninspokendialogues•  open‐domainques(onanswering•  real‐(mebi‐direc(onaltransla(on

•  …tothemundane•  iden(fyingspam•  categorizingnewsstories(&otherdocs)•  finding&comparingproductinforma(onontheweb•  assessingsen(menttowardproducts,brands,stocks,…

Predominantinrecentyears

Page 8: Computaonal Linguiscs - The Stanford Natural Language

NLPinthecommercialworld

Powerset

Page 9: Computaonal Linguiscs - The Stanford Natural Language

Currentmo(va(onsforNLP

•  Theexplosionofmachine‐readablenaturallanguagetext•  Exabytes(1018bytes)oftext,doublingeveryyearortwo•  Webpages,emails,IMs,SMSs,tweets,docs,PDFs,…•  Opportunity—andincreasingnecessity—toextractmeaning

•  Media(onofhumaninterac(onsbycomputers•  Opportunityforthecomputerinthelooptodomuchmore

•  Growingroleoflanguageinhuman‐computerinterac(on

What’sdrivingNLP?Threetrends:

Page 10: Computaonal Linguiscs - The Stanford Natural Language

Furthermo(va(onforCL

Onereasonforstudyinglanguage—andformepersonallythemostcompellingreason—isthatitistemp6ngtoregardlanguage,inthetradi6onalphrase,asa“mirrorofmind”.

Chomsky,1975

Forthesamereason,computa(onallinguis(csisacompellingwaytostudypsycholinguis(csandlanguageacquisi(on.

Some(mes,thebestwaytounderstandsomethingistobuildamodelofit.

WhatIcannotcreate,Idonotunderstand.Feynman,1988

Page 11: Computaonal Linguiscs - The Stanford Natural Language

Earlyhistory:50sand60s

•  Founda(onalworkonautomata,formallanguages,probabili(s(cmodeling,andinforma(ontheory

•  Firstspeechsystems(Davisetal.,BellLabs)

•  MTheavilyfundedbymilitary—hugeoverconfidence

•  Butusingmachinesdumberthanapocketcalculator

•  LiQleunderstandingofsyntax,seman(cs,pragma(cs

•  ALPACreport(1966):crap,thisisreallyhard!

Page 12: Computaonal Linguiscs - The Stanford Natural Language

Refocusing:70sand80s

•  Founda(onalworkonspeechrecogni(on:stochas(cmodeling,hiddenMarkovmodels,the“noisychannel”

•  Ideasfromthisworkwouldlaterrevolu(onizeNLP!

•  Logicprogramming,rules‐drivenAI,determinis(calgorithmsforsyntac(cparsing(e.g.,LFG)

•  Increasinginterestinnaturallanguageunderstanding:SHRDLU,LUNAR,CHAT‐80

•  ButsymbolicAIhitthewall:“AIwinter”

Page 13: Computaonal Linguiscs - The Stanford Natural Language

Thesta(s(calrevolu(on:90s

•  InfluxofnewideasfromEE&ASR:probabilis(cmodeling,corpussta(s(cs,supervisedlearning,empiricalevalua(on

•  Newsourcesofdata:explosionofmachine‐readabletext;human‐annotatedtrainingdata(e.g.,thePennTreebank)

•  Availabilityofmuchmorepowerfulmachines

•  Loweredexpecta(ons:forgetfullseman(cunderstanding,let’sdotextcat,part‐of‐speechtagging,NER,andparsing!

Page 14: Computaonal Linguiscs - The Stanford Natural Language

Theriseofthemachines:00s

•  Consolida(onofthegainsofthesta(s(calrevolu(on

•  Moresophis(catedsta(s(calmodelingandmachinelearningalgorithms:MaxEnt,SVMs,BayesNets,LDA,etc.

•  Bigbigdata:100xgrowthofweb,massiveserverfarms

•  Focusshi_ingfromsupervised to unsupervisedlearning

•  Revivedinterestinhigher‐levelseman(capplica(ons

Page 15: Computaonal Linguiscs - The Stanford Natural Language

Subfieldsandtasks

Textcategoriza(on Coreferenceresolu(on Ques(onanswering(QA)

Part‐of‐speech(POS)tagging Wordsensedisambigua(on(WSD)

Textualinference&paraphrase

Nameden(tyrecogni(on(NER) Syntac(cparsing Summariza(on

Informa(onextrac(on(IE) Machinetransla(on(MT) Discourse&dialog

Sen(mentanalysis

mostlysolved makinggoodprogress s(llreallyhard

Spamdetec(onOK,let’smeetbythebig…

D1cktoosmall?BuyV1AGRA…

✓ ✗

PhilliesshutdownRangers2‐0

Joblessratehitstwo‐yearlow

SPORTS

BUSINESS

Colorlessgreenideassleepfuriously.

ADJADJNOUNVERBADV

ObamametwithUAWleadersinDetroit…

PERSONORGLOC

You’reinvitedtoourbungabungaparty,FridayMay27at8:30pminCorduraHall

PartyMay27add

Thephowasauthen(candyummy.

Waiterignoredusfor20minutes.

ObamatoldMubarakheshouldn’trunagain.

IneednewbaQeriesformymouse.

Ourspecialtyispandafriedrice.

我们的专长是熊猫炒饭

Sheencon(nuesrantagainst…Sheencon(nuesrantagainst…Sheencon(nuesrantagainst…

Sheenisnuts

Q.WhatcurrencyisusedinChina?

A.Theyuan

IcanseeRussiafrommyhouse!

T.Thirteensoldierslosttheirlives…

H.Severaltroopswerekilledinthe… YES

WhereisThorplayinginSF?

Metreonat4:30and7:30

Seman(csearchpeopleprotes(ngglobaliza(on Search

…demonstratorsstormedIMFoffices…

Page 16: Computaonal Linguiscs - The Stanford Natural Language

WhyisNLPhard?

Naturallanguageis:•  highlyambiguousatalllevels

•  complex,withrecursivestructuresandcoreference

•  subtle,exploi(ngcontexttoconveymeaning

•  fuzzyandvague•  involvesreasoningabouttheworld•  partofasocialsystem:persuading,insul(ng,amusing,…

(Nevertheless,simplefeatureso_endohalfthejob!)

Page 17: Computaonal Linguiscs - The Stanford Natural Language

Meaningsandexpressions

soda

so_drink

pop

beverage

Coke

Page 18: Computaonal Linguiscs - The Stanford Natural Language

Onemeaning,manyexpressions

ImageCaptureDevice:1.68millionpixel1/2‐inchCCDsensor

ImageCaptureDeviceTotalPixelsApprox.3.34millionEffec(vePixelsApprox.3.24million

ImagesensorTotalPixels:Approx.2.11million‐pixel

ImagingsensorTotalPixels:Approx.2.11million1,688(H)x1,248(V)

CCDTotalPixels:Approx.3,340,000(2,140[H]x1,560[V])Effec(vePixels:Approx.3,240,000(2,088[H]x1,550[V])RecordingPixels:Approx.3,145,000(2,048[H]x1,536[V])

Theseallcamefromthesamevendor’swebsite!

Tobuildashoppingsearchengine,youneedtoextractproductinforma(onfromvendors’websites:

Page 19: Computaonal Linguiscs - The Stanford Natural Language

Onemeaning,manyexpressions

Gazpromconfirmstwo‐foldincreaseingaspriceforGeorgia

GazpromdoublesGeorgia'sgasbill

Russiagasmonopolytodoublepriceofgas

RussiahitsGeorgiawithhugeriseinitsgasbill

RussiaplanstodoubleGeorgiangasprice

RussiaincreasingpriceofgasforGeorgia Search

Russiadoublesgasbillto“punish”neighbourGeorgia

Orconsideraseman(csearchapplica(on:

Page 20: Computaonal Linguiscs - The Stanford Natural Language

Oneexpression,manymeanings

cartoon from qwantz.com

Page 21: Computaonal Linguiscs - The Stanford Natural Language

Syntac(c&seman(cambiguity

NPNP

VP

S

NPNP

PP

VP

S

photosfromworth1000.com

seman(cambiguity

syntac(cambiguity

FruitflieslikeabananaFruitflieslikeabanana

Page 22: Computaonal Linguiscs - The Stanford Natural Language

Ambiguousheadlines

Teacher Strikes Idle Kids China to Orbit Human on Oct. 15 Red Tape Holds Up New Bridges Hospitals Are Sued by 7 Foot Doctors Juvenile Court to Try Shooting Defendant Local High School Dropouts Cut in Half Police: Crack Found in Man's Buttocks

Page 23: Computaonal Linguiscs - The Stanford Natural Language

OK,whyelseisNLPhard?Ohsomanyreasons!

non‐standardEnglish

Greatjob@jus(nbieber!WereSOOPROUDofwhatyouveaccomplished!Utaughtus2#neversaynever&youyourselfshouldnevergiveupeither♥

segmenta1onissues idiomsdarkhorsegetcoldfeetloseface

throwinthetowel

neologisms

unfriendretweetbromanceteabagger

gardenpathsentences

Themanwhohuntsducksoutonweekends.

ThecoQonshirtsaremadefromgrowshere.

trickyen1tynames

…amuta(onontheforgene…

WhereisABug’sLifeplaying…

MostofLetItBewasrecorded…

worldknowledge

MaryandSuearesisters.

MaryandSuearemothers.

prosody

Ineversaidshestolemymoney.

Ineversaidshestolemymoney.

Ineversaidshestolemymoney.

lexicalspecificity

Butthat’swhatmakesitfun!

theNewYork‐NewHavenRailroad

theNewYork‐NewHavenRailroad

Page 24: Computaonal Linguiscs - The Stanford Natural Language

So,howtomakeprogress?

•  Thetaskisdifficult!Whattoolsdoweneed?•  Knowledgeaboutlanguage•  Knowledgeabouttheworld•  Awaytocombineknowledgesources

•  Theanswerthat’sbeengezngtrac(on:•  probabilis(cmodelsbuiltfromlanguagedata

•  P(“maison”→“house”)high

•  P(“L’avocatgénéral”→“thegeneralavocado”)low

•  Somethinkthisisafancynew“A.I.”idea•  Butreallyit’sanoldideastolenfromtheelectricalengineers…

Page 25: Computaonal Linguiscs - The Stanford Natural Language

Machinetransla(on(MT)

美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件,威胁将会向机场等公众地方发动生化袭击後,关岛经保持高度戒备。

TheU.S.islandofGuamismaintainingahighstateofalerta_ertheGuamairportanditsofficesbothreceivedane‐mailfromsomeonecallinghimselftheSaudiArabianOsamabinLadenandthreateningabiological/chemicalaQackagainstpublicplacessuchastheairport.

•  Theclassicacidtestfornaturallanguageprocessing.

•  Requirescapabili(esinbothinterpreta(onandgenera(on.

•  About$10billionspentannuallyonhumantransla(on.

Page 26: Computaonal Linguiscs - The Stanford Natural Language

Empiricalsolu(on

Hieroglyphs

ParallelTexts:TheRoseQaStone

Demo(c

Greek

Page 27: Computaonal Linguiscs - The Stanford Natural Language

Empiricalsolu(on

Hmm,every(meonesees“banco”,transla(onis“bank”or“bench”…Ifit’s“bancode…”,italwaysbecomes“bank”,never“bench”…

slide from Kevin Knight

ParallelTexts:–  HongKongLegisla(on–  MacaoLegisla(on

–  CanadianParliamentHansards

–  UnitedNa(onsReports–  EuropeanParliament

–  Instruc(onManuals–  Mul(na(onalcompany

websites

Page 28: Computaonal Linguiscs - The Stanford Natural Language

Sindarin‐English

Iamarprestaraen.Theworldischanged.

Hanmathonnenen.Ifeelitinthewaters.

Hanmathonnechae.Ifeelitintheearth.

Ahannostonned'wilith.Ismellitintheair.

FellowshipoftheRingsmoviescript

slide from Lori Levin

Page 29: Computaonal Linguiscs - The Stanford Natural Language

Sta(s(calMT

Supposewehadaprobabilis(cmodeloftransla(onP(e|f)

Example:supposefisderienP(you’rewelcome|derien) =0.45P(nothing|derien) =0.13P(piddling|derien) =0.01P(underpants|derien) =0.000000001

Thenthebesttransla(onforfisargmaxeP(e|f)

Page 30: Computaonal Linguiscs - The Stanford Natural Language

ABayesianapproach

ê=argmaxeP(e|f)

=argmaxeP(f)

P(f|e)P(e)

=argmaxeP(f|e)P(e)

languagemodeltransla(onmodel languagemodel(fluency)

transla(onmodel(fidelity)

Page 31: Computaonal Linguiscs - The Stanford Natural Language

The“noisychannel”model

illustration from Jurafsky & Martin

Page 32: Computaonal Linguiscs - The Stanford Natural Language

Languagemodels(LMs)

•  NoisychannelmodelrequireslanguagemodelP(e)

•  LMtellsuswhichsentencesseemlikelyor“good”

•  Givensomecandidatetransla(ons,LMhelpswith:•  wordchoice(“shrankfrom”or“shrankof”?)

•  wordordering(“toughdecisions”or“decisionstough”?)

sentence P(e)

Heshrankfromtoughdecisions. 1.89e‐11

Heshrankfromimportantdecisions. 9.46e‐12

Heshrankoftoughdecisions. 7.11e‐16

Heshrankfromdecisionstough. 3.21e‐17

Page 33: Computaonal Linguiscs - The Stanford Natural Language

Sta(s(callanguagemodels

•  Wherewillthelanguagemodelcomefrom?

•  We’llbuilditbycoun(ngthingsincorpusdata!

•  Sta(s(cales(ma(onofmodelparameters

•  Butwecan’tjustcountwholesentences

sentence count P(e)

Heshrankfromtoughdecisions. 1/49208 2.03e‐05

Heshrankfromimportantdecisions. 0/49208 0

Heshrankoftoughdecisions. 0/49208 0

Heshrankfromdecisionstough. 0/49208 0

toohigh!

toolow!

Page 34: Computaonal Linguiscs - The Stanford Natural Language

N‐gramlanguagemodels

•  Instead,we’llbreakthingsintopieces

•  Thisiscalledabigramlanguagemodel

•  Wecanes(matebigramprobabili(esfromcorpus

P(Heshrankfromtoughdecisions)=P(He|•)×P(shrank|He)×P(from|shrank)×…×P(decisions|tough)

w1 w2 C(w1) C(w1w2) P(w2|w1)

• He 49208 978 0.0199

He shrank 53142 21 0.0004

shrank from 122 17 0.1393

from tough 18777 184 0.0098

Page 35: Computaonal Linguiscs - The Stanford Natural Language

Sta(s(caltransla(onmodels

•  Noisychannelalsoneedstransla(onmodelP(f|e)

•  Similarstrategy:breaksentencepairsintophrases

•  Countco‐occurringpairsinalargeparallelcorpus

•  (ButI’llskipthegorydetails…)

e f C(e) C(e,f) P(f|e)

heshrank illuirépugnait 17 6 0.3529

from de 27111 17855 0.6586

from des 27111 6434 0.2373

toughdecisions décisionsdifficiles 98 81 0.8265

Page 36: Computaonal Linguiscs - The Stanford Natural Language

Sta(s(calMTSystems

French BrokenEnglish

English

Sta(s(calAnalysis Sta(s(calAnalysis

J’aitrèsfaim Iamsohungry

WhathungerhaveI,HungryIamso,Iamsohungry,HaveIthathunger…

LanguageModelP(e)

Transla1onModelP(f|e)

DecodingalgorithmargmaxeP(f|e)P(e)

French/EnglishParallelTexts

Michellemabellesontlesmotsquivonttrèsbienensemble

Michellemabellesontlesmotsquivonttrèsbienensemble

Michelle,mabelle,sontlesmotsquivonttrèsbienensemble

Michellemabellesontlesmotsquivonttrèsbienensemble

Michellemabellesontlesmotsquivonttrèsbienensemble

Michelle,mybeau(ful,arewordsthatgotogetherwell

EnglishTexts

Michellemabellesontlesmotsquivonttrèsbienensemble

Michellemabellesontlesmotsquivonttrèsbienensemble

Manygreattradi(onsinartoriginatedintheartofoneofthefive…

Page 37: Computaonal Linguiscs - The Stanford Natural Language

Applica(onsofthenoisychannel

Thismodelcanbeappliedtomanydifferentproblems!

Channelmodelspeechproduc(onOCRtypingwithspellingerrorstransla(ngtoEnglish

LanguagemodelEnglishwordsEnglishwordsEnglishwordsEnglishwords

ê=argmaxeP(x|e)P(e)

(WidelyusedatGoogle,forexample)

Page 38: Computaonal Linguiscs - The Stanford Natural Language

IfyoulikeNLP/CompLing…

•  learnJavaorPython(andplaywithJavaNLPorNLTK)•  studylogic,probability,sta(s(cs,linearalgebra•  getsomeexposuretolinguis(cs(LING1,…)•  studyAIandmachinelearning(CS121,CS221,CS229)

•  readJurafsky&Mar(norManning&Schütze

•  CS124:FromLanguagetoInforma(on(Jurafsky)

•  CS224N:NaturalLanguageProcessing(Manning)

•  CS224S:SpeechRecogni(on&Synthesis(Jurafsky)•  CS224U:NaturalLanguageProcessing(MacCartney)

Page 39: Computaonal Linguiscs - The Stanford Natural Language

Onemorefortheroad

cartoon from qwantz.com