nlp_compromise

26
The pope is catholic. language as an interface language as data introduction

Upload: torontonodejs

Post on 14-Apr-2017

70 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: nlp_compromise

The pope is catholic.

language as an interfacelanguage as data

introduction

Page 2: nlp_compromise

(@spencermountain)introduction

Page 3: nlp_compromise

problem

Page 4: nlp_compromise

problem

Page 5: nlp_compromise

problem

Page 6: nlp_compromise

london in the rain london in the

in the rain

london in

in the

the rain

london

in

the

rain

4-gram: 3-gram: 2-gram: 1-gram:

10 requests per keystroke4-words -

5-words : 15 6-words : 21 7-words : 28 8-words : 36

problem

Page 7: nlp_compromise

london in the rain

london in the

in the rain

london in

in the

the rain

London

in

the

rain

in

the

london in the

in the rain

london in

the rain

Stopwords blacklist:

Edge gram filter:

Redundancy check:

#1

#2

#3

in the

“rain”

“london”

“london in the rain”

problem

Page 8: nlp_compromise

● NLTK - excellent, huge, python

● Stanford parser - excellent, huge, java

Alchemy,

TextRazor,

OpenCalais,

Embedly,

Zemanta

Or an offsite API?● Freeling - excellent, huge, C++

● Illinois tagger - excellent, huge, java

When all you’ve got is a jackhammer..

niche

Page 9: nlp_compromise

Can it be hacked?

tldr: yes.

niche

Page 10: nlp_compromise

How big is a language?

Shakespeare - 35,000

niche

Page 11: nlp_compromise

Zipfs law

The top 10 words account for 25% of language.

The top 100 words account for 50% of language.

The top 50,000 words account for 95% of language.

niche

Page 12: nlp_compromise

602 kb uncompressed

50,000 different words

An average person will ever hear

~4 lookups in binary search

niche

Page 13: nlp_compromise

first, let’s kill the

nouns 70%

process

180 kb uncompressed

Page 14: nlp_compromise

Noun Verb Adjective Adverb

TomatoTomatoes

TorontoTorontonian

Speak

SpokeSpeakingwill speak

have spokenhad spoken

...

nice

nicernicest

quickly

quicklierquickliest

“awesome”“awesomeify”

improveify your vocabularies

“quickly”“quick”n/2.3Each word

*not handsome *not truly*not is*not economics

“tomatoey”“tomato”

“speak”“speaker”

“aggressive”“agressiveness”

“civil”“civilize”

niche

Page 15: nlp_compromise

then, let’s conjugate our verbs

process

110 kb uncompressed

Page 16: nlp_compromise

process

jQuery 256kb

d3js 330kb

react653kb110 kb

uncompressed lodash 503kb

the wholeenglish

language110kb

Page 17: nlp_compromise

Ok, let’s roll our own POS tagger..

(what could go rong?)

process

1) ☑� Lexicon2) ◻� Suffix regexes3) ◻� Sentence

markov

Page 18: nlp_compromise

Suffix rules

process

1) ☑� Lexicon2) ☑� Suffix regexes3) ◻� Sentence

markov

Page 19: nlp_compromise

Grammar rules - markov

She could walk the walk .

before: Verb - Det - Verb

after: Verb - Det - Noun

process

1) ☑� Lexicon2) ☑� Suffix regexes3) ☑� Sentence

markov

Page 20: nlp_compromise

“Unreasonable effectiveness” of rule-based taggers-

● a 1,000 word lexicon - 45% precision

● fallback to [Noun] - 70% precision

● a little regex - 74% precision

● a little grammar in it - 81% precision

process

Page 21: nlp_compromise

t.text(“keep on rocking in the free world”)t.negate()//“don’t keep on rocking in the free world.”

outcome

Page 22: nlp_compromise

t.text(“it is a cool library”)t.toValleyGirl()//“so, it is like, a cool library.”

outcome

Page 23: nlp_compromise

We gave the monkeys the bananas,

..because they were ripe. ..because they were hungry.

outcome

Page 24: nlp_compromise

We gave the monkeys the bananas

[Pr] [Verb] [Dt] [Noun] [Dt] [Noun]

list of letters

POS-tagging

Dependency parser We give [Noun] [Noun]

� [act / transfer / voluntary] [genus / monkey]

Knowledge engine[plant / banana]

outcome

Page 25: nlp_compromise

#TODOFML

● Mutable/Immutable API

● Speed, performance testing

● Romantic-language verb conjugations

● ‘bl.ocks.org’ of demos and docs

outcome

Page 26: nlp_compromise

npm install --wooyeah

@spencermountain

Slack group, mailing list, github, Toronto/coffee