practical nlp with lisp
DESCRIPTION
* Overview of NLP practice* Getting Data* Using Lisp: pros & consTRANSCRIPT
Practical NLP with Lisp
Vsevolod DyomkinGrammarly
Topics
* Overview of NLP practice* Getting Data* Using Lisp: pros & cons* A couple of examples
A bit about Grammarly
(c) xkcd
An example of what we deal with
NLP practiceR - research work:set a goal →devise an algorithm →train the algorithm →test its accuracy
NLP practiceR - research work:set a goal →devise an algorithm →train the algorithm →test its accuracy
D - development work:implement the algorithm as an APIAPI with sufficient performanceperformance and scaling characteristics
Research1. Set a goal
Business goal:
* Develop best/good enough/better than Word/etc spellchecker
* Develop a set of grammar rules, that will catch errors according to MLA Style
* Develop a thesaurus, that will produce synonyms relevant to context
Translate it to measurable goal
* On a test corpus of 10000 sentences with common errors achieve smaller number of FNs (and FPs), that other spellcheckers/Word spellchecker/etc
* On a corpus of examples of sentences with each kind of error (and similar sentences without this kind of error) find all sentences with errors and do not find errors in correct sentences
* On a test corpus of 1000 sentences suggest synonyms for all meaningful words that will be considered relevant by human linguists in 90% of the cases
A Note on Terminology
FN and FP instead of precision (P), recall (R)
FN = 1-RFP = 1-P or ???f1 = P*R/(P+R) =(1-FN-FP+FN*FP)/(2-(FN+FP))
Research contd.
2. Devise an algorithm3. Train & improve the algorithm
Research contd.
2. Devise an algorithm3. Train & improve the algorithm
http://nlp-class.org
4. Test its performance
ML: one corpus, divided into training,development,test
4. Test its performance
ML: one corpus, divided into training,development,test
Often different corpora:—* for training some part (not whole) of the algorithm* for testing the whole system
Theoretical maxima
Theoretical maxima are rarely achievable. Why?
Theoretical maxima
Theoretical maxima are rarely achievable. Why?
* Because you need their data. (And data is key)
Theoretical maxima
Theoretical maxima are rarely achievable. Why?
* Because you need their data. (And data is key)
* Domains might differ
Pre/post-processingWhat ultimately matters is not crude performance, but...
Pre/post-processingWhat ultimately matters is not crude performance, but...
Acceptance to users (much harder to measure & depends on domain).
Pre/post-processingWhat ultimately matters is not crude performance, but...
Acceptance to users (much harder to measure & depends on domain).
Real-world is messier, than any lab set-up.
Examples ofpre-processing
For spellcheck:
* some people tend to use words, separated by slashes, like: spell/grammar check
* handling of abbreviations
Where to get data?
Well-known sources:* Penn Tree Bank* Wordnet* Web1T Google N-gram Corpus* Linguistic Data Consortium (http://www.ldc.upenn.edu/)
More data
Also well-known sources, but with a twist:* Wikipedia & Wiktionary, DBPedia* OpenWeb Common Crawl (updated: 2010)* Public APIs of some services: Twitter, Wordnik
Obscure corporaAcademic resources:* Stanford* CoNLL* Oxford (http://www.ota.ox.ac.uk/)* CMU, MIT,...* LingPipe, OpenNLP, NLTK,...
Human-powered?
http://goo.gl/hs4qB
Beyond corpora?
* Bootstrapping* Seeding
And remember...
“Data is ten times more powerful than algorithms.”
-- Peter Norvig, “The Unreasonable Effectiveness of Data.”http://youtu.be/yvDCzhbjYWs
Using Lisp for NLP
(c) xkcd
Why Lisp?Lisp is a carefully crafted tool for:
* Engineers* Practical researchers* Artists* Entrepreneurs
Some examples* Piano.aero* ITA Software* Secure Outcomes* Impromptu
* Land of Lisphttp://youtu.be/HM1Zb3xmvMc
Research requirements
* Interactivity* Mathematical basis* Expressiveness* Agility Malleability* Advanced tools
Specific NLP requirements
* Good support for statistics & number-crunching (matrices) Statistical AI–
* Good support for working with trees & symbols – Symbolic AI
Production requirements
* Scalability* Maintainability* Integrability* ...
...eventually
* Speed
...eventually
* Speed* Speed
...eventually
* Speed* Speed* Speed
Heterogeneous systems
You have to split the system and communicate:
“Java” way vs. “Unix” way
* Sockets, Redis, ZeroMQ, etc for communication* JSON, SEXPs, etc for data
Lisp drawbacksThere's no OpenNLP or SciPy & generally there's fewer libraries.
Lisp drawbacksThere's no OpenNLP or SciPy & generally there's fewer libraries.
But...* github: eslick/cl-langutils* github: mathematical-systems/clml* github: tpapp/lla* github: blindglobe/common-lisp-stat* … and http://quicklisp.org
But #2Porter stemmer: http://tartarus.org/~martin/PorterStemmer & http://www.cliki.net/PorterStemmer
or Soundex: http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/lisp/code/0.html
are irrelevant with good data
More drawbacks
Lisp is a fringe language
Not special language(like R, J or Octave)
Example #1
API interaction
Example #2
Lisp FTW
* truly interactive environment* very flexible => DSLs* native tree support* fast and solid
Take-aways* Take nlp-class
* Data is key, collect it, build tools to work with it easily and efficiently
* A good language for R&D should be first of all interactive & malleable, with as few barriers as possible
* ... it also helps if you don't need to port your code for production
* Lisp is one of the good examples
Thanks!
Vsevolod Dyomkin@vseloved