aspects of nlp practice
DESCRIPTION
Some notes on the aspects of applying NLP research in industrial environmentTRANSCRIPT
![Page 1: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/1.jpg)
Practical Aspectsof NLP Work
Vsevolod DyomkinGrammarly
TAAC'2012, Kyiv, Ukraine
![Page 2: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/2.jpg)
Topics
* Practical vs Theoretical NLP work* Working with Data for NLP* NLP Tools
![Page 3: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/3.jpg)
A bit about Grammarly
(c) xkcd
![Page 4: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/4.jpg)
An example of what we deal with
![Page 5: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/5.jpg)
Research vs Development
“Trick for productionizing research: read current 3-5 pubs and note the stupid simple thing they all claim to beat, implement that.
--Jay Kreps https://twitter.com/jaykreps/
status/219977241839411200
![Page 6: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/6.jpg)
NLP practice
R - research work:set a goal →devise an algorithm →train the algorithm →test its accuracy
D - development work:implement the algorithm as an API with sufficient performance and scaling characteristics
![Page 7: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/7.jpg)
Research1. Set a goal
Business goal:
* Develop best/good enough/better than Word/etc spellchecker
* Develop a set of grammar rules, that will catch errors according to MLA Style
* Develop a thesaurus, that will produce synonyms relevant to context
![Page 8: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/8.jpg)
Translate it to measurable goal* On a test corpus of 10000 sentences with common errors achieve smaller number of FNs (and FPs), that other spellcheckers/Word spellchecker/etc
* On a corpus of examples of sentences with each kind of error (and similar sentences without this kind of error) find all sentences with errors and do not find errors in correct sentences
* On a test corpus of 1000 sentences suggest synonyms for all meaningful words that will be considered relevant by human linguists in 90% of the cases
![Page 9: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/9.jpg)
Research
1. Set a goal2. Devise an algorithm3. Train & improve the algorithm
![Page 10: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/10.jpg)
Research
1. Set a goal2. Devise an algorithm3. Train & improve the algorithm
http://nlp-class.org
![Page 11: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/11.jpg)
4. Test its performance
ML: one corpus, divided into training,development,test
![Page 12: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/12.jpg)
4. Test its performance
ML: one corpus, divided into training,development,test
Often different corpora:—* for training some part of the algorithm* for testing the whole system
![Page 13: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/13.jpg)
Theoretical maxima
Theoretical maxima are rarely achievable. Why?
![Page 14: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/14.jpg)
Theoretical maxima
Theoretical maxima are rarely achievable. Why?
* because you need their data
![Page 15: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/15.jpg)
Theoretical maxima
Theoretical maxima are rarely achievable. Why?
* because you need their data
* domains might differ
![Page 16: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/16.jpg)
Pre/post-processingWhat ultimately matters is not crude performance, but...
![Page 17: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/17.jpg)
Pre/post-processingWhat ultimately matters is not crude performance, but...
Acceptance to users (much harder to measure & depends on domain).
![Page 18: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/18.jpg)
Pre/post-processingWhat ultimately matters is not crude performance, but...
Acceptance to users (much harder to measure & depends on domain).
Real-world is messier, than any lab set-up.
![Page 19: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/19.jpg)
Examples ofpre-processing
For spellcheck:
* some people tend to use words, separated by slashes, like: spell/grammar check
* handling of abbreviations
![Page 20: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/20.jpg)
Data
“Data is the next Intel Inside.
--Tim O'Reilly, What is Web2.0 http://oreilly.com/web2/archive/what-is-web-
20.html?page=3
![Page 21: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/21.jpg)
Categorization of Data
* Structured small—* Semi-structured medium—* Unstructured big—
![Page 22: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/22.jpg)
Where to get data?Well-known sources:* Penn Tree Bank* Wordnet* BNC* Web1T Google N-gram Corpus* Linguistic Data Consortium (http://www.ldc.upenn.edu/)
![Page 23: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/23.jpg)
More dataAlso well-known sources, but with a twist:
* Wikipedia & Wiktionary, DBPedia* OpenWeb Common Crawl* Public APIs of some services: Twitter, Wordnik
![Page 24: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/24.jpg)
Academic resources
* Stanford* CoNLL* Oxford (http://www.ota.ox.ac.uk/)* CMU, MIT,...* LingPipe, OpenNLP, NLTK,...
![Page 25: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/25.jpg)
Crowd-sourced data
Jonathan Zittrain, The Future of the Internet
http://goo.gl/hs4qB
![Page 26: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/26.jpg)
And remember...
“Data is ten times more powerful than algorithms.
--Peter Norvig The Unreasonable Effectiveness of Data http://youtu.be/yvDCzhbjYWs
![Page 27: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/27.jpg)
Tools
![Page 28: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/28.jpg)
Levels of NLP tools
High-level: user services
Middle-level: NLP algorithms
Low-level: data-crunching
![Page 29: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/29.jpg)
Choosing a language
Requirement types:* Research* NLP-specific* Production
![Page 30: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/30.jpg)
Research requirements
* Interactivity* Mathematical basis* Expressiveness* Agility Malleability* Advanced tools
![Page 31: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/31.jpg)
Specific NLP requirements
* Good support for statistics & number-crunching – Statistical AI
* Good support for working with trees & symbols – Symbolic AI
![Page 32: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/32.jpg)
Production requirements
* Scalability* Maintainability* Integrability* ...
![Page 33: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/33.jpg)
Choose Lisp
(c) xkcd
![Page 34: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/34.jpg)
Lisp FTW* Truly interactive environment* Very flexible => DSLs* Native tree support* Fast and solid
- No OpenNLP/NLTK
![Page 35: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/35.jpg)
Heterogeneous systems
“Java way” vs. “Unix way”
Create language-agnostic systems, that can easily communicate!
![Page 36: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/36.jpg)
Take-aways* As they say, in theory research and practice are the same, but in practice...
* Data is key. There are 3 types of it. Collect it, build tools to work with it easily and efficiently
* Choose a good language for R&D: interactive & malleable, with as few barriers as possible
![Page 37: Aspects of NLP Practice](https://reader034.vdocuments.us/reader034/viewer/2022042708/55857260d8b42a422c8b4c2e/html5/thumbnails/37.jpg)
Thanks!
Vsevolod Dyomkin@vseloved