![Page 1: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/1.jpg)
How to prepare data for NLP
Loryfel Nunez
@lorynyc
![Page 2: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/2.jpg)
California Gold Rush
![Page 3: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/3.jpg)
“
Extracting actionable information
from modern big data sets requires the
equivalent processing infrastructure of
extracting a nugget of GOLD from a mountain of DIRT.
Nikolas Markou
(via LInkedIn)
![Page 4: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/4.jpg)
Have an intuition on how things work
Breaking data down
Keep it simple .. if possible
1
3
2
![Page 5: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/5.jpg)
How does it work,
anyway?1
![Page 6: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/6.jpg)
The General NLP Problem
dog: 3, 2, 1
red coat: 0, 0, 1
😋
😭
![Page 7: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/7.jpg)
Controlling the input
Document Unit
Representation of text
![Page 8: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/8.jpg)
Inside the Machine
Smith acquires shares of Novak and Kline for $10.99 per share .
Smith acquires shares of Novak and Kline for $10.99 per share .
Smith acquires shares of Novak and Kline for $10.99 per share .
Smith acquires shares of Novak and Kline for $10.99 per share .
![Page 9: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/9.jpg)
BREAK IT DOWN
2
![Page 10: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/10.jpg)
Let’s Break it Down
á NovákNovák and
KlineSmith acquires shares of Novak
and Kline for $10.99 per share.
Smith acquires shares of
Novak and Kline for $10.99 per
share.
Smith Inc. acquires shares of
Novak and Kline for $10.99 per
share.
Smith acquires common
shares of N & K for
$10.99/share.
![Page 11: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/11.jpg)
In the real world
<p><b>Smith Buys Novak</b></p>
<p></p>
<p>by Anna Smith<p>
<p> LONDON --- Smith Inc. acquires shares for Novak & Kline. for
$10.99/share.</p>
<table style="width:100%">
<tr><th>Col1</th><th>Col2</th> </tr>
<tr><td>data1</td><td>data2</td></tr>
</table>
![Page 12: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/12.jpg)
… if possible
2
![Page 13: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/13.jpg)
Character
á
&
Do you know the encoding of your input data?
◉User tells you
◉Metadata
◉Figure it out (using chardet, or similar)
◉Have your own heuristics
![Page 14: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/14.jpg)
Tokens
Forty-two, 42
Post-colonial, postcolonial
eBay, Ebay, EBAY, ebay
Fed, FED, fed
C.A.T., CAT
Heuristics
Mappings
Transformations
numToWord, POS (from
SpaCy or NLTK)
![Page 15: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/15.jpg)
Tokens
STEMMING vs LEMMATIZATION
import spacy
from nltk.stem.porter import PorterStemmer
nlp = spacy.load('en')
stemmer = PorterStemmer()
doc = nlp(u'She is an intelligence operative.')
for word in doc:
stemmed = stemmer.stem(word.text)
print(word.text, " LEMMA => ", word.lemma_, "
STEM => ", stemmed)
She LEMMA => -PRON- STEM => she
is LEMMA => be STEM => is
an LEMMA => an STEM => an
intelligence LEMMA => intelligence STEM => intellig
operative LEMMA => operative STEM => oper
. LEMMA => . STEM => .
SpaCy, NLTK
![Page 16: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/16.jpg)
Entities
Novak and Kline, NK,
NYSE:NK, Test Company
June 30, 2017
06/30/2017
30/6/2017
Smith acquires shares of Novak and Kline for
$10.99 per share .
Smith acquires shares of NK for $10.99 per
share .
ORG acquires shares of ORG for $10.99 per share
.
![Page 17: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/17.jpg)
Hot or Not
REMOVING HIGHLIGHTING
WORDS Emails, dates, URLs,
stop words
hotwords
More than WORDS tables Hot patterns
textacy
![Page 18: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/18.jpg)
In the real world
<p><b>Smith Buys Novak</b></p>
<p></p>
<p>by Anna Smith<p>
<p> LONDON --- Smith Inc. acquires shares for Novak & Kline. for
$10.99/share.</p>
<table style="width:100%">
<tr><th>Col1</th><th>Col2</th> </tr>
<tr><td>data1</td><td>data2</td></tr>
</table>
![Page 19: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/19.jpg)
IRL
{‘title’: ‘Smith Buys …’,
‘original_text’: ‘LONDON --- Smith..’,
‘transformed_text’: {
‘text_with_entities’: ‘LOCATION – ORG acquired …. ‘,
‘lemmatized’: ‘Smith Inc acquire share..’
‘has_acquired: true
},
‘table’: ‘<table>….. </table>’
}
![Page 20: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/20.jpg)
The General NLP Problem
dog: 3, 2, 1
red coat: 0, 0, 1
😋
😭
![Page 21: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/21.jpg)
Have an intuition on how things work
Breaking data down
Keep it simple .. if possible
1
3
2
-- how algorithms see text
-- from bytes to documents
-- patterns, normalization, metadata, actions
(replace, remove, highlight)
![Page 22: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx](https://reader031.vdocuments.us/reader031/viewer/2022030318/5a64aad67f8b9a27568b8909/html5/thumbnails/22.jpg)
◉ Stanford NLP Group
◉ Spacy Documentation
◉ SciKit Learn Documentation
◉ The hard knocks of NLP projects
References and other stuff