intro’to’textprocessing’ lecture’2’behrang/15383/lecs/2013.10.23.lec2.pdf · 10/26/13 5...
Post on 14-Sep-2019
25 Views
Preview:
TRANSCRIPT
10/26/13
1
15383: txt proc!
Intro to Text Processing Lecture 2
Behrang Mohit
15383: txt proc!
Today: Text
• Different types of text • Corpus • Levels of text processing – Corpus – Document – Sentence – Tokens – Characters and Encoding
2
10/26/13
2
15383: txt proc!
So Much Text!
• Goal: Organize and process 3
15383: txt proc!
Different forms of text
• Structured text – Databases – Knowledge-‐base
• Semi-‐structured – Wikipedia info-‐boxes
– Excel sheets, tables in documents – XML documents
• Unstructured – Plain text: News, Webpages, Emails, SMS, tweets, …
4
10/26/13
3
15383: txt proc!
From text to knowledge
5
15383: txt proc!
Organizing the knowledge
• Text: non-‐structured informaZon
6
Charlie was born in London in 1889 and …. Born in 1889, Chaplin spent his youth in …. Chaplin (1889-1977) was a true genius of cinema …. He was born 6 years before the birth of cinema!
10/26/13
4
15383: txt proc!
Organizing the knowledge
• Text: non-‐structured informaZon
• Knowledge-‐base: structured informaZon
7
Charlie was born in London in 1889 and …. Born in 1889, Chaplin spent his youth in …. Chaplin (1889-1977) was a true genius of cinema …. He was born 6 years before the birth of cinema!
Person Birth date
Charlie Chaplin 1889
15383: txt proc!
Organizing the knowledge
• Text: non-‐structured informaZon
• Knowledge-‐base: structured informaZon
8
Charlie was born in London in 1889 and …. Born in 1889, Chaplin spent his youth in …. Chaplin (1889-1977) was a true genius of cinema …. He was born 6 years before the birth of cinema!
Person Birth date
Charlie Chaplin 1889
10/26/13
5
15383: txt proc!
Need for textual data
• All text processing systems need data: – TranslaZon – QuesZon Answering – SummarizaZon – … – …
9
15383: txt proc!
Corpus/Corpora
• No access to all texts – We use a representaZve subset
• Corpus: A collecZon of documents – Used for linguisZc analysis – Used for building a text processing system • Human labeled data for staZsZcal systems
– Used for benchmarking/tesZng a system • Performance of a search engine
10
10/26/13
6
15383: txt proc!
CharacterisZcs of a corpus
• Wriaen, speech • General (balanced), domain specific (e.g. news)
• Monolingual, Bilingual, Comparable
• Size of corpus – Web as corpora > 2,000 billion words
• Copyright issues and license 11
15383: txt proc!
Annotated Corpus
• LinguisZcs annotaZon: adding some linguisZc knowledge to the plain-‐text
• Need for annotated data – System that separates sentences. – System that finds names – System that answers quesZons – System that translates – Search Engine – ….
12
10/26/13
7
15383: txt proc!
Examples of corpora
• BriZsh NaZonal Corpus – 90% wriaen, 10% spoken – 100M words, BriZsh English – Plain text with some linguisZc annotaZons
• Brown Corpus – English, balanced, wriaen American, 1M words – Annotated with syntax and semanZcs informaZon
13
15383: txt proc!
Brown corpus: Text + syntax
14
10/26/13
9
15383: txt proc!
LinguisZc AnnotaZon
• CreaZng gold-‐standard data – Used for building and training a system
– Used for evaluaZng a system
• Requirements: – Robust annotaZon sokware – Clear guidelines – Quality control
17
15383: txt proc!
Desktop and web-‐based annotaZon
• Text and GUI-‐based desktop annotaZon • New web-‐based annotaZons – Work any-‐Zme, anywhere
– Direct storage – Browser and bandwidth limitaZons
18
10/26/13
10
15383: txt proc!
Web-‐based annotaZon
19
15383: txt proc!
CMUQ’s QALB Project
• QALB: Qatar Arabic Language Bank – AutomaZc correcZon of Arabic text • Spelling, grammar, word-‐choice • Text: NaZves (news comments), Learners (essays), MT
• Target: A corpus of 2 million words • 10 annotators
– Demo: The annotaZon interface
20
10/26/13
11
15383: txt proc!
Quality Control
• Corpus needs to be an accurate language sample
• Quality of annotated corpus: – Human annotator’s agreement high quality corpus
• Framework: – Randomly assign parts of the task to mulZple annotators
– Measure their agreement • Agreement measures
21
15383: txt proc!
New Models: Crowd Sourcing
• Large scale, simple and cheap annotaZons – Quick and mass annotaZon
• Amazon Mechanical Turk – Cost: 1-‐10 cents/tasks – Frameworks to asses the annotators • Before and during the work
– Limits: Quality and the range of experZse
22
10/26/13
12
15383: txt proc!
Document structure
• Plain text • Markups – SGML: Standard Generalized Markup language
– HTML: Hyper text Mark-‐up Language – XML: Extensible Mark-‐up Language – Wikipedia
• Prints
23
15383: txt proc!
Plain Text
24
10/26/13
14
15383: txt proc!
Today: Text
✓ Different types of text ✓ Corpus
Levels of text processing ✓Corpus ✓Document – Sentence – Tokens – Characters and Encoding
27
15383: txt proc!
Sentences
• Group of words which … – Usually split by a full-‐stop, exclamaZon/quesZon mark.
– Many sentences have a verb
– Some languages have a word order: • English: Subject-‐Verb-‐Object (SVO) • Arabic: Verb Subject Object, SVO
28
10/26/13
15
15383: txt proc!
Sentence spliong
• Sentence level processing Sentence boundary detecZon
• Baseline: split at ./!/? • But: – The class was thought by Dr. Ali Eqbal and Prof. J. Patrick Jr.
– P.O.Box, I.R. Pakistan, U.S.A. • Use rules with list of acronyms
29
15383: txt proc!
StaZsZcal Sentence Spliong
• Collect different informaZon (features) from the surrounding words: – Lexical category of the preceding/following words – Case of the preceding/following words – Is the preceding word an abbreviaZon
30
10/26/13
16
15383: txt proc!
Sentence alignment
• Parallel corpus is ideal – Reality: Comparable corpus • Translators reorganize the text, creaZng new sentences.
– Extremely non-‐aligned: • News arZcles in different languages • Wikipedia pages in different languages
31
15383: txt proc!
Today: Text
✓ Different types of text ✓ Corpus
Levels of text processing ✓Corpus ✓Document ✓Sentence
– Tokens – Characters and Encoding
32
10/26/13
17
15383: txt proc!
Words vs. tokens
• Difference mainly in languages where white space does not necessarily mean
• Graphical words: token – What we see on the screen or in the book
Her email is mariem@yahoo.com !This is a half-cooked idea.!
• Underlying word – Deeper linguis/c
• Half, cooked
33
15383: txt proc!
Types
• How many tokens in a corpus? – The “wc” command
• How many token types in a corpus? – Number of unique tokens – CapitalizaZon is not relevant (The = the)
34
10/26/13
18
15383: txt proc!
CounZng words and types
• The chair of the session starts to chair the session – Tokens? – Types?
35
15383: txt proc!
TokenizaZon/SegmentaZon
“In Baghdad, Kirkuk (Kurdistan) and, Basra you hear different views.”
“ In Baghdad , Kirkuk ( Kurdistan ) and , Basra you hear different views . ”
• An important step for accurate text processing – Search Engine: looking for “Baghdad”
• Not “Baghdad,” – TranslaZon – …
36
10/26/13
19
15383: txt proc!
TokenizaZon
• Finding the word boundaries – Simple for Roman languages • White space and some rules
• StaZsZcal/probabilisZc approaches
• Not always easy for Asian languages
37
15383: txt proc!
Word vs. Lemma
• Lemma is the collapsed form of a word – Goes, went, gone go – Relies, relied, relying rely – Boxes Box
• LemmaZzaZon helps improving some text processing applicaZon – Search – Machine TranslaZon
38
10/26/13
20
15383: txt proc!
CounZng words and types
• The chair of the session starts to chair the session – Tokens? – Types?
39
15383: txt proc!
Stop words
• Tokens with liale stand-‐alone informaZon – The, a, an, there, here, is, am, …
– PunctuaZons
• ~ funcZons words – Content words
40
10/26/13
21
15383: txt proc!
Character Encoding
• ASCII: American Standard Code for InformaZon Interchange – = 128 characters (1 byte, with one control bit) – 99 numbers, punctuaZons, leaers • Mainly English
41
€
27
15383: txt proc!
ASCII characters
42
10/26/13
22
15383: txt proc!
Unicode
• A character encoding standard developed by the Unicode ConsorZum
• Provides a single representaZon for all world's wriZng systems
• "Unicode provides a unique number for every character, no maaer what the plaxorm, no maaer what the program, no maaer what the language.” – hap://www.unicode.org
43
15383: txt proc!
Unicode encodings
• Version 6.2 (2012) has codes for 110,182 characters – Full Unicode standard uses 32 bits (4 bytes) : it can represent
2^32 = 4,294,967,296 characters! • In reality, only 21 bits are needed
• Unicode has three encoding versions – UTF-‐32 (32 bits/4 bytes): direct representaZon – UTF-‐16 (16 bits/2 bytes): 216=65,536 possibiliZes – UTF-‐8 (8 bits/1 byte): 28=256 possibiliZes
• Why UTF-‐16 and UTF-‐8? – They are more compact (for certain languages, i.e., English)
44
10/26/13
23
15383: txt proc!
UTF-‐8
• Unicode has three encoding versions – UTF-‐32 (32 bits/4 bytes): direct representaZon – UTF-‐16 (16 bits/2 bytes): 2^16=65,536 possibiliZes – UTF-‐8 (8 bits/1 byte): 2^8=256 possibiliZes
• But how do you represent all of 2^32 (=4 billion) code points with only one byte (UTF-‐8: 2^8 =256 slots)? – You don't. – In reality, only 2^21 bits are ever uZlized for 110K characters.
– UTF-‐8 and UTF-‐16 use a variable-‐width encoding.
45
15383: txt proc!
Today: Text
✓ Different types of text ✓ Corpus
Levels of text processing ✓Corpus ✓Document ✓Sentence
✓Tokens ✓Characters and Encoding
46
10/26/13
24
15383: txt proc!
Python Lab
• Text processing warm up with standard Python and NLTK
• Lab-‐1 is available on course webpage: – hap://www.qatar.cmu.edu/~behrang/15383
– Task-‐1: Reading a text corpus with Python and basic processing to find frequent word types.
– Task-‐2: RepeaZng task-‐1, using some of NLTK uZliZes.
– Start now and submit before 6 pm on Thursday
47
top related