intro’to’textprocessing’ lecture’2’behrang/15383/lecs/2013.10.23.lec2.pdf · 10/26/13 5...

10/26/13

15383: txt proc!

Intro to Text Processing Lecture 2

Behrang Mohit

15383: txt proc!

Today: Text

•  Different types of text •  Corpus •  Levels of text processing – Corpus – Document – Sentence – Tokens – Characters and Encoding

10/26/13

15383: txt proc!

So Much Text!

•  Goal: Organize and process 3

15383: txt proc!

Different forms of text

•  Structured text – Databases –  Knowledge-‐base

•  Semi-‐structured – Wikipedia info-‐boxes

–  Excel sheets, tables in documents –  XML documents

•  Unstructured –  Plain text: News, Webpages, Emails, SMS, tweets, …

10/26/13

15383: txt proc!

From text to knowledge

15383: txt proc!

Organizing the knowledge

•  Text: non-‐structured informaZon

Charlie was born in London in 1889 and …. Born in 1889, Chaplin spent his youth in …. Chaplin (1889-1977) was a true genius of cinema …. He was born 6 years before the birth of cinema!

10/26/13

15383: txt proc!

•  Knowledge-‐base: structured informaZon

Person Birth date

Charlie Chaplin 1889

15383: txt proc!

•  Knowledge-‐base: structured informaZon

Person Birth date

Charlie Chaplin 1889

10/26/13

15383: txt proc!

Need for textual data

•  All text processing systems need data: – TranslaZon – QuesZon Answering – SummarizaZon – … – …

15383: txt proc!

Corpus/Corpora

•  No access to all texts – We use a representaZve subset

•  Corpus: A collecZon of documents – Used for linguisZc analysis – Used for building a text processing system •  Human labeled data for staZsZcal systems

– Used for benchmarking/tesZng a system •  Performance of a search engine

10/26/13

15383: txt proc!

CharacterisZcs of a corpus

•  Wriaen, speech •  General (balanced), domain specific (e.g. news)

•  Monolingual, Bilingual, Comparable

•  Size of corpus – Web as corpora > 2,000 billion words

•  Copyright issues and license 11

15383: txt proc!

Annotated Corpus

•  LinguisZcs annotaZon: adding some linguisZc knowledge to the plain-‐text

•  Need for annotated data – System that separates sentences. – System that finds names – System that answers quesZons – System that translates – Search Engine – ….

10/26/13

15383: txt proc!

Examples of corpora

•  BriZsh NaZonal Corpus – 90% wriaen, 10% spoken – 100M words, BriZsh English – Plain text with some linguisZc annotaZons

•  Brown Corpus – English, balanced, wriaen American, 1M words – Annotated with syntax and semanZcs informaZon

15383: txt proc!

Brown corpus: Text + syntax

10/26/13

15383: txt proc!

Named EnZty Corpus

15383: txt proc!

Parallel Corpus

10/26/13

15383: txt proc!

LinguisZc AnnotaZon

•  CreaZng gold-‐standard data – Used for building and training a system

– Used for evaluaZng a system

•  Requirements: – Robust annotaZon sokware – Clear guidelines – Quality control

15383: txt proc!

Desktop and web-‐based annotaZon

•  Text and GUI-‐based desktop annotaZon •  New web-‐based annotaZons – Work any-‐Zme, anywhere

– Direct storage – Browser and bandwidth limitaZons

10/26/13

15383: txt proc!

Web-‐based annotaZon

15383: txt proc!

CMUQ’s QALB Project

•  QALB: Qatar Arabic Language Bank – AutomaZc correcZon of Arabic text •  Spelling, grammar, word-‐choice •  Text: NaZves (news comments), Learners (essays), MT

•  Target: A corpus of 2 million words •  10 annotators

– Demo: The annotaZon interface

10/26/13

15383: txt proc!

Quality Control

•  Corpus needs to be an accurate language sample

•  Quality of annotated corpus: – Human annotator’s agreement high quality corpus

•  Framework: –  Randomly assign parts of the task to mulZple annotators

– Measure their agreement •  Agreement measures

15383: txt proc!

New Models: Crowd Sourcing

•  Large scale, simple and cheap annotaZons – Quick and mass annotaZon

•  Amazon Mechanical Turk – Cost: 1-‐10 cents/tasks – Frameworks to asses the annotators •  Before and during the work

– Limits: Quality and the range of experZse

10/26/13

15383: txt proc!

Document structure

•  Plain text •  Markups – SGML: Standard Generalized Markup language

– HTML: Hyper text Mark-‐up Language – XML: Extensible Mark-‐up Language – Wikipedia

•  Prints

15383: txt proc!

Plain Text

10/26/13

15383: txt proc!

10/26/13

15383: txt proc!

Today: Text

✓ Different types of text ✓ Corpus

Levels of text processing ✓Corpus ✓Document – Sentence – Tokens – Characters and Encoding

15383: txt proc!

Sentences

•  Group of words which … – Usually split by a full-‐stop, exclamaZon/quesZon mark.

– Many sentences have a verb

– Some languages have a word order: •  English: Subject-‐Verb-‐Object (SVO) •  Arabic: Verb Subject Object, SVO

10/26/13

15383: txt proc!

Sentence spliong

•  Sentence level processing Sentence boundary detecZon

•  Baseline: split at ./!/? •  But: – The class was thought by Dr. Ali Eqbal and Prof. J. Patrick Jr.

– P.O.Box, I.R. Pakistan, U.S.A. •  Use rules with list of acronyms

15383: txt proc!

StaZsZcal Sentence Spliong

•  Collect different informaZon (features) from the surrounding words: – Lexical category of the preceding/following words – Case of the preceding/following words –  Is the preceding word an abbreviaZon

10/26/13

15383: txt proc!

Sentence alignment

•  Parallel corpus is ideal – Reality: Comparable corpus •  Translators reorganize the text, creaZng new sentences.

– Extremely non-‐aligned: •  News arZcles in different languages • Wikipedia pages in different languages

15383: txt proc!

Today: Text

Levels of text processing ✓Corpus ✓Document ✓Sentence

– Tokens – Characters and Encoding

10/26/13

15383: txt proc!

Words vs. tokens

•  Difference mainly in languages where white space does not necessarily mean

•  Graphical words: token – What we see on the screen or in the book

Her email is mariem@yahoo.com !This is a half-cooked idea.!

•  Underlying word – Deeper linguis/c

•  Half, cooked

15383: txt proc!

•  How many tokens in a corpus? – The “wc” command

•  How many token types in a corpus? – Number of unique tokens – CapitalizaZon is not relevant (The = the)

10/26/13

15383: txt proc!

CounZng words and types

•  The chair of the session starts to chair the session – Tokens? – Types?

15383: txt proc!

TokenizaZon/SegmentaZon

“In Baghdad, Kirkuk (Kurdistan) and, Basra you hear different views.”

“ In Baghdad , Kirkuk ( Kurdistan ) and , Basra you hear different views . ”

•  An important step for accurate text processing –  Search Engine: looking for “Baghdad”

•  Not “Baghdad,” –  TranslaZon – …

10/26/13

15383: txt proc!

TokenizaZon

•  Finding the word boundaries – Simple for Roman languages • White space and some rules

•  StaZsZcal/probabilisZc approaches

•  Not always easy for Asian languages

15383: txt proc!

Word vs. Lemma

•  Lemma is the collapsed form of a word – Goes, went, gone go – Relies, relied, relying rely – Boxes Box

•  LemmaZzaZon helps improving some text processing applicaZon – Search – Machine TranslaZon

10/26/13

15383: txt proc!

CounZng words and types

•  The chair of the session starts to chair the session – Tokens? – Types?

15383: txt proc!

Stop words

•  Tokens with liale stand-‐alone informaZon – The, a, an, there, here, is, am, …

– PunctuaZons

•  ~ funcZons words – Content words

10/26/13

15383: txt proc!

Character Encoding

•  ASCII: American Standard Code for InformaZon Interchange –  = 128 characters (1 byte, with one control bit) – 99 numbers, punctuaZons, leaers •  Mainly English

15383: txt proc!

ASCII characters

10/26/13

15383: txt proc!

Unicode

•  A character encoding standard developed by the Unicode ConsorZum

•  Provides a single representaZon for all world's wriZng systems

•  "Unicode provides a unique number for every character, no maaer what the plaxorm, no maaer what the program, no maaer what the language.” –  hap://www.unicode.org

15383: txt proc!

Unicode encodings

•  Version 6.2 (2012) has codes for 110,182 characters –  Full Unicode standard uses 32 bits (4 bytes) : it can represent

2^32 = 4,294,967,296 characters! •  In reality, only 21 bits are needed

•  Unicode has three encoding versions –  UTF-‐32 (32 bits/4 bytes): direct representaZon –  UTF-‐16 (16 bits/2 bytes): 216=65,536 possibiliZes –  UTF-‐8 (8 bits/1 byte): 28=256 possibiliZes

•  Why UTF-‐16 and UTF-‐8? –  They are more compact (for certain languages, i.e., English)

10/26/13

15383: txt proc!

UTF-‐8

•  Unicode has three encoding versions –  UTF-‐32 (32 bits/4 bytes): direct representaZon –  UTF-‐16 (16 bits/2 bytes): 2^16=65,536 possibiliZes –  UTF-‐8 (8 bits/1 byte): 2^8=256 possibiliZes

•  But how do you represent all of 2^32 (=4 billion) code points with only one byte (UTF-‐8: 2^8 =256 slots)? –  You don't. –  In reality, only 2^21 bits are ever uZlized for 110K characters.

–  UTF-‐8 and UTF-‐16 use a variable-‐width encoding.

15383: txt proc!

Today: Text

Levels of text processing ✓Corpus ✓Document ✓Sentence

✓Tokens ✓Characters and Encoding

10/26/13

15383: txt proc!

Python Lab

•  Text processing warm up with standard Python and NLTK

•  Lab-‐1 is available on course webpage: –  hap://www.qatar.cmu.edu/~behrang/15383

–  Task-‐1: Reading a text corpus with Python and basic processing to find frequent word types.

–  Task-‐2: RepeaZng task-‐1, using some of NLTK uZliZes.

–  Start now and submit before 6 pm on Thursday

intro’to’textprocessing’ lecture’2’behrang/15383/lecs/2013.10.23.lec2.pdf · 10/26/13 5...

Documents

shannon’s theorem and poisson boundaries for locally...

model npr 400 amt npr 400 crew amt nqr 500 nqr...

model gxz 45-360 truck-tractor fyh 33-360 swb fyh 33 … ·...

2013.10.23 malta, telecentres, m gonzalez sancho

solution-processed flexible memory devices dr. nadine...

2013.10.23 eastern partnership youth forum_sg

official report...

比較法ニューズレタ 2013.10.23 (1)title...

approach to medical liver biopsies dr behrang mozayani...

intro$to$textprocessing$...

vpn client user guide for windows - sprint vpn client user...

a study of current market strategies and position and an...

behrang(mohit(...

united states court of appeals ninth circuit...nos....

u c.. .. (. - . c -...

faculty of cognitive sciences and human development design...

cad-cam technology for implant abutments, crowns and...

nos. 17-15381 & 17-15383 ... · ii c. the district court...

alimohammadisagvand, behrang; jokisalo, juha; sirén, kai

employee relations - karelia-amk - moodle...