hlt

44
1 L ANGUAGE Technology 動互機人 1

Upload: mike-tian-jian-jiang

Post on 23-Jul-2015

158 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: HLT

1

LANGUAGE Technology

日 本 語動互機人

1

Page 2: HLT

Human Language Technology

2

Page 3: HLT

Natural Language ProcessingComputational Linguistics

3

Page 4: HLT

NLP

• Computation

• Linguistics

4

Page 5: HLT

Information RetrievalSearch Engine

5

Page 6: HLT

IR

• Vector Space Model (tf-idf)

• Latent Semantic Analysis

• Link Analysis

6

Page 7: HLT

Human-Computer InteractionApplied Psychology

7

Page 8: HLT

HCI

• Effectiveness

• Efficiency

• Satisfaction

8

Page 9: HLT

What are doable tasks?http://en.wikipedia.org/wiki/

Natural_language_processing#Major_tasks_in_NLP

9

Page 10: HLT

Everything is labeling10

Page 11: HLT

e.g. MeCabhttp://mecab.googlecode.com/svn/trunk/mecab/doc/index.html

11

Page 12: HLT

–– George E. P. Box

“Essentially, all models are wrong, but some are useful.”

12

Page 13: HLT

What’s the niche?Know where’s the limit.

13

Page 14: HLT

快、狠、準。The Slap of a Thousand Exploding Suns.

(http://en.wikipedia.org/wiki/Slapsgiving_3:_Slappointment_in_Slapmarra)(http://tune.pk/video/1866501/how-i-met-your-mother-slapsgiving-3-slappointment-in-slapmara-preview)

14

Page 15: HLT

What can WE do?

• Be creative on combination, e.g.

• MT → Summarization

• Deception detection

15

Page 16: HLT

16

Page 17: HLT

<(_ _)>https://www.coursera.org/course/nlangp

Page 18: HLT

/ 51

Fundamental Unit?a meta-communication

18

Page 19: HLT

/ 51

What is a Word?to linguistics

19

Page 20: HLT

/ 51

“... the smallest free form that may be uttered in isolation with semantic or pragmatic content (with literal or practical meaning) ...”

http://en.wikipedia.org/wiki/Word

20

Page 21: HLT

/ 51

“... the task of defining what constitutes a ‘word’ involves determining where one word ends and another word begins...”

http://en.wikipedia.org/wiki/Word#Word_boundaries

21

Page 22: HLT

/ 51

Word Boundary?• Orthographic

• Sociological

• Lexical

• Semantic

• Phonological

• Morphological

• Syntactic

• Psycholinguistic

22

Page 23: HLT

/ 51

Orthographic Word

• Writing convention

• Space

• How about Ancient Greek?

• OED: africanization vs. americanization

23

Page 24: HLT

/ 51

Sociological Word

• Between a phoneme and a sentence

• 字 (zi) vs. 詞 (ci)

24

Page 25: HLT

/ 51

Lexical Word

• Listedness: cannot be generated “on-line”

• Dictionary entry?

• Orthographic?

• Idiomatic phrase?

• “Kick the bucket”

25

Page 26: HLT

/ 51

Semantic Word

• Difficult to define

• Without phonological form?

• Closer to morpheme?

• “Bio-”

26

Page 27: HLT

/ 51

Phonological Word

• Prosodic word?

• Disyllabic Chinese words

• Hyphenation? Syllabification?

27

Page 28: HLT

/ 51

Morphological Word

• Bound root: “bio-”

• [[貓頭]鷹]: is [貓頭] a word here?

• [[台北]市] vs. Taipei city

28

Page 29: HLT

/ 51

Syntactic Word

• Minimally occupy the category slots

• Hence [鴨] in [[鴨]⼦子] is NOT a word

• Just like “duck” in “duck-ie”

29

Page 30: HLT

/ 51

Psycholinguistic Word

• All of the above?

30

Page 31: HLT

/ 51

What is a Word?to computational linguistics

31

Page 32: HLT

/ 51

Standard de jure?

• Academia Sinica Balanced Corpus

• Chinese Treebank of University of Pennsylvania

• City University of Hong Kong

• Microsoft Research Asia

• Peking University

32

Page 33: HLT

Comparison

CTB China ASBC

ABAB ABAB AB-AB ABAB 研究研究

AA看 [AA/V-看/V]/V AA看 AA看 説説看

Person Name One Two One

Noun-們 One Two Two 朋友們

Ordinals One Two Two 第⼀一

Page 34: HLT

/ 51

... then match standardsthe more accuracy, the better communication?

34

Page 35: HLT

Partial Match?Or dictionary, concordance, collocation, etc?

Page 36: HLT
Page 37: HLT

cross-lingual information retrieval

Page 38: HLT

/ 51

Evaluation Examples• Gold standard

• [[meta][data]] / is / the / data / of / data

• 5 boundaries, 7 morphemes, 6 words, 5 lexicon types

• Test subject

• meta / data / is / the / data / of / data

• 1 boundary error, 0 morpheme error, two word errors, 1 lexicon type error.

38

Page 39: HLT

/ 51

Term Type• Kwok (2002)

• Insensitive: stop-words; frequent non-content-bearing

• Monotonic: content-bearing

• Non-monotonic:

• ⻄西⼟土⽿耳其 (Western Turkey)

• Semantic, syntax, or surface?

• 农 (agricultural) / 作物 (plants)

• 旱 (drought) / 灾 (disaster) vs. 春旱 (Spring drought) vs. 旱区 (area or drought disaster)

• Recall or precision?

• ⽕火 (fire) / ⼭山 (mountain) vs. ⽕火⼭山 (volcano)

39

Page 40: HLT

/ 51

Surface Pattern• Ambiguity

• Combinatorial

• ⻄西⼟土⽿耳其、农作物、旱灾、春旱、旱区、⽕火⼭山... etc.

• Overlapping

• 施政 (practice policy) / 伟 (great) vs. 施 (Shih) / 政伟 (Zheng-Wei)

• Which is more harmful?

http://www.definicionabc.com/general/gestalt-psicologia.php

40

Page 41: HLT

/ 51

Tractable Simulation?

http://imgs.xkcd.com/store/glen_shirts/g_try_science_shirt_2.jpg

41

Page 42: HLT
Page 43: HLT
Page 44: HLT