hlt
Post on 23-Jul-2015
158 Views
Preview:
TRANSCRIPT
1
LANGUAGE Technology
日 本 語動互機人
1
Human Language Technology
2
Natural Language ProcessingComputational Linguistics
3
NLP
• Computation
• Linguistics
4
Information RetrievalSearch Engine
5
IR
• Vector Space Model (tf-idf)
• Latent Semantic Analysis
• Link Analysis
6
Human-Computer InteractionApplied Psychology
7
HCI
• Effectiveness
• Efficiency
• Satisfaction
8
What are doable tasks?http://en.wikipedia.org/wiki/
Natural_language_processing#Major_tasks_in_NLP
9
Everything is labeling10
e.g. MeCabhttp://mecab.googlecode.com/svn/trunk/mecab/doc/index.html
11
–– George E. P. Box
“Essentially, all models are wrong, but some are useful.”
12
What’s the niche?Know where’s the limit.
13
快、狠、準。The Slap of a Thousand Exploding Suns.
(http://en.wikipedia.org/wiki/Slapsgiving_3:_Slappointment_in_Slapmarra)(http://tune.pk/video/1866501/how-i-met-your-mother-slapsgiving-3-slappointment-in-slapmara-preview)
14
What can WE do?
• Be creative on combination, e.g.
• MT → Summarization
• Deception detection
15
16
/ 51
Fundamental Unit?a meta-communication
18
/ 51
What is a Word?to linguistics
19
/ 51
“... the smallest free form that may be uttered in isolation with semantic or pragmatic content (with literal or practical meaning) ...”
http://en.wikipedia.org/wiki/Word
20
/ 51
“... the task of defining what constitutes a ‘word’ involves determining where one word ends and another word begins...”
http://en.wikipedia.org/wiki/Word#Word_boundaries
21
/ 51
Word Boundary?• Orthographic
• Sociological
• Lexical
• Semantic
• Phonological
• Morphological
• Syntactic
• Psycholinguistic
22
/ 51
Orthographic Word
• Writing convention
• Space
• How about Ancient Greek?
• OED: africanization vs. americanization
23
/ 51
Sociological Word
• Between a phoneme and a sentence
• 字 (zi) vs. 詞 (ci)
24
/ 51
Lexical Word
• Listedness: cannot be generated “on-line”
• Dictionary entry?
• Orthographic?
• Idiomatic phrase?
• “Kick the bucket”
25
/ 51
Semantic Word
• Difficult to define
• Without phonological form?
• Closer to morpheme?
• “Bio-”
26
/ 51
Phonological Word
• Prosodic word?
• Disyllabic Chinese words
• Hyphenation? Syllabification?
27
/ 51
Morphological Word
• Bound root: “bio-”
• [[貓頭]鷹]: is [貓頭] a word here?
• [[台北]市] vs. Taipei city
28
/ 51
Syntactic Word
• Minimally occupy the category slots
• Hence [鴨] in [[鴨]⼦子] is NOT a word
• Just like “duck” in “duck-ie”
29
/ 51
Psycholinguistic Word
• All of the above?
30
/ 51
What is a Word?to computational linguistics
31
/ 51
Standard de jure?
• Academia Sinica Balanced Corpus
• Chinese Treebank of University of Pennsylvania
• City University of Hong Kong
• Microsoft Research Asia
• Peking University
32
Comparison
CTB China ASBC
ABAB ABAB AB-AB ABAB 研究研究
AA看 [AA/V-看/V]/V AA看 AA看 説説看
Person Name One Two One
Noun-們 One Two Two 朋友們
Ordinals One Two Two 第⼀一
/ 51
... then match standardsthe more accuracy, the better communication?
34
Partial Match?Or dictionary, concordance, collocation, etc?
cross-lingual information retrieval
/ 51
Evaluation Examples• Gold standard
• [[meta][data]] / is / the / data / of / data
• 5 boundaries, 7 morphemes, 6 words, 5 lexicon types
• Test subject
• meta / data / is / the / data / of / data
• 1 boundary error, 0 morpheme error, two word errors, 1 lexicon type error.
38
/ 51
Term Type• Kwok (2002)
• Insensitive: stop-words; frequent non-content-bearing
• Monotonic: content-bearing
• Non-monotonic:
• ⻄西⼟土⽿耳其 (Western Turkey)
• Semantic, syntax, or surface?
• 农 (agricultural) / 作物 (plants)
• 旱 (drought) / 灾 (disaster) vs. 春旱 (Spring drought) vs. 旱区 (area or drought disaster)
• Recall or precision?
• ⽕火 (fire) / ⼭山 (mountain) vs. ⽕火⼭山 (volcano)
39
/ 51
Surface Pattern• Ambiguity
• Combinatorial
• ⻄西⼟土⽿耳其、农作物、旱灾、春旱、旱区、⽕火⼭山... etc.
• Overlapping
• 施政 (practice policy) / 伟 (great) vs. 施 (Shih) / 政伟 (Zheng-Wei)
• Which is more harmful?
http://www.definicionabc.com/general/gestalt-psicologia.php
40
/ 51
Tractable Simulation?
http://imgs.xkcd.com/store/glen_shirts/g_try_science_shirt_2.jpg
41
top related