web scale nlp: a case study on url word breaking
DESCRIPTION
Web Scale NLP: A Case Study on URL Word Breaking. Kuansan Wang, Chris Thrasher, Bo-June (Paul) Hsu Microsoft Research, Redmond, USA WWW 2011 March 31, 2011. More Data > Complex Model. Banko and Brill. Mitigating the Paucity-of-Data Problems . HLT 01. More Data > Complex Model. ?. - PowerPoint PPT PresentationTRANSCRIPT
Web Scale NLP:A Case Study on URL Word Breaking
Kuansan Wang, Chris Thrasher, Bo-June (Paul) HsuMicrosoft Research, Redmond, USA
WWW 2011March 31, 2011
2
More Data > Complex Model
Banko and Brill. Mitigating the Paucity-of-Data Problems. HLT 01
3
More Data > Complex Model
• CIKM 08
There is no data like more data ?
NLP for the Web• Scale of the Web– Avoid manual intervention– Efficient implementations
• Dynamic Nature of the Web– Fast adaptation
• Global Reach of the Web– Need rudimentary multi-lingual capabilities
• Diverse Language Styles of Web Contents– Multi-style language models
Simple models with matched data!
5
Outline• Web-Scale NLP• Word Breaking• Models• Evaluation• Conclusion
Word Breaking• Large Data + Simple Model (Norvig, CIKM 2008)– Use unigram model to rank all possible segmentations– Pretty good, but with occasional embarrassing outcomes
• More data does not help!• Extension to trigram alleviates the problem
Word Breaking for the WebWeb URLs exhibit variety of language styles…
…and in different languagesMatched data is crucial to accuracy!
8
Outline• Web-Scale NLP• Word Breaking• Models• Evaluation• Conclusion
MAP Decision Rule• Special case of Bayesian Minimum Risk– Speech, MT, Parsing, Tagging, Information Retrieval, …
• Problem: Given , find
: transformation model : prior
ChannelSignal Observation
Distortion
MAP for Word Breaker
• : tweeter hash tag or URL domain name– Ex. 247moms, w84um8
• : what user meant to say– Ex. 24_7_moms, w8_4_u_m8 (wait for you mate)
Channel
Signal Output
Transformation
Plug-in MAP Problem• MAP decision rule is optimal only if and are the
“correct” underlying distributions• Adjustments needed when estimated models and
have unknown errors– Simple logarithmic interpolation:
– “Random Field”/Machine Learning:
– Bayesian• Point estimation is outdated• Assume parameters are drawn from “some” distribution
Baseline Methods• GM: Geometric Mean (Keohn and Kline, 2003)– Widely used, especially in MT systems
• BI: Binomial Model (Venkataraman, 2001)
• WL: Word Length Normalization (Kaitan et al, 2009)
All special cases/variations of MAP
Proposed MethodME: Maximum Entropy Principle Model• – Special case of BI () and WL (uniform)• using Microsoft Web N-gram,
Microsoft Web N-gram (http://web-ngram.research.microsoft.com)• Web documents/Bing queries (EN-US market)• Rudimentary multilingual (NAACL 10)• Frequent updates (ICASSP 09)• Multi-style language model (WWW 10, SIGIR 10)
Body Title Anchor Query1-gram 1.2 B 60 M 150 M 252M5-gram 237 B 3.8 B 8.9 B -
14
Outline• Web-Scale NLP• Word Breaking• Models• Evaluation• Conclusion
Data Set• 100K randomly sampled URLs indexed by Bing– Simple tokenization– 266K unique tokens– Mostly ASCII characters
• Metric: Precision@3– Manually labeled word breaks– Multiple answers are allowed
16
Language Model Style
Body Title Query Anchor93%
94%
95%
96%
97%
98%
99%
1-gram2-gram3-gram
ME
• Title is best although Body is 100x larger• Nav queries often word-split URLs, but Query worse than Title
Matched style is crucial to precision!
17
Model Complexity
• With mismatched data, model choice is crucial• With matched data, complex models do not help
Body Title Query Anchor95%
96%
97%
98%
99%
BI (2)WL (1)ME (0)
3-gram
Simple model is sufficient with matched data!
18
Outline• Web-Scale NLP• Word Breaking• Models• Evaluation• Conclusion
Best = Right Data + Smart Model• Style of language trumps size of data– There is no data like more data… provided it’s matched data!
• Right data alleviates Plug-in MAP problem– Complicated machine learning artillery not required; simple methods
suffice• Smart model gives us:– Rudimentary multi-lingual capability– Fast inclusion of new words/phrases– Eliminate needs of human labor in data labeling
http://research.microsoft.com/en-us/um /people/kuansanw/wordbreaker/
95.00%
95.50%
96.00%
96.50%
97.00%
97.50%
98.00%
98.50%
99.00%
1-gram
3-gram
BodyTitle
Query
Anchor
Note: BI, WL are oracle results
GM 1-gram 2-gram 3-gram
Body 59.01% 44.68% 44.78%
Title 61.55% 60.31% 58.70%
Anchor 60.46% 55.25% 54.84%
Query 54.83% 54.27% 54.83%