web scale nlp: a case study on url word breaking

Web Scale NLP:A Case Study on URL Word Breaking

Kuansan Wang, Chris Thrasher, Bo-June (Paul) HsuMicrosoft Research, Redmond, USA

WWW 2011March 31, 2011

http://research.microsoft.com/c/1040

2

More Data > Complex Model

Banko and Brill. Mitigating the Paucity-of-Data Problems. HLT 01


3

More Data > Complex Model

• CIKM 08

There is no data like more data ?


NLP for the Web• Scale of the Web– Avoid manual intervention– Efficient implementations

• Dynamic Nature of the Web– Fast adaptation

• Global Reach of the Web– Need rudimentary multi-lingual capabilities

• Diverse Language Styles of Web Contents– Multi-style language models

Simple models with matched data!


5

Outline• Web-Scale NLP• Word Breaking• Models• Evaluation• Conclusion


Word Breaking• Large Data + Simple Model (Norvig, CIKM 2008)– Use unigram model to rank all possible segmentations– Pretty good, but with occasional embarrassing outcomes

• More data does not help!• Extension to trigram alleviates the problem


Word Breaking for the WebWeb URLs exhibit variety of language styles…

…and in different languagesMatched data is crucial to accuracy!


http://research.microsoft.com/en-us/um/people/kuansanw/wordbreaker/

8



MAP Decision Rule• Special case of Bayesian Minimum Risk– Speech, MT, Parsing, Tagging, Information Retrieval, …

• Problem: Given , find

: transformation model : prior

ChannelSignal Observation

Distortion


MAP for Word Breaker

• : tweeter hash tag or URL domain name– Ex. 247moms, w84um8

• : what user meant to say– Ex. 24_7_moms, w8_4_u_m8 (wait for you mate)

Channel

Signal Output

Transformation


Plug-in MAP Problem• MAP decision rule is optimal only if and are the

“correct” underlying distributions• Adjustments needed when estimated models and

have unknown errors– Simple logarithmic interpolation:

– “Random Field”/Machine Learning:

– Bayesian• Point estimation is outdated• Assume parameters are drawn from “some” distribution


Baseline Methods• GM: Geometric Mean (Keohn and Kline, 2003)– Widely used, especially in MT systems

• BI: Binomial Model (Venkataraman, 2001)

• WL: Word Length Normalization (Kaitan et al, 2009)

All special cases/variations of MAP


Proposed MethodME: Maximum Entropy Principle Model• – Special case of BI () and WL (uniform)• using Microsoft Web N-gram,

Microsoft Web N-gram (http://web-ngram.research.microsoft.com)• Web documents/Bing queries (EN-US market)• Rudimentary multilingual (NAACL 10)• Frequent updates (ICASSP 09)• Multi-style language model (WWW 10, SIGIR 10)

Body Title Anchor Query1-gram 1.2 B 60 M 150 M 252M5-gram 237 B 3.8 B 8.9 B -

http://web-ngram.research.microsoft.com/

http://web-ngram.research.microsoft.com/


14



Data Set• 100K randomly sampled URLs indexed by Bing– Simple tokenization– 266K unique tokens– Mostly ASCII characters

• Metric: Precision@3– Manually labeled word breaks– Multiple answers are allowed


16

Language Model Style

Body Title Query Anchor93%

94%

95%

96%

97%

98%

99%

1-gram2-gram3-gram

ME

• Title is best although Body is 100x larger• Nav queries often word-split URLs, but Query worse than Title

Matched style is crucial to precision!


17

Model Complexity

• With mismatched data, model choice is crucial• With matched data, complex models do not help

Body Title Query Anchor95%

96%

97%

98%

99%

BI (2)WL (1)ME (0)

3-gram

Simple model is sufficient with matched data!


18



Best = Right Data + Smart Model• Style of language trumps size of data– There is no data like more data… provided it’s matched data!

• Right data alleviates Plug-in MAP problem– Complicated machine learning artillery not required; simple methods

suffice• Smart model gives us:– Rudimentary multi-lingual capability– Fast inclusion of new words/phrases– Eliminate needs of human labor in data labeling

http://research.microsoft.com/en-us/um /people/kuansanw/wordbreaker/






20

BACKUP SLIDES


95.00%

95.50%

96.00%

96.50%

97.00%

97.50%

98.00%

98.50%

99.00%

1-gram

3-gram

BodyTitle

Query

Anchor

Note: BI, WL are oracle results

GM 1-gram 2-gram 3-gram

Body 59.01% 44.68% 44.78%

Title 61.55% 60.31% 58.70%

Anchor 60.46% 55.25% 54.84%

Query 54.83% 54.27% 54.83%


web scale nlp: a case study on url word breaking

Documents