the wild thing goes mobile and local kenneth church and bo thiesson text mining, search and...

23
The Wild Thing The Wild Thing Goes Mobile And Local Goes Mobile And Local Kenneth Church and Bo Thiesson Kenneth Church and Bo Thiesson Text Mining, Search and Text Mining, Search and Navigation (TMSN) Navigation (TMSN) Microsoft Corporation Microsoft Corporation

Upload: ginger-horton

Post on 02-Jan-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

The Wild ThingThe Wild ThingGoes Mobile And Local Goes Mobile And Local

Kenneth Church and Bo ThiessonKenneth Church and Bo ThiessonText Mining, Search and Navigation Text Mining, Search and Navigation (TMSN)(TMSN)Microsoft CorporationMicrosoft Corporation

Standard Word Wheeling (T9)

Better with Wild Cards!

Find k-best regex matches,subject to language model

Wild Thing Goes MobileWild Thing Goes Mobile

SearchSearch

GivenGivenAn input pattern (regex), andAn input pattern (regex), and

A language model (LM)A language model (LM)A list of queries andA list of queries and

Their popularities in the MSN logsTheir popularities in the MSN logs

Find the k-best (most popular) matchesFind the k-best (most popular) matches

Conceptually Conceptually grep pattern LM | sort –nr | headgrep pattern LM | sort –nr | head

Heuristic speed-upsHeuristic speed-ups

Wild Thing > Word WheelingWild Thing > Word WheelingFor surnames, filenames, URLsFor surnames, filenames, URLs

RegexRegexMore general thanMore general thanprefix matchingprefix matching/C.* OH.*/ >> /C.*//C.* OH.*/ >> /C.*/

Challenge: Will users Challenge: Will users enter Wild Cards?enter Wild Cards?

Implicit Wild CardsImplicit Wild CardsAdded after each “word”Added after each “word”

InitialsInitialsK F CK F C

N Y CN Y C

Two Implicit Wild CardsTwo Implicit Wild Cards

Phone Mode And LocalPhone Mode And Local

Phone ModePhone ModeRegex notationRegex notation

7#6 7#6 /[PQRS].* [MNO].*//[PQRS].* [MNO].*/

LocalLocalDifferent Language ModelDifferent Language ModelPr(query) Pr(query) Pr(query | location) Pr(query | location)

Local queries are differentLocal queries are differentLocal: Restaurants (Pizza)Local: Restaurants (Pizza)

Non-localNon-localWeb ServicesWeb Services(e-mail, shopping)(e-mail, shopping)

Entertainment (adult)Entertainment (adult)

Goal: All Forms Goal: All Forms Go Wild Go WildBut with different language modelsBut with different language modelsfor different contextsfor different contexts

Goal: All Forms Goal: All Forms Go Wild Go WildBut with different language modelsBut with different language modelsfor different contextsfor different contexts

Demos Demos herehere

Condoleezza RiceCondoleezza Rice

Arnold SchwarzeneggerArnold Schwarzenegger

Hot-mail programsHot-mail programs

Wild Thing + Virtual EarthWild Thing + Virtual EarthBetter togetherBetter together herehere

Going Local

Different ExpansionsDifferent ExpansionsIn Different LocationsIn Different Locations

B CB CBritish ColumbiaBritish Columbia

Boeing CompanyBoeing Company

Baptist ChurchBaptist Church

Bible CollegeBible College

* beach* beachWaikikiWaikiki

NarragansettNarragansett

Pebble BeachPebble Beach

Old OrchardOld Orchard

FFDetroitDetroit

New LondonNew London

* high* high

* school* school

* univ* univ

* hospital* hospital

* airport* airport

* river* river

One Letter QueriesOne Letter Queries

Conclusions: Why Go Local?Conclusions: Why Go Local?

That’s where the money isThat’s where the money isAll politics is localAll politics is local

Ditto for classified adsDitto for classified ads

It is nice to be able to search the worldIt is nice to be able to search the worldBut I often want stuff near meBut I often want stuff near me

It is nice to be able to drive my car anywhereIt is nice to be able to drive my car anywhereBut most accidents are not far from homeBut most accidents are not far from home

Geo-tagging URLs and QueriesGeo-tagging URLs and QueriesMethod 1: Parse docs (hard)Method 1: Parse docs (hard)

Method 2: Logs (easy)Method 2: Logs (easy)

Wild Thing Goes LocalWild Thing Goes Local

Wild Thing Wild Thing Find the k-best matchesFind the k-best matchesNon-local: Non-local: k-best ≡ Pr(query)k-best ≡ Pr(query)

Local: Local: k-best ≡ Pr(query|location)k-best ≡ Pr(query|location)

Probabilities based on query logsProbabilities based on query logsNon-local caseNon-local case

Conceptually, search list of queries in freq orderConceptually, search list of queries in freq order

Stop after finding k matchesStop after finding k matches

Local caseLocal caseDitto, but store a different list for each locationDitto, but store a different list for each location

Local queries are different from non-local queriesLocal queries are different from non-local queriesLots of requests for pizza near xLots of requests for pizza near x

Lots of requests for Britney SpearsLots of requests for Britney Spears

But these are not local searchesBut these are not local searches

Apparently, not so many people want her nearby???Apparently, not so many people want her nearby???

Heuristic Speed-ups

SmoothingSmoothing

Computational and statistical motivationsComputational and statistical motivationsCan’t store/estimate Pr(query | location) Can’t store/estimate Pr(query | location)

For all queries everywhereFor all queries everywhere

Locations defined by a kd-treeLocations defined by a kd-tree

Smoothing Rule: Counts Smoothing Rule: Counts Parent ParentUnless significantly larger than sibling’s countsUnless significantly larger than sibling’s counts

One parameter: p (significance level)One parameter: p (significance level)

2929

Split by latitude

Split by long

After smoothing:Most counts 0

Leaf inherits counts from ancestors 22 22

4/24/2

1133

4/24/2

113030

2/22/2

8/48/4

Search Speed-UpsSearch Speed-Upsgrep pattern LM | sort –nr | headgrep pattern LM | sort –nr | head

Heuristic speed-upHeuristic speed-upGenerate candidates that might matchGenerate candidates that might match

Filter candidates with standard regex toolFilter candidates with standard regex tool

Generating candidates (Suffix Array)Generating candidates (Suffix Array)regex regex substring substring

/C.* OH.*/ /C.* OH.*/ OH OH

Popularity ModificationPopularity ModificationSuffix arrays designed for all matches (not k-best)Suffix arrays designed for all matches (not k-best)

Single sort order Single sort order Two TwoAlphabetic Order + PopularityAlphabetic Order + Popularity

Alternate on odd and even levels (like a kd-tree) Alternate on odd and even levels (like a kd-tree)

Standard Suffix ArraysStandard Suffix Arrays

SortSort

Suffix arrays: Suffix arrays: Designed to findDesigned to find

Frequency and Frequency and

LocationLocation

Of pattern Of pattern (substring)(substring) First “To Be”

Last “To Be”

Single Sort Order Single Sort Order Two TwoAlphabetic and popularityAlphabetic and popularity

Standard AppStandard AppFind Find allall matchesmatches

Modify Data StructureModify Data StructureTo find k-bestTo find k-best

SearchSearchOn alphabetic splitsOn alphabetic splits

Do the standard thingDo the standard thing

On popularity splits,On popularity splits,go left (pop)go left (pop)

Stop if you have foundStop if you have foundk matchesk matches

Otherwise, go right,Otherwise, go right,if you have toif you have toSort by 1st order

Sort by 2nd order

Sort by 1st

Modified Suffix Array Time ComplexityModified Suffix Array Time ComplexityO(O(loglog N) N) O( O(sqrtsqrt(N))(N))

Worst case: Pattern with 0 matchesWorst case: Pattern with 0 matchesAlphabetic splits are same as beforeAlphabetic splits are same as before

Unfortunately, popularity splits don’t helpUnfortunately, popularity splits don’t helpHave to go both left and right everywhere (for 0 matches)Have to go both left and right everywhere (for 0 matches)

LetLetP(N)P(N) be work to process be work to process NN items on popularity splits items on popularity splits

A(N)A(N) be work to process be work to process NN items on alphabetic splitsitems on alphabetic splits

In worst caseIn worst caseA(N)A(N) = = P(N/2)P(N/2) + + CC22

P(N)P(N) = = 2A(N/2)2A(N/2) + + CC11

Therefore, Therefore, P(N)P(N) = = CC33 sqrt(N) sqrt(N) + + C4C4

ConclusionsConclusionsPersonalization andPersonalization andcollaborative filteringcollaborative filtering

To find stuff you search for a lotTo find stuff you search for a lotOr other people search for a lotOr other people search for a lot

You shouldn’t have to type a lotYou shouldn’t have to type a lot

Wild ThingWild ThingUser enter wild cards anywhereUser enter wild cards anywhere

Implicitly or ExplicitlyImplicitly or Explicitly

System finds k-best expansionsSystem finds k-best expansionsMatching their Favorites andMatching their Favorites and

Hot StuffHot Stuff

Favorites(Personalization)

Hot S

tuff

Simple Uniform Look-And-FeelSimple Uniform Look-And-Feel

Simple, easy to useSimple, easy to useEven if you can’t spell, type…Even if you can’t spell, type…

Even Bo’s 3-year-old can do itEven Bo’s 3-year-old can do it

Goal: All Forms Go WildGoal: All Forms Go WildUniform Look-and-FeelUniform Look-and-Feel

Currently, different systems are differentCurrently, different systems are differentInternet BrowserInternet Browser

Address Bar remembers where you’ve beenAddress Bar remembers where you’ve been

Forms autopop name, credit card numbers, etc.Forms autopop name, credit card numbers, etc.

Outlook Outlook Remembers you favorite e-mail addressesRemembers you favorite e-mail addresses

Wild ThingWild ThingMeans different things to different peopleMeans different things to different people

Encourage use of wild cardsEncourage use of wild cardsImplicit as well as explicitImplicit as well as explicit

A Children’s StoryA Children’s StoryWith apologies to With apologies to Hippos Go Berserk!Hippos Go Berserk!

Wild Thing Goes Mobile!Wild Thing Goes Mobile!

Wild Thing Goes Local!Wild Thing Goes Local!

All Forms Go Wild!!!All Forms Go Wild!!!

For the young adultFor the young adultWild Thing: You Make My Phone Sing!Wild Thing: You Make My Phone Sing!

© 2006 Microsoft Corporation. All rights reserved.© 2006 Microsoft Corporation. All rights reserved.Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation.Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft,Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft,

and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.