the wild thing goes mobile and local kenneth church and bo thiesson text mining, search and...
TRANSCRIPT
The Wild ThingThe Wild ThingGoes Mobile And Local Goes Mobile And Local
Kenneth Church and Bo ThiessonKenneth Church and Bo ThiessonText Mining, Search and Navigation Text Mining, Search and Navigation (TMSN)(TMSN)Microsoft CorporationMicrosoft Corporation
Standard Word Wheeling (T9)
Better with Wild Cards!
Find k-best regex matches,subject to language model
Wild Thing Goes MobileWild Thing Goes Mobile
SearchSearch
GivenGivenAn input pattern (regex), andAn input pattern (regex), and
A language model (LM)A language model (LM)A list of queries andA list of queries and
Their popularities in the MSN logsTheir popularities in the MSN logs
Find the k-best (most popular) matchesFind the k-best (most popular) matches
Conceptually Conceptually grep pattern LM | sort –nr | headgrep pattern LM | sort –nr | head
Heuristic speed-upsHeuristic speed-ups
Wild Thing > Word WheelingWild Thing > Word WheelingFor surnames, filenames, URLsFor surnames, filenames, URLs
RegexRegexMore general thanMore general thanprefix matchingprefix matching/C.* OH.*/ >> /C.*//C.* OH.*/ >> /C.*/
Challenge: Will users Challenge: Will users enter Wild Cards?enter Wild Cards?
Implicit Wild CardsImplicit Wild CardsAdded after each “word”Added after each “word”
InitialsInitialsK F CK F C
N Y CN Y C
Two Implicit Wild CardsTwo Implicit Wild Cards
Phone Mode And LocalPhone Mode And Local
Phone ModePhone ModeRegex notationRegex notation
7#6 7#6 /[PQRS].* [MNO].*//[PQRS].* [MNO].*/
LocalLocalDifferent Language ModelDifferent Language ModelPr(query) Pr(query) Pr(query | location) Pr(query | location)
Local queries are differentLocal queries are differentLocal: Restaurants (Pizza)Local: Restaurants (Pizza)
Non-localNon-localWeb ServicesWeb Services(e-mail, shopping)(e-mail, shopping)
Entertainment (adult)Entertainment (adult)
Goal: All Forms Goal: All Forms Go Wild Go WildBut with different language modelsBut with different language modelsfor different contextsfor different contexts
Goal: All Forms Goal: All Forms Go Wild Go WildBut with different language modelsBut with different language modelsfor different contextsfor different contexts
Demos Demos herehere
Condoleezza RiceCondoleezza Rice
Arnold SchwarzeneggerArnold Schwarzenegger
Hot-mail programsHot-mail programs
Wild Thing + Virtual EarthWild Thing + Virtual EarthBetter togetherBetter together herehere
Going Local
Different ExpansionsDifferent ExpansionsIn Different LocationsIn Different Locations
B CB CBritish ColumbiaBritish Columbia
Boeing CompanyBoeing Company
Baptist ChurchBaptist Church
Bible CollegeBible College
* beach* beachWaikikiWaikiki
NarragansettNarragansett
Pebble BeachPebble Beach
Old OrchardOld Orchard
FFDetroitDetroit
New LondonNew London
* high* high
* school* school
* univ* univ
* hospital* hospital
* airport* airport
* river* river
One Letter QueriesOne Letter Queries
Conclusions: Why Go Local?Conclusions: Why Go Local?
That’s where the money isThat’s where the money isAll politics is localAll politics is local
Ditto for classified adsDitto for classified ads
It is nice to be able to search the worldIt is nice to be able to search the worldBut I often want stuff near meBut I often want stuff near me
It is nice to be able to drive my car anywhereIt is nice to be able to drive my car anywhereBut most accidents are not far from homeBut most accidents are not far from home
Geo-tagging URLs and QueriesGeo-tagging URLs and QueriesMethod 1: Parse docs (hard)Method 1: Parse docs (hard)
Method 2: Logs (easy)Method 2: Logs (easy)
Wild Thing Goes LocalWild Thing Goes Local
Wild Thing Wild Thing Find the k-best matchesFind the k-best matchesNon-local: Non-local: k-best ≡ Pr(query)k-best ≡ Pr(query)
Local: Local: k-best ≡ Pr(query|location)k-best ≡ Pr(query|location)
Probabilities based on query logsProbabilities based on query logsNon-local caseNon-local case
Conceptually, search list of queries in freq orderConceptually, search list of queries in freq order
Stop after finding k matchesStop after finding k matches
Local caseLocal caseDitto, but store a different list for each locationDitto, but store a different list for each location
Local queries are different from non-local queriesLocal queries are different from non-local queriesLots of requests for pizza near xLots of requests for pizza near x
Lots of requests for Britney SpearsLots of requests for Britney Spears
But these are not local searchesBut these are not local searches
Apparently, not so many people want her nearby???Apparently, not so many people want her nearby???
Heuristic Speed-ups
SmoothingSmoothing
Computational and statistical motivationsComputational and statistical motivationsCan’t store/estimate Pr(query | location) Can’t store/estimate Pr(query | location)
For all queries everywhereFor all queries everywhere
Locations defined by a kd-treeLocations defined by a kd-tree
Smoothing Rule: Counts Smoothing Rule: Counts Parent ParentUnless significantly larger than sibling’s countsUnless significantly larger than sibling’s counts
One parameter: p (significance level)One parameter: p (significance level)
2929
Split by latitude
Split by long
After smoothing:Most counts 0
Leaf inherits counts from ancestors 22 22
4/24/2
1133
4/24/2
113030
2/22/2
8/48/4
Search Speed-UpsSearch Speed-Upsgrep pattern LM | sort –nr | headgrep pattern LM | sort –nr | head
Heuristic speed-upHeuristic speed-upGenerate candidates that might matchGenerate candidates that might match
Filter candidates with standard regex toolFilter candidates with standard regex tool
Generating candidates (Suffix Array)Generating candidates (Suffix Array)regex regex substring substring
/C.* OH.*/ /C.* OH.*/ OH OH
Popularity ModificationPopularity ModificationSuffix arrays designed for all matches (not k-best)Suffix arrays designed for all matches (not k-best)
Single sort order Single sort order Two TwoAlphabetic Order + PopularityAlphabetic Order + Popularity
Alternate on odd and even levels (like a kd-tree) Alternate on odd and even levels (like a kd-tree)
SortSort
Suffix arrays: Suffix arrays: Designed to findDesigned to find
Frequency and Frequency and
LocationLocation
Of pattern Of pattern (substring)(substring) First “To Be”
Last “To Be”
Single Sort Order Single Sort Order Two TwoAlphabetic and popularityAlphabetic and popularity
Standard AppStandard AppFind Find allall matchesmatches
Modify Data StructureModify Data StructureTo find k-bestTo find k-best
SearchSearchOn alphabetic splitsOn alphabetic splits
Do the standard thingDo the standard thing
On popularity splits,On popularity splits,go left (pop)go left (pop)
Stop if you have foundStop if you have foundk matchesk matches
Otherwise, go right,Otherwise, go right,if you have toif you have toSort by 1st order
Sort by 2nd order
Sort by 1st
Modified Suffix Array Time ComplexityModified Suffix Array Time ComplexityO(O(loglog N) N) O( O(sqrtsqrt(N))(N))
Worst case: Pattern with 0 matchesWorst case: Pattern with 0 matchesAlphabetic splits are same as beforeAlphabetic splits are same as before
Unfortunately, popularity splits don’t helpUnfortunately, popularity splits don’t helpHave to go both left and right everywhere (for 0 matches)Have to go both left and right everywhere (for 0 matches)
LetLetP(N)P(N) be work to process be work to process NN items on popularity splits items on popularity splits
A(N)A(N) be work to process be work to process NN items on alphabetic splitsitems on alphabetic splits
In worst caseIn worst caseA(N)A(N) = = P(N/2)P(N/2) + + CC22
P(N)P(N) = = 2A(N/2)2A(N/2) + + CC11
Therefore, Therefore, P(N)P(N) = = CC33 sqrt(N) sqrt(N) + + C4C4
ConclusionsConclusionsPersonalization andPersonalization andcollaborative filteringcollaborative filtering
To find stuff you search for a lotTo find stuff you search for a lotOr other people search for a lotOr other people search for a lot
You shouldn’t have to type a lotYou shouldn’t have to type a lot
Wild ThingWild ThingUser enter wild cards anywhereUser enter wild cards anywhere
Implicitly or ExplicitlyImplicitly or Explicitly
System finds k-best expansionsSystem finds k-best expansionsMatching their Favorites andMatching their Favorites and
Hot StuffHot Stuff
Favorites(Personalization)
Hot S
tuff
Simple Uniform Look-And-FeelSimple Uniform Look-And-Feel
Simple, easy to useSimple, easy to useEven if you can’t spell, type…Even if you can’t spell, type…
Even Bo’s 3-year-old can do itEven Bo’s 3-year-old can do it
Goal: All Forms Go WildGoal: All Forms Go WildUniform Look-and-FeelUniform Look-and-Feel
Currently, different systems are differentCurrently, different systems are differentInternet BrowserInternet Browser
Address Bar remembers where you’ve beenAddress Bar remembers where you’ve been
Forms autopop name, credit card numbers, etc.Forms autopop name, credit card numbers, etc.
Outlook Outlook Remembers you favorite e-mail addressesRemembers you favorite e-mail addresses
Wild ThingWild ThingMeans different things to different peopleMeans different things to different people
Encourage use of wild cardsEncourage use of wild cardsImplicit as well as explicitImplicit as well as explicit
A Children’s StoryA Children’s StoryWith apologies to With apologies to Hippos Go Berserk!Hippos Go Berserk!
Wild Thing Goes Mobile!Wild Thing Goes Mobile!
Wild Thing Goes Local!Wild Thing Goes Local!
All Forms Go Wild!!!All Forms Go Wild!!!
For the young adultFor the young adultWild Thing: You Make My Phone Sing!Wild Thing: You Make My Phone Sing!
© 2006 Microsoft Corporation. All rights reserved.© 2006 Microsoft Corporation. All rights reserved.Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation.Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft,Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft,
and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.