Download - Swapnil Chhajer [email protected]
Spelling Correction for Search Engine Queries
Bruno Martins and Mario J. SilvaProceedings of EsTAL-04,
España for Natural Language Processing (2004)
Swapnil [email protected]
http://schhajer.co.nr
3
Topics Covered in Class• Peter Norvig’s Spelling Corrector: Query Processing [33-35]• Levenshtein Algortihm: Query Processing [36-41]• Evaluation Metrices: Precision & Recall: Introduction to
Information Retrieval [16]• Soundex Algorithm: Query Processing [18]
April 16, 2013 Spelling Correction for Search Engine Queries
4
Motivation & Abstract• Misspelled queries retrieve pages with misspelled words
which leaves behind the most appropriate pages.• 10-12% of queries are misspelled.• To provide user with the best possible match instead of
making user choose one of the possible corrections from the correction list.
April 16, 2013 Spelling Correction for Search Engine Queries
Google: Spelling Correction
5April 16, 2013 Spelling Correction for Search Engine Queries
Spelling Correction• Uses
• Correcting documents being indexed• Retrieve matching documents when query contains
spelling errorFlavors:• Isolated words
• Check words on its own• Unable to catch correctly spelled typos from vs.form
• Context-sensitive• Look at surrounding words, e.g., I flew form Heathrow to
Narita.
6April 16, 2013 Spelling Correction for Search Engine Queries
“a paragraph cud half mini flaws but wood bee past by the isolated spill checker”
General issues in Spelling Correction
• UI• Did you mean works for one suggestion. • What about multiple possible corrections ?
• Computational Cost• Spelling Correction is potentially expensive• Avoid running on each query• Maybe just on query that matches few documents• Guess: Spelling Correction of major search engines is
efficient enough to be run on every query
6April 16, 2013 Spelling Correction for Search Engine Queries
8
Kinds of Spelling Mistakes: Typos• Wrong characters by mistake• Categorized mainly into 4 categories:
• Insertions (Missing Letter)• “appellate” as “appellare”, “prejudice” as “prejudsice”
• Deletions (Extra Letter)• “plaintiff” as “paintiff”, “judgment” as “judment”, “liability” as
“liabilty”, “discovery” as “dicovery”, “fourth amendment” as “fourthamendment”
• Substitutions (Wrong letter)• “habeas” as “haceas”
• Transpositions• “fraud” as “fruad”, “bankruptcy” as “banrkuptcy, “subpoena”
as “subpeona”, “plaintiff” as “plaitniff”• 80-95% differ from the correct spellings in just one of the four
ways.• Keyboard layout is important in such cases.
April 16, 2013 Spelling Correction for Search Engine Queries
• Wrong characters on purpose• Most common type of mistake in general web queries• Mistakes derived from either pronunciation or spelling or
semantic confusions• Brainos: Soundalike (Phonetic Errors)
• “subpoena” as “supena”,“voir” as “voire”, “latter” as “ladder”, “withholding” as “witholding”, “foreclosure” as “forclosure”
• Brainos: Confusions• “preclusion” as “perclusion”, “men” as “mans”, “juries”
as “jurys” or “jureys”, “dramshop” as “dram shop”
8
Kinds of Spelling Mistakes: Brainos
April 16, 2013 Spelling Correction for Search Engine Queries
10
Dictionary Storage: Ternary Search Trees(TST)
• Data structure: Ternary Search Tree(TST)• Type of a TRIE, limited to 3 children per node.• TRIE is the common definition for a tree storing strings, in
which there is one node for every common prefix and the strings are stored in extra leaf nodes.
• Searching: O(log(n)+k)• n: number of strings in tree• k: length of string being searched for
April 16, 2013 Spelling Correction for Search Engine Queries
TST Continued…
11
Figure: A ternary search tree storing the words “to”, “too”, “toot”, “tab” and “so”, all within an associated frequency of 1
April 16, 2013 Spelling Correction for Search Engine Queries
Spelling Correction Algorithm• Implemented using edit distance, rule-based techniques, n-grams
probabilistic techniques, neural nets, similarity key techniques, or combinations.
• Goal: To find edit distance based on different strategies.• Shorter distance implies Good Correction.• Soundex System:
• Indexing based on sound.• Devised to help with the problem of phonetic errors.
• Metaphone Systems:• Specific to English language• Transforming words into codes based on phonetic properties• Based on consonants & diphthongs
• Spelling correction for web• Complete waste to make context dependent correction as user hardly
type more than three terms for a query
11April 16, 2013 Spelling Correction for Search Engine Queries
12
Spelling Correction Algorithm Continued…
• User entered query is tokenized ignoring non-word characters.
• Convert all words into lower case, and check whether the word is correctly spelled.
• Update the frequencies for correctly spelled words. This basically acts as a feedback to the system.
• Feedback system can be helpful for Spell Checker in predicting patterns in user’s searches.
• Misspelled words are replaced by correctly spelled words.• Finally, a new query is presented to the user as a
suggestion, together with the results page for the original query.
April 16, 2013 Spelling Correction for Search Engine Queries
• Algorithm is divided into 2 phases:• Phase 1: Generation of a set of candidate suggestions• Phase 2: Select the best choice among those selections
• Phase 1• 9 Steps, at each step look up dictionary for words that relate to the
original misspelling.• Differ in one character from the original word.• Differ in two character from the original word.• Differ in one letter removed or added.• Differ in one letter removed or added, plus one letter different.• Differ in repeated characters removed.• Correspond to 2 concatenated words (space between words
eliminated).• Differ in having two consecutive letters exchanged & 1 character
different• Have the original word as a prefix.• Differ in repeated characters removed & 1 character different.
13
Spelling Correction Algorithm Continued…
April 16, 2013 Spelling Correction for Search Engine Queries
• Phase 2: Heuristics used• Return the one if it only differs in accented characters• Return if it only differs in one character, with the error corresponding to
an adjacent letter in the same row of the keyboard.• Return the smallest one, if there are solutions having same metaphone
key as the original string.• Return if it only differs in one character, with the error corresponding to
an adjacent letter in an adjacent row of the keyboard.• In last, return the last word.
• Heuristics are followed sequentially and only move to the next if no matching words are found.
• If there are more than one matching words, return the one with first character matched.
• If still, there are more than one, choose the word with highest frequency.
14
Spelling Correction Algorithm Continued…
April 16, 2013 Spelling Correction for Search Engine Queries
15
Results Comparison• Aspell Spell Checker
• http://aspell.sourceforge.net/• Aspell uses Metaphone algorithm with near miss strategy• 48.33% correct forms were correctly guessed.• Outperformed Aspell by 1.66%
April 16, 2013 Spelling Correction for Search Engine Queries
* Doesn’t detect the misspelling - Failed in returning a suggestion.
16
Results Comparison Continued…• Tumba! : Search engine for Portuguese web
April 16, 2013 Spelling Correction for Search Engine Queries
Table: Results from spelling checker with Tumba!
17
Conclusion & Future Work
• Spelling checker uses a ternary search tree data structure for storing the dictionary.
• For data source, referred two popular Portuguese newspapers.• Queries in search engine may contain company or person’s name.
In such cases, keeping two dictionaries, one in the TST used for correction and another in an hash-table used only for checking valid words, could yield good results.
April 16, 2013 Spelling Correction for Search Engine Queries
Pros & Cons• Pros
• Considered various factors affecting edit distance including probabilistic estimations.
• Used feedback system to improve the quality of user queried results.
• Cons• Did not consider Context Sensitive spell checking.• It is not language independent system. Mainly focused on
Portuguese words.• No discussion about spell corrected completion suggestions as a
query is incrementally entered.
18April 16, 2013 Spelling Correction for Search Engine Queries
References• Contemporary Spelling Correction - Decoding the noisy channel, Bob
Carpenter• Using the Web for Language Independent Spellchecking and
Autocorrection, Whitelaw, Hutchinson, Chung and Ellis• How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic
Analysis through Complex Network Approach, Choudhury, Thomas, Mukherjee, Basu and Ganguly
19April 16, 2013 Spelling Correction for Search Engine Queries