keys to building a multilingual search engine thierry sourbier
TRANSCRIPT
Keys to Building a Multilingual Search
EngineThierry Sourbier
Thierry Sourbier
Client-Side (browser)
How to make the best use of the browsers when dealing with multiple languages
Client-Side (browser)
How to make the best use of the browsers when dealing with multiple languages
Server-Side
How to provide efficient multilingual information retrieval
Server-Side
How to provide efficient multilingual information retrieval
Submit query
Display resultsProcess query
HTTPCreate index
Search Engine OverviewSearch Engine Overview
Thierry Sourbier
Overview of the Server-sideOverview of the Server-side
Index creation steps: Normalization
gives the pages a standard format
Segmentation breaks the pages in units
that will be stored in the index
Index building
Index creation steps: Normalization
gives the pages a standard format
Segmentation breaks the pages in units
that will be stored in the index
Index building
Query processing steps: Normalization
makes sure that the query has the same format as the indexed pages
Segmentation breaks the query in units
that will be looked up in the index
Index search
Query processing steps: Normalization
makes sure that the query has the same format as the indexed pages
Segmentation breaks the query in units
that will be looked up in the index
Index search
Typically only Normalization and Segmentation are language dependent. The goal is to reduce these dependencies as much as possible.
Thierry Sourbier
Multilingual NormalizationMultilingual Normalization
Normalizing the character encoding One size fits all: Unicode
Removing the unnecessary HTML tags, extra white spaces, etc.
Character normalization Mapping together characters that have the
same meaning Locale dependent
Normalizing the character encoding One size fits all: Unicode
Removing the unnecessary HTML tags, extra white spaces, etc.
Character normalization Mapping together characters that have the
same meaning Locale dependent
Thierry Sourbier
Multilingual SegmentationMultilingual Segmentation
Linguistic features can’t be used Too complex and/or costly to implement
Relying on N-Gram N-Gram = a sequence of N contiguous
characters N-Gram may overlap
example with N=4 “unicode conference” =>
“unic”,”nico”,”icod”,”code”,”de c”,”e co”,” con”,...
Linguistic features can’t be used Too complex and/or costly to implement
Relying on N-Gram N-Gram = a sequence of N contiguous
characters N-Gram may overlap
example with N=4 “unicode conference” =>
“unic”,”nico”,”icod”,”code”,”de c”,”e co”,” con”,...
Thierry Sourbier
N-Grams AdvantagesN-Grams Advantages
Advantages: Simple to implement Increased tolerance for typos “Free” morphology Language independent
Advantages: Simple to implement Increased tolerance for typos “Free” morphology Language independent
Thierry Sourbier
N-Grams DisadvantagesN-Grams Disadvantages
Disadvantages: Index is bigger Minimum query length is N characters
shorter query will yield to no results
May introduce “noise” sometime the system may be too tolerant (e.g.: a
query to “standing” may send back pages containing “understand”)
Not as good as linguistic based IR system. no explicit word normalization possible
Disadvantages: Index is bigger Minimum query length is N characters
shorter query will yield to no results
May introduce “noise” sometime the system may be too tolerant (e.g.: a
query to “standing” may send back pages containing “understand”)
Not as good as linguistic based IR system. no explicit word normalization possible
Thierry Sourbier
What value should N have?What value should N have?
N is language dependent Typically we use a value between 1 and 6
High N-gram size improves quality, but reduces tolerance and increases the minimal query size
Some languages may require more than one N-Gram size Japanese example
N is language dependent Typically we use a value between 1 and 6
High N-gram size improves quality, but reduces tolerance and increases the minimal query size
Some languages may require more than one N-Gram size Japanese example
Thierry Sourbier
Client-sideClient-side
Must be compatible with most browsers We restrict ourselves to HTML We use the “standard” encodings for each
language for our pages: many people still use browsers that are not Unicode
friendly this makes content editing easier
Must be compatible with most browsers We restrict ourselves to HTML We use the “standard” encodings for each
language for our pages: many people still use browsers that are not Unicode
friendly this makes content editing easier
Thierry Sourbier
Using a FORMUsing a FORM
The parameters of the query are passed via the URL to a CGI script
e.g: http://www.my_site.com/my_script?query=%22San+Jose%22
What is the charset of the data sent back from the client?
The parameters of the query are passed via the URL to a CGI script
e.g: http://www.my_site.com/my_script?query=%22San+Jose%22
What is the charset of the data sent back from the client?
Thierry Sourbier
URL Encoding IssuesURL Encoding Issues
Different browsers have different behaviors Example: a Japanese query
Could be submitted to the server as: ...search.pl?Query=%93%FA%96%7B%8C%EA
Or by another browser as: ...search.pl?Query=%26%2326085%3B%26%23264
12%3B%26%2335486%3B
Different browsers have different behaviors Example: a Japanese query
Could be submitted to the server as: ...search.pl?Query=%93%FA%96%7B%8C%EA
Or by another browser as: ...search.pl?Query=%26%2326085%3B%26%23264
12%3B%26%2335486%3B
Thierry Sourbier
FORM and CGIFORM and CGI
The server tells the client which encoding to use at the HTTP level
<HTML> <HEAD> <META HTTP-EQUIV="content-type" CONTENT="text/html;
charset=..."> … </HEAD> …. </HTML>
The server tells the client which encoding to use at the HTTP level
<HTML> <HEAD> <META HTTP-EQUIV="content-type" CONTENT="text/html;
charset=..."> … </HEAD> …. </HTML>
Thierry Sourbier
FORM and CGIFORM and CGI
The client returns the information to the script using the Private FORM/CGI Protocol A “hidden” form field adds a parameter to the
query which identifies the locale <form> ... <input type=hidden name=Locale value=ja> ... </form>
The client returns the information to the script using the Private FORM/CGI Protocol A “hidden” form field adds a parameter to the
query which identifies the locale <form> ... <input type=hidden name=Locale value=ja> ... </form>
Thierry Sourbier
Displaying the ResultsDisplaying the Results
Simple if only one code set per page is required
For multilingual content: use UTF-8 use multiples frames
Unexpected browser behavior
Simple if only one code set per page is required
For multilingual content: use UTF-8 use multiples frames
Unexpected browser behavior
Thierry Sourbier
ConclusionConclusion
Solutions exist to provide a robust multilingual search engine
Code set issues on the client side can be a limitation but it will soon disappear as more and more people will be using UTF-8 friendly browsers
Solutions exist to provide a robust multilingual search engine
Code set issues on the client side can be a limitation but it will soon disappear as more and more people will be using UTF-8 friendly browsers
Thierry Sourbier
Q&AQ&A
Thierry SourbierSoftware Developer
Thierry SourbierSoftware Developer