keys to building a multilingual search engine thierry sourbier

16
Keys to Building a Multilingual Search Engine Thierry Sourbier

Upload: danielle-stack

Post on 27-Mar-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Keys to Building a Multilingual Search Engine Thierry Sourbier

Keys to Building a Multilingual Search

EngineThierry Sourbier

Page 2: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

Client-Side (browser)

How to make the best use of the browsers when dealing with multiple languages

Client-Side (browser)

How to make the best use of the browsers when dealing with multiple languages

Server-Side

How to provide efficient multilingual information retrieval

Server-Side

How to provide efficient multilingual information retrieval

Submit query

Display resultsProcess query

HTTPCreate index

Search Engine OverviewSearch Engine Overview

Page 3: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

Overview of the Server-sideOverview of the Server-side

Index creation steps: Normalization

gives the pages a standard format

Segmentation breaks the pages in units

that will be stored in the index

Index building

Index creation steps: Normalization

gives the pages a standard format

Segmentation breaks the pages in units

that will be stored in the index

Index building

Query processing steps: Normalization

makes sure that the query has the same format as the indexed pages

Segmentation breaks the query in units

that will be looked up in the index

Index search

Query processing steps: Normalization

makes sure that the query has the same format as the indexed pages

Segmentation breaks the query in units

that will be looked up in the index

Index search

Typically only Normalization and Segmentation are language dependent. The goal is to reduce these dependencies as much as possible.

Page 4: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

Multilingual NormalizationMultilingual Normalization

Normalizing the character encoding One size fits all: Unicode

Removing the unnecessary HTML tags, extra white spaces, etc.

Character normalization Mapping together characters that have the

same meaning Locale dependent

Normalizing the character encoding One size fits all: Unicode

Removing the unnecessary HTML tags, extra white spaces, etc.

Character normalization Mapping together characters that have the

same meaning Locale dependent

Page 5: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

Multilingual SegmentationMultilingual Segmentation

Linguistic features can’t be used Too complex and/or costly to implement

Relying on N-Gram N-Gram = a sequence of N contiguous

characters N-Gram may overlap

example with N=4 “unicode conference” =>

“unic”,”nico”,”icod”,”code”,”de c”,”e co”,” con”,...

Linguistic features can’t be used Too complex and/or costly to implement

Relying on N-Gram N-Gram = a sequence of N contiguous

characters N-Gram may overlap

example with N=4 “unicode conference” =>

“unic”,”nico”,”icod”,”code”,”de c”,”e co”,” con”,...

Page 6: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

N-Grams AdvantagesN-Grams Advantages

Advantages: Simple to implement Increased tolerance for typos “Free” morphology Language independent

Advantages: Simple to implement Increased tolerance for typos “Free” morphology Language independent

Page 7: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

N-Grams DisadvantagesN-Grams Disadvantages

Disadvantages: Index is bigger Minimum query length is N characters

shorter query will yield to no results

May introduce “noise” sometime the system may be too tolerant (e.g.: a

query to “standing” may send back pages containing “understand”)

Not as good as linguistic based IR system. no explicit word normalization possible

Disadvantages: Index is bigger Minimum query length is N characters

shorter query will yield to no results

May introduce “noise” sometime the system may be too tolerant (e.g.: a

query to “standing” may send back pages containing “understand”)

Not as good as linguistic based IR system. no explicit word normalization possible

Page 8: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

What value should N have?What value should N have?

N is language dependent Typically we use a value between 1 and 6

High N-gram size improves quality, but reduces tolerance and increases the minimal query size

Some languages may require more than one N-Gram size Japanese example

N is language dependent Typically we use a value between 1 and 6

High N-gram size improves quality, but reduces tolerance and increases the minimal query size

Some languages may require more than one N-Gram size Japanese example

Page 9: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

Client-sideClient-side

Must be compatible with most browsers We restrict ourselves to HTML We use the “standard” encodings for each

language for our pages: many people still use browsers that are not Unicode

friendly this makes content editing easier

Must be compatible with most browsers We restrict ourselves to HTML We use the “standard” encodings for each

language for our pages: many people still use browsers that are not Unicode

friendly this makes content editing easier

Page 10: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

Using a FORMUsing a FORM

The parameters of the query are passed via the URL to a CGI script

e.g: http://www.my_site.com/my_script?query=%22San+Jose%22

What is the charset of the data sent back from the client?

The parameters of the query are passed via the URL to a CGI script

e.g: http://www.my_site.com/my_script?query=%22San+Jose%22

What is the charset of the data sent back from the client?

Page 11: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

URL Encoding IssuesURL Encoding Issues

Different browsers have different behaviors Example: a Japanese query

Could be submitted to the server as: ...search.pl?Query=%93%FA%96%7B%8C%EA

Or by another browser as: ...search.pl?Query=%26%2326085%3B%26%23264

12%3B%26%2335486%3B

Different browsers have different behaviors Example: a Japanese query

Could be submitted to the server as: ...search.pl?Query=%93%FA%96%7B%8C%EA

Or by another browser as: ...search.pl?Query=%26%2326085%3B%26%23264

12%3B%26%2335486%3B

Page 12: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

FORM and CGIFORM and CGI

The server tells the client which encoding to use at the HTTP level

<HTML> <HEAD> <META HTTP-EQUIV="content-type" CONTENT="text/html;

charset=..."> … </HEAD> …. </HTML>

The server tells the client which encoding to use at the HTTP level

<HTML> <HEAD> <META HTTP-EQUIV="content-type" CONTENT="text/html;

charset=..."> … </HEAD> …. </HTML>

Page 13: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

FORM and CGIFORM and CGI

The client returns the information to the script using the Private FORM/CGI Protocol A “hidden” form field adds a parameter to the

query which identifies the locale <form> ... <input type=hidden name=Locale value=ja> ... </form>

The client returns the information to the script using the Private FORM/CGI Protocol A “hidden” form field adds a parameter to the

query which identifies the locale <form> ... <input type=hidden name=Locale value=ja> ... </form>

Page 14: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

Displaying the ResultsDisplaying the Results

Simple if only one code set per page is required

For multilingual content: use UTF-8 use multiples frames

Unexpected browser behavior

Simple if only one code set per page is required

For multilingual content: use UTF-8 use multiples frames

Unexpected browser behavior

Page 15: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

ConclusionConclusion

Solutions exist to provide a robust multilingual search engine

Code set issues on the client side can be a limitation but it will soon disappear as more and more people will be using UTF-8 friendly browsers

Solutions exist to provide a robust multilingual search engine

Code set issues on the client side can be a limitation but it will soon disappear as more and more people will be using UTF-8 friendly browsers

Page 16: Keys to Building a Multilingual Search Engine Thierry Sourbier

Thierry Sourbier

Q&AQ&A

Thierry SourbierSoftware Developer

[email protected]

Thierry SourbierSoftware Developer

[email protected]