build your own search engine

34
Build Your Own Search Engine Siddhartha Reddy @the5el 1

Upload: sids

Post on 15-Jan-2015

712 views

Category:

Technology


0 download

DESCRIPTION

Slides for a talk I gave at The Fifth Elephant conference in 2012. Description copied from the conference's website: No, this is not another tutorial on using Solr/ElasticSearch/Sphinx/Lucene. Imagine that none of these existed and you need a search engine for your shiny new eCommerce startup. What would you do? Build your own search engine, of course. In this session at The Fifth Elephant 2012, Siddhartha Reddy describes what it takes to do that. Here is a video of the talk: https://hasgeek.tv/fifthelephant/2012-1/49-siddhartha-reddy-build-your-own-search-engine

TRANSCRIPT

Page 1: Build Your Own Search Engine

Build Your OwnSearch Engine

Siddhartha Reddy @the5el

�1

Page 2: Build Your Own Search Engine

SOME Ideas Underlying a Search Engine

Siddhartha Reddy @the5el

�2

Page 3: Build Your Own Search Engine

Siddhartha Reddy

o Loves Lives Search

o Leads Product Discovery (search) team at Flipkart

o @sids

���3

Page 4: Build Your Own Search Engine

“Should I build my own search engine?”

���4

No.

Page 5: Build Your Own Search Engine

“Should I build my own search engine?”

���5

No. Use Lucene/Solr/ElasticSearch/Sphinx/etc.

“What are we doing here then?”

Page 6: Build Your Own Search Engine

“User” SearcherText

Analysis

Index

IndexerText

Analysis

Documents

Query/Response

Page 7: Build Your Own Search Engine

“User” SearcherText

Analysis

Index

IndexerText

Analysis

Documents

Query/Response

Page 8: Build Your Own Search Engine

Text Analysis

�8

Page 9: Build Your Own Search Engine

Stop-word Removal

Text Analysis“I was killed i` the Capitol; Brutus killed me.”

Tokenization

Stemming

Lemmatization

I, was, killed, i`, the, Capitol, Brutus, killed, me

I, was, killed, i`, the, Capitol, Brutus, killed, me

Case-folding i, killed, i`, capitol, brutus, killed, me

(colour <=> color; pavement <=> footpath)

i, killed, i`, capitol, brutus, killed, me

(saw => see; saw => hacksaw)

(Accents and diacritics, Abbreviations etc.)

���9

Normalization

Synonyms

Others

- English Porter (Rule-based) - KStemmer (Dictionary-based)

Page 10: Build Your Own Search Engine

“User” SearcherText

Analysis

Index

IndexerText

Analysis

Documents

Query/Response

Page 11: Build Your Own Search Engine

The Index

�11

Page 12: Build Your Own Search Engine

Brutus AND Caesar AND NOT Calpurnia

110100 AND 110111 AND 101111 = 100100

Brutus 110100

Ceasar 110111

Calpurnia 10000

NOT Calpurnia 101111

Term-Document Matrix

���12

Page 13: Build Your Own Search Engine

Brutus AND Caesar AND NOT Calpurnia

Inverted Index

���13

Page 14: Build Your Own Search Engine

“User” Searcher (Ranking/Scoring)

Text Analysis

Index

IndexerText

Analysis

Documents

Query/Response

Page 15: Build Your Own Search Engine

RelevanceRanking/Scoring

�15

Page 16: Build Your Own Search Engine

Ranking/Scoring

“mysql performance”

Top 25 Best Linux Performance Monitoring and Debugging Tools

8 great MySQL Performance Tips

Linux performance: is Linux becoming just too slow and bloated?

MySQL Performance Blog

���16

Page 17: Build Your Own Search Engine

Ranking/Scoring

“mysql performance”

Term Frequency (Tf)

mysql performance Total

Top 25 Best Linux Performance Monitoring and Debugging Tools

1 23 24

8 great MySQL Performance Tips

5 7 12

Linux performance: is Linux becoming just too slow and bloated?

3 12 15

MySQL Performance Blog 6 8 14

* random numbers

���17

Page 18: Build Your Own Search Engine

Ranking/Scoring

Term Frequency (Tf)

mysql performance Total

Top 25 Best Linux Performance Monitoring and Debugging Tools

1 23 24

Linux performance: is Linux becoming just too slow and bloated?

3 12 15

MySQL Performance Blog 6 8 14

8 great MySQL Performance Tips

5 7 12

���18

“mysql performance”

Page 19: Build Your Own Search Engine

Ranking/Scoring

• Inverse Document Frequency (Idf)

• "How common (or rare) is a term?"

• 1 / (no. of documents the term occurs in)

���19

“mysql performance”

Page 20: Build Your Own Search Engine

Ranking/Scoring

• Tf normalized by document length

• Idf dampened by applying a function (log)

���20

“mysql performance”

score = Tf * Idf

Page 21: Build Your Own Search Engine

Ranking/Scoring

Tf * Idf

mysql performance Total

Top 25 Best Linux Performance Monitoring and Debugging Tools 1 * 10 23 * 2 56

Linux performance: is Linux becoming just too slow and bloated? 3 * 10 12 * 2 54

MySQL Performance Blog 6 * 10 8 * 2 76

8 great MySQL Performance Tips 5 * 10 7 * 2 64

���21

“mysql performance”Term Idf

mysql 10

performance 2

Page 22: Build Your Own Search Engine

Ranking/Scoring

Tf * Idf

mysql performance Total

MySQL Performance Blog 6 * 10 8 * 2 76

8 great MySQL Performance Tips 5 * 10 7 * 2 64

Top 25 Best Linux Performance Monitoring and Debugging Tools

1 * 10 23 * 2 56

Linux performance: is Linux becoming just too slow and bloated?

3 * 10 12 * 2 54

���22

“mysql performance”

Page 23: Build Your Own Search Engine

Boolean Search vs. Ranked Search

• Boolean Search

o Rich query syntax

o No relevance scoring

o Ex: Patent search, Enterprise search

o Precision & Recall controlled by user

���23

• Ranked Search

o Simple query syntax

o Relevance ranking/scoring is key

o Ex: Web Search, Flipkart Search

o Search Engine needs to balance Precision & Recall

Page 24: Build Your Own Search Engine

“User” SearcherText

Analysis

Index

IndexerText

Analysis

Documents

Query/Response

Page 25: Build Your Own Search Engine

Indexing

�25

Page 26: Build Your Own Search Engine

Building an Inverted Documents <term,documentId>

pairsText Analysis

Sort

<term,documentId> pairs, sortedPersist

termId => term

termId => postingId (dictionary)

postingId => postingsList (postings file)

(Disk)

���26

Page 27: Build Your Own Search Engine

“User” SearcherText

Analysis

Index

IndexerText

Analysis

Documents

Query/Response

Page 28: Build Your Own Search Engine

ScalingIndexing

andSearching

�28

Page 29: Build Your Own Search Engine

`large` Inverted Index

���29

Page 30: Build Your Own Search Engine

Distributed Indexing

���30

Page 31: Build Your Own Search Engine

Distributed Search

���31

brutus d1, d3, d6, d7ceasar d1, d2, d4, d8,

d9

noble d5, d10with d1, d2, d3, d5killed d8

brutus d1, d3, d6, d7ceasar d1, d2, d4, d8, d9noble d5, d10with d1, d2, d3, d5killed d8

Page 32: Build Your Own Search Engine

Distributed Search

brutus

d1, d3ceasar

d1, d2, d4noble d5with d1, d2, d3,

d5

brutus

d6, d7ceasar

d8, d9noble d10killed d8

brutus d1, d3, d6, d7ceasar d1, d2, d4, d8,

d9

noble d5, d10with d1, d2, d3, d5killed d8

Partitioning by terms

Partitioning by documents

���32

Page 33: Build Your Own Search Engine

Images Attribution

• Introduction to Information Retrieval

o By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze

���33

Page 34: Build Your Own Search Engine

Thank [email protected]

@sids

���34