build your own search engine
DESCRIPTION
Slides for a talk I gave at The Fifth Elephant conference in 2012. Description copied from the conference's website: No, this is not another tutorial on using Solr/ElasticSearch/Sphinx/Lucene. Imagine that none of these existed and you need a search engine for your shiny new eCommerce startup. What would you do? Build your own search engine, of course. In this session at The Fifth Elephant 2012, Siddhartha Reddy describes what it takes to do that. Here is a video of the talk: https://hasgeek.tv/fifthelephant/2012-1/49-siddhartha-reddy-build-your-own-search-engineTRANSCRIPT
Build Your OwnSearch Engine
Siddhartha Reddy @the5el
�1
SOME Ideas Underlying a Search Engine
Siddhartha Reddy @the5el
�2
Siddhartha Reddy
o Loves Lives Search
o Leads Product Discovery (search) team at Flipkart
o @sids
���3
“Should I build my own search engine?”
���4
No.
“Should I build my own search engine?”
���5
No. Use Lucene/Solr/ElasticSearch/Sphinx/etc.
“What are we doing here then?”
“User” SearcherText
Analysis
Index
IndexerText
Analysis
Documents
Query/Response
“User” SearcherText
Analysis
Index
IndexerText
Analysis
Documents
Query/Response
Text Analysis
�8
Stop-word Removal
Text Analysis“I was killed i` the Capitol; Brutus killed me.”
Tokenization
Stemming
Lemmatization
I, was, killed, i`, the, Capitol, Brutus, killed, me
I, was, killed, i`, the, Capitol, Brutus, killed, me
Case-folding i, killed, i`, capitol, brutus, killed, me
(colour <=> color; pavement <=> footpath)
i, killed, i`, capitol, brutus, killed, me
(saw => see; saw => hacksaw)
(Accents and diacritics, Abbreviations etc.)
���9
Normalization
Synonyms
Others
- English Porter (Rule-based) - KStemmer (Dictionary-based)
“User” SearcherText
Analysis
Index
IndexerText
Analysis
Documents
Query/Response
The Index
�11
Brutus AND Caesar AND NOT Calpurnia
110100 AND 110111 AND 101111 = 100100
Brutus 110100
Ceasar 110111
Calpurnia 10000
NOT Calpurnia 101111
Term-Document Matrix
���12
Brutus AND Caesar AND NOT Calpurnia
Inverted Index
���13
“User” Searcher (Ranking/Scoring)
Text Analysis
Index
IndexerText
Analysis
Documents
Query/Response
RelevanceRanking/Scoring
�15
Ranking/Scoring
“mysql performance”
Top 25 Best Linux Performance Monitoring and Debugging Tools
8 great MySQL Performance Tips
Linux performance: is Linux becoming just too slow and bloated?
MySQL Performance Blog
���16
Ranking/Scoring
“mysql performance”
Term Frequency (Tf)
mysql performance Total
Top 25 Best Linux Performance Monitoring and Debugging Tools
1 23 24
8 great MySQL Performance Tips
5 7 12
Linux performance: is Linux becoming just too slow and bloated?
3 12 15
MySQL Performance Blog 6 8 14
* random numbers
���17
Ranking/Scoring
Term Frequency (Tf)
mysql performance Total
Top 25 Best Linux Performance Monitoring and Debugging Tools
1 23 24
Linux performance: is Linux becoming just too slow and bloated?
3 12 15
MySQL Performance Blog 6 8 14
8 great MySQL Performance Tips
5 7 12
���18
“mysql performance”
Ranking/Scoring
• Inverse Document Frequency (Idf)
• "How common (or rare) is a term?"
• 1 / (no. of documents the term occurs in)
���19
“mysql performance”
Ranking/Scoring
• Tf normalized by document length
• Idf dampened by applying a function (log)
���20
“mysql performance”
score = Tf * Idf
Ranking/Scoring
Tf * Idf
mysql performance Total
Top 25 Best Linux Performance Monitoring and Debugging Tools 1 * 10 23 * 2 56
Linux performance: is Linux becoming just too slow and bloated? 3 * 10 12 * 2 54
MySQL Performance Blog 6 * 10 8 * 2 76
8 great MySQL Performance Tips 5 * 10 7 * 2 64
���21
“mysql performance”Term Idf
mysql 10
performance 2
Ranking/Scoring
Tf * Idf
mysql performance Total
MySQL Performance Blog 6 * 10 8 * 2 76
8 great MySQL Performance Tips 5 * 10 7 * 2 64
Top 25 Best Linux Performance Monitoring and Debugging Tools
1 * 10 23 * 2 56
Linux performance: is Linux becoming just too slow and bloated?
3 * 10 12 * 2 54
���22
“mysql performance”
Boolean Search vs. Ranked Search
• Boolean Search
o Rich query syntax
o No relevance scoring
o Ex: Patent search, Enterprise search
o Precision & Recall controlled by user
���23
• Ranked Search
o Simple query syntax
o Relevance ranking/scoring is key
o Ex: Web Search, Flipkart Search
o Search Engine needs to balance Precision & Recall
“User” SearcherText
Analysis
Index
IndexerText
Analysis
Documents
Query/Response
Indexing
�25
Building an Inverted Documents <term,documentId>
pairsText Analysis
Sort
<term,documentId> pairs, sortedPersist
termId => term
termId => postingId (dictionary)
postingId => postingsList (postings file)
(Disk)
���26
“User” SearcherText
Analysis
Index
IndexerText
Analysis
Documents
Query/Response
ScalingIndexing
andSearching
�28
`large` Inverted Index
���29
Distributed Indexing
���30
Distributed Search
���31
brutus d1, d3, d6, d7ceasar d1, d2, d4, d8,
d9
noble d5, d10with d1, d2, d3, d5killed d8
brutus d1, d3, d6, d7ceasar d1, d2, d4, d8, d9noble d5, d10with d1, d2, d3, d5killed d8
Distributed Search
brutus
d1, d3ceasar
d1, d2, d4noble d5with d1, d2, d3,
d5
brutus
d6, d7ceasar
d8, d9noble d10killed d8
brutus d1, d3, d6, d7ceasar d1, d2, d4, d8,
d9
noble d5, d10with d1, d2, d3, d5killed d8
Partitioning by terms
Partitioning by documents
���32
Images Attribution
• Introduction to Information Retrieval
o By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze
���33