introduction to full-text search
DESCRIPTION
TRANSCRIPT
![Page 1: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/1.jpg)
Introduction to Full-text search
![Page 2: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/2.jpg)
About me Full-time (Mostly) Java Developer Part-time general technical/sysadmin/geeky guy Interested in: hard problems, search, performance, paralellism,
scalability
![Page 3: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/3.jpg)
Why should you care?
![Page 4: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/4.jpg)
Because every application needs search
![Page 5: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/5.jpg)
We live in an era of big, complex and connected applications.
![Page 6: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/6.jpg)
That means a lot of data
![Page 7: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/7.jpg)
But it's no use if you can't find anything!
![Page 8: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/8.jpg)
But it's no use if you can't quickly find anything something relevant
![Page 9: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/9.jpg)
Quick
![Page 10: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/10.jpg)
Relevant
![Page 11: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/11.jpg)
Customized Experience
![Page 12: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/12.jpg)
You can't win by being generic, but you can be the best for your specific type of content.
Deathy's Tip
![Page 13: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/13.jpg)
So back to our full-text search...
![Page 14: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/14.jpg)
Some core ideas "index" (or "inverted index") "document"
![Page 15: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/15.jpg)
Don't be too quick in deciding what a "document" is. Put some thought into it or you'll regret it (speaking from a lot of experience)
Deathy’s Tip
![Page 16: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/16.jpg)
First we need some documents, more specifically some text samples
![Page 17: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/17.jpg)
Documents Doc1: "The cow says moo" Doc2: "The dog says woof" Doc3: "The cow-dog says moof“
"Stolen" from http://www.slideshare.net/tomdyson/being-google
![Page 18: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/18.jpg)
Important: individual words are the basis for the index
![Page 19: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/19.jpg)
Individual wordsindex = [
"cow","dog","moo","moof","The","says","woof"
]
![Page 20: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/20.jpg)
For each word we have a list of documents to which it belongs
![Page 21: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/21.jpg)
Words, with appearancesindex = {
"cow": ["Doc1", "Doc3"],"dog": ["Doc2", "Doc3"],"moo": ["Doc1"],"moof": ["Doc3"],"The": ["Doc1", "Doc2", "Doc3"],"says": ["Doc1", "Doc2", "Doc3"],"woof": ["Doc2"]
}
![Page 22: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/22.jpg)
Q1: Find documents which contain "moo"A1: index["moo"]
![Page 23: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/23.jpg)
Q2: Find documents which contain "The" and "dog"A2: set(index["The"]) & set(index["dog"])
![Page 24: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/24.jpg)
Try to think of search as unions/intersections or other filters on sets.
![Page 25: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/25.jpg)
Most searches are using simple terms and "boolean" operators.
![Page 26: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/26.jpg)
“boolean” "word" - word MAY/SHOULD appear in document "+word" - word MUST appear in document "-word" - word MUST NOT appear in document
![Page 27: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/27.jpg)
Example Query: “+type:book content:java content:python -content:ruby”
Find books, with "java" or "python" in content but which don't contain "ruby" in content.
![Page 28: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/28.jpg)
Err...wait...what the hell does "content:java" mean?
![Page 29: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/29.jpg)
Reviewing the "document" concept
![Page 30: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/30.jpg)
An index consists out of one or more documents
![Page 31: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/31.jpg)
Each document consists of one or more "field"s. Each field has
a name and content.
![Page 32: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/32.jpg)
Field examples content title author publication date etc.
![Page 33: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/33.jpg)
So how are fields handled internally?
In most cases very simple. A word belongs to a specific field, so it can be stored in the term directly.
![Page 34: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/34.jpg)
New index exampleindex = {
"content:cow": ["Doc1", "Doc3"],"content:dog": ["Doc2", "Doc3"],"content:moo": ["Doc1"],"content:moof": ["Doc3"],"content:The": ["Doc1", "Doc2", "Doc3"],"content:says": ["Doc1", "Doc2", "Doc3"],"content:woof": ["Doc2"],"type:example_documents": ["Doc1", "Doc2", "Doc3"]
}
![Page 35: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/35.jpg)
But enough of that
![Page 36: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/36.jpg)
We missed the most important thing!
![Page 37: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/37.jpg)
We missed saved the most important thing for last!
![Page 38: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/38.jpg)
Analysis
![Page 39: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/39.jpg)
or for mortals: how you get from a long text to small
tokens/words/terms
![Page 40: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/40.jpg)
…borrowing from Lucene naming/API...
![Page 41: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/41.jpg)
(One) Tokenizer
![Page 42: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/42.jpg)
and zero or more Filters
![Page 43: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/43.jpg)
First...
![Page 44: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/44.jpg)
Some more interesting documents Doc1: "The quick brown fox jumps over the lazy dog" Doc2: "All Daleks: Exterminate! Exterminate! EXTERMINATE!!
EXTERMINATE!!!" Doc3: "And the final score is: no TARDIS, no screwdriver, two
minutes to spare. Who da man?!"
![Page 45: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/45.jpg)
Tokenizer: Breaks up a single string into smaller tokens.
![Page 46: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/46.jpg)
You define what splitting rules are best for you.
![Page 47: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/47.jpg)
Whitespace TokenizerJust break into tokens wherever there is some space. So we get something like:
![Page 48: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/48.jpg)
Doc1: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
Doc2: ["All", "Daleks:", "Exterminate!", "Exterminate!", "EXTERMINATE!!", "EXTERMINATE!!!"]
Doc3: ["And", "the", "final", "score", "is:", "no", "TARDIS,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "Who", "da", "man?!"]
![Page 49: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/49.jpg)
But wait, that doesn't look right...
![Page 50: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/50.jpg)
So we apply Filters
![Page 51: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/51.jpg)
Filter transforms one single token into another single token, multiple
tokens or no token at all you can apply more of them in a specific order
![Page 52: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/52.jpg)
Filter 1: lower-case (since we don't want the search to be
case-sensitive)
![Page 53: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/53.jpg)
Result
Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
Doc2: ["all", "daleks:", "exterminate!", "exterminate!", "exterminate!!", "exterminate!!!"]
Doc3: ["and", "the", "final", "score", "is:", "no", "tardis,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "who", "da", "man?!"]
![Page 54: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/54.jpg)
Filter 2: remove punctuation
![Page 55: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/55.jpg)
Result
Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
Doc2: ["all", "daleks", "exterminate", "exterminate", "exterminate", "exterminate"]
Doc3: ["and", "the", "final", "score", "is", "no", "tardis", "no", "screwdriver", "two", "minutes", "to", "spare", "who", "da", "man"]
![Page 56: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/56.jpg)
Add more filter seasoning until it tastes just right.
![Page 57: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/57.jpg)
Lots of things you can do with filters case normalization removing unwanted/unneeded characters transliteration/normalization of special characters stopwords synonyms
![Page 58: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/58.jpg)
Possibilities are endless, enjoy experimenting with
them!
![Page 59: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/59.jpg)
Just one warning…
![Page 60: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/60.jpg)
Always use the same analysis rules when indexing and when parsing search text entered by
the user!
![Page 61: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/61.jpg)
I bet you want to start working with this
![Page 62: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/62.jpg)
Implementations
Lucene (Java main, .NET, Python, C ) SOLR if using from other languages
Xapian Sphinx OpenFTS MySQL Full-Text Search (kind of…)
![Page 63: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/63.jpg)
Related Books
![Page 64: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/64.jpg)
The theoryIntroduction to Information Retrievalhttp://nlp.stanford.edu/IR-book/information-retrieval-book.htmlWarning: contains a lot of math.
![Page 65: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/65.jpg)
The practice (for Lucene at least):Lucene in Action, second edition:http://www.manning.com/hatcher3/Warning: contains a lot of Java
![Page 66: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/66.jpg)
Questions?
![Page 67: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/67.jpg)
Contact me(with interesting problems involving lots of data )
@[email protected]://blog.deathy.info/ (yeah…I know…)
![Page 68: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/68.jpg)
Fin.
![Page 69: Introduction to Full-Text Search](https://reader033.vdocuments.us/reader033/viewer/2022050804/54bede644a7959d0298b4581/html5/thumbnails/69.jpg)
So where’s the Halloween Party?
Happy Halloween !