judcon brazil 2014 lucene from the bottom up
DESCRIPTION
Judcon Brazil 2014TRANSCRIPT
Lucene from the bottom up !
Gustavo Fernandes
Ultra-fast, low memory footprint, high throughput apache licensed search library with support for incremental indexing, written in Java with several language ports Python, .NET, C++
What is Lucene
• Service
• Database
• Product
What Lucene is not
Search
Search
Battle with or against their favourite heroes and outlaws, or your own customised character
Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway
You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance
Joel, a brutal survivor, and Ellie, a brave young teenage girl must work together if they hope to survive their journey across the US.
GTA V for PS3DC Universe Online for PS4
The Last of Us for PS4
Assassins Creed Black Flag for PS3
1
4
3
2
Index against and battle character customised dc favourite heroes online
a assassins among and captain caribbean creed developed edward fearsome have is
a and appearance by character creating criminal customising developing gta v her
a across and brave brutal ellie girl hope if joel journey last must
1
43
2his in invest or potential ps3 start unique you your
kenway lawless named outlaws pirate pirates ps3 republic rule the these young
or outlaws own ps4 their universe with your
of ps4 survive survivor teenage the their they to together us work young
Inverted Index across against among appearance battle brave brutal captain caribbean character creating criminal customised customising developed developing edward ellie favourite fearsome girl heroes hope invest
joel journey kenway lawless must named outlaws own pirate pirates potential republic rule start survive survivor teenage together unique work young
4
3
2
1
1
1
1
1
1
1
2
2
2
2
2
23
3
3
3
3
3
3
3
3
3
3
3
3
3
4
4
1
4
4
4
4
4
4
4
4
4
4
4
1
4
2
1
Documents and FieldsId
Console
1
PS3
You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance
Title GTA V
Description
Id
Console
2
PS4
Battle with or against their favourite heroes and outlaws, or your own customised character
Title DC Universe Online
Description
Id
Console
3
PS3
Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway
Title Assassins Creed
Description
Id
Console
4
PS4
Joel, a brutal survivor, and Ellie, a brave young teenage girl must work together if they hope to survive their journey across the US.
Title The Last of Us
Description
Fields across against among appearance battle brave brutal captain . . . republic rule start survive survivor teenage together unique work young
Field: Description Field: Title Field: Console4
3
2
2
1
4
4
3
3
3
4
4
4
4
1
4
3 4
1
assassins black creed dc flag gta last of online universe the us v
3
3
3
3
4
4
4
4
1
1
2
2
2
ps3 ps4
1 3
2 4
Field: Id1 2 3 4
1
3
2
4
On Terms
• Unit of search
• Created by a process called tokenisation
• Numerous ways of doing it
• Language specific “gotchas”
Examples Joel, a brutal survivor, and Ellie, a brave young teenage girl must work together.
Joel a brutal survivor and Ellie a brave young teenage girl must work together
Joel brutal survivor Ellie brave young teenage girl must work together
joel brutal survivor ellie brave young teenage girl must work together
joel brutal survivor survive
ellie brave fearless
young teenage teen-age
girl must work together
SynonymsStemming Synonyms
Examples (2) Coca-Cola improved the market share of the flagship brand Diet Coke by 0.4% to 42.4%
coca cola improved market share flagship brand diet coke 0 4 42 4
私の名前はグスタボです
私の名前はグスタボです私の名前はグスタボです
私 の 名 前 は グ ス タ ボ で す
Phrase q=title:“black creed”q=description:”young teenage”
republic rule start survive survivor teenage together unique work young
Field: Description
3
3
4
4
4
4
1
4
3 4
1
Field: Titleassassins black creed dc flag gta last of online universe the us v
3
3
3
3
4
4
4
4
1
1
2
2
2
1122194
10
148
1318, 9
1321412332142
Autocomplete captain caribbean character captain
caribbean character criminal customised teenage together unique work young
Field: Description
3
3
4
1
4
3 4
1
1949151310
148
1318, 9
2
1
4
c
Autocomplete Finite State Transducer
character
captain
captain, caribbean, character, criminal,young, your
criminal
Relevance q=description:outlaws
Id
Console
2
PS4
Battle with or against their favourite heroes and outlaws, or your own customised character
Title DC Universe Online
Description
Id
Console
3
PS3
Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway
Title Assassins Creed
Description
? Dc Universe Online? Assassins Creed
Id
Console
2
PS4
Battle with or against their favourite heroes and outlaws, or your own customised character
Title DC Universe Online
Description
Vector
d1
d2V
2
3 V=(2, 3)
V=2 . d1 + 3 . d2
d1
d2
d3
2
3 V=(2, 3, 4)
4
V=2 . d1 + 3 . d2 + 4 . d3
Score- Vector Model • Result documents represented as vectors
• Query represent as vector
• Vectors dimensions are terms
• Vector ‘quantities’ are Tf-Idf
• Score = Cossine Similarity between query vector and document vector
0.4024 Dc Universe Online
0.3219 Assassins Creed
Documents and Queries as vectors
Id
Console
2
PS4
Battle with or against their favourite heroes and outlwas, or your own customised character
Title DC Universe Online
Description
Id
Console
3
PS3
Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway
Title Assassins Creed
Description
D2 = w21 . against + w22 . battle + … + w23 . outlaws + w2j . own
D3 = w31 . among + w32 . captain + … + w35 . outlaws + … + w3j . young
Q = wq . outlaws
Term Weights
• Term frequency (Tf) : number of appearances of term in the doc
• Inverse Document Frequency (Idf):
3
D3 = 1.6931 . among + 1.6931 . captain + … + 1.287 . outlaws + … + 1.287 . young
TERM among across outlaws young
sqrt(Tf) 1 0 1 1
docFreq 1 1 2 2
Idf 1.6931 1.6931 1.287 1.287
w 1.6931 0 1.287 1.287
nDocs = 4
Id
Console
3
PS3
Pirates rule the Caribbean and have developed a lawless pirate republic. Among these outlaws is a fearsome young captain named Edward Kenway
Title Assassins Creed
Description
Tf-Idf
• The more a term appears in a document
• The more rare a term is index-wide
Lucene API
Lucene API - Documents
import org.apache.lucene.document.Document; import org.apache.lucene.document.IntField; import org.apache.lucene.document.TextField; !Document doc = new Document(); !doc.add(new IntField("id", 1, Store.YES)); doc.add(new TextField("console", "PS3", Store.YES)); doc.add(new TextField("title", "GTA V", Store.YES)); doc.add(new TextField("description", "You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance", Store.YES));
Lucene API - Analysis
Name Type AnalysisId Number None
Console String Lowercase
Title TextWhiteSpace, Lowercase
Description Text
WhiteSpace, Lowercase,
Remove commons words
Description_jp Text Japanse Tokenizer
Id
Console
1
PS3
You start by creating and developing a unique character and invest in your potential criminal by customising his or her appearance
Title GTA V
Description
Description_jp
かつてないほど大規模でダイナミックな多様性に富んだオープンワールドを誇る『グランド・セフト・オートV』は、ストーリーテリングとゲームプレイを新しい手法で融合。
Lucene API - Analysis
rulepirates
Pirates rule the Caribbean
Whitespace Tokenizer
Lowercase TokenFilter
Stopwords TokenFilter
Analyzer
caribbean
Lucene API - Analysis
Custom Analyzer
public class MySimpleAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { ! WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(reader); LowerCaseFilter lcFilter = new LowerCaseFilter(keywordTokenizer); return new TokenStreamComponents(keywordTokenizer, lcFilter); ! } }
Lucene API - Analysis
@Override protected TokenStreamComponents createComponents( String fieldName, Reader reader) { Tokenizer tokenizer = new JapaneseTokenizer(reader, userDict, true, mode); TokenStream stream = new JapaneseBaseFormFilter(tokenizer); stream = new JapanesePartOfSpeechStopFilter(stream, stoptags); stream = new CJKWidthFilter(stream); stream = new StopFilter(stream, stopwords); stream = new JapaneseKatakanaStemFilter(stream); stream = new LowerCaseFilter(stream); return new TokenStreamComponents(tokenizer, stream); } !
org.apache.lucene.analysis.ja.JapaneseAnalyzer
Lucene API - Analysis
@Override protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) { ! final StandardTokenizer src = new StandardTokenizer(getVersion(), reader); … TokenStream tok = new StandardFilter(getVersion(), src); tok = new LowerCaseFilter(getVersion(), tok); tok = new StopFilter(getVersion(), tok, stopwords); return new TokenStreamComponents(src, tok) ! }
org.apache.lucene.analysis.standard.StandardAnalyzer
Lucene API - Indexing
1 Map<String, Analyzer> analyzerMap = new HashMap<String, Analyzer>(); 2 analyzerMap.put("id", new KeywordAnalyzer()); 3 analyzerMap.put("console", new MySimpleAnalyzer()); 4 analyzerMap.put("description", new StandardAnalyzer()); 5 analyzerMap.put("description_jp", new JapaneseAnalyzer()); 6 7 PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper( new StandardAnalyzer(), analyzerMap); 8 9 Directory ramDirectory = new RAMDirectory(); 10 IndexWriterConfig iwc = new IndexWriterConfig(Version.LATEST, analyzer); 11 IndexWriter iw = new IndexWriter(ramDirectory, iwc); 12 for (Document document : documents) { 13 iw.addDocument(document); 14 } 15 iw.close();
Lucene API - Directory
• RAMDirectory (for tests only)
• FSDirectory • MMapDirectory (Default for 64bit) • SimpleFSDirectory (java.io.RandomAccessFile) • NIOFSDirectory (java.io.FileChannel) • WindowsDirectory (native requires a .dll) • NativeUnixDirectory (experimental)
• InfinispanDirectory (3rd party)
Lucene API - Directory _0.fdt _0.fdx _0.fnm _0.nvd _0.nvm _0.si _0_Lucene41_0.doc _0_Lucene41_0.pos _0_Lucene41_0.tim _0_Lucene41_0.tip
IndexWriter.close()
IndexWriter.close()
IndexWriter.close()
_1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1_Lucene41_0.doc _1_Lucene41_0.pos _1_Lucene41_0.tim _1_Lucene41_0.tip
_2.fdt _2.fdx _2.fnm _2.nvd _2.nvm _2.si _2_Lucene41_0.doc _2_Lucene41_0.pos _2_Lucene41_0.tim _2_Lucene41_0.tip
Lucene API - Autocomplete
1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 AnalyzingSuggester suggester = new AnalyzingSuggester(new StandardAnalyzer()); 4 LuceneDictionary dictionary = new LuceneDictionary(reader, "description"); 5 suggester.build(dictionary); 6 7 List<Lookup.LookupResult> suggestions = suggester.lookup("c", false, 5); 8 9 for (Lookup.LookupResult suggestion : suggestions) { 10 System.out.println(suggestion.key); 11 }
captain caribbean character creating criminal
Lucene API - Search
1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 TermQuery termQuery = new TermQuery(new Term("description", "character")); 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 TopDocs topDocs = indexSearcher.search(termQuery, 10); 6 7 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 8 int internalId = scoreDoc.doc; 9 Document document = reader.document(internalId); 10 String title = document.get("title"); 11 System.out.printf("%f - %s\n", scoreDoc.score, title); 12 }
q = description:character
0.402401 - DC Universe Online 0.321921 - GTA V
Lucene API - Search
q=description:”young teenage”
1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 PhraseQuery query = new PhraseQuery(); 4 query.add(new Term("description","young")); 5 query.add(new Term("description","teenage")); 6 7 IndexSearcher indexSearcher = new IndexSearcher(reader); 8 TopDocs topDocs = indexSearcher.search(query, 10); 9 10 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 11 int internalId = scoreDoc.doc; 12 Document document = reader.document(internalId); 13 String title = document.get("title"); 14 System.out.printf("%f - %s\n", scoreDoc.score, title); 15 }
0.745207 - The Last of Us
Lucene API - Search
q = console:”PS3” AND (description:”pirate” OR description:”criminal”)
0.741689 - GTA V 0.741689 - Assassins Creed Black Flag
1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 TermQuery descriptionOne = new TermQuery(new Term("description", "pirate")); 4 TermQuery descriptionTwo = new TermQuery(new Term("description", "criminal")); 5 6 BooleanQuery descriptionQuery = new BooleanQuery(); 7 descriptionQuery.add(descriptionOne, BooleanClause.Occur.SHOULD); 8 descriptionQuery.add(descriptionTwo, BooleanClause.Occur.SHOULD); 9 10 TermQuery consoleQuery = new TermQuery(new Term("console", "ps3")); 11 12 BooleanQuery query = new BooleanQuery(); 13 query.add(consoleQuery, BooleanClause.Occur.MUST); 14 query.add(descriptionQuery, BooleanClause.Occur.MUST); 15 16 IndexSearcher indexSearcher = new IndexSearcher(reader); 17 TopDocs topDocs = indexSearcher.search(query, 10);
Lucene API - Search
Query Parser
1 QueryParser queryParser = new QueryParser("description", analyzer); 2 Query query = queryParser.parse("console:PS3 AND (description:pirate OR description:criminal)"); 3 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 TopDocs topDocs = indexSearcher.search(query, 10); 6 7 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 8 int internalId = scoreDoc.doc; 9 Document document = reader.document(internalId); 10 String title = document.get("title"); 11 System.out.printf("%f - %s\n", scoreDoc.score, title); 12 }
0.741689 - GTA V 0.741689 - Assassins Creed Black Flag
Lucene API - Sort
NaN - Assassins Creed Black Flag NaN - GTA V
1 QueryParser queryParser = new QueryParser("description", analyzer); 2 Query query = queryParser.parse("console:PS3 AND (description:pirate OR description:criminal)"); 3 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 6 Sort sort = new Sort(new SortField("title", SortField.Type.STRING, true)); 7 TopDocs topDocs = indexSearcher.search(query, 10, sort); 8 9 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 10 int internalId = scoreDoc.doc; 11 Document document = reader.document(internalId); 12 String title = document.get("title"); 13 System.out.printf("%f - %s\n", scoreDoc.score, title); 14 }
Lucene API - Explain 1 DirectoryReader reader = DirectoryReader.open(directory); 2 3 TermQuery termQuery = new TermQuery(new Term("description", "character")); 4 IndexSearcher indexSearcher = new IndexSearcher(reader); 5 TopDocs topDocs = indexSearcher.search(termQuery, 10); 6 7 for (ScoreDoc scoreDoc : topDocs.scoreDocs) { 8 int internalId = scoreDoc.doc; 9 Explanation explanation = indexSearcher.explain(termQuery, internalId); 10 System.out.println(explanation); 11 }
0.40240064 = (MATCH) weight(description:character in 2) [DefaultSimilarity], result of: 0.40240064 = fieldWeight in 2, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.287682 = idf(docFreq=2, maxDocs=4) 0.3125 = fieldNorm(doc=2) !0.3219205 = (MATCH) weight(description:character in 0) [DefaultSimilarity], result of: 0.3219205 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.287682 = idf(docFreq=2, maxDocs=4) 0.25 = fieldNorm(doc=0)
Reviews provided by ign.com