"searching with solr" - tyler harms, south dakota code camp 2012
DESCRIPTION
"Searching with Solr" by Tyler Harms, given November 10, 2012, at South Dakota Code Camp 2012 in Sioux Falls.TRANSCRIPT
Tyler HarmsDeveloper
@harmstyler
AN INTRODUCTION
Searching with Solr
1
Saturday, November 10, 12
SEARCHING WITH SOLR
Why Implement Solr?
• Does your site need search?• Is google enough?• Do you need/want to control rankings?• Just text, or Structured Data?
2
Saturday, November 10, 12
SEARCHING WITH SOLR
What is Solr?
3
Solr is a standalone enterprise search server with a REST-like API. You put documents in it [...] over HTTP. You query it via HTTP GET and receive [...] results.
Saturday, November 10, 12
4
Saturday, November 10, 12
SEARCHING WITH SOLR
• Current Version(s)• Solr 3.6.1• Solr 4
• Released Versions are always stable
5
Solr Versions
Saturday, November 10, 12
6
$ wget http://(...)/3.6.1/apache-solr-3.6.1.tgz
$ tar -xzf apache-solr-3.6.1.tgz
$ cd apache-solr-3.6.1/example/
$ java -jar start.jar
(a lot of java log...)
Saturday, November 10, 12
SEARCHING WITH SOLR
• Google• Lucene• elasticsearch• Whoosh• Xapien• Many Others
7
Search Alternatives
Saturday, November 10, 12
SEARCHING WITH SOLR
NOT a Database Replacement
• Solr is designed to live alongside your website as a separate web app
8
Saturday, November 10, 12
9
Frontend Servers[1..n]Database Master
Database Slaves[0..n]
Solr Master
Solr Slaves[0..n]
10
Saturday, November 10, 12
SEARCHING WITH SOLR
Scaling Solr
• Master/Slave Architecture• Write to master -> Read from slaves
• Multicore Setup• Multiple Solr ‘cores’ running alongside each other within the same install
10
Saturday, November 10, 12
SUB HEADLINE
Solr’s Data Model
• Solr maintains a collection of documents• A document is a collection of fields and values• A field can occur multiple times in a doc• Documents are immutable• They can be deleted and replaced by new versions, however.
11
SEARCHING WITH SOLR
Saturday, November 10, 12
SUB HEADLINE
Querying
• http request• http://localhost:8983/solr/select?q=blend&start=0&rows=10
12
SEARCHING WITH SOLR
Saturday, November 10, 12
SUB HEADLINE
Solr Query Syntax
• blend (value)• company:blend (field:value)• title:”Searching with Solr” AND text:apache• id:[* TO *]• *:* (all fields : all values)
13
SEARCHING WITH SOLR
Saturday, November 10, 12
SUB HEADLINE
Using Solr
• Getting Data into Solr• Getting Data out of Solr
14
SEARCHING WITH SOLR
Saturday, November 10, 12
SUB HEADLINE
Getting Data into Solr
• POST it
15
SEARCHING WITH SOLR
<add> <doc> <field name="abstract">Lorem ipsum</field> <field name="company">Blend Interactive</field> <field name="text">Lorem Ipsum</field> <field name="title">Some Title</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]]</add>
Saturday, November 10, 12
SUB HEADLINE
Getting Data into Solr
• POST it
16
SEARCHING WITH SOLR
<add> <doc> <field name="abstract">Lorem ipsum</field> <field name="company">Blend Interactive</field> <field name="text">Lorem Ipsum</field> <field name="title">Some Title</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]]</add>
Saturday, November 10, 12
SUB HEADLINE
Getting Data into Solr
• POST it
17
SEARCHING WITH SOLR
<add> <doc> <field name="abstract">Lorem ipsum</field> <field name="company">Blend Interactive</field> <field name="text">Lorem Ipsum</field> <field name="title">Some Title</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]]</add>
Saturday, November 10, 12
SUB HEADLINE
Commiting
• Nothing shows up in the index until you commit• You can just POST <commit/> to:• http://<host>:<port>/solr/update
18
SEARCHING WITH SOLR
Saturday, November 10, 12
SUB HEADLINE
Getting Data out of Solr
• http://localhost:8983/solr/select/?q=solr
19
SEARCHING WITH SOLR
Saturday, November 10, 12
20
<response><lst name="responseHeader">
<int name="status">0</int><int name="QTime">19</int><lst name="params">
<str name="q">solr</str></lst>
</lst><result name="response" numFound="1" start="0">
<doc><str name="abstract">A brief introduction to using Apache Solr for implementing search for your website.</str><str name="django_ct">codecamp.session</str><str name="django_id">19</str><str name="id">codecamp.session.19</str><str name="text">Searching with Solr: An Introduction A brief introduction to using Apache Solr for implementing search for your website.</str><str name="title">Searching with Solr: An Introduction</str>
</doc></result>
</response>
Saturday, November 10, 12
21
<response><lst name="responseHeader">
<int name="status">0</int><int name="QTime">19</int><lst name="params">
<str name="q">solr</str></lst>
</lst><result name="response" numFound="1" start="0">
<doc><str name="abstract">A brief introduction to using Apache Solr for implementing search for your website.</str><str name="django_ct">codecamp.session</str><str name="django_id">19</str><str name="id">codecamp.session.19</str><str name="text">Searching with Solr: An Introduction A brief introduction to using Apache Solr for implementing search for your website.</str><str name="title">Searching with Solr: An Introduction</str>
</doc></result>
</response>
Saturday, November 10, 12
22
<response><lst name="responseHeader">
<int name="status">0</int><int name="QTime">19</int><lst name="params">
<str name="q">solr</str></lst>
</lst><result name="response" numFound="1" start="0">
<doc><str name="abstract">A brief introduction to using Apache Solr for implementing search for your website.</str><str name="django_ct">codecamp.session</str><str name="django_id">19</str><str name="id">codecamp.session.19</str><str name="text">Searching with Solr: An Introduction A brief introduction to using Apache Solr for implementing search for your website.</str><str name="title">Searching with Solr: An Introduction</str>
</doc></result>
</response>
Saturday, November 10, 12
SUB HEADLINE
Getting Data out of Solr: JSON
• http://localhost:8983/solr/select/?q=solr&wt=json
23
SEARCHING WITH SOLR
Saturday, November 10, 12
24
{"responseHeader": {
"status":0,"QTime":0,"params": {
"wt":"json","q":"solr"
}},"response": {
"numFound":1,"start":0,"docs":[{
"django_id":"19","title":"Searching with Solr: An Introduction","text":"Searching with Solr: An Introduction\nA brief introduction to using Apache Solr for implementing search for your website.","abstract":"A brief introduction to using Apache Solr for implementing search for your website.","django_ct":"codecamp.session","id":"codecamp.session.19"
}]}
}
Saturday, November 10, 12
SUB HEADLINE
Deleting Data from Solr
• POST it
25
SEARCHING WITH SOLR
<delete><id>codecamp.session.19</id></delete><delete><query>company:blend</query></delete>
Saturday, November 10, 12
SEARCHING WITH SOLR
The Solr Schema
• schema.xml• Defines ‘types’ used in the webapp• Defines the fields• Defines ‘copyfields’• Read the schema inside the example project for more
26
Saturday, November 10, 12
SEARCHING WITH SOLR
The Solr Schema
• Types• Define how a field and query should be processed• Word Stemming• Case Folding• How would you handle a search for ‘C.I.A.’?
• Dates, ints, floats, etc.. are defined here as well• 2 Modes• Index Time• Query Time
27
Saturday, November 10, 12
28
<fieldType name="text" class="solr.TextField" positionIncrementGap="100"><analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
</analyzer><analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/><filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
</analyzer></fieldType>
Saturday, November 10, 12
29
<fieldType name="text" class="solr.TextField" positionIncrementGap="100"><analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
</analyzer><analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/><filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
</analyzer></fieldType>
Saturday, November 10, 12
30
<fieldType name="text" class="solr.TextField" positionIncrementGap="100"><analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
</analyzer><analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/><filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
</analyzer></fieldType>
Saturday, November 10, 12
SEARCHING WITH SOLR
Fields
• The elements of a document• Both Predefined and Dynamic• Fields may occur multiple times• May be indexed and/or stored
31
Saturday, November 10, 12
32
<fields><!-- general --><field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/><field name="django_ct" type="string" indexed="true" stored="true" multiValued="false" /><field name="django_id" type="string" indexed="true" stored="true" multiValued="false" /><!-- dynamic --><dynamicField name="*_i" type="sint" indexed="true" stored="true"/><dynamicField name="*_s" type="string" indexed="true" stored="true"/><dynamicField name="*_l" type="slong" indexed="true" stored="true"/><dynamicField name="*_t" type="text" indexed="true" stored="true"/><dynamicField name="*_b" type="boolean" indexed="true" stored="true"/><dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/><dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/><dynamicField name="*_dt" type="date" indexed="true" stored="true"/><!-- app --><field name="bio" type="text" indexed="true" stored="true" multiValued="false" /><field name="title" type="text" indexed="true" stored="true" multiValued="false" /><field name="text" type="text" indexed="true" stored="true" multiValued="false" /><field name="abstract" type="text" indexed="true" stored="true" multiValued="false" /><field name="full_name" type="text" indexed="true" stored="true" multiValued="false" /><field name="company" type="text" indexed="true" stored="true" multiValued="false" />
</fields>
Saturday, November 10, 12
33
<fields><!-- general --><field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/><field name="django_ct" type="string" indexed="true" stored="true" multiValued="false" /><field name="django_id" type="string" indexed="true" stored="true" multiValued="false" /><!-- dynamic --><dynamicField name="*_i" type="sint" indexed="true" stored="true"/><dynamicField name="*_s" type="string" indexed="true" stored="true"/><dynamicField name="*_l" type="slong" indexed="true" stored="true"/><dynamicField name="*_t" type="text" indexed="true" stored="true"/><dynamicField name="*_b" type="boolean" indexed="true" stored="true"/><dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/><dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/><dynamicField name="*_dt" type="date" indexed="true" stored="true"/><!-- app --><field name="bio" type="text" indexed="true" stored="true" multiValued="false" /><field name="title" type="text" indexed="true" stored="true" multiValued="false" /><field name="text" type="text" indexed="true" stored="true" multiValued="false" /><field name="abstract" type="text" indexed="true" stored="true" multiValued="false" /><field name="full_name" type="text" indexed="true" stored="true" multiValued="false" /><field name="company" type="text" indexed="true" stored="true" multiValued="false" />
</fields>
Saturday, November 10, 12
34
<fields><!-- general --><field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/><field name="django_ct" type="string" indexed="true" stored="true" multiValued="false" /><field name="django_id" type="string" indexed="true" stored="true" multiValued="false" /><!-- dynamic --><dynamicField name="*_i" type="sint" indexed="true" stored="true"/><dynamicField name="*_s" type="string" indexed="true" stored="true"/><dynamicField name="*_l" type="slong" indexed="true" stored="true"/><dynamicField name="*_t" type="text" indexed="true" stored="true"/><dynamicField name="*_b" type="boolean" indexed="true" stored="true"/><dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/><dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/><dynamicField name="*_dt" type="date" indexed="true" stored="true"/><!-- app --><field name="bio" type="text" indexed="true" stored="true" multiValued="false" /><field name="title" type="text" indexed="true" stored="true" multiValued="false" /><field name="text" type="text" indexed="true" stored="true" multiValued="false" /><field name="abstract" type="text" indexed="true" stored="true" multiValued="false" /><field name="full_name" type="text" indexed="true" stored="true" multiValued="false" /><field name="company" type="text" indexed="true" stored="true" multiValued="false" />
</fields>
Saturday, November 10, 12
SEARCHING WITH SOLR
Copy Fields
• Two Main Uses• Analyze fields in different ways• Concatenate Fields
35
Saturday, November 10, 12
36
<copyField source="bio" dest="df_text" /><copyField source="year" dest="century" maxChars="2"/>
Saturday, November 10, 12
37
<copyField source="bio" dest="df_text" /><copyField source="year" dest="century" maxChars="2"/>
Saturday, November 10, 12
38
<copyField source="bio" dest="df_text" /><copyField source="year" dest="century" maxChars="2"/>
2000 would be stored as 20Useful for custom faceting
Saturday, November 10, 12
SUB HEADLINE
The Solr Config File
• solrconfig.xml• Defines request handlers, defaults, & caches• Read the solrconfig.xml inside the example project for more
39
SEARCHING WITH SOLR
Saturday, November 10, 12
SUB HEADLINE
Other Solr Tools
• Debug Query• Boost Functions• Search Faceting• Search Filters• Search Highlighting• Solr Admin
40
SEARCHING WITH SOLR
Saturday, November 10, 12
SUB HEADLINE
Debug Query Option
• Add &debugQuery=on to request parameters• Returns a parsed form of the query
41
SEARCHING WITH SOLR
Saturday, November 10, 12
42
<lst name="debug"><str name="rawquerystring">solr</str><str name="querystring">solr</str><str name="parsedquery">text:solr</str><str name="parsedquery_toString">text:solr</str><lst name="explain">
<str name="codecamp.session.19">1.2147729 = (MATCH) fieldWeight(text:solr in 17), product of: 1.4142135 = tf(termFreq(text:solr)=2) 3.9267395 = idf(docFreq=2, maxDocs=56) 0.21875 = fieldNorm(field=text, doc=17)</str>
</lst>
Saturday, November 10, 12
43
<lst name="debug"><str name="rawquerystring">solr</str><str name="querystring">solr</str><str name="parsedquery">text:solr</str><str name="parsedquery_toString">text:solr</str><lst name="explain">
<str name="codecamp.session.19">1.2147729 = (MATCH) fieldWeight(text:solr in 17), product of: 1.4142135 = tf(termFreq(text:solr)=2) 3.9267395 = idf(docFreq=2, maxDocs=56) 0.21875 = fieldNorm(field=text, doc=17)</str>
</lst>
Saturday, November 10, 12
SUB HEADLINE
Boost Function
• Allows you to influence results at query time• Really useful for tuning scoring• You can also boost at index time
44
SEARCHING WITH SOLR
Saturday, November 10, 12
SUB HEADLINE
Boost Function
• Allows you to influence results at query time• Really useful for tuning scoring• You can also boost at index time
45
SEARCHING WITH SOLR
q=blend&qf=text^2 company
Saturday, November 10, 12
SUB HEADLINE
Boost Function
• Allows you to influence results at query time• Really useful for tuning scoring• You can also boost at index time
46
SEARCHING WITH SOLR
q=blend&qf=text^2 company
More information available - http://wiki.apache.org/solr/SolrRelevancyFAQ
Can use both dismax and standard query handlers, I use dismax
Saturday, November 10, 12
SUB HEADLINE
Boost Function
• Allows you to influence results at query time• Really useful for tuning scoring• You can also boost at index time
47
SEARCHING WITH SOLR
&bq=text:blend^2
More information available - http://wiki.apache.org/solr/SolrRelevancyFAQ
Can use both dismax and standard query handlers, I use dismax
Saturday, November 10, 12
SUB HEADLINE
Solr Faceting
• What is a facet?• “Interaction style where users filter a set of items by
progressively selecting from only valid values of a faceted classification system” - Keith Instone, SOASIS&T, July 8, 2004
• What does it look like?• Make sure to use an untokenized field (e.g. string)• “San Jose” != “san”+“jose”
48
SEARCHING WITH SOLR
Saturday, November 10, 12
49
q=*:*facet=onfacet.field=company
Saturday, November 10, 12
SUB HEADLINE
Solr Filter Query
• Used to narrow your search query• Restrict the super set of documents that can be returned
• ‘fq’ parameter (short for Filter Query)
50
SEARCHING WITH SOLR
Saturday, November 10, 12
SUB HEADLINE
Solr Filter Query
• Used to narrow your search query• Restrict the super set of documents that can be returned
• ‘fq’ parameter (short for Filter Query)
51
SEARCHING WITH SOLR
q=*:*fq=company:blend
Saturday, November 10, 12
SUB HEADLINE
Search Highlighting
• Allow Solr to generate your highlight
52
SEARCHING WITH SOLR
Saturday, November 10, 12
SUB HEADLINE
Search Highlighting
• Allow Solr to generate your highlight
53
SEARCHING WITH SOLR
Saturday, November 10, 12
54
hl=truehl.simple.pre=<b>hl.simple.post=</b>hl.fragsize=200hl.requireFieldMatch=falsehl.fl=text bio titlehl.snippets=1
Saturday, November 10, 12
SUB HEADLINE
Solr Admin
• http://localhost:8983/solr/admin/• Built in app for testing all search options• Field Analysis• Schema Browser• Full Query Interface• Solr Statistics• Solr Information• Many More Options
55
SEARCHING WITH SOLR
Saturday, November 10, 12
SUB HEADLINE
Solr/Browse
• Test your search configuration using the /browse requestHandler
56
SEARCHING WITH SOLR
Saturday, November 10, 12
SUB HEADLINE
Resources
• Apache Solr Website• http://lucene.apache.org/solr/• Wiki, mailing list, bugs/features
• Books
57
SEARCHING WITH SOLR
Saturday, November 10, 12
58
Saturday, November 10, 12