a noobs lesson on solr (configuration)
TRANSCRIPT
STEVE, STOP ME IF I’M WRONG
at any point
not exactly a full secret, but a disclaimer here: I don’t completely know everything there is to know about Solr or its configuration
EASIEST WAY I CAN EXPLAIN SOLR.
how would you find all the pages a term or phrase appears on in a book?
EASIEST WAY I CAN EXPLAIN SOLR.
How would you find all the pages a term or phrase appears on in a book?
EASIEST WAY I CAN EXPLAIN SOLR.
so we can think of Solr like an index in the back of a book
we use our brains to find the words or terms in the index
Solr’s brain is schema.xml
the words or terms refer to documents (text streams)
SO, SCHEMA.XML IS THE BRAIN
index contains one or more documents
documents are unit of search and index
documents contain fields
so, index = tons of documents = and each document has field(s)
make sense yet?
SO, SCHEMA.XML IS THE BRAIN
<field name="html" type="example" indexed="true"
stored="true" multiValued="true" />
<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
and schema.xml is where it’s at!
it defines the fields and how to index and search each field
SO, SCHEMA.XML IS THE BRAIN
<field name="html" type="example" indexed="true"
stored="true" multiValued="true" />
<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
and schema.xml is where it’s at!
it defines the fields and how to index and search each field
SO, SCHEMA.XML IS THE BRAIN
<field name="html" type="example" indexed="true"
stored="true" multiValued="true" />
<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
and schema.xml is where it’s at!
it defines the fields and how to index and search each field
SO, SCHEMA.XML IS THE BRAIN
<field name="html" type="example" indexed="true"
stored="true" multiValued="true" />
<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
and schema.xml is where it’s at!
it defines the fields and how to index and search each field
SO, SCHEMA.XML IS THE BRAIN
<field name="html" type="example" indexed="true"
stored="true" multiValued="true" />
<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
and schema.xml is where it’s at!
it defines the fields and how to index and search each field
SO, SCHEMA.XML IS THE BRAIN
<field name="html" type="example" indexed="true"
stored="true" multiValued="true" />
<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
and schema.xml is where it’s at!
it defines the fields and how to index and search each field
SO, SCHEMA.XML IS THE BRAIN
<field name="html" type="example" indexed="true"
stored="true" multiValued="true" />
<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
and schema.xml is where it’s at!
it defines the fields and how to index and search each field
SO, SCHEMA.XML IS THE BRAIN
<field name="html" type="example" indexed="true"
stored="true" multiValued="true" />
<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
and schema.xml is where it’s at!
it defines the fields and how to index and search each field
FIELD? FIELDTYPE? HALP PLS. @Test public void sslCertsHostNameField() throws SolrServerException {
}
FIELD? FIELDTYPE? HALP PLS. @Test public void sslCertsHostNameField() throws SolrServerException {
testExpectations("sslcerts-hostname", "d-128-100-108.bootp.virginia.edu",
}
FIELD? FIELDTYPE? HALP PLS. @Test public void sslCertsHostNameField() throws SolrServerException {
testExpectations("sslcerts-hostname", "d-128-100-108.bootp.virginia.edu",
hit("VIRGINIA.EDU"), hit("bootp.virginia.edu"), hit("\"d-128-100-108.bootp.virginia.edu\""),
}
FIELD? FIELDTYPE? HALP PLS. @Test public void sslCertsHostNameField() throws SolrServerException {
testExpectations("sslcerts-hostname", "d-128-100-108.bootp.virginia.edu",
hit("VIRGINIA.EDU"), hit("bootp.virginia.edu"), hit("\"d-128-100-108.bootp.virginia.edu\""),
miss("mail.virginia.edu"));
}
FIELD? FIELDTYPE? HALP PLS. @Test public void sslCertsHostNameField() throws SolrServerException {
testExpectations("sslcerts-hostname", "d-128-100-108.bootp.virginia.edu",
hit("VIRGINIA.EDU"), hit("bootp.virginia.edu"), hit("\"d-128-100-108.bootp.virginia.edu\""),
miss("mail.virginia.edu"));
}
THIS TEST FAILS :-( So where do we look?
FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />
FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />
FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />
<field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />
FIELD? FIELDTYPE? HALP PLS.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer> </fieldType>
FIELD? FIELDTYPE? HALP PLS.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer> </fieldType>
<field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />
FIELD? FIELDTYPE? HALP PLS.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer>
<fieldType name="sslcerts_hostname" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> </fieldType>
<field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />
FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />
<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer>
<fieldType name="sslcerts_hostname" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory”/> </analyzer> </fieldType>
FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />
<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer>
<fieldType name="sslcerts_hostname" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory”/> <filter class=”solr.NGramFilterFactory” maxGramSize=”25” minGramSize=”3”/> </analyzer> </fieldType>
FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />
<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer>
<fieldType name="sslcerts_hostname" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory”/> <filter class=”solr.NGramFilterFactory” maxGramSize=”25” minGramSize=”3”/> <filter class=”solr.LowerCaseFilterFactory”/> </analyzer> </fieldType>
FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />
<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer>
<fieldType name="sslcerts_hostname" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory”/> <filter class=”solr.NGramFilterFactory” maxGramSize=”25” minGramSize=”3”/> <filter class=”solr.LowerCaseFilterFactory”/> </analyzer> </fieldType>
FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="sslcerts_hostname" indexed="true" stored="true" multiValued="true" />
<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer>
<fieldType name="sslcerts_hostname" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory”/> <filter class=”solr.NGramFilterFactory” maxGramSize=”25” minGramSize=”3”/> <filter class=”solr.LowerCaseFilterFactory”/> </analyzer> </fieldType>
FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="sslcerts_hostname" indexed="true" stored="true" multiValued="true" />
<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer>
<fieldType name="sslcerts_hostname" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory”/> <filter class=”solr.NGramFilterFactory” maxGramSize=”25” minGramSize=”3”/> <filter class=”solr.LowerCaseFilterFactory”/> </analyzer> </fieldType>
FIELD? FIELDTYPE? HALP PLS. @Test public void sslCertsHostNameField() throws SolrServerException {
testExpectations("sslcerts-hostname", "d-128-100-108.bootp.virginia.edu",
hit("VIRGINIA.EDU"), hit("bootp.virginia.edu"), hit("\"d-128-100-108.bootp.virginia.edu\""),
miss("mail.virginia.edu"));
}
FIELD? FIELDTYPE? HALP PLS. @Test public void sslCertsHostNameField() throws SolrServerException {
testExpectations("sslcerts-hostname", "d-128-100-108.bootp.virginia.edu",
hit("VIRGINIA.EDU"), hit("bootp.virginia.edu"), hit("\"d-128-100-108.bootp.virginia.edu\""),
miss("mail.virginia.edu"));
}
THIS TEST PASSES :-D
SUMMARY OF WHAT WE LEARNED.
A Solr index is comprised of a bunch of documents (token streams) – think index in the back of a book example
schema.xml holds the brains, the power, the rules – for how data gets stored as documents and how
they’re returned from matching queries
thanks to Steve’s exercises, I was able to look at the schema.xml file and… for the most part, understand it
Hopefully you can look at it now and understand it too