a noobs lesson on solr (configuration)

37
A NOOBS LESSON ON SOLR (CONFIGURATION)

Upload: bti360

Post on 10-Aug-2015

73 views

Category:

Software


0 download

TRANSCRIPT

A NOOBS LESSON ON SOLR (CONFIGURATION)

STEVE, STOP ME IF I’M WRONG

at any point

not exactly a full secret, but a disclaimer here: I don’t completely know everything there is to know about Solr or its configuration

EASIEST WAY I CAN EXPLAIN SOLR.

how would you find all the pages a term or phrase appears on in a book?

EASIEST WAY I CAN EXPLAIN SOLR.

How would you find all the pages a term or phrase appears on in a book?

EASIEST WAY I CAN EXPLAIN SOLR.

so we can think of Solr like an index in the back of a book

we use our brains to find the words or terms in the index

Solr’s brain is schema.xml

the words or terms refer to documents (text streams)

? HOW DOES THE INDEX GET POPULATED?

schema.xml !

HOW DOES THE INDEX GET SEARCHED?

? schema.xml !

SO, SCHEMA.XML IS THE BRAIN

index contains one or more documents

documents are unit of search and index

documents contain fields

so, index = tons of documents = and each document has field(s)

make sense yet?

SO, SCHEMA.XML IS THE BRAIN

<field name="html" type="example" indexed="true"

stored="true" multiValued="true" />

<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

and schema.xml is where it’s at!

it defines the fields and how to index and search each field

SO, SCHEMA.XML IS THE BRAIN

<field name="html" type="example" indexed="true"

stored="true" multiValued="true" />

<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

and schema.xml is where it’s at!

it defines the fields and how to index and search each field

SO, SCHEMA.XML IS THE BRAIN

<field name="html" type="example" indexed="true"

stored="true" multiValued="true" />

<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

and schema.xml is where it’s at!

it defines the fields and how to index and search each field

SO, SCHEMA.XML IS THE BRAIN

<field name="html" type="example" indexed="true"

stored="true" multiValued="true" />

<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

and schema.xml is where it’s at!

it defines the fields and how to index and search each field

SO, SCHEMA.XML IS THE BRAIN

<field name="html" type="example" indexed="true"

stored="true" multiValued="true" />

<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

and schema.xml is where it’s at!

it defines the fields and how to index and search each field

SO, SCHEMA.XML IS THE BRAIN

<field name="html" type="example" indexed="true"

stored="true" multiValued="true" />

<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

and schema.xml is where it’s at!

it defines the fields and how to index and search each field

SO, SCHEMA.XML IS THE BRAIN

<field name="html" type="example" indexed="true"

stored="true" multiValued="true" />

<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

and schema.xml is where it’s at!

it defines the fields and how to index and search each field

SO, SCHEMA.XML IS THE BRAIN

<field name="html" type="example" indexed="true"

stored="true" multiValued="true" />

<fieldType name="example" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true" /> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

and schema.xml is where it’s at!

it defines the fields and how to index and search each field

FIELD? FIELDTYPE? HALP PLS. @Test public void sslCertsHostNameField() throws SolrServerException {

}

FIELD? FIELDTYPE? HALP PLS. @Test public void sslCertsHostNameField() throws SolrServerException {

testExpectations("sslcerts-hostname", "d-128-100-108.bootp.virginia.edu",

}

FIELD? FIELDTYPE? HALP PLS. @Test public void sslCertsHostNameField() throws SolrServerException {

testExpectations("sslcerts-hostname", "d-128-100-108.bootp.virginia.edu",

hit("VIRGINIA.EDU"), hit("bootp.virginia.edu"), hit("\"d-128-100-108.bootp.virginia.edu\""),

}

FIELD? FIELDTYPE? HALP PLS. @Test public void sslCertsHostNameField() throws SolrServerException {

testExpectations("sslcerts-hostname", "d-128-100-108.bootp.virginia.edu",

hit("VIRGINIA.EDU"), hit("bootp.virginia.edu"), hit("\"d-128-100-108.bootp.virginia.edu\""),

miss("mail.virginia.edu"));

}

FIELD? FIELDTYPE? HALP PLS. @Test public void sslCertsHostNameField() throws SolrServerException {

testExpectations("sslcerts-hostname", "d-128-100-108.bootp.virginia.edu",

hit("VIRGINIA.EDU"), hit("bootp.virginia.edu"), hit("\"d-128-100-108.bootp.virginia.edu\""),

miss("mail.virginia.edu"));

}

THIS TEST FAILS :-( So where do we look?

FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />

FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />

FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />

<field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />

FIELD? FIELDTYPE? HALP PLS.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer> </fieldType>

FIELD? FIELDTYPE? HALP PLS.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer> </fieldType>

<field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />

FIELD? FIELDTYPE? HALP PLS.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer>

<fieldType name="sslcerts_hostname" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> </fieldType>

<field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />

FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />

<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer>

<fieldType name="sslcerts_hostname" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory”/> </analyzer> </fieldType>

FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />

<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer>

<fieldType name="sslcerts_hostname" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory”/> <filter class=”solr.NGramFilterFactory” maxGramSize=”25” minGramSize=”3”/> </analyzer> </fieldType>

FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />

<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer>

<fieldType name="sslcerts_hostname" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory”/> <filter class=”solr.NGramFilterFactory” maxGramSize=”25” minGramSize=”3”/> <filter class=”solr.LowerCaseFilterFactory”/> </analyzer> </fieldType>

FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="text_general" indexed="true" stored="true" multiValued="true" />

<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer>

<fieldType name="sslcerts_hostname" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory”/> <filter class=”solr.NGramFilterFactory” maxGramSize=”25” minGramSize=”3”/> <filter class=”solr.LowerCaseFilterFactory”/> </analyzer> </fieldType>

FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="sslcerts_hostname" indexed="true" stored="true" multiValued="true" />

<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer>

<fieldType name="sslcerts_hostname" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory”/> <filter class=”solr.NGramFilterFactory” maxGramSize=”25” minGramSize=”3”/> <filter class=”solr.LowerCaseFilterFactory”/> </analyzer> </fieldType>

FIELD? FIELDTYPE? HALP PLS. <field name="sslcerts-hostname" type="sslcerts_hostname" indexed="true" stored="true" multiValued="true" />

<fieldType name="text_general" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory” /> </analyzer>

<fieldType name="sslcerts_hostname" class="solr.TextField" positionIncrementGap=”100” sortMissingLast=”true”> <analyzer> <tokenizer class=”solr.WhitespaceTokenizerFactory”/> <filter class=”solr.NGramFilterFactory” maxGramSize=”25” minGramSize=”3”/> <filter class=”solr.LowerCaseFilterFactory”/> </analyzer> </fieldType>

FIELD? FIELDTYPE? HALP PLS. @Test public void sslCertsHostNameField() throws SolrServerException {

testExpectations("sslcerts-hostname", "d-128-100-108.bootp.virginia.edu",

hit("VIRGINIA.EDU"), hit("bootp.virginia.edu"), hit("\"d-128-100-108.bootp.virginia.edu\""),

miss("mail.virginia.edu"));

}

FIELD? FIELDTYPE? HALP PLS. @Test public void sslCertsHostNameField() throws SolrServerException {

testExpectations("sslcerts-hostname", "d-128-100-108.bootp.virginia.edu",

hit("VIRGINIA.EDU"), hit("bootp.virginia.edu"), hit("\"d-128-100-108.bootp.virginia.edu\""),

miss("mail.virginia.edu"));

}

THIS TEST PASSES :-D

SUMMARY OF WHAT WE LEARNED.

A Solr index is comprised of a bunch of documents (token streams) –  think index in the back of a book example

schema.xml holds the brains, the power, the rules –  for how data gets stored as documents and how

they’re returned from matching queries

thanks to Steve’s exercises, I was able to look at the schema.xml file and… for the most part, understand it

Hopefully you can look at it now and understand it too

QUESTIONS?