solr –soup to nuts - food.neta thriving development community at the apache software foundation is...

74
October 7, 2009 Clay Webster Solr – Soup to Nuts*

Upload: others

Post on 24-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

October 7, 2009

Clay Webster

Solr –Soup to Nuts*

Page 2: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Scope

»Solrthe Product

»Why Make Solr?

»Building Open Source within a Business

»How to Use Solr

22

Page 3: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

SolrBasics

3

Page 4: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

The Obligatory “What is Solr?”

»Solris a scalable enterprise software platform

that

provides highly relevant full text search with “faceting”.

»Solris the defactoimplementation of the Lucene Java

search library.

»Solris easy to set up. To get powerful features you do

not need to write code. Just configure the schema and

4

not need to write code. Just configure the schema and

word analyzers and craft powerful queries.

»Solris built to be blindingly fast, reliable, and capable

of intense operational demands –

many caching,

replication, and administration controls.

»Solris internally programmable and has restful A

PIs.

»Solris open source, actively supported and 100% free.

4

Page 5: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Solr Features

»High Relevancy –

the right documents with

the right search scores. Index schema, word

analyzers, search queries, are all easy ways

to enhance relevancy.

»Search with Faceted Drill-downs –

users

narrow their search results to a guaranteed

5

narrow their search results to a guaranteed

number of results in.

»Open Source –

Solr is free, open source

software. A thriving development community

at the Apache Software Foundation is

constantly adding new features to Solr,

ensuring it will be around for a long time.

»Fast –Solr is blindingly fast and will not be a

bottleneck in your content delivery.

5

Page 6: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Why Make Solr?

6

Page 7: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Sticker-Shock

»All Initial Vendor Pricing Far Higher Than

Budgeted

»Not Site License Pricing

»Vendors Use M

etrics Like #CPUs That

Would Result in >$1M.

»Some Prices by Queries/Sec …

»Vendors’ Prices aren’t Apples to Apples

Page 8: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

2005 Vendor Pricing

Vendor

Model

Interested M

etrics

Initial Proposed

Price/Form

ula

Rough

Initial Cost

F********

Queries/sec, Doc Space, Tools,

Maint

Queries/sec, Num Docs, Size

Docs, Doc Space

$627,000

$627K

C*********

CPUs (a little complex) and Tools

CPUs

$287,000 + ($75,000 * CPU) +

Build (Millions)

$2,870K

E*********

not yet received

Licensed Term

and based on

Query-side CPUs

Siloization puts us off their

chart. > $750,000

$900K

A*********

Interfaces, distribrequirements, perf

requirements, failover requirements,

query demand

Queries/sec, redundancy reqs,

connectors, functionality, num

documents, ballpark anonym

ous

users, world-wide-web user

license

$500,000 (but I'm

guessing

larger)

$500K

R**********

CPUs

CPUs

$75,000 * CPU (Millions)

$2,000K

E*********

Enterprise Models(plural), with some

PR

Base it on the business

$150,000 => $400,000

$400K

I**********

CPUs (a little complex) and Internal

User License

CPUs, Internal U

sers

$340,000 + ($75,000 * CPU) +

Build (Millions)

$2,000K

V************

Pretty much, "no".

CPUs, intranet users, content size,

sources, response time, training,

consulting, etc.

~$200,000, with larger ones in

high six figures.

$800K

Page 9: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

What To Do?

»Search for a Replacement Search Platform

–commercial: high license fees

–open-source: no good solutions

»FAIL

99

Page 10: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Early (aka 2006) SolrCredits

»Ted Cahall

»Mark Castrovinci

»Clay W

ebster

»Yonik Seeley

»Bill Au

10

»Bill Au

»Chris Hostetter

(Not necessarily in that order.)

10

Page 11: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Lucene “Refresher”

»Lucene is a search library

»Add documents to an index via IndexW

riter

–A document is a acollection of fields

–No configfiles, dynamic field typing

–Flexible text analysis –tokenizers, filters

11

11

»Search for documents via IndexS

earcher

Hits = search(Q

uery,Filter,Sort,topN)

»Scoring is m

odified tf*idf

–term

frequency–inverse document frequency

Page 12: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Solr Architecture

»Basics

–Java, Lucene, Appserver

»Queries

–Query Language

–HTTP GET requests with XML, Ruby, PHP, JSON responses

–HTTP POSTs of XML documents

–Custom Query Handlers

12

–Custom Query Handlers

»Data Architecture and Schema

–Types, Analysis, Tokenizers, Dynamic Fields

»Service Tier(s)

–Master & Query Server

»Data Loading

–Request/Update Handlers and Response W

riters, Analyzers, Plugin

Architecture, CSV Loader, Java, DB sync

»Distribution

»Administration

12

Page 13: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Building Open Source

within a Business

13

within a Business

Page 14: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Why Open Source Solr? (Development)

»New concepts not of our origin

–Useful features

–Not currently useful features

–Questionable features

»Extra developer power

14

–Optimizations

–Features we didn’t have time to do

–Expertise

»Bonded to Lucene

Page 15: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Why Open Source Solr? (Software Quality)

»Quality contributions/commits

–Embarrassment factor

»Code and functionality reviews

–Norm

al course of business

»Productization

15

»Productization

–More than norm

al

»Collaborative Design –

checks and balances

Page 16: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Why Open Source Solr? (Bugs!)

»Bugs Found

–More than CNET would find

–Bugs before CNET hits

–Bugs CNET would never hit

»Bugs Fixed

16

–Fixing bugs we care about

–Fixing bugs we don’t care about

–Receiving contributed fixes

–Reviewing committed fixes

»Far less bugs

Page 17: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Why Not Open Source Solr?

»Engineer Time

–Upfront packaging

–Process

–Leadership

»Dumb user questions

17

»Split infrastructure

–Website, wiki, mailing lists, build systems, repository, bug

systems

»Red tape

–Procedural stuff aimed at making Apache projects successful

»Far less control than internal products

–Voting/Consensus

Page 18: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Solar vs. Solr

»Solar?

–History of architecturally descriptive acronyms

–Themed from ATOMICS (Apache TO Mysql In Cnet Search)

–Solar == Search on Lucene and Resin

–Solar is also in a Lucene light theme

»Couldn’t Use “Solar”

18

»Couldn’t Use “Solar”

»Rename to a non-acronym?

18

Page 19: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Solar vs. Solr

»Related “Light” Names

Name

Meaning

Origin

Gender

Visual Effect

vAlena

Light

Slavic

Female

alena-dev@

lucene.apache.org

vBrighid

Bringer/Bearer of Light

Celtic/Gaelic

Female

brighid-dev@

lucene.apache.org

vHuda

Enlightenment, Guidance

Arabic

Female

huda-dev@

lucene.apache.org

vIlona

Beautiful S

unshine

Hungarian

Female

ilona-dev@

lucene.apache.org

^^^

Inara

Ray of Light -Heaven-Sent

Arabic

Female

inara-dev@

lucene.apache.org

vLeyna

Bright and Shining Light

Russian

Female

leyna-dev@

lucene.apache.org

vvvv

Luca

Bringer of Light

Italian

Either

luca-dev@

lucene.apache.org

vvvv

Luce

Light

Latin

Male

luce-dev@

lucene.apache.org

vvvv

Lucine

Bright, Light

French

Male

lucine-dev@

lucene.apache.org

vvvv

Lucius

Bringer of Light

Latin

Male

lucius-dev@

lucene.apache.org

19

19

vvvv

Lucius

Bringer of Light

Latin

Male

lucius-dev@

lucene.apache.org

Misae

White Hot Sun

Native-American

Either

misae-dev@

lucene.apache.org

vMorag

Embracing The Sun

Celtic/Gaelic

Female

morag-dev@

lucene.apache.org

^^Sulwyn

Bright As The Sun

Welsh

Female

sulwyn-dev@

lucene.apache.org

^^^

Synnove

Sun Gift

Scandinavian

Female

synnove-dev@

lucene.apache.org

^Lyaues

An epithet of Dionysys, as the

god who releases people from

worries

Greek

?lyaeus-dev@

lucene.apache.org

http://www.pantheon.org/

http://www.babynames.com/

Page 20: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

History

»CNET grants code to Apache

»Solrenters Incubator 17 Jan 2006

»Solris a Lucene sub-project

»Production: CNET Reviews, CNET Channel,

Shopper.com, News.com, Download.com,

20

20

Shopper.com, News.com, Download.com,

ChowHound

Page 21: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Community Activity (last 975 days)

»“User” List:

–1113 Current Subscribers

–Posts: 15343

–Avg posts perday: 15.74

21

21

»Developer List:

–375 Current Subscribers

–Posts: 12543

–Avg posts perday: 12.86

Page 22: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

What Did W

e Get From Open Sourcing?

»Longevity

–Well-received, growing community

–Solr is the open source search software that people use and add to

–Open source software is often preferred

»Collaboration

–Voting/Consensus

–Opposite side of losing control

22

–Opposite side of losing control

»Leadership

–Driving position vs. passenger user/customer position

–Solr keeps in sync with CNET needs

–CNET gets the best support of all Solr users

»Positive karm

a–Recruiting

–Industry respect

»OOTB goodness

Page 23: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Using Solr–

content

scraped from Yonik and Chris

23

Page 24: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Adding Documents

HTTP POST to /update

<add><doc boost=“2”>

<field name=“article”>05991</field>

<field name=“title”>Apache Solr</field>

<field name=“subject”>An intro...</field>

24

24

<field name=“subject”>An intro...</field>

<field name=“category”>search</field>

<field name=“category”>lucene</field>

<field name=“body”>Solr is a full...</field>

</doc></add>

Page 25: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Deleting Documents

»Delete by Id

<delete><id>05591</id></delete>

»Delete by Query (multiple documents)

25

25

<delete>

<query>category:lucene</query>

</delete>

Page 26: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Commit

»<commit/> makes changes visible

–closes IndexWriter

–removes duplicates

–opens new IndexSearcher

26

26

Page 27: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Searching: Query Syntax

Lucene Query Syntax

»mission impossible

»+mission +impossible –actor:cruise

»“m

ission impossible” –actor:cruise

27

27

»title:spiderm

an^10 description:spiderm

an

»description:“spiderm

an m

ovie”~10

»+HDTV +weight:[0 TO 100]

»Wildcard queries: te?t, te*t, test*

Page 28: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Querying Data

»HTTP GET or POST, paramsspecifying query options...

http://solr/select?q=electronics

http://solr/select?q=electronics&sort=price+desc

http://solr/select?q=electronics&rows=50&start=50

28

28

http://solr/select?q=electronics&rows=50&start=50

http://solr/select?q=electronics&fl=name+price

http://solr/select?q=electronics&fq=inStock:true

Page 29: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Searching: Parameters

Query Arguments for HTTP GET/POST to /select

para

mdefa

ult

desc

ription

qThe q

uery

start

0Off

set in

to the list of m

atc

hes

row

s10

Num

ber of docum

ents

to retu

rn

29

row

s10

Num

ber of docum

ents

to retu

rn

fl*

Sto

red fie

lds to

retu

rn

qt

standard

Query

type; m

aps to

query

handle

r

df

(schem

a)

Defa

ult fie

ld to searc

h

Page 30: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Querying Data: Results

»Canonical response form

at is XML...

<response>

<lstname="responseHeader">

<intname="status">0</int>

<intname="Q

Time">1</int>

</lst>

<result name="response" numFound="14" start="0">

<doc>

30

30

<doc>

<arrname="cat">

<str>electronics</str>

<str>connector</str>

</arr>

<arrname="features">

<str>car power adapter, white</str>

</arr>

<strname="id">F8V7067-APL-KIT</str>

...

Page 31: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Caching

IndexSearcher’s view of an index is fixed

–Aggressive caching possible

–Consistency for multi-query requests

filterCace –unordered set of documents matching

a query

31

31

a query

resultCache –ordered subset of documents

matching a query

documentCache –

the stored fields of documents

userCaches –application specific

Page 32: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Warm

ing

»Lucene IndexR

eader warm

ing

–field norm

s

–FieldCache

»Cache warm

ing

–Configurable static requests to warm

new Searchers

32

32

»Smart Cache W

arm

ing (autowarm

ing)

–Using MRU items in the current cache to pre-populate the

new cache

Page 33: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Smart Cache W

arm

ing

Warm

ing

Requests

Request

Handler

Live

Requests

On-Deck

Solr

IndexS

earcher

User

Cache

Registered

Solr

IndexS

earcher

User

Cache

Regenerator

1

2 3

33

Field

Cache

Field

Norm

s

Filter

Cache

Result

Cache

Doc

Cache

Filter

Cache

Result

Cache

Doc

Cache

Regenerator

Autowarm

ing -

warm

n M

RU

cache keys

Autowarm

ing

Autowarm

ing

3

Page 34: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Schema

»Lucene has no notion of a schema

–Sorting -string vs. numeric

–Ranges -val:42 included in val:[1 TO 5] ?

–Lucene QueryParser has date-range support, but must

guess.

»Defines fields, their types, properties

34

34

»Defines fields, their types, properties

»Defines unique key field, default search field,

Similarity implementation

Page 35: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Field Definitions

»Field Attributes: name, type, indexed, stored, multiValued,

omitNorm

s

<field name="id“

type="string" indexed="true" stored="true"/>

<field name="sku“

type="textTight” indexed="true" stored="true"/>

<field name="name“

type="text“ indexed="true" stored="true"/>

<field name=“reviews“ type="text“ indexed="true“ stored=“false"/>

35

35

<field name="category“ type="text_ws“ indexed="true" stored="true“

multiValued="true"/>

»Dynamic Fields, in the spirit of Lucene!

<dynamicField name="*_i" type="sint“ indexed="true" stored="true"/>

<dynamicField name="*_s" type="string“ indexed="true" stored="true"/>

<dynamicField name="*_t" type="text“ indexed="true" stored="true"/>

Page 36: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Field Type Definitions

<fieldtypename="sint" class="solr.SortableIntField" sortMissingLast="true"/>

<fieldtypename="text" class="solr.TextField">

<analyzer>

<tokenizerclass="solr.StandardTokenizerFactory"/>

<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.Synonym

FilterFactory"

36

36

<filter class="solr.Synonym

FilterFactory"

synonym

s="synonym

s.txt“/>

<filter class="solr.StopFilterFactory" words=“stopwords.txt”/>

<filter class="solr.EnglishPorterFilterFactory"

protected="protwords.txt"/>

</analyzer>

</fieldtype>

Page 37: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Analysis Example

PowerShot SD 500

PowerShotSD

500

WhitespaceTokenizer

WordDelim

iterFilter catenateWords=1

power-shot sd500

power-shotsd500

WhitespaceTokenizer

WordDelim

iterFilter catenateWords=0

(Source Text)

(Query Text)

37

37

SD

500

PowerShot

PowerShot

sd

500

powershot

powershot

LowercaseFilter

sd

500

powershot

sd

500

powershot

LowercaseFilter

A Match!

Page 38: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Analysis Tool: Output

38

Page 39: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

copyField

»Analyze same field different ways

–Boost exact-case or exact-punctuation m

atches

–translations, thesaurus, soundex

»Index multiple fields into single searchable field

<field name=“title” type=“text”/>

39

39<field name=“title” type=“text”/>

<field name=“title_exact” type=“text_ex” stored=“false”/>

<field name=“catchall” type=“text” stored=“false”/>

<copyField

source=“title” dest=“text_exact”/>

<copyField

source=“title” dest=“catchall”/>

<copyField

source=“subject” dest=“catchall”/>

Page 40: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

High Availability

Load Balancer

Appservers

Solr Searchers

40

40

Solr Master

DB

Updater

updates

updates

queries

Index Replication

Page 41: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Replication

solr/data/index

Master

solr/data/index

Searcher

new segment

1. hard links

2. hard links

Lucene index

41

41

solr/data/snapshot-2006062950000

1. hard links

solr/data/snapshot-2006062950000-W

IP

2. hard links

3. rsync

4. mv

Lucene index

segments

Page 42: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Replication

»Replication scripts for efficiently m

irroring an index on

multiple m

achines.

–snapshooter

–snappuller

–snapinstaller

42

42

Page 43: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Faceted Browsing

Search(Q

uery,Filter[],Sort,offset,n)

computer_type:PC

memory:[1GB TO *]

computer

price asc

proc_manu:Intel

proc_manu:AMD

section of

ordered

Unordered

set of all

price:[0 TO 500]

price:[500 TO 1000]

intersection

Size()

= 594

= 382

= 247

= 689

43

43

DocList

ordered

results

DocSet

set of all

results

manu:Dell

manu:HP

manu:Lenovo

= 104

= 92

= 75

Query Response

Page 44: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Querying Data: Facet Counts

»Constraint counts can be computed for the whole result set

using field values or explicit queries....

&facet=true&facet.field=cat&facet.field=inStock

&facet.query=price:[0 TO 10]&facet.query=price:[10 TO *]

...

<lstname="facet_counts">

44

44

<lstname="facet_counts">

<lstname="facet_queries">

<intname="price:[0 TO 10]">0</int>

<intname="price:[10 TO *]">13</int>

</lst>

<lstname="facet_fields">

<lstname="inStock">

<intname="true">10</int>

<intname="false">4</int>

</lst>

...

Page 45: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Querying Data: Facet Counts

45

Page 46: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Web Admin Interface

»Show Config, Schema, Distribution info

»Query Interface

»Statistics

–Caches: lookups, hits, hitratio, inserts, evictions, size

46

46

–Caches: lookups, hits, hitratio, inserts, evictions, size

–RequestHandlers: requests, errors

–UpdateHandler: adds, deletes, commits, optimizes

–IndexReader, open-time, index-version, numDocs, maxDocs,

»Analysis Debugger

–Shows tokens after each Analyzer stage

–Shows token matches for query vsindex

Page 47: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

The Admin Console

47

Page 48: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Odds and Ends

48

Page 49: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Querying Data: Results

»Canonical response form

at is XML...

<response>

<lstname="responseHeader">

<intname="status">0</int>

<intname="Q

Time">1</int>

</lst>

<result name="response" numFound="14" start="0">

49

49

<result name="response" numFound="14" start="0">

<doc>

<arrname="cat">

<str>electronics</str>

<str>connector</str>

</arr>

<arrname="features">

<str>car power adapter, white</str>

</arr>

<strname="id">F8V7067-APL-KIT</str>

...

Page 50: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Querying Data: Highlighting

50

Page 51: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Querying Data: Highlighting

»Generates summary "fragments" of stored fields showing

matches....

&hl=true&hl.fl=features&hl.fragsize=30

...

<lstname="highlighting">

51

51

<lstname="F8V7067-APL-KIT">

<arrname="features">

<str>car power &lt;em&gt;adapter&lt;/em&gt;, white</str>

</arr>

</lst>

...

Page 52: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Describing Your Data

»schema.xmlis where you configure the options for various

fields.

�Is it a number? A string? A date?

�Is there a default value for documents that don't have one?

52

52

�Is there a default value for documents that don't have one?

�Is it created by combining the values of other fields?

�Is it stored for retrieval?

�Is it indexed? If so is it parsed? If so how?

�Is it a unique identifier?

Page 53: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Fields

�<field>Describes How You Deal With Specific Named Fields

�<dynamicField>Describes How To Deal With Fields That

Match A Glob

(Unless There Is A Specific <field> For Them)

�<copyField>Describes How To Construct Fields From Other

53

53

�<copyField>Describes How To Construct Fields From Other

Fields

<field name="title" type="text" stored=”false” />

<dynamicField name="price*" type="sfloat" indexed="true" />

<copyField source="*" dest="catchall" />

Page 54: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Field Types

�Every Field Is Based On A <fieldType>Which Specifies:

�The Underlying Storage Class (FieldType)

�The Analyzer To Use Or Parsing If It Is A Text Field

�OOTB Solr Has 18 FieldType Classes

54

54

<fieldType name="sfloat" class="solr.SortableFloatField"

sortMissingLast="true" omitNorm

s="true" />

<fieldtype name="string" class="solr.StrField"

indexed="true" stored="true" />

<fieldtype name="unstored" class="solr.StrField"

indexed="true" stored="false" />

Page 55: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Analyzers

�'Analyzer' Is A Core Lucene Class For Parsing Text

�Solr Includes 18 Lucene Analyzers That Can Be Used

OOTB If They M

eet Your Needs

<fieldType name="text_greek" class="solr.TextField">

55

55

<analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/>

</fieldType>

...BUT W

AIT!

Page 56: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Tokenizers And TokenFilters

�Analyzers Are Typical Comprised Of Tokenizers And

TokenFilters

�Tokenizer: Controls How Your Text Is Tokenized

�TokenFilter: Mutates And Manipulates The Stream Of Tokens

�Solr Lets You M

ix And M

atch Tokenizers and TokenFilters In

Your schema.xmlTo Define Analyzers On The Fly

56

56

Your schema.xmlTo Define Analyzers On The Fly

�OOTB Solr Has Factories For 12 Tokenizers and 36

TokenFilters

�Many Factories Have Customization Options --Limitless

Combinations

Page 57: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Tokenizers And TokenFilters

<fieldTypename="text" class="solr.TextField">

<analyzer type="index">

<tokenizerclass="solr.W

hitespaceTokenizerFactory"/>

<filter class="solr.StopFilterFactorywords="stopwords.txt"/>

<filter class="solr.W

ordDelim

iterFilterFactory"

generateWordParts="1" generateNumberParts="1"/>

<filter class="solr.LowerCaseFilterFactory"/>

57

57

<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EnglishPorterFilterFactory"

protected="protwords.txt"/>

</analyzer>

<analyzer type="query">

<tokenizerclass="solr.W

hitespaceTokenizerFactory"/>

<filter class="solr.Synonym

FilterFactory"

synonym

s="synonym

s.txt" expand="true"/>

...

Page 58: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Notable Token(izers|Filters)

�StandardTokenizerFactory

�HTMLStripWhitespaceTokenizerFactory

�KeywordTokenizerFactory

�NGramTokenizerFactory

�PatternTokenizerFactory

58

58

�EnglishPorterFilterFactory

�SynonymFilterFactory

�StopFilterFactory

�ISOLatin1AccentFilterFactory

�PatternReplaceFilterFactory

Page 59: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Interacting W

ith Your Data

»solrconfig.xmlis where you configure options for how this

Solr instance should behave.

» �Low-Level Index Settings

�Perform

ance Settings (Cache Sizes, etc...)

59

59

�Types of Updates Allowed

�Types of Queries Allowed

Note:

�solrconfig.xmldepends on schema.xml.

�schema.xmldoes not depend on solrconfig.xml.

Page 60: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Request Handlers

�Type Of Request Handler Determ

ines Options, Syntax, And

Logic For Processing Requests

�OOTB Indexing Handlers:

�XmlUpdateRequestHandler

�CSVRequestHandler

�DataImportHandler

60

60

�DataImportHandler

�OOTB Searching Handler:

�SearchHandler + QParsers

Page 61: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Example: Handler Configuration

<requestHandlername="/select” class="solr.SearchHandler" />

<requestHandlername="/simple" class="solr.SearchHandler" >

<lstname="defaults">

<strname="defType">dismax</str>

<strname="qf">catchall</str> </lst>

</requestHandler>

<requestHandlername="/complex" class="solr.SearchHandler" >

61

61

<lstname="defaults">

<strname="defType">dismax</str>

<strname="qf">features^1 name^2</str> </lst>

<lstname="appends">

<strname="fq">inStock:true</str> </lst>

<lstname="invariants">

<strname="facet">false</str>

...

Page 62: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Output: Response W

riters

�Response Form

at Can Be Controlled Independently From

Request Handler Logic

�Many Useful Response W

riters OOTB

http://solr/select?q=electronics

http://solr/select?q=electronics&wt=xm

l

62

62

http://solr/select?q=electronics&wt=xm

l

http://solr/select?q=electronics&wt=json

http://solr/select?q=electronics&wt=python

http://solr/select?q=electronics&wt=ruby

http://solr/select?q=electronics&wt=php

http://solr/select?q=electronics&wt=xslt&tr=example.xsl

<queryResponseWriter name="xml" default=”true”

Page 63: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Installing Solr

�Put The solr.war Where Your Favorite Servlet Container Can

Find It

�Create A "Solr Home" Directory

�Steal The Example solr/conf Files

�Point At Your Solr Home Using Either:

63

63

Point At Your Solr Home Using Either:

�JNDI

�System Properties

�The Current Working Directory

(Or just use the Jetty example setup.)

Page 64: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Understanding The Data: Luke

�The LukeRequestHandler Is Based On A Popular Lucene

GUI App For Debugging Indexes (Luke)

�Allows Introspection Of Field Inform

ation:

�Options From The Schema (Either Explicit Or Inherited From Field

Type)

�Statistics On Unique Term

s And Term

s W

ith High Doc Frequency

64

64

�Histogram Of Term

s W

ith Doc Frequency Above Set Thresholds

�Helpful In Understanding The Nature Of Your Data

�Schema Browser: Luke On Steroids

Page 65: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Example: Luke Output

65

Page 66: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Example: Schema Browser

66

Page 67: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Refining Your Schema

�Pick Field Types That Make Sense

�Pick Analyzers That Make Sense

�Use <copyField>To M

ake M

ultiple Copies Of Fields For

Different Purposes:

�Faceting

�Sorting

67

67

�Sorting

�Loose Matching

�Etc...

Page 68: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Example: "BIC" Codes

»<!--used by the bic field, a prefix based code -->

»<fieldType name="bicgram" class="solr.TextField" >

»<analyzer type="index">

»<tokenizer class="solr.EdgeNGramTokenizerFactory"

»minGramSize="1"

68

68

»maxGramSize="100"

»side="front" />

»<filter class="solr.LowerCaseFilterFactory"/>

»</analyzer>

»<analyzer type="query">

»<tokenizer class="solr.W

hitespaceTokenizerFactory" />

»<filter class="solr.LowerCaseFilterFactory"/>

Page 69: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Search Components

�Default Components That Power SearchHandler

�QueryComponent

�HighlightComponent

�FacetComponent

�MoreLikeThisComponent

�DebugComponent

69

69

�Additional Components You Can Configure

�SpellC

heckComponent

�QueryElevationComponent

Page 70: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Score Explanations

�Why Did Document X Score Higher Then Y?

�Why Didn't Document Z Match At All?

�Debugging Options Can Answer Both Questions...

�idf -How Common A Term

Is In The W

hole Index

�tf -How Common A Term

Is In This Document

fieldNorm

-How Significant Is This Field In This Document (Usually

70

70

�fieldNorm

-How Significant Is This Field In This Document (Usually

Based On Length)

�boost -How Important The Client Said This Clause Is

�coordFactor -How Many Clauses Matched

&debugQuery=true&explainOther=documentId:Z

Page 71: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Example: Score Explanations

»<str name="id=9781841135779,internal_docid=111">

»0.30328625= (MATCH) fieldWeight(catchall:law in 111), product of:

»3.8729835= tf(term

Freq(catchall:law)=15)

»1.0023446 = idf(docFreq=851)

»0.078125= fieldNorm

(field=catchall, doc=111)

»</str>

71

71

»</str>

»...

»<str name="id=9781841135335,internal_docid=696">

»0.26578674= (MATCH) fieldWeight(catchall:law in 696), product of:

»4.2426405= tf(term

Freq(catchall:law)=18)

»1.0023446 = idf(docFreq=851)

»0.0625= fieldNorm

(field=catchall, doc=696)

»</str>

Page 72: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Builds and incrementally updates indexes based on

configured SQL or XPath queries.

<entity name="item" pk="ID" query="select * from ITEM"

deltaQuery="select ID ... where

ITEMDATE > '${dataimporter.last_index_time}'">

»DataImportHandler

72

72

<field column="NAME" name="name" />

...

<entity name="f" pk="ITEMID"

query="select DESC from FEATURE where ITEMID='${item.ID}'"

deltaQuery="select ITEMID from FEATURE where

UPDATEDATE > '${dataimporter.last_index_time}'"

parentDeltaQuery="select ID from ITEM where ID=${f.ITEMID}">

<field name="features" column="DESC" />

...

Page 73: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

References

73

Page 74: Solr –Soup to Nuts - food.netA thriving development community at the Apache Software Foundation is constantly adding new features to Solr, ensuring it will be around for a long time

Resources

»Home Page

–http://lucene.apache.org/solr

–Tutorial

–http://wiki.apache.org/solr/

»Mailing Lists

74

74

[email protected]

[email protected]