haystack training

Post on 04-Apr-2015

966 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

From a November 2010 training session with CMG Digital.

TRANSCRIPT

Getting The MostOut Of Haystack

Daniel LindsleyPragmatic Badger, LLC

Tuesday, December 28, 2010

Terminology

Tuesday, December 28, 2010

“Engine”• The actual search engine

• Here be interesting computer science problems

• Examples: Solr, Xapian, Whoosh

Tuesday, December 28, 2010

“Document”• A single record in the index

• Usually accompanied by 1+ fields of metadata

• Heavily processed

Tuesday, December 28, 2010

“Corpus”• The collection of indexed documents

• Latin for “body”

Tuesday, December 28, 2010

“Stemming”• Find the root of the word

• Part of the “magic” of search

• More on this later...

Tuesday, December 28, 2010

“Relevance”• A metric of how well a document matches

the query

• Search’s killer feature

• Hard to get 100% right

Tuesday, December 28, 2010

“Faceting”• Count of docs meeting certain criteria

within your result set

• Drill down!

• Think Amazon/eBay

• More on this later...

Tuesday, December 28, 2010

“Boost”• A way to artificially increase the relevance of

document

• Types: Document/Field/Term

Tuesday, December 28, 2010

Introduction toSearch

Tuesday, December 28, 2010

Search != RDBMS• The sooner you get over that, the easier

everything that follows will be.

• Think “document store”.

Tuesday, December 28, 2010

Stemming• Porter-Stemmer or Snowball

• The engine takes terms & hacks them down to the root word.

• Examples:

“testing” ! “test”

“searchers” ! “searcher” ! “search”

Tuesday, December 28, 2010

Inverted Index• The power of the engine starts here

• Basically a reverse mapping between the stemmed form of a term to a collection of documents containing the term

...“search”: [3, 104, 238],...

Tuesday, December 28, 2010

Inverted Index• Very fast lookups

• NOT a “contains” or “like” lookup unless you say so (slower)

Tuesday, December 28, 2010

Document Store• Flat structure

• Generally free-form/schema-less

• Easiest to think about each record as a dictionary

• No relations built-in

Tuesday, December 28, 2010

Why custom search?...or...

“Isn’t this what Google is for?”

Tuesday, December 28, 2010

Why custom search?• You control what is (and is not) indexed

Tuesday, December 28, 2010

Why custom search?• You control what is (and is not) indexed

• Better quality data goes into the index

Tuesday, December 28, 2010

Why custom search?• You control what is (and is not) indexed

• Better quality data goes into the index

• Information-specific handling

Tuesday, December 28, 2010

Why custom search?• You control what is (and is not) indexed

• Better quality data goes into the index

• Information-specific handling

• Provide context-specific search

Tuesday, December 28, 2010

Introduction toHaystack

Tuesday, December 28, 2010

What is Haystack?At its simplest, Haystack is an abstraction layer for integrating Django with a search engine.

Tuesday, December 28, 2010

Why Haystack?• Familiar API

• Declarative

• “Looks” like Django

Tuesday, December 28, 2010

Why Haystack?• Pluggable Backends

• Support Solr & Whoosh out of the box, Xapian with a third-party backend (boo GPL!)

• Your code stays the same regardless of backend.

Tuesday, December 28, 2010

Why Haystack?• Advanced Features

• Faceting

• More Like This

• Highlighting

• Boost

Tuesday, December 28, 2010

Why Haystack?• Integration with third-party apps

• No need to fork their code

• Put the indexes in your code & register them

• Applies to django.contrib as well.

Tuesday, December 28, 2010

Why Haystack?• Real Live Documentation™!

• http://docs.haystacksearch.org/dev/

• Test Coverage!

• Decent coverage

• No new commits without tests

Tuesday, December 28, 2010

Enough shameless self-promotion

already!

Tuesday, December 28, 2010

UsingHaystack

Tuesday, December 28, 2010

Two Phase Approach• The “Data In” is SearchIndex

• The “Data Out” is SearchQuerySet

• Note: There’s a disconnect between your database & the search index

Tuesday, December 28, 2010

SearchIndex

Tuesday, December 28, 2010

SearchIndex• Provides the means to get data into the

index

• Something of a cross between a Form (the data preparation aspects) and Model (the persistence)

Tuesday, December 28, 2010

SearchIndexfrom haystack import indexes, sitefrom myapp.models import Entry

class EntrySearchIndex(indexes.SearchIndex):text = indexes.CharField(document=True, use_template=True)author = indexes.CharField(model_attr=‘user__username’)created = indexes.DateTimeField()

def get_queryset(self):return Entry.objects.published()

def prepare_created(self, obj):return obj.pub_date or datetime.datetime.now()

site.register(Entry, EntrySearchIndex)

Tuesday, December 28, 2010

`use_template=True`?• Use Django templates to prep the data

• Example:# search/indexes/myapp/entry_text.txt{{ obj.title }}{{ obj.author.get_full_name }}{{ obj.tease }}{{ obj.content }}

Tuesday, December 28, 2010

SearchQuerySet

Tuesday, December 28, 2010

SearchQuerySet• The reason to use Haystack

• Very powerful

• Forget views, forms, etc. They’re all thin wrappers around SearchQuerySet

Tuesday, December 28, 2010

SearchQuerySet• Fetches data from the index

• Very similar to QuerySet

• Intentional, to reduce conceptual overhead

• Lazily evaluated

• Chain methods

Tuesday, December 28, 2010

SearchQuerySet• By default, searches across all models

• Can limit using SearchQuerySet.models

• Caches where possible

Tuesday, December 28, 2010

SearchQuerySet>>> import datetime>>> from haystack.query import SearchQuerySet>>> sqs = SearchQuerySet().models(Entry)>>> sqs = sqs.filter(created__lte=datetime.datetime.now())>>> sqs = sqs.exclude(author=‘daniel’)

# Lazily performed the query when asked for results.>>> sqs[<SearchResult: myapp.entry (pk=u'5')>, <SearchResult: myapp.entry (pk=u'3')>, <SearchResult: myapp.entry (pk=u'2')>]

# Iterable interface.# Still hasn’t hit the DB.>>> [result.author for result in sqs][‘johndoe’, ‘sally1982’, ‘bob_the_third’]

Tuesday, December 28, 2010

SearchQuerySet# Hits the database once per result.>>> [result.object.user.first_name for result in sqs][‘John’, ‘Sally’, ‘Bob’]

# More efficient loading from database (one query total).>>> [result.object.user.first_name for result in sqs.load_all()][‘John’, ‘Sally’, ‘Bob’]

Tuesday, December 28, 2010

SearchView

Tuesday, December 28, 2010

SearchView• Class-based view

• Hit 80% of the regular usage

• A guideline to more advanced use

• Relies heavily on SearchForm

Tuesday, December 28, 2010

SearchForm

Tuesday, December 28, 2010

SearchForm• Outside of using SearchQuerySet, it’s a

standard Django form

• Defines a search method that does the necessary actions

Tuesday, December 28, 2010

SearchFormfrom django import formsfrom haystack.forms import SearchFormfrom myapp.models import Entry

class EntrySearchForm(SearchForm):# Additional fields go here.author = forms.CharField(max_length=255, required=False)

def search(self):sqs = super(EntrySearchForm, self).search()

if self.cleaned_data.get(‘author’):sqs = sqs.filter(author=self.cleaned_data[‘author’])

return sqs

Tuesday, December 28, 2010

SearchSite

Tuesday, December 28, 2010

SearchSite• Registry pattern

• Collects all registered SearchIndex classes

• Used by SearchQuerySet to limit results to only things Haystack knows about

• Think django.contrib.admin.site.

Tuesday, December 28, 2010

HaystackBest Practices

Tuesday, December 28, 2010

Common Fields• Try to find common fields as much as

possible

• Reuse where it makes sense

• But don’t shoehorn if it doesn’t work

Tuesday, December 28, 2010

It’s Just Python• When an out-of-box doesn't work for you,

use SearchQuerySet & write what you need.

• It's just Django & Python.

Tuesday, December 28, 2010

load_all• Appropriate use of

SearchQuerySet.load_all

• One hit to the DB per content type

• But do you need to hit the DB?

Tuesday, December 28, 2010

More Like This• Cheap & very worth it

• LJWorld saw a 30% jump in traffic by adding it solely on story detail views.

• Cache it!

Tuesday, December 28, 2010

Other Ideas• Admin Integration

• Integration with API

• Search “grouping”

• Vertical search

Tuesday, December 28, 2010

SolrBest Practices

Tuesday, December 28, 2010

Tomcat vs. Jetty• Very close performance-wise

• Tomcat better when busy

• Jetty is smaller on RAM & easier to run

Tuesday, December 28, 2010

Tune JVM settings-Xms (Minimum size)-Xmx (Maximum size)

# Something close to...- ``java -Xms1G -Xmx12G -jar start.jar``

- -XX:+PrintGCDetails (print GC info)- -XX:+PrintGCTimeStamps (print GC info + timestamps)

Tuesday, December 28, 2010

JMX Console• java -Dcom.sun.management.jmxremote -jar start.jar

• Then jconsole

• Find jetty in the process list.

• Lots of instrumentation

Tuesday, December 28, 2010

• Proper query warming

• The default “solr rocks” doesn’t.

• Remove unused handlers (like partition)

Tune solrconfig

Tuesday, December 28, 2010

Tune solrconfig• Tuning the mergeFactor

• Not too high, not too low

• Big trade-off

Tuesday, December 28, 2010

Schema• use omitNorms where possible

• Only needed on full-text fields

• Same goes for indexed & stored

• The fewer fields, the better

Tuesday, December 28, 2010

Optimize!• Seriously.

• Goes back through existing indexes & cleans up

• Takes awhile to run, so make sure your timeout is high (custom settings file)

Tuesday, December 28, 2010

Commits• Commit as infrequently as is reasonable

• Commit as much as you can at once

• queued_search shines here

Tuesday, December 28, 2010

Debugging• Use &debugQuery=on to debug queries

• Use the browser interface!

Tuesday, December 28, 2010

Advanced Bits• Learn & love the Solr stats page

Tuesday, December 28, 2010

Advanced Bits• Learn & love the Solr stats page

• Replication

Tuesday, December 28, 2010

Advanced Bits• Learn & love the Solr stats page

• Replication

• n-gram based autocomplete

Tuesday, December 28, 2010

Advanced Bits• Learn & love the Solr stats page

• Replication

• n-gram based autocomplete

• Spelling suggestions

• the (Haystack) documented config sucks

Tuesday, December 28, 2010

Advanced Bits• Learn & love the Solr stats page

• Replication

• n-gram based autocomplete

• Spelling suggestions

• the (Haystack) documented config sucks

• Dismax Handler

Tuesday, December 28, 2010

Resources• https://gist.github.com/215331

• http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

• http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/

• http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html

• http://wiki.apache.org/solr/SolrJmx

• http://wiki.apache.org/solr/LargeIndexes

• http://wiki.apache.org/solr/SolrPerformanceFactors

Tuesday, December 28, 2010

Resources• http://wiki.apache.org/solr/SolrReplication

• http://www.yashh.com/blog/2010/nov/03/autocomplete-solr/

• http://charlesleifer.com/blog/search-on-djangosnippetsorg/

• http://wiki.apache.org/solr/SpellCheckComponent

Tuesday, December 28, 2010

Enough Talk.Let’s Go Work With It.

Tuesday, December 28, 2010

A Big Thanks ToCMG Digital &

@cmheisel For Having Me!

Tuesday, December 28, 2010

http://haystacksearch.org/http://github.com/toastdriven/django-haystack

#haystack on irc.freenode.nethttp://groups.google.com/group/django-haystack/

@daniellindsley on Twitter

More Information

Tuesday, December 28, 2010

top related