elasticsearch @ shopwiki 2014-03-20

Elasticsearch@ ShopWiki

What is ShopWiki?

• ShopWiki is the retail division of Oversee.net.

• We run a collection of retail websites,• Including the Comparison Shopping Engines (CSE)– ShopWiki.com– Compare.com

How do we use Elasticsearch?

• You know, for search (not logging).• We index millions of products, offered from

hundreds of thousands of stores, and allow users to search them.

Why Elasticsearch?

• ShopWiki was built using a proprietary search server written in C++.

• Served us well for many years, but it needed improvements, especially for non-English language search.

• What about Lucene-based solutions?

Solr3

• We tried out Solr3 when building CouponFinder.com.

• Solr worked well (for English & French), but the coupon dataset is small in comparison to our product dataset.

• The setup was simple master-slave replication.

How do we scale?

• To use Solr for our product data we needed to shard the data across multiple machines.

• But, Solr3’s sharding capabilities were clunky and difficult to use.

• Enter Elasticsearch!• Designed to scale out-of-the-box.

Compare.com

• Compare.com was built using Elasticsearch from the start.

• Allowed us to get up & running very quickly.• Allowed us to scale up very quickly.– 60 million products and growing.

• Allows us iterate on new features quickly.

Other Languages

• ShopWiki search is being gradually ported to Elasticsearch.

• Allows us to have better non-English search right out-of-the-box.– French– German– Dutch– Spanish

Our Elasticsearch Cluster

• 12 indices, one for each website.• 3 replicas per shard.• 3 master nodes (quorum of 2).• 6 data nodes.• Plan to add more data nodes as we proceed

with our migration of ShopWiki (500m products).

• Expect to need less hardware than the C++. cluster (uses 50+ machines).

Elasticsearch Head

Realtime Updates

• C++ search servers need to have the entire dataset re-indexed and swapped out all at once.

• Could only do this once a day, at night (affects performance).

• With Elasticsearch, we can update our data all the time (it’s not even a limiting factor).

Challenges

• Use TermsFacet to suggest filters to the user.• E.g. filter by stores or brands.• Using the 10 most frequent brands from a

search can produce bad results.– A single brand may have lots of products that are

all weakly relevant.

Top-N Faceting

• The solution in Solr is to limit facets to the top-N results.

• Elasticsearch doesn’t have this feature (as mentioned at last Meetup).

• Solution: TermsStatsFacet (AKA aggregations in 1.0)

• Allows us to get the brands/stores with the most relevant results.

• E.g. Σ(scoren) n allows us to tune facet results to our liking

N = 0 (same as count)

TermsStatsFacet for BrandsQuery: “mixing bowl”

Σ(scoren) N = 4

De-duping Products

• Use “more_like_this” query to find similar products.

• If result’s score is “high enough”, it’s likely the same product from a different store.

• “High enough” is defined as a fraction of the identity match’s score.

• Questions?

• Rob Stewart• Lead Software Engineer• [email protected]

elasticsearch @ shopwiki 2014-03-20

Technology