elasticsearch @ shopwiki 2014-03-20
DESCRIPTION
Slides from the NY Elasticsearch Meetup on May 20, 2014. http://www.meetup.com/Elasticsearch-NY/events/170714812/ http://vimeo.com/90124531TRANSCRIPT
Elasticsearch@ ShopWiki
What is ShopWiki?
• ShopWiki is the retail division of Oversee.net.
• We run a collection of retail websites,• Including the Comparison Shopping Engines (CSE)– ShopWiki.com– Compare.com
How do we use Elasticsearch?
• You know, for search (not logging).• We index millions of products, offered from
hundreds of thousands of stores, and allow users to search them.
Why Elasticsearch?
• ShopWiki was built using a proprietary search server written in C++.
• Served us well for many years, but it needed improvements, especially for non-English language search.
• What about Lucene-based solutions?
Solr3
• We tried out Solr3 when building CouponFinder.com.
• Solr worked well (for English & French), but the coupon dataset is small in comparison to our product dataset.
• The setup was simple master-slave replication.
How do we scale?
• To use Solr for our product data we needed to shard the data across multiple machines.
• But, Solr3’s sharding capabilities were clunky and difficult to use.
• Enter Elasticsearch!• Designed to scale out-of-the-box.
Compare.com
• Compare.com was built using Elasticsearch from the start.
• Allowed us to get up & running very quickly.• Allowed us to scale up very quickly.– 60 million products and growing.
• Allows us iterate on new features quickly.
Other Languages
• ShopWiki search is being gradually ported to Elasticsearch.
• Allows us to have better non-English search right out-of-the-box.– French– German– Dutch– Spanish
Our Elasticsearch Cluster
• 12 indices, one for each website.• 3 replicas per shard.• 3 master nodes (quorum of 2).• 6 data nodes.• Plan to add more data nodes as we proceed
with our migration of ShopWiki (500m products).
• Expect to need less hardware than the C++. cluster (uses 50+ machines).
Elasticsearch Head
Realtime Updates
• C++ search servers need to have the entire dataset re-indexed and swapped out all at once.
• Could only do this once a day, at night (affects performance).
• With Elasticsearch, we can update our data all the time (it’s not even a limiting factor).
Challenges
• Use TermsFacet to suggest filters to the user.• E.g. filter by stores or brands.• Using the 10 most frequent brands from a
search can produce bad results.– A single brand may have lots of products that are
all weakly relevant.
Top-N Faceting
• The solution in Solr is to limit facets to the top-N results.
• Elasticsearch doesn’t have this feature (as mentioned at last Meetup).
• Solution: TermsStatsFacet (AKA aggregations in 1.0)
• Allows us to get the brands/stores with the most relevant results.
• E.g. Σ(scoren) n allows us to tune facet results to our liking
N = 0 (same as count)
TermsStatsFacet for BrandsQuery: “mixing bowl”
Σ(scoren) N = 4
De-duping Products
• Use “more_like_this” query to find similar products.
• If result’s score is “high enough”, it’s likely the same product from a different store.
• “High enough” is defined as a fraction of the identity match’s score.
• Questions?
• Rob Stewart• Lead Software Engineer• [email protected]