solr, lucene and hadoop @ etsy

75
Solr, Lucene & Hadoop @ Thursday, May 10, 12

Upload: lucenerevolution

Post on 06-May-2015

2.259 views

Category:

Technology


3 download

DESCRIPTION

Presented by David Giffin, Software Engineer, Etsy - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 Search at Etsy poses significant challenges. Our marketplace is filled with millions of unique, short-lived items and people trying to find them over 13 million times a day. In this session we'll discuss many of the solutions we've engineered to meet these challenges including, the evolution of indexing at Etsy, how HBase and Hadoop have taken indexing from hours to minutes, how and why we use bittorrent for Solr replication, how we track search performance, our approach to shave crucial milliseconds off every search, and an overview of our continuous deployment strategy, web / search config integration and A/B testing and analytics.

TRANSCRIPT

Page 1: Solr, Lucene and Hadoop @ Etsy

Solr, Lucene & Hadoop@

Thursday, May 10, 12

Page 2: Solr, Lucene and Hadoop @ Etsy

4 Years Lucene and Solr @ Etsy

[email protected]

Thursday, May 10, 12

Page 3: Solr, Lucene and Hadoop @ Etsy

History of Search @ Etsy

Hadoop + HBase Indexing(in development)

Replication

Thursday, May 10, 12

Page 4: Solr, Lucene and Hadoop @ Etsy

AboutUs

Thursday, May 10, 12

Page 5: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 6: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 7: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 8: Solr, Lucene and Hadoop @ Etsy

13MM Listings

39MM Unique Visitors880K Shops / 150 Countries

100+ Engineers

Thursday, May 10, 12

Page 9: Solr, Lucene and Hadoop @ Etsy

Architecture Overview

Thursday, May 10, 12

Page 10: Solr, Lucene and Hadoop @ Etsy

OverviewSearch

+n slaves

Memcached+n caches

Web+n webs

Database+n db shards

Thursday, May 10, 12

Page 11: Solr, Lucene and Hadoop @ Etsy

Thrift

slave

slave

Search

+n slaves

Web

web

web

+n webs

query = hats for cats

result = 402, 283, 837

Thursday, May 10, 12

Page 12: Solr, Lucene and Hadoop @ Etsy

Web

web

web

+n webs

Database

shard

+n shards

shard

Memcached

cache

cache

+n caches

Hydration

Thursday, May 10, 12

Page 13: Solr, Lucene and Hadoop @ Etsy

The Results

Thursday, May 10, 12

Page 14: Solr, Lucene and Hadoop @ Etsy

History of Searchat Etsy

Thursday, May 10, 12

Page 15: Solr, Lucene and Hadoop @ Etsy

History of Search2007

•1 Million Listings •A Single “Master” Postgres Database•PHP > Twisted > Stored Proc > TSearch•18 “Baby” Postgres Databases•Baby Replicator

Thursday, May 10, 12

Page 16: Solr, Lucene and Hadoop @ Etsy

History of Search2008

•2 Million Listings •A Single “Master” Postgres Database•PHP > Solr•4 Solr Slaves + 2 Masters•Baby Replicator + DIH for Reindexing

Thursday, May 10, 12

Page 17: Solr, Lucene and Hadoop @ Etsy

History of Search2009

•4 Million Listings •A Single “Master” Postgres Database•PHP > Solr•6 Solr Slaves + 2 Masters•Webs >ActiveMQ > Solr

Thursday, May 10, 12

Page 18: Solr, Lucene and Hadoop @ Etsy

History of Search2010

•7 Million Listings •A Single “Master” Postgres Database•PHP > Thrift > Solr•10 Solr Slaves + 1 Master•Custom Import Handler

Thursday, May 10, 12

Page 19: Solr, Lucene and Hadoop @ Etsy

History of Search2011

•10 Million Listings •“Master” Postgres Database + DB SHARDS!•PHP > Thrift > Solr•24 Solr Slaves + 1 Master•Custom Import Handler

Thursday, May 10, 12

Page 20: Solr, Lucene and Hadoop @ Etsy

Future of Search2012

•?? Million Listings•MORE DB SHARDS!•PHP > Thrift > Solr•?? Solr Slaves + 1 Master•HBase + Hadoop Indexers

Thursday, May 10, 12

Page 21: Solr, Lucene and Hadoop @ Etsy

What Did We Learn?

Thursday, May 10, 12

Page 22: Solr, Lucene and Hadoop @ Etsy

Lucene + Solr > TSearch

http://www.depesz.com/2010/10/17/why-im-not-fan-of-tsearch-2/

Thursday, May 10, 12

Page 23: Solr, Lucene and Hadoop @ Etsy

Love Lucene + Solr Trunk!

Thursday, May 10, 12

Page 24: Solr, Lucene and Hadoop @ Etsy

Run, Don’t Walk...

Thursday, May 10, 12

Page 25: Solr, Lucene and Hadoop @ Etsy

Fork it: https://github.com/etsy/deployinator

Deployinator

Thursday, May 10, 12

Page 26: Solr, Lucene and Hadoop @ Etsy

Smoker

Thursday, May 10, 12

Page 27: Solr, Lucene and Hadoop @ Etsy

StatsD, Graph Everything!

Fork it: https://github.com/etsy/statsd

Thursday, May 10, 12

Page 28: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 29: Solr, Lucene and Hadoop @ Etsy

95th Percentile

Thursday, May 10, 12

Page 30: Solr, Lucene and Hadoop @ Etsy

start · build_query · perform_search · receive_search_ads · search_side_response · create_event_logger · set_tpl_vars · tpl_render · receive_search_ads_post_render

Thursday, May 10, 12

Page 31: Solr, Lucene and Hadoop @ Etsy

Solr Top Level Cache > Memcached

Thursday, May 10, 12

Page 32: Solr, Lucene and Hadoop @ Etsy

etsy-index.properties

$ cat /search/data/person/index/etsy-index.properties#Tue Mar 27 13:05:51 EDT 2012max_update_time=2012-03-27T17\:05\:51.955Z

Thursday, May 10, 12

Page 33: Solr, Lucene and Hadoop @ Etsy

Check Index SizeDon’t Install if < 50% Current Size

Thursday, May 10, 12

Page 34: Solr, Lucene and Hadoop @ Etsy

Check if Index is Too OldDon’t Update if > 10 Days Old

Thursday, May 10, 12

Page 35: Solr, Lucene and Hadoop @ Etsy

What Did We Learn?

Store Nothing

Thursday, May 10, 12

Page 36: Solr, Lucene and Hadoop @ Etsy

Keep Denormalized Data

Thursday, May 10, 12

Page 37: Solr, Lucene and Hadoop @ Etsy

SearchDatabase

DB Shard

DB Shard

DB Shard

PHPDenormalizer

JSON

Thursday, May 10, 12

Page 38: Solr, Lucene and Hadoop @ Etsy

FullReindex

ApplyIncremental

Install

Thursday, May 10, 12

Page 39: Solr, Lucene and Hadoop @ Etsy

FullReindex

ApplyIncremental

InstallApply

Incremental

Thursday, May 10, 12

Page 40: Solr, Lucene and Hadoop @ Etsy

Indexer

Database

Thursday, May 10, 12

Page 41: Solr, Lucene and Hadoop @ Etsy

HBase + HadoopIndexing

Thursday, May 10, 12

Page 42: Solr, Lucene and Hadoop @ Etsy

HBase + Hadoop Indexing

Why HBase?

Thursday, May 10, 12

Page 43: Solr, Lucene and Hadoop @ Etsy

HBase + Hadoop Indexing

HBase

DB Shard

DB Shard

DB Shard

PHPDenormalizer

JSON

Thursday, May 10, 12

Page 44: Solr, Lucene and Hadoop @ Etsy

{NAME => 'listings_denormalized', FAMILIES => [{NAME => 'listing_data', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}

listings_denormalized

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 45: Solr, Lucene and Hadoop @ Etsy

{NAME => 'listings_denormalized_modified_index', FAMILIES => [{NAME => 'pks', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}

listings_denormalized_modified_index

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 46: Solr, Lucene and Hadoop @ Etsy

HBase + Hadoop Indexing

SOLR-1301

https://issues.apache.org/jira/browse/SOLR-1301

Thursday, May 10, 12

Page 47: Solr, Lucene and Hadoop @ Etsy

HDFS

Disk

SolrOutput Format

•Solr Document Converter•Solr Requires Posix Disk•Index Copied Back to HDFS

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 48: Solr, Lucene and Hadoop @ Etsy

•Not Great with Multi-Core Configs•Added Solr Multi-Core Support• Solr Config Issues•Added ENV support for Configs•Uses “new” style Hadoop API•Added Support for both Old and New

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 49: Solr, Lucene and Hadoop @ Etsy

SolrInputDocumentWritable

HBase + Hadoop Indexing

public class SolrInputDocumentWritable extends SolrInputDocument implements org.apache.hadoop.io.Writable {

Thursday, May 10, 12

Page 50: Solr, Lucene and Hadoop @ Etsy

HBase + Hadoop Indexing

Oozie

Thursday, May 10, 12

Page 51: Solr, Lucene and Hadoop @ Etsy

HBase + Hadoop Indexing

Oozie + HBase?

Thursday, May 10, 12

Page 52: Solr, Lucene and Hadoop @ Etsy

ScanStringGenerator

HBase + Hadoop Indexing

http://blog.ozbuyucusu.com/2011/07/21/using-hbase-tablemapper-via-oozie-workflow/

Thursday, May 10, 12

Page 53: Solr, Lucene and Hadoop @ Etsy

Hadoop Indexer

StartOozie

Map HBase

HDFSReduce

DiskSolr

Output

Copy

Merge

Install

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 54: Solr, Lucene and Hadoop @ Etsy

IndexerActionMain

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 55: Solr, Lucene and Hadoop @ Etsy

HBase + Hadoop Indexing

Deployinator

Thursday, May 10, 12

Page 56: Solr, Lucene and Hadoop @ Etsy

IndexCompare

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 57: Solr, Lucene and Hadoop @ Etsy

$ ./compare ERROR: please provide two index directories example: ./compare -p 0.1 -i user_id ./index ./index-1332867952588options: -p --percent= percent of the index to check -i --id= primary key id field in the index -h --hash= comparison or hash field in the index <index> <index>

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 58: Solr, Lucene and Hadoop @ Etsy

$ ./compare \/search/data/person/index-1332867952588/ \/search/data/person/index-1335378487672

id field: user_idhash field: hashpercentage: 0.0010 files: /search/data/person/index-1332867952588/ /search/data/person/index-1335378487672

/search/data/person/index-1332867952588 contains 1515512 docs/search/data/person/index-1335378487672 contains 14837972 docs1516 of 1516 documents are the same

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 59: Solr, Lucene and Hadoop @ Etsy

Copy and Merge

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 60: Solr, Lucene and Hadoop @ Etsy

Open Source

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 61: Solr, Lucene and Hadoop @ Etsy

Replication

Thursday, May 10, 12

Page 62: Solr, Lucene and Hadoop @ Etsy

Replication

Thursday, May 10, 12

Page 63: Solr, Lucene and Hadoop @ Etsy

Slaves

+n slaves

Master

Replication

Thursday, May 10, 12

Page 64: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 65: Solr, Lucene and Hadoop @ Etsy

BitTorrentReplication

Thursday, May 10, 12

Page 66: Solr, Lucene and Hadoop @ Etsy

Bit Torrent

Using BitTornado:

Thursday, May 10, 12

Page 67: Solr, Lucene and Hadoop @ Etsy

ReplicationBit Torrent + Solr

Thursday, May 10, 12

Page 68: Solr, Lucene and Hadoop @ Etsy

ReplicationBit Torrent + Solr

Thursday, May 10, 12

Page 69: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 70: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 71: Solr, Lucene and Hadoop @ Etsy

Fork of TTorent: https://github.com/etsy/ttorrent

Multi-File SupportLarge File Support

Fork BitTorrent: Comming Soon

Replication

Thursday, May 10, 12

Page 72: Solr, Lucene and Hadoop @ Etsy

Need a job?

Thursday, May 10, 12

Page 73: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 74: Solr, Lucene and Hadoop @ Etsy

Thanks!

Thursday, May 10, 12