Transcript
Page 1: Solr, Lucene and Hadoop @ Etsy

Solr, Lucene & Hadoop@

Thursday, May 10, 12

Page 2: Solr, Lucene and Hadoop @ Etsy

4 Years Lucene and Solr @ Etsy

[email protected]

Thursday, May 10, 12

Page 3: Solr, Lucene and Hadoop @ Etsy

History of Search @ Etsy

Hadoop + HBase Indexing(in development)

Replication

Thursday, May 10, 12

Page 4: Solr, Lucene and Hadoop @ Etsy

AboutUs

Thursday, May 10, 12

Page 5: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 6: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 7: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 8: Solr, Lucene and Hadoop @ Etsy

13MM Listings

39MM Unique Visitors880K Shops / 150 Countries

100+ Engineers

Thursday, May 10, 12

Page 9: Solr, Lucene and Hadoop @ Etsy

Architecture Overview

Thursday, May 10, 12

Page 10: Solr, Lucene and Hadoop @ Etsy

OverviewSearch

+n slaves

Memcached+n caches

Web+n webs

Database+n db shards

Thursday, May 10, 12

Page 11: Solr, Lucene and Hadoop @ Etsy

Thrift

slave

slave

Search

+n slaves

Web

web

web

+n webs

query = hats for cats

result = 402, 283, 837

Thursday, May 10, 12

Page 12: Solr, Lucene and Hadoop @ Etsy

Web

web

web

+n webs

Database

shard

+n shards

shard

Memcached

cache

cache

+n caches

Hydration

Thursday, May 10, 12

Page 13: Solr, Lucene and Hadoop @ Etsy

The Results

Thursday, May 10, 12

Page 14: Solr, Lucene and Hadoop @ Etsy

History of Searchat Etsy

Thursday, May 10, 12

Page 15: Solr, Lucene and Hadoop @ Etsy

History of Search2007

•1 Million Listings •A Single “Master” Postgres Database•PHP > Twisted > Stored Proc > TSearch•18 “Baby” Postgres Databases•Baby Replicator

Thursday, May 10, 12

Page 16: Solr, Lucene and Hadoop @ Etsy

History of Search2008

•2 Million Listings •A Single “Master” Postgres Database•PHP > Solr•4 Solr Slaves + 2 Masters•Baby Replicator + DIH for Reindexing

Thursday, May 10, 12

Page 17: Solr, Lucene and Hadoop @ Etsy

History of Search2009

•4 Million Listings •A Single “Master” Postgres Database•PHP > Solr•6 Solr Slaves + 2 Masters•Webs >ActiveMQ > Solr

Thursday, May 10, 12

Page 18: Solr, Lucene and Hadoop @ Etsy

History of Search2010

•7 Million Listings •A Single “Master” Postgres Database•PHP > Thrift > Solr•10 Solr Slaves + 1 Master•Custom Import Handler

Thursday, May 10, 12

Page 19: Solr, Lucene and Hadoop @ Etsy

History of Search2011

•10 Million Listings •“Master” Postgres Database + DB SHARDS!•PHP > Thrift > Solr•24 Solr Slaves + 1 Master•Custom Import Handler

Thursday, May 10, 12

Page 20: Solr, Lucene and Hadoop @ Etsy

Future of Search2012

•?? Million Listings•MORE DB SHARDS!•PHP > Thrift > Solr•?? Solr Slaves + 1 Master•HBase + Hadoop Indexers

Thursday, May 10, 12

Page 21: Solr, Lucene and Hadoop @ Etsy

What Did We Learn?

Thursday, May 10, 12

Page 22: Solr, Lucene and Hadoop @ Etsy

Lucene + Solr > TSearch

http://www.depesz.com/2010/10/17/why-im-not-fan-of-tsearch-2/

Thursday, May 10, 12

Page 23: Solr, Lucene and Hadoop @ Etsy

Love Lucene + Solr Trunk!

Thursday, May 10, 12

Page 24: Solr, Lucene and Hadoop @ Etsy

Run, Don’t Walk...

Thursday, May 10, 12

Page 25: Solr, Lucene and Hadoop @ Etsy

Fork it: https://github.com/etsy/deployinator

Deployinator

Thursday, May 10, 12

Page 26: Solr, Lucene and Hadoop @ Etsy

Smoker

Thursday, May 10, 12

Page 27: Solr, Lucene and Hadoop @ Etsy

StatsD, Graph Everything!

Fork it: https://github.com/etsy/statsd

Thursday, May 10, 12

Page 28: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 29: Solr, Lucene and Hadoop @ Etsy

95th Percentile

Thursday, May 10, 12

Page 30: Solr, Lucene and Hadoop @ Etsy

start · build_query · perform_search · receive_search_ads · search_side_response · create_event_logger · set_tpl_vars · tpl_render · receive_search_ads_post_render

Thursday, May 10, 12

Page 31: Solr, Lucene and Hadoop @ Etsy

Solr Top Level Cache > Memcached

Thursday, May 10, 12

Page 32: Solr, Lucene and Hadoop @ Etsy

etsy-index.properties

$ cat /search/data/person/index/etsy-index.properties#Tue Mar 27 13:05:51 EDT 2012max_update_time=2012-03-27T17\:05\:51.955Z

Thursday, May 10, 12

Page 33: Solr, Lucene and Hadoop @ Etsy

Check Index SizeDon’t Install if < 50% Current Size

Thursday, May 10, 12

Page 34: Solr, Lucene and Hadoop @ Etsy

Check if Index is Too OldDon’t Update if > 10 Days Old

Thursday, May 10, 12

Page 35: Solr, Lucene and Hadoop @ Etsy

What Did We Learn?

Store Nothing

Thursday, May 10, 12

Page 36: Solr, Lucene and Hadoop @ Etsy

Keep Denormalized Data

Thursday, May 10, 12

Page 37: Solr, Lucene and Hadoop @ Etsy

SearchDatabase

DB Shard

DB Shard

DB Shard

PHPDenormalizer

JSON

Thursday, May 10, 12

Page 38: Solr, Lucene and Hadoop @ Etsy

FullReindex

ApplyIncremental

Install

Thursday, May 10, 12

Page 39: Solr, Lucene and Hadoop @ Etsy

FullReindex

ApplyIncremental

InstallApply

Incremental

Thursday, May 10, 12

Page 40: Solr, Lucene and Hadoop @ Etsy

Indexer

Database

Thursday, May 10, 12

Page 41: Solr, Lucene and Hadoop @ Etsy

HBase + HadoopIndexing

Thursday, May 10, 12

Page 42: Solr, Lucene and Hadoop @ Etsy

HBase + Hadoop Indexing

Why HBase?

Thursday, May 10, 12

Page 43: Solr, Lucene and Hadoop @ Etsy

HBase + Hadoop Indexing

HBase

DB Shard

DB Shard

DB Shard

PHPDenormalizer

JSON

Thursday, May 10, 12

Page 44: Solr, Lucene and Hadoop @ Etsy

{NAME => 'listings_denormalized', FAMILIES => [{NAME => 'listing_data', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}

listings_denormalized

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 45: Solr, Lucene and Hadoop @ Etsy

{NAME => 'listings_denormalized_modified_index', FAMILIES => [{NAME => 'pks', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}

listings_denormalized_modified_index

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 46: Solr, Lucene and Hadoop @ Etsy

HBase + Hadoop Indexing

SOLR-1301

https://issues.apache.org/jira/browse/SOLR-1301

Thursday, May 10, 12

Page 47: Solr, Lucene and Hadoop @ Etsy

HDFS

Disk

SolrOutput Format

•Solr Document Converter•Solr Requires Posix Disk•Index Copied Back to HDFS

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 48: Solr, Lucene and Hadoop @ Etsy

•Not Great with Multi-Core Configs•Added Solr Multi-Core Support• Solr Config Issues•Added ENV support for Configs•Uses “new” style Hadoop API•Added Support for both Old and New

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 49: Solr, Lucene and Hadoop @ Etsy

SolrInputDocumentWritable

HBase + Hadoop Indexing

public class SolrInputDocumentWritable extends SolrInputDocument implements org.apache.hadoop.io.Writable {

Thursday, May 10, 12

Page 50: Solr, Lucene and Hadoop @ Etsy

HBase + Hadoop Indexing

Oozie

Thursday, May 10, 12

Page 51: Solr, Lucene and Hadoop @ Etsy

HBase + Hadoop Indexing

Oozie + HBase?

Thursday, May 10, 12

Page 52: Solr, Lucene and Hadoop @ Etsy

ScanStringGenerator

HBase + Hadoop Indexing

http://blog.ozbuyucusu.com/2011/07/21/using-hbase-tablemapper-via-oozie-workflow/

Thursday, May 10, 12

Page 53: Solr, Lucene and Hadoop @ Etsy

Hadoop Indexer

StartOozie

Map HBase

HDFSReduce

DiskSolr

Output

Copy

Merge

Install

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 54: Solr, Lucene and Hadoop @ Etsy

IndexerActionMain

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 55: Solr, Lucene and Hadoop @ Etsy

HBase + Hadoop Indexing

Deployinator

Thursday, May 10, 12

Page 56: Solr, Lucene and Hadoop @ Etsy

IndexCompare

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 57: Solr, Lucene and Hadoop @ Etsy

$ ./compare ERROR: please provide two index directories example: ./compare -p 0.1 -i user_id ./index ./index-1332867952588options: -p --percent= percent of the index to check -i --id= primary key id field in the index -h --hash= comparison or hash field in the index <index> <index>

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 58: Solr, Lucene and Hadoop @ Etsy

$ ./compare \/search/data/person/index-1332867952588/ \/search/data/person/index-1335378487672

id field: user_idhash field: hashpercentage: 0.0010 files: /search/data/person/index-1332867952588/ /search/data/person/index-1335378487672

/search/data/person/index-1332867952588 contains 1515512 docs/search/data/person/index-1335378487672 contains 14837972 docs1516 of 1516 documents are the same

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 59: Solr, Lucene and Hadoop @ Etsy

Copy and Merge

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 60: Solr, Lucene and Hadoop @ Etsy

Open Source

HBase + Hadoop Indexing

Thursday, May 10, 12

Page 61: Solr, Lucene and Hadoop @ Etsy

Replication

Thursday, May 10, 12

Page 62: Solr, Lucene and Hadoop @ Etsy

Replication

Thursday, May 10, 12

Page 63: Solr, Lucene and Hadoop @ Etsy

Slaves

+n slaves

Master

Replication

Thursday, May 10, 12

Page 64: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 65: Solr, Lucene and Hadoop @ Etsy

BitTorrentReplication

Thursday, May 10, 12

Page 66: Solr, Lucene and Hadoop @ Etsy

Bit Torrent

Using BitTornado:

Thursday, May 10, 12

Page 67: Solr, Lucene and Hadoop @ Etsy

ReplicationBit Torrent + Solr

Thursday, May 10, 12

Page 68: Solr, Lucene and Hadoop @ Etsy

ReplicationBit Torrent + Solr

Thursday, May 10, 12

Page 69: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 70: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 71: Solr, Lucene and Hadoop @ Etsy

Fork of TTorent: https://github.com/etsy/ttorrent

Multi-File SupportLarge File Support

Fork BitTorrent: Comming Soon

Replication

Thursday, May 10, 12

Page 72: Solr, Lucene and Hadoop @ Etsy

Need a job?

Thursday, May 10, 12

Page 73: Solr, Lucene and Hadoop @ Etsy

Thursday, May 10, 12

Page 74: Solr, Lucene and Hadoop @ Etsy

Thanks!

Thursday, May 10, 12


Top Related