solr, lucene and hadoop @ etsy

Post on 06-May-2015

2.259 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented by David Giffin, Software Engineer, Etsy - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 Search at Etsy poses significant challenges. Our marketplace is filled with millions of unique, short-lived items and people trying to find them over 13 million times a day. In this session we'll discuss many of the solutions we've engineered to meet these challenges including, the evolution of indexing at Etsy, how HBase and Hadoop have taken indexing from hours to minutes, how and why we use bittorrent for Solr replication, how we track search performance, our approach to shave crucial milliseconds off every search, and an overview of our continuous deployment strategy, web / search config integration and A/B testing and analytics.

TRANSCRIPT

Solr, Lucene & Hadoop@

Thursday, May 10, 12

4 Years Lucene and Solr @ Etsy

david@etsy.com

Thursday, May 10, 12

History of Search @ Etsy

Hadoop + HBase Indexing(in development)

Replication

Thursday, May 10, 12

AboutUs

Thursday, May 10, 12

Thursday, May 10, 12

Thursday, May 10, 12

Thursday, May 10, 12

13MM Listings

39MM Unique Visitors880K Shops / 150 Countries

100+ Engineers

Thursday, May 10, 12

Architecture Overview

Thursday, May 10, 12

OverviewSearch

+n slaves

Memcached+n caches

Web+n webs

Database+n db shards

Thursday, May 10, 12

Thrift

slave

slave

Search

+n slaves

Web

web

web

+n webs

query = hats for cats

result = 402, 283, 837

Thursday, May 10, 12

Web

web

web

+n webs

Database

shard

+n shards

shard

Memcached

cache

cache

+n caches

Hydration

Thursday, May 10, 12

The Results

Thursday, May 10, 12

History of Searchat Etsy

Thursday, May 10, 12

History of Search2007

•1 Million Listings •A Single “Master” Postgres Database•PHP > Twisted > Stored Proc > TSearch•18 “Baby” Postgres Databases•Baby Replicator

Thursday, May 10, 12

History of Search2008

•2 Million Listings •A Single “Master” Postgres Database•PHP > Solr•4 Solr Slaves + 2 Masters•Baby Replicator + DIH for Reindexing

Thursday, May 10, 12

History of Search2009

•4 Million Listings •A Single “Master” Postgres Database•PHP > Solr•6 Solr Slaves + 2 Masters•Webs >ActiveMQ > Solr

Thursday, May 10, 12

History of Search2010

•7 Million Listings •A Single “Master” Postgres Database•PHP > Thrift > Solr•10 Solr Slaves + 1 Master•Custom Import Handler

Thursday, May 10, 12

History of Search2011

•10 Million Listings •“Master” Postgres Database + DB SHARDS!•PHP > Thrift > Solr•24 Solr Slaves + 1 Master•Custom Import Handler

Thursday, May 10, 12

Future of Search2012

•?? Million Listings•MORE DB SHARDS!•PHP > Thrift > Solr•?? Solr Slaves + 1 Master•HBase + Hadoop Indexers

Thursday, May 10, 12

What Did We Learn?

Thursday, May 10, 12

Lucene + Solr > TSearch

http://www.depesz.com/2010/10/17/why-im-not-fan-of-tsearch-2/

Thursday, May 10, 12

Love Lucene + Solr Trunk!

Thursday, May 10, 12

Run, Don’t Walk...

Thursday, May 10, 12

Fork it: https://github.com/etsy/deployinator

Deployinator

Thursday, May 10, 12

Smoker

Thursday, May 10, 12

StatsD, Graph Everything!

Fork it: https://github.com/etsy/statsd

Thursday, May 10, 12

Thursday, May 10, 12

95th Percentile

Thursday, May 10, 12

start · build_query · perform_search · receive_search_ads · search_side_response · create_event_logger · set_tpl_vars · tpl_render · receive_search_ads_post_render

Thursday, May 10, 12

Solr Top Level Cache > Memcached

Thursday, May 10, 12

etsy-index.properties

$ cat /search/data/person/index/etsy-index.properties#Tue Mar 27 13:05:51 EDT 2012max_update_time=2012-03-27T17\:05\:51.955Z

Thursday, May 10, 12

Check Index SizeDon’t Install if < 50% Current Size

Thursday, May 10, 12

Check if Index is Too OldDon’t Update if > 10 Days Old

Thursday, May 10, 12

What Did We Learn?

Store Nothing

Thursday, May 10, 12

Keep Denormalized Data

Thursday, May 10, 12

SearchDatabase

DB Shard

DB Shard

DB Shard

PHPDenormalizer

JSON

Thursday, May 10, 12

FullReindex

ApplyIncremental

Install

Thursday, May 10, 12

FullReindex

ApplyIncremental

InstallApply

Incremental

Thursday, May 10, 12

Indexer

Database

Thursday, May 10, 12

HBase + HadoopIndexing

Thursday, May 10, 12

HBase + Hadoop Indexing

Why HBase?

Thursday, May 10, 12

HBase + Hadoop Indexing

HBase

DB Shard

DB Shard

DB Shard

PHPDenormalizer

JSON

Thursday, May 10, 12

{NAME => 'listings_denormalized', FAMILIES => [{NAME => 'listing_data', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}

listings_denormalized

HBase + Hadoop Indexing

Thursday, May 10, 12

{NAME => 'listings_denormalized_modified_index', FAMILIES => [{NAME => 'pks', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}

listings_denormalized_modified_index

HBase + Hadoop Indexing

Thursday, May 10, 12

HBase + Hadoop Indexing

SOLR-1301

https://issues.apache.org/jira/browse/SOLR-1301

Thursday, May 10, 12

HDFS

Disk

SolrOutput Format

•Solr Document Converter•Solr Requires Posix Disk•Index Copied Back to HDFS

HBase + Hadoop Indexing

Thursday, May 10, 12

•Not Great with Multi-Core Configs•Added Solr Multi-Core Support• Solr Config Issues•Added ENV support for Configs•Uses “new” style Hadoop API•Added Support for both Old and New

HBase + Hadoop Indexing

Thursday, May 10, 12

SolrInputDocumentWritable

HBase + Hadoop Indexing

public class SolrInputDocumentWritable extends SolrInputDocument implements org.apache.hadoop.io.Writable {

Thursday, May 10, 12

HBase + Hadoop Indexing

Oozie

Thursday, May 10, 12

HBase + Hadoop Indexing

Oozie + HBase?

Thursday, May 10, 12

ScanStringGenerator

HBase + Hadoop Indexing

http://blog.ozbuyucusu.com/2011/07/21/using-hbase-tablemapper-via-oozie-workflow/

Thursday, May 10, 12

Hadoop Indexer

StartOozie

Map HBase

HDFSReduce

DiskSolr

Output

Copy

Merge

Install

HBase + Hadoop Indexing

Thursday, May 10, 12

IndexerActionMain

HBase + Hadoop Indexing

Thursday, May 10, 12

HBase + Hadoop Indexing

Deployinator

Thursday, May 10, 12

IndexCompare

HBase + Hadoop Indexing

Thursday, May 10, 12

$ ./compare ERROR: please provide two index directories example: ./compare -p 0.1 -i user_id ./index ./index-1332867952588options: -p --percent= percent of the index to check -i --id= primary key id field in the index -h --hash= comparison or hash field in the index <index> <index>

HBase + Hadoop Indexing

Thursday, May 10, 12

$ ./compare \/search/data/person/index-1332867952588/ \/search/data/person/index-1335378487672

id field: user_idhash field: hashpercentage: 0.0010 files: /search/data/person/index-1332867952588/ /search/data/person/index-1335378487672

/search/data/person/index-1332867952588 contains 1515512 docs/search/data/person/index-1335378487672 contains 14837972 docs1516 of 1516 documents are the same

HBase + Hadoop Indexing

Thursday, May 10, 12

Copy and Merge

HBase + Hadoop Indexing

Thursday, May 10, 12

Open Source

HBase + Hadoop Indexing

Thursday, May 10, 12

Replication

Thursday, May 10, 12

Replication

Thursday, May 10, 12

Slaves

+n slaves

Master

Replication

Thursday, May 10, 12

Thursday, May 10, 12

BitTorrentReplication

Thursday, May 10, 12

Bit Torrent

Using BitTornado:

Thursday, May 10, 12

ReplicationBit Torrent + Solr

Thursday, May 10, 12

ReplicationBit Torrent + Solr

Thursday, May 10, 12

Thursday, May 10, 12

Thursday, May 10, 12

Fork of TTorent: https://github.com/etsy/ttorrent

Multi-File SupportLarge File Support

Fork BitTorrent: Comming Soon

Replication

Thursday, May 10, 12

Need a job?

Thursday, May 10, 12

Thursday, May 10, 12

Thanks!

Thursday, May 10, 12

david@etsy.com

Thursday, May 10, 12

top related