solr, lucene and hadoop @ etsy
Post on 06-May-2015
2.259 Views
Preview:
DESCRIPTION
TRANSCRIPT
Solr, Lucene & Hadoop@
Thursday, May 10, 12
4 Years Lucene and Solr @ Etsy
david@etsy.com
Thursday, May 10, 12
History of Search @ Etsy
Hadoop + HBase Indexing(in development)
Replication
Thursday, May 10, 12
AboutUs
Thursday, May 10, 12
Thursday, May 10, 12
Thursday, May 10, 12
Thursday, May 10, 12
13MM Listings
39MM Unique Visitors880K Shops / 150 Countries
100+ Engineers
Thursday, May 10, 12
Architecture Overview
Thursday, May 10, 12
OverviewSearch
+n slaves
Memcached+n caches
Web+n webs
Database+n db shards
Thursday, May 10, 12
Thrift
slave
slave
Search
+n slaves
Web
web
web
+n webs
query = hats for cats
result = 402, 283, 837
Thursday, May 10, 12
Web
web
web
+n webs
Database
shard
+n shards
shard
Memcached
cache
cache
+n caches
Hydration
Thursday, May 10, 12
The Results
Thursday, May 10, 12
History of Searchat Etsy
Thursday, May 10, 12
History of Search2007
•1 Million Listings •A Single “Master” Postgres Database•PHP > Twisted > Stored Proc > TSearch•18 “Baby” Postgres Databases•Baby Replicator
Thursday, May 10, 12
History of Search2008
•2 Million Listings •A Single “Master” Postgres Database•PHP > Solr•4 Solr Slaves + 2 Masters•Baby Replicator + DIH for Reindexing
Thursday, May 10, 12
History of Search2009
•4 Million Listings •A Single “Master” Postgres Database•PHP > Solr•6 Solr Slaves + 2 Masters•Webs >ActiveMQ > Solr
Thursday, May 10, 12
History of Search2010
•7 Million Listings •A Single “Master” Postgres Database•PHP > Thrift > Solr•10 Solr Slaves + 1 Master•Custom Import Handler
Thursday, May 10, 12
History of Search2011
•10 Million Listings •“Master” Postgres Database + DB SHARDS!•PHP > Thrift > Solr•24 Solr Slaves + 1 Master•Custom Import Handler
Thursday, May 10, 12
Future of Search2012
•?? Million Listings•MORE DB SHARDS!•PHP > Thrift > Solr•?? Solr Slaves + 1 Master•HBase + Hadoop Indexers
Thursday, May 10, 12
What Did We Learn?
Thursday, May 10, 12
Lucene + Solr > TSearch
http://www.depesz.com/2010/10/17/why-im-not-fan-of-tsearch-2/
Thursday, May 10, 12
Love Lucene + Solr Trunk!
Thursday, May 10, 12
Run, Don’t Walk...
Thursday, May 10, 12
Fork it: https://github.com/etsy/deployinator
Deployinator
Thursday, May 10, 12
Smoker
Thursday, May 10, 12
StatsD, Graph Everything!
Fork it: https://github.com/etsy/statsd
Thursday, May 10, 12
Thursday, May 10, 12
95th Percentile
Thursday, May 10, 12
start · build_query · perform_search · receive_search_ads · search_side_response · create_event_logger · set_tpl_vars · tpl_render · receive_search_ads_post_render
Thursday, May 10, 12
Solr Top Level Cache > Memcached
Thursday, May 10, 12
etsy-index.properties
$ cat /search/data/person/index/etsy-index.properties#Tue Mar 27 13:05:51 EDT 2012max_update_time=2012-03-27T17\:05\:51.955Z
Thursday, May 10, 12
Check Index SizeDon’t Install if < 50% Current Size
Thursday, May 10, 12
Check if Index is Too OldDon’t Update if > 10 Days Old
Thursday, May 10, 12
What Did We Learn?
Store Nothing
Thursday, May 10, 12
Keep Denormalized Data
Thursday, May 10, 12
SearchDatabase
DB Shard
DB Shard
DB Shard
PHPDenormalizer
JSON
Thursday, May 10, 12
FullReindex
ApplyIncremental
Install
Thursday, May 10, 12
FullReindex
ApplyIncremental
InstallApply
Incremental
Thursday, May 10, 12
Indexer
Database
Thursday, May 10, 12
HBase + HadoopIndexing
Thursday, May 10, 12
HBase + Hadoop Indexing
Why HBase?
Thursday, May 10, 12
HBase + Hadoop Indexing
HBase
DB Shard
DB Shard
DB Shard
PHPDenormalizer
JSON
Thursday, May 10, 12
{NAME => 'listings_denormalized', FAMILIES => [{NAME => 'listing_data', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}
listings_denormalized
HBase + Hadoop Indexing
Thursday, May 10, 12
{NAME => 'listings_denormalized_modified_index', FAMILIES => [{NAME => 'pks', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}
listings_denormalized_modified_index
HBase + Hadoop Indexing
Thursday, May 10, 12
HBase + Hadoop Indexing
SOLR-1301
https://issues.apache.org/jira/browse/SOLR-1301
Thursday, May 10, 12
HDFS
Disk
SolrOutput Format
•Solr Document Converter•Solr Requires Posix Disk•Index Copied Back to HDFS
HBase + Hadoop Indexing
Thursday, May 10, 12
•Not Great with Multi-Core Configs•Added Solr Multi-Core Support• Solr Config Issues•Added ENV support for Configs•Uses “new” style Hadoop API•Added Support for both Old and New
HBase + Hadoop Indexing
Thursday, May 10, 12
SolrInputDocumentWritable
HBase + Hadoop Indexing
public class SolrInputDocumentWritable extends SolrInputDocument implements org.apache.hadoop.io.Writable {
Thursday, May 10, 12
HBase + Hadoop Indexing
Oozie
Thursday, May 10, 12
HBase + Hadoop Indexing
Oozie + HBase?
Thursday, May 10, 12
ScanStringGenerator
HBase + Hadoop Indexing
http://blog.ozbuyucusu.com/2011/07/21/using-hbase-tablemapper-via-oozie-workflow/
Thursday, May 10, 12
Hadoop Indexer
StartOozie
Map HBase
HDFSReduce
DiskSolr
Output
Copy
Merge
Install
HBase + Hadoop Indexing
Thursday, May 10, 12
IndexerActionMain
HBase + Hadoop Indexing
Thursday, May 10, 12
HBase + Hadoop Indexing
Deployinator
Thursday, May 10, 12
IndexCompare
HBase + Hadoop Indexing
Thursday, May 10, 12
$ ./compare ERROR: please provide two index directories example: ./compare -p 0.1 -i user_id ./index ./index-1332867952588options: -p --percent= percent of the index to check -i --id= primary key id field in the index -h --hash= comparison or hash field in the index <index> <index>
HBase + Hadoop Indexing
Thursday, May 10, 12
$ ./compare \/search/data/person/index-1332867952588/ \/search/data/person/index-1335378487672
id field: user_idhash field: hashpercentage: 0.0010 files: /search/data/person/index-1332867952588/ /search/data/person/index-1335378487672
/search/data/person/index-1332867952588 contains 1515512 docs/search/data/person/index-1335378487672 contains 14837972 docs1516 of 1516 documents are the same
HBase + Hadoop Indexing
Thursday, May 10, 12
Copy and Merge
HBase + Hadoop Indexing
Thursday, May 10, 12
Open Source
HBase + Hadoop Indexing
Thursday, May 10, 12
Replication
Thursday, May 10, 12
Replication
Thursday, May 10, 12
Slaves
+n slaves
Master
Replication
Thursday, May 10, 12
Thursday, May 10, 12
BitTorrentReplication
Thursday, May 10, 12
Bit Torrent
Using BitTornado:
Thursday, May 10, 12
ReplicationBit Torrent + Solr
Thursday, May 10, 12
ReplicationBit Torrent + Solr
Thursday, May 10, 12
Thursday, May 10, 12
Thursday, May 10, 12
Fork of TTorent: https://github.com/etsy/ttorrent
Multi-File SupportLarge File Support
Fork BitTorrent: Comming Soon
Replication
Thursday, May 10, 12
Need a job?
Thursday, May 10, 12
Thursday, May 10, 12
Thanks!
Thursday, May 10, 12
top related