solr distributed indexing in walmartlabs: presented by shengua wan, walmartlabs

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Shenghua Wan

Sr Software Engineer, @WalmartLabs swan@walmartlabs.com

Solr Distributed Indexing in WalmartLabs

Background

•  Search big data, part of Polaris Search Team in WalmartLabs •  Audience management, Axciom Inc. •  HPC computational scientist, UTSW Medical Center

Our perspective :

•  To help make Solr indexing more scalable •  From a big data engineer perspective •  Solr/Lucene internals are not covered in this talk

Problem definition •  Input

96 gzipped xml files

•  Output 3 shards of binary indexes, one for every 32 xml files •  Dedicated indexing servers not scalable •  Indexing time in dev environment at least 4 hours -> slow down development iteration

Existing “Wheels” for Solr Distributed Indexing •  “Indexing Files via Solr and Java MapReduce” (Adam

Smieszny since 2012)

•  LuenceIndexOutputFormat (Twitter’s Elephant-Bird since 2013)

•  MapReduceIndexerTool (Mark Miller since late 2013)

Existing “Wheels” for Solr Distributed Indexing

q “Indexing Files via Solr and Java MapReduce” (Adam Smieszny since 2012)

q LuenceIndexOutputFormat (Twitter’s Elephant-Bird since 2013)

ü MapReduceIndexerTool (Mark Miller since late 2013) This tool is closest to our use case.

Start from MapReduceIndexerTool Anatomy of this tool •  MorphlineMapper use Morphlines to convert document to SolrInputDocument •  SolrRecordWriter

create a embedded Solr instance to index the document •  TreeMergeRecordWriter

merge multiple binary indexes into one References: 1.  https://github.com/apache/lucene-solr/tree/trunk/solr/

contrib/map-reduce 2.  https://github.com/markrmiller/solr-map-reduce-example

Our Challenges •  Not using Solr Cloud •  Not using Zookeeper •  Solr version 4.0 (when we did experiments) •  Environment •  Hadoop version 1 •  MapR File System •  XML input format

•  Easy to maintain and debug •  Documentation A runnable example with source code is the best. Thanks to https://github.com/markrmiller/solr-map-reduce-example.

Customize Design to Our Use Case Breaking down to two fundamental utilities •  Index Generator

replace Morphlines with XmlInputFormat from Apache Mahout and reuse SolrOutputFormat

•  Index Merger reuse TreeMergeOutputFormat

References: 1.https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java 2.https://github.com/apache/lucene-solr/tree/trunk/solr/contrib/map-reduce

Customize Design to Our Use Case – cont. Breaking down to two fundamental utilities •  Index Generator •  Index Merger More complicated logic can be built on top of these two simple map-only jobs. Where is reduce? Our use case does not need it. We want it lean and fast. But you may need it.

Experiments and Observations

•  Index Generation ü CPU-bound ü  can easily scale and be parallel ü Map-only wins 12~15% over Map-Reduce in our

experiments ü ~5GB decompressed Xml document indexed within 10

minutes using 7x3 mappers

Experiments and Observations – cont.

•  Index Merging ü  IO-bound Disk and Network. But network was our pain ü  Two stages: logical merge and optimize o  Logical merge: file movement o  Optimize: reduce number of index segments

Experiments and Observations – cont. n-Way Merge: merging n roughly same size shards into 1

Nothing suspicious

Go sharp suddenly? •  Too many shards •  Resource

contention

Optimize time >> Logical merge time 5x ~ 8x (though 64-way is an exception, considered to be outlier because of shared environment)

New Challenges

After contacting cluster owner team, we were told the connection of that cluster consist of almost five dozen nodes is 1Gb/s Ethernet.

Experiments and Observations – cont. How about “tree” structure merge?

Seems to be attractive

Experiments and Observations – cont. Comparing hierarchical merge and n-way merge total time

Kind of unexpected

Experiments and Observations Comparing hierarchical merge and n-way merge total time

Relatively isolated environment: no network, but disk IO (4 cores x 2 threads)

4 small reads + 2 large reads

4 small reads

Lessons Learnt

•  Index generation in parallel is easy

•  Merging is not

•  N-way merging all shards is better

•  Data locality is key

Our Solutions

•  Plan A “Hey, Sir/Madam, could you please get us 48Gb/s InfiniBand network ASAP? Or 10Gb/s is also fine.” •  Plan B A small dedicated indexing Hadoop cluster (starting from one node)

Our Solutions A small dedicated indexing Hadoop cluster (starting from one node)

environment! Disk IO (MB/s)!shared! ~44!

Mac Pro (SSD)! ~250!Dedicated! ~202!

Dedicated cluster: •  1 node •  32 cores •  128GB mem

Tunable Parameter •  Split Size (Map-Reduce) •  Batch Size (Solr Index) •  RAM Buffer Size (Solr Index) •  Max number of Segments (Solr Index)

Opportunities

There are some parts missing in our tool which are allowed by our use case but you may want to have them: 1.  Reduce functions (deduplication, other processing logic) 2.  Try Spark or equivalent (bottleneck is embedded Solr

instance when merging)

Thanks! We are hiring!

Questions? 26

solr distributed indexing in walmartlabs: presented by shengua wan, walmartlabs

Technology

solr & spark€¢spark overview / high-level architecture...

dkd -...

indexing data and faceting search with apache solr ·...

university library. - amazon s3 › pfigshare-u... ·...

lambda processing for near real time search indexing at...

browsing the pds image archive with the imaging atlas...

crawling, indexing, and searching software project data with...

apache solr: indexing and searching (draft last modified...

queue based solr indexing with collection management:...

solr -...

oracle open world data-and-compute-intensive processing...

solr indexing and analysis tricks

ubiquitous solr - a database's not-so-evil twin: presented...

distributed semantic indexing infrastructure · keywords:...

introduction to - centrum für informations- und...

apache solr + ajax solr

solr for indexing and searching logs

sitecore search and indexing guide · chapter 1...

apache solr: indexing and searching 1. 2. 3

optimizing solr to improve...