scaling solrcloud to a large number of collections: presented by shalin shekhar mangar, lucidworks

27

Upload: lucidworks

Post on 07-Jul-2015

547 views

Category:

Software


0 download

DESCRIPTION

Presented at Lucene/Solr Revolution 2014

TRANSCRIPT

Page 1: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks
Page 2: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Scaling SolrCloud to a Large Number of CollectionsShalin Shekhar Mangar Lucidworks Inc.

Page 3: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Apache Solr has tremendous momentum

8M+ total downloads

Solr is both established & growing

250,000+monthly downloads

Largest community of developers.

2500+open Solr jobs.

The most widely used search solution on the planet.

Lucene/Solr Revolutionworld’s largest open sourceuser conference dedicated

to Lucene/Solr.

Solr has tens of thousands of applications in production.

You use Solr everyday.

Page 4: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Solr Scalability is unmatched

Page 5: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

The traditional search use-case

Example: eCommerce Product Catalogue

• One large index distributed across multiple nodes • A large number of users searching on the same data • Searches happen across the entire cluster

Page 6: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

“The limits of the possible can only be defined by going beyond them into the impossible.”

—Arthur C. Clarke

Page 7: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Analyze, measure and optimize

•Analyze and find missing features •Setup a performance testing environment on AWS •Devise tests for stability and performance •Find bugs and bottlenecks and fixes

Page 8: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Problem #1: Cluster state and updates

•The SolrCloud cluster state has information about all collections, their shards and replicas

•All nodes and (Java) clients watch the cluster state •Every state change is notified to all nodes •Limited to (slightly less than) 1MB by default •1 node restart triggers a few 100 watcher fires and pulls

from ZK for a 100 node cluster (three states: down, recovering and active)

Page 9: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Solution: Split cluster state and scale

• Each collection gets it’s own state node in ZK • Nodes selectively watch only those states

which they are a member of • Clients cache state and use smart cache

updates instead of watching nodes • http://issues.apache.org/jira/browse/

SOLR-5473

Page 10: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Problem #2: Overseer Performance

• Thousands of collections create a lot of state updates

• Overseer falls behind and replicas can’t recover or can’t elect a leader

• Under high indexing/search load, GC pauses can cause overseer queue to back up

Page 11: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Solution - Improve the overseer

• Optimize polling for new items in overseer queue (SOLR-5436)

• Dedicated overseers nodes (SOLR-5476) • New Overseer Status API (SOLR-5749) • Asynchronous execution of collection

commands (SOLR-5477, SOLR-5681)

Page 12: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Problem #3: Moving data around

• Not all users are born equal - A tenant may have a few very large users

• We wanted to be able to scale an individual user’s data — maybe even as it’s own collection

• SolrCloud can split shards with no downtime but it only splits in half

• No way to ‘extract’ user’s data to another collection or shard

Page 13: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Solution: Improved data management

• Shard can be split on arbitrary hash ranges (SOLR-5300)

• Shard can be split by a given key (SOLR-5338, SOLR-5353)

• A new ‘migrate’ API to move a user’s data to another (new) collection without downtime (SOLR-5308)

Page 14: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Problem #4: Exporting data

• Lucene/Solr is designed for finding top-N search results

• Trying to export full result set brings down the system due to high memory requirements as you go deeper

Page 15: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Solution: Distributed deep paging

Page 16: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Testing scale at scale

• Performance goals: 6 billion documents, 4000 queries/sec, 400 updates/sec, 2 seconds NRT sustained performance

• 5% large collections (50 shards), 15% medium (10 shards), 85% small (1 shard) with replication factor of 3

• Target hardware: 24 CPUs, 126G RAM, 7 SSDs (460G) + 1 HDD (200G)

• 80% traffic served by 20% of the tenants

Page 17: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Test Infrastructure

Page 18: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Logging

Page 19: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks
Page 20: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

How to manage large clusters?

• Tim Potter wrote the Solr Scale Toolkit • Fabric based tool to setup and manage

SolrCloud clusters in AWS complete with collectd and SiLK

• Backup/Restore from S3. Parallel clone commands.

• Open source! • https://github.com/LucidWorks/solr-scale-tk

Page 21: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Gathering metrics and analyzing logs

• Lucidworks SiLK (Solr + Logstash + Kibana) • collectd daemons on each host • rabbitmq to queue messages before delivering to log stash • Initially started with Kafka but discarded thinking it is overkill • Not happy with rabbitmq — crashes/unstable • Might try Kafka again soon • http://www.lucidworks.com/lucidworks-silk

Page 22: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Generating data and load

• Custom randomized data generator (re-producible using a seed)

• JMeter for generating load • Embedded CloudSolrServer (Solr Java client)

using JMeter Java Action Sampler • JMeter distributed mode was itself a bottleneck! • Not open source (yet) but we’re working on it!

Page 23: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Numbers

•30 hosts, 120 nodes, 1000 collections, 6B+ docs, 15000 queries/second, 2000 writes/second, 2 second NRT sustained over 24-hours

•More than 3x the numbers we needed •Unfortunately, we had to stop testing at that

point :( •Our biggest cluster cost us just $120/hour :)

Page 24: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Not over yet

• We continue to test performance at scale • Published indexing performance benchmark, working on others

• 15 nodes, 30 shards, 1 replica, 157195 docs/sec • 15 nodes, 30 shards, 2 replicas, 61062 docs/sec • http://searchhub.org/introducing-the-solr-scale-toolkit/

• Setting up an internal performance testing environment • Jenkins CI • Jepsen tests • Single node benchmarks • Cloud tests

• Stay tuned!

Page 25: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Pushing the limits

Page 26: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Not over yet

• SolrCloud continues to be improved • SOLR-6220 - Replica placement strategy • SOLR-6273 - Cross data center replication • SOLR-5656 - Auto-add replicas on HDFS • SOLR-5986 - Don’t allow runaway queries to harm the cluster • SOLR-5750 - Backup/Restore API for SolrCloud • Many, many more

Page 27: Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekhar Mangar, Lucidworks

Thank [email protected] @shalinmangar