rebalance api for solrcloud: presented by nitin sharma, netflix & suruchi shah, bloomreach

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Rebalance API for SolrCloud

Nitin Sharma - Senior Software Engineer, Netflix Suruchi Shah - Software Engineer, Bloomreach

Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance

➢ Allocation Strategies

➢ Open Source ➢ Summary

Motivation ●  Harder to scale & guarantee SLA (Availability, Query Performance, Data Freshness ) for a multi-tenant, cross datacenter

search architecture based on solrcloud

●  Scaling issues related to Query Performance:

■  Inability to auto scale Solr serving with increasing index sizes

■  Dynamic Shard Setup based on index size

●  SLA issues due to Availability:

■  Nightmare to manipulate cluster/collection setup with expanding clusters and frequent node replacements (AWS issues)

■  Flexible Replica Allocation based on custom strategies to guarantee 99.995% availability

●  Data Freshness (aka Indexing Performance) hiccups:

■  Break the tight coupling between Indexing and Serving latency (Tp 95) SLAs

●  Generic framework for Solr SLA management that can be open-sourced

Rebalance API ●  Fine-grained SLA management in Solr

●  Smarter Index, Cluster & Data Management for SolrCloud

●  Forms the basis for Solr Auto Scale

●  Admin handler in Solr

●  2 levels of abstraction

●  Scaling Strategies

○  Aid with shard, replica and cluster manipulation techniques to guarantee SLA

○  Zero Downtime operations

○  Tunable for Availability, Performance or Cost

●  Allocation Strategies

○  Decides Core Placement

Rebalance API

Scaling Strategies Allocation Strategies

Auto Shard

Redistribute

Smart Merge

Replace

Scale Up

Scale Down

Least Used

Unused

AZ Aware

Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance

01 Query Performance Issues

●  Indexing doubles documents - Per shard latency goes up

●  Tp 95 shoots up ●  No way to change shards

dynamically ●  Delete, Recreate, Re-index ●  Availability goes down

Node 1

Zk Ensemble

Node 2

Solr Collection A

Indexing 2x documents

Shard 1 Shard 2

Query Tp95

Shard 3 Shard 4

03 Rebalance Auto Shard

●  Re-sharding existing collection to any number of destination shards. (e.g can help with reducing latency)

●  Includes re-distributing the index and configs consistently.

●  Zero downtime - No query failures

●  Avoiding any heavy re-indexing processes.

●  Can be size based as well

●  Sample API Call:

/admin/collections?action=REBALANCE &scaling_strategy=AUTO_SHARD &collection=collection_name&num_shards=number of shards &size_cap=1G &allocation_strategy=least_used

Solr Collection A

Node 1

Zk Ensemble

Node 2

Shard 1 Shard 2

Rebalance Auto Shard - Overview

Shard 3 Shard 4

01 Auto Shard (1) - Internals - Simple Strategy

●  Merge documents from all shards - Lucene library based

●  Split the merged shard into desired

number ●  Auto Zk update for shard ranges ●  Heavyweight but works ●  Based on size, could take in 20-30 of

minutes to complete

Solr Collection A

Shard 1 Shard 2

Merged Shard (Temp)

Shard 1 Shard 2 Shard 3 Shard 4

Solr Collection A’

Even Split

01 Auto Shard (2) - Internals - Smart Split Strategy

●  Identify minimum number of splits ●  Split shards in parallel to required desired

setup ●  Relatively high performance ○  2 Tb index from 2 to 4 shards - 2.5 mins

●  Auto Zk Update

Solr Collection A

Shard 1 Shard 2

Shard 1_1

Shard 1_2

Shard 2_1

Shard 2_2

Solr Collection A

Smart Split Renamed Shards

01 Solution : Auto Sharding

●  Dynamically Increase/Decrease shards

●  E.g. Increase shards from 2 to 4 to reduce latency

●  E.g. Tp 95 reduced from > 1 sec to 250

Solr Collection A

Node 1

Zk Ensemble

Node 2

Shard 1 Shard 2

Shard 3 Shard 4

03 Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance ○  Data Consistency

01 Data distribution issues

●  Adding a new node - Does nothing ●  No Redistribution of solr cores ●  Machines running out of disk space -

heavier collections need to be moved out ●  Problem amplifies at large scale - 100s of

nodes, 1000s of collections ●  Manual Management of core placement

becomes an issue

Default Solr Behavior

Node 1

Zk Ensemble

Node 2

Node 3

Core A2

Core B1

Core D2

Core D1

Core A1

Core B2

Core C1

Core C2

Node 4

01 Re-distribute Strategy - Internals

●  Internal topology construction from ZK

●  Desired Core placement computation - External

or Trigger based

●  Migration of cores within the cluster

●  Knobs to control min/max to reduce cluster

●  Zero downtime - No query failures and

resiliency to node failures

●  API Call: /admin/collections?

action=REBALANCE&scaling_strategy=REDIST

RIBUTE

Compute & Redistribute

Node 1

Zk Ensemble

Node 2

Node 3

Core A2

Core B1

Core D1

Core A1

Core B2

Core C1

Node 1

Zk Ensemble

Node 2

Node 3

Core A2

Core B1

Core D1

Core A1

Core B2

Core C1

01 Solution: Auto Redistribution of Data

Redistribute

●  Adding new node -

triggers redistribution

●  Respects the core

placement allocation

strategy

●  Zero downtime

Node 1

Zk Ensemble

Node 2

Node 3

Core A2

Core B1

Core D2

Core D1

Core A1

Core B2

Core C1

Core C2

Node 4

Node 1

Zk Ensemble

Node 2

Node 3

Core A2

Core B1

Core D2

Core D1

Core A1

Core B2

Core C1

Core C2

Node 4

03 Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance

01 Replace Solr Nodes Default Solr Behavior

● A node might die, need to be replaced,

decommissioned

● Default behavior - Do nothing

● Can cause downtime - Heavy cores on

the nodes

● Problem exacerbated with 1000s of

nodes/collections

Node 1

Node 2

Node 3

Core A2

Core B1

Core D1

Core A1

Core B2

Core C1

01 Replace Nodes with Rebalance

● Read the Topology of cluster

● Migrate replicas from node about to die to

new node

● API Call:

○  /admin/collections?

action=REBALANCE&scaling_strategy=REPLACE

&collection=collectionName&source_node=so

urce_host &dest_node=dest_host

Node 1

Node 2

Node 3

Core A2

Core B1

Core D1

Core A1

Core B2

Core C1

Node 4

Core B1

Core A1

Replaced Node

03 Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance

01 Indexing Performance ●  Higher the shards, faster the indexing

(parallelism) ●  Faster indexing - Data Freshness SLA ●  Solr - # shards is the same for indexing

vs serving ●  Shard setup - tweaked for serving

query performance ●  E.g. ○  Indexing 100M docs in 2 shards - 2

hours ○  Serving 100M docs in 2 shards - Tp

95 < 100 ms

Indexing

500M Documents

Performance Hit

Serving Queries

01 Indexing Performance - Smart Merge

●  Separate Indexing shard setup vs serving

● More shards for indexing - Merged into lesser shards for serving

● Post Indexing issue API to merge into serving collection

● API Call: ○  /admin/collections?

action=Rebalance&scaling_strategy=SMART_MERGE_DISTRIBUTED&collection=collectionName&num_shards=numRequiredShards

● Parallel Merge

01 Indexing Performance - Smart Merge

●  Index vs Serving has different collections

●  Indexed Collection

merged into Serving - Using smart merge call

●  Indexing can be tuned

independently for performance

● Serving SLA unaffected

Indexing

500M Documents

Serving Queries

Shard 1

Shard 4

Shard 5

Shard 8

Shard 9

Shard 12

Shard 13

Shard 16… … … …

Collection A_Indexing

Collection A

Parallel merge Parallel merge Parallel merge Parallel merge

Rebalance API ●  Fine-grained SLA management in Solr

●  Smarter Index, Cluster & Data Management for SolrCloud

●  Admin handler in Solr

●  2 levels of abstraction

●  Scaling Strategies

○  Aid with shard, replica and cluster manipulation techniques to guarantee SLA

○  Zero Downtime operations

●  Allocation Strategies

○  Decides Core Placement

Rebalance API

Scaling Strategies Allocation Strategies

Auto Shard

Redistribute

Smart Merge

Replace

Scale Up

Scale Down

Least Used

Unused

AZ Aware

01 Allocation Strategies

● Abstracts out the core placement methodology

●  Least Used Strategy - Pick the node that has the least amount of cores

● AZ aware Strategy - Pick the node that is in a different availability zone than the other

cores for a given collection

● Unused Strategy - Pick the node that does not have any cores for a given collection

● All of them are compatible with all scaling strategies

01 Open Source

●  Fully open sourced - SOLR-9241

(4.6.1).

● Contributed patch works on top of

4.6.1 and tested up to 4.10

●  SOLR-9241 (epic) - patches/features

on master. Has sub patches

○  SOLR-93{16-21}

○  SOLR-9407

● Actively working with community to

get the rest of the API 6+ compatible.

01 Summary ●  Harder to scale & guarantee SLA (Availability, Query Performance, Data Freshness ) for a multi-tenant, cross datacenter search

architecture based on solrcloud

●  Rebalance API

○  Scaling Strategies - How to scale?

○  Allocation Strategies - Where to place cores?

●  Zero Downtime operations for

○  Dynamically changing shard setup

○  Decoupling indexing SLA from Serving

○  Replacing Nodes

○  Auto -Redistributing data with cluster expansion

●  Open Source

01 Speakers

Nitin Sharma https://www.linkedin.com/in/knitinsharma nsarma1985@gmail.com

Suruchi Shah https://www.linkedin.com/in/suruchishah suruchi.shah13@gmail.com

rebalance api for solrcloud: presented by nitin sharma, netflix & suruchi shah, bloomreach

Technology

wewe - bloomreach...

administering and monitoring solrcloud clusters

managing a solrcloud cluster using apis

scaling search with solrcloud

gids2014: solrcloud: searching big data

presented bybabita shahi ritu choudhary suruchi singh

bloomreach...

smc web content management guide bloomreach

scaling solr with solrcloud

how solrcloud changes the user experience in a sharded...

bloomreach aws tech

suruchi creations

an integrative approach for addressing depression dr....

solr exchange: introduction to solrcloud

neo4dogs - a data quality platform approach with solrcloud...

nosocomial pneumonia dr suruchi

the state of commerce experience - bloomreach

user guide: bloomreach discovery

apache solrcloud installation and configuration · apache...

whitepaper the bloomreach web relevance...