rebalance api for solrcloud: presented by nitin sharma, netflix & suruchi shah, bloomreach

Post on 07-Jan-2017

100 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Rebalance API for SolrCloud

Nitin Sharma - Senior Software Engineer, Netflix Suruchi Shah - Software Engineer, Bloomreach

3

Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance

➢ Allocation Strategies

➢ Open Source ➢ Summary

4

Motivation ●  Harder to scale & guarantee SLA (Availability, Query Performance, Data Freshness ) for a multi-tenant, cross datacenter

search architecture based on solrcloud

●  Scaling issues related to Query Performance:

■  Inability to auto scale Solr serving with increasing index sizes

■  Dynamic Shard Setup based on index size

●  SLA issues due to Availability:

■  Nightmare to manipulate cluster/collection setup with expanding clusters and frequent node replacements (AWS issues)

■  Flexible Replica Allocation based on custom strategies to guarantee 99.995% availability

●  Data Freshness (aka Indexing Performance) hiccups:

■  Break the tight coupling between Indexing and Serving latency (Tp 95) SLAs

●  Generic framework for Solr SLA management that can be open-sourced

5

Rebalance API ●  Fine-grained SLA management in Solr

●  Smarter Index, Cluster & Data Management for SolrCloud

●  Forms the basis for Solr Auto Scale

●  Admin handler in Solr

●  2 levels of abstraction

●  Scaling Strategies

○  Aid with shard, replica and cluster manipulation techniques to guarantee SLA

○  Zero Downtime operations

○  Tunable for Availability, Performance or Cost

●  Allocation Strategies

○  Decides Core Placement

○  Tunable for Availability, Performance or Cost

Rebalance API

Scaling Strategies Allocation Strategies

Auto Shard

Redistribute

Smart Merge

Replace

Scale Up

Scale Down

Least Used

Unused

AZ Aware

6

Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance

➢ Allocation Strategies

➢ Open Source ➢ Summary

7

01 Query Performance Issues

●  Indexing doubles documents - Per shard latency goes up

●  Tp 95 shoots up ●  No way to change shards

dynamically ●  Delete, Recreate, Re-index ●  Availability goes down

Node 1

Zk Ensemble

Node 2

Solr Collection A

Indexing 2x documents

Shard 1 Shard 2

Query Tp95

Shard 3 Shard 4

8

03 Rebalance Auto Shard

●  Re-sharding existing collection to any number of destination shards. (e.g can help with reducing latency)

●  Includes re-distributing the index and configs consistently.

●  Zero downtime - No query failures

●  Avoiding any heavy re-indexing processes.

●  Can be size based as well

●  Sample API Call:

/admin/collections?action=REBALANCE &scaling_strategy=AUTO_SHARD &collection=collection_name&num_shards=number of shards &size_cap=1G &allocation_strategy=least_used

9

Solr Collection A

Node 1

Zk Ensemble

Node 2

Shard 1 Shard 2

Rebalance Auto Shard - Overview

Shard 3 Shard 4

10

01 Auto Shard (1) - Internals - Simple Strategy

●  Merge documents from all shards - Lucene library based

●  Split the merged shard into desired

number ●  Auto Zk update for shard ranges ●  Heavyweight but works ●  Based on size, could take in 20-30 of

minutes to complete

Solr Collection A

Shard 1 Shard 2

Merged Shard (Temp)

Shard 1 Shard 2 Shard 3 Shard 4

Solr Collection A’

merge

Even Split

11

01 Auto Shard (2) - Internals - Smart Split Strategy

●  Identify minimum number of splits ●  Split shards in parallel to required desired

setup ●  Relatively high performance ○  2 Tb index from 2 to 4 shards - 2.5 mins

●  Auto Zk Update

Solr Collection A

Shard 1 Shard 2

Shard 1_1

Shard 1_2

Shard 2_1

Shard 2_2

Solr Collection A

Shard 1 Shard 2 Shard 3 Shard 4

Solr Collection A

Smart Split Renamed Shards

12

01 Solution : Auto Sharding

●  Dynamically Increase/Decrease shards

●  E.g. Increase shards from 2 to 4 to reduce latency

●  E.g. Tp 95 reduced from > 1 sec to 250

ms.

Solr Collection A

Node 1

Zk Ensemble

Node 2

Shard 1 Shard 2

Shard 3 Shard 4

13

03 Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance ○  Data Consistency

➢ Allocation Strategies

➢ Open Source ➢ Summary

14

01 Data distribution issues

●  Adding a new node - Does nothing ●  No Redistribution of solr cores ●  Machines running out of disk space -

heavier collections need to be moved out ●  Problem amplifies at large scale - 100s of

nodes, 1000s of collections ●  Manual Management of core placement

becomes an issue

Default Solr Behavior

Node 1

Zk Ensemble

Node 2

Node 3

Core A2

Core B1

Core D2

Core D1

Core A1

Core B2

Core C1

Core C2

Node 4

15

01 Re-distribute Strategy - Internals

●  Internal topology construction from ZK

●  Desired Core placement computation - External

or Trigger based

●  Migration of cores within the cluster

●  Knobs to control min/max to reduce cluster

load

●  Zero downtime - No query failures and

resiliency to node failures

●  API Call: /admin/collections?

action=REBALANCE&scaling_strategy=REDIST

RIBUTE

Compute & Redistribute

Node 1

Zk Ensemble

Node 2

Node 3

Core A2

Core B1

Core D1

Core A1

Core B2

Core C1

Node 1

Zk Ensemble

Node 2

Node 3

Core A2

Core B1

Core D1

Core A1

Core B2

Core C1

16

01 Solution: Auto Redistribution of Data

Redistribute

●  Adding new node -

triggers redistribution

●  Respects the core

placement allocation

strategy

●  Zero downtime

Node 1

Zk Ensemble

Node 2

Node 3

Core A2

Core B1

Core D2

Core D1

Core A1

Core B2

Core C1

Core C2

Node 4

Node 1

Zk Ensemble

Node 2

Node 3

Core A2

Core B1

Core D2

Core D1

Core A1

Core B2

Core C1

Core C2

Node 4

17

03 Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance

➢ Allocation Strategies

➢ Open Source ➢ Summary

18

01 Replace Solr Nodes Default Solr Behavior

● A node might die, need to be replaced,

decommissioned

● Default behavior - Do nothing

● Can cause downtime - Heavy cores on

the nodes

● Problem exacerbated with 1000s of

nodes/collections

Node 1

Zk

Ense

mbl

e

Node 2

Node 3

Core A2

Core B1

Core D1

Core A1

Core B2

Core C1

19

01 Replace Nodes with Rebalance

● Read the Topology of cluster

● Migrate replicas from node about to die to

new node

●  Zero downtime

● API Call:

○  /admin/collections?

action=REBALANCE&scaling_strategy=REPLACE

&collection=collectionName&source_node=so

urce_host &dest_node=dest_host

Node 1

Zk

Ense

mbl

e

Node 2

Node 3

Core A2

Core B1

Core D1

Core A1

Core B2

Core C1

Node 4

Core B1

Core A1

Replaced Node

20

03 Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance

➢ Allocation Strategies

➢ Open Source ➢ Summary

21

01 Indexing Performance ●  Higher the shards, faster the indexing

(parallelism) ●  Faster indexing - Data Freshness SLA ●  Solr - # shards is the same for indexing

vs serving ●  Shard setup - tweaked for serving

query performance ●  E.g. ○  Indexing 100M docs in 2 shards - 2

hours ○  Serving 100M docs in 2 shards - Tp

95 < 100 ms

Shard 1 Shard 2 Shard 3 Shard 4

Indexing

500M Documents

Performance Hit

Serving Queries

22

01 Indexing Performance - Smart Merge

●  Separate Indexing shard setup vs serving

● More shards for indexing - Merged into lesser shards for serving

● Post Indexing issue API to merge into serving collection

● API Call: ○  /admin/collections?

action=Rebalance&scaling_strategy=SMART_MERGE_DISTRIBUTED&collection=collectionName&num_shards=numRequiredShards

● Parallel Merge

●  Zero downtime

23

01 Indexing Performance - Smart Merge

●  Index vs Serving has different collections

●  Indexed Collection

merged into Serving - Using smart merge call

●  Indexing can be tuned

independently for performance

● Serving SLA unaffected

Shard 1 Shard 2 Shard 3 Shard 4

Indexing

500M Documents

Serving Queries

Shard 1

Shard 4

Shard 5

Shard 8

Shard 9

Shard 12

Shard 13

Shard 16… … … …

Collection A_Indexing

Collection A

Parallel merge Parallel merge Parallel merge Parallel merge

24

Rebalance API ●  Fine-grained SLA management in Solr

●  Smarter Index, Cluster & Data Management for SolrCloud

●  Forms the basis for Solr Auto Scale

●  Admin handler in Solr

●  2 levels of abstraction

●  Scaling Strategies

○  Aid with shard, replica and cluster manipulation techniques to guarantee SLA

○  Zero Downtime operations

○  Tunable for Availability, Performance or Cost

●  Allocation Strategies

○  Decides Core Placement

○  Tunable for Availability, Performance or Cost

Rebalance API

Scaling Strategies Allocation Strategies

Auto Shard

Redistribute

Smart Merge

Replace

Scale Up

Scale Down

Least Used

Unused

AZ Aware

25

01 Allocation Strategies

● Abstracts out the core placement methodology

●  Least Used Strategy - Pick the node that has the least amount of cores

● AZ aware Strategy - Pick the node that is in a different availability zone than the other

cores for a given collection

● Unused Strategy - Pick the node that does not have any cores for a given collection

● All of them are compatible with all scaling strategies

26

01 Open Source

●  Fully open sourced - SOLR-9241

(4.6.1).

● Contributed patch works on top of

4.6.1 and tested up to 4.10

●  SOLR-9241 (epic) - patches/features

on master. Has sub patches

○  SOLR-93{16-21}

○  SOLR-9407

● Actively working with community to

get the rest of the API 6+ compatible.

27

01 Summary ●  Harder to scale & guarantee SLA (Availability, Query Performance, Data Freshness ) for a multi-tenant, cross datacenter search

architecture based on solrcloud

●  Rebalance API

○  Scaling Strategies - How to scale?

○  Allocation Strategies - Where to place cores?

●  Forms the basis for Solr Auto Scale

●  Zero Downtime operations for

○  Dynamically changing shard setup

○  Decoupling indexing SLA from Serving

○  Replacing Nodes

○  Auto -Redistributing data with cluster expansion

●  Open Source

28

01 Speakers

Nitin Sharma https://www.linkedin.com/in/knitinsharma nsarma1985@gmail.com

Suruchi Shah https://www.linkedin.com/in/suruchishah suruchi.shah13@gmail.com

top related