architec;ng*splunk*for*high* availability*and*disaster ...€¦ · clustered indexing!...

52
Copyright © 2013 Splunk Inc. Dritan Bi;ncka Professional Services Architec;ng Splunk for High Availability and Disaster Recovery #splunkconf

Upload: others

Post on 13-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Copyright  ©  2013  Splunk  Inc.  

Dritan  Bi;ncka  Professional  Services  

Architec;ng  Splunk  for  High  Availability  and  Disaster  Recovery  

#splunkconf  

Page 2: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Legal  No;ces  During  the  course  of  this  presenta;on,  we  may  make  forward-­‐looking  statements  regarding  future  events  or  the  expected  performance  of  the  company.  We  cau;on  you  that  such  statements  reflect  our  current  expecta;ons  and  es;mates  based  on  factors  currently  known  to  us  and  that  actual  events  or  results  could  differ  materially.  For  important  factors  that  may  cause  actual  results  to  differ  from  those  contained  in  our  forward-­‐looking  statements,  please  review  our  filings  with  the  SEC.    The  forward-­‐looking  statements  made  in  this  presenta;on  are  being  made  as  of  the  ;me  and  date  of  its  live  presenta;on.    If  reviewed  aUer  its  live  presenta;on,  this  presenta;on  may  not  contain  current  or  accurate  informa;on.      We  do  not  assume  any  obliga;on  to  update  any  forward-­‐looking  statements  we  may  make.    In  addi;on,  any  informa;on  about  our  roadmap  outlines  our  general  product  direc;on  and  is  subject  to  change  at  any  ;me  without  no;ce.    It  is  for  informa;onal  purposes  only  and  shall  not,  be  incorporated  into  any  contract  or  other  commitment.    Splunk  undertakes  no  obliga;on  either  to  develop  the  features  or  func;onality  described  or  to  include  any  such  feature  or  func;onality  in  a  future  release.  

 

Splunk,  Splunk>,  Splunk  Storm,  Listen  to  Your  Data,  SPL  and  The  Engine  for  Machine  Data  are  trademarks  and  registered  trademarks  of  Splunk  Inc.  in  the  United  States  and  other  countries.  All  other  brand  names,  product  names,  or  trademarks  belong  to  their  respecCve  

owners.    

©2013  Splunk  Inc.  All  rights  reserved.  

2  

Page 3: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

About Me

!   Member  of  Professional  Services  !   Several  mul;-­‐TB  deployments  !   50+  projects/customers  !   Third  .conf  

3  

Page 4: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Agenda !  Disaster  recovery   !  High  availability  

–  Data collection –  Searching –  Indexing

!   Top  takeaways  

Problems Recover in the event of a disaster

Maintain an acceptable level of continuous service

4  

Page 5: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Disaster Recovery (DR)

Page 6: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

What is Disaster Recovery?

Set of processes necessary to ensure recovery of service after a disaster

6!

Page 7: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Disaster Recovery Steps

7!

1 Backup necessary data

Backup to a medium at least as resilient as source Local backup vs. remote

2 Restore

Ensure this works Backup is worthless without restore

Page 8: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Backup

8!

1 a Configurations $SPLUNK_HOME/etc/*

b Indexes Buckets: Hot*, warm, cold, frozen

Page 9: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Backup Configurations

9  

$SPLUNK_HOME/etc/*

Splunk instance

Page 10: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Backup: Bucket Lifecycle

10!

Events

[Too many warms] [Hot bucket is full]

[Out of space or bucket is old]

[Explicit user action]

$ Thawed path

$ Home path $ Cold path [Cheaper storage]

$ Frozen path or deleted

Page 11: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Backup Indexes

11  

Bucket type State Can backup?

Read + write No*

Read only Yes

Read only Yes

Page 12: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

What if your backup fails?

12  

Page 13: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

13  

Page 14: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Restore Configurations

14  

$SPLUNK_HOME/etc/*

New Splunk instance

$SPLUNK_HOME/etc/*

Splunk instance

Page 15: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Restore Data

15  

New Splunk instance Splunk instance

$Indexes_Location ($SPLUNK_HOME/var/lib/splunk)

$Indexes_Location ($SPLUNK_HOME/var/lib/splunk)

Page 16: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Putting Restore Together

16!

2 a New Splunk instance

b Configurations

C Indexes

Page 17: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Things to Think About: !   Recovery  )me  vs.  complexity  and  cost    

–  Ex. Backup to tape vs. warm standby

!   Tolerable  loss  vs.  complexity  and  cost  –  Ex. Backup warm/cold buckets vs. mirror hot bucket writes

17!

Page 18: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

High Availability (HA)

Page 19: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

What is High Availability?

A design methodology whereby a system is continuously operational, bounded by a set of

predetermined tolerances Note: “high availability” !=“complete availability”

19!

Page 20: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Splunk High Availability

20!

1 Data collection

2 Searching

3 Indexing

Page 21: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Data Collection

21!

Forwarder

Indexers

Forwarder Forwarder .  .  .    

.  .  .  

A B

Page 22: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Data Collection

22!

Forwarder

Indexers

Forwarder Forwarder .  .  .    

.  .  .  outputs.conf: ! ![tcpout] !defaultGroup = mygroup !![tcpout:mygroup] !server = A:9997, B:9997 !autoLB = true!

A B

Page 23: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Searching

23!

2 a Multiple, Independent Search Heads

b Splunk Native Search Head Pooling

Page 24: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Searching

24!

Indexer A Indexer B Indexer N . . . !

Typical search hierarchy

Page 25: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Searching: Independent SH

25!

Indexer A Indexer B Indexer N . . . !

Typical search hierarchy

Page 26: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Searching: Independent SH

26!

Indexer A Indexer B Indexer N

Search head A

. . . !

Search head B

Sync configs Sync job artifacts

Independent search heads

Page 27: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Search Head Pooling

27!

Page 28: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Searching: Pooling

28!

Indexer A Indexer B Indexer N

Search head A

. . . !

Search head B

NFS

Use NFS to share - Configurations - Job artifacts

Page 29: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Searching: Pooling

29!

Indexer A Indexer B Indexer N

Search head A

. . . !

Search head B

NFS

Use NFS to share - Configurations - Job artifacts

Page 30: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Steps to SHP

30!

1 Configure the shared location (NFS/SMB)

2 Enable pooling on each (stopped) search head ./splunk pooling enable <path_to_share>

3 Copy $SPLUNK_HOME/etc/apps and $SPLUNK_HOME/etc/users to $path_to_share/etc/

4 Start all search heads

Page 31: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Indexing

31!

3 a Reliable storage (SAN)

b Mirrored indexers

C Index replication

Page 32: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Indexing: Reliable Storage

32!

Use the underlying SAN infrastructure to provide HA

Indexer A

SAN

Page 33: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Indexing: Reliable Storage

33!

SAN

Indexer B

When indexer A fails, mount SAN to another indexer instance

Indexer A

Page 34: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Indexing: Mirrored Indexers

34!

Indexer A.0

Search head

Indexer B.0

Indexer A.1 Indexer B.1

Primary indexers X.0 index (locally) and forward to their respective secondaries X.1

Page 35: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Indexing: Mirrored Indexers

35!

Indexer A.0

Search head

Indexer B.0

Indexer A.1 Indexer B.1

Point search head to mirrored indexer B.1

Page 36: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Index Replication

Page 37: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Indexing: Index Replication !   Introduced  with  Splunk  5  !   Cluster  =  a  group  of  search  peers  (indexers)  that  replicate  each  others'  data  

!   Mul;ple  copies  of  all  data  exist  throughout  the  cluster  !   The  process  that  achieves  this  is  known  as  index  replica;on  !   Bucket  based  replica;on    

–  i.e. the unit of replication is a bucket

37!

Page 38: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Index Replication Benefits !   Data  availability.  An  indexer  is  always  available  to  handle  incoming  data,  and  the  indexed  data  is  always  available    for  searching    

!   Disaster  tolerance.  System  can  tolerate  downed  indexers  without  losing  data  or  losing  access  to  data    

38!

Page 39: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Index Replication Trade Offs

!  Extra  storage  !  To  a  lesser  degree,  increased  processing  load  

–  More disaster tolerance (i.e., more copies of data) requires higher storage requirements

–  Governed the replication factor (RF) attribute

39!

Page 40: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

40  

3 node Cluster Replication factor of 3

Page 41: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

41  

Ok guys, pair up in threes! – Yogi Berra

Page 42: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Cluster Components !  Master  node  

–  Orchestrates replication process –  Informs the SH where to find data –  Helps manage peer configurations

!   Peer  nodes  –  Receive and index data (normal peer behavior) –  Replicate data to/from other peers –  Peer nodes number ≥ RF

42!

Page 43: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Cluster Components !   At  least  one  search  head  

–  Must use one to manage searches across the cluster –  Can’t run searches directly on an indexer, as you can with non-

clustered indexers

!   Auto  load-­‐balanced  forwarders    –  Capable of indexer acknowledgement

ê  Ensuring data gets reliably indexed –  Note that other input methods can be used as well, but no indexer

acknowledgement is available

43!

Page 44: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Cluster Concepts: RF !   Replica;on  factor  ::  RF  defaults  to  3    

–  Number of copies of data in the cluster –  Cluster can tolerate RF-1 node failures

44!

!   Cluster can tolerate 2 node failures

Page 45: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Cluster Concepts: SF !   Searchable  factor  ::  SF  defaults  to  2  

–  Number of searchable copies the cluster maintains –  SF ≤ RF –  Requires more storage and more processing than replicated copies –  Replicated copy vs. searchable copy –  SF = 1 vs. SF = 2

45!

Page 46: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Clustered Indexing !   Origina;ng  peer  node  streams  copies  of  data  to  other  clustered  peers  

–  They store those copies in their own buckets –  Some peers may also index the copy

!   Master  determines  which  peer  nodes  get  replicated  data  –  Currently not configurable

!   Master  manages  all  peer-­‐to-­‐peer  interac;ons  !   Master  keeps  track  of  which  peers  have  searchable  data  

–  Ensures that there are always SF copies of searchable data available –  When a peer goes down, the master coordinates remedial activities

46!

Page 47: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Clustered Searching !   Search  head  coordinates  all  searches  in  the  cluster  !   SH  relies  on  master  to  tell  it  who  its  peers  are    !   Only  one  replicated  bucket  is  searchable  (primary)    

–  Searches occur over primary buckets, only

!   Primary  buckets  (searchable)  may  change  over  ;me  –  Peers know their status and therefore know where to search

47!

Page 48: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Putting HA All Together

48!

NFS!

………..! ………..!

Search head pooling HA

Index replication or

SAN based mirrorring

Forwarders – autoLB HA

Page 49: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Top Takeways

49!

!   DR – Process of backing-up and restoring service in case of disaster –  Configuration files – Copy of $SPLUNK_HOME/etc/ folder –  Indexed data – Backup and restore buckets

ê  Hot, warm, cold, frozen ê  Can’t backup hot but can safely backup warm and cold

!   HA – Continuously operational system bounded by a set of tolerances –  Data collection

ê  Autolb from forwarders to multiple indexers ê  Use indexer acknowledgement to protect in flight data

–  Searching ê  Search head pooling (SHP)

–  Indexing ê  Use index replication ê  Highly reliable storage, mirrored set of indexers

Page 50: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Questions?

Page 51: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Next  Steps  

51  

Download  the  .conf2013  Mobile  App  If  not  iPhone,  iPad  or  Android,  use  the  Web  App    

Take  the  survey  &  WIN  A  PASS  FOR  .CONF2014…  Or  one  of  these  bags!    Go  to  the  sessions  listed  on  the  next  slide      

1  

2  

3  

Page 52: Architec;ng*Splunk*for*High* Availability*and*Disaster ...€¦ · Clustered Indexing! Originang*peer*node*streams*copies*of*datato*other*clustered*peers* – They store those copies

Other  Sessions  You  May  Like:  !   Architec)ng  and  Sizing  Your  Splunk  Deployments  Brera  2&3,  Level  3  Today,  3-­‐4pm  

!   Planning  and  Execu)on  for  Successful  Deployments  Brera  2&3,  Level  3  October  3rd,  10:15-­‐11:15am  

More Info

52