large scale near real-time log indexing with flume and solrcloud

45
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 1 log indexing with Apache Flume and SolrCloud Hadoop Summit 2013 Ari Flink Operations Architect Cisco/WebEx June 27 th , 2013

Upload: hadoopsummit

Post on 27-Jan-2015

108 views

Category:

Technology


2 download

DESCRIPTION

Apache Flume’s extensible architecture allows Cisco to stream system and application logs from worldwide production data centers to a central Hadoop cluster and Solr. This architecture enables a new level of scalable indexing so that a larger volume of logs is searchable within seconds. Using Solr 4.0′s near real time features together with Hadoop, we can execute mission critical tasks much quicker, improving our ability to meet tight SLAs. At the same time, using the same infrastructure, we can perform large-scale historical analysis and pattern extraction to help further improve our services. This talk will explore our infrastructure and decisions we?ve made to meet key requirements, i.e. high indexing load, high availability and disaster recovery. We will further explore other uses of Flume and SolrCloud within Cisco including dynamic event routing, parsing and multi-tenancy.

TRANSCRIPT

Page 1: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 1

Large scale near real-time log indexing with Apache Flume and SolrCloud

Hadoop Summit 2013

Ari FlinkOperations ArchitectCisco/WebEx

June 27th, 2013

Page 2: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 2

Agenda

1. Intro

2. Problem to solve?

3. How does Flume/Solr help?

4. Syslog indexing example

5. HA, DR & scalability

Page 3: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 3

Me

Ops Architect at Cisco CCATG (WebEx)Ensure operational readiness for complex distributed services

HA, DR, monitoring, config, deployment

Previously eBay, Excite@Home, IBM, VISAOperations architecture, monitoring, event correlation

Page 4: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 4

What’s that funny accent?

Page 5: Large scale near real-time log indexing with Flume and SolrCloud

© 2012 Cisco and/or its affiliates. All rights reserved. 5

Cisco CCATG - Cloud Collaboration Application Technology Group

Cisco WebEx Meetings

• Voice, video, desktop sharing• Meeting/Event/Support/Training• Centers• Integration with TelePresence

Cisco WebEx Social

• Social networking• Content creation• Integrated IM

Cisco WebEx Messenger

• IM, presence• Integrate with voice, video• XMPP

Page 6: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 6© 2010 Cisco and/or its affiliates. All rights reserved. 6

Participants from over 231 countries, 52% market share

2.2 Billion meeting minutes per month

40.5 Million meeting attendees per month

9.4 million registered hosts worldwide

4 Million mobile downloads

Cisco WebEx - Leader for SaaS Web Conferencing

Page 7: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 7© 2010 Cisco and/or its affiliates. All rights reserved. 7

Cisco WebEx Collaboration Cloud

Datacenter / PoP

Leased network link

Global Scale: 13 datacenters & iPoPs around the globe

Dedicated network: dual path 10G circuits between DCs

Multi-tenant: 95k sites

Real-time collaboration: voice, desktop sharing, video, chat

Page 8: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 8© 2010 Cisco and/or its affiliates. All rights reserved. 8

S#!% happens ..

Datacenter / PoP

Leased network link

People make mistakesHardware failsSoftware failsEven failovers sometimes fail

Page 9: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 9

Recovery-Oriented Computing Philosophy

“If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time”

— Shimon Peres (“Peres’s Law”)

People/HW/SW failures are facts, not problems

Operations main goal is to maintain high service availability• Recovery/repair is how we cope with above facts• Improving recovery/repair improves availability

UnAvailability = MTTR / MTBF1/10th MTTR just as valuable as 10x MTBF

Page 10: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 10

How could we make recovery & repair faster?

Even better: proactive

Good: reactive

Your search – What is the root cause of the outage? – did not match any documents.

Page 11: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 11

The goal: NRT full text log search

Page 12: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 12

Cisco WebEx log collection overview

Flume

Log4j

File

Avro

Syslog

Other SinksSolrSink

App

lica

tion

stat

e &

AP

Is

HDFS

Thrift

AMQP RDBMS

Sqoop

HTTP/REST

MySQL

Unstructured/semi-structured data Structured data

Cisco UCS C240 M3 servers

12 x 3TB = 36 TB / server

HDFSSink

SolrCloud

Raw dataSolr index

Page 13: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 13

Global Flume topology

DC 1

HDFS

Flume

SolrCloud

FlumeFlume

DC 2

HDFS

Flume

SolrCloud

FlumeFlume

DC 1

FlumeFlume

Flume

syslog log4j file

DC N

FlumeFlume

Flume

syslog log4j file

… Collectortier

Storagetier

Page 14: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 14

Flume Collector tier

agent agent agent

File Channel 1

Avro src

DC1 Avro sink

DC2 Avro sink

File Channel 2

Replicatingfan-out

flow Flume Collector server

Failover & load balancing agents

Flume Storage tier

All events replicated to both Channels

DC1 DC2

Page 15: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 15

Global Flume topology

DC 1

HDFS

Flume

SolrCloud

FlumeFlume

DC 2

HDFS

Flume

SolrCloud

FlumeFlume

DC 1

FlumeFlume

Flume

syslog log4j file

DC N

FlumeFlume

Flume

syslog log4j file

… Collectortier

Storagetier

Page 16: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 16

Flume Storage tier

File Channel 1

Avro src

SolrSink

HDFSsink

File Channel 2

Multiplexingfan-out

flow Flume Storage tier server

Failover & load balancing agents

FlumeCollector

FlumeCollector

FlumeCollector

HDFSSolrCloud

Routing to Solr by Flume event header

All events to HDFS

Page 17: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 17

Schema or no schema ..

Isn’t Big Data “schema on read”?• Why does Solr require a schema on write?• Dirty little secret: there’s always a schema• Performance & functionality vs flexibility• Optimize operations and storage based on field type - that's how you

get sub second response times

There’s always a schema• Application code vs. central location

Page 18: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 18

Cloudera Morphlines: streaming ETL

Cloudera Morphlines

• Framework to simplify event transformation • Compatible with existing grok patterns• Reusable across multiple index workloads:

Flume & M/R

Command: readLine

Command: grok

Command: loadSolr

Solr

Flume event = headers + body

Record

Document matching schema.xml

Command: tryRules

Command: addValues

…Record

Record

Record

Record

SolrSink

Page 19: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 19

Flume Morphline example

Convert syslog message..

<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234

.. into Solr schema fields

Severity=[3] Facility=[22] host=[colo01-wxp00-ace01b-connect.webex.com] timestamp=[2013-06-16T04:36:49.000Z]syslog_message=[%ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234] severity_label=[error] access_token=[54asdf654] id=[b2f839c3-dece-404f-a535-e0141ad549bf] cisco_product=[ACE] cisco_level=[3] cisco_id=[251008] cisco_code=[%ACE-3-251008]

Page 20: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 20

Flume Morphline example

Convert syslog message

<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234

Step 1: readLine reads in Flume event headers and body

timestamp=[1371357409000]host=[colo01-wxp00-ace01b-connect.webex.com]category=[545f5sfsd5sf]Severity=[3] Facility=[22] message=[<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234]

Headers

Body

Page 21: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 21

Flume Morphline example

Convert syslog message

<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234

Step 2: convertTimestamp converts epoch to ISO 8601 format

timestamp=[2013-06-16T04:36:49.000Z]host=[colo01-wxp00-ace01b-connect.webex.com]access_token=[545f5sfsd5sf]message=[<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234]Severity=[3] Facility=[22]

Page 22: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 22

Flume Morphline example

Convert syslog message

<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234

Step 3: addValues creates new field access_token

timestamp=[2013-06-16T04:36:49.000Z] category=[545f5sfsd5sf] access_token=[545f5sfsd5sf] message=[<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234] host=[colo01-wxp00-ace01b-connect.webex.com] Severity=[3] Facility=[22]

Page 23: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 23

Flume Morphline example

Convert syslog message

<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234

Step 4: tryRules creates field severity_label for severity

timestamp=[2013-06-16T04:36:49.000Z] severity_label=[error] access_token=[545f5sfsd5sf] message=[<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234] host=[colo01-wxp00-ace01b-connect.webex.com] category=[545f5sfsd5sf] Severity=[3] Facility=[22]

Page 24: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 24

Flume Morphline example

Convert syslog message

<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234

Step 5: tryRules creates new fields

syslog_message=[%ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234] cisco_product=[ACE] cisco_level=[3] cisco_id=[251008] cisco_code=[%ACE-3-251008]

Page 25: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 25

Flume Morphline example

Convert syslog message

<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234

Step 6: sanitizeUnknownSolrFields drops non-schema fields

timestamp=[2013-06-16T04:36:49.000Z]syslog_message=[%ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234] severity_label=[error] access_token=[545f5sfsd5sf] host=[colo01-wxp00-ace01b-connect.webex.com] cisco_product=[ACE] cisco_level=[3] cisco_id=[251008] cisco_code=[%ACE-3-251008]

Page 26: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 26

Flume Morphline example

Convert syslog message

<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234

Step 7: generateUUID creates an unique id for the document

timestamp=[2013-06-16T04:36:49.000Z]syslog_message=[%ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234] severity_label=[error] access_token=[545f5sfsd5sf] id=[b2f839c3-dece-404f-a535-e0141ad549bf] host=[colo01-wxp00-ace01b-connect.webex.com] cisco_product=[ACE] cisco_level=[3] cisco_id=[251008] cisco_code=[%ACE-3-251008]

Page 27: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 27

Flume Morphline example

Convert syslog message

<179>Jun 16 04:36:49 colo01-wxp00-ace01b-connect.webex.com Jun 16 2013 04:36:49 : %ACE-3-251008: Health probe failed for server 10.240.22.111 on port 1234

Step 8: loadSolr loads a record into a Solr server

Page 28: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 28

Morphlines recap

Command: readLine

Command: grok

Command: loadSolr

SolrCloud

Flume syslog event = headers + body

Record

Document matching schema.xml

Command: tryRules

Command: addValues

…Record

Record

Record

Record

SolrSink

Page 29: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 29

SolrCloud

ZooKeeper

leader1

replica1

Shard1

leader2

replica2

Shard2

leader3

replica3

Shard3

SolrCloud cluster

zk1

zk2

zk3

Pluggable filesystem(local, HDFS)

Add doc to syslog index

• Collections, shards & replicas• Pluggable file system• Central config & coordination with

ZK • Full HA, automatic fail-over• NRT indexing• Automatic routing

Where can I index data?

leader3

Collection

Page 30: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 30

SolrCloud

Collection “syslog” with three shards

Page 31: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 31

Solr index management

Special case of search• Logs are time series data: timestamp + data

• High indexing rate, no updates

• New data is more frequently searched than old

Collection aliases• Time partitioned collections – e.g. one collection per day

• Reduces the workload to near-real-time data only

• One-to-many collection mapping: queries go to a logical representation mapped to multiple, same-schema collection

• Simplifies for hot-warm-cold migration of data

Index expiration• Old data is aged out by Collection Aliases

• Remap only the latest collection to an alias

Page 32: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 32

Solr & HDFS DR

Solr• No multi-datacenter cluster support

HDFS• No multi-datacenter cluster support

Options?• All our services must survive DC outage

• . . so should logging and indexing

Page 33: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 33

DR option 1: Flume dual writes

DC 1

HDFS

Flume

SolrCloud

FlumeFlume

DC 2

HDFS

Flume

SolrCloud

FlumeFlume

DC 1

FlumeFlume

Flume

syslog log4j file

DC 2

FlumeFlume

Flume

syslog log4j file

DC N

FlumeFlume

Flume

syslog log4j file

…Collector

tier

StoragetierPlanned or

unplanned outage

Flume Collector disk channel buffering DC1 events

DC1 Hadoop cluster back online after outage

Replicate aggregate

data

Page 34: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 34

DR option 2: distcp + M/R

DC 1

HDFS

Flume

SolrCloud

FlumeFlume

DC 2

HDFSSolrCloud

DC 1

FlumeFlume

Flume

syslog log4j file

DC 2

FlumeFlume

Flume

syslog log4j file

DC N

FlumeFlume

Flume

syslog log4j file

… Collectortier

Storagetier

FlumeFlume

Flume

distcp

Manual CNAME change to DC2

DC1 back online, sync data from DC2

Data sent only to a single DC

distcp

DNS CNAME change back to DC1

Flip distcp the other way

Flume buffering events at collector tier

Create indexes with M/R

Page 35: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 35

Tiers to scale

• Flume Collector tier• Flume Storage tier• SolrCloud

Page 36: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 36

Flume Collector tier throughput100 – 5000 servers per a datacenter

agent agent agent

File Channel 1

Avro src

DC1 Avro sink

DC2 Avro sink

File Channel 2

Replicatingfan-out

flow

agent agent agent …

…Flume Collector

More agents and data

FileChannel:14MB/sec

NIC:100MB/sec

NIC:100MB/sec

File Channel 1

Avro src

DC1 Avro sink

DC2 Avro sink

File Channel 2

Replicatingfan-out

flow

Max per server:

14MB/s1.2 TB/day

70k events/s

Page 37: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 37

Scalability: Flume Storage tierDC 1 collectors

DC 1storage tier

Flume 1

DC 2 storage tier

Avro sink

1

Avro sink

2

Avro sink

N…

DC 2 collectors

Avro sink

1

Avro sink

2

Avro sink

N…

DC N collectors

Avro sink

1

Avro sink

2

Avro sink

N……

File Chan1

Avro src

HDFSsink

Solrsink

File Chan2

Multiplexingfan-out

flowFile

Chan1

Avro src

HDFSsink

Solrsink

File Chan2

Multiplexingfan-out

flowFile

Chan1

Avro src

HDFSsink

Solrsink

File Chan2

Multiplexingfan-out

flowFile

Chan1

Avro src

HDFSsink

Solrsink

File Chan2

Multiplexingfan-out

flow

Max per server:14MB/s

1.2 TB/day70k events/s

Page 38: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 38

SolrCloud scalablity

ZooKeeper

leader1

replica1

Shard1

leader2

replica2

Shard2

leader3

replica3

Shard3

SolrCloud cluster

zk1

zk2

zk3

Pluggable filesystem(local, HDFS)

New logsto index

Searchqueries

1000 tx/sec/core

2x8 cores 16k tx/sec

3 shards3 x 16k =

48k tx/sec

Page 39: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 39

Syslog indexing recap

Central syslog servers• Network and OS system messages forwarded to several central syslog

servers

Forward syslog to Solr using Flume Morphline SolrSink• Parse messages with Morphline and grok patterns

SolrCloud • Index log lines as documents into a Collection (i.e. index)

HUE Solr search • Simple UI to build a customized search page layout with faceting, sorting.

• Easy drill down with multiple facets: severity, datacenter, hostname, etc

Page 40: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 40

Screen shots

Page 41: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 41

Search by time

Sort by select field

Facets by selected fields

Page 42: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 42

Wildcard query by field

Highlight the query keywords

Page 43: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 43

Data sources: REST/JSON, log4j, syslog, Avro, Thrift

Parsing: Cloudera Morphlines

NRT Indexing: SolrCloud embedded in CDH

Batch indexing: MapReduce

Analytics: Use your favorite tool, raw detailed data stored in HDFS

Summary

Page 44: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 44

email: [email protected]: @raaka

Questions ?

Page 45: Large scale near real-time log indexing with Flume and SolrCloud

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 45

Thank you.