building a near real time search engine & analytics for logs using solr

42
Building a Near Real time “Logs Search Engine & Analytics” using Solr Lucene/Solr Revolution 2013 May 1 st , 2013 Rahul Jain [email protected]

Upload: lucenerevolution

Post on 05-Dec-2014

6.523 views

Category:

Education


4 download

DESCRIPTION

Presented by Rahul Jain, System Analyst (Software Engineer), IVY Comptech Pvt Ltd Consolidation and Indexing of logs to search them in real time poses an array of challenges when you have hundreds of servers producing terabytes of logs every day. Since the log events mostly have a small size of around 200 bytes to few KBs, makes it more difficult to handle because lesser the size of a log event, more the number of documents to index. In this session, we will discuss the challenges faced by us and solutions developed to overcome them. The list of items that will be covered in the talk are as follows. Methods to collect logs in real time. How Lucene was tuned to achieve an indexing rate of 1 GB in 46 seconds Tips and techniques incorporated/used to manage distributed index generation and search on multiple shards How choosing a layer based partition strategy helped us to bring down the search response times. Log analysis and generation of analytics using Solr. Design and architecture used to build the search platform.

TRANSCRIPT

Page 1: Building a near real time search engine & analytics for logs using solr

Building a Near Real time “Logs Search Engine & Analytics”

using Solr

Lucene/Solr Revolution 2013 May 1st , 2013

Rahul Jain [email protected]

Page 2: Building a near real time search engine & analytics for logs using solr

Who am I?

Software Engineer

Member of Core technology @ IVY Comptech, Hyderabad, India

6 years of programming experience

Areas of expertise/interest High traffic web applications

JAVA/J2EE

Big data, NoSQL

Information-Retrieval, Machine learning

2

Page 3: Building a near real time search engine & analytics for logs using solr

Agenda

• Overview

• Indexing

• Search

• Analytics

• Architecture

• Lessons learned

• Q&A

3

Page 4: Building a near real time search engine & analytics for logs using solr

Overview

Page 5: Building a near real time search engine & analytics for logs using solr

Issues keep coming in “Production”

5

java.net.ConnectException:

Connection refused

ServerNotRunningException

Too many open files

DBException

NullPointerException

OutOfMemory

Issues Hidden Bugs DB is down Server crashed

OutOfMemory Connection reset Nodes go out of cluster (Due to long GC pause) Attack

DOS (Denial of Service) by sending a lot of requests in a short time frame.

5

Page 6: Building a near real time search engine & analytics for logs using solr

Why Logs Search?

• Enable production support team to immediately check for issues at “one place”

– Saves time from logging on to multiple servers to check the logs

• Debugging production issues

– Is it a server specific or occurring in all other servers for that application?

• Allows to track user activity across multiple servers/applications.

• Correlation of multiple issues with each other.

– e.g. Logins might be failing on X Node due to OutOfMemory on Y node.

6

Page 7: Building a near real time search engine & analytics for logs using solr

Key Problems

• Hundreds of servers/services generating logs

• terabytes of unstructured logs/day to index in Near Real time

• Millions of log events (Priority one)

• Full Text search & storage of log content

• High Indexing Rate of 1GB/min

• Search latency in seconds is acceptable

7

Page 8: Building a near real time search engine & analytics for logs using solr

Logs are different

• varying size – from few bytes to several KBs

– more no. of documents.

• average 6-8 million log messages in 1 GB logs – Each line forms one log message except “exception stack trace”.

• different types – exception stack-trace

– application logs

– http access/error logs

– gclog

• logging format is not uniform across all logs

8

Page 9: Building a near real time search engine & analytics for logs using solr

Indexing

Page 10: Building a near real time search engine & analytics for logs using solr

Improving Indexing Performance

10

Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations

Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object

Page 11: Building a near real time search engine & analytics for logs using solr

Old Architecture

11

Solr Server

Centralized Log Collection

Server

Solr Server

Search UI Production Server

Logs Transfer

Page 12: Building a near real time search engine & analytics for logs using solr

Old Architecture

12

Solr Server

Centralized Log Collection

Server

Solr Server

Search UI Production Server

Logs Transfer

Data Copy1 Data Copy2

Page 13: Building a near real time search engine & analytics for logs using solr

Direct Logs transfer

Indexing Server Production Server

Indexing Server

Indexing Server

Open question : Since now Indexing system is exposed to production servers - what if a new Indexing Server is added on the fly or one of them is down

13

Page 14: Building a near real time search engine & analytics for logs using solr

Solr in Embedded Mode

14

Single JVM

Solrj (EmbeddedSolrServer)

Solr Application

Indexing Server

No network latency

Page 15: Building a near real time search engine & analytics for logs using solr

Improving Indexing Performance

15

Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations

Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object

Page 16: Building a near real time search engine & analytics for logs using solr

Message Flow

16

SolrInputDocument SolrInputDocument

(new object)

Single JVM

XML Marshalling (UpdateRequest)

XML Unmarshalling

(XMLLoader)

<add> <doc> <field> </field> <field> </field> <doc> </add>

xml

Page 17: Building a near real time search engine & analytics for logs using solr

Bypassing XML Marshalling/Unmarshalling

17

SolrInputDocument

XML Marshalling (UpdateRequest)

XML Unmarshalling

(XMLLoader)

SolrInputDocument (referenced object)

Passing the Direct reference of SolrInputDocument Object

Single JVM

DocContentStream #getSolrInputDocuments()

RefDocumentLoader #load()

DocUpdateRequest #add(List<SolrInputDocument>)

LMEmbeddedSolrServer #add(List<SolrInputDocument>)

Page 18: Building a near real time search engine & analytics for logs using solr

Improving Indexing Performance

18

Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations

Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object

Page 19: Building a near real time search engine & analytics for logs using solr

Old Architecture (Sync)

19

Incoming Message

Log Event

Solr

unstructured structured

SolrInput Document

(10K)

Thread Pool with multiple threads

Once Batch size reaches to 10k, one of the thread adds documents to Solr as a Sync call and wait for response

add

UpdateResponse

Batch

Wait for response

Time taken : - Indexing 1 chunk (10k) takes anywhere between 400ms-3000ms#

- while commit it is from 6000ms-23000ms and even more… - In 1 GB there are around 600 chunks - so most of time is just spent in waiting for response

#Indexing time vary based on several factors, for e.g. hardware configurations, application type, nature of data, number of index fields/stored fields, analyzer type etc.

Page 20: Building a near real time search engine & analytics for logs using solr

Moving to an Asynchronous Architecture

20

Incoming Message

Log Event Event Pipeline (BlockingQueue)

Log Event SolrInput

Document

Log Message Transformation

(Analyzer Thread Pool)

Log Event to SolrInputDocument

(Indexer Thread Pool)

Add a Batch of Log Event to Pipeline

Remove Batch of Log Event from Pipeline

Solr

Add to Batch

Remove from Batch

Page 21: Building a near real time search engine & analytics for logs using solr

Improving Indexing Performance

21

Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations

Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object

Page 22: Building a near real time search engine & analytics for logs using solr

Commit Strategy

22

Solr

20130501_0

20130501_1

20130501_2

Shard (Single Node)

SolrInputDocument Indexing

Partition function

22

Page 23: Building a near real time search engine & analytics for logs using solr

Indexing traffic on alternate Shard Once commit starts on “Main Shard”

23

Solr

20130501_0

20130501_1

20130501_2

Main Shard (Single Node)

20130501_3

20130501_4

20130501_5

Alternate Shard

SolrInputDocument Indexing

Pair Partition function

23

Page 24: Building a near real time search engine & analytics for logs using solr

Commit Strategy

• Merits

– Scales well

– Indexing can run continuously

• De-Merits

– Search needs to be done on both cores

– but end of the day these two can be merged into one core

24

Page 25: Building a near real time search engine & analytics for logs using solr

Improving Indexing Performance

25

Solr in Embedded Mode Bypassing XML Marshalling/Unmarshalling Moving to an Async Approach Route traffic on Alternate Shard once “Commit” starts on Main Shard Other optimizations

Add document does update (add + delete) Changing Buffer size in BufferIndexInput and BufferIndexOutput Reusing Lucene document object

Page 26: Building a near real time search engine & analytics for logs using solr

Other Optimizations

• In Solr, Add document does update (add + delete) – for each add document call, Solr internally creates a delete term with “id” field

for delete – but log messages are always unique

• Changing Buffer Size in BufferIndexInput and BufferIndexOutput – Increasing buffer size improves the indexing performance especially if disk is

slow. – More Process heap is required accordingly as lot of files are created if data

volume is high.

• Reusing Lucene document and Field instances - Check org/apache/lucene/benchmark/byTask/feeds/DocMaker.java

• Check for more information on Improving Indexing performance http://rahuldausa.wordpress.com/2013/01/14/scaling-lucene-for-indexing-a-billion-documents/

26

Page 27: Building a near real time search engine & analytics for logs using solr

The Result

27

Page 28: Building a near real time search engine & analytics for logs using solr

Data Volume v/s Indexing time (GB/Minutes)

3

14

38

56

112

0.5 2 4.5 9

22

0

20

40

60

80

100

120

1GB 4GB 8GB 17GB 35GB

Ind

exin

g Ti

me

Before

After

28

Page 29: Building a near real time search engine & analytics for logs using solr

Search

Page 30: Building a near real time search engine & analytics for logs using solr

Partition • Partitioning the data properly improves the Search performance significantly

• Partition Type

– Server based Partition • Number of documents does not balance out evenly in all shards

– Date and Time based Partition • Hotspot a single shard

– Least loaded Shard (index)

• By number of documents

• Balances out documents evenly in all shards

• Can’t provide optimal search performance, as all shards needs to be hit

30

Incoming message

Server Based Partition

Date & time Based

Partition Solr Shard

Hybrid Approach

Page 31: Building a near real time search engine & analytics for logs using solr

Multi-tier Partition

jacob

Incoming Message

Date & time based

Partition

20130501_00_0

mia

Solr Shard (date_hour_shardId)

20130501_06_0

20130501_00_1 Server based

Partition

Jacob: { message: hello lucene time:20130501:11:00:00 }

Indexing Server Production Server

mia: { message: hello solr and lucene time:20130501:04:00:00 }

31

Page 33: Building a near real time search engine & analytics for logs using solr

How Search works

Zookeeper (zk)

Search Server (tomcat)

Pushes shard mapping to zk

Create a watcher on zk node and update the In-Memory shard mapping on change

User query QueryParser Indexing Server (maestro)

Search query with shards parameter

Shard Mapping (In Memory structure)

Lookup

33

erver Indexing Server

Page 34: Building a near real time search engine & analytics for logs using solr

How Search works (Cont’d)

from: now-24hour

server: jacob from: now-4hour

from: now-11hour

Indexing server

Indexing server

Indexing server

Lookup on shards for today

shard(s) having data for jacob from last 6hour shard

shard(s) having data for last 12 hours

34

Leader shard (maestro)

Page 35: Building a near real time search engine & analytics for logs using solr

Analytics

• Young GC timings/chart

• Full GC timings

• DB Access/Update Timings

– Reveal is there any pattern across all DB servers?

• Real time Exceptions/Issues reporting using facet query.

• Apache Access/Error KPI

35

Page 36: Building a near real time search engine & analytics for logs using solr

Analytics

• Custom report based on “Key:Value” Pair

For e.g. time – key:value

18:28:28, 541 - activeThreadCount:5

18:28:29, 541- activeThreadCount:8

18:28:30, 541 - activeThreadCount:9

18:28:31, 541- activeThreadCount:3

36

0

2

4

6

8

10

activeThreadCount

Page 37: Building a near real time search engine & analytics for logs using solr

Architecture

Page 38: Building a near real time search engine & analytics for logs using solr

Data Flows

38

Weird Log File

Zero Copy server

Kafka Broker

Indexing Server

Search UI

Periodic Push

Real time transfer from In Memory (Log4jAppender)

o Zero Copy server - Deployed on each Indexing server for data locality - Write incoming files to disk as Indexing server doesn’t index with same rate

o Kafka Broker o Kafka Appender pass the messages from in-Memory

Page 39: Building a near real time search engine & analytics for logs using solr

Periodic Push

39

Zookeeper

Indexing Server

Zero copy server

Node 1

Production

Server

Logs transfer

Daemon

Disk

Node…n

.

.

.

Page 40: Building a near real time search engine & analytics for logs using solr

Real time transfer

40

Indexing Server

Kafka Broker

Indexing Server

Search UI Production

Server (Kafka Appender)

Zookeeper

Indexing Server

Indexing Server

Update Consumed Message offset

Page 41: Building a near real time search engine & analytics for logs using solr

Conclusion Lessons Learned

• Always find sweet-spots for – Number of Indexer threads, that can run in parallel – Randomize

• Merge factor • Commit Interval • ramBufferSize

– Increasing Cache Size helps in bringing down search latency • but with Full GC penalty

• Index size of more than 5GB in one core does not go well with Search

• Search on a lot of cores does not provide optimal response time – Overall query response time is limited by slowest shard’s performance

• Solr scales both vertically and horizontally

• Batching of log messages based on message size (~10KB) in a MessageSet – Kafka adds 10 bytes on each message – Most of the time Log messages are < 100 bytes

41

Page 42: Building a near real time search engine & analytics for logs using solr

Thank You [email protected]

42