search on hadoop

1

Finding a needle in a stack of needles -‐ adding Search to the Hadoop Ecosystem

Patrick Hunt (@phunt) Big Data Gurus Meetup July 2013

Agenda

•  Big Data and Search – seIng the stage •  Cloudera Search’s Architecture •  Component deep dive •  Early performance insights • What’s next?

Feel free to ask quesQons as we go!

Why Search?

An Integrated Part of the Hadoop System

One pool of data

One security framework

One set of system resources

One management interface

Search Simplifies InteracQon

•  User Goals •  Explore •  Navigate •  Correlate

•  Experts know MapReduce •  Savvy people know SQL •  Everyone knows Search!

Benefits of Search

•  Improved Big Data ROI •  An interacQve experience without technical knowledge •  Single data set for mulQple compuQng frameworks

•  Faster Qme to insight •  Exploratory analysis, esp. unstructured data •  Broad range of indexing opQons to accommodate needs

•  Cost efficiency •  Single scalable pla`orm; no incremental investment •  No need for separate systems, storage

•  Solid foundaQons and reliability •  Apache Solr in producQon environments for years •  Hadoop-‐powered reliability and scalability

What is Cloudera Search?

•  Full-‐text, interacQve search and faceted navigaQon •  Batch, near real-‐Qme, and on-‐demand indexing •  Apache Solr integrated with CDH

•  Established, mature search with vibrant community •  Separate runQme like MapReduce, Impala •  Incorporated as part of the Hadoop ecosystem

•  Open Source •  100% Apache, 100% Solr •  Standard Solr APIs

Cloudera Search Components

•  Refresher – HDFS/MR/Lucene/Solr/SolrCloud •  HDFSDirectoryFactory/HDFSDirectory •  BlockDirectory/BlockDirectoryCache •  Near Real Time (NRT) indexing

•  Apache Flume MorphlineSolrSink •  Lily HBase Indexer

•  Batch – MapReduce Indexer •  ETL – Cloudera Morphlines •  Hue Search ApplicaQon

Apache Hadoop

•  Apache HDFS •  Distributed file system •  High reliability •  High throughput

•  Apache MapReduce •  Parallel, distributed programming model •  Allows processing of large datasets •  Fault tolerant

Apache Lucene

•  Full text search •  Indexing •  Query

•  TradiQonal inverted index •  Batch and Incremental indexing • We are using version 4 (4.3 currently)

Apache Solr

•  Search service built using Lucene •  Ships with Lucene (same TLP at Apache)

•  Provides XML/HTTP/JSON/Python/Ruby/… APIs •  Indexing •  Query •  AdministraQve interface •  Also rich web admin GUI via HTTP

Apache SolrCloud

•  Provides distributed Search capability •  Part of Solr (not a separate library/codebase) •  Shards -‐ both verQcally and horizontally scaleable

•  Horizontally – parQQon index for size •  VerQcally – replicate for query performance

•  Uses ZooKeeper for coordinaQon •  No split-‐brain issues •  Simplifies operaQons

Distributed Search on Hadoop

Flume Hue UI

Custom UI

Custom App

Solr

Solr

Solr

SolrCloud query

query

query

index

Hadoop Cluster

MR

HDFS

index

HBase index

High Level View

13

HDFS

Lucene

Solr

ZooKeeper

SolrCloud

Querying API Indexing API

Solr on HDFS •  Scalable, cost-‐efficient index storage

•  Higher availability •  Search and process data in one pla`orm

Cloudera Upstream ContribuQons

•  SOLR-‐3911 -‐ Directory/DirectoryFactory now first class •  Solr ReplicaQon now uses Directory abstracQon •  Solr Admin UI no longer assumes local directory access

•  SOLR-‐4916 – support for reading/wriQng Solr index files and transacQon log files to/from HDFS

•  HDFSDirectoryFactory/HDFSDirectory implementaQon

•  SOLR-‐4655 -‐ The Overseer should assign node names by default. •  SOLR-‐3706 -‐ Ship setup to log with log4j •  SOLR-‐4494 -‐ Clean up and polish CollecQons API •  SOLR-‐4718 -‐Improvements to configurability

•  ConfiguraQon now enQrely through ZooKeeper (opQonal)

•  Many more improvements/cleanup/hardening/…

Lucene Directory abstracQon

•  It’s how Lucene interacts with index files •  Solr uses it too, but spory prior to 4.x Class Directory { listAll(); createOutput(file, context); openInput(file, context); deleteFile(file); makeLock(file); clearLock(file); …

}

HDFSDirectory

•  Originally implemented against Lucene 3 by Blur •  Cloudera ported to Lucene 4 and now upstream

•  Solr trunk and version 4.4 (upcoming) •  Uses the HDFS Client API import org.apache.hadoop.fs.FileSystem; public IndexInput openInput(file, context){ … _inputStream = fileSystem.open(path, bufferSize); …

}

HDFSDirectoryFactory

•  Enables plugin of HDFSDirectory into Solr •  Configurable through solrconfig.xml

•  Also handles •  Directory configuraQon •  ComposiQng of Directory(s)

•  NRTCachingDirectory •  BlockDirectory/BlockDirectoryCache

BlockDirectory/BlockDirectoryCache

•  In memory cache of index file blocks •  Caches on read, in some cases on write

•  Compensate for less effecQve file system cache •  Uses DirectByteBuffer, not JVM heap (default) •  Size configurable by user

Near Real Time Indexing with Flume

Log File Solr and Flume •  Data ingest at scale •  Flexible extracQon and mapping

•  Indexing at data ingest

HDFS

Flume Agent

Indexer

Other Log File

Flume Agent

Indexer

19

Apache Flume -‐ MorphlineSolrSink

•  A Flume Source… •  Receives/gathers events

•  A Flume Channel… •  Carries the event – MemoryChannel or reliable FileChannel

•  A Flume Sink… •  Sends the events on to the next locaQon

•  Flume MorphlineSolrSink •  Integrates Cloudera Morphlines library

•  ETL, more on that in a bit •  Does batching •  Results sent to Solr for indexing

Near Real Time indexing of Apache HBase

HDFS

HBase

interacQve load

Indexer(s)

Triggers on

updates Solr server

Solr server Solr server Solr server Solr server

Search

+ = planet-‐sized tabular data immediate access & updates fast & flexible informaFon discovery

B IG DATA DATAMANAGEMENT

Lily HBase Indexer

•  CollaboraQon between NGData & Cloudera •  NGData are creators of the Lily data management pla`orm

•  Lily HBase Indexer •  Service which acts as a HBase replicaQon listener

•  HBase replicaQon features, such as filtering, supported

•  ReplicaQon updates trigger indexing of updates (rows) •  Integrates Cloudera Morphlines library for ETL of rows •  AL2 licensed on github hrps://github.com/ngdata

Scalable Batch Indexing

Index shard

Files

Index shard

Indexer

Files

Solr server

Indexer

Solr server

23

HDFS

Solr and MapReduce •  Flexible, scalable batch indexing

•  Start serving new indices with no downQme

•  On-‐demand indexing, cost-‐efficient re-‐indexing

Scalable Batch Indexing

24

Mapper: Parse input into

indexable document


indexable document


indexable document

Index shard 1

Index shard 2

Arbitrary reducing steps of indexing and merging

End-‐Reducer (shard 1): Index document

End-‐Reducer (shard 2): Index document

MapReduce Indexer

MapReduce Job with two parts

1) Scan HDFS for files to be indexed •  Much like Unix “find” – see HADOOP-‐8989 •  Output is NLineInputFormat’ed file

2) Mapper/Reducer indexing step •  Mapper extracts content via Cloudera Morphlines •  Reducer indexes documents via embedded Solr server •  Originally based on SOLR-‐1301

•  Many modificaQons to enable linear scalability

MapReduce Indexer “golive”

•  Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing

•  Results of MR indexing operaQon are immediately merged into a live SolrCloud serving cluster

•  No downQme for users •  No NRT expense •  Linear scale out to the size of your MR cluster

Cloudera Morphlines

•  Open Source framework for simple ETL •  Ships as part Cloudera Developer Kit (CDK)

•  It’s a Java library •  AL2 licensed on github hrps://github.com/cloudera/cdk

•  Similar to Unix pipelines •  ConfiguraQon over coding •  Supports common Hadoop formats

•  Avro •  Sequence file •  Text •  Etc…

Cloudera Morphlines Architecture

Solr

Solr

Solr

SolrCloud

Logs, tweets, social media, html,

images, pdf, text….

Anything you want to index

Flume, MR Indexer, HBase indexer, etc... Or your applicaQon!

Morphline Library

Morphlines can be embedded in any applicaQon…

ExtracQon and Mapping

•  Simple and flexible data transformaQon

•  Reusable across mulQple index workloads

•  Over Qme, extend and re-‐use across pla`orm workloads

syslog Flume Agent

Solr sink

Command: readLine

Command: grok

Command: loadSolr

Solr

Event

Record

Record

Record

Document

Morph

line Library

Current Command Library

•  Integrate with and load into Apache Solr •  Flexible log file analysis •  Single-‐line record, mulQ-‐line records, CSV files •  Regex based parern matching and extracQon •  IntegraQon with Avro •  IntegraQon with Apache Hadoop Sequence Files •  IntegraQon with SolrCell and all Apache Tika parsers •  Auto-‐detecQon of MIME types from binary data using Apache Tika

Current Command Library (cont)

•  ScripQng support for dynamic java code •  OperaQons on fields for assignment and comparison •  OperaQons on fields with list and set semanQcs •  if-‐then-‐else condiQonals •  A small rules engine (tryRules) •  String and Qmestamp conversions •  slf4j logging •  Yammer metrics and counters •  Decompression and unpacking of arbitrarily nested container file formats

•  Etc…

Morphline Example – syslog with grok

morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { readLine {} } { grok { dicQonaryFiles : [/tmp/grok-‐dicQonaries] expressions : { message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_Qmestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}""" } } } { loadSolr {} } ] } ]

Example Input <164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_Qmestamp:Feb 4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.

Simple, Customizable Search Interface

Hue •  Simple UI •  Navigated, faceted drill down

•  Customizable display •  Full text search, standard Solr API and query language

Performance

•  Cloudera internal tesQng results •  Cisco WebEx results from Hadoop Summit 2013

Cloudera Internal TesQng

• We’ve looked at •  NRT and Batch indexing •  Query performance

•  Performance has been similar to Solr on local disk •  Indexing/query operaQons are typically CPU bound •  Caching obviously plays a big factor for queries •  Limited use cases explored – public beta helping here!

Details shared by WebEx at 2013 Summit

•  Cisco presented on their use of Flume, Cloudera Search, and Cloudera Morphlines

•  Indexing log events in Near Real Time via Flume •  Cisco UCS C240 M3 servers

•  2 quad cores @2.3ghz •  16gb RAM •  12 x 3TB storage

•  Ingest rate •  70k events/sec, 1.2 TB/day inbound

What’s next

•  Usability – “solrctl” •  Security

•  Index, Document and (eventually) Field level security

•  Lots of scalability/performance work to be done •  What are the best Solr/Lucene seIngs for HDFS? •  InvesQgate short circuit HDFS reads •  BlockDirectoryCache tuning •  HDFS block affinity

• More sophisQcated index management •  Take advantage of collecQon alias support (SOLR-‐4497)

Conclusion

•  Cloudera Search now in public beta •  Free Download •  Extensive documentaQon •  Send your quesQons and feedback to search-‐[email protected]

•  Take the Search online training

•  Cloudera Manager Standard (i.e. the free version) •  Simple management of Search •  Free Download

•  QuickStart VM also available!

search on hadoop

Technology