search on hadoop
DESCRIPTION
Patrick Hunt (Cloudera) talks about search capabilities of HadoopTRANSCRIPT
1
Finding a needle in a stack of needles -‐ adding Search to the Hadoop Ecosystem
Patrick Hunt (@phunt) Big Data Gurus Meetup July 2013
Agenda
• Big Data and Search – seIng the stage • Cloudera Search’s Architecture • Component deep dive • Early performance insights • What’s next?
Feel free to ask quesQons as we go!
Why Search?
An Integrated Part of the Hadoop System
One pool of data
One security framework
One set of system resources
One management interface
Search Simplifies InteracQon
• User Goals • Explore • Navigate • Correlate
• Experts know MapReduce • Savvy people know SQL • Everyone knows Search!
Benefits of Search
• Improved Big Data ROI • An interacQve experience without technical knowledge • Single data set for mulQple compuQng frameworks
• Faster Qme to insight • Exploratory analysis, esp. unstructured data • Broad range of indexing opQons to accommodate needs
• Cost efficiency • Single scalable pla`orm; no incremental investment • No need for separate systems, storage
• Solid foundaQons and reliability • Apache Solr in producQon environments for years • Hadoop-‐powered reliability and scalability
What is Cloudera Search?
• Full-‐text, interacQve search and faceted navigaQon • Batch, near real-‐Qme, and on-‐demand indexing • Apache Solr integrated with CDH
• Established, mature search with vibrant community • Separate runQme like MapReduce, Impala • Incorporated as part of the Hadoop ecosystem
• Open Source • 100% Apache, 100% Solr • Standard Solr APIs
Cloudera Search Components
• Refresher – HDFS/MR/Lucene/Solr/SolrCloud • HDFSDirectoryFactory/HDFSDirectory • BlockDirectory/BlockDirectoryCache • Near Real Time (NRT) indexing
• Apache Flume MorphlineSolrSink • Lily HBase Indexer
• Batch – MapReduce Indexer • ETL – Cloudera Morphlines • Hue Search ApplicaQon
Apache Hadoop
• Apache HDFS • Distributed file system • High reliability • High throughput
• Apache MapReduce • Parallel, distributed programming model • Allows processing of large datasets • Fault tolerant
Apache Lucene
• Full text search • Indexing • Query
• TradiQonal inverted index • Batch and Incremental indexing • We are using version 4 (4.3 currently)
Apache Solr
• Search service built using Lucene • Ships with Lucene (same TLP at Apache)
• Provides XML/HTTP/JSON/Python/Ruby/… APIs • Indexing • Query • AdministraQve interface • Also rich web admin GUI via HTTP
Apache SolrCloud
• Provides distributed Search capability • Part of Solr (not a separate library/codebase) • Shards -‐ both verQcally and horizontally scaleable
• Horizontally – parQQon index for size • VerQcally – replicate for query performance
• Uses ZooKeeper for coordinaQon • No split-‐brain issues • Simplifies operaQons
Distributed Search on Hadoop
Flume Hue UI
Custom UI
Custom App
Solr
Solr
Solr
SolrCloud query
query
query
index
Hadoop Cluster
MR
HDFS
index
HBase index
High Level View
13
HDFS
Lucene
Solr
ZooKeeper
SolrCloud
Querying API Indexing API
Solr on HDFS • Scalable, cost-‐efficient index storage
• Higher availability • Search and process data in one pla`orm
Cloudera Upstream ContribuQons
• SOLR-‐3911 -‐ Directory/DirectoryFactory now first class • Solr ReplicaQon now uses Directory abstracQon • Solr Admin UI no longer assumes local directory access
• SOLR-‐4916 – support for reading/wriQng Solr index files and transacQon log files to/from HDFS
• HDFSDirectoryFactory/HDFSDirectory implementaQon
• SOLR-‐4655 -‐ The Overseer should assign node names by default. • SOLR-‐3706 -‐ Ship setup to log with log4j • SOLR-‐4494 -‐ Clean up and polish CollecQons API • SOLR-‐4718 -‐Improvements to configurability
• ConfiguraQon now enQrely through ZooKeeper (opQonal)
• Many more improvements/cleanup/hardening/…
Lucene Directory abstracQon
• It’s how Lucene interacts with index files • Solr uses it too, but spory prior to 4.x Class Directory { listAll(); createOutput(file, context); openInput(file, context); deleteFile(file); makeLock(file); clearLock(file); …
}
HDFSDirectory
• Originally implemented against Lucene 3 by Blur • Cloudera ported to Lucene 4 and now upstream
• Solr trunk and version 4.4 (upcoming) • Uses the HDFS Client API import org.apache.hadoop.fs.FileSystem; public IndexInput openInput(file, context){ … _inputStream = fileSystem.open(path, bufferSize); …
}
HDFSDirectoryFactory
• Enables plugin of HDFSDirectory into Solr • Configurable through solrconfig.xml
• Also handles • Directory configuraQon • ComposiQng of Directory(s)
• NRTCachingDirectory • BlockDirectory/BlockDirectoryCache
BlockDirectory/BlockDirectoryCache
• In memory cache of index file blocks • Caches on read, in some cases on write
• Compensate for less effecQve file system cache • Uses DirectByteBuffer, not JVM heap (default) • Size configurable by user
Near Real Time Indexing with Flume
Log File Solr and Flume • Data ingest at scale • Flexible extracQon and mapping
• Indexing at data ingest
HDFS
Flume Agent
Indexer
Other Log File
Flume Agent
Indexer
19
Apache Flume -‐ MorphlineSolrSink
• A Flume Source… • Receives/gathers events
• A Flume Channel… • Carries the event – MemoryChannel or reliable FileChannel
• A Flume Sink… • Sends the events on to the next locaQon
• Flume MorphlineSolrSink • Integrates Cloudera Morphlines library
• ETL, more on that in a bit • Does batching • Results sent to Solr for indexing
Near Real Time indexing of Apache HBase
HDFS
HBase
interacQve load
Indexer(s)
Triggers on
updates Solr server
Solr server Solr server Solr server Solr server
Search
+ = planet-‐sized tabular data immediate access & updates fast & flexible informaFon discovery
B IG DATA DATAMANAGEMENT
Lily HBase Indexer
• CollaboraQon between NGData & Cloudera • NGData are creators of the Lily data management pla`orm
• Lily HBase Indexer • Service which acts as a HBase replicaQon listener
• HBase replicaQon features, such as filtering, supported
• ReplicaQon updates trigger indexing of updates (rows) • Integrates Cloudera Morphlines library for ETL of rows • AL2 licensed on github hrps://github.com/ngdata
Scalable Batch Indexing
Index shard
Files
Index shard
Indexer
Files
Solr server
Indexer
Solr server
23
HDFS
Solr and MapReduce • Flexible, scalable batch indexing
• Start serving new indices with no downQme
• On-‐demand indexing, cost-‐efficient re-‐indexing
Scalable Batch Indexing
24
Mapper: Parse input into
indexable document
Mapper: Parse input into
indexable document
Mapper: Parse input into
indexable document
Index shard 1
Index shard 2
Arbitrary reducing steps of indexing and merging
End-‐Reducer (shard 1): Index document
End-‐Reducer (shard 2): Index document
MapReduce Indexer
MapReduce Job with two parts
1) Scan HDFS for files to be indexed • Much like Unix “find” – see HADOOP-‐8989 • Output is NLineInputFormat’ed file
2) Mapper/Reducer indexing step • Mapper extracts content via Cloudera Morphlines • Reducer indexes documents via embedded Solr server • Originally based on SOLR-‐1301
• Many modificaQons to enable linear scalability
MapReduce Indexer “golive”
• Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing
• Results of MR indexing operaQon are immediately merged into a live SolrCloud serving cluster
• No downQme for users • No NRT expense • Linear scale out to the size of your MR cluster
Cloudera Morphlines
• Open Source framework for simple ETL • Ships as part Cloudera Developer Kit (CDK)
• It’s a Java library • AL2 licensed on github hrps://github.com/cloudera/cdk
• Similar to Unix pipelines • ConfiguraQon over coding • Supports common Hadoop formats
• Avro • Sequence file • Text • Etc…
Cloudera Morphlines Architecture
Solr
Solr
Solr
SolrCloud
Logs, tweets, social media, html,
images, pdf, text….
Anything you want to index
Flume, MR Indexer, HBase indexer, etc... Or your applicaQon!
Morphline Library
Morphlines can be embedded in any applicaQon…
ExtracQon and Mapping
• Simple and flexible data transformaQon
• Reusable across mulQple index workloads
• Over Qme, extend and re-‐use across pla`orm workloads
syslog Flume Agent
Solr sink
Command: readLine
Command: grok
Command: loadSolr
Solr
Event
Record
Record
Record
Document
Morph
line Library
Current Command Library
• Integrate with and load into Apache Solr • Flexible log file analysis • Single-‐line record, mulQ-‐line records, CSV files • Regex based parern matching and extracQon • IntegraQon with Avro • IntegraQon with Apache Hadoop Sequence Files • IntegraQon with SolrCell and all Apache Tika parsers • Auto-‐detecQon of MIME types from binary data using Apache Tika
Current Command Library (cont)
• ScripQng support for dynamic java code • OperaQons on fields for assignment and comparison • OperaQons on fields with list and set semanQcs • if-‐then-‐else condiQonals • A small rules engine (tryRules) • String and Qmestamp conversions • slf4j logging • Yammer metrics and counters • Decompression and unpacking of arbitrarily nested container file formats
• Etc…
Morphline Example – syslog with grok
morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { readLine {} } { grok { dicQonaryFiles : [/tmp/grok-‐dicQonaries] expressions : { message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_Qmestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}""" } } } { loadSolr {} } ] } ]
Example Input <164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_Qmestamp:Feb 4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.
Simple, Customizable Search Interface
Hue • Simple UI • Navigated, faceted drill down
• Customizable display • Full text search, standard Solr API and query language
Performance
• Cloudera internal tesQng results • Cisco WebEx results from Hadoop Summit 2013
Cloudera Internal TesQng
• We’ve looked at • NRT and Batch indexing • Query performance
• Performance has been similar to Solr on local disk • Indexing/query operaQons are typically CPU bound • Caching obviously plays a big factor for queries • Limited use cases explored – public beta helping here!
Details shared by WebEx at 2013 Summit
• Cisco presented on their use of Flume, Cloudera Search, and Cloudera Morphlines
• Indexing log events in Near Real Time via Flume • Cisco UCS C240 M3 servers
• 2 quad cores @2.3ghz • 16gb RAM • 12 x 3TB storage
• Ingest rate • 70k events/sec, 1.2 TB/day inbound
What’s next
• Usability – “solrctl” • Security
• Index, Document and (eventually) Field level security
• Lots of scalability/performance work to be done • What are the best Solr/Lucene seIngs for HDFS? • InvesQgate short circuit HDFS reads • BlockDirectoryCache tuning • HDFS block affinity
• More sophisQcated index management • Take advantage of collecQon alias support (SOLR-‐4497)
Conclusion
• Cloudera Search now in public beta • Free Download • Extensive documentaQon • Send your quesQons and feedback to search-‐[email protected]
• Take the Search online training
• Cloudera Manager Standard (i.e. the free version) • Simple management of Search • Free Download
• QuickStart VM also available!