search on hadoop frontier meetup

38
1 Adding Search to the Hadoop Ecosystem Gregory Chanan (gchanan AT cloudera.com) Frontier Meetup Dec 2013

Upload: gregchanan

Post on 10-May-2015

368 views

Category:

Technology


7 download

TRANSCRIPT

Page 1: Search On Hadoop Frontier Meetup

1

Adding Search to the Hadoop Ecosystem

Gregory Chanan (gchanan AT cloudera.com) Frontier Meetup Dec 2013

Page 2: Search On Hadoop Frontier Meetup

Agenda

• Big Data and Search – setting the stage

• Cloudera Search Architecture

• Component deep dive

• Security

• Conclusion

Page 3: Search On Hadoop Frontier Meetup

Why Search?

• Hadoop for everyone

• Typical case:

• Ingest data to storage engine (HDFS, HBase, etc)

• Process data (MapReduce, Hive, Impala)

• Experts know MapReduce

• Savvy people know SQL

• Everyone knows Search!

Page 4: Search On Hadoop Frontier Meetup

Why Search?

An Integrated Part of the Hadoop System

One pool of data

One security framework

One set of system resources

One management interface

Page 5: Search On Hadoop Frontier Meetup

Benefits of Search

• Improved Big Data ROI

• An interactive experience without technical knowledge

• Single data set for multiple computing frameworks

• Faster time to insight

• Exploratory analysis, esp. unstructured data

• Broad range of indexing options to accommodate needs

• Cost efficiency

• Single scalable platform; no incremental investment

• No need for separate systems, storage

Page 6: Search On Hadoop Frontier Meetup

What is Cloudera Search?

• Full-text, interactive search with faceted navigation

• Apache Solr integrated with CDH

• Established, mature search with vibrant community

• In production environments for years

• Open Source

• 100% Apache, 100% Solr

• Standard Solr APIs

• Batch, near real-time, and on-demand indexing

• Generally Available; released 1.1 last month

Page 7: Search On Hadoop Frontier Meetup

Cloudera Search Components

• HDFS/MR/Lucene/Solr/SolrCloud

• Indexing

• Near Real Time (NRT) indexing

• Batch

• ETL – Cloudera Morphlines

• Querying

Page 8: Search On Hadoop Frontier Meetup

Apache Hadoop

• Apache HDFS

• Distributed file system

• High reliability

• High throughput

• Apache MapReduce

• Parallel, distributed programming model

• Allows processing of large datasets

• Fault tolerant

Page 9: Search On Hadoop Frontier Meetup

Apache Lucene

• Full text search

• Indexing

• Query

• Traditional inverted index

• Batch and Incremental indexing

• We are using version 4.4 in current release

Page 10: Search On Hadoop Frontier Meetup

Apache Solr

• Search service built using Lucene

• Ships with Lucene (same TLP at Apache)

• Provides XML/HTTP/JSON/Python/Ruby/… APIs

• Indexing

• Query

• Administrative interface

• Also rich web admin GUI via HTTP

Page 11: Search On Hadoop Frontier Meetup

Apache SolrCloud

• Provides distributed Search capability

• Part of Solr (not a separate library/codebase)

• Shards – provide scalability

• partition index for size

• replicate for query performance

• Uses ZooKeeper for coordination

• No split-brain issues

• Simplifies operations

Page 12: Search On Hadoop Frontier Meetup

SolrCloud Architecture

• Updates automatically sent to the correct shard

• Replicas handle queries, forward updates to the leader

• Leader indexes the document for the shard, and forwards the index notation to itself and any replicas.

Page 13: Search On Hadoop Frontier Meetup

SolrCloud Architecture

Visual representation via admin UI

Page 14: Search On Hadoop Frontier Meetup

Distributed Search on Hadoop

Flume Hue UI

Custom UI

Custom App

Solr

Solr

Solr

SolrCloud query

query

query

index

Hadoop Cluster

MR

HDFS

index

HBase index

ZK

Page 15: Search On Hadoop Frontier Meetup

Indexing

• Near Real Time (NRT)

• Flume

• HBase Indexer

• Batch (MR)

Page 16: Search On Hadoop Frontier Meetup

Indexing

• Near Real Time (NRT)

• Flume

• HBase Indexer

• Batch (MR)

Page 17: Search On Hadoop Frontier Meetup

Near Real Time Indexing with Flume

Log File Solr and Flume • Data ingest at scale • Flexible extraction and

mapping • Indexing at data ingest

HDFS

Flume Agent

Indexer

Other Log File

Flume Agent

Indexer

17

Page 18: Search On Hadoop Frontier Meetup

Apache Flume - MorphlineSolrSink

• A Flume Source… • Receives/gathers events

• A Flume Channel… • Carries the event – MemoryChannel or reliable FileChannel

• A Flume Sink… • Sends the events on to the next location

• Flume MorphlineSolrSink • Integrates Cloudera Morphlines library

• ETL, more on that in a bit

• Does batching

• Results sent to Solr for indexing

Page 19: Search On Hadoop Frontier Meetup

Indexing

• Near Real Time (NRT)

• Flume

• HBase Indexer

• Batch (MR)

Page 20: Search On Hadoop Frontier Meetup

Near Real Time Indexing of Apache HBase

HDFS

HBase

inte

ract

ive

load

HBase Indexer(s)

Rep

licat

ion

Solr server Solr server Solr server Solr server Solr server

Sear

ch

+ = planet-sized tabular data immediate access & updates fast & flexible information discovery

B I G D ATA D ATA M A N A G E M E N T

Page 21: Search On Hadoop Frontier Meetup

Lily HBase Indexer

• Collaboration between NGData & Cloudera

• NGData are creators of the Lily data management platform

• Lily HBase Indexer

• Service which acts as a HBase replication listener • HBase replication features, such as filtering, supported

• Replication updates trigger indexing of updates (rows)

• Integrates Cloudera Morphlines library for ETL of rows

• AL2 licensed on github https://github.com/ngdata

Page 22: Search On Hadoop Frontier Meetup

Indexing

• Near Real Time (NRT)

• Flume

• HBase Indexer

• Batch (MR)

Page 23: Search On Hadoop Frontier Meetup

Scalable Batch Indexing

Index shard

Files

Index shard

Indexer

Files

Solr server

Indexer

Solr server

23

HDFS

Solr and MapReduce • Flexible, scalable batch

indexing • Start serving new indices

with no downtime • On-demand indexing, cost-

efficient re-indexing

Page 24: Search On Hadoop Frontier Meetup

MapReduce Indexer

MapReduce Job with two parts

1) Scan HDFS for files to be indexed

• Much like Unix “find” – see HADOOP-8989

• Output is NLineInputFormat’ed file

2) Mapper/Reducer indexing step

• Mapper extracts content via Cloudera Morphlines

• Reducer indexes documents via embedded Solr server

• Originally based on SOLR-1301 • Many modifications to enable linear scalability

Page 25: Search On Hadoop Frontier Meetup

MapReduce Indexer “golive”

• Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing

• Results of MR indexing operation are immediately merged into a live SolrCloud serving cluster

• No downtime for users

• No NRT expense

• Linear scale out to the size of your MR cluster

Page 26: Search On Hadoop Frontier Meetup

HBase + MapReduce

• New in search 1.1: run MapReduce job over HBase tables

• Same architecture as running over HDFS

• Similar to HBase’s CopyTable,

Page 27: Search On Hadoop Frontier Meetup

Cloudera Morphlines

• Open Source framework for simple ETL

• Simplify ETL • Built-in commands and library support (Avro format, Hadoop

SequenceFiles, grok for syslog messages)

• Configuration over coding

• Standardize ETL

• Ships as part of Kite SDK, formerly Cloudera Developer Kit (CDK)

• It’s a Java library

• AL2 licensed on github https://github.com/kite-sdk

Page 28: Search On Hadoop Frontier Meetup

Cloudera Morphlines Architecture

Solr

Solr

Solr

SolrCloud

Logs, tweets, social media, html,

images, pdf, text….

Anything you want to index

Flume, MR Indexer, HBase indexer, etc... Or your application!

Morphline Library

Morphlines can be embedded in any application…

Page 29: Search On Hadoop Frontier Meetup

Extraction and Mapping

• Modeled after Unix pipelines

• Simple and flexible data transformation

• Reusable across multiple index workloads

• Over time, extend and re-use across platform workloads

syslog Flume Agent

Solr sink

Command: readLine

Command: grok

Command: loadSolr

Solr

Event

Record

Record

Record

Document

Mo

rph

line

Lib

rary

Page 30: Search On Hadoop Frontier Meetup

Morphline Example – syslog with grok

morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { readLine {} } { grok { dictionaryFiles : [/tmp/grok-dictionaries] expressions : { message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}""" } } } { loadSolr {} } ] } ]

Example Input <164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_timestamp:Feb 4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.

Page 31: Search On Hadoop Frontier Meetup

Current Command Library

• Integrate with and load into Apache Solr

• Flexible log file analysis

• Single-line record, multi-line records, CSV files

• Regex based pattern matching and extraction

• Integration with Avro

• Integration with Apache Hadoop Sequence Files

• Integration with SolrCell and all Apache Tika parsers

• Auto-detection of MIME types from binary data using Apache Tika

Page 32: Search On Hadoop Frontier Meetup

Current Command Library (cont)

• Scripting support for dynamic java code

• Operations on fields for assignment and comparison

• Operations on fields with list and set semantics

• if-then-else conditionals

• A small rules engine (tryRules)

• String and timestamp conversions

• slf4j logging

• Yammer metrics and counters

• Decompression and unpacking of arbitrarily nested container file formats

• Etc…

Page 33: Search On Hadoop Frontier Meetup

Querying

• Built-in solr web UI

• Write your own

• Hue

Page 34: Search On Hadoop Frontier Meetup

Simple, Customizable Search Interface

Hue • Simple UI • Navigated, faceted drill

down • Customizable display • Full text search,

standard Solr API and query language

Page 35: Search On Hadoop Frontier Meetup

Security

• Upstream Solr doesn’t deal with security

• Search 1.0 supports kerberos authentication

• Similar to Oozie / WebHDFS

• Search 1.1 supports index-level authorization via Apache Sentry (incubating)

Page 36: Search On Hadoop Frontier Meetup

Index-Level Authorization

• Sentry works via “policy files” stored in HDFS

• Can grant roles administrative-only, query-only, update-only access

• Example:

[groups]

# Assigns each Hadoop group to its set of roles

dev_ops = engineer_role, ops_role

[roles]

engineer_role = collection = source_code->action=* ops_role = collection = hbase_logs->action=Query

Page 37: Search On Hadoop Frontier Meetup

Index-Level Authorization 2

• Works by hooking into Solr RequestHandlers: <requestHandler name="/update“ class="solr.UpdateRequestHandler">

<lst name="defaults“>

<str name="update.chain">updateIndexAuthorization</str>

</lst>

</requestHandler>

• Also includes secure impersonation support

• Unauthorized attempts get a 401 response and are written to the solr log

• Future work: more fine grain authorization

Page 38: Search On Hadoop Frontier Meetup

Conclusion

• Cloudera Search now Generally Available (1.1)

• Free Download

• Extensive documentation

• Send your questions and feedback to [email protected]

• Take the Search online training

• Cloudera Manager Standard (i.e. the free version)

• Simple management of Search

• Free Download

• QuickStart VM also available!