solrcloud on hadoop

1

SolrCloud on Hadoop Cleveland Big Data and Hadoop User Group, January 2014 Alex Moundalexis [email protected] @technmsg

Disclaimer

•  Technologies, not products •  Cloudera builds things soHware

•  most donated to Apache •  some closed-‐source

•  I will likely menMon “Cloudera Something” •  Cloudera “products” I reference are open source

•  Apache Licensed •  Source code is on GitHub

•  hRps://github.com/cloudera

2

What This Talk Isn’t About

•  Deploying •  Puppet, Chef, Ansible, homegrown scripts, intern labor

•  Sizing & Tuning •  Depends heavily on data and workload

•  Coding •  Unless you count XML or CSV

•  Algorithms

3

4

Quick and dirty, more Mme for use cases.

The Apache Hadoop Ecosystem

Why “Ecosystem?”

•  In the beginning, just Hadoop •  HDFS •  MapReduce

•  Today, dozens of interrelated components •  I/O •  Processing •  Specialty ApplicaMons •  ConfiguraMon •  Workflow

5

ParMal Ecosystem

6

Hadoop

external system

RDBMS / DWH

web server

device logs

API access

log collecMon

DB table import

batch processing

machine learning

external system

API access

user

RDBMS / DWH

DB table export

BI tool + JDBC/ODBC

Search

SQL

HDFS

•  Distributed, highly fault-‐tolerant filesystem •  OpMmized for large streaming access to data •  Based on Google File System

•  hRp://research.google.com/archive/gfs.html

7

Lots of Commodity Machines

8

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]



Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce (MR)

•  Programming paradigm •  Batch oriented, not realMme •  Works well with distributed compuMng •  Lots of Java, but other languages supported •  Based on Google’s paper

•  hRp://research.google.com/archive/mapreduce.html

9

Under the Covers

You specify map() and reduce() functions. ��

��The framework does the

rest. 60

Apache HBase

•  Random, realMme read/write access •  Key/value columnar store •  (b|tr)illions of rows/columns •  Based on Google BigTable

•  hRp://research.google.com/archive/bigtable.html

12

Cloudera Hue

•  Hadoop User Experience •  Hadoop is largely command line •  Hue provides a UI for end-‐users •  SDK to build your own apps on top

13

Apache Tika

•  Content analysis toolkit •  Simply put, a lot of parsers •  Detect/extract metadata/text from documents

•  HTML •  XML •  Office •  PDF •  mbox •  More…

14

Apache ZooKeeper

•  Distributed systems are HARD •  Everyone was trying to implement the same subsystems •  Bugs leads to race condiMons, other bad things

•  ZK: Highly reliable distributed coordinaMon services •  ConfiguraMon •  Naming •  SynchronizaMon •  Group Services

15

Cloudera Morphlines

•  In-‐memory transformaMons •  Load, parse, transform, process •  Records as name-‐value pairs w/ opMonal blob/pojo objects

•  Java library, embedded in your codebase •  Used to ETL data from Flume and MR into Solr

•  Was part of CDK, now part of Kite •  hRp://kitesdk.org

16

Apache Lucene

•  Java-‐based index and search •  ranked or sorted results •  hits streamed through QP

•  mem(results) < mem(collecMon)

•  rich/extensible query operators •  bool, phrase, range, span, spaMal

•  Features •  spellchecking •  hit highlighMng •  tokenizaMon

17

Apache Solr

•  Enterprise search plaporm •  Based on Apache Lucene

•  Full-‐text search •  FaceMng •  NRT indexing •  UI

18

Apache Solr – Simple Indexing via CLI

$ java -‐jar post.jar solr.xml money.xml SimplePostTool: version 1.4 SimplePostTool: POSTing files to http://localhost:8983/solr/update.. SimplePostTool: POSTing file solr.xml SimplePostTool: POSTing file money.xml SimplePostTool: COMMITting Solr index changes.. $ post.sh *.xml

19

Apache Solr – Document money.xml

<add> <doc> <field name="id">USD</field> <field name="name">One Dollar</field> <field name="manu">Bank of America</field> <field name="manu_id_s">boa</field> <field name="cat">currency</field> <field name="features">Coins and notes</field> <field name="price_c">1,USD</field> <field name="inStock">true</field> </doc> <doc> <field name="id">EUR</field> <field name="name">One Euro</field>

20

Apache Solr – More Advanced Indexing

•  From DB, using Data Import Handler (DIH) •  Load a CSV file •  POST JSON documents •  Index binary documents (uses Tika) •  SolrJ for programmaMc document creaMon

21

Apache Solr – Querying

•  HTTP GET •  hRp://solr:8983/solr/collecMon1/select/

•  Examples •  ?q=Mmestamp:[* TO NOW] •  ?q=-‐instock:false •  ?q={!lucene q.op=AND df=text}myfield:foo +bar -‐bat

22

Apache Solr – Querying

•  HTTP GET •  hRp://solr:8983/solr/collecMon1/select/?q=video

•  Examples •  &fl=name,id (return only name and id fields) •  &fl=name,id,score (return relevancy score as well) •  &fl=*,score (return all fields + relevancy score) •  &sort=price desc&fl=name,id,price (sort by price desc) •  &wt=json (return response in JSON format)

23

What the Heck is FaceMng?

•  Generate counts for properMes or categories •  Links allow drill-‐down or refine search results

What?

24

Facets on Amazon.com

25

Apache Solr – Facets at Query Time

•  HTTP GET •  hRp://solr:8983/solr/collecMon1/select/?q=video •  All docs, count by category q=*:*&facet=true&facet.field=cat

•  All docs, count by category and in-‐stock status q=*:*&facet=true&facet.field=cat&facet.field=inStock

•  Docs matching “ipod”, count by price (above/below $100) q=ipod&facet=true&facet.query=price:[0 TO 100]&facet.query=price:[100 TO *]

26

Apache Solr – Querying via UI

27

Apache SolrCloud

•  IntegraMon of Solr + ZooKeeper •  Provides for shard failover

28

Cloudera Search

•  Based on Apache Solr (incl Lucene and SolrCloud) •  Fault-‐tolerance: collecMons backed by HDFS or HBase •  IntegraMon galore:

•  HBase/Flume/MapReduce w/ Lucene •  Hue w/ Solr •  Avro w/ Tika •  HDFS w/ Solr/Lucene •  Sentry w/ Solr

29

Cloudera Search + Hue

30

Cloudera Search + Hue

31

32

Apologies, I swiped some preRy slides from markeMng…

Why Search?

Search Design Strategy

33

One pool of data

One security framework

One set of system resources

One management interface

An Integrated Part of the Hadoop System

Storage

Integra5on

Resource Management

Metad

ata

Batch Processing MAPREDUCE, HIVE & PIG

…

HDFS HBase

TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS

Engines

InteracMve SQL

CLOUDERA IMPALA

InteracMve Search CLOUDERA SEARCH

Machine Learning MAHOUT

Math & Sta5s5cs

SAS, R

Benefits of Search IntegraMon

34

Improved Big Data ROI §  An interacMve experience without technical knowledge §  Single data set for mulMple compuMng frameworks

Faster Time to Insight §  Exploratory analysis, esp. unstructured data §  Broad range of indexing opMons to accommodate needs

Cost Efficiency §  Single scalable plaporm; no incremental investment §  No need for separate systems, storage

Solid Founda5ons & Reliability §  Solr in producMon environments for years §  Hadoop-‐powered reliability and scalability

35

Some quick examples.

Search Use Cases

Search Use Cases

36

Offer easy access to non-‐technical resources

Explore data prior to processing and modeling

Gain immediate access and find correlaMons in mission-‐criMcal data

Powerful, proven search capabili5es that let organiza5ons:

Monsanto

37

Scalable, efficient image search for analysis and research

Track plant characterisMcs throughout their lifecycle

Before: Manual aRribute extracMon and search queries within database

Now: Parse and index images at acquisiMon and on demand, index archived images in batch

38

Cloudera: Internal Field Portal

Custom Aggregated Search

Cloudera – Internal Field Portal

•  Single stop for field engineers •  Mailing lists: public, private •  Tickets: support, development, public ASF •  Customer data: accounts, clusters, KB arMcles •  Customer Clusters: configs, audits, logs, events •  Books and papers •  Discussion forums

•  Dogfooding, yes • Makes my life easier

39


40


•  Varied fetchers/observers for web/API content •  Content is retrieved via Flume, Sqoop

•  Search indexes and replicates into HBase •  Each collecMon has collecMon-‐specific filters/fields •  Provides Mtle, content snippet, link to original

• Morphlines extracts books and papers using Tika •  Impala for analyMcs

•  Future: Use MapReduce to ingest logs

41

42

ParMng thoughts… in no parMcular order.

Summary

Search Simplifies InteracMon

43

Explore

Navigate

Correlate Experts know MapReduce. Savvy people know SQL.

Everyone knows Search.

Summary

•  With Hadoop, it depends. •  The tools are out there. •  Open source soHware, hooray!

•  Many interconnected pieces •  Many unexplored opportuniMes •  A thriving community awaits you…

•  Data can make a difference. •  Search allows everyone to interact with data.

•  This is a Big Deal.

44

What’s Next?

•  Search examples •  hRp://blog.cloudera.com/blog/category/search/

•  Cloudera provides pre-‐loaded VMs •  hRp://Mny.cloudera.com/quickstartvm

•  Clone our repos! •  hRps://github.com/cloudera

45

46

Preferably related to the talk…

QuesMons?

47

Thank You! Alex Moundalexis [email protected] @technmsg We’re hiring, kids! Well, not kids.

solrcloud on hadoop

Technology