lucene inputformat (lightning talk) - trihug december 10, 2013

Lucene InputFormatand more!

Lookups on HDFSSequenceFile is great for fast sequential access, but how to do lookups?

MapFile, BloomMapFile, HBase, Cassandra, et al. all provide one primary-key type index of the data. But what if you want to index all your fields (or at least many of them)?

What if you want search?

Lucene to the rescue!Lucene is, among many things, a file format.

The stored fields file (fdt) has fast sequential access, so it acts as our “sequence file” of key/values. In addition to this, you get the power of the inverted index and the search capabilities of Lucene.

Solr HDFSDirectory● Start with SOLR-4916 (HDFS support)● Pull out Solr-specific bits so we can use with

vanilla Lucene● Backport to Hadoop 1.x

Lucene InputFormat● Glob HDFS for Lucene instance directories● Read SegmentInfos and create a split per

segment● Use a MatchAllDocsQuery to quickly iterate

through the doc set● RecordReader returns docs from

DocIdSetIterator

Lucene InputFormat cont.● Gives back a Document with the stored fields● The time spent searching is negligible

compared to iterating through docs● Think of it as a key/value storage format plus

an efficient inverted index

Adding a queryAdd a simple TermQuery like “key:value” and specify which fields to return

LIF.setLuceneQuery(job, "body:anarchy");

LIF.setLuceneFields(job, "title", "body");

More complex queries?Use JavaScript to dynamically set more complicated queries

var clause1 = new TermQuery("body", "anarchy");

var clause2 = new TermQuery("title", "revolution");

var query = new BooleanQuery();

query.add(clause1, BooleanClause.Occur.MUST);

query.add(clause2, BooleanClause.Occur.MUST);

Adding Pig LoadFuncX = LOAD 'hdfs://localhost:50001/tmp/lucene/*'

USING DefaultLuceneLoadFunc('body:anarchy')

AS (title:chararray, date:long, body:chararray);

Y = FOREACH X GENERATE title, date;

(Anarchism,1355654644000)

(Abraham Lincoln,1357087785000)

(Art,1357159249000)

(Anarcho-capitalism,1356671677000)

Adding some schema● Schema is hard-coded in previous examples● InputFormat gives back Lucene Document● Use Avro to reflect a schema onto the

Lucene docs when reading/writing● Similarly, use Avro to reflect a Pig schema

Avro-ified IF and LoadFunc

X = LOAD 'hdfs://localhost:50001/tmp/lucene/*'

USING AvroLuceneLoadFunc(

'com.lucid.MyAvroClass',

'body:anarchy'

Y = FOREACH X GENERATE title, date;

That’s it!David Arthurhttp://mumrah.github.io/

Bonus Slide - Kafka 0.8Kafka 0.8.0 was released last week!

Now with 100% more logo:

Apache Kafka

lucene inputformat (lightning talk) - trihug december 10, 2013

Technology

trihug: lucene solr hadoop

trihug 3/14: hbase in production

the lucene full-text search...

orientdb & lucene

ir with lucene

lucene algorithm paper

lucene solrrev documentlevelsecurity_rajanimaski_final

hatcher erik lucene

lucene part2. lucene jarkarta lucene ( is a high-...

lucene and solr

trihug 2/14: apache sentry

lucene in action

lucene kv-store

apache lucene: searching the web and everything...

spring lucene reference...

lucene/solr 3.1

lucene introduction

lucene tutorial

lucene tutorial

introduction and overview of apache kafka, trihug july 23,...