lucene inputformat (lightning talk) - trihug december 10, 2013

14

Click here to load reader

Upload: mumrah

Post on 25-May-2015

719 views

Category:

Technology


0 download

DESCRIPTION

7 minute overview of some work I did to build a Hadoop InputFormat for Lucene indexes on HDFS. Includes a Pig LoadFunc, and some Avro schema reflection to make things smoother.

TRANSCRIPT

Page 1: Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Lucene InputFormatand more!

Page 2: Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Lookups on HDFSSequenceFile is great for fast sequential access, but how to do lookups?

MapFile, BloomMapFile, HBase, Cassandra, et al. all provide one primary-key type index of the data. But what if you want to index all your fields (or at least many of them)?

What if you want search?

Page 3: Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Lucene to the rescue!Lucene is, among many things, a file format.

The stored fields file (fdt) has fast sequential access, so it acts as our “sequence file” of key/values. In addition to this, you get the power of the inverted index and the search capabilities of Lucene.

Page 4: Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Solr HDFSDirectory● Start with SOLR-4916 (HDFS support)● Pull out Solr-specific bits so we can use with

vanilla Lucene● Backport to Hadoop 1.x

Page 5: Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Lucene InputFormat● Glob HDFS for Lucene instance directories● Read SegmentInfos and create a split per

segment● Use a MatchAllDocsQuery to quickly iterate

through the doc set● RecordReader returns docs from

DocIdSetIterator

Page 6: Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Lucene InputFormat cont.● Gives back a Document with the stored fields● The time spent searching is negligible

compared to iterating through docs● Think of it as a key/value storage format plus

an efficient inverted index

Page 7: Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Adding a queryAdd a simple TermQuery like “key:value” and specify which fields to return

LIF.setLuceneQuery(job, "body:anarchy");

LIF.setLuceneFields(job, "title", "body");

Page 8: Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

More complex queries?Use JavaScript to dynamically set more complicated queries

var clause1 = new TermQuery("body", "anarchy");

var clause2 = new TermQuery("title", "revolution");

var query = new BooleanQuery();

query.add(clause1, BooleanClause.Occur.MUST);

query.add(clause2, BooleanClause.Occur.MUST);

Page 9: Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Adding Pig LoadFuncX = LOAD 'hdfs://localhost:50001/tmp/lucene/*'

USING DefaultLuceneLoadFunc('body:anarchy')

AS (title:chararray, date:long, body:chararray);

Y = FOREACH X GENERATE title, date;

(Anarchism,1355654644000)

(Abraham Lincoln,1357087785000)

(Art,1357159249000)

(Anarcho-capitalism,1356671677000)

Page 10: Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Demo!

Page 11: Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Adding some schema● Schema is hard-coded in previous examples● InputFormat gives back Lucene Document● Use Avro to reflect a schema onto the

Lucene docs when reading/writing● Similarly, use Avro to reflect a Pig schema

Page 12: Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Avro-ified IF and LoadFunc

X = LOAD 'hdfs://localhost:50001/tmp/lucene/*'

USING AvroLuceneLoadFunc(

'com.lucid.MyAvroClass',

'body:anarchy'

);

Y = FOREACH X GENERATE title, date;

Page 13: Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

That’s it!David Arthurhttp://mumrah.github.io/

Page 14: Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Bonus Slide - Kafka 0.8Kafka 0.8.0 was released last week!

Now with 100% more logo:

Apache Kafka