high performance json search and relational faceted browsing with lucene

62

Upload: lucenerevolution

Post on 11-May-2015

2.221 views

Category:

Technology


0 download

DESCRIPTION

Presented by Renaud Delbru, Co-Founder, SindiceTech In this presentation, we will discuss how Lucene and Solr can be used for very efficient search of tree-shaped schemaless document, e.g. JSON or XML, and can be then made to address both graph and relational data search. We will discuss the capabilities of SIREn, a Lucene/Solr plugin we have developed to deal with huge collections of tree-shaped schemaless documents, and how SIREn is built using Lucene extensibility capabilities (Analysis, Codec, Flexible Query Parser). We will compare it with Lucene's BlockJoin Query API in nested schemaless data intensive scenarios. We will then go through use cases that show how relational or graph data can be turned into JSON documents using Hadoop and Pig, and how this can be used in conjunction with SIREn to create relational faceting systems with unprecedented performance. Take-away lessons from this session will be awareness about using Lucene/Solr and Hadoop for relational and graph data search, as well as the awareness that it is now possible to have relational faceted browsers with sub-second response time on commodity hardware.

TRANSCRIPT

Page 1: High Performance JSON Search and Relational Faceted Browsing with Lucene
Page 2: High Performance JSON Search and Relational Faceted Browsing with Lucene

HIGH PERFORMANCE JSON SEARCH AND

RELATIONAL FACETED BROWSING WITH LUCENE

Renaud Delbru Co-Founder, SindiceTech

Post-Doctoral Researcher, NUIG

[email protected]

[email protected]

Page 3: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Lucene / Solr

– User since 7 years

– Built a web search engine – sindice.com (700M documents)

• Academia & Research

– Ph.D. in Information Retrieval and Semantic Web

– Post-doctoral researcher at National Univerity of Ireland, Galway

• Industry

– Technical co-founder of SindiceTech

– Management Platform for Enterprise Knowledge Graph

My Background

Page 4: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Nested Data Model

• SIREn Overview & Theory

• SIREn Plugin Architecture

• Relational Faceted Browsing

• Comparison with BlockJoin

Agenda

Page 5: High Performance JSON Search and Relational Faceted Browsing with Lucene

• SQL

– Query-time join performance penalty

• NoSQL

– Denormalisation of relational data into nested data

– Convert many-to-one/many into one-to-many relationships

Nested Data Model: Why is it important ?

Page 6: High Performance JSON Search and Relational Faceted Browsing with Lucene

Denormalising Relational Data

LucidWorks

Series A

Series B

Granite Ventures

Page 7: High Performance JSON Search and Relational Faceted Browsing with Lucene

Denormalising Relational Data

LucidWorks

Series A

Series B

Granite Ventures

Granite Ventures

Page 8: High Performance JSON Search and Relational Faceted Browsing with Lucene

• SQL

– Query-time join performance penalty

• NoSQL

– Denormalisation of relational data into nested data

– Convert many-to-one/many into one-to-many relationships

– Duplicate data …

– … but avoid joins

Nested Data Model: Why is it important ?

Page 9: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Model becoming prevalent: JSON, XML, Avro, …

– Can be arbitrarily nested and large

– No strict schema / structure enforced

• Schema-less brings

– Flexibility

– Ease of development

• Developers do not have to invest significant modelling effort upfront

Schema-Less Nested Data Model

Page 10: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Lucene/Solr plugin for indexing and searching JSON

• Rich data model (JSON)

– Nested objects, nested arrays, datatypes

• Schema-agnostic

– No need to define structure (nested model)

– No need to define schema (fields)

Introducing SIREn

Page 11: High Performance JSON Search and Relational Faceted Browsing with Lucene

Overview of the SIREn API

Document Query

{

"name" : "LucidWorks",

"category_code" : "analytics",

"funding_rounds" : [

{

"round_code" : "a",

"raised_amount" : 6000000,

"funded_year" : 2009,

"investments" : [

{

"name" : "Granite Ventures",

"type" : "financial-org"

},

]

},

]

}

(category_code : analytics)

AND

(funding_rounds : {

round_code : seed OR a OR angel,

raised_amount : [0 TO 12000000],

* : {

type : financial-org

}

})

Page 12: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Inspired from tree-labelling scheme techniques (XML IR)

– Label each node with a hierarchical ids (here Dewey’s identifiers)

• Full-text search operators over the content of a node

• Structural search operators over the nodes of the tree

– Ancestor-Descendant, Parent-Child, Sibling, …

Theory behind SIREn

Page 13: High Performance JSON Search and Relational Faceted Browsing with Lucene

Theory behind SIREn: Tree-Labelling

name

funding_

rounds

LucidWorks

round_

code

raised_

amount

a

6000000

{

"name" : "LucidWorks",

"category_code" : "analytics",

"funding_rounds" : [

{

"round_code" : "a",

"raised_amount" : 6000000,

"funded_year" : 2009,

},

]

}

Page 14: High Performance JSON Search and Relational Faceted Browsing with Lucene

Theory behind SIREn: Tree-Labelling

{

"name" : "LucidWorks",

"category_code" : "analytics",

"funding_rounds" : [

{

"round_code" : "a",

"raised_amount" : 6000000,

"funded_year" : 2009,

},

]

}

name

funding_

rounds

LucidWorks

round_

code

raised_

amount

a

1.2

1.1

1

1.1.1

1.2.1

1.2.2.1.1 1.2.2.1

6000000

1.2.2.2

… 1.2.2

1.2.2.2.1

Page 15: High Performance JSON Search and Relational Faceted Browsing with Lucene

Theory behind SIREn: Query Processing

?

name LucidWorks

Query

name

Inverted Index

LucidWorks

1.1 2.2 2.5

1.5.3 2.2.1 4.2.1

Page 16: High Performance JSON Search and Relational Faceted Browsing with Lucene

Theory behind SIREn: Query Processing

Query Inverted Index

1.1 2.2 2.5

1.5.3

?

name LucidWorks

name

LucidWorks 2.2.1 4.2.1

Page 17: High Performance JSON Search and Relational Faceted Browsing with Lucene

Theory behind SIREn: Query Processing

Query Inverted Index

1.1 2.2 2.5

1.5.3 2.2.1 4.2.1

?

name LucidWorks

name

LucidWorks

Page 18: High Performance JSON Search and Relational Faceted Browsing with Lucene

Theory behind SIREn: Query Processing

Query Inverted Index

1.1 2.2 2.5

1.5.3

?

name LucidWorks

name

LucidWorks 2.2.1 4.2.1

Page 19: High Performance JSON Search and Relational Faceted Browsing with Lucene

Theory behind SIREn: Query Processing

Query Inverted Index

1.1 2.2 2.5

1.5.3

?

name LucidWorks

name

LucidWorks 2.2.1 4.2.1

Page 20: High Performance JSON Search and Relational Faceted Browsing with Lucene

Theory behind SIREn: Query Processing

Query Inverted Index

1.1 2.2 2.5

1.5.3

?

name LucidWorks

name

LucidWorks 2.2.1 4.2.1

Page 21: High Performance JSON Search and Relational Faceted Browsing with Lucene

Theory behind SIREn: Query Processing

Query Inverted Index

1.1 2.2 2.5

1.5.3

?

name LucidWorks

name

LucidWorks 2.2.1 4.2.1

Page 22: High Performance JSON Search and Relational Faceted Browsing with Lucene

SIREn Plugin Architecture - Overview

Codec

Tree-Labelling Codec

Analysis

JSON Analyzer

Flexible Query Parser

JSON Query Parser

SIREn Lucene Legend:

Query

Node Query

Document

Page 23: High Performance JSON Search and Relational Faceted Browsing with Lucene

JSON Field

schema.xml sample

<fields>

<field name="id" type="string" indexed="true" stored="true"/>

<field name="json" type="json" indexed="true" stored="false"/>

</fields>

<types>

<fieldType name="json"

class="org.sindice.siren.solr.schema.JsonField"

datatypeConfig="datatypes.xml"/>

</types>

Page 24: High Performance JSON Search and Relational Faceted Browsing with Lucene

Datatypes

datatypes.xml sample

<datatype name="http://www.w3.org/2001/XMLSchema#String"

class="org.sindice.siren.solr.schema.TextDatatype">

<analyzer type="index">

<tokenizer class="solr.KeywordTokenizerFactory"/>

</analyzer>

<analyzer type="query">

<tokenizer class="solr.KeywordTokenizerFactory"/>

</analyzer>

</datatype>

<datatype name="http://www.w3.org/2001/XMLSchema#int"

class="org.sindice.siren.solr.schema.TrieDatatype"

precisionStep="8"

type="integer"/>

Page 25: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Traverses JSON tree using Depth-First

Search

• Generates one token per JSON node

• Attaches metadata attributes (Dewey id,

datatype, …) to each token

Tokenizer Output

JSON Tokenizer

name LucidWorks funding_ rounds

round_ code

1.1 Field

1.1.1 String

1.2 Field

1.2.2.1 String

Page 26: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Tokenize the content of a node token based on its datatype

JSON Analyzer – NodeTokenizerFilter

lucid works

name LucidWorks funding_ rounds

funding rounds

Input

Output

name LucidWorks funding_ rounds

round_ code

1.1 Field

1.1.1 String

1.2 Field

1.2.2.1 String

Page 27: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Tokenize the content of a node token based on its datatype

JSON Analyzer – NodeTokenizerFilter

lucid works

name LucidWorks funding_ rounds

funding rounds

Input

Output

Tokenized with String datatype analyzer

name LucidWorks funding_ rounds

round_ code

1.1 Field

1.1.1 String

1.2 Field

1.2.2.1 String

Page 28: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Tokenize the content of a node token based on its datatype

JSON Analyzer – NodeTokenizerFilter

lucid works

name LucidWorks funding_ rounds

funding rounds

Input

Output

Tokenized with Field datatype analyzer

name LucidWorks funding_ rounds

round_ code

1.1 Field

1.1.1 String

1.2 Field

1.2.2.1 String

Page 29: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Encode metadata attributes into a term payload

• Leverage Payload API to transfer attributes to the Codec API

JSON Analyzer – NodePayloadFilter

Page 30: High Performance JSON Search and Relational Faceted Browsing with Lucene

SIREn Plugin Architecture - Overview

Codec

Tree-Labelling Codec

Analysis

JSON Analyzer

Flexible Query Parser

JSON Query Parser

SIREn Lucene Legend:

Query

Node Query

Document

Page 31: High Performance JSON Search and Relational Faceted Browsing with Lucene

Tree-Labelling Codec – File Structure

.nod

.doc

.pos

Header Doc identifiers Node frequencies

Header Node identifiers Term frequencies

Header Term positions

Block

Page 32: High Performance JSON Search and Relational Faceted Browsing with Lucene

Tree-Labelling Codec – Compression

• Adaptive Frame Of Reference

– Adapt the encoding to the integer distribution

– Better tolerance against outliers

– Very effective with frequencies, node identifiers and positions (higher

compression rate)

FOR BFS

BFS BFS BFS BFS AFOR

Page 33: High Performance JSON Search and Relational Faceted Browsing with Lucene

SIREn Plugin Architecture - Overview

Codec

Tree-Labelling Codec

Analysis

JSON Analyzer

Flexible Query Parser

JSON Query Parser

SIREn Lucene Legend:

Query

Node Query

Document

Page 34: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Query Processing

– Collects matching document and node identifiers

– Posting list traversal order: document ids, node ids then positions

• Adaptation of all Lucene’s Query classes to the new file structure

– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …

Node Query

Page 35: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Query Processing

– Collects matching document and node identifiers

– Posting list traversal order: document ids, node ids then positions

• Adaptation of all Lucene’s Query classes to the new file structure

– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …

Node Query

• TwigQuery

– Consist of a root query and one or

more descendant or child queries

Boolean

Phrase MUST

Boolean SHOULD

Page 36: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Query Processing

– Collects matching document and node identifiers

– Posting list traversal order: document ids, node ids then positions

• Adaptation of all Lucene’s Query classes to the new file structure

– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …

Node Query

• TwigQuery

– Consist of a root query and one or

more descendant or child queries

Boolean

Phrase MUST

Boolean SHOULD

Page 37: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Query Processing

– Collects matching document and node identifiers

– Posting list traversal order: document ids, node ids then positions

• Adaptation of all Lucene’s Query classes to the new file structure

– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …

Node Query

• TwigQuery

– Consist of a root query and one or

more descendant or child queries

– Can be nested to form complex tree

structure

Boolean

Phrase MUST

Twig NOT

Boolean SHOULD

Range SHOULD

Page 38: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Query Processing

– Collects matching document and node identifiers

– Posting list traversal order: document ids, node ids then positions

• Adaptation of all Lucene’s Query classes to the new file structure

– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …

Node Query

• TwigQuery

– Consist of a root query and one or

more descendant or child queries

– Can be nested to form complex tree

structure

– Can be rewritten as a pure boolean

query

Boolean

Phrase MUST

Twig NOT

Boolean SHOULD

Range SHOULD

Page 39: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Faceted Navigation

– Data-driven exploratory interface

– User incrementally adds constraints

– Restricted to one record collection

• Relational Faceted Navigation

– Enables navigation of interrelated record collections

– Constraints affect all record collections

– New navigation operation: Pivot

• Switch user view to a record collection

Application: Relational Faceted Navigation

Page 41: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Each collection has its own data model (document)

• Lucene fields for facets

• JSON field for relationships with records from other collections

Data Model

Country

Category

JSON

Company Investment Investor

Year

Amount

JSON

Type

JSON

Page 42: High Performance JSON Search and Relational Faceted Browsing with Lucene

• JSON field: Tree covering all the relationships with records from other collections

• Resulting tree can be very large

JSON Model

Company Investment Investor

category_ code

funding_ rounds

round_ code

raised_ amount

[…]

invest- ments

type

[…]

round_ code

funding_ rounds -1 […]

invest- ments

type

[…]

raised_amount

invest- ments -1

round_ code

raised_ amount

[…]

type

funding_ rounds -1 […]

category_ code

country_ code

country_ code

category_ code

country_ code

Page 43: High Performance JSON Search and Relational Faceted Browsing with Lucene

• JSON field: Tree covering all the relationships with records from other collections

• Resulting tree can be very large

JSON Model

Company Investment Investor

category_ code

funding_ rounds

round_ code

raised_ amount

[…]

invest- ments

type

[…]

country_ code

Page 44: High Performance JSON Search and Relational Faceted Browsing with Lucene

• JSON field: Tree covering all the relationships with records from other collections

• Resulting tree can be very large

JSON Model

Company Investment Investor

category_ code

funding_ rounds

round_ code

raised_ amount

[…]

invest- ments

type

[…]

country_ code

Page 45: High Performance JSON Search and Relational Faceted Browsing with Lucene

• JSON field: Tree covering all the relationships with records from other collections

• Resulting tree can be very large

JSON Model

Company Investment Investor

round_ code

funding_ rounds -1 […]

invest- ments

type

[…]

raised_amount

category_ code

country_ code

Page 46: High Performance JSON Search and Relational Faceted Browsing with Lucene

• JSON field: Tree covering all the relationships with records from other collections

• Resulting tree can be very large

JSON Model

Company Investment Investor

round_ code

funding_ rounds -1 […]

invest- ments

type

[…]

raised_amount

category_ code

country_ code

Page 47: High Performance JSON Search and Relational Faceted Browsing with Lucene

• JSON field: Tree covering all the relationships with records from other collections

• Resulting tree can be very large

JSON Model

Company Investment Investor

invest- ments -1

round_ code

raised_ amount

[…]

type

funding_ rounds -1 […]

country_ code

category_ code

Page 48: High Performance JSON Search and Relational Faceted Browsing with Lucene

• JSON field: Tree covering all the relationships with records from other collections

• Resulting tree can be very large

JSON Model

Company Investment Investor

invest- ments -1

round_ code

raised_ amount

[…]

type

funding_ rounds -1 […]

country_ code

category_ code

Page 49: High Performance JSON Search and Relational Faceted Browsing with Lucene

Navigation Model : Drill-Down

Page 50: High Performance JSON Search and Relational Faceted Browsing with Lucene

Navigation Model: Drill-Down

collection : Company

AND

country_code : irl

AND

category_code : software

Lucene query

Page 51: High Performance JSON Search and Relational Faceted Browsing with Lucene

Navigation Model: Pivot

Page 52: High Performance JSON Search and Relational Faceted Browsing with Lucene

Navigation Model: Pivot

collection : Investment

Lucene query

Page 53: High Performance JSON Search and Relational Faceted Browsing with Lucene

Navigation Model: Pivot

collection : Investment

Lucene query

funding_rounds -1 : {

country_code : irl,

category_code : software

}

JSON query

collection : Company

AND

country_code : irl

AND

category_code : software

Preceding Lucene query

Query Rewriting

Page 54: High Performance JSON Search and Relational Faceted Browsing with Lucene

Navigation Model: Pivot

collection : Investment

Lucene query

funding_rounds -1 : {

country_code : irl,

category_code : software

}

JSON query

Page 55: High Performance JSON Search and Relational Faceted Browsing with Lucene

Navigation Model: Pivot

Page 56: High Performance JSON Search and Relational Faceted Browsing with Lucene

Navigation Model: Pivot

collection : Investor

Lucene query investments -1 : {

founded_year : 2012,

funding_rounds -1 : {

country_code : irl,

category_code : software

}

}

JSON query

Page 57: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Lucene BlockJoin

– Introduced support for indexing and searching nested data …

– … for small and well-defined schema

Comparison with BlockJoin

Page 58: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Increase artificially the number of documents in the index

– One document per nested data record

• Cache size linear with the number of nested data records

– Increased memory usage

Lucene BlockJoin - Scalability

Page 59: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Developers must be aware of the relations between nested data records

– At indexing time to tag parent records

– At querying time to filter parent records

• Upfront effort required to design and configure the system

– Define Parent-Child relationships between record collections

– Define attributes for each record collection

• If not properly designed, risk of incorrect matches

Lucene BlockJoin - Flexibility

Page 60: High Performance JSON Search and Relational Faceted Browsing with Lucene

• BlockJoin

+ Works out of the box with all Lucene’s features

‒ Requires upfront design effort

‒ Memory usage dependent on nested data structure

• Tree-Labelling

+ Can handle arbitrary and large nested model

+ Memory friendly

‒ Have to re-think and re-implement Lucene’s features

Comparison with BlockJoin

Page 61: High Performance JSON Search and Relational Faceted Browsing with Lucene

• Nested data model becomes more and more prevalent

• Searching nested data brings new challenges: performance, scalability, flexibility

• Different approaches exist, each one with pros and cons

• SIREn plugin based on tree-labelling techniques

• Enables new kind of search applications, e.g., relational faceted browser, with sub-

second response time

• SIREn Availability

– Trial license currently available

– In negotiation with the University to open-source

Conclusion

Page 62: High Performance JSON Search and Relational Faceted Browsing with Lucene

This material is based upon works supported by the European FP7 project LOD2

(257943) and the Irish Research Council for Science, Engineering and Technology.

Acknowledgement