high performance json search and relational faceted browsing with lucene
DESCRIPTION
Presented by Renaud Delbru, Co-Founder, SindiceTech In this presentation, we will discuss how Lucene and Solr can be used for very efficient search of tree-shaped schemaless document, e.g. JSON or XML, and can be then made to address both graph and relational data search. We will discuss the capabilities of SIREn, a Lucene/Solr plugin we have developed to deal with huge collections of tree-shaped schemaless documents, and how SIREn is built using Lucene extensibility capabilities (Analysis, Codec, Flexible Query Parser). We will compare it with Lucene's BlockJoin Query API in nested schemaless data intensive scenarios. We will then go through use cases that show how relational or graph data can be turned into JSON documents using Hadoop and Pig, and how this can be used in conjunction with SIREn to create relational faceting systems with unprecedented performance. Take-away lessons from this session will be awareness about using Lucene/Solr and Hadoop for relational and graph data search, as well as the awareness that it is now possible to have relational faceted browsers with sub-second response time on commodity hardware.TRANSCRIPT
HIGH PERFORMANCE JSON SEARCH AND
RELATIONAL FACETED BROWSING WITH LUCENE
Renaud Delbru Co-Founder, SindiceTech
Post-Doctoral Researcher, NUIG
• Lucene / Solr
– User since 7 years
– Built a web search engine – sindice.com (700M documents)
• Academia & Research
– Ph.D. in Information Retrieval and Semantic Web
– Post-doctoral researcher at National Univerity of Ireland, Galway
• Industry
– Technical co-founder of SindiceTech
– Management Platform for Enterprise Knowledge Graph
My Background
• Nested Data Model
• SIREn Overview & Theory
• SIREn Plugin Architecture
• Relational Faceted Browsing
• Comparison with BlockJoin
Agenda
• SQL
– Query-time join performance penalty
• NoSQL
– Denormalisation of relational data into nested data
– Convert many-to-one/many into one-to-many relationships
Nested Data Model: Why is it important ?
Denormalising Relational Data
LucidWorks
Series A
Series B
Granite Ventures
Denormalising Relational Data
LucidWorks
Series A
Series B
Granite Ventures
Granite Ventures
• SQL
– Query-time join performance penalty
• NoSQL
– Denormalisation of relational data into nested data
– Convert many-to-one/many into one-to-many relationships
– Duplicate data …
– … but avoid joins
Nested Data Model: Why is it important ?
• Model becoming prevalent: JSON, XML, Avro, …
– Can be arbitrarily nested and large
– No strict schema / structure enforced
• Schema-less brings
– Flexibility
– Ease of development
• Developers do not have to invest significant modelling effort upfront
Schema-Less Nested Data Model
• Lucene/Solr plugin for indexing and searching JSON
• Rich data model (JSON)
– Nested objects, nested arrays, datatypes
• Schema-agnostic
– No need to define structure (nested model)
– No need to define schema (fields)
Introducing SIREn
Overview of the SIREn API
Document Query
{
"name" : "LucidWorks",
"category_code" : "analytics",
"funding_rounds" : [
{
"round_code" : "a",
"raised_amount" : 6000000,
"funded_year" : 2009,
"investments" : [
{
"name" : "Granite Ventures",
"type" : "financial-org"
},
…
]
},
…
]
}
(category_code : analytics)
AND
(funding_rounds : {
round_code : seed OR a OR angel,
raised_amount : [0 TO 12000000],
* : {
type : financial-org
}
})
• Inspired from tree-labelling scheme techniques (XML IR)
– Label each node with a hierarchical ids (here Dewey’s identifiers)
• Full-text search operators over the content of a node
• Structural search operators over the nodes of the tree
– Ancestor-Descendant, Parent-Child, Sibling, …
Theory behind SIREn
Theory behind SIREn: Tree-Labelling
name
funding_
rounds
LucidWorks
round_
code
raised_
amount
a
6000000
…
{
"name" : "LucidWorks",
"category_code" : "analytics",
"funding_rounds" : [
{
"round_code" : "a",
"raised_amount" : 6000000,
"funded_year" : 2009,
…
},
…
]
}
Theory behind SIREn: Tree-Labelling
{
"name" : "LucidWorks",
"category_code" : "analytics",
"funding_rounds" : [
{
"round_code" : "a",
"raised_amount" : 6000000,
"funded_year" : 2009,
…
},
…
]
}
name
funding_
rounds
LucidWorks
round_
code
raised_
amount
a
1.2
1.1
1
1.1.1
1.2.1
1.2.2.1.1 1.2.2.1
6000000
1.2.2.2
… 1.2.2
1.2.2.2.1
Theory behind SIREn: Query Processing
?
name LucidWorks
Query
name
Inverted Index
LucidWorks
1.1 2.2 2.5
1.5.3 2.2.1 4.2.1
Theory behind SIREn: Query Processing
Query Inverted Index
1.1 2.2 2.5
1.5.3
?
name LucidWorks
name
LucidWorks 2.2.1 4.2.1
Theory behind SIREn: Query Processing
Query Inverted Index
1.1 2.2 2.5
1.5.3 2.2.1 4.2.1
?
name LucidWorks
name
LucidWorks
Theory behind SIREn: Query Processing
Query Inverted Index
1.1 2.2 2.5
1.5.3
?
name LucidWorks
name
LucidWorks 2.2.1 4.2.1
Theory behind SIREn: Query Processing
Query Inverted Index
1.1 2.2 2.5
1.5.3
?
name LucidWorks
name
LucidWorks 2.2.1 4.2.1
Theory behind SIREn: Query Processing
Query Inverted Index
1.1 2.2 2.5
1.5.3
?
name LucidWorks
name
LucidWorks 2.2.1 4.2.1
Theory behind SIREn: Query Processing
Query Inverted Index
1.1 2.2 2.5
1.5.3
?
name LucidWorks
name
LucidWorks 2.2.1 4.2.1
SIREn Plugin Architecture - Overview
Codec
Tree-Labelling Codec
Analysis
JSON Analyzer
Flexible Query Parser
JSON Query Parser
SIREn Lucene Legend:
Query
Node Query
Document
JSON Field
schema.xml sample
<fields>
<field name="id" type="string" indexed="true" stored="true"/>
<field name="json" type="json" indexed="true" stored="false"/>
…
</fields>
<types>
<fieldType name="json"
class="org.sindice.siren.solr.schema.JsonField"
datatypeConfig="datatypes.xml"/>
…
</types>
Datatypes
datatypes.xml sample
<datatype name="http://www.w3.org/2001/XMLSchema#String"
class="org.sindice.siren.solr.schema.TextDatatype">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</datatype>
<datatype name="http://www.w3.org/2001/XMLSchema#int"
class="org.sindice.siren.solr.schema.TrieDatatype"
precisionStep="8"
type="integer"/>
• Traverses JSON tree using Depth-First
Search
• Generates one token per JSON node
• Attaches metadata attributes (Dewey id,
datatype, …) to each token
Tokenizer Output
JSON Tokenizer
name LucidWorks funding_ rounds
round_ code
1.1 Field
…
1.1.1 String
1.2 Field
1.2.2.1 String
• Tokenize the content of a node token based on its datatype
JSON Analyzer – NodeTokenizerFilter
lucid works
name LucidWorks funding_ rounds
…
funding rounds
Input
Output
name LucidWorks funding_ rounds
round_ code
1.1 Field
…
1.1.1 String
1.2 Field
1.2.2.1 String
• Tokenize the content of a node token based on its datatype
JSON Analyzer – NodeTokenizerFilter
lucid works
name LucidWorks funding_ rounds
…
funding rounds
Input
Output
Tokenized with String datatype analyzer
name LucidWorks funding_ rounds
round_ code
1.1 Field
…
1.1.1 String
1.2 Field
1.2.2.1 String
• Tokenize the content of a node token based on its datatype
JSON Analyzer – NodeTokenizerFilter
lucid works
name LucidWorks funding_ rounds
…
funding rounds
Input
Output
Tokenized with Field datatype analyzer
name LucidWorks funding_ rounds
round_ code
1.1 Field
…
1.1.1 String
1.2 Field
1.2.2.1 String
• Encode metadata attributes into a term payload
• Leverage Payload API to transfer attributes to the Codec API
JSON Analyzer – NodePayloadFilter
SIREn Plugin Architecture - Overview
Codec
Tree-Labelling Codec
Analysis
JSON Analyzer
Flexible Query Parser
JSON Query Parser
SIREn Lucene Legend:
Query
Node Query
Document
Tree-Labelling Codec – File Structure
.nod
.doc
.pos
Header Doc identifiers Node frequencies
Header Node identifiers Term frequencies
Header Term positions
Block
Tree-Labelling Codec – Compression
• Adaptive Frame Of Reference
– Adapt the encoding to the integer distribution
– Better tolerance against outliers
– Very effective with frequencies, node identifiers and positions (higher
compression rate)
FOR BFS
BFS BFS BFS BFS AFOR
SIREn Plugin Architecture - Overview
Codec
Tree-Labelling Codec
Analysis
JSON Analyzer
Flexible Query Parser
JSON Query Parser
SIREn Lucene Legend:
Query
Node Query
Document
• Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
• Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
Node Query
• Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
• Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
Node Query
• TwigQuery
– Consist of a root query and one or
more descendant or child queries
Boolean
Phrase MUST
Boolean SHOULD
• Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
• Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
Node Query
• TwigQuery
– Consist of a root query and one or
more descendant or child queries
Boolean
Phrase MUST
Boolean SHOULD
• Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
• Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
Node Query
• TwigQuery
– Consist of a root query and one or
more descendant or child queries
– Can be nested to form complex tree
structure
Boolean
Phrase MUST
Twig NOT
Boolean SHOULD
Range SHOULD
• Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
• Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
Node Query
• TwigQuery
– Consist of a root query and one or
more descendant or child queries
– Can be nested to form complex tree
structure
– Can be rewritten as a pure boolean
query
Boolean
Phrase MUST
Twig NOT
Boolean SHOULD
Range SHOULD
• Faceted Navigation
– Data-driven exploratory interface
– User incrementally adds constraints
– Restricted to one record collection
• Relational Faceted Navigation
– Enables navigation of interrelated record collections
– Constraints affect all record collections
– New navigation operation: Pivot
• Switch user view to a record collection
Application: Relational Faceted Navigation
Relational Faceted Navigation – Demo
HCLS Demo: http://hcls.sindice.com/pivot-browser/
• Each collection has its own data model (document)
• Lucene fields for facets
• JSON field for relationships with records from other collections
Data Model
Country
Category
JSON
Company Investment Investor
Year
Amount
JSON
Type
JSON
• JSON field: Tree covering all the relationships with records from other collections
• Resulting tree can be very large
JSON Model
Company Investment Investor
category_ code
funding_ rounds
round_ code
raised_ amount
[…]
invest- ments
type
[…]
round_ code
funding_ rounds -1 […]
invest- ments
type
[…]
raised_amount
invest- ments -1
round_ code
raised_ amount
[…]
type
funding_ rounds -1 […]
category_ code
country_ code
country_ code
category_ code
country_ code
• JSON field: Tree covering all the relationships with records from other collections
• Resulting tree can be very large
JSON Model
Company Investment Investor
category_ code
funding_ rounds
round_ code
raised_ amount
[…]
invest- ments
type
[…]
country_ code
• JSON field: Tree covering all the relationships with records from other collections
• Resulting tree can be very large
JSON Model
Company Investment Investor
category_ code
funding_ rounds
round_ code
raised_ amount
[…]
invest- ments
type
[…]
country_ code
• JSON field: Tree covering all the relationships with records from other collections
• Resulting tree can be very large
JSON Model
Company Investment Investor
round_ code
funding_ rounds -1 […]
invest- ments
type
[…]
raised_amount
category_ code
country_ code
• JSON field: Tree covering all the relationships with records from other collections
• Resulting tree can be very large
JSON Model
Company Investment Investor
round_ code
funding_ rounds -1 […]
invest- ments
type
[…]
raised_amount
category_ code
country_ code
• JSON field: Tree covering all the relationships with records from other collections
• Resulting tree can be very large
JSON Model
Company Investment Investor
invest- ments -1
round_ code
raised_ amount
[…]
type
funding_ rounds -1 […]
country_ code
category_ code
• JSON field: Tree covering all the relationships with records from other collections
• Resulting tree can be very large
JSON Model
Company Investment Investor
invest- ments -1
round_ code
raised_ amount
[…]
type
funding_ rounds -1 […]
country_ code
category_ code
Navigation Model : Drill-Down
Navigation Model: Drill-Down
collection : Company
AND
country_code : irl
AND
category_code : software
Lucene query
Navigation Model: Pivot
Navigation Model: Pivot
collection : Investment
Lucene query
Navigation Model: Pivot
collection : Investment
Lucene query
funding_rounds -1 : {
country_code : irl,
category_code : software
}
JSON query
collection : Company
AND
country_code : irl
AND
category_code : software
Preceding Lucene query
Query Rewriting
Navigation Model: Pivot
collection : Investment
Lucene query
funding_rounds -1 : {
country_code : irl,
category_code : software
}
JSON query
Navigation Model: Pivot
Navigation Model: Pivot
collection : Investor
Lucene query investments -1 : {
founded_year : 2012,
funding_rounds -1 : {
country_code : irl,
category_code : software
}
}
JSON query
• Lucene BlockJoin
– Introduced support for indexing and searching nested data …
– … for small and well-defined schema
Comparison with BlockJoin
• Increase artificially the number of documents in the index
– One document per nested data record
• Cache size linear with the number of nested data records
– Increased memory usage
Lucene BlockJoin - Scalability
• Developers must be aware of the relations between nested data records
– At indexing time to tag parent records
– At querying time to filter parent records
• Upfront effort required to design and configure the system
– Define Parent-Child relationships between record collections
– Define attributes for each record collection
• If not properly designed, risk of incorrect matches
Lucene BlockJoin - Flexibility
• BlockJoin
+ Works out of the box with all Lucene’s features
‒ Requires upfront design effort
‒ Memory usage dependent on nested data structure
• Tree-Labelling
+ Can handle arbitrary and large nested model
+ Memory friendly
‒ Have to re-think and re-implement Lucene’s features
Comparison with BlockJoin
• Nested data model becomes more and more prevalent
• Searching nested data brings new challenges: performance, scalability, flexibility
• Different approaches exist, each one with pros and cons
• SIREn plugin based on tree-labelling techniques
• Enables new kind of search applications, e.g., relational faceted browser, with sub-
second response time
• SIREn Availability
– Trial license currently available
– In negotiation with the University to open-source
Conclusion
This material is based upon works supported by the European FP7 project LOD2
(257943) and the Irish Research Council for Science, Engineering and Technology.
Acknowledgement