real-time analytics with solr: presented by yonik seeley, cloudera
TRANSCRIPT
2 © Cloudera, Inc. All rights reserved.
My Background
• Creator of Solr • Cloudera Engineer • LucidWorks Co-‐Founder • Lucene/Solr commiMer, PMC member • Apache SoQware Founda=on member • M.S. in Computer Science, Stanford
4 © Cloudera, Inc. All rights reserved.
Search and Hadoop
• Search is a key component of many big data problems • Many analy=cs use cases start with search • Adding analy=cs to full-‐text search has proven to be more effec=ve than vice-‐versa • External integra=ons are challenging for "real-‐=me" (i.e. interac=ve) results
5 © Cloudera, Inc. All rights reserved.
Solr in Hadoop
• Top Hadoop vendors who have integrated search have all chosen Apache Solr • For example: Cloudera, Hortonworks, MapR, IBM, ...
• Historical focus on interac=ve response =mes • Historical focus on faceted search / guided naviga=on • High performance indexes • originally for "full-‐text" search, but just as great for meta-‐data!
6 © Cloudera, Inc. All rights reserved.
Inverted Index
aardvark
hood
red
liMle
riding
robin
women
zoo
LiMle Red Riding Hood
Robin Hood
LiMle Women
0 1
0 2
0
0
2
1
0
1
2
Documents
7 © Cloudera, Inc. All rights reserved.
Columnar Storage (DocValues)
a1
a2
a3
a4
b1
b2
b3
b4
c1
c2
c3
c4
a1 b1 c1 a3 b3 c3
Stored Fields (row oriented)
DocValues (column oriented)
a1 b1 c1 a1 b1 c1 ... a1 b1 c1 a2 b2 c3 ...
• Fast linear scan • Read only the data you need
• Fast random access • docid -‐> value(s)
• High degree of locality • Compressed • prefix, delta, table, gcd, etc
• Mostly "Off-‐Heap" • Memory mapped from index
• Row vs Column configurable per field!
8 © Cloudera, Inc. All rights reserved.
Mul=-‐Segment Index
_0.fnm _0.fdt _0.fdx [...] _0_1.del _1.fnm
_1.fdt _1.fdx […]
segments_3
• Each segment is a self-‐contained "index" • Segments are never changed once wriMen • Per-‐segment caching very effec=ve • Point-‐in-‐=me searcher
• gejng new view means wri=ng & including addi=onal segment
• turns a weakness into a strength
9 © Cloudera, Inc. All rights reserved.
Faceted Search • Breaks search results into buckets • Generally provides bucket counts • Allows user to filter / "drill into" results
11 © Cloudera, Inc. All rights reserved.
Face=ng
Search
Sta=s=cs
Facet Module Goals
Search
Joins
Grouping
Field Collapsing
New Facet Module
JSON Facet API
• Integra=on • Performance • Ease of use
Highligh=ng
Nested Documents
Geosearch
12 © Cloudera, Inc. All rights reserved.
Slice and Dice with Facet commands
Domain
Facet Command
A
• Domain: A set of documents • Facet command: create sub-‐domains / "facet buckets"
Facet Command
B
Domain
Domain
Domain
Domain
Facet Command
C
Domain
Domain
Domain
Domain
Domain
Domain
13 © Cloudera, Inc. All rights reserved.
Facet Func=ons / Sta=s=cs
Domain
Facet Command
A
Facet Command
B
Domain
Domain
Domain
Domain
Facet Command
C
Domain
Domain
Domain
Domain
Domain
Domain
sum(x)
unique(y)
sum(x)
unique(y)
sum(x)
unique(y)
min(units)
avg(price)
• Facet func=on calculates something over a domain • Can sort domains by facet func=ons!
14 © Cloudera, Inc. All rights reserved.
Facet func=ons • Calculate (and Sort) by things other than document count Func%on Example Descrip%on
sum sum(sales) Summa=on of numeric values
avg avg(popularity) Average of numeric values
sumsq sumsq(rent) Sum of squares
min min(salary) Minimum value
max max(mul(popularity,boost)) Maximum value
unique unique(state) Number of unique values (calc dis=nct)
hll hll(state) Number of unique values using HyperLogLog algorithm
percen=le percen=le(salary, 25, 50, 75) Calculates percen=les via t-‐digest algorithm
topdocs topdocs("another query",5) (in progress) Returns the top documents for another query
15 © Cloudera, Inc. All rights reserved.
Simple request and response
curl http://localhost:8983/solr/query -‐d ' q=widgets& json.facet= { x : "avg(price)" , y : "unique(brand)" } '
[…] "facets" : { "count" : 314, "x" : 102.5, "y" : 28 }
root domain defined by docs matching the query count of docs in the bucket
16 © Cloudera, Inc. All rights reserved.
All-‐JSON request example $ curl http://localhost:8983/solr/query -‐d ' { query : "widgets", // our JSON parser accepts comments (C-‐style too) filter : "inStock:true", // bare strings can appear unquoted offset: 0, limit: 5, sort: "price desc", fields: ["id","name","price"], /* could have also used "id,name,price" */ facet : { x : "avg(price)", y : "unique(brand)" } } '
17 © Cloudera, Inc. All rights reserved.
Bucke=ng Facet Types • Terms Facet • Creates new domains (facet buckets) based on values in a field
• Range Facet • Creates mul=ple buckets based on date ranges or numeric ranges
• Query Facet • Creates a single bucket of documents that match any given query
• Unlimited nes=ng: Any facet types may have any number of sub-‐facets
18 © Cloudera, Inc. All rights reserved.
Terms facet example
json.facet={ shoes : { type : terms, field : shoe_style, sort : {x : desc}, facet : { x : "avg(price)", y : "unique(brand)" } } }
"facets": { "count" : 472, "shoes": { "buckets" : [ { "val" : "Hiking", "count" : 34, "x" : 135.25, "y" : 17, }, { "val" : "Running", "count" : 45, "x" : 110.75, "y" : 24, },
Calculated per-‐bucket
19 © Cloudera, Inc. All rights reserved.
Sub-‐facet example
json.facet={ shoes:{ type : terms, field : shoe_style, sort : {x : desc}, facet : { x : "avg(price)", y : "unique(brand)", colors : { type : terms, field : color } } } }
"facets": { "count" : 472, "shoes": { "buckets" : [ { "val" : "Hiking", "count" : 34, "x" : 135.25, "y" : 17, "colors" : { "buckets" : [ { "val" : "brown", "count" : 12 }, { "val" : "black", "count" : 10 }, […] ] } // end of colors sub-‐facet }, // end of Hiking bucket { "val" : "Running", "count" : 45, "x" : 110.75, "y" : 24, "colors" : { "buckets" : […]
21 © Cloudera, Inc. All rights reserved.
Fantasy ($1045) Top Authors $423 George R.R. Mar=n $347 Brandon Sanderson $155 JK Rowling Top Books $252 A Game of Thrones $113 Emperor of Thorns $101 Nine Princes in Amber $82 Steel Heart
Sci-‐Fi ($898) Top Authors $321 Iain M Banks $218 Neal Asher $155 Neal Stephenson Top Books $113 Gridlinked $101 Use of Weapons $93 Snow Crash $82 The Skinner
Mystery ($645) Top Authors $191 James PaMerson $145 Patricia Cornwell $126 John Grisham Top Books $85 One for the Money $77 Angels & Daemons $64 ShuMer Island $35 The Firm
Filter By State $852 NJ (14 stores) $658 NY (11 stores) $421 CT (8 stores) Chain $984 Amazoon (14 stores) $734 Houses&Royalty (9 stores) $387 Books-‐r-‐us (7 stores) Store $108 Amazoon Branchburg $93 Books-‐r-‐us Bridgewater $87 H&R NYC Number of Books Chain 201K Houses&Royalty 183K Amazoon 98K Books-‐r-‐us Store 193K H&R NYC 77K Books-‐r-‐us Bridgewater 68K Amazoon Branchburg
22 © Cloudera, Inc. All rights reserved.
date_breakout : { type : range, field : sale_date, start : ..., end : ..., gap : "+1MONTH”, facet : { top_genres : { type : terms field : genre, sort : "revenue desc", limit : 4, facet : { revenue : "sum(sales)" } }, by_chain: { type : terms, field : chain, facet : { revenue : "sum(sales)" } }
Implementa=on
Range Facet (sale_date)
Terms Facet (genre)
Terms Facet (chain)
sum(sales)
sum(sales)
23 © Cloudera, Inc. All rights reserved.
Fantasy ($1045) Top Authors $423 George R.R. Mar=n $347 Brandon Sanderson $155 JK Rowling Top Books $252 A Game of Thrones $113 Emperor of Thorns $101 Nine Princes in Amber $82 Steel Heart
Sci-‐Fi ($898) Top Authors $321 Iain M Banks $218 Neal Asher $155 Neal Stephenson Top Books $113 Gridlinked $101 Use of Weapons $93 Snow Crash $82 The Skinner
Mystery ($645) Top Authors $191 James PaMerson $145 Patricia Cornwell $126 John Grisham Top Books $85 One for the Money $77 Angels & Daemons $64 ShuMer Island $35 The Firm
top_genres:{ type : terms, field : genre, facet : { rev : "sum(sales)", top_authors:{ type : terms, field : author, sort :"rev desc", limit : 3, facet : { rev : "sum(sales)" } }, top_books:{ type : terms, field : =tle, sort : "rev desc", limit : 4, facet : { rev : "sum(sales)" } } }
Implementa=on (con=nued)
Terms Facet (genre)
Terms Facet (author)
Terms Facet (=tle)
sum(sales)
sum(sales)
sum(sales)
25 © Cloudera, Inc. All rights reserved.
facet=true&stats=true &stats.field={!tag=stat1+mean=true}field2 &facet.pivot={!stats=stat1}field1 &f.field1.limit=10
json.facet={ f : { type : terms, field : field1, facet:{ mean:"avg(field2)" } } }
Tested Facet Request
Legacy (stats component & pivot facets)
JSON Facet API (New Facet Module)
29 © Cloudera, Inc. All rights reserved.
Indexing Nested Documents id : book1 =tle : The Way of Kings author : Brandon Sanderson
id : book1_review1 review_author : Yonik stars : 5 comment : A great start to what ...
id : book1_review2 review_author : Dan stars : 3 comment : This book was too long
id : book2 =tle : Snow Crash author : Neal Stephenson
id : book2_review1 review_author : Yonik stars : 5 comment : Ahead of it's =me ...
book1_review1
book1_review2
book1
book2_review1
book2
Lucene index view (flat)
• Group indexed as a "block" • atomic • internal document ids con=guous • enables quick and inexpensive joins
30 © Cloudera, Inc. All rights reserved.
Indexing Nested Documents (JSON format)
{ id : book1, type : book, =tle : "The Way of Kings", author : "Brandon Sanderson", genre : fantasy, pubyear : 2010, publisher : Tor, _childDocuments_ : [ { id : book1_review1, type : review, review_dt:"2015-‐01-‐03T14:30:00Z", stars : 5, review_author : Yonik, comment : "A great start to what looks like an epic series!" } , { id : book1_review2, type : review, review_dt:"2015-‐03-‐15T12:00:00Z", stars : 3, review_author : Dan, comment : "This book was too long." } ] }
31 © Cloudera, Inc. All rights reserved.
Block Join Queries Find reviews men=oning "epic", limi=ng to reviews for books published by Tor Find books published by Tor with a review men=oning "epic"
q=comment:epic fq={!child of="type:book"}publisher:Tor sort=review_dt desc
q=publisher:Tor fq={!parent which="type:book"}comment:epic sort=pubyear desc
32 © Cloudera, Inc. All rights reserved.
Block Join Face=ng (child to parent)
• Find the number of books I (Yonik) reviewed, broken out by Genre
q=review_author:Yonik& json.facet={ genres : { type : terms, field : genre, domain : { blockParent : "type:book" } } }
33 © Cloudera, Inc. All rights reserved.
Block Join Face=ng (parent to child)
• Find the top reviewers for sci-‐fi and fantasy books
q=genre:(sci-‐fi OR fantasy)& json.facet={ top_reviewers : { type: terms, field: review_author, domain: { blockChildren : "type:book" } } }