![Page 1: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/1.jpg)
Advanced search and Top-K queries in Cassandra
1
Andrés de la Peña [email protected] @a_de_la_pena
Apache Cassandra Meetup 2015
![Page 2: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/2.jpg)
• Stratio is a Big Data Company
• Founded in 2013
• Commercially launched in 2014
• 70+ employees in Madrid
• Office in San Francisco
• Certified Spark distribution
Apache Cassandra Meetup 2015
Who are we?
![Page 3: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/3.jpg)
Introduction to Cassandra
Cassandra query methods
Stratio Lucene based 2i implementation
Integrating Lucene 2i with Apache Spark
1
2
3
CONTENTS
Apache Cassandra Meetup 2015
4
![Page 4: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/4.jpg)
Tunable consistency Tradeoffs between consistency and latency are tunable. C* values a high availability and partitioning against consistency; strong consistency can be achieved but there is no row locking.
Incremental scalability Nodes added to a cluster increase throughput in a predictable & linear fashion.
The best of Dynamo & Big Table Combines the partitioning and replication of Amazon’s Dynamo with the log-structured data model of Google’s Bigtable.
Decentralized P2P architecture without master node or single point of failure.
Apache Cassandra overview
4 Apache Cassandra Meetup 2015
![Page 5: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/5.jpg)
Apache Cassandra operators
5 Apache Cassandra Meetup 2015
![Page 6: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/6.jpg)
primary key
secondary indexes
token ranges
Throughput
Expressiveness
Cassandra query methods
6 Apache Cassandra Meetup 2015
![Page 7: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/7.jpg)
• O(1) node lookup for partition key • Range slices for clustering key • Usually requires denormalization
Primary key queries
Node 3
Node 1
Node 2
Partition key Clustering key range CLIENT
apena 2014-04-10:body
When you..
aagea
dhiguero
apena
2014-04-06:body 2014-04-07:body 2014-04-08:body
To study and… To think and... If you see what..
2014-04-06:body
The cautious…
2014-04-10:body
When you..
2014-04-11:body
When you do…
7 Apache Cassandra Meetup 2015
![Page 8: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/8.jpg)
primary key
secondary indexes
token ranges
Throughput
Expressiveness
Cassandra query methods
8 Apache Cassandra Meetup 2015
![Page 9: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/9.jpg)
CLIENT C* node
C* node
2i local column family
C* node
2i local column family
2i local column family
Secondary indexes queries
• Inverted index • Mitigates denormalization • Queries may involve all C* nodes • Queries limited to a single column
9 Apache Cassandra Meetup 2015
![Page 10: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/10.jpg)
primary key
secondary indexes
token ranges
Throughput
Expressiveness
Cassandra query methods
10 Apache Cassandra Meetup 2015
![Page 11: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/11.jpg)
C* node
C* node
C* node
Spark master
Token range queries
• Used by MapReduce frameworks as Hadoop or Spark
• All kinds of queries are possible • Low throughput • Ad-hoc queries • Batch processing • Materialized views
CLIENT
query= function (all data)
11 Apache Cassandra Meetup 2015
![Page 12: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/12.jpg)
C* node
C* node
C* node
Combining 2i with MapReduce
• Expressiveness avoiding full scans • Still limited by one indexed column per query
Spark master CLIENT
Secondary index
Secondary index
Secondary index
12 Apache Cassandra Meetup 2015
![Page 13: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/13.jpg)
MORE EXPRESIVENESS
What do we miss from 2i indexes?
• Range queries • Multivariable search • Full text search • Sorting by fields • Top-k queries
13 Apache Cassandra Meetup 2015
![Page 14: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/14.jpg)
IT’S ARCHITECTURE
What do we like from the existing 2i?
• Each node indexes its own data • The index implementations do not need to be distributed • Can be created after design and ingestion • Natural extension point
14 Apache Cassandra Meetup 2015
![Page 15: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/15.jpg)
Thinking in a custom secondary index implementation…
WHY NOT USE ?
15 Apache Cassandra Meetup 2015
![Page 16: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/16.jpg)
Why we like Lucene
• Proven stable and fast indexing solution • Expressive queries
- Multivariable, ranges, full text, sorting, top-k, etc.
• Mature distributed search solutions built on top of it
- Solr, ElasticSearch • Can be fully embedded in application code • Published under the Apache License
16 Apache Cassandra Meetup 2015
![Page 17: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/17.jpg)
HOW IT WORKS
Apache Cassandra Meetup 2015
![Page 18: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/18.jpg)
ALTER TABLE tweets ADD lucene TEXT;
CREATE TABLE tweets ( id bigint, createdAt timestamp, message text, userid bigint, username text, PRIMARY KEY (userid, createdAt, id) );
Create index
• Built in the background in any moment • Real time updates • Mapping eases ETL • Language aware
18
CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene) USING 'com.stratio.index.RowIndex' WITH OPTIONS = { 'refresh_seconds' : '60', 'schema' : '{ default_analyzer : "EnglishAnalyzer”, fields : { createdat : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : ”EnglishAnalyzer"}, userid : {type : "string"}, username : {type : "string”} } }'};
Apache Cassandra Meetup 2015
![Page 19: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/19.jpg)
SELECT * FROM tweets WHERE lucene = ‘{ filter : {type : "match", field : "text", value : "cassandra"}}’ LIMIT 10;
search 10
found 6
found 4
We are done !
Filtering query
CLIENT
C* node
C* node
C* node
Lucene index
Lucene index
Lucene index
19 Apache Cassandra Meetup 2015
![Page 20: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/20.jpg)
Found 5
Found 4
Found 5
Top-k query
SELECT * FROM tweets WHERE lucene = ‘{ query: {type : "match", field : "text", value : "cassandra"}}’ LIMIT 5;
Search top-5 CLIENT Search top-5
C* node
C* node
C* node
Lucene index
Lucene index
Lucene index
Merge 14 to best 5
20 Apache Cassandra Meetup 2015
![Page 21: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/21.jpg)
SELECT * FROM tweets WHERE lucene = ‘{ filter : { type : "boolean", must : [ {type : "range", field : "time" lower : "2014/04/25”}, {type : "boolean", should : [ {type : "prefix", field : "user", value : "a"} , {type : "wildcard", field : "user", value : "*b*"} , {type : "match", field : "user", value : "fast"} ] } ] }, sort : { fields: [ {field :"time", reverse : true}, {field : "user", reverse : false} ] }}’ LIMIT 10000;
Queries can be as complex as you want
21 Apache Cassandra Meetup 2015
![Page 22: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/22.jpg)
NO MAINTENANCE REQUIRED
Some implementation details
• A Lucene document per CQL row, and a Lucene field per indexed column • SortingMergePolicy keeps index sorted in the same way that C* does • Index commits synchronized with column family flushes • Segments merge synchronized with column family compactions
22 Apache Cassandra Meetup 2015
![Page 23: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/23.jpg)
LUCENE AND
SPARK
Apache Cassandra Meetup 2015
![Page 24: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/24.jpg)
Split friendly. It supports searches within a token range
SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:"text", value:"cassandra"}}’ AND TOKEN(userid, createdAt, id) > 253653456456AND TOKEN(userid, createdAt, id) <= 3456467456756LIMIT 10000;
Integrating Lucene & Spark
24 Apache Cassandra Meetup 2015
![Page 25: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/25.jpg)
SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:"text", value:"cassandra"}}’ AND userid = 3543534 AND createdAt > 2011-02-03 04:05+0000 LIMIT 5000;
Paging friendly: It supports starting queries in a certain point
Integrating Lucene & Spark
25 Apache Cassandra Meetup 2015
![Page 26: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/26.jpg)
Integrating Lucene & Spark
CLIENT Spark
master
C* node
C* node
C* node
Lucene
Lucene
Lucene
• Compute large amounts of data • Avoid systematic full scan • Reduces the amount of data to be processed • Filtering push-down
26 Apache Cassandra Meetup 2015
![Page 27: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/27.jpg)
WHEN TO USE INDEXES
AND WHEN TO USE FULL SCAN
Apache Cassandra Meetup 2015
![Page 28: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/28.jpg)
Index performance in Spark
Time
Records returned
Full scan
Lucene 2i
28 Apache Cassandra Meetup 2015
![Page 29: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/29.jpg)
DEMO Lucene indexes in C*
Apache Cassandra Meetup 2015
![Page 30: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/30.jpg)
Conclusions
• Added new query methods
- Multivariable queries (AND, OR, NOT)
- Range queries (>, >=, <, <=) and regular expressions
- Full text queries (match, phrase, fuzzy...)
• Top-k query support
- Lucene scoring formula
- Sort by field values
• Compatible with MapReduce frameworks
• Preserves Cassandra’s functionality 30 Apache Cassandra Meetup 2015
![Page 31: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/31.jpg)
Its open source
31
github.com/stratio/stratio-cassandra • Published as fork of Apache Cassandra • Apache License Version 2.0
stratio.github.io/crossdata • Apache License Version 2.0
github.com/stratio/deep-spark • Apache License Version 2.0
Apache Cassandra Meetup 2015
![Page 32: Advanced search and Top-K queries in Cassandra](https://reader033.vdocuments.us/reader033/viewer/2022042522/55a9396d1a28ab490a8b4860/html5/thumbnails/32.jpg)
Advanced search and Top-K queries in Cassandra
32
Andrés de la Peña [email protected] @a_de_la_pena
Apache Cassandra Meetup 2015