advanced search and top-k queries in cassandra

Advanced search and Top-K queries in Cassandra

Andrés de la Peña andres@stratio.com @a_de_la_pena

Apache Cassandra Meetup 2015

•  Stratio is a Big Data Company

•  Founded in 2013

•  Commercially launched in 2014

•  70+ employees in Madrid

•  Office in San Francisco

•  Certified Spark distribution

Who are we?

Introduction to Cassandra

Cassandra query methods

Stratio Lucene based 2i implementation

Integrating Lucene 2i with Apache Spark

CONTENTS

Tunable consistency Tradeoffs between consistency and latency are tunable. C* values a high availability and partitioning against consistency; strong consistency can be achieved but there is no row locking.

Incremental scalability Nodes added to a cluster increase throughput in a predictable & linear fashion.

The best of Dynamo & Big Table Combines the partitioning and replication of Amazon’s Dynamo with the log-structured data model of Google’s Bigtable.

Decentralized P2P architecture without master node or single point of failure.

Apache Cassandra overview

4 Apache Cassandra Meetup 2015

Apache Cassandra operators

primary key

secondary indexes

token ranges

Throughput

Expressiveness

•  O(1) node lookup for partition key •  Range slices for clustering key •  Usually requires denormalization

Primary key queries

Node 3

Node 1

Node 2

Partition key Clustering key range CLIENT

apena 2014-04-10:body

When you..

dhiguero

2014-04-06:body 2014-04-07:body 2014-04-08:body

To study and… To think and... If you see what..

2014-04-06:body

The cautious…

2014-04-10:body

When you..

2014-04-11:body

When you do…

primary key

secondary indexes

token ranges

Throughput

Expressiveness

CLIENT C* node

C* node

2i local column family

C* node

2i local column family

Secondary indexes queries

•  Inverted index •  Mitigates denormalization •  Queries may involve all C* nodes •  Queries limited to a single column

primary key

secondary indexes

token ranges

Throughput

Expressiveness

C* node

Spark master

Token range queries

•  Used by MapReduce frameworks as Hadoop or Spark

•  All kinds of queries are possible •  Low throughput •  Ad-hoc queries •  Batch processing •  Materialized views

CLIENT

query= function (all data)

C* node

Combining 2i with MapReduce

•  Expressiveness avoiding full scans •  Still limited by one indexed column per query

Spark master CLIENT

Secondary index

MORE EXPRESIVENESS

What do we miss from 2i indexes?

•  Range queries •  Multivariable search •  Full text search •  Sorting by fields •  Top-k queries

IT’S ARCHITECTURE

What do we like from the existing 2i?

•  Each node indexes its own data •  The index implementations do not need to be distributed •  Can be created after design and ingestion •  Natural extension point

Thinking in a custom secondary index implementation…

WHY NOT USE ?

Why we like Lucene

•  Proven stable and fast indexing solution •  Expressive queries

- Multivariable, ranges, full text, sorting, top-k, etc.

•  Mature distributed search solutions built on top of it

- Solr, ElasticSearch •  Can be fully embedded in application code •  Published under the Apache License

HOW IT WORKS

ALTER TABLE tweets ADD lucene TEXT;

CREATE TABLE tweets ( id bigint, createdAt timestamp, message text, userid bigint, username text, PRIMARY KEY (userid, createdAt, id) );

Create index

•  Built in the background in any moment •  Real time updates •  Mapping eases ETL •  Language aware

CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene) USING 'com.stratio.index.RowIndex' WITH OPTIONS = { 'refresh_seconds' : '60', 'schema' : '{ default_analyzer : "EnglishAnalyzer”, fields : { createdat : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : ”EnglishAnalyzer"}, userid : {type : "string"}, username : {type : "string”} } }'};

SELECT * FROM tweets WHERE lucene = ‘{ filter : {type : "match", field : "text", value : "cassandra"}}’ LIMIT 10;

search 10

found 6

found 4

We are done !

Filtering query

CLIENT

C* node

Lucene index

Found 5

Found 4

Found 5

Top-k query

SELECT * FROM tweets WHERE lucene = ‘{ query: {type : "match", field : "text", value : "cassandra"}}’ LIMIT 5;

Search top-5 CLIENT Search top-5

C* node

Lucene index

Merge 14 to best 5

SELECT * FROM tweets WHERE lucene = ‘{ filter : { type : "boolean", must : [ {type : "range", field : "time" lower : "2014/04/25”}, {type : "boolean", should : [ {type : "prefix", field : "user", value : "a"} , {type : "wildcard", field : "user", value : "*b*"} , {type : "match", field : "user", value : "fast"} ] } ] }, sort : { fields: [ {field :"time", reverse : true}, {field : "user", reverse : false} ] }}’ LIMIT 10000;

Queries can be as complex as you want

NO MAINTENANCE REQUIRED

Some implementation details

•  A Lucene document per CQL row, and a Lucene field per indexed column •  SortingMergePolicy keeps index sorted in the same way that C* does •  Index commits synchronized with column family flushes •  Segments merge synchronized with column family compactions

LUCENE AND

Split friendly. It supports searches within a token range

SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:"text", value:"cassandra"}}’ AND TOKEN(userid, createdAt, id) > 253653456456AND TOKEN(userid, createdAt, id) <= 3456467456756LIMIT 10000;

Integrating Lucene & Spark

SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:"text", value:"cassandra"}}’ AND userid = 3543534 AND createdAt > 2011-02-03 04:05+0000 LIMIT 5000;

Paging friendly: It supports starting queries in a certain point

CLIENT Spark

master

C* node

Lucene

•  Compute large amounts of data •  Avoid systematic full scan •  Reduces the amount of data to be processed •  Filtering push-down

WHEN TO USE INDEXES

AND WHEN TO USE FULL SCAN

Index performance in Spark

Records returned

Full scan

Lucene 2i

DEMO Lucene indexes in C*

Conclusions

•  Added new query methods

- Multivariable queries (AND, OR, NOT)

- Range queries (>, >=, <, <=) and regular expressions

- Full text queries (match, phrase, fuzzy...)

•  Top-k query support

- Lucene scoring formula

- Sort by field values

•  Compatible with MapReduce frameworks

•  Preserves Cassandra’s functionality 30 Apache Cassandra Meetup 2015

Its open source

github.com/stratio/stratio-cassandra •  Published as fork of Apache Cassandra •  Apache License Version 2.0

stratio.github.io/crossdata •  Apache License Version 2.0

github.com/stratio/deep-spark •  Apache License Version 2.0

Advanced search and Top-K queries in Cassandra

Andrés de la Peña andres@stratio.com @a_de_la_pena

advanced search and top-k queries in cassandra

pena apache cassandra

apache cassandra operators

apache cassandra overview

node c

apache spark

node indexes

master node

client c

Technology

online expansion of rare queries for sponsored search

a full-text search algorithm for long queries

answering imprecise structured search queries

lucene search essentials: scorers, collectors and custom...

deciphering mobile search patterns: a study of yahoo! mobile...

cassandra summit 2015: intro to dse search

sasi, cassandra on full text search ride

web search queries: an emerging new language

datastax: enabling search in your cassandra application with...

spca2014 search queries explained svenson

hippocampus: answering memory queries using transactive...

cassandra 2.1 boot camp, protocol, queries, cql

internet search queries, and what they tell us

voice search – information access via voice queries

cassandra spark integration - university of southern...

answering imprecise structured search queries - deep blue

web search queries can predict stock market volumes

addressing different types of search queries

an introduction to distributed search with cassandra and...

example queries for federated search