advanced search and top-k queries in cassandra

Post on 18-Jul-2015

265 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Advanced search and Top-K queries in Cassandra

1

Andrés de la Peña andres@stratio.com @a_de_la_pena

Apache Cassandra Meetup 2015

•  Stratio is a Big Data Company

•  Founded in 2013

•  Commercially launched in 2014

•  70+ employees in Madrid

•  Office in San Francisco

•  Certified Spark distribution

Apache Cassandra Meetup 2015

Who are we?

Introduction to Cassandra

Cassandra query methods

Stratio Lucene based 2i implementation

Integrating Lucene 2i with Apache Spark

1

2

3

CONTENTS

Apache Cassandra Meetup 2015

4

Tunable  consistency  Tradeoffs between consistency and latency are tunable. C* values a high availability and partitioning against consistency; strong consistency can be achieved but there is no row locking.

Incremental  scalability  Nodes added to a cluster increase throughput in a predictable & linear fashion.

The  best  of  Dynamo  &  Big  Table  Combines the partitioning and replication of Amazon’s Dynamo with the log-structured data model of Google’s Bigtable.

Decentralized  P2P architecture without master node or single point of failure.

Apache Cassandra overview

4 Apache Cassandra Meetup 2015

Apache Cassandra operators

5 Apache Cassandra Meetup 2015

primary key

secondary indexes

token ranges

Throughput

Expressiveness

Cassandra query methods

6 Apache Cassandra Meetup 2015

•  O(1) node lookup for partition key •  Range slices for clustering key •  Usually requires denormalization

Primary key queries

Node 3

Node 1

Node 2

Partition key Clustering key range CLIENT

apena 2014-04-10:body

When you..

aagea

dhiguero

apena

2014-04-06:body 2014-04-07:body 2014-04-08:body

To study and… To think and... If you see what..

2014-04-06:body

The cautious…

2014-04-10:body

When you..

2014-04-11:body

When you do…

7 Apache Cassandra Meetup 2015

primary key

secondary indexes

token ranges

Throughput

Expressiveness

Cassandra query methods

8 Apache Cassandra Meetup 2015

CLIENT C* node

C* node

2i local column family

C* node

2i local column family

2i local column family

Secondary indexes queries

•  Inverted index •  Mitigates denormalization •  Queries may involve all C* nodes •  Queries limited to a single column

9 Apache Cassandra Meetup 2015

primary key

secondary indexes

token ranges

Throughput

Expressiveness

Cassandra query methods

10 Apache Cassandra Meetup 2015

C*  node  

C*  node  

C*  node  

Spark master

Token range queries

•  Used by MapReduce frameworks as Hadoop or Spark

•  All kinds of queries are possible •  Low throughput •  Ad-hoc queries •  Batch processing •  Materialized views

CLIENT

query= function (all data)

11 Apache Cassandra Meetup 2015

C*  node  

C*  node  

C*  node  

Combining 2i with MapReduce

•  Expressiveness avoiding full scans •  Still limited by one indexed column per query

Spark master CLIENT

Secondary index

Secondary index

Secondary index

12 Apache Cassandra Meetup 2015

MORE EXPRESIVENESS

What do we miss from 2i indexes?

•  Range queries •  Multivariable search •  Full text search •  Sorting by fields •  Top-k queries

13 Apache Cassandra Meetup 2015

IT’S ARCHITECTURE

What do we like from the existing 2i?

•  Each node indexes its own data •  The index implementations do not need to be distributed •  Can be created after design and ingestion •  Natural extension point

14 Apache Cassandra Meetup 2015

Thinking in a custom secondary index implementation…

WHY NOT USE ?

15 Apache Cassandra Meetup 2015

Why we like Lucene

•  Proven stable and fast indexing solution •  Expressive queries

- Multivariable, ranges, full text, sorting, top-k, etc.

•  Mature distributed search solutions built on top of it

- Solr, ElasticSearch •  Can be fully embedded in application code •  Published under the Apache License

16 Apache Cassandra Meetup 2015

HOW IT WORKS

Apache Cassandra Meetup 2015

ALTER TABLE tweets ADD lucene TEXT;

CREATE TABLE tweets ( id bigint, createdAt timestamp, message text, userid bigint, username text, PRIMARY KEY (userid, createdAt, id) );

Create index

•  Built in the background in any moment •  Real time updates •  Mapping eases ETL •  Language aware

18

CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene) USING 'com.stratio.index.RowIndex' WITH OPTIONS = { 'refresh_seconds' : '60', 'schema' : '{ default_analyzer : "EnglishAnalyzer”, fields : { createdat : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : ”EnglishAnalyzer"}, userid : {type : "string"}, username : {type : "string”} } }'};

Apache Cassandra Meetup 2015

SELECT * FROM tweets WHERE lucene = ‘{ filter : {type : "match", field : "text", value : "cassandra"}}’ LIMIT 10;

search 10

found 6

found 4

We are done !

Filtering query

CLIENT

C* node

C* node

C* node

Lucene index

Lucene index

Lucene index

19 Apache Cassandra Meetup 2015

Found 5

Found 4

Found 5

Top-k query

SELECT * FROM tweets WHERE lucene = ‘{ query: {type : "match", field : "text", value : "cassandra"}}’ LIMIT 5;

Search top-5 CLIENT Search top-5

C* node

C* node

C* node

Lucene index

Lucene index

Lucene index

Merge 14 to best 5

20 Apache Cassandra Meetup 2015

SELECT * FROM tweets WHERE lucene = ‘{ filter : { type : "boolean", must : [ {type : "range", field : "time" lower : "2014/04/25”}, {type : "boolean", should : [ {type : "prefix", field : "user", value : "a"} , {type : "wildcard", field : "user", value : "*b*"} , {type : "match", field : "user", value : "fast"} ] } ] }, sort : { fields: [ {field :"time", reverse : true}, {field : "user", reverse : false} ] }}’ LIMIT 10000;

Queries can be as complex as you want

21 Apache Cassandra Meetup 2015

NO MAINTENANCE REQUIRED

Some implementation details

•  A Lucene document per CQL row, and a Lucene field per indexed column •  SortingMergePolicy keeps index sorted in the same way that C* does •  Index commits synchronized with column family flushes •  Segments merge synchronized with column family compactions

22 Apache Cassandra Meetup 2015

LUCENE AND

SPARK

Apache Cassandra Meetup 2015

Split friendly. It supports searches within a token range

SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:"text", value:"cassandra"}}’ AND TOKEN(userid, createdAt, id) > 253653456456AND TOKEN(userid, createdAt, id) <= 3456467456756LIMIT 10000;

Integrating Lucene & Spark

24 Apache Cassandra Meetup 2015

SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:"text", value:"cassandra"}}’ AND userid = 3543534 AND createdAt > 2011-02-03 04:05+0000 LIMIT 5000;

Paging friendly: It supports starting queries in a certain point

Integrating Lucene & Spark

25 Apache Cassandra Meetup 2015

Integrating Lucene & Spark

CLIENT Spark

master

C* node

C* node

C* node

Lucene

Lucene

Lucene

•  Compute large amounts of data •  Avoid systematic full scan •  Reduces the amount of data to be processed •  Filtering push-down

26 Apache Cassandra Meetup 2015

WHEN TO USE INDEXES

AND WHEN TO USE FULL SCAN

Apache Cassandra Meetup 2015

Index performance in Spark

Time

Records returned

Full scan

Lucene 2i

28 Apache Cassandra Meetup 2015

DEMO Lucene indexes in C*

Apache Cassandra Meetup 2015

Conclusions

•  Added new query methods

- Multivariable queries (AND, OR, NOT)

- Range queries (>, >=, <, <=) and regular expressions

- Full text queries (match, phrase, fuzzy...)

•  Top-k query support

- Lucene scoring formula

- Sort by field values

•  Compatible with MapReduce frameworks

•  Preserves Cassandra’s functionality 30 Apache Cassandra Meetup 2015

Its open source

31

github.com/stratio/stratio-cassandra •  Published as fork of Apache Cassandra •  Apache License Version 2.0

stratio.github.io/crossdata •  Apache License Version 2.0

github.com/stratio/deep-spark •  Apache License Version 2.0

Apache Cassandra Meetup 2015

Advanced search and Top-K queries in Cassandra

32

Andrés de la Peña andres@stratio.com @a_de_la_pena

Apache Cassandra Meetup 2015

top related