cassandra summit 2014: interactive olap queries using apache cassandra and spark
DESCRIPTION
Presenter: Evan Chan, Principal Software Engineer at Socrata Inc. How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets on Cassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.TRANSCRIPT
![Page 1: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/1.jpg)
#CassandraSummit
OLAP WITH SPARK ANDCASSANDRA
EVAN CHANSEPT 2014
![Page 2: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/2.jpg)
WHO AM I?Principal Engineer, @evanfchan
Creator of
Socrata, Inc.
http://github.com/velviaSpark Job Server
![Page 3: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/3.jpg)
WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MOREPEOPLE.
data.edmonton.ca finances.worldbank.org data.cityofchicago.orgdata.seattle.gov data.oregon.gov data.wa.govwww.metrochicagodata.org data.cityofboston.govinfo.samhsa.gov explore.data.gov data.cms.gov data.ok.govdata.nola.gov data.illinois.gov data.colorado.govdata.austintexas.gov data.undp.org www.opendatanyc.comdata.mo.gov data.nfpa.org data.raleighnc.gov dati.lombardia.itdata.montgomerycountymd.gov data.cityofnewyork.usdata.acgov.org data.baltimorecity.gov data.energystar.govdata.somervillema.gov data.maryland.gov data.taxpayer.netbronx.lehman.cuny.edu data.hawaii.gov data.sfgov.org
![Page 4: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/4.jpg)
WE ARE SWIMMING IN DATA!
![Page 5: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/5.jpg)
BIG DATA AT SOCRATATens of thousands of datasets, each one up to 30 million rowsCustomer demand for billion row datasetsWant to analyze across datasets
![Page 6: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/6.jpg)
BIG DATA AT OOYALA2.5 billion analytics pings a day = almost a trillion events ayear.Roll up tables - 30 million rows per day
![Page 7: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/7.jpg)
HOW CAN WE ALLOW CUSTOMERS TO QUERY AYEAR'S WORTH OF DATA?
Flexible - complex queries includedSometimes you can't denormalize your data enough
Fast - interactive speedsNear Real Time - can't make customers wait hours beforequerying new data
![Page 8: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/8.jpg)
RDBMS? POSTGRES?Start hitting latency limits at ~10 million rowsNo robust and inexpensive solution for querying across shardsNo robust way to scale horizontally
PostGres runs query on single thread unless you partition(painful!)
Complex and expensive to improve performance (eg rolluptables, huge expensive servers)
![Page 9: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/9.jpg)
OLAP CUBES?Materialize summary for every possible combinationToo complicated and brittleTakes forever to compute - not for real timeExplodes storage and memory
![Page 10: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/10.jpg)
When in doubt, use brute force- Ken Thompson
![Page 11: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/11.jpg)
![Page 12: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/12.jpg)
CASSANDRAHorizontally scalableVery flexible data modelling (lists, sets, custom data types)Easy to operateNo fear of number of rows or documentsBest of breed storage technology, huge communityBUT: Simple queries only
![Page 13: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/13.jpg)
APACHE SPARKHorizontally scalable, in-memory queriesFunctional Scala transforms - map, filter, groupBy, sortetc.SQL, machine learning, streaming, graph, R, many more pluginsall on ONE platform - feed your SQL results to a logisticregression, easy!THE Hottest big data platform, huge community, leavingHadoop in the dustDevelopers love it
![Page 14: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/14.jpg)
SPARK PROVIDES THE MISSING FAST, DEEPANALYTICS PIECE OF CASSANDRA!
![Page 15: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/15.jpg)
INTEGRATING SPARK AND CASSANDRAScala solutions:
Datastax integration:
(CQL-based)https://github.com/datastax/spark-cassandra-connectorCalliope
![Page 16: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/16.jpg)
A bit more work:
Use traditional Cassandra client with RDDsUse an existing InputFormat, like CqlPagedInputFormat
Only reason to go here is probably you are not on CQL version ofCassandra, or you're using Shark/Hive.
![Page 17: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/17.jpg)
A SPARK AND CASSANDRAOLAP ARCHITECTURE
![Page 18: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/18.jpg)
SEPARATE STORAGE AND QUERY LAYERSCombine best of breed storage and query platformsTake full advantage of evolution of eachStorage handles replication for availabilityQuery can replicate data for scaling read concurrency -independent!
![Page 19: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/19.jpg)
SCALE NODES, NOTDEVELOPER TIME!!
![Page 20: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/20.jpg)
KEEPING IT SIMPLEMaximize row scan speedColumnar representation for efficiencyCompressed bitmap indexes for fast algebraFunctional transforms for easy memoization, testing,concurrency, composition
![Page 21: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/21.jpg)
SPARK AS CASSANDRA'S CACHE
![Page 22: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/22.jpg)
EVEN BETTER: TACHYON OFF-HEAP CACHING
![Page 23: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/23.jpg)
INITIAL ATTEMPTSval rows = Seq( Seq("Burglary", "19xx Hurston", 10), Seq("Theft", "55xx Floatilla Ave", 5) )
sc.parallelize(rows) .map { values => (values[0], values) } .groupByKey .reduce(_[2] + _[2])
![Page 24: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/24.jpg)
No existing generic query engine for Spark when we started(Shark was in infancy, had no indexes, etc.), so we built our ownFor every row, need to extract out needed columnsAbility to select arbitrary columns means using Seq[Any], notype safetyBoxing makes integer aggregation very expensive and memoryinefficient
![Page 25: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/25.jpg)
COLUMNAR STORAGE AND QUERYING
![Page 26: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/26.jpg)
The traditional row-based data storageapproach is dead- Michael Stonebraker
![Page 27: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/27.jpg)
TRADITIONAL ROW-BASED STORAGESame layout in memory and on disk:
Name AgeBarak 46
Hillary 66
Each row is stored contiguously. All columns in row 2 come afterrow 1.
![Page 28: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/28.jpg)
COLUMNAR STORAGE (MEMORY)Name column
0 10 1
Dictionary: {0: "Barak", 1: "Hillary"}
Age column
0 146 66
![Page 29: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/29.jpg)
COLUMNAR STORAGE (CASSANDRA)Review: each physical row in Cassandra (e.g. a "partition key")stores its columns together on disk.
Schema CF
Rowkey TypeName StringDict
Age Int
Data CF
Rowkey 0 1Name 0 1
Age 46 66
![Page 30: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/30.jpg)
ADVANTAGES OF COLUMNAR STORAGECompression
Dictionary compression - HUGE savings for low-cardinalitystring columnsRLE
Reduce I/OOnly columns needed for query are loaded from disk
Can keep strong types in memory, avoid boxingBatch multiple rows in one cell for efficiency
![Page 31: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/31.jpg)
ADVANTAGES OF COLUMNAR QUERYINGCache locality for aggregating column of dataTake advantage of CPU/GPU vector instructions for ints /doublesavoid row-ifying until last possible momenteasy to derive computed columnsUse vector data / linear math libraries
![Page 32: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/32.jpg)
COLUMNAR QUERY ENGINE VS ROW-BASED INSCALA
Custom RDD of column-oriented blocks of dataUses ~10x less heap10-100x faster for group by's on a single nodeScan speed in excess of 150M rows/sec/core for integeraggregations
![Page 33: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/33.jpg)
SO, GREAT, OLAP WITH CASSANDRA ANDSPARK. NOW WHAT?
![Page 34: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/34.jpg)
![Page 35: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/35.jpg)
DATASTAX: CASSANDRA SPARK INTEGRATIONDatastax Enterprise now comes with HA Spark
HA master, that is.spark-cassandra-connector
![Page 36: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/36.jpg)
SPARK SQLAppeared with Spark 1.0In-memory columnar storeCan read from Parquet and JSON now; direct Cassandraintegration comingQuerying is not column-based (yet)No indexesWrite custom functions in Scala .... take that Hive UDFs!!Integrates well with MLBase, Scala/Java/Python
![Page 37: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/37.jpg)
CACHING A SQL TABLE FROM CASSANDRAval sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.cassandraTable[GDeltRow]("gdelt, "1979to2009") .registerAsTable("gdelt")sqlContext.cacheTable("gdelt")sqlContext.sql("SELECT Actor2Code, Actor2Name, Actor2CountryCode, AvgTone from gdelt ORDER BY AvgTone DESC LIMIT
Remember Spark is lazy, nothing is executed until thecollect()In Spark 1.1+: registerTempTable
![Page 38: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/38.jpg)
SOME PERFORMANCE NUMBERSGDELT dataset, 117 million rows, 57 columns, ~50GBSpark 1.0.2, AWS 8 x c3.xlarge, cached in memory
Query Avgtime(sec)
SELECT count(*) FROM gdeltWHERE Actor2CountryCode ='CHN'
0.49
SELECT 4 columns Top K 1.51
SELECT Top countries by Avg Tone(Group By)
2.69
![Page 39: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/39.jpg)
IMPORTANT - CACHINGBy default, queries will read data from source - Cassandra -every timeSpark RDD Caching - much faster, but big waste of memory(row oriented)Spark SQL table caching - fastest, memory efficient
![Page 40: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/40.jpg)
WORK STILL NEEDEDIndexesColumnar querying for fast aggregationTachyon support for Cassandra/CQLEfficient reading from columnar storage formats
![Page 41: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/41.jpg)
LESSONSExtremely fast distributed querying for these use cases
Data doesn't change much (and only bulk changes)Analytical queries for subset of columnsFocused on numerical aggregationsSmall numbers of group bys
For fast query performance, cache your data using Spark SQLConcurrent queries is a frontier with Spark. Use additionalSpark contexts.
![Page 42: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/42.jpg)
THANK YOU!
![Page 43: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/43.jpg)
EXTRA SLIDES
![Page 44: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/44.jpg)
EXAMPLE CUSTOM INTEGRATION USINGASTYANAX
val cassRDD = sc.parallelize(rowkeys). flatMap { rowkey => columnFamily.get(rowkey).execute().asScala }
![Page 45: Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark](https://reader031.vdocuments.us/reader031/viewer/2022013118/548276c95806b5e3048b4684/html5/thumbnails/45.jpg)
SOME COLUMNAR ALTERNATIVESMonetdb and Infobright - true columnar stores (storage +querying)Vertica and C-StoreGoogle BigQuery - columnar cloud database, Dremel basedAmazon RedShift