c* summit 2013: searching for a needle in a big data haystack by jason rutherglen
DESCRIPTION
The presentation demonstrates how Solr may be used to create real-time analytics applications. In addition, Datastax Enterprise 3.0 will be showcased, which offers Solr version 4.0 with a number of improvements over the previous DSE release. A realtime financial application will run for the audience, and then a detailed look at how the application was built. An overview of Datastax Enterprise Solr features will be given, and how the many enhancements in DSE make it unique in the marketplace.TRANSCRIPT
©2012 DataStax 1
DataStax Enterprise 3.x Realtime Analytics with Solr Jason Rutherglen
©2012 DataStax 2
• Big Data Engineer at DataStax • Co-author of ‘Programming Hive’
and ‘Introduction to Solr’ from O’Reilly
About the Presenter
©2012 DataStax 3
• The company behind Cassandra • Sells DataStax Enterprise
About DataStax
©2012 DataStax 4
DataStax Enterprise 3.x
©2012 DataStax 5
• Single stack • Cassandra • Solr • Hadoop • Consulting • Support
DataStax Enterprise
©2012 DataStax 6
• ZDNet article: “The biggest cloud app of all: Netflix”
• http://zd.net/ZHtrmW • Built on Cassandra • “According to Cockcroft, if something
goes wrong, Netflix can continue to run the entire service on two out of three zones”
Cassandra at Netflix
©2012 DataStax 7
• Petabytes of growing data • Hadoop is for batch work • What are the solutions for realtime?
What is Big Data?
©2012 DataStax 8
• Near realtime • 1000 millisecond latency
What is realtime?
©2012 DataStax 9
• Cost of scaling to petabytes • Physical limitations
Why not relational databases?
©2012 DataStax 10
• Hadoop for batch • Solr and Cassandra for realtime • Gives most of relational capability at
1/10 the cost, scales linearly
Relational to Big Data
©2012 DataStax 11
• Distributed database heavy lifting • Simple dynamo model • Executes replication tasks extremely
well
Why Cassandra?
©2012 DataStax 12
• Cassandra is easier, code is readable • Fewer moving parts • Multi-datacenter replication • Enables low level IO tuning
Cassandra vs. HBase
©2012 DataStax 13
• HBase runs on HDFS • HDFS is not designed for random
access IO • Multiple hacks / products to perform
random access (MapR, HDFS Jiras)
Cassandra vs. HBase
©2012 DataStax 14
• Cassandra is peer to peer, there is no single point of failure (SPOF)
• The HDFS name node is a single point of failure
Cassandra vs. HBase
©2012 DataStax 15
• Most of Facebook runs on MySQL • Memcache front ends the reads
HBase at Facebook
©2012 DataStax 16
• Hive with Hadoop • A vague dialect of SQL • Requires Java for UDFs • Relational Joins
Batch Analytics
©2012 DataStax 17
• Solr • SQL features except relational joins
• Use Hive for relational joins
• CEP (Complex Event Processing) • Storm
Realtime Analytics
©2012 DataStax 18
• Storm, computes results on streaming data
CEP (Complex Event Processing)
©2012 DataStax 19
• Java inverted indexing library • Text analytics is raw computation
over linear sets of data • High speed computation engine
Lucene
©2012 DataStax 20
• Terms dictionary points to list document ids (integers)
• Tokenizes text • Complete variety of computation on
vectors of data
Inverted Indexes
©2012 DataStax 21
• Search server built around Lucene • Adds faceting, distributed search • Missed the cloud environment
features of NoSQL systems for many years
Solr
©2012 DataStax 22
• Solr Cloud is a Zookeeper based system
• New and probably not production ready
• Playing catch up
Solr Cloud
©2012 DataStax 23
• High overlap with Solr • More mature than Solr Cloud • Less distributed features than
Cassandra
Elastic Search
©2012 DataStax 24
• Columns, column families, keyspaces • Peer to peer • Eventual consistency • Implements basic Google BigTable
model
Cassandra Concepts
©2012 DataStax 25
• Both implement a log structured merge tree file architecture
Lucene and Cassandra
©2012 DataStax 26
• Data is stored in Cassandra • Data placement controlled by
Cassandra • Solr is a secondary index (only)
DataStax Enterprise with Solr
©2012 DataStax 27
• Separation of church and state, eg, data and index
DataStax Enterprise with Solr
©2012 DataStax 28
• Indexing is a CPU intensive task • Not IO bound because of multi-
threading • When a thread is flushing, other
threads are indexing, CPU is saturated at all times
Indexing
©2012 DataStax 29
• IO bound, index needs to fit in RAM, then CPU bound
• Lucene enables multithreading queries
• Solr does not multithread queries
Queries
©2012 DataStax 30
• Eventual consistency, each node has it’s own Lucene index
• Lucene segment files are not replicated (like Solr Cloud and ElasticSearch)
DataStax Enterprise with Solr
©2012 DataStax 31
• Query requests are round robin’d across nodes automatically
Distributed Search Architecture
©2012 DataStax 32
• 3.0.1 is the current release of DataStax Enterprise
DSE 3.0.1
©2012 DataStax 33
• Ease of re-indexing • Re-index the entire cluster or per-
node • Re-indexing occurs when the Solr
schema changes
New Features in DSE 3.0.1
©2012 DataStax 34
• Solr Cloud requires re-indexing from an external data source such as a relational database
New Features in DSE 3.0.1
©2012 DataStax 35
• DSE re-indexes directly from Cassandra
• No custom code is required for re-indexing
New Features in DSE 3.0.1
©2012 DataStax 36
• View the heap memory usage of the field caches
• Perform capacity planning
New Features in DSE 3.0.1
©2012 DataStax 37
• Multithreaded re-indexing and repair • Adding a new Solr node is fast
New Features in DSE 3.0.1
©2012 DataStax 38
• Kerberos and SSL security • Security audit logging
New Features in DSE 3.0.1
©2012 DataStax 39
• Near realtime: per-segment filters, facets, multivalue facets
• Solr 4.3
DataStax Enterprise 3.1
©2012 DataStax 40
• vNodes • Composite keys
DataStax Enterprise 3.1
©2012 DataStax 41
• Multi datacenter live Solr schema updates and re-indexing
• CQL -> Solr queries, makes porting SQL applications easy for SQL developers
Future
©2012 DataStax 42
Demo of Wikipedia
©2012 DataStax 43
• Details about every trade • Tick data generated real time and is
quantitatively query-able • Too big to query on in real time? Not
anymore!
Real World Example: Tick Data
©2012 DataStax 44
• Computing the moving stock price average in real time
• Comparing multiple moving averages for different stock_symbols
• Requires statistical analysis, group by companies, and faceting features
Tick Data - Moving Average
©2012 DataStax 45
• Read latest ticks for a given company • Query ticks for companies in specific
verticals during large events such as press releases
• Compute deviation of stock data over 5 years for groups of companies
Tick Data Analytics - Ad Hoc Searches
©2012 DataStax 46
Real Time Stocks Demo
©2012 DataStax 47
General
©2012 DataStax 48
• Like an SQL CREATE TABLE statement
• Defines field types • Defines fields
Schema
©2012 DataStax 49
• XML based configuration options for Solr
Solr Config
©2012 DataStax 50
• Commits new index segment to RAM • Avoids ‘hard’ commit fsync
Soft Commit
©2012 DataStax 51
Auto Soft Commit <!-- The default high-performance update handler --> !<updateHandler class="solr.DirectUpdateHandler2”>! <autoSoftCommit> !
<maxTime>1000</maxTime> <!– Near Realtime of 1 second -->! </autoSoftCommit> !</updateHandler>!
©2012 DataStax 52
• Loaded for sort and facet queries • Uses heap space
Field Cache
©2012 DataStax 53
• Java based API for interacting with a Solr server
• DSE supports SolrJ/HTTP with no changes
SolrJ / HTTP
©2012 DataStax 54
• Auto data type mapping • Copy fields • Dynamic fields
Insert data with CQL
©2012 DataStax 55
• Exists however is mainly useful for debugging
• Limited functionality, queries a single node
CQL with Solr Query
©2012 DataStax 56
CQL Insert Example INSERT INTO wikipedia (key, text) !VALUES ('1', 'when in rome')!
©2012 DataStax 57
How to convert applications
©2012 DataStax 58
• Common to convert existing SQL applications to Big Data
• Focus on the application functionality
SQL to Solr
©2012 DataStax 59
• Cassandra makes all distributed operations easy
SQL to Solr
©2012 DataStax 60
• SELECT * FROM wikipedia WHERE type = ‘pdf’
• q=type:pdf
SELECT WHERE
©2012 DataStax 61
• SELECT title,text FROM wikipedia • q=*:* • fl=title,text
SELECT columns
©2012 DataStax 62
• SELECT COUNT(*) FROM wikipedia WHERE type = ‘pdf’
• q=type:pdf • Get the num found
SELECT COUNT
©2012 DataStax 63
• SELECT * FROM stocks ORDER BY price ASC
• q=*:* • sort=price asc
SELECT ORDER BY
©2012 DataStax 64
• SELECT AVG(price) FROM stocks • q=*:* • stats=true • stats.field=price • The average is called ‘mean’ in the
Solr results
SELECT AVG
©2012 DataStax 65
• SELECT AVG(price) FROM stocks GROUP BY symbol
• q=*:* • stats=true • stats.field=price • stats.facet=symbol
SELECT AVG GROUP BY
©2012 DataStax 66
• SELECT * FROM wikipedia WHERE text LIKE ‘rom%’
• q=text:rom*
SELECT WHERE LIKE
©2012 DataStax 67