cassandra at nosql matters 2012
DESCRIPTION
TRANSCRIPT
Apache Cassandra:Real-world scalability, today!Jonathan Ellis CTO
©2012 DataStax
Cassandra Job Trends
©2012 DataStax
“Big Data” trend
©2012 DataStax
Why Big Data Matters
Research done by McKinsey & Company shows the eye-opening, 10-year category growth rate differences between businesses that smartly use their big data and those that do not.
©2012 DataStax
Big data
Analytics(Hadoop)
Realtime(“NoSQL”)
?
©2012 DataStax
Some Casandra users
©2012 DataStax
• Financial
• Social Media
• Advertising
• Entertainment
• Energy
• E-tail
• Health care
• Government
Industries & use cases• Time series data
• Messaging
• Ad tracking
• Data mining
• User activity streams
• User sessions
• Anything requiring: Scalable performant + highly available
©2012 DataStax
Why Cassandra?• Fully distributed, no SPOF
• Multi-master, multi-DC
• Linearly scalable
• Larger-than-memory datasets
• Best-in-class performance (not just writes!)
• Fully durable
• Integrated caching
• Tuneable consistency
©2012 DataStax
Availability• “There is no such thing as standby
infrastructure: there is stuff you always use and stuff that won’t work when you need it.” -- Ben Black: founder, Boundary; ex-AWS
• “The biggest problem with failover is that you're almost never using it until it really hurts. It's like backups that you never test.” -- Rick Branson: instagram; ex-DataStax
©2012 DataStax
Classic partitioning with SPOFpartition 1 partition 2 partition 3 partition 4
router
client
©2012 DataStax
Fully distributed, no SPOFclient
p1
p1
p1p3
p6
©2012 DataStax
©2012 DataStax
Partitioning
jim
carol
johnny
suzy
age: 36 car: camaro gender: M
age: 37 car: subaru gender: F
age:12 gender: M
age:10 gender: F
©2012 DataStax
Primary key determines placement*
Partitioning
jim
carol
johnny
suzy
age: 36 car: camaro gender: M
age: 37 car: subaru gender: F
age:12 gender: M
age:10 gender: F
©2012 DataStax
jim
carol
johnny
suzy
PK
5e02739678...
a9a0198010...
f4eb27cea7...
78b421309e...
MD5 Hash
MD5 hash operation yields
a 128-bit number for
keysof any size.
©2012 DataStax
Node A
Node D Node C
Node B
The “token ring”
©2012 DataStax
jim 5e02739678...
carol a9a0198010...
johnny f4eb27cea7...
suzy 78b421309e...
Start EndA 0xc000000000..
10x0000000000..
0
B 0x0000000000..1
0x4000000000..0
C 0x4000000000..1
0x8000000000..0
D 0x8000000000..1
0xc000000000..0
©2012 DataStax
jim 5e02739678...
carol a9a0198010...
johnny f4eb27cea7...
suzy 78b421309e...
Start EndA 0xc000000000..
10x0000000000..
0
B 0x0000000000..1
0x4000000000..0
C 0x4000000000..1
0x8000000000..0
D 0x8000000000..1
0xc000000000..0
©2012 DataStax
jim 5e02739678...
carol a9a0198010...
johnny f4eb27cea7...
suzy 78b421309e...
Start EndA 0xc000000000..
10x0000000000..
0
B 0x0000000000..1
0x4000000000..0
C 0x4000000000..1
0x8000000000..0
D 0x8000000000..1
0xc000000000..0
©2012 DataStax
jim 5e02739678...
carol a9a0198010...
johnny f4eb27cea7...
suzy 78b421309e...
Start EndA 0xc000000000..
10x0000000000..
0
B 0x0000000000..1
0x4000000000..0
C 0x4000000000..1
0x8000000000..0
D 0x8000000000..1
0xc000000000..0
©2012 DataStax
jim 5e02739678...
carol a9a0198010...
johnny f4eb27cea7...
suzy 78b421309e...
Start EndA 0xc000000000..
10x0000000000..
0
B 0x0000000000..1
0x4000000000..0
C 0x4000000000..1
0x8000000000..0
D 0x8000000000..1
0xc000000000..0
©2012 DataStax
Node A
Node D Node C
Node B
carol a9a0198010...
Replication
©2012 DataStax
Node A
Node D Node C
Node B
carol a9a0198010...
©2012 DataStax
Node A
Node D Node C
Node B
carol a9a0198010...
©2012 DataStax
Highlights• Adding capacity is application-transparent and
requires no downtime
• No SPOF, not even temporarily• No “primary” replica
• Configurable synchronous/asynchronous
• Tolerates node failure; never have to restart replication “from scratch”
• “Smart” replication avoids correlated failures
©2012 DataStax
What about performance?• Log-structured storage engine avoids random i/
o
• Excellent performance on both reads and writes
• Row-level isolation via concurrent algorithms• no locking
• Built in compression improves cache hotness
• “Row cache” can replace memcached
©2012 DataStax
0
5000
10000
15000
20000
25000
30000
35000
Cassandra 0.6
Cassandra 1.0
reads/s writes/s
©2012 DataStax
©2012 DataStax
simple text
Netflix
“I can create a Cassandra cluster in any region of the world in 10 minutes. When marketing guys decide we want to move into a certain part of the world, we’re ready.”
Application/Use Case• Manage subscriber interactions with
downloaded movies• Need to handle distributed databases all over
the world (40 countries)• Need better TCO than Oracle
Why Cassandra? • Easy scale and multi-data center support
for geographical data distribution• Data model perfect fit for customer
interaction data• Much better TCO than Oracle or SimpleDB
©2012 DataStax
simple text
Constant Contact
“Whenever we need new capacity, we just add new nodes online and we’re able to meet whatever demand we have. Cassandra is great for that.”
Application/Use Case• Manage marketing/email campaigns for
small businesses• Needed database to handle social media
data that is very large in volume and must be maintained for long time
• Data is unstructured in nature
Why Cassandra? • Cassandra built for big data scale and able
to persist, manage, and quickly query big data
• Deployed application on Cassandra in 1/3rd the time and 1/10th the cost of Oracle
©2012 DataStax
simple text
ReachLocalApplication/Use Case• ReachLocal provides end-to-end Internet
advertising services to small and medium-sized businesses in eight countries
• Must track most or all user interaction with marketing campaigns on web sites
Why Cassandra? • The amount of information was beyond
the scalability limits of traditional RDBMS’s
• Has to replicate data to six data centers around the world
• Needed integration with real-time data and analytics/search
©2012 DataStax
simple text
Backupify
“Cassandra was just a better design all around – more truly horizontally scalable and with less management overhead – and there’s no single point of failure. I looked at Cassandra’s architecture and thought, ‘Yeah, that’s how you do it.’”
Application/Use Case• Cloud-based utility that enables backups and
searches of Google Apps, Gmail, Facebook, Twitter, Blogger and other content.
• Must write lots of data very quickly
Why Cassandra? • Big data requirements necessitated easy
scale out and continuously available database architecture
• Strong Community support of Cassandra• TCO was much better than others
©2012 DataStax
simple text
OpenWave
“Here are the big ‘checkbox’ items for us with Apache Cassandra: There is no single point of failure, it offers high read-and-write performance, and it has the ability to work on commodity hardware”.
Application/Use Case• Openwave Messaging delivers next
generation converged messaging platform with cloud and social integration capabilities.
Why Cassandra? • Needed new database that would support
geographic redundancy, continuous availability, and big data scale
• Required high IOPS database speed• Better TCO than prior Oracle database
©2012 DataStax
simple text
Healthx
“We really like the integration with Solr. We get the full redundancy that you’d expect out of Cassandra as well as the full text indexing of Solr. The two things together make a win.”
Application/Use Case• Develops and manages online portals for
healthcare market• Delivered via cloud platform• Manages provider, patient, and other related
data
Why DataStax Enterprise? • Needed to scale, perform, and search data
faster than previous Microsoft SQL Server database farm
• Integrated big data platform that provides one database cluster for all real-time and search data
©2012 DataStax
Big data
Analytics(Hadoop)
Realtime(“NoSQL”)
?
©2012 DataStax
The evolution of Analytics
Analytics + Realtime
©2012 DataStax
The evolution of Analytics
Analytics Realtime
replication
©2012 DataStax
The evolution of Analytics
ETL
©2012 DataStax
Big data
Analytics(Hadoop)
Realtime(Cassandra)
DatastaxEnterprise
©2012 DataStax
Reunification of realtime + analytics
©2012 DataStax
©2012 DataStax
Portfolio Demo dataflow
PortfoliosHistorical PricesIntermediate ResultsLargest loss
PortfoliosLive Prices for
today
Largest loss
©2012 DataStax
Better Hadoop than Hadoop• “Vanilla” Hadoop• 8+ services to setup, monitor, backup, and recover
(NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper, Region Server,...)
• Single points of failure• Can't separate online and offline processing
• DataStax Enterprise• Single, simplified component• Self-organizes based on workload• Peer to peer• JobTracker failover
©2012 DataStax
SELECT title FROM solr WHERE solr_query='title:natio*';
title-------------------------------------------------------------------------- Bolivia national football team 2002 List of French born footballers who have played for other national teams Lithuania national basketball team at Eurobasket 2009 Bolivia national football team 2000 Kenya national under-20 football team Bolivia national football team 1999 Israel men's national inline hockey team Bolivia national football team 2001
Enterprise search with Solr
©2012 DataStax
Managing & Monitoring Big DataDataStax OpsCenter manages and monitors all Cassandra and Hadoop operations
©2012 DataStax
Questions?• http://www.datastax.com/docs
• http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-1
• http://www.datastax.com/products/enterprise