scaling data on public clouds
DESCRIPTION
Check which alternatives you have to scale your data on the public cloudTRANSCRIPT
About Us
• ScaleBase is a new startup targeting the database-as-a-service market (DBaaS)
• We offer unlimited database scalability and availability using our Database Load Balancer
• We currently run in beta mode – contact me if you want to join
Problem Of Data
• Flickr just hit 5B pictures
• Facebook > 0.5B users
• Farmville have more monthly players than the population of France
Mondays Key Note
• More data
• More users
• More complex actions
• Shorter response times
Scalability Pain
http://media.amazonwebservices.com/pdf/IBMWebinarDeck_Final.pdf
You just lost
customers
Infrastructure
Cost $
time
Large
Capital
Expenditure
Opportunity
Cost
Predicted
Demand
Traditional
Hardware
Actual
Demand
Automated
Virtualization
CAP vs. ACID
• CAP = Consistency, Availability, Partition Tolerance
• ACID = Atomicity, Consistency, Isolation, Durability
• Atomicity – Chain of actions treated as one whole unseperateable action
• Isolation – Consistent query snapshots, read across writes, 4 levels are supported
ScaleBase Database Scaling In A Box
• The first truly elastic, fault tolerant SQL based data layer
• Enables linear scaling of any SQL based database
Legacy clients Applications
Scalebase
Database instances
ScaleBase
Application/Web Servers
Scal
eB
ase
Shared Nothing DB Machines
Commodity Hardware
MySQL? Oracle?
Scalable and hi-available
THE REQUIREMENTS FOR DATA SLA IN PUBLIC CLOUD ENVIRONMENTS
What We Need
• Availability
• Consistency
• Scalability
Brewer's (CAP) Theorem
• It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
– Consistency (all nodes see the same data at the same time)
– Availability (node failures do not prevent survivors from continuing to operate)
– Partition Tolerance (the system continues to operate despite arbitrary message loss)
http://en.wikipedia.org/wiki/CAP_theorem
What It Means
http://guyharrison.squarespace.com/blog/2010/6/13/consistency-models-in-non-relational-databases.html
Reading More On CAP
• This is an excellent read, and some of my samples are from this blog
– http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
ACHIEVING DATA SCALABILITY WITH RELATIONAL DATABASES
Databases And CAP
• ACID – Consistency
• Availability – tons of solutions, most of them not cloud oriented
– Oracle RAC
– MySQL Proxy
– Etc.
– Replication based solutions can solve at least read availability and scalability (see Azure SQL)
Database Cloud Solutions
• Amazon RDS
• NaviSite Oracle RAC
• Joyent + Zeus
So Where Is The Problem?
• Scaling problems (usually write but also read)
• Schema change problems
• BigData problems
Scaling Up
• Issues with scaling up when the dataset is just too big
• RDBMS were not designed to be distributed
• Began to look at multi-node database solutions
• Known as ‘scaling out’ or ‘horizontal scaling’
• Different approaches include:
– Master-slave
– Sharding
Scaling RDBMS – Master/Slave
• Master-Slave
– All writes are written to the master. All reads performed against the replicated slave databases
– Critical reads may be incorrect as writes may not have been propagated down
– Large data sets can pose problems as master needs to duplicate data to slaves
Scaling RDBMS - Sharding
• Partition or sharding
– Scales well for both reads and writes
– Not transparent, application needs to be partition-aware
– Can no longer have relationships/joins across partitions
– Loss of referential integrity across shards
Other ways to scale RDBMS
• Multi-Master replication
• INSERT only, not UPDATES/DELETES
• No JOINs, thereby reducing query time
– This involves de-normalizing data
• In-memory databases
ACHIEVING DATA SLA WITH NOSQL
NoSQL
• A term used to designate databases which differ from classic relational databases in some way. These data stores may not require fixed table schemas, and usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage, a term which would include classic relational databases as a subset.
http://en.wikipedia.org/wiki/NoSQL
NoSQL Types
• Key/Value – A big hash table – Examples: Voldemort, Amazon Dynamo
• Big Table – Big table, column families – Examples: Hbase, Cassandra
• Document based – Collections of collections – Examples: CouchDB, MongoDB
• Each solves a different problem
NO-SQL
http://browsertoolkit.com/fault-tolerance.png
Pros/Cons
• Pros: – Performance – BigData – Most solutions are open source – Data is replicated to nodes and is therefore fault-tolerant (partitioning) – Don't require a schema – Can scale up and down
• Cons: – Code change – Limited framework support – Not ACID – Eco system (BI, Backup) – There is always a database at the backend – Some API is just too simple
Amazon S3 Code Sample
AWSAuthConnection conn = new AWSAuthConnection(awsAccessKeyId, awsSecretAccessKey, secure, server, format); Response response = conn.createBucket(bucketName, location, null); final String text = "this is a test"; response = conn.put(bucketName, key, new S3Object(text.getBytes(), null), null);
Cassandra Code Sample CassandraClient cl = pool.getClient() ;
KeySpace ks = cl.getKeySpace("Keyspace1") ;
// insert value
ColumnPath cp = new ColumnPath("Standard1" , null,
"testInsertAndGetAndRemove".getBytes("utf-8"));
for(int i = 0 ; i < 100 ; i++){
ks.insert("testInsertAndGetAndRemove_"+i, cp ,
("testInsertAndGetAndRemove_value_"+i).getBytes("utf-8"));
}
//get value
for(int i = 0 ; i < 100 ; i++){
Column col = ks.getColumn("testInsertAndGetAndRemove_"+i, cp);
String value = new String(col.getValue(),"utf-8") ;
}
//remove value
for(int i = 0 ; i < 100 ; i++){
ks.remove("testInsertAndGetAndRemove_"+i, cp);
}
Cassandra Code Sample – Cont’
try{
ks.remove("testInsertAndGetAndRemove_not_exist", cp);
}catch(Exception e){
fail("remove not exist row should not throw exceptions");
}
//get already removed value
for(int i = 0 ; i < 100 ; i++){
try{
Column col = ks.getColumn("testInsertAndGetAndRemove_"+i, cp);
fail("the value should already being deleted");
}catch(NotFoundException e){
}catch(Exception e){
fail("throw out other exception, should be
NotFoundException." + e.toString() );
}
}
pool.releaseClient(cl) ;
pool.close() ;
Cassandra Statistics
• Facebook Search
• MySQL > 50 GB Data
– Writes Average : ~300 ms
– Reads Average : ~350 ms
• Rewritten with Cassandra > 50 GB Data
– Writes Average : 0.12 ms
– Reads Average : 15 ms
MongoDB Mongo m = new Mongo();
DB db = m.getDB( "mydb" );
Set<String> colls = db.getCollectionNames();
for (String s : colls) {
System.out.println(s);
}
MongoDB – Cont’
BasicDBObject doc = new BasicDBObject();
doc.put("name", "MongoDB");
doc.put("type", "database");
doc.put("count", 1);
BasicDBObject info = new BasicDBObject();
info.put("x", 203);
info.put("y", 102);
doc.put("info", info);
coll.insert(doc);
THE BOTTOM LINE
Data SLA
• There is no golden hammer – See http://sourcemaking.com/antipatterns/golden-
hammer
• Choose your tool wisely, based on what you need • Usually
– Start with RDBMS (shortest TTM, which is what we really care about)
– When scale issues occur – start moving to NoSQL based on your needs
• You can get Data Scalability in the cloud – just think before you code!!!
A BIT MORE ON SCALEBASE
How ScaleBase Works
• ScaleBase takes an application database and splits its data across multiple, separated instances (a technique called Sharding)
• Queries and DML are either: – Directed to correct instance, or – Executed simultaneously across
several instances
• Results are aggregated and returned to the original application
ScaleBase
Database instances
Application
Example
ID First name Last name
100 Steven King
101 Neena Kochhar
102 Lex De Haan
103 Alexander Hunold
104 Bruce Ernst
105 David Austin
106 Valli Pataballa
ID First name Last name
102 Lex De Haan
105 David Austin
ID First name Last name
100 Steven King
103 Alexander Hunold
106 Valli Pataballa
ID First name Last name
101 Neena Kochhar
104 Bruce Ernst
ScaleBase Supports
• 3 table types – Master – Global – Split
• Splitting according to Hash, List, Range • Rebalance, addition and removal of machines • Instance replication backup: Shadow and Master • Full consistent 2-Phase Commit • Joins, Foreign Keys, Subqueries • DML, DDL, Batch updates, Prepared Statements • Aggregations, Group By, Order By, Auto Numbering,
Timestamps
Sample Code
• site_owner_id is the split key
• Perform the query on all DBs
• Simple aggregation of results
• No Code Change
SELECT site_owner_id, count(*)FROM google.user_clicks WHERE country = ‘BRAZIL’ GROUP BY site_owner_id
Sample Code
• Perform the query on all DBs
• Aggregation of the aggregations
• No Code Change
SELECT country, count(*)FROM google.user_clicks GROUP BY country
Sample Code
• Is split key dynamic or static?
• Each command is added to the correct DB, execution is on all relevant DBs
• No Code Change
PreparedStatement pstmt = conn.prepareStatement("INSERT INTO emp VALUES(?,?,?,?,?)"); for (int i = 0; i < 10; i++) { pstmt.setInt(1, 300 + i); pstmt.setString(2, "Something" + i); pstmt.setDate(3, new Date(System.currentTimeMillis())); pstmt.setInt(4, i); pstmt.setLong(5, i); pstmt.addBatch(); } int[] result = pstmt.executeBatch();
• Elastic SQL Database Scaling hi-availability solution
– Complete
– Transparent
– Super scalable
– Out of the box
– Non-intrusive
– Flexible
– Manageable
ScaleBase Solution
• Pay much less for hardware and Database licenses • Get more power, better data spreading and better
availability • Real linear scalability • Go for real grid, cloud and virtualization
• ScaleBase is NOT:
– Is NOT an RDBMS. It facilitates any secure, high-available, archivable RDBMS (Oracle, DB2, MySQL, any!)
– Does NOT require schema structure modifications – Does NOT require additional sophisticated hardware
With ScaleBase
Moving To ScaleBase
• Implementing ScaleBase is done in minutes • Just direct your application to your ScaleBase
instance • Target ScaleBase to your original database and
target database instances • ScaleBase will automatically migrate your schema
and data to the new servers • After all data is transferred ScaleBase will start
working with target database instances, giving you linear scalability – with no down time!
Where ScaleBase Fits In
• Cloud databases
– Use SQL databases in the cloud, and get Scale Out features and high availability
• High scale applications
– Use your existing code, without migrating to NOSQL, and get unlimited SQL scalability
CUSTOMER CASE STUDIES
Public Cloud
• Scenario: – A startup developing a complex iPhone application
running on EC2 – Requires high scaling option for SQL based database
• Problem: – Amazon RDS offers only Scale Up option, no Scale Out
• Solution: – Customer switched their Amazon RDS-based
application to ScaleBase on EC2 – Gained transparent, linear, scaling – Running 4 RDS instances behind ScaleBase
Private Cloud
• Scenario: – A company selling devices that ‘ping’ home every 5 minute
– 8 digit number of devices sold
• Problem: – Evaluated MySQL, Oracle - no single machine can support
required devices
– Clustering options too expensive, limited scalability
• Solution: – Customer moved to ScaleBase with no code changes
– Gained linear scales. Runs 4 MySQL databases behind ScaleBase