scaling data on public clouds

Scaling Data On Public Clouds

Liran Zelkha, Founder

[email protected]

mailto:[email protected]

About Us

• ScaleBase is a new startup targeting the database-as-a-service market (DBaaS)

• We offer unlimited database scalability and availability using our Database Load Balancer

• We currently run in beta mode – contact me if you want to join

Problem Of Data

• Flickr just hit 5B pictures

• Facebook > 0.5B users

• Farmville have more monthly players than the population of France

Mondays Key Note

• More data

• More users

• More complex actions

• Shorter response times

Scalability Pain

http://media.amazonwebservices.com/pdf/IBMWebinarDeck_Final.pdf

You just lost

customers

Infrastructure

Cost $

time

Large

Capital

Expenditure

Opportunity

Cost

Predicted

Demand

Traditional

Hardware

Actual

Demand

Automated

Virtualization

CAP vs. ACID

• CAP = Consistency, Availability, Partition Tolerance

• ACID = Atomicity, Consistency, Isolation, Durability

• Atomicity – Chain of actions treated as one whole unseperateable action

• Isolation – Consistent query snapshots, read across writes, 4 levels are supported

ScaleBase Database Scaling In A Box

• The first truly elastic, fault tolerant SQL based data layer

• Enables linear scaling of any SQL based database

Legacy clients Applications

Scalebase

Database instances

ScaleBase

Application/Web Servers

Scal

eB

ase

Shared Nothing DB Machines

Commodity Hardware

MySQL? Oracle?

Scalable and hi-available

THE REQUIREMENTS FOR DATA SLA IN PUBLIC CLOUD ENVIRONMENTS

What We Need

• Availability

• Consistency

• Scalability

Brewer's (CAP) Theorem

• It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

– Consistency (all nodes see the same data at the same time)

– Availability (node failures do not prevent survivors from continuing to operate)

– Partition Tolerance (the system continues to operate despite arbitrary message loss)

http://en.wikipedia.org/wiki/CAP_theorem

http://en.wikipedia.org/wiki/CAP_theorem

What It Means

http://guyharrison.squarespace.com/blog/2010/6/13/consistency-models-in-non-relational-databases.html


















Reading More On CAP

• This is an excellent read, and some of my samples are from this blog

– http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

http://www.julianbrowne.com/article/viewer/brewers-cap-theorem






ACHIEVING DATA SCALABILITY WITH RELATIONAL DATABASES

Databases And CAP

• ACID – Consistency

• Availability – tons of solutions, most of them not cloud oriented

– Oracle RAC

– MySQL Proxy

– Etc.

– Replication based solutions can solve at least read availability and scalability (see Azure SQL)

Database Cloud Solutions

• Amazon RDS

• NaviSite Oracle RAC

• Joyent + Zeus

So Where Is The Problem?

• Scaling problems (usually write but also read)

• Schema change problems

• BigData problems

Scaling Up

• Issues with scaling up when the dataset is just too big

• RDBMS were not designed to be distributed

• Began to look at multi-node database solutions

• Known as ‘scaling out’ or ‘horizontal scaling’

• Different approaches include:

– Master-slave

– Sharding

Scaling RDBMS – Master/Slave

• Master-Slave

– All writes are written to the master. All reads performed against the replicated slave databases

– Critical reads may be incorrect as writes may not have been propagated down

– Large data sets can pose problems as master needs to duplicate data to slaves

Scaling RDBMS - Sharding

• Partition or sharding

– Scales well for both reads and writes

– Not transparent, application needs to be partition-aware

– Can no longer have relationships/joins across partitions

– Loss of referential integrity across shards

Other ways to scale RDBMS

• Multi-Master replication

• INSERT only, not UPDATES/DELETES

• No JOINs, thereby reducing query time

– This involves de-normalizing data

• In-memory databases

ACHIEVING DATA SLA WITH NOSQL

NoSQL

• A term used to designate databases which differ from classic relational databases in some way. These data stores may not require fixed table schemas, and usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage, a term which would include classic relational databases as a subset.

http://en.wikipedia.org/wiki/NoSQL

http://en.wikipedia.org/wiki/NoSQL

NoSQL Types

• Key/Value – A big hash table – Examples: Voldemort, Amazon Dynamo

• Big Table – Big table, column families – Examples: Hbase, Cassandra

• Document based – Collections of collections – Examples: CouchDB, MongoDB

• Each solves a different problem

NO-SQL

http://browsertoolkit.com/fault-tolerance.png

Pros/Cons

• Pros: – Performance – BigData – Most solutions are open source – Data is replicated to nodes and is therefore fault-tolerant (partitioning) – Don't require a schema – Can scale up and down

• Cons: – Code change – Limited framework support – Not ACID – Eco system (BI, Backup) – There is always a database at the backend – Some API is just too simple

Amazon S3 Code Sample

AWSAuthConnection conn = new AWSAuthConnection(awsAccessKeyId, awsSecretAccessKey, secure, server, format); Response response = conn.createBucket(bucketName, location, null); final String text = "this is a test"; response = conn.put(bucketName, key, new S3Object(text.getBytes(), null), null);

Cassandra Code Sample CassandraClient cl = pool.getClient() ;

KeySpace ks = cl.getKeySpace("Keyspace1") ;

// insert value

ColumnPath cp = new ColumnPath("Standard1" , null,

"testInsertAndGetAndRemove".getBytes("utf-8"));

for(int i = 0 ; i < 100 ; i++){

ks.insert("testInsertAndGetAndRemove_"+i, cp ,

("testInsertAndGetAndRemove_value_"+i).getBytes("utf-8"));

}

//get value

for(int i = 0 ; i < 100 ; i++){

Column col = ks.getColumn("testInsertAndGetAndRemove_"+i, cp);

String value = new String(col.getValue(),"utf-8") ;

}

//remove value

for(int i = 0 ; i < 100 ; i++){

ks.remove("testInsertAndGetAndRemove_"+i, cp);

}

Cassandra Code Sample – Cont’

try{

ks.remove("testInsertAndGetAndRemove_not_exist", cp);

}catch(Exception e){

fail("remove not exist row should not throw exceptions");

}

//get already removed value

for(int i = 0 ; i < 100 ; i++){

try{

Column col = ks.getColumn("testInsertAndGetAndRemove_"+i, cp);

fail("the value should already being deleted");

}catch(NotFoundException e){

}catch(Exception e){

fail("throw out other exception, should be

NotFoundException." + e.toString() );

}

}

pool.releaseClient(cl) ;

pool.close() ;

Cassandra Statistics

• Facebook Search

• MySQL > 50 GB Data

– Writes Average : ~300 ms

– Reads Average : ~350 ms

• Rewritten with Cassandra > 50 GB Data

– Writes Average : 0.12 ms

– Reads Average : 15 ms

MongoDB Mongo m = new Mongo();

DB db = m.getDB( "mydb" );

Set<String> colls = db.getCollectionNames();

for (String s : colls) {

System.out.println(s);

}

MongoDB – Cont’

BasicDBObject doc = new BasicDBObject();

doc.put("name", "MongoDB");

doc.put("type", "database");

doc.put("count", 1);

BasicDBObject info = new BasicDBObject();

info.put("x", 203);

info.put("y", 102);

doc.put("info", info);

coll.insert(doc);

THE BOTTOM LINE

Data SLA

• There is no golden hammer – See http://sourcemaking.com/antipatterns/golden-

hammer

• Choose your tool wisely, based on what you need • Usually

– Start with RDBMS (shortest TTM, which is what we really care about)

– When scale issues occur – start moving to NoSQL based on your needs

• You can get Data Scalability in the cloud – just think before you code!!!

http://sourcemaking.com/antipatterns/golden-hammer



A BIT MORE ON SCALEBASE

How ScaleBase Works

• ScaleBase takes an application database and splits its data across multiple, separated instances (a technique called Sharding)

• Queries and DML are either: – Directed to correct instance, or – Executed simultaneously across

several instances

• Results are aggregated and returned to the original application

ScaleBase

Database instances

Application

Example

ID First name Last name

100 Steven King

101 Neena Kochhar

102 Lex De Haan

103 Alexander Hunold

104 Bruce Ernst

105 David Austin

106 Valli Pataballa


102 Lex De Haan

105 David Austin


100 Steven King

103 Alexander Hunold

106 Valli Pataballa


101 Neena Kochhar

104 Bruce Ernst

ScaleBase Supports

• 3 table types – Master – Global – Split

• Splitting according to Hash, List, Range • Rebalance, addition and removal of machines • Instance replication backup: Shadow and Master • Full consistent 2-Phase Commit • Joins, Foreign Keys, Subqueries • DML, DDL, Batch updates, Prepared Statements • Aggregations, Group By, Order By, Auto Numbering,

Timestamps

Sample Code

• site_owner_id is the split key

• Perform the query on all DBs

• Simple aggregation of results

• No Code Change

SELECT site_owner_id, count(*)FROM google.user_clicks WHERE country = ‘BRAZIL’ GROUP BY site_owner_id

Sample Code

• Perform the query on all DBs

• Aggregation of the aggregations

• No Code Change

SELECT country, count(*)FROM google.user_clicks GROUP BY country

Sample Code

• Is split key dynamic or static?

• Each command is added to the correct DB, execution is on all relevant DBs

• No Code Change

PreparedStatement pstmt = conn.prepareStatement("INSERT INTO emp VALUES(?,?,?,?,?)"); for (int i = 0; i < 10; i++) { pstmt.setInt(1, 300 + i); pstmt.setString(2, "Something" + i); pstmt.setDate(3, new Date(System.currentTimeMillis())); pstmt.setInt(4, i); pstmt.setLong(5, i); pstmt.addBatch(); } int[] result = pstmt.executeBatch();

• Elastic SQL Database Scaling hi-availability solution

– Complete

– Transparent

– Super scalable

– Out of the box

– Non-intrusive

– Flexible

– Manageable

ScaleBase Solution

• Pay much less for hardware and Database licenses • Get more power, better data spreading and better

availability • Real linear scalability • Go for real grid, cloud and virtualization

• ScaleBase is NOT:

– Is NOT an RDBMS. It facilitates any secure, high-available, archivable RDBMS (Oracle, DB2, MySQL, any!)

– Does NOT require schema structure modifications – Does NOT require additional sophisticated hardware

With ScaleBase

Moving To ScaleBase

• Implementing ScaleBase is done in minutes • Just direct your application to your ScaleBase

instance • Target ScaleBase to your original database and

target database instances • ScaleBase will automatically migrate your schema

and data to the new servers • After all data is transferred ScaleBase will start

working with target database instances, giving you linear scalability – with no down time!

Where ScaleBase Fits In

• Cloud databases

– Use SQL databases in the cloud, and get Scale Out features and high availability

• High scale applications

– Use your existing code, without migrating to NOSQL, and get unlimited SQL scalability

CUSTOMER CASE STUDIES

Public Cloud

• Scenario: – A startup developing a complex iPhone application

running on EC2 – Requires high scaling option for SQL based database

• Problem: – Amazon RDS offers only Scale Up option, no Scale Out

• Solution: – Customer switched their Amazon RDS-based

application to ScaleBase on EC2 – Gained transparent, linear, scaling – Running 4 RDS instances behind ScaleBase

Private Cloud

• Scenario: – A company selling devices that ‘ping’ home every 5 minute

– 8 digit number of devices sold

• Problem: – Evaluated MySQL, Oracle - no single machine can support

required devices

– Clustering options too expensive, limited scalability

• Solution: – Customer moved to ScaleBase with no code changes

– Gained linear scales. Runs 4 MySQL databases behind ScaleBase

scaling data on public clouds

Technology

scaling data

data scalability

scalebase database scaling

data sla

data layer

data stores

scaling problems

relational databases