java web development with mongodb (presented at devoxx 2010)

Alvin Richards [email protected]

Topics

OverviewData modelingReplication & ShardingDeveloping with JavaDeployment

Drinking from the fire hose

Part OneMongoDB Overview

Strong adoption of MongoDB

90,000 Database downloads per month

Over 1,000 Production Deployments

web 2.0 companies started out using thisbut now:- enterprises- financial industries

3 Reason - Performance- Large number of readers / writers- Large data volume- Agility (ease of development)

non-‐relational, next-‐generation operational datastores and databases

NoSQL Really Means:

RDBMS(Oracle, MySQL)

past : one-size-fits-all


New Gen. OLAP

(vertica, aster, greenplum)

present : business intelligence and analytics is now its own segment.


New Gen. OLAP

(vertica, aster, greenplum)

Non-relationalOperational

Stores(“NoSQL”)

futurewe claim nosql segment will be: * large* not fragmented* ‘platformitize-able’

Philosophy: maximize features -‐ up to the “knee” in the curve, then stop

depth of functionality

scalab

ility & perform

ance •memcached

• key/value

• RDBMS

Horizontally ScalableArchitectures

no joinsno complex transactions+

New Data ModelsImproved ways to develop

no joinsno complex transactions+

Platform and Language supportMongoDB is Implemented in C++ for best performance

Platforms 32/64 bit• Windows• Linux, Mac OS-X, FreeBSD, Solaris

ease of development a surprisingly big benefit : faster to code, faster to change, avoid upgrades and scheduled downtimemore predictable performancefast single server performance -> developer spends less time manually coding around the databasebottom line: usually, developers like it much better after trying

Platform and Language supportMongoDB is Implemented in C++ for best performance

Platforms 32/64 bit• Windows• Linux, Mac OS-X, FreeBSD, Solaris

Language drivers for• Java • Ruby / Ruby-on-Rails • C#• C / C++• Erlang • Python, Perl, JavaScript• Scala• others...

ease of development a surprisingly big benefit : faster to code, faster to change, avoid upgrades and scheduled downtimemore predictable performancefast single server performance -> developer spends less time manually coding around the databasebottom line: usually, developers like it much better after trying

Part TwoData Modeling in MongoDB

So why model data?

A brief history of normalization• 1970 E.F.Codd introduces 1st Normal Form (1NF)• 1971 E.F.Codd introduces 2nd and 3rd Normal Form (2NF, 3NF)• 1974 Codd & Boyce define Boyce/Codd Normal Form (BCNF)• 2002 Date, Darween, Lorentzos define 6th Normal Form (6NF)

Goals:• Avoid anomalies when inserting, updating or deleting• Minimize redesign when extending the schema• Make the model informative to users• Avoid bias towards a particular style of query

* source : wikipedia

The real benefit of relational

• Before relational• Data and Logic combined

• After relational• Separation of concerns• Data modeled independent of logic• Logic freed from concerns of data design

• MongoDB continues this separation

Relational made normalized data look like this

Document databases make normalized data look like this

Terminology

RDBMS MongoDB

Table Collection

Row(s) JSON Document

Index Index

Join Embedding & Linking

Partition Shard

Partition Key Shard Key

DB ConsiderationsHow can we manipulate

this data ?

• Dynamic Queries

• Secondary Indexes

• Atomic Updates

• Map Reduce

Considerations• No Joins• Document writes are atomic

Access Patterns ?

• Read / Write Ratio

• Types of updates

• Types of queries

• Data life-cycle

So today’s example will use...

Design Session

Design documents that simply map to your applicationpost = {author: “Hergé”, date: new Date(), text: “Destination Moon”, tags: [“comic”, “adventure”]}

>db.post.save(post)

>db.posts.find()

{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "Hergé", date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)", text : "Destination Moon", tags : [ "comic", "adventure" ] } Notes:• ID must be unique, but can be anything you’d like• MongoDB will generate a default ID if one is not supplied

Find the document

Secondary index for “author”

// 1 means ascending, -1 means descending

>db.posts.ensureIndex({author: 1})

>db.posts.find({author: 'Hergé'}) { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "Hergé", ... }

Add and index, find via Index

Verifying indexes exist

>db.system.indexes.find()

// Index on ID { name : "_id_", ns : "test.posts", key : { "_id" : 1 } }

// Index on author { _id : ObjectId("4c4ba6c5672c685e5e8aabf4"), ns : "test.posts", key : { "author" : 1 }, name : "author_1" }

Query operatorsConditional operators: $ne, $in, $nin, $mod, $all, $size, $exists, $type, .. $lt, $lte, $gt, $gte, $ne,

// find posts with any tags >db.posts.find({tags: {$exists: true}})



Regular expressions: // posts where author starts with h >db.posts.find({author: /^h*/i })



Regular expressions: // posts where author starts with h >db.posts.find({author: /^h*/i })

Counting: // posts written by Hergé >db.posts.find({author: “Hergé”}).count()

Extending the Schema new_comment = {author: “Kyle”, date: new Date(), text: “great book”}

>db.posts.update({_id: “...” }, { ‘$push’: {comments: new_comment}, ‘$inc’: {comments_count: 1}})

{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "Hergé", date : "Sat Jul 24 2010 19:47:11 GMT-‐0700 (PDT)", text : "Destination Moon", tags : [ "comic", "adventure" ], comments_count: 1, comments : [ { author : "Kyle", date : "Sat Jul 24 2010 20:51:03 GMT-‐0700 (PDT)", text : "great book" } ]}

Extending the Schema

// create index on nested documents: >db.posts.ensureIndex({"comments.author": 1})

>db.posts.find({comments.author:”Kyle”})




// find last 5 posts: >db.posts.find().sort({date:-1}).limit(5)




// find last 5 posts: >db.posts.find().sort({date:-1}).limit(5)

// most commented post: >db.posts.find().sort({comments_count:-1}).limit(1)

When sorting, check if you need an index


Explain a query plan> db.blogs.find({author: 'Hergé'}).explain(){ "cursor" : "BtreeCursor author_1", "nscanned" : 1, "nscannedObjects" : 1, "n" : 1, "millis" : 5, "indexBounds" : { "author" : [ [ "Hergé", "Hergé" ] ] }

Watch for full table scans

> db.blogs.find({text: 'Destination Moon'}).explain() { "cursor" : "BasicCursor", "nscanned" : 1, "nscannedObjects" : 1, "n" : 1, "millis" : 0, "indexBounds" : { }}

Map Reduce

Map reduce : count tagsmapFunc = function () { this.tags.forEach(function (z) {emit(z, {count:1});});}

reduceFunc = function (k, v) { var total = 0; for (var i = 0; i < v.length; i++) { total += v[i].count; } return {count:total}; }

res = db.posts.mapReduce(mapFunc, reduceFunc)

>db[res.result].find() { _id : "comic", value : { count : 1 } } { _id : "adventure", value : { count : 1 } }

Group

• Equivalent to a Group By in SQL

• Specific the attributes to group the data

• Process the results in a Reduce function

Groupcmd = { key: { "author":true }, initial: {count: 0}, reduce: function(obj, prev) { prev.count++; }, };result = db.posts.group(cmd);

[ { "author" : "Hergé", "count" : 1 }, { "author" : "Kyle", "count" : 3 }]

Review

So Far:- Started out with a simple schema- Queried Data- Evolved the schema - Queried / Updated the data some more

Single Table Inheritance

>db.shapes.find() { _id: ObjectId("..."), type: "circle", area: 3.14, radius: 1} { _id: ObjectId("..."), type: "square", area: 4, d: 2} { _id: ObjectId("..."), type: "rect", area: 10, length: 5, width: 2}

// find shapes where radius > 0 >db.shapes.find({radius: {$gt: 0}})

// create index >db.shapes.ensureIndex({radius: 1})

One to Many- Embedded Array / Array Keys - slice operator to return subset of array - some queries hard e.g find latest comments across all documents


- Embedded tree - Single document - Natural - Hard to query


- Embedded tree - Single document - Natural - Hard to query

- Normalized (2 collections) - most flexible - more queries

Many - ManyExample: - Product can be in many categories- Category can have many products

Products- product_id

Category- category_id

Product_Categories- product_id- category_id

products: { _id: ObjectId("4c4ca23933fb5941681b912e"), name: "Destination Moon", category_ids: [ ObjectId("4c4ca25433fb5941681b912f"), ObjectId("4c4ca25433fb5941681b92af”]}

Many - Many

products: { _id: ObjectId("4c4ca23933fb5941681b912e"), name: "Destination Moon", category_ids: [ ObjectId("4c4ca25433fb5941681b912f"), ObjectId("4c4ca25433fb5941681b92af”]} categories: { _id: ObjectId("4c4ca25433fb5941681b912f"), name: "Adventure", product_ids: [ ObjectId("4c4ca23933fb5941681b912e"), ObjectId("4c4ca30433fb5941681b9130"), ObjectId("4c4ca30433fb5941681b913a"]}

Many - Many

products: { _id: ObjectId("4c4ca23933fb5941681b912e"), name: "Destination Moon", category_ids: [ ObjectId("4c4ca25433fb5941681b912f"), ObjectId("4c4ca25433fb5941681b92af”]} categories: { _id: ObjectId("4c4ca25433fb5941681b912f"), name: "Adventure", product_ids: [ ObjectId("4c4ca23933fb5941681b912e"), ObjectId("4c4ca30433fb5941681b9130"), ObjectId("4c4ca30433fb5941681b913a"]}

//All categories for a given product>db.categories.find({product_ids: ObjectId("4c4ca23933fb5941681b912e")})

Many - Many

products: { _id: ObjectId("4c4ca23933fb5941681b912e"), name: "Destination Moon", category_ids: [ ObjectId("4c4ca25433fb5941681b912f"), ObjectId("4c4ca25433fb5941681b92af”]} categories: { _id: ObjectId("4c4ca25433fb5941681b912f"), name: "Adventure"}

Alternative


// All products for a given category>db.products.find({category_ids: ObjectId("4c4ca25433fb5941681b912f")})

Alternative


// All products for a given category>db.products.find({category_ids: ObjectId("4c4ca25433fb5941681b912f")})

// All categories for a given productproduct = db.products.find(_id : some_id)>db.categories.find({_id : {$in : product.category_ids}})

Alternative

TreesFull Tree in Document

{ comments: [ { author: “Kyle”, text: “...”, replies: [ {author: “Fred”, text: “...”, replies: []} ]} ]}

Pros: Single Document, Performance, Intuitive

Cons: Hard to search, Partial Results, 4MB limit

TreesParent Links- Each node is stored as a document- Contains the id of the parent

Child Links- Each node contains the id’s of the children- Can support graphs (multiple parents / child)

Array of Ancestors- Store Ancestors of a node { _id: "a" } { _id: "b", ancestors: [ "a" ], parent: "a" } { _id: "c", ancestors: [ "a", "b" ], parent: "b" } { _id: "d", ancestors: [ "a", "b" ], parent: "b" } { _id: "e", ancestors: [ "a" ], parent: "a" } { _id: "f", ancestors: [ "a", "e" ], parent: "e" } { _id: "g", ancestors: [ "a", "b", "d" ], parent: "d" }


//find all descendants of b:>db.tree2.find({ancestors: ‘b’})


//find all descendants of b:>db.tree2.find({ancestors: ‘b’})

//find all ancestors of f:>ancestors = db.tree2.findOne({_id:’f’}).ancestors>db.tree2.find({_id: { $in : ancestors})

findAndModifyQueue example

//Example: find highest priority job and mark

job = db.jobs.findAndModify({ query: {inprogress: false}, sort: {priority: -1), update: {$set: {inprogress: true, started: new Date()}}, new: true})

Part ThreeReplication & Sharding

Scaling

• Data size only goes up• Operations/sec only go up• Vertical scaling is limited• Hard to scale vertically in the cloud• Can scale wider than higher

What is scaling?Well - hopefully for everyone here.

Traditional Horizontal Scaling

• read only slaves• caching• custom partitioning code

scaling isn’t newsharding isn’tmanual re-balancing is painful at best

New methods of Scaling

• relational database clustering• consistent hashing (Dynamo)• range based partitioning (BigTable/PNUTS)

Read Scalability : Replication

write

read

ReplicaSet 1

Primary

Secondary

Secondary

Basics• MongoDB replication is a bit like MySQL replication

Asynchronous master/slave at its core• Variations:

Master / slaveReplica Pairs (deprecated – use replica sets)Replica Sets

• A cluster of N servers• Any (one) node can be primary• Consensus election of primary• Automatic failover• Automatic recovery• All writes to primary• Reads can be to primary (default) or a secondary

Replica Sets

Replica Sets – Design Concepts

1. Write is durable once avilable on a majority of members

2. Writes may be visible before a cluster wide commit has been completed

3. On a failover, if data has not been replicated from the primary, the data is dropped (see #1).

Replica Set: Establishing

Member 1

Member 2

Member 3

Replica Set: Electing primary

Member 1

Member 2PRIMARY

Member 3

Replica Set: Failure of master

Member 1

Member 2DOWN

Member 3PRIMARY

negotiate new

master

Replica Set: Reconfiguring

Member 1

Member 2DOWN

Member 3PRIMARY

Replica Set: Member recovers

Member 1

Member 2RECOVER-

ING

Member 3PRIMARY

Replica Set: Active

Member 1

Member 2

Member 3PRIMARY

Set Member TypesNormal (priority == 1)Passive (priority == 0)Arbiter (no data, but can vote)

Write Scalability: Sharding

write

read

ReplicaSet 1

Primary

Secondary

Secondary

ReplicaSet 2

Primary

Secondary

Secondary

ReplicaSet 3

Primary

Secondary

Secondary

key range 0 .. 30

key range 31 .. 60

key range 61 .. 100

Sharding

• Scale horizontally for data size, index size, write and consistent read scaling

• Distribute databases, collections or a objects in a collection

• Auto-balancing, migrations, management happen with no down time

• Replica Sets for inconsistent read scaling

for inconsistent read scaling

Sharding

• Choose how you partition data• Can convert from single master to sharded system with no downtime• Same features as non-sharding single master• Fully consistent

Range Based

• collection is broken into chunks by range• chunks default to 200mb or 100,000 objects

Architecture

client

mongos ...mongos

mongodmongod

mongod mongod

mongod

mongod ...

Shards

mongod

mongod

mongod

ConfigServers

Config Servers

• Hold meta data of where chunks are located •1 or 3 of them (3 for availability)• changes are made with 2 phase commit• if a majority are down, meta data goes read only• system is online as long as 1/3 is up

Shards

• Hold the actual data •Can be master, master/slave or replica sets• Replica sets gives sharding + full auto-failover• Regular mongod processes

mongos

• Sharding Router (or Switch)• Acts just like a mongod to clients• Can have 1 or as many as you want• Can run on appserver so no extra network traffic

Writes

• Inserts : require shard key, routed• Removes: routed and/or scattered• Updates: routed or scattered

Queries

• By shard key: routed• Sorted by shard key: routed in order• By non shard key: scatter gather• Sorted by non shard key: distributed merge sort

Operations

• split: breaking a chunk into 2• migrate: move a chunk from 1 shard to another• balancing: moving chunks automatically to keep system in balance

Part FourJava Development

Library Choices• Raw MongoDB Driver

Map<String, Object> view of objectsRough but dynamic

• Morphia (type-safe mapper)POJOsAnnotation based (similar to JPA)Syntactic sugar and helpers

• OthersCode generators, other jvm languages

MongoDB Java Driver• BSON Package

TypesEncode/DecodeDBObject (Map<String, Object>)

Nested MapsDirectly encoded to binary format (BSON)

• MongoDB PackageMongoDBObject (BasicDBObject/Builder)DB/DBColletionDBQuery/DBCursor

BSON PackageTypes

int and longArray/ArrayListStringbyte[] – binDataDouble (IEEE 754 FP)Date (secs since epoch)NullBooleanJavaScript StringRegex

MongoDB Package• Mongo

Connection, ThreadSafeWriteConcern*

• DBAuth, Collections getLastError()Command(), eval()RequestStart/Done

• DBCollectionInsert/Save/Find/Remove/Update/FindAndModifyensureIndex

Simple ExampleDBCollection coll = new Mongo().getDB(“blogdb”);

ArrayList<String> tags = new ArrayList<String>();tags.add("comic");tags.add("adventure");

coll.save( new BasicDBObjectBuilder( “author”, “Hergé”). append(“text”, “Destination Moon”). append(“date”, new Date()). append(“tags”, tags);

Simple Example, AgainDBCollection coll = new Mongo().getDB(“blogdb”);

ArrayList<String> tags = new ArrayList<String>();tags.add("comic");tags.add("adventure");

Map<String, Object> fields = new …fields.add(“author”, “Hergé”); fields.add(“text”, “Destination Moon”);fields.add(“date”, new Date());fields.add(“tags”, tags);

coll.insert(new BasicDBObject(fields));

DBObject <-> (B/J)SON{author:”kyle”, text:“Destination Moon”,date: }

BasicDBObjectBuilder dbObj = new BasicDBObjectBuilder()

.append(“author”, “Hergé”)

.append(“text”, “Destination Moon”)

.append(“date”, new Date()) .get();

String text = (String)dbObj.get(“text”);

JSON.parse(…)DBObject dbObj = JSON.parse(“ {‘author’:‘Hergé’, ‘text’:‘Destination Moon’, ‘date’:‘Sat Jul 24 2010 19:47:11 GMT-‐0700 (PDT)’,}

”);

ListsDBObject dbObj = JSON.parse(“ {‘author’:‘Hergé’, ‘text’:‘Destination Moon’, ‘date’:‘Sat Jul 24 2010 19:47:11 GMT-‐0700 (PDT)’,}

”);

List<String> tags = new …tags.add(“comic”);tags.add(“adventure”);dbObj.put(“tags”, tags);

{…, tags: [‘comic’, ‘adventure’]}

Maps of MapsCan represent object graph/treeAlways keyed off String (field)

Morphia: MongoDB MapperMaps POJOType-safeAccess Patterns: DAO/Datastore/???Data TypesJPA likeMany concepts came from Objectify (GAE)

Annotations@Entity(“collectionName”)@Id@Transient (not transient)@Indexed(…)@Property(“fieldAlias”)@AlsoLoad({aliases})@Reference@Serialized[@Embedded]

Lifecycle Events@PrePersist@PreSave@PostPersist@PreLoad@PostLoad

EntityListenersEntityInterceptor

Basic POJO@Entityclass Person { @Id String author; @Indexed Date date; String text;}

Datastore Basicsget(class, id)find(class, […])save(entity, […])delete(query)getCount(query)update/First(query, upOps)findAndModify/Delete(query, upOps)

Add, Get, DeleteBlog entry = new Blog(“Hergé”, New Date(), “Destination Moon”)

Datastore ds = new Morphia().createDatastore()

ds.save(entry);

Blog foundEntry = ds.get(Blog.class, “Hergé”)

ds.delete(entry);

QueriesDatastore ds = …

Query q = ds.createQuery(Blog.class);

q.field(“author”).equal(“Hergé”).limit(5);

for(Blog e : q.fetch()) print(e);

Blog entry = q.field(“author”).startsWith(“H”).get();

UpdateDatastore ds = …Query q = ds.find(Blog.class, “author”, “Hergé”);UpdateOperation uo = ds.createUpdateOperations(cls)

uo.inc(“views”, 1).set(“lastUpdated”, new Date());

UpdateResults res = ds.update(q, uo);if(res.getUpdatedCount() > 0) //do something?

Update Operationsset(field, val)unset(field)

inc(field, [val])dec(field)

add(field, val)addAdd(field, vals)

removeFirst/Last(field)removeAll(field, vals)

Relationships[@Embedded]

Loaded/Saved with EntityUpdate

@Reference

Stored as DBRef(s)Loaded with EntityNot automatically saved

Key<T> (DBRef)

Stored as DBRef(s)Just a link, but resolvable by Datastore/Query

MongoDB features in Java

• Durability• Replication• Sharding• Connection options

Durability

What failures do you need to recover from?• Loss of a single database node?• Loss of a group of nodes?

Durability - Master only

• Write acknowledged when in memory on master only

Durability - Master + Slaves

• Write acknowledged when in memory on master + slave

• Will survive failure of a single node

Durability - Master + Slaves + fsync• Write acknowledged when in memory on master + slaves

• Pick a “majority” of nodes

• fsync in batches (since it blocking)

Setting default error checking// Do not check or report errors on writecom.mongodb.WriteConcern.NONE;

// Use default level of error check. Do not send// a getLastError(), but raise exction on errorcom.mongodb.WriteConcern.NORMAL;

// Send getLastError() after each write. Raise an// exception on errorcom.mongodb.WriteConcern.STRICT;

// Set the concerndb.setWriteConcern(concern);

Customized WriteConcern// Wait for three servers to acknowledge writeWriteConcern concern = new WriteConcern(3);

// Wait for three servers, with a 1000ms timeoutWriteConcern concern = new WriteConcern(3, 1000);

// Wait for 3 server, 100ms timeout and fsync // data to diskWriteConcern concern = new WriteConcern(3, 1000, true); // Set the concerndb.setWriteConcern(concern);

Using Replication from Java

slaveOk()- driver to send read requests to Secondaries- driver will always send writes to Primary

Can be set on-‐ DB.slaveOk()-‐ Collection.slaveOk()-‐ find(q).addOption(Bytes.QUERYOPTION_SLAVEOK);

Using sharding Java

Before sharding

coll.save( new BasicDBObjectBuilder(“author”, “Hergé”). append(“text”, “Destination Moon”). append(“date”, new Date());

Query q = ds.find(Blog.class, “author”, “Hergé”);

After sharding

No code change required!

Connection options

MongoOptions mo = new MongoOptions();

// Restrict number of connectionsmo.connectionsPerHost = MAX_THREADS + 5;

// Auto reconnection on connection failuremo.autoConnectRetry = true;

Part FiveDeploying MongoDB

Part FiveDeploying MongoDB

• Performance tuning• Sizing• O/S Tuning / File System layout• Backup

Backup

• Typically backups are driven from a slave• Eliminates impact to client / application traffic to master

Backup

•Two strategies• mogodump / mongorestore• fsync + lock

mongodump

• binary, compact object dump• each consistent object is written• not necessarily consistent from start to finish

fsync + lock

• fsync - flushes buffers to disk• lock - blocks writes

db.runCommand({fsync:1,lock:1})

• Use file-system / LVM / storage snapshot

• unlock db.$cmd.sys.unlock.findOne();

Slave delay

• Protection against app faults• Protection against administration mistakes

O/S Config

• RAM - lots of it

• Filesystem• EXT4 / XFS• Better file allocation & performance

• I/O• More disk the better• Consider RAID10 or other RAID configs

Monitoring

• Munin, Cacti, Nagios

Primary function: • Measure stats over time• Tells you what is going on with your system• Alerts when threshold reached

Remember me?

Summary

MongoDB makes building Java Web application simple

You can focus on what the apps needs to do

MongoDB has built-in

• Horizontal scaling (reads and writes)• Simplified schema evolution• Simplified deployed and operation• Best match for development tools and agile processes

java web development with mongodb (presented at devoxx 2010)

Documents