mongo db

Post on 03-Sep-2014






Click to see full reader





by toki

About me

● Delta Electronic CTBD Senior Engineer● Main developer of

○ Website built via MongoDB with daily 600k PV○ Data grow up everyday with auto crawler bots

MongoDB - Simple Introduction

● Document based NOSQL(Not Only SQL) database

● Started from 2007 by 10Gen company● Wrote in C++● Fast (But takes lots of memory)● Stores JSON documents in BSON format● Full index on any document attribute● Horizontal scalability with auto sharding● High availability & replica ready

What is database?

● Raw data○ John is a student, he's 12 years old.

● Data○ Student

■ name = "John"■ age = 12

● Records○ Student(name="John", age=12)○ Student(name="Alice", age=11)

● Database○ Student Table○ Grades Table

Example of (relational) database


Student ID



Class ID


Class ID


Student Grade

Grade ID



Grade ID


SQL Language - How to find data?

● Find student name is John○ select * from student where name="John"

● Find class name of John○ select, as class_name from student

s, class c where name="John" and s.class_id=c.class_id


● Big data○ Morden data size is too big for single DB server○ Google search engine

● Connectivity○ Facebook like button

● Semi-structure data○ Car equipments database

● High availability○ The basic of cloud service

Common NOSQL DB characteristic

● Schemaless● No join, stores pre-joined/embedded data● Horizontal scalability ● Replica ready - High availability

Common types of NOSQL DB

● Key-Value○ Based on Amazon's Dynamo paper○ Stores K-V pairs○ Example:

■ Dynomite■ Voldemort

Common types of NOSQL DB

● Bigtable clones○ Based on Google Bigtable paper○ Column oriented, but handles semi-structured data○ Data keyed by: row, column, time, index○ Example:

■ Google Big Table■ HBase■ Cassandra(FB)

Common types of NOSQL DB

● Document base○ Stores multi-level K-V pairs○ Usually use JSON as document format○ Example:

■ MongoDB■ CounchDB (Apache)■ Redis

Common types of NOSQL DB

● Graph○ Focus on modeling the structure of data -

interconnectivity○ Example

■ Neo4j■ AllegroGraph

Start using MongoDB - Installation

● From apt-get (debian / ubuntu only)○ sudo apt-get install mongodb

● Using 10-gen mongodb repository○

mongodb-on-debian-or-ubuntu-linux/● From pre-built binary or source

○● Note:

32-bit builds limited to around 2GB of data

Manual start your MongoDB

mkdir -p /tmp/mongomongod --dbpath /tmp/mongo


mongod -f mongodb.conf

Verify your MongoDB installation

$ mongo

MongoDB shell version: 2.2.0connecting to: test>_

--------------------------------------------------------mongo localhost/test2mongo

How many database do you have?

show dbs

Elements of MongoDB

● Database○ Collection

■ Document

What is JSON

● JavaScript Object Notation● Elements of JSON

○ Object: K/V pairs○ Key, String○ Value, could be

■ string■ bool■ number■ array■ object■ null

{"key1": "value1","key2": 2.0"key3": [1, "str", 3.0],"key4": false,"key5": { "name": "another object",}


Another sample of JSON

{"name": "John","age": 12,"grades": {

"math": 4.0,"english": 5.0

},"registered": true,"favorite subjects": ["math", "english"]


Insert document into MongoDB

s = {"name": "John","age": 12,"grades": {

"math": 4.0,"english": 5.0

},"registered": true,"favorite subjects": ["math", "english"]



Verify inserted document


also try

db.student.insert(s)show collections

Save document into MongoDB = "Alice"s.age = 14s.grades.math = 2.0

What is _id / ObjectId ?

● _id is the default primary key for indexing documents, could be any JSON acceptable value.

● By default, MongoDB will auto generate a ObjectId as _id

● ObjectId is 12 bytes value of unique document _id

● Use ObjectId().getTimestamp() to restore the timestamp in ObjectId

0 1 2 3 4 5 6 7 8 9 10 11

unix timestamp machine process id Increment

Save document with id into MongoDB = "Bob"s.age = 11s['favorite subjects'] = ["music", "math", "art"]s.grades.chinese = 3.0s._id = 1

Save document with existing _id

delete s.registered

How to find documents?

● db.xxxx.find()○ list all documents in collection

● db.xxxx.find(find spec, //how document looks likefind fields, //which parts I wanna see...

)● db.xxxx.findOne()

○ only returns first document match find spec.

find by id

db.students.find({_id: 1})db.students.find({_id: ObjectId('xxx....')})

find and filter return fields

db.students.find({_id: 1}, {_id: 1})db.students.find({_id: 1}, {name: 1})db.students.find({_id: 1}, {_id: 1, name: 1})db.students.find({_id: 1}, {_id: 0, name: 1})

find by name - equal or not equal

db.students.find({name: "John"})db.students.find({name: "Alice"})

db.students.find({name: {$ne: "John"}})● $ne : not equal

find by name - ignorecase ($regex)

db.students.find({name: "john"}) => Xdb.students.find({name: /john/i}) => O

db.students.find({name: {

$regex: "^b", $options: "i"


find by range of names - $in, $nin

db.students.find({name: {$in: ["John", "Bob"]}})db.students.find({name: {$nin: ["John", "Bob"]}})

● $in : in range (array of items)● $nin : not in range

find by age - $gt, $gte, $lt, $lte

db.students.find({age: {$gt: 12}})db.students.find({age: {$gte: 12}})db.students.find({age: {$lt: 12}})db.students.find({age: {$lte: 12}})

● $gt : greater than● $gte : greater than or equal● $lt : lesser than● $lte : lesser or equal

find by field existence - $exists

db.students.find({registered: {$exists: true}})db.students.find({registered: {$exists: false}})

find by field type - $type

db.students.find({_id: {$type: 7}})db.students.find({_id: {$type: 1}})

1 Double 11 Regular expression

2 String 13 JavaScript code

3 Object 14 Symbol

4 Array 15 JavaScript code with scope

5 Binary Data 16 32 bit integer

7 Object id 17 Timestamp

8 Boolean 18 64 bit integer

9 Date 255 Min key

10 Null 127 Max key

find in multi-level fields

db.students.find({"grades.math": {$gt: 2.0}})db.students.find({"grades.math": {$gte: 2.0}})

find by remainder - $mod

db.students.find({age: {$mod: [10, 2]}})db.students.find({age: {$mod: [10, 3]}})

find in array - $size

db.students.find({'favorite subjects': {$size: 2}}


{'favorite subjects': {$size: 3}})

find in array - $all

db.students.find({'favorite subjects': {$all: ["music", "math", "art"]

}})db.students.find({'favorite subjects': {

$all: ["english", "math"]}})

find in array - find value in array

db.students.find({"favorite subjects": "art"}


db.students.find({"favorite subjects": "math"}


find with bool operators - $and, $or

db.students.find({$or: [{age: {$lt: 12}},{age: {$gt: 12}}


db.students.find({$and: [{age: {$lt: 12}},{age: {$gte: 11}}


find with bool operators - $and, $or

db.students.find({$and: [{age: {$lt: 12}},{age: {$gte: 11}}


equals to

db.student.find({age: {$lt:12, $gte: 11}}

find with bool operators - $not

$not could only be used with other find filter

X db.students.find({registered: {$not: false}})O db.students.find({registered: {$ne: false}})

O db.students.find({age: {$not: {$gte: 12}}})

find with JavaScript- $where

db.students.find({$where: "this.age > 12"})



find cursor functions

● countdb.students.find().count()

● limitdb.students.find().limit(1)

● skipdb.students.find().skip(1)

● sortdb.students.find().sort({age: -1})db.students.find().sort({age: 1})

combine find cursor functions

db.students.find().skip(1).limit(1)db.students.find().skip(1).sort({age: -1})db.students.find().skip(1).limit(1).sort({age: -1})

more cursor functions

● snapshotensure cursor returns○ no duplicates○ misses no object○ returns all matching objects that were present at

the beginning and the end of the query.○ usually for export/dump usage

more cursor functions

● batchSizetell MongoDB how many documents should be sent to client at once

● explainfor performance profiling

● hinttell MongoDB which index should be used for querying/sorting

list current running operations

● list operationsdb.currentOP()

● cancel operationsdb.killOP()

MongoDB index - when to use index?

● while doing complicate find● while sorting lots of data

MongoDB index - sort() example

for (i=0; i<1000000; i++){{value: i});


db.many.find().sort({value: -1})

error: {"$err" : "too much data for sort() with no index. add an index or specify

a smaller limit","code" : 10128


MongoDB index - how to build index

db.many.ensureIndex({value: 1})

● Index options○ background○ unique○ dropDups○ sparse

MongoDB index - index commands

● list indexdb.many.getIndexes()

● drop indexdb.many.dropIndex({value: 1})db.many.dropIndexes() <-- DANGER!

MongoDB Index - find() example

db.many.dropIndex({value: 1})db.many.find({value: 5555}).explain()

db.many.ensureIndex({value: 1})db.many.find({value: 5555}).explain()

MongoDB Index - Compound Index{a:1, b:-1, c:1})

query/sort with fields● a● a, b● a, b, c

will be accelerated by this index

Remove/Drop data from MongoDB

● Removedb.many.remove({value: 5555})db.many.find({value: 5555})db.many.remove()

● Dropdb.many.drop()

● Drop databasedb.dropDatabase() EXTREMELY DANGER!!!

How to update data in MongoDB

Easiest way:

s = db.students.findOne({_id: 1})s.registered =

In place update - update()

update({find spec},{update spec},upsert=false)

db.students.update({_id: 1},{$set: {registered: false}}


Update a non-exist document

db.students.update({_id: 2}, {name: 'Mary', age: 9},true


{_id: 2}, {$set: {name: 'Mary', age: 9}},true


set / unset field value

db.students.update({_id: 1},{$set: {"age": 15}})

db.students.update({_id: 1},{$set: {registered:

{2012: false, 2011:true}}})

db.students.update({_id: 1},{$unset: {registered: 1}})

increase/decrease value

db.students.update({_id: 1}, {$inc: {

"grades.math": 1.1,"grades.english": -1.5,"grades.history": 3.0


push value(s) into array

db.students.update({_id: 1},{$push: {tags: "lazy"}


db.students.update({_id: 1},{$pushAll: {tags: ["smart", "cute"]}


add only not exists value to array

db.students.update({_id: 1},{$push: {tags: "lazy"}

})db.students.update({_id: 1},{

$addToSet:{tags: "lazy"}})db.students.update({_id: 1},{

$addToSet:{tags: {$each: ["tall", "thin"]}}})

remove value from array

db.students.update({_id: 1},{$pull: {tags: "lazy"}

})db.students.update({_id: 1},{

$pull: {tags: {$ne: "smart"}}})db.students.update({_id: 1},{

$pullAll: {tags: ["lazy", "smart"]}})

pop value from array

a = []; for(i=0;i<20;i++){a.push(i);}{_id:1, value: a})

db.test.update({_id: 1}, {$pop: {value: 1}

})db.test.update({_id: 1}, {

$pop: {value: -1}})

rename field

db.test.update({_id: 1}, {$rename: {value: "values"}


Practice: add comments to student

Add a field into students ({_id: 1}):● field name: comments● field type: array of dictionary● field content:

○ {

by: author name, stringtext: content of comment, string

}● add at least 3 comments to this field

Example answer to practice

db.students.update({_id: 1}, {$addToSet: { comments: {$each: [

{by: "teacher01", text: "text 01"},{by: "teacher02", text: "text 02"},{by: "teacher03", text: "text 03"},


The $ position operator (for array)

db.students.update({_id: 1,"": "teacher02"

}, {$inc: {"comments.$.vote": 1}


Atomically update - findAndModify

● Atomically update SINGLE DOCUMENT and return it

● By default, returned document won't contain the modification made in findAndModify command.

findAndModify parameters{query: filter to querysort: how to sort and select 1st document in query resultsremove: set true if you want to remove itupdate: update contentnew: set true if you want to get the modified objectfields: which fields to fetchupsert: create object if not exists})


● MongoDB has 32MB document size limit● For storing large binary objects in MongoDB● GridFS is kind of spec, not implementation● Implementation is done by MongoDB drivers● Current supported drivers:

○ PHP○ Java○ Python○ Ruby○ Perl

GridFS - command line tools

● Listmongofiles list

● Putmongofiles put xxx.txt

● Getmongofiles get xxx.txt

MongoDB config - basic

● dbpath○ Which folder to put MongoDB database files○ MongoDB must have write permission to this folder

● logpath, logappend○ logpath = log filename○ MongoDB must have write permission to log file

● bind_ip○ IP(s) MongoDB will bind with, by default is all○ User comma to separate more than 1 IP

● port○ Port number MongoDB will use○ Default port = 27017

Small tip - rotate MongoDB log


MongoDB config - journal

● journal○ Set journal on/off○ Usually you should keep this on

MongoDB config - http interface

● nohttpinterface○ Default listen on http://localhost:28017○ Shows statistic info with http interface

● rest○ Used with httpinterface option enabled only○ Example:


MongoDB config - authentication

● auth○ By default, MongoDB runs with no authentication○ If no admin account is created, you could login with

no authentication through local mongo shell and start managing user accounts.

MongoDB account management

● Add admin user> mongo localhost/admindb.addUser("testadmin", "1234")

● Authenticated as admin useruse admindb.auth("testadmin", "1234")

MongoDB account management

● Add user to test databaseuse testdb.addUser("testrw", "1234")

● Add read only user to test databasedb.addUser("testro", "1234", true)

● List usersdb.system.users.find()

● Remove user db.removeUser("testro")

MongoDB config - authentication

● keyFile○ At least 6 characters and size smaller than 1KB○ Used only for replica/sharding servers○ Every replica/sharding server should use the same

key file for communication○ On U*ix system, file permission to key file for

group/everyone must be none, or MongoDB will refuse to start

MongoDB configuration - Replica Set

● replSet○ Indicate the replica set name○ All MongoDB in same replica set should use the

same name○ Limitation

■ Maximum 12 nodes in a single replica set■ Maximum 7 nodes can vote

○ MongoDB replica set is Eventually consistent

How's MongoDB replica set working?

● Each a replica set has single primary(master) node and multiple slave nodes

● Data will only be wrote to primary node then will be synced to other slave nodes.

● Use getLastError() for confirming previous write operation is committed to whole replica set, otherwise the write operation may be rolled back if primary node is down before sync.

How's MongoDB replica set working?

● Once primary node is down, the whole replica set will be marked as fail and can't do any operation on it until the other nodes vote and elect a new primary node.

● During failover, any write operation not committed to whole replica set will be rolled back

Simple replica set configuration

mkdir -p /tmp/db01mkdir -p /tmp/db02mkdir -p /tmp/db03

mongod --replSet test --port 29001 --dbpath /tmp/db01mongod --replSet test --port 29002 --dbpath /tmp/db02mongod --replSet test --port 29003 --dbpath /tmp/db03

Simple replica set configuration

mongo localhost:29001

Another way to config replica set


Extra options for setting replica set

● arbiterOnly○ Arbiter nodes don't receive data, can't become

primary node but can vote.● priority

○ Node with priority 0 will never be elected as primary node.

○ Higher priority nodes will be preferred as primary○ If you want to force some node become primary

node, do not update node's vote result, update node's priority value and reconfig replica set.

● buildIndexes○ Can only be set to false on nodes with priority 0 ○ Use false for backup only nodes

Extra options for setting replica set

● hidden○ Nodes marked with hidden option will not be

exposed to MongoDB clients.○ Nodes marked with hidden option will not receive

queries.○ Only use this option for nodes with usage like

reporting, integration, backup, etc.● slaveDelay

○ How many seconds slave nodes could fall behind to primary nodes

○ Can only be set on nodes with priority 0○ Used for preventing some human errors

Extra options for setting replica set

● voteIf set to 1, this node can vote, else not.

Change primary node at runtime

config = rs.conf()config.members[1].priority = 2rs.reconfig(config)

What is sharding?

Name Value

Alice value

Amy value

Bob value

: value

: value

: value

: value

Yoko value

Zeus value

A value

to value

F value

G value

to value

N value

O value

to value

Z value

MongoDB sharding architecture

Elements of MongoDB sharding cluster

● Config ServerStoring sharding cluster metadata

● mongos RouterRouting database operations to correct shard server

● Shard ServerHold real user data

Sharding config - config server

● Config server is a MongoDB instance runs with --configsrv option

● Config servers will automatically synced by mongos process, so DO NOT run them with --replSet option

● Synchronous replication protocol is optimized for three machines.

Sharding config - mongos Router

● Use mongos (not mongod) for starting a mongos router

● mongos routes database operations to correct shard servers

● Exmaple command for starting mongosmongos --configdb db01, db02, db03

● With --chunkSize option, you could specify a smaller sharding chunk if you're just testing.

Sharding config - shard server

● Shard server is a MongoDB instance runs with --shardsvr option

● Shard server don't need to know where config server / mongos route is

Example script for building MongoDB shard cluster

mkdir -p /tmp/s00mkdir -p /tmp/s01mkdir -p /tmp/s02mkdir -p /tmp/s03

mongod --configsvr --port 29000 --dbpath /tmp/s00mongos --configdb localhost:29000 --chunkSize 1 --port 28000mongod --shardsvr --port 29001 --dbpath /tmp/s01mongod --shardsvr --port 29002 --dbpath /tmp/s02mongod --shardsvr --port 29003 --dbpath /tmp/s03

Sharding config - add shard server

mongo localhost:28000/admin

db.runCommand({addshard: "localhost:29001"})db.runCommand({addshard: "localhost:29002"})db.runCommand({addshard: "localhost:29003"})

db.printShardingStatus()db.runCommand( { enablesharding : "test" } )db.runCommand( {shardcollection: "test.shardtest",key: {_id: 1}, unique: true})

Let us insert some documents

use test

for (i=0; i<1000000; i++) {db.shardtest.insert({value: i});


Remove 1 shard & see what happens

use admindb.runCommand({removeshard: "shard0002"})

Let's add it backdb.runCommand({addshard: "localhost:29003"})

Pick your sharding key wisely

● Sharding key can not be changed after sharding enabled

● For updating any document in a sharding cluster, sharding key MUST BE INCLUDED as find spec

EX:sharding key= {name: 1, class: 1}{name: "xxxx", class: "ooo},{..... update spec})

Pick your sharding key wisely

● Sharding key will strongly affect your data distribution model

EX:sharding by ObjectIdshard001 => data saved 2 months agoshard002 => data saved 1 months agoshard003 => data saved recently

Other sharding key examples

EX:sharding by Usernameshard001 => Username starts with a to kshard002 => Username starts with l to rshard003 => Username starts with s to z

EX:sharding by md5completely random distribution

What is Mapreduce?

● Map then Reduce● Map is the procedure to call a function for

emitting keys & values sending to reduce function

● Reduce is the procedure to call a function for reducing the emitted keys & values sent via map function into single reduced result.

● Example: map students grades and reduce into total students grades.

How to call mapreduce in MongoDB function,reduce function,{out: output option,query: query filter, optional,sort: sort filter, optional,finalize: finalize function,.... etc


Let's generate some data

for (i=0; i<10000; i++){db.grades.insert({

grades: {math: Math.random() * 100 % 100,art: Math.random() * 100 % 100,music: Math.random() * 100 % 100



Prepare Map function

function map(){for (k in this.grades){

emit(k, {total: 1, pass: 1 ? this.grades[k] >= 60.0 : 0, fail: 1 ? this.grades[k] < 60.0 : 0, sum: this.grades[k], avg: 0});


Prepare reduce function

function reduce(key, values){result = {total: 0, pass: 0, fail: 0, sum: 0, avg: 0};values.forEach(function(value){ +=;result.pass += value.pass; +=;result.sum += value.sum;

});return result;


Execute your 1st mapreduce call

db.grades.mapReduce(map, reduce, {out:{inline: 1}}


Add finalize function

function finalize(key, value){value.avg = value.sum /;return value;


Run mapreduce again with finalize

db.grades.mapReduce(map, reduce, {out:{inline: 1}, finalize: finalize}


Mapreduce output options

● {replace: <result collection name>}Replace result collection if already existed.

● {merge: <result collection name>}Always overwrite with new results.

● {reduce: <result collection name>}Run reduce if same key exists in both old/current result collections. Will run finalize function if any.

● {inline: 1}Put result in memory

Other mapreduce output options

● db- put result collection in different database

● sharded - output collection will be sharded using key = _id

● nonAtomic - partial reduce result will be visible will processing.

MongoDB backup & restore

● mongodumpmongodump -h localhost:27017

● mongorestoremongorestore -h localhost:27017 --drop

● mongoexportmongoexport -d test -c students -h localhost:27017 > students.json

● mongoimport mongoimport -d test -c students -h localhost:27017 < students.json

Conclusion - Pros of MongoDB

● Agile (Schemaless)● Easy to use ● Built in replica & sharding● Mapreduce with sharding

Conclusion - Cons of MongoDB

● Schemaless = everyone need to know how data look like

● Waste of spaces on keys● Eats lots of memory● Mapreduce is hard to handle

Cautions of MongoDB

● Global write lock○ Add more RAM○ Use newer version (MongoDB 2.2 now has DB level

global write lock)○ Split your database properly

● Remove document won't free disk spaces○ You need run compact command periodically

● Don't let your MongoDB data disk full○ Once freespace of disk used by MongoDB if full, you

won't be able to move/delete document in it.

top related