nosql & mongodb rutgers university databases …borgida/cs336/mongo.pdfnosql & mongodb...

NoSQL & MongoDBRutgers University Databases Course

Richard M Kreuter10gen Inc.

[email protected]

November 29, 2011

NoSQL & MongoDB — Rutgers University Databases Course

Why NoSQL is here — CAP Theorem

Consistency (better termed atomicity): every node in adistributed database has the same value for the same field

Availability: whether the is system usable (for reads/writes)

Partition tolerance: what happens when nodes can’t reacheach other?

Eric Brewer’s CAP Theorem (ca. 2000): of C, A, P, pick two.


Why NoSQL is here – Relational Doesn’t Scale (At Cost)

When RDBMS got started in the 70s, today’s internet-scaleapplications weren’t even dreamt of (probably).

RDBMSes are strongly consistent at the expense ofavailability or partition tolerance (in CAP theorem terms).

Scaling them up is possible, but superlinearly expensive.

Scaling out hard-to-impossible: in general, needs network wideJOINs for reads & distributed transactions for writes.


NoSQL in abstract

By giving up some aspects of ACID (atomicity, consistency,isolation, durability), you gain the advantages of BASE (basicallyavailable, soft-state, eventual consistency).


Working with non-relational DBMSes: living withoutJOINs

In general, non-relational databases dispense with JOIN-typeoperations, because they’re just not practical on large data sets.

In practice, this is less of a limitation than you might expect: manyapplications’ read/write patterns are fairly limited/predictable.


Working with non-relational DBMSes:denormalization/caching

Consider the problem of showing a Twitter user all new tweets bypeople the user is following.

It’s not e�cient to recompute this query every time the userchecks for new tweets.

It’s also not important to enforce consistency across di↵erentusers’ feeds at any moment.

(This problem turns out to be ubiquitous: ad networks, datingsites, social media networks, are all about this puzzle.)


Working with non-relational DBMSes: model the problem

Traditional data modeling approaches (relational normal forms)organize data to support any imaginable program.

Non-relational DBMSes tend to encourage application-driven datamodeling, i.e., designing data layouts that serve specific problemsat hand.

Problems that require a more relational modeling are admittedlytricky in a non-relational DBMS.


Ontology of NoSQL systems

Inasmuch as NoSQL systems are defined simply as “notrelational”, it’s handy to organize them along a couple axes:

Data Distribution — how do servers propagate data around (ifat all)?

Data model — how do clients get to use the data?

Additionally, you can consider whether the system is “eventuallyconsistent” or “strongly consistent”, whether it supportsmap/reduce, etc.


Distribution: GFS

Google, 2003.

Data is organized into chunks.

A GFS master knows where all the chunks reside.

Chunkservers handle chunk contents.

Chunks get replicated across chunkservers.


Model: Key/Value

Very simple: keys correspond to values (possibly withtimestamps)

Simplified API get(key), put(key, value), delete(key)


Model: BigTable

Keys are associated with some number of columns

Each column contains a datum and a timestamp

Columns are grouped into ColumnFamilies

Simplified API get(key, column), put(key, column,value), delete(key, column)


Model: Document-Oriented

Database stores documents (think trees).

Queries can look into subdocuments.

Secondary indexes are feasible, in general.

Simplified API find(document-attributes),insert(document), delete(key, column)


MongoDB

Document-oriented.

GFS data distribution model.

Analogous to MySQL in ease-of-use, featureset, etc.


Documents

Documents are arbitrarily nestable JSON dictionaries, with nopre-determined schema.

{ user : "kreuter",addresses : { personal : "[email protected]",

work : "[email protected]" },programs : [ "GNU Emacs", "TeX", "MongoDB",

"Steel Bank Common Lisp" ]}

Dictionary keys may consist of any sequence of alphadigits andunderscores; keys with leading underscores are reserved.


Query Language

Queries are expressed in a JSON-like, “query by example”language.

db.users.findOne({ user : "kreuter" })

Queries can search ”into” documents in nifty ways.

db.users.find({"addresses.work" : { $exists: 1 }})

db.users.find({programs : "TeX" })

“Dollarsign” operators exist for a variety of query functions:$exists, $gt, $lt, etc.


Data Manipulation Language

Documents are manipulated (inserted, updated, deleted) viaoperations invoked from client, with a JSON-like minilanguage forspecifying what to frob.

db.users.insert({user : "kreuter",addresses : ... ,programs ...})

db.users.update({ user : "kreuter" },{ $set : { homedir : "/home/kreuter" }})

db.users.update({ user : "kreuter" },{ $set { shell : "/bin/bash" }, true})

db.users.remove({ user : "kreuter" })

Data manipulation operations are “fire and forget”; that is, they’reasynchronous. The database command getLastError() waits untilthe last pending operation finishes, reports any errors.NoSQL & MongoDB — Rutgers University Databases Course

The mongo shell

MongoDB comes with a convenient shell, permitting interactiveexperimentation and development with the full power of aturing-complete programming language (JavaScript).

~$ mongoMongoDB shell version: 1.6.4url: testconnecting to: testtype "help" for help> db.users.findOne({ users : "kreuter" })

{ _id : ObjectId("decafbaddecafbaddecafbad"),user : "kreuter",addresses : { personal : "[email protected]",

work : "[email protected]" },programs : [ "GNU Emacs", "TeX", "MongoDB",

"Steel Bank Common Lisp" ]}

The id attribute is implicitly created; it’s a GUID-like 12-bytevalue that’s a primary key for the document.NoSQL & MongoDB — Rutgers University Databases Course

Indexing

The ObjectId is always indexed, and MongoDB supports secondaryIndexes, too.

# Create an index on the user attributedb.users.ensureIndex({ user : 1 })# Create a compound index on# the user and homedir attributesdb.users.ensureIndex({ user : 1, homedir: 1 })# Create an index on the programs# attribute, will index all values in listdb.users.ensureIndex({ programs : 1 })# Create an index on personal addressesdb.users.ensureIndex({ addresses.personal : 1 })# Create a unique index on the user attribtedb.users.ensureIndex({ user : 1 }, { unique : true })


Drivers, Frameworks, Tools

10gen supports drivers for many popular languages: Ruby,Python, Perl, PHP, Java, C#, C++, C.

The MongoDB community has developed drivers for manymore languages, including Erlang, Haskell, Clojure, Scala, PLTScheme, Lua.

Folks have developed frameworks for mapping documents toobjects in various language-specific ways (Java reflection,Ruby MOP, et al.).


Scalability


Replication

MongoDB supports automatic replication (mirroring)

Good for horizontal read scaling: clients can read from any ofa number of slaves.

Recommended for failover, durability, backups (essentially alldeployments).

Works well over wide area networks.


Replica set replication

Primary (write server)

Old Primary

New Primary

Secondary (read replica)

Later



Secondary (read replica) Secondary (read replica)


Sharding

Sharding partitions databases into subsets based on ranges of thesharding key.

With sharding, writes can scale horizontally: document writesare routed to a shard based on shard key.

With sharding, database working set capacity can growstraightforwardly by adding more hardware to the deployment.

Automatic load balancing in background, allows dynamicshard addition: if data are unevenly distributed in the shardkey space, MongoDB will rebalance data among shards toapproach even distribution.

Queries that span shards can parallelize under certaincircumstances, which can improve read scalability.

Sharding is range-based, and so supports range queriese↵ectively.

(Essentially the GFS data distribution model.)


Sharding (continued)

Shard server (mongos) Shard server (mongos)


Secondaries (read replicas) Secondaries (read replicas)


configuration server(s)


So give MongoDB a try!

www.mongodb.org — downloads, docs, community.

[email protected] — mailing list.

#mongodb — irc.freenode.net.

try.mongodb.org — web-based shell.

10gen is hiring! Email [email protected].

10gen o↵ers support, training, and advising services formongodb.