couchbase the complete nosql solution for big data

30
CouchBase The Complete NoSql Solution for Big Data - Debajani Mohanty

Upload: debajani-mohanty-scea-pmp

Post on 16-Aug-2015

222 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: CouchBase The Complete NoSql Solution for Big Data

CouchBase The Complete NoSql Solution for Big Data

- Debajani Mohanty

Page 2: CouchBase The Complete NoSql Solution for Big Data

CAP Theorum

Before we get into big data and the role of NOSQL, we must first understand the CAP theorem. In theoretical computer science, the CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees

1. Consistency (all nodes see the same data at the same time)

2. Availability (a guarantee that every request receives a response about whether it succeeded or failed)

3. Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

Although all three are impossible to achieve, any two can be achieved by the systems. That means in order to get high availability and partition tolerance, you need to sacrifice consistency

Page 3: CouchBase The Complete NoSql Solution for Big Data
Page 4: CouchBase The Complete NoSql Solution for Big Data

The 5 Vs of Big Data

• Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.

• Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. 

• We currently only see the beginnings of a transformation into a big data economy. Any business that doesn’t seriously consider the implications of Big Data runs the risk of being left behind.

• To get a better understanding of what Big Data is, it is often described using 5 Vs:  Volume Velocity Variety Veracity Value

Page 5: CouchBase The Complete NoSql Solution for Big Data

Volume

Volume Refers to the vast amounts of data generated every second. We are not talking Terabytes but Zettabytes or Brontobytes. If we take all the data generated in the world between the beginning of time and 2008, the same amount of data will soon be generated every minute. This makes most data sets too large to store and analyse using traditional database technology. New big data tools use distributed systems so that we can store and analyse data across databases that are dotted around anywhere in the world

Page 6: CouchBase The Complete NoSql Solution for Big Data

Variety

Variety Refers to the different types of data we can now use. In the past we only focused on structured data that neatly fitted into tables or relational databases, such as financial data. In fact, 80% of the world’s data is unstructured (text, images, video, voice, etc.) With big data technology we can now analyse and bring together data of different types such as messages, social media conversations, photos, sensor data, video or voice recordings.

Page 7: CouchBase The Complete NoSql Solution for Big Data

Velocity

Velocity Refers to the speed at which new data is generated and the speed at which data moves around. Just think of social media messages going viral in seconds. Technology allows us now to analyze the data while it is being generated (sometimes referred to as in-memory analytics), without ever putting it into databases.

Page 8: CouchBase The Complete NoSql Solution for Big Data

Veracity & Value

Veracity refers to truthfulness, correctness of the data.

Value! Having access to big data is no good unless we can turn it into value. Companies are starting to generate amazing value from their big data.

Page 9: CouchBase The Complete NoSql Solution for Big Data

Big Data and Human BrainTo understand how big data could be solution architected, let’s try to understand how human brain is architected.

So the key is parallel processing. Hureeyyyyyy!!! 

Page 10: CouchBase The Complete NoSql Solution for Big Data

Hadoop & MapReduce

• In 2004, Google published a paper on a process called MapReduce that used such an architecture.

• The MapReduce framework provides a parallel processing model and associated implementation to process huge amounts of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The framework was very successful, so others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce framework was adopted by an Apache open source project named Hadoop.

• But Hadoop is only for processing the data. How can we store this huge data?

Page 11: CouchBase The Complete NoSql Solution for Big Data

NoSql Database• A NoSQL (often interpreted as Not only SQL) database often used in big data-centric

real-time web applications, provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.

• Motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability. The data structures used by NoSQL databases (e.g. key-value, graph, or document) differ from those used in relational databases, making some operations faster in NoSQL and others faster in relational databases.

• The particular suitability of a given NoSQL database depends on the problem it must solve.

Page 12: CouchBase The Complete NoSql Solution for Big Data

Types of NoSQL databases

• There have been various approaches to classify NoSQL databases, each with different categories and subcategories. Because of the variety of approaches and overlaps it is difficult to get and maintain an overview of non-relational databases. Nevertheless, a basic classification is based on data model. A few examples in each category are:

• Column: Accumulo, Cassandra, Druid, HBase, Vertica

• Document: Lotus Notes, Clusterpoint, Apache CouchDB, Couchbase, MarkLogic, MongoDB, OrientDB, Qizx

• Key-value: CouchDB, Dynamo, FoundationDB, MemcacheDB, Redis, Riak, FairCom c-treeACE, Aerospike, OrientDB, MUMPS

• Graph: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog

• Multi-model: OrientDB, FoundationDB, ArangoDB, Alchemy Database, CortexDB

Page 13: CouchBase The Complete NoSql Solution for Big Data

Graph Database

• This kind of database is designed for data whose relations are well represented as a graph (elements interconnected with an undetermined number of relations between them). The kind of data could be social relations, public transport links, road maps or network topologies.

Page 14: CouchBase The Complete NoSql Solution for Big Data

Key-value stores

• In this model, data is represented as a collection of key-value pairs, such that each possible key appears at most once in the collection. The key-value model is one of the simplest non-trivial data models, and richer data models are often implemented on top of it. 

Page 15: CouchBase The Complete NoSql Solution for Big Data

Document-oriented databases

• The central concept of a document store is the notion of a "document". While each document-oriented database implementation differs on the details of this definition, in general, they all assume that documents encapsulate and encode data (or information) in some standard formats or encodings. Encodings in use include XML, JSON as well as binary forms like BSON.

• The most widely used solutions in no-sql are MongoDB and CouchBase and both of them are document-oriented databases.

• Here is a sample document:{

  '_id' : '5897g42s0245afo4o473ai1e7',

  'firstname': 'John',

  'lastname': 'Doe',

  'age': 26,

  'sex': 'M',

  'interests': [ 'Reading', 'Running', 'Hacking' ]

}

Page 16: CouchBase The Complete NoSql Solution for Big Data

MongoDB vs CouchBase

Page 17: CouchBase The Complete NoSql Solution for Big Data

Results Analysis

Page 18: CouchBase The Complete NoSql Solution for Big Data

Another Analysis

Page 19: CouchBase The Complete NoSql Solution for Big Data

Scalability

• In Couchbase, you can easily add servers to do clustering and obtain a distributed system, Couchbase is flexible enough to avoid downtime. Indeed, it relies on the power of the Erlang language, a functional and fault-tolerant language that manages hot changes.

• For MongoDB, the configuration is a bit more complicated. For example, once you have defined the shard key (the key to distribute documents within a sharded cluster), it becomes difficult to change it afterwards. The system is not as flexible, so you have to think carefully about your data modeling before you move your application into production.

• Scalability is why Couchbase is widely used in social gaming, where millions of players can play and their numbers can increase exponentially overnight.

Page 20: CouchBase The Complete NoSql Solution for Big Data

Monitoring toolCouchbase comes with a turnkey package while MongoDB requires an additional subscription to a monitoring service. You can monitor MongoDB using the command line, but a monitoring tool without graphical interface is relatively restrictive.

Page 21: CouchBase The Complete NoSql Solution for Big Data

Introducing CouchBase

• Couchbase provides the world’s most complete, most scalable and best performing NoSQL database.

• Based on a share nothing architecture, a single node-type, a built in caching layer, true auto-sharding and the world’s first NoSQL mobile offering: Couchbase Mobile, a complete NoSQL mobile solution comprised of Couchbase Server, Couchbase Sync Gateway and Couchbase Lite.

• Clients: AT&T, Amadeus, Bally’s, Beats Music, Cisco, Comcast, Concur, Disney, eBay / PayPal, Neiman Marcus, Orbitz, Rakuten / Viber, Sky, Tencent, Tesco, Verizon and Willis Group, as well as hundreds of other household names worldwide

Page 22: CouchBase The Complete NoSql Solution for Big Data

Real life Use CasesCouchbase Server’s unique combinations could be 1) linear, horizontal scalability, 2) sustained low latency and high throughput performance, and 3) the extensibility of the system.

Few usecases:• Session store: User sessions are easily stored and managed in Couchbase, for

instance, by using the document ID naming scheme, “user:USERID”. With Couchbase Server, you can flag items for deletion after a certain amount of time, and therefore you have the option of having Couchbase automatically delete old sessions.

• Social gaming: You can model and store game state, property state, time lines, conversations and chats with Couchbase Server. The asynchronous persistence algorithms of Couchbase were designed, built and deployed to support some of the highest scale social games.

• Ad, offer, and content targeting: The same attributes which serve Couchbase in the gaming context also apply well for real-time ad and content targeting. For example, Couchbase provides a fast storage capability for counters. Counters are useful for tracking visits, associating users with various targeting profiles, tracking ad-offers, and for tracking ad-inventory.

Page 23: CouchBase The Complete NoSql Solution for Big Data

Buckets• Couchbase Server stores all of your application data in either RAM or on disk. The

data containers used in Couchbase Server are called buckets; there are two bucket types in Couchbase, which reflect the two types of data storage that we use in Couchbase Server. Buckets also serve as namespaces for documents and are used to look up a document by key:

• Couchbase Buckets• Memcached Buckets

• You can customize the properties of each bucket, within limits using Couchbase Admin Console, Couchbase Command Line Interface (CLI), or the Couchbase REST Admin API. Quotas for RAM and disk space can be configured per bucket so you can manage usage across a cluster

• Couchbase Server is best suited for fast-changing data items of relatively small size. For in-memory storage, using Couchbase Memcached buckets, the memcached standard 1 megabyte limit applies to each value. Items suitable for storage include shopping carts, user profile, user sessions, time lines, game states, pages, conversations and product catalog. Items that are less suitable include large audio or video media files.

• On that note, some Couchbase SDKs offer the additional feature of optionally compressing/decompressing objects stored into Couchbase. The CPU-time versus space trade-off here should be considered

Page 24: CouchBase The Complete NoSql Solution for Big Data

Couchbase Buckets

• Couchbase Buckets: provide data persistence and data replication. Data stored in Couchbase Buckets is highly-available and reconfigurable without server downtime. They can survive node failures and restore data plus allow cluster reconfiguration while still fulfilling service requests. The main features are:

– Supports items up to 20MB in size.– Persistence, including data sets that are larger than the allocated memory size

for a bucket. You can configure persistence per bucket and Couchbase Server will persist data asynchronously from RAM to disk

– Fully supports replication and server rebalancing. You can configure one or more replica servers for a Couchbase bucket. If a node fails, a replica node can be promoted to be the host node.

– Full range of statistics supported.

Page 25: CouchBase The Complete NoSql Solution for Big Data

Memcached Buckets

• Memcached Buckets: provides in-memory document storage. Memcache buckets cache frequently-used data in memory, thereby reducing the number of queries a database server must perform in response to web application requests. Memcached buckets can work alongside relational database technology, not only NoSQL databases.

– Item size limited to 1 MByte.– No persistence.– No replication; no rebalancing.– Statistics about Memcached Buckets are on RAM usage and client-side

operations.

Page 26: CouchBase The Complete NoSql Solution for Big Data

Keys & Metadata• All information that you store in Couchbase Server are documents with keys, unique identifiers for a

document, and values are either JSON documents or if you choose the data you want to store can be byte stream, data types, or other forms of serialized objects.

• Keys are also known as document IDs and serve the same function as a SQL primary key. A key in Couchbase Server can be any string and is unique.

• By default, all documents contain metadata that is provided by the Couchbase Server. The metadata is stored with the document and is used to change how the document is handled.

• CAS Value—Also called CAS token or CAS ID, this value is a unique identifier associated with a document that is verified by the Couchbase Server before a document is deleted or changed and provides a form of basic optimistic concurrency. When Couchbase Server checks a CAS value before changing data, it effectively prevents data loss without having to lock records. Couchbase Server prevents a document from being altered by an operation if another process alters the document and its CAS value, in the meantime.

• Time to Live (TTL)—This is an expiration for a document typically specified in seconds. By default, any document created in Couchbase Server that does not have a given TTL will have an indefinite life span and will remain in Couchbase Server unless an explicit delete call from a client removes it. The Couchbase Server will delete values during regular maintenance if the TTL for an item has expired.Note: The expiration value deletes information from the entire database. It has no effect on when the information is removed from the RAM caching layer.

• Flags—These are SDK- specific flags which are used to provides a variety of options during storage, retrieval, update, and removal of documents. Typically flags are optional metadata used by a Couchbase client library to perform additional processing of a document. An example of flags include the ability to specify that a document be formatted a specific way before it is stored.

Page 27: CouchBase The Complete NoSql Solution for Big Data

Creating First ApplicationComponents for your development environment:

• Couchbase Server: installed on a virtual or physical machine separate from the machine containing your web application server. Download the appropriate version for your environment here http://www.couchbase.com/download

• Couchbase SDK: installed for runtime on the machine containing your web application server. You will also need to make the SDKs available in your development environment in order to compile/interpret your client-side code. The SDKs are programming-language and platform-specific. You will use your SDK to communicate with the Couchbase Server from your web application. Downloads for your chosen SDK are here: http://www.couchbase.com/develop

• Couchbase Admin Console: administering your Couchbase Server is done via the Couchbase Admin Console, a web application viewable in most modern browsers. Your development environment should therefore have the latest version of Mozilla Firefox 3.6+, Apple Safari 5+, Google Chrome 11, or Internet Explorer 8, or higher. You should set your browser preference to be JavaScript enabled.

The development languages supported by the Couchbase Client SDK Libraries are Java, .NET, PHP, Ruby, C

Page 28: CouchBase The Complete NoSql Solution for Big Data

Connecting A Bucket

• After you have your Couchbase Server up and running, and your chosen Couchbase Client libraries installed on a web server, you create the code that connects to the server from the client.

1. Make a new bucket request to the REST endpoint for buckets and provide the new bucket settings as request parameters:shell> curl -u Administrator:password \

2. -d name=newBucket -d ramQuotaMB=100 -d authType=none \

3. -d replicaNumber=1 -d proxyPort=11215 http://localhost:8091/pools/default/buckets

Page 29: CouchBase The Complete NoSql Solution for Big Data

Connecting to Couchbase Server

The following shows a basic steps for creating a connection:

• Include, import, link, or require Couchbase SDK libraries into your program files. In the example that follows, we require 'couchbase'.

• Provide connection information for the Couchbase cluster. Typically this includes URI, bucket ID, a password and optional parameters and can be provided as a list or string. To avoid failure to initially connect, you should provide and try at least two URL’s for two different nodes. In the following example, we provide connection information as"http://<host>:<port>/pools". In this case there is no password required.

• Create an instance of a Couchbase client object. In the example that follows, we create a new client instance in the client = Couchbase.connect statement.

• Perform any database operations for your applications, such as read, write, delete, or query.

• If needed, destroy the client, and therefore disconnect.

Page 30: CouchBase The Complete NoSql Solution for Big Data

Connecting to Couchbase Server..

• The below example in Java we demonstrate how it is safest to create at least two possible node URIs while creating an initial connection with the server. This way, if your application attempts to connect, but one node is down, the client automatically re-attempts to connect with the second node URL:

// Set up at least two URIs in case one server fails

List<URI> servers = new ArrayList<URI>();

servers.add("http://<host>:8091/pools");

servers.add("http://<host>:8091/pools");

// Create a client talking to the default bucket

CouchbaseClient cbc = new CouchbaseClient(servers, "default", "");

// Create a client talking to the default bucket

CouchbaseClient cbc = new CouchbaseClient(servers, "default", "");

System.err.println(cbc.get(“thisname") +  " is off developing with Couchbase!");