cs-580k/480k advanced topics in cloud computing › ~huilu › slides580ksp20 › nosql.pdf · x is...

33
CS-580K/480K Advanced Topics in Cloud Computing Chapter V 15. NoSQL Database 1

Upload: others

Post on 27-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

CS-580K/480K Advanced Topics in Cloud Computing

Chapter V

15. NoSQL Database

1

Page 2: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Where are we?Cloud Applications

New databases technologies (e.g., Key-value store, and Object store)

New Programming Models (e.g., severless, microservices, Hadoop, and Spark)…

Cloud Platforms

Storage Virtualization

Virtualization Layer

Operating System

APP1

APP2

APP3

APP4

Operating System

APP1

APP2

APP3

APP4

Operating System

APP1

APP2

APP3

APP4

VM1 VM2 VM3

Network Virtualization

Virtualization Layer

Operating System

APP1

APP2

APP3

APP4

Operating System

APP1

APP2

APP3

APP4

Operating System

APP1

APP2

APP3

APP4

VM1 VM2 VM3

Virtualization Layer

Operating System

APP1

APP2

APP3

APP4

Operating System

APP1

APP2

APP3

APP4

Operating System

APP1

APP2

APP3

APP4

VM1 VM2 VM3

Page 3: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

A simple example – web-based applications

3

Storage

SQL Databases

▪ A relational database is a collection of data items organized as a set of formally-described tables – tables and their relations.

▪ Structured Query Language is the standard means of manipulating and querying data in relational databases.

Tables and their relations

Page 4: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Scalability Problem

4

Storage

Stateful SQL Databases

▪As data volume keeps increasing, it is cheaper and more reasonable to scale out horizontally, by adding smaller, less expensive servers rather than investing in a single large server.

Scale horizontally

Page 5: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Adding a cache layer

5

Front-end1

Middle-Tier

Storage

Front-end2

Middle-Tier

Front-end3

Middle-TierStateless

Stateless

Stateful SQL Databases

Cache Cache CacheStateless

As datasets grows, the simple memcache model starts to become problematic.

Page 6: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Scaling RDBMS – Master/Slave

• Master-Slave▪ All writes are written to the master.

▪ All reads performed against the replicated slave databases

▪ Critical reads may be incorrect as writes may not have been propagated down

▪ Large data sets can pose problems as master needs to duplicate data to slaves

6

Master

Slave Slave Slave

Page 7: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Scaling RDBMS - Partitioning

▪ “Sharding”▪ Divide data amongst many cheap databases

▪ Manage parallel access in the application

▪ Scales well for both reads and writes

▪ Not transparent, application needs to be partition-aware

7

Page 8: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Other ways to scale RDBMS

▪ Multi-Master replication

▪ INSERT only, not UPDATES/DELETES

▪ In-memory databases

8

Page 9: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

In Summary

▪ Traditional relational databases do not scale well, when dataset grows

▪ Adding a cache layer scales well for reads, but not for writes

▪ Master/slave also scales well for reads, but not for writes

▪ Sharding scales well for both reads and writes, but not transparent

9

Page 10: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

What is NoSQL

• Stands for No-SQL or Not Only SQL??• Class of non-relational data storage systems

• E.g. BigTable, and Dynamo

• Do not require a fixed table schema nor do they use the concept of joins (no relationships)• Distributed data storage systems

• All NoSQL offerings relax one or more of the ACID properties (i.e., CAP theorem)

10

Page 11: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Why NoSQL?

▪ Relational databases offer a very good general purpose solution to many different data storage needs.

▪ But it cannot fit all

▪ Just as there are different programming languages, need to have other data storage tools in the toolbox

11

Page 12: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Why NoSQL?

▪ Explosion of social media sites (Facebook, Twitter) with large data needs

▪ Explosion of storage needs in large web sites such as Google, Yahoo

▪ Much of the data is not files

▪ Shift to dynamically-typed data with frequent schema changes

12

Page 13: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Dynamo and BigTable

▪Three major papers were the seeds of the NoSQL movement▪BigTable (Google)▪Dynamo (Amazon)▪CAP Theorem (counterpart of ACID)

13

Page 14: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

CAP Theorem (I)

• Three properties of a distributed system• Consistency

• All copies have same value

• Availability• Every request received by a non-failing node the system

must result in a response

• Partition tolerance • Network can break into two or more independent parts (due

to separate optimization and/or failures)• Partition tolerance means that even after the network is

partitioned into multiple sub-systems, it still works correctly.14

Page 15: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

CAP Theorem (II)

• Brewer’s CAP “Theorem”: You can have at most two of these three properties for any system

• Proof:https://mwhittaker.github.io/blog/an_illustrated_proof_of_the_cap_theorem/

• Very large systems will partition at some point• Choose one of consistency or availability• Traditional database choose consistency• Most Web applications choose availability

• Except for specific parts such as order processing

15

Page 16: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Availability

▪ For a large node system, at almost any point in time there’s a good chance that a node is either down or there is a network disruption among the nodes.

▪Availability refers to the percentage of time that the infrastructure, system or a solution remains operational under normal circumstances in order to serve its intended purpose.▪Percentage of availability = (total elapsed time – sum of

downtime)/total elapsed time

▪Traditionally, thought of as the server/process available five 9’s (99.999 %).▪The yearly service downtime could be as much as 5.256

minutes.16

Page 17: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Availability

17

Page 18: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Consistency Model

▪ A consistency model determines rules for visibility and order of updates.

▪ For example:▪ X is replicated on nodes M and N

▪ Client A writes X to node N

▪ Some period of time t elapses.

▪ Client B reads X from node M

▪ Does client B see the write from client A?

▪ For NoSQL, the answer would be:

▪ Yes, if the NoSQL adopts a strict consistency model

18

M N

Client AClient B

X X

Page 19: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Strict Consistency

• All read operations must return the data from the latest completed write operation, regardless of which replica the operations went to.

• It implies nodes employ some kind of distributed transaction protocol to ensure all data copies have the same value

• CAP Theorem: Strict Consistency can’t be achieved at the same time as availability and partition-tolerance.

19

Page 20: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Eventual Consistency

▪ When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent

▪ The type of large systems built based on CAP are known as BASE (Basically Available, Soft state, Eventual consistency)

▪ Who builds large-scale distributed systems based on CAP?▪ Google, Yahoo, Facebook, Amazon, eBay, etc…

20

Page 21: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Types of NoSQL (1)

▪Key/Value data model▪Use of a hash table▪Store data as a key/value pair ▪Access data (values) by strings

called keys ▪Value has no require format▪Basics operations include

insert(key, value), fetch(key),update(key), delete(key)

▪E.g., Amazon S3 (Dynamo) DeCandia, Giuseppe, et al. "Dynamo: amazon's highly

available key-value store." ACM SIGOPS operating

systems review 41.6 (2007): 205-220.

Page 22: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Key/Value

▪Pros:▪ Simple model

▪ Very fast

▪ Very scalable

▪ Able to scale horizontally

▪Cons: ▪ Many data structures (objects) can't be easily modeled as key

value pairs

Page 23: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Types of NoSQL (2)

▪Other schema-less data models which come in multiple flavors such as column-based, document-based or graph-based.▪Cassandra (column-based)▪MongoDB (document-based)▪Neo4J (graph-based)▪Redis (key/value-based)

23

Page 24: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Document-based

▪Can model more complex objects

▪Data model: collection of documents

▪Document: JSON ▪ JavaScript Object Notation is a data model, which supports

objects, records, structs, list, array, maps, dates, with nesting.

24

Page 25: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Column family data model

▪ Column based databases use a concept called a keyspace.

▪ The keyspace contains all the column families

▪ A column family consists of multiple rows.

▪ Each row can contain a different number of columns.

▪ Each column contains a name/value pair, along with a timestamp.

25

Page 26: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Graph data model▪ Based on Graph Theory: A graph is

composed of two elements: a node and aedge (i.e., the relationship).

▪ Each node represents an entity (e.g., a person, place, thing, category or other piece of data), and each relationship represents how two nodes are associated.▪ Twitter is a perfect example of a graph database

connecting 330 million monthly active users.

▪ Graph databases, by design, allow simple and fast retrieval of complex hierarchical structures that are difficult to model in relational systems.

26

Page 27: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Typical NoSQL API

▪ Basic API access:

▪ get(key) -- Extract the value given a key

▪ put(key, value) -- Create or update the value given its key

▪ delete(key) -- Remove the key and its associated value

▪ execute(key, operation, parameters) -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... etc).

27

Page 28: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Advantages of NoSQL Systems

▪ Easy to use

▪ Easy to scale horizontally on commodity hardware

▪ Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned▪ No single point of failure

▪ Breadth of functionality

▪ Executing code next to the data (e.g., Hadoop)

▪ …

28

Page 29: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

What does NoSQL not Provide?

▪ Joins

▪ Group by

▪ ACID transactions

▪ SQL

▪ Integration with applications that are based on SQL

29

Page 30: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Which One to use?

▪ NoSQL Data storage systems make sense for applications that need to deal with very very large semi-structured data

▪ Log Analysis

▪ Social Networking Feeds

▪ Most of our work is on organizational databases, which are not that large and have low update/query rates

▪ regular relational databases are the correct solution for such applications

30

Page 31: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

In Summary -- NoSQL

• “Not Only SQL”

• 1. Scale horizontally “simple operations”

• 2. Replicate/distribute data over many servers

• 3. Simple call level interface (contrast w/ SQL)

• 4. Weaker concurrency/consistency model than ACID

• 5. Flexible schema (i.e., schema-less)

31

Page 32: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Demos

• Google Big Tables: https://www.youtube.com/watch?v=ChoXxlddGis

32

Page 33: CS-580K/480K Advanced Topics in Cloud Computing › ~huilu › slides580ksp20 › NoSQL.pdf · X is replicated on nodes M and N Client A writes X to node N Some period of time t elapses

Sources

• Reliability vs Availability: What’s the Difference? https://www.bmc.com/blogs/reliability-vs-availability/

• An Illustrated Proof of the CAP Theorem: https://mwhittaker.github.io/blog/an_illustrated_proof_of_the_cap_theorem/

• What is a Column Store Database: https://database.guide/what-is-a-column-store-database/

• 10 Advantages of NoSQL over RDBMS: https://www.dummies.com/programming/big-data/10-advantages-of-nosql-over-rdbms/

33