trends in nosql technologiesdsb/cs386d/projects14/nosqldatabases.pdf · trends in nosql...
Post on 17-Oct-2020
3 Views
Preview:
TRANSCRIPT
Trends in NoSQL Technologies
Database Systems, CS386D
Instructor: Don Batory
Ankit, Prateek and Dheeraj
1
Agenda Fundamental Concepts
Why NoSQL?
What is NoSQL?
NoSQL Taxonomy
Case Studies
Project Summary
References
2
Why NoSQL? New Trends
3
Growth in data "Big Data" + "Unstructued data"
Source: http://www.couchbase.com
4
Connectedness "Social networks"
Source: http://http://tjm.org/
5
Architecture "Concurrency"
Source: http://www.couchbase.com
6
Architecture "Concurrency"
Source: http://www.slideshare.net/thobe/nosql-for-dummies
7
Couchbase NoSQL Survey 1. Flexibility (49%) 2. Scalability (35%) 3. Performance (29%)
8
Extending the scope of RDBMs Data Partitioning ("sharding") + Enables scalability - Difficult process - No-cross shard joins - Schema management on every shard
Denormalizing + Increases speed + Provides flexibility to sharding - Eliminates relational query benefits
Distributed Caching + Accelerated reads + Scale out : ability to serve larger number of requests - Another tier to manage
9
RDMBS
Not a "One Shoe fits all" solution
Oracle has tried it
Need something different
10
Fundamental Concepts ACID and CAP (A Quick Review)
11
ACID
Atomicity * Consistency * Isolation * Durability
Set of properties that guarantee that database transactions are processed reliably
Example:
Transfer of funds from one bank account to another (lower the FROM account and raise the TO account)
12
CAP It is impossible in a for a distributed computer system to simultaneously provide all three of the following guarantees
Cluster : A distributed network of nodes which acts as a gateway to the user
• Consistency • Data is consistent across all the nodes of the cluster.
• Availability
• Ability to access cluster even if nodes in cluster go down.
• Partition Tolerance • Cluster continues to function even if there is a “partition” between two nodes.
13
CAP
14
Categorization
15
What is different? What is NoSQL database technology
16
Design Features Data Model No schema enforced by database - "Schemaless"
Four major categories Key/Value stores
Document Stores
Columnar stores
Graph Databases
17
Key Value Stores Redis
BDB
Memcached
Membase
Voldemort
Dynamo
At the core all NoSQL databases are key/value systems, the difference is whether the database understands the value or not.
18
Key Value Stores - examples
• In memory / on disk • Key/Value pair storage • Purpose: Caching
• Persistent In-memory • Key/Value pair storage • Provides special data structures • pub/sub capabilities. • Purpose: Caching and beyond
• In memory / on disk • Key/Value pair storage • Transactional support • Purpose: Lightweight DB
• Disk-based with built in memcache • Cache refill on restart • Highly Available (replication) • Add/Remove live cluster • Purpose: Caching
19
Document Stores MongoDB, Couchbase
• DB understands values • Data store in JSON/XML/BSON objects • Secondary Indexes possible • Schemaless • Query on attributes inside values possible
20
Document Stores - examples
• memcached + couchDB • Data stored as JSON objects • Autosharding (replication)* • Highly Available • Create indexes, views. • Query against indexes. • Native support for map-reduce
• Data stored as BSON • Very easy to get started • Disk based with in-memory caching • Auto-sharding* • Supports Ad-hoc queries • Native support for map-reduce.
* Auto-sharding - As system load changes, assignment of data to shards is rebalanced automatically 21
Column Oriented Stores Cassandra
• DB understands values • You don't need to model all the columns required by your
application upfront. • Technically It's a partitioned row store, where rows are organized
into tables with a required primary key.
source: http://stackoverflow.com 22
Column Family :: table
Dynamic Columns
http://www.datastax.com/docs/0.8/ddl/index 23
Column oriented store - examples
• Open source clone to Google's BigTable
• Runs only on top of HDFS • CP based system
• Modeled after Google's BigTable • Clustered like Dynamo • Good cross datacenter support • Supports efficient queries on
columns • Eventually consistent • AP based system
24
Graph Databases Neo4J, GraphDB, Pregel
source: http://stackoverflow.com
• Apply Graph Theory to the storage of information about the relationship between entries
• Used for recommendation engines.
25
Wait for more in the Graph DB presentation!
• Disk-based system • External caching required • Nodes, relationships and paths • Properties on nodes • Complex query on relations
Graph DB - example
26
27
In-house solutions
• No schema required before inserting data • No schema change required to change data format • Auto-sharding without application participation • Distributed queries • Integrated main memory caching • Data Synchronization (multi-datacenter)
28
Case Studies Amazon's Dynamo and Google's Bigtable
29
Google's BigTable 30
BigTable Designed to scale.
And the scale we are talking is of Petabytes!
A distributed store for managing structured data.
Three dimensional Table structure.
Uninterpretated bytes storage.
CP - Choses Consistency over Availability in the case of network partitioning (CAP theorem)
Basically, it is just a sparse, distributed, persistent sorted map store.
31
Sparse
Most of the columns are empty
Persistent
Data gets stored permanently in the disk
Sorted
Data kept in heirarchical fashion
Spatial Locality
Consistent
• sparse • distributed • scalable • persistent • sorted • consistent • map store
Features
32
Tablet dimensions 1. Rows
2. Column Families
3. Timestamps
BigTable's basic data storage structure
Tablet : BigTable's basic unit of storage
33
Sparsity demonstrated in the table
Storage Hierarchy
BigTable indexing hierarchy
Metadata tablet
Root tablet
Chubby file
Tablet
34
Optimizations • Bloom Filters
• Caching
Bloom Filter : Drastically reduces the number of disk seeks required for read!
35
Optimizations Higher Level Cache
• For scenarios where same data is read repeatedly
Lower Level Cache
• For spatial locality
• Bloom Filters
• Caching
36
GFS
HLC
LLC
Tablet Server
Google File System
Amazon DynamoDB 37
Amazon DynamoDB
Motivation:
“ Customers should be able to view and add items to their shopping cart even if network routes are broken or data centers are being destroyed by tornadoes.”
AP: It chooses availability over consistency in the case of network partitioning
38
Features
Highly Available key-value
High performance (low latency)
Highly scalable (hundreds of nodes)
"Always on" available (esp. for writes)
Partition/Fault-tolerant
Eventually consistent
39
Key Techniques
Consistent Hashing
For data partitioning, replicating and load balancing
Sloppy Quorums
Boosts availability in present of failures
40
Consistent Hashing Sharding
source: http://sharplearningcurve.com/blog/2010/09/27/consistent-hashing/
41
Consistent Hashing Dynamically add nodes
source: http://sharplearningcurve.com/blog/2010/09/27/consistent-hashing/
42
Consistent Hashing Dynamically remove nodes
source: http://sharplearningcurve.com/blog/2010/09/27/consistent-hashing/
43
Consistent Hashing load balancing
source: http://sharplearningcurve.com/blog/2010/09/27/consistent-hashing/
44
Replication N = 3
45
Sloppy Quorums
- R number of nodes that need to participate in read
- W number of nodes that need to participate in write
- R + W > N (a quorum system)
Availability in presence of failures
Dynamo:
W = 1 (Always available for write)
Yields R=N(reads pay penality )
Typical: R=2, W=2, N=4
46
Dynamo Summary An eventually consistent highly available key/value store
AP in CAP space
Focuses on low latency, SLAs
Very low latency writes, reconciliation in reads
Key techniques used in many other distributed systems
Consistent hashing, (sloppy) quorum-based replication, vector clocks, gossip-based membership, merkel tree synchronization
47
Project Summary To be SQL or Not to be SQL
48
Bright future of NoSQL Companies using NoSQL
- Amazon
... many many more.
49
Conclusion Even NoSQL - Not a "One Size fits all" kinda shoe.
Shoe horning your database is just bad, bad, bad!
Use when Data schema keeps on varying often Scalability really becomes an issue Not to use when The data is inherently relational Lots of complex queries to write You need good helping resources eg. debugger, performance tools
50
References - 1 Dynamo: amazon's highly available key-value store. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles (SOSP '07). ACM, New York, NY, USA, 205-220
Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst. 26, 2, Article 4 (June 2008). Fay Chang et. al
http://docs.mongodb.org/manual/core/sharding-introduction
http://mongodb.com/learn/nosql"http://www.mongodb.com/learn/nosql
http://www.cs.rutgers.edu/~pxk/417/notes/content/bigtable.html
http://en.wikipedia.org/wiki/ACID
51
References - 2 http://www.slideshare.net/mongodb/mongodb-autosharding-at-mongo-seattle
http://www.slideshare.net/danglbl/schemaless-databases"http://www.slideshare.net/danglbl/schemaless-databases
http://infoq.com/presentations/NoSQL-Survey-Comparison"www.infoq.com/presentations/NoSQL-Survey-Comparison
http://info.mongodb.com/rs/mongodb/images/10gen_Top_5_NoSQL_Considerations.pdf
http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosql-for.html
http://technosophos.com/2014/04/11/nosql-no-more.html
52
Questions
Backup Slides Basic Concepts
• Trees
• Graphs
• Key-value
• XML
• etc.
Any storage model other than tabular relations.
12
Auto-sharding
TODO: http://www.scalebase.com/
extreme-scalability-with-mongodb-and-mysql-part-
1-auto-sharding/
Impedence Mismatch You break structured data into pieces and spread it across different tables.
leads to object relational mapping
lots of traffic => buy bigger boxes. Lot of small boxes. SQL was designed to run on single box.
Key Techniques
Consistent Hashing For data partitioning, replicating and load
balancing
Sloppy Quorums Boosts availability in present of failures
Vector Clocks For tracking casual dependencies among different
versions of the same key (data)
Gossip-based group membership protocol For maintaining information about live nodes
Anti-entropy protocol using hash/merkle trees Background synchronization of divergent replicas
46
Availability
Consistent Hashing
Replication and partitioning
48
Vector Clocks Tracking causal dependencies
49
Gossip and Anti-entropy
Merkel Trees Each node keeps a merkel tree for each of its key ranges
Compare the root of the tree with replicas if equal => replicas in synch Traverse the tree and synch those keys that differ
Synchronization and book-keeping of live nodes
Membership:
Node contacts a random node every 1s.
Gossip used for exchanging and partitioning/placement metadata
51
Merkel Trees
Atomicity
Atomicity requires that each transaction is "all or nothing"
ACID
26
Consistency
The consistency property ensures that any transaction will bring the database from one valid state to another valid state.
ACID
27
b
a+b
a+b-10
Isolation
The isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially, i.e. one after the other.
ACID
28
Durability
Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors
ACID
29
• sparse • distributed • scalable • persistent • sorted • map store
Features Persistent Data gets stored permanently in
the disk
Sorted Data kept in heirarchical fashion Spatial Locality
Map Store Just a collection of (key, value)
pairs
40
BASE
An alternative to ACID
• Basically Available • Support partial failures without total system failure.
• Soft state • optimistic and accepts that consistency will be in state of flux.
• Eventual Consistency • Given a sufficiently long period of time over which no changes are sent, all updates can be
expected to propagate eventually.
33
top related