cs 542 parallel dbs, nosql, mapreduce

CS 542 Database Management SystemsParallel and Distributed Databases

Introduction to NoSQL databases and MapReduce

J Singh February 14, 2011

2© J Singh, 2011 2

Today’s meeting

• Parallel and Distributed Databases, Chapter 20.1 – 20.4– But only a part of 20.4, the rest w/ Query Optimization

• NoSQL Databases

• MapReduce

• References:– Selected Papers– MapReduce (textbook), Lin & Dyer, 2010

http://cs542.wordpress.com/2011/01/16/parallel-distributed-databases/

http://net.pku.edu.cn/~course/cs402/2010/book/MapReduce-book-20100307.pdf

3© J Singh, 2011 3

Parallel Databases

• Motivation– More transactions per second, or less time per query– Throughput vs. Response Time– Speedup vs. Scaleup

• Database operations are extremely parallel– E.g. Consider a join between R and S on R.b = S.b

4© J Singh, 2011 4

Parallel Databases

• Shared-nothing vs. shared-memory vs. shared-disk

5© J Singh, 2011 5

Parallel Databases

Shared Memory Shared Disk Shared Nothing

Communication between processors

Extremely fast Disk interconnect is very fast

Over a LAN, so slowest

Scalability ? Not beyond 32 or 64 or so (memory bus is the bottleneck)

Not very scalable (disk interconnect is the bottleneck)

Very scalable

Notes Cache-coherency an issue

Transactions complicated; natural fault-tolerance.

Distributed transactions are complicated (deadlock detection etc);

Main use Low degrees of parallelism

Not used very often

Everywhere

6© J Singh, 2011 6

Date’s Rules for Distributed DBMS

Other rules:1. Local autonomy2. No reliance on central

site3. Continuous operation4. Location independence5. Fragmentation

independence6. Replication

independence

7. Distributed query processing

8. Distributed transaction management

9. Hardware independence10. Operating system

independence11. Network independence12. DBMS independence

Rule 0To the user, a distributed system should look exactly

like a non-distributed system

7© J Singh, 2011 7

Distributed Systems Headaches

• Especially if trying to execute transactions that involve data from multiple sites

– Keeping the databases in sync• 2-phase commit for transactions uniformly hated (we’ll cover

on 4/11)– Autonomy issues

• Even within an organization, people tend to be protective of their unit/department

– Locks/Deadlock management

• Works better for query processing– Since we are only reading the data

8© J Singh, 2011 8

Distributed Query Optimization

• Cost-based approach; consider all plans, pick cheapest; similar to centralized optimization.

– Communication costs must be considered.– Local site autonomy must be respected.– New distributed join methods.

• Query site constructs global plan, with suggested local plans describing processing at each site.

– If a site can improve suggested local plan, free to do so.

9© J Singh, 2011 9

Distributed Query Example

• Find all cities in AfricaSELECT City.NameFROM City, CountryWHERE City.CountryCode = Country.Code AND Country.Continent = 'Africa' ;

• Location assumptions:– Country table: Server A– City table: Server B– Request arrives at Server A

• Think through on whiteboard

10© J Singh, 2011 10

Detour: Paxos Algorithm

• Paxos is a family of protocols for solving consensus in a network of unreliable processors.

– Consensus is the process of agreeing on one result among a group of participants.

– This problem becomes difficult when the participants or their communication medium may experience failures.

• Includes a spectrum of trade-offs between – the number of processors, number of message delays before

learning the agreed value, the activity level of individual participants, number of messages sent, and types of failures.

• Widely used. In Google’s lock management service among other uses

– The above definitions from Wikipedia. Feel free to pursue. Not on exam.

– Contributions from Leslie Lamport, Nancy Lynch, Barbara Liskov

11© J Singh, 2011 11

Updating Distributed Data

• Synchronous Replication: All copies of a modified relation (fragment) must be updated before the modifying transaction commits.

– Data distribution is made transparent to users.

• Asynchronous Replication: Copies of a modified relation are only periodically updated; different copies may get out of synch in the meantime.

– Users must be aware of data distribution.– Many current products follow this approach. Also referred

to as Master/Slave.

12© J Singh, 2011 12

Synchronous Replication

• Voting: transaction must write a majority of copies to modify an object; must read enough copies to be sure of seeing at least one most recent copy.

– E.g., 10 copies; 7 written for update; 4 copies read.– Each copy has version number.– Not attractive usually because reads are common.

• Read-any Write-all: Writes are slower and reads are faster, relative to Voting.

– Most common approach to synchronous replication.– But what if one of the 10 nodes is down?

• Choice of technique determines which locks to set.

13© J Singh, 2011 13

Peer-to-Peer Replication

• More than one of the copies of an object can be a master in this approach.

• Changes to a master copy must be propagated to other copies somehow.

• If two master copies are changed in a conflicting manner, this must be resolved. (e.g., Site 1: Joe’s age changed to 35; Site 2: to 36)

• Best used when conflicts do not arise:– E.g., Each master site owns a disjoint fragment.– E.g., Updating rights owned by one master at a time.

14© J Singh, 2011 14

Summary

• Parallel DBMSs designed for scalable performance. Relational operators very well-suited for parallel execution.

– Pipeline and partitioned parallelism.

• Distributed DBMSs offer site autonomy and distributed administration. Must revisit storage and catalog techniques, concurrency control, and recovery issues.

• Thus far, we have taken an ad hoc approach to database parallelism and distribution

– Time to formalize it

15© J Singh, 2011 15

Brewer’s Conjecture (p1)

• Source: Eric Brewer’s July 2000 PODC Keynote

• Main points:– Classic “Distributed Systems” don’t work

• They focus on computation, not data• Distributing computation is easy, distributing data is hard

– DBMS research is about ACID (mostly)• But we forfeit “C” and “I” for availability, graceful

degradation and performance• This tradeoff is fundamental

– BASE• Basically Available• Soft-state• Eventual Consistency

http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

16© J Singh, 2011 16

Brewer’s Conjecture (p2)

• ACID– Strong consistency– Isolation– Focus on “commit”– Nested transactions– Availability?– Conservative

(pessimistic)– Difficult evolution (e.g.

schema)

• BASE– Weak consistency

• stale data OK– Availability first– Best effort– Approximate answers OK– Aggressive (optimistic)– Simpler!– Faster– Easier evolution

But I think it’s a spectrumEric Brewer

17© J Singh, 2011 17

CAP Theorem (p1)

• Brewer’s Take Home Messages�– Can have consistency & availability within a cluster, but it

is still hard in practice– OS/Networking good at BASE/Availability, but terrible at

consistency– Databases better at C than Availability

• Wide-area databases can’t have both– All systems are probabilistic

• Since then,– Brewer’s conjecture formally proved:

Gilbert & Lynch, 2002– Thus Brewer’s conjecture became the CAP theorem…– …and contributed to the birth of the NoSQL movement

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf

18© J Singh, 2011 18

CAP Theorem (p2)

• But the theory is not settled– Aren’t Availability and Partition Tolerance the same thing?– And shouldn’t we be thinking about latency?

• While http://nosql-database.org/ lists 122 NoSQL databases

• References– Availability and Partition Tolerance, Jeff Darcy, 2010– Problems with CAP…, Dan Abadi, 2010 – What does the Proof of the CAP Theorem mean? Dan Weinreb,

2010

http://nosql-database.org/

http://pl.atyp.us/wordpress/?p=2521/

http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html

http://danweinreb.org/blog/what-does-the-proof-of-the-cap-theorem-mean

CS 542 Database Management SystemsNoSQL Databases

20© J Singh, 2011 20

What is NoSQL?

• Stands for Not Only SQL

• Class of non-relational data storage systems

• Usually do not require a fixed table schema nor do they use the concept of joins

• All NoSQL offerings relax one or more of the ACID properties

21© J Singh, 2011 21

Forces at Work

• Three major papers were the seeds of the NoSQL movement

– CAP Theorem (discussed above)– BigTable (Google)– Dynamo (Amazon)

• Gossip protocol (discovery and error detection)• Distributed key-value data store• Eventual consistency

• Some types of data could not be modeled well in RDBMS

– Document Storage and Indexing– Recursive Data and Graphs– Time Series Data– Genomics Data

http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf

22© J Singh, 2011 22

The Perfect Storm

• Large datasets, acceptance of alternatives, and dynamically-typed data has come together in a perfect storm

• Not a backlash/rebellion against RDBMS

• SQL is a rich query language that cannot be rivaled by the current list of NoSQL offerings

23© J Singh, 2011 23

What kinds of NoSQL

• NoSQL solutions fall into two major areas:– Key/Value or ‘the big hash table’.

• Amazon S3 (Dynamo)• Voldemort• Scalaris

– Schema-less which comes in multiple flavors, column-based, document-based or graph-based.

• Cassandra (column-based)• CouchDB (document-based)• Neo4J (graph-based)• HBase (column-based)

24© J Singh, 2011 24

Amazon SimpleDB

• Key-value store– Written in Erlang, (as is CouchDB)

• Data is modeled in terms of – Domain, a container of entities,– Item, an entity and – Attribute and Value, a property of an Item

• Eventually Consistent, except when ReadConsistent flag specified

– Impressive performance numbers, • e.g., .7 sec to store 1 million records

• SQL-like SELECT

select output_list from domain_name [where expression] [sort_instructions] [limit limit]

25© J Singh, 2011 25

Google Datastore

• Part of App Engine; also used for internal applications– Used for all storage– Incorporates a transaction model to ensure high

consistency• Optimistic locking• Transactions can fail

• CAP implications– Datastore isn’t just “eventually consistent”– They offer two commercial options (with different prices)

• Master/Slave – Low latency but also lower availability– Asynchronous replication

• High Replication– Strong availability at the cost of higher latency

26© J Singh, 2011 26

App Engine Architecture

PythonVM

process

stdlib

app

memcachedatastore

mail

images

urlfech

statefulAPIs

stateless APIs R/O FS

req/resp

27© J Singh, 2011 27

Datastore Programming Model (p1)

• Entities have a Kind, a Key, and Properties– Entity ~~ Record ~~ Python dict ~~ Python class instance– Key ~~ structured foreign key; includes Kind– Kind ~~ Table ~~ Python class– Property ~~ Column or Field; has a type

• Dynamically typed: Property types are recorded per Entity

• Key has either id or name– the id is auto-assigned; alternatively, the name is set by app– A key can be a path including the parent key, and so on

• Paths define entity groups which limit transactions– A transaction locks the root entity (parentless ancestor key)

28© J Singh, 2011 28

Datastore Programming Model (p2)

• GQL– GQL offers SELECT but no INSERT, UPDATE, JOIN– Use language bindings for INSERT, … and Transaction

primitives

SELECT [* | __key__] FROM <kind> [WHERE <condition> [AND <condition> ...]] [ORDER BY <property> [ASC | DESC] [, <property> [ASC | DESC] …]] [LIMIT [<offset>,]<count>] [OFFSET <offset>]

<condition> := <property> {< | <= | > | >= | = | != } <value> <condition> := <property> IN <list> <condition> := ANCESTOR IS <entity or key>

29© J Singh, 2011 29

Datastore is Based on BigTable

• Provides Scalable, Structured Storage

– Implemented as a sharded, sorted, array

• Sharded: Each block (tablet) lives on its own server

• Sorted: Engineered to fetch the results of range queries with fewest disk reads

– Operations: Only these six1. Read2. Write3. Delete4. Update (atomic)5. Prefix scan6. Range scan

• Row Names (keys) up to 64KB

• Columns unlimited in size– Divided into “column

families”

BigTable RowsRow Names Columns

a … Tablet 1 b …

c …

d … Tablet 2 f …

j …

n … Tablet 3 p …

z …

30© J Singh, 2011 30

Datastore Entity Implementation

• Shown– Index: Entities By Kind– Index: Entities By Prop ASC

• Not Shown– Index: Entities By Prop

DESC– Index: By Composite

Property

BigTable RowsRow Names Columns

a … Tablet 1 b …

c …

d … Tablet 2 f …

j …

n … Tablet 3 p …

z …

Index: Entities By KindID <AppID, Kind, Key>… … … … … …

… … … … … …

… … … … … …

Index: Entities By Property ASCID <AppID, propName, propVal, Key>… … … … … …

… … … … … …

… … … … … …

Root

Nod

e 1

Nod

e 2

Nod

e 3

**

*

*

* *

31© J Singh, 2011 31

• Some production data, circa 2008.

• For more info, see video of Ryan Barrett’s talk at Google I/O

Datastore Application at Google

http://sites.google.com/site/io/under-the-covers-of-the-google-app-engine-datastore

32© J Singh, 2011 32

Databases and Key-Value Stores

http://browsertoolkit.com/fault-tolerance.png

http://browsertoolkit.com/fault-tolerance.png

CS 542 Database Management SystemsMapReduce

The Story of Sam

http://www.slideshare.net/esaliya/mapreduce-in-simple-terms

35© J Singh, 2011 35

Conceptual Underpinnings

• Programming model from Lisp and other functional languages

– (map square '(1 2 3 4)) (1 4 9 16)– (reduce + '(1 4 9 16)) 30

• Easy to distribute

• Nice failure/retry semantics

37© J Singh, 2011 37

Example: Reverse index words in docs

• Input: Crawler yields (url, content) pairs

• Map function:– map (key = url, value = content)

• For each word w in content, emit (w, [url, offset]

– reduce(key = word, values = list of [url, offset])• Sort values• Emit (word, sorted list of [url, offset])

38© J Singh, 2011 38

Implementation Questions

• Map: – How many processors should we use?

• 4? 32? 1024?

• Reduce:– How many processors?– What’s the allocation algorithm for assigning words to

processors?

• These all design decisions driven by– Size and other characteristics of the problem

39© J Singh, 2011 39

Implementation Steps

• Split key/value pairs into M chunks, run a map task on each chunk in parallel

• Partition the output of map tasks into R regions

• After all map tasks complete,– Why do we need to wait?

• Run a reduce task on each (of R) regions

40© J Singh, 2011 40

Fault Tolerance

• Problem Detection– Heartbeat

• Remedy– Re-execute in-progress and completed map tasks– Re-execute in-progress reduce tasks

41© J Singh, 2011 41

• Not limited to data analysis tasks– Any task that is parallelizable

• e.g., prepare a report for each user

– The importance of idempotence• It must be possible to rerun a task.

Applications

- name: prepUserReport mapper: input_reader: mapreduce.input_readers.DatastoreInputReader handler: Service.prepUserReport params: - name: entity_kind default: DataModel.Org - name: done_callback default: /mr_done/prep

42© J Singh, 2011 42

Refinements: Usability

• In the App Engine Environment, automation has been the focus

– Automatic sharding for faster execution– Automatic rate limiting for slow execution– Status pages (demo in a minute)– Counters– Parameterized mappers– Batching datastore operations– Iterating over blob data

43© J Singh, 2011 43

Refinement: Redundant Execution

• Slow workers significantly delay completion time – Other jobs consuming resources on machine – Bad disks w/ soft errors transfer data slowly – Weird things: processor caches disabled (!!)

• Solution: Near end of phase, spawn backup tasks – Whichever one finishes first "wins"

• Dramatically shortens job completion time

44© J Singh, 2011 44

Refinement: Locality Optimization

• Master scheduling policy: – Asks GFS for locations of replicas of input file blocks – Map tasks typically split into 64MB (GFS block size) – Map tasks scheduled so GFS input block replica are on

same machine or same rack

• Effect– Thousands of machines read input at local disk speed

• Without this, rack switches limit read rate

45© J Singh, 2011 45

Refinement: Skipping Bad Records

• Map/Reduce functions sometimes fail for particular inputs

– Best solution is to debug & fix• Not always possible ~ third-party source libraries

– On segmentation fault: • Send UDP packet to master from signal handler • Include sequence number of record being processed

– If master sees two failures for same record: • Next worker is told to skip the record

46© J Singh, 2011 46

Refinement: Pipelining

• Hadoop Online Prototype (HOP) supports pipelining within and between MapReduce jobs: push rather than pull

– Preserve simple fault tolerance scheme– Improved job completion time (better cluster utilization)– Improved detection and handling of stragglers

• MapReduce programming model unchanged– Clients supply same job parameters

• Hadoop client interface backward compatible– No changes required to existing clients

• E.g., Pig, Hive, Sawzall, Jaql– Extended to take a series of job

47© J Singh, 2011 47

MapReduce Statistics @ GOOG

Aug ’04

Mar ’06

Sep ’07

Sep ’09

Number of Jobs (000) 29 171 2,217 3,467

Average Completion Time (secs)

637 874 395 475

Machine-years 217 2,002 11,081 25,562

Input Data Read (TB) 3,288 52,254 403,152

544,130

Intermediate Data (TB) 758 6,743 34,774 90,120

Output Data (TB) 193 2,970 14,018 57,520

Avg Worker Machines 157 268 394 488• Take-away message:

– MapReduce is not a “new-fangled technology of the future”

– It is here, it is proven, use it!

48© J Singh, 2011 48

MapReduce is Still Controversial

• Arg1: MapReduce is a step backwards in database access– MapReduce is not a database, a data storage, or management

system– MapReduce is an algorithmic technique for the distributed

processing of large amounts of data

• Arg2: MapReduce is a poor implementation– MapReduce is one way to generate indexes from a large volume

of data, but it’s not a data storage and retrieval system

• Arg3: MapReduce is not novel– Hashing, parallel processing, data partitioning, and user-defined

functions are all old hat in the RDBMS world, but so what?– The big innovation MapReduce enables is distributing data

processing across a network of cheap and possibly unreliable computers

49© J Singh, 2011 49

MapReduce is Still Controversial (p2)

• Arg4: MapReduce is missing features

• Arg5: MapReduce is incompatible with the DBMS tools– The ability to process a huge volume of data quickly such as

web crawling and log analysis is more important than guaranteeing 100% data integrity and completeness

• Arg6: Even Google is replacing MapReduce– Not much written about it – seems more focused on pipelining

and incremental processing

50© J Singh, 2011 50

Next meetings

• February 21: Mid-Term Exam

• February 28: Presentations– Due: Presentation Report

• In-class presentation– Please bring on a thumb drive in PDF or PPT format

• Report to accompany the presentation, up to 5 pages– Submit electronically via Turn-In – deadline: 2/27 midnight

– Due: a proposal for your project• No more than 1 page, no less than 300 words.• Include an initial bibliography• Will not be graded independently, feedback will be provided• Will feed into your project grade

cs 542 parallel dbs, nosql, mapreduce

Technology

distributed dbmss

master copies

distributed administration

distributed dbmsrule

distributed query examplefind

master site

different copies

majority of copies