cs 542 parallel dbs, nosql, mapreduce

50
CS 542 Database Management Systems Parallel and Distributed Databases Introduction to NoSQL databases and MapReduce J Singh February 14, 2011

Upload: j-singh

Post on 22-Jan-2015

1.939 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: CS 542 Parallel DBs, NoSQL, MapReduce

CS 542 Database Management SystemsParallel and Distributed Databases

Introduction to NoSQL databases and MapReduce

J Singh February 14, 2011

Page 2: CS 542 Parallel DBs, NoSQL, MapReduce

2© J Singh, 2011 2

Today’s meeting

• Parallel and Distributed Databases, Chapter 20.1 – 20.4– But only a part of 20.4, the rest w/ Query Optimization

• NoSQL Databases

• MapReduce

• References:– Selected Papers– MapReduce (textbook), Lin & Dyer, 2010

Page 3: CS 542 Parallel DBs, NoSQL, MapReduce

3© J Singh, 2011 3

Parallel Databases

• Motivation– More transactions per second, or less time per query– Throughput vs. Response Time– Speedup vs. Scaleup

• Database operations are extremely parallel– E.g. Consider a join between R and S on R.b = S.b

Page 4: CS 542 Parallel DBs, NoSQL, MapReduce

4© J Singh, 2011 4

Parallel Databases

• Shared-nothing vs. shared-memory vs. shared-disk

Page 5: CS 542 Parallel DBs, NoSQL, MapReduce

5© J Singh, 2011 5

Parallel Databases

Shared Memory Shared Disk Shared Nothing

Communication between processors

Extremely fast Disk interconnect is very fast

Over a LAN, so slowest

Scalability ? Not beyond 32 or 64 or so (memory bus is the bottleneck)

Not very scalable (disk interconnect is the bottleneck)

Very scalable

Notes Cache-coherency an issue

Transactions complicated; natural fault-tolerance.

Distributed transactions are complicated (deadlock detection etc);

Main use Low degrees of parallelism

Not used very often

Everywhere

Page 6: CS 542 Parallel DBs, NoSQL, MapReduce

6© J Singh, 2011 6

Date’s Rules for Distributed DBMS

Other rules:1. Local autonomy2. No reliance on central

site3. Continuous operation4. Location independence5. Fragmentation

independence6. Replication

independence

7. Distributed query processing

8. Distributed transaction management

9. Hardware independence10. Operating system

independence11. Network independence12. DBMS independence

Rule 0To the user, a distributed system should look exactly

like a non-distributed system

Page 7: CS 542 Parallel DBs, NoSQL, MapReduce

7© J Singh, 2011 7

Distributed Systems Headaches

• Especially if trying to execute transactions that involve data from multiple sites

– Keeping the databases in sync• 2-phase commit for transactions uniformly hated (we’ll cover

on 4/11)– Autonomy issues

• Even within an organization, people tend to be protective of their unit/department

– Locks/Deadlock management

• Works better for query processing– Since we are only reading the data

Page 8: CS 542 Parallel DBs, NoSQL, MapReduce

8© J Singh, 2011 8

Distributed Query Optimization

• Cost-based approach; consider all plans, pick cheapest; similar to centralized optimization.

– Communication costs must be considered.– Local site autonomy must be respected.– New distributed join methods.

• Query site constructs global plan, with suggested local plans describing processing at each site.

– If a site can improve suggested local plan, free to do so.

Page 9: CS 542 Parallel DBs, NoSQL, MapReduce

9© J Singh, 2011 9

Distributed Query Example

• Find all cities in AfricaSELECT City.NameFROM City, CountryWHERE City.CountryCode = Country.Code AND Country.Continent = 'Africa' ;

• Location assumptions:– Country table: Server A– City table: Server B– Request arrives at Server A

• Think through on whiteboard

Page 10: CS 542 Parallel DBs, NoSQL, MapReduce

10© J Singh, 2011 10

Detour: Paxos Algorithm

• Paxos is a family of protocols for solving consensus in a network of unreliable processors.

– Consensus is the process of agreeing on one result among a group of participants.

– This problem becomes difficult when the participants or their communication medium may experience failures.

• Includes a spectrum of trade-offs between – the number of processors, number of message delays before

learning the agreed value, the activity level of individual participants, number of messages sent, and types of failures.

• Widely used. In Google’s lock management service among other uses

– The above definitions from Wikipedia. Feel free to pursue. Not on exam.

– Contributions from Leslie Lamport, Nancy Lynch, Barbara Liskov

Page 11: CS 542 Parallel DBs, NoSQL, MapReduce

11© J Singh, 2011 11

Updating Distributed Data

• Synchronous Replication: All copies of a modified relation (fragment) must be updated before the modifying transaction commits.

– Data distribution is made transparent to users.

• Asynchronous Replication: Copies of a modified relation are only periodically updated; different copies may get out of synch in the meantime.

– Users must be aware of data distribution.– Many current products follow this approach. Also referred

to as Master/Slave.

Page 12: CS 542 Parallel DBs, NoSQL, MapReduce

12© J Singh, 2011 12

Synchronous Replication

• Voting: transaction must write a majority of copies to modify an object; must read enough copies to be sure of seeing at least one most recent copy.

– E.g., 10 copies; 7 written for update; 4 copies read.– Each copy has version number.– Not attractive usually because reads are common.

• Read-any Write-all: Writes are slower and reads are faster, relative to Voting.

– Most common approach to synchronous replication.– But what if one of the 10 nodes is down?

• Choice of technique determines which locks to set.

Page 13: CS 542 Parallel DBs, NoSQL, MapReduce

13© J Singh, 2011 13

Peer-to-Peer Replication

• More than one of the copies of an object can be a master in this approach.

• Changes to a master copy must be propagated to other copies somehow.

• If two master copies are changed in a conflicting manner, this must be resolved. (e.g., Site 1: Joe’s age changed to 35; Site 2: to 36)

• Best used when conflicts do not arise:– E.g., Each master site owns a disjoint fragment.– E.g., Updating rights owned by one master at a time.

Page 14: CS 542 Parallel DBs, NoSQL, MapReduce

14© J Singh, 2011 14

Summary

• Parallel DBMSs designed for scalable performance. Relational operators very well-suited for parallel execution.

– Pipeline and partitioned parallelism.

• Distributed DBMSs offer site autonomy and distributed administration. Must revisit storage and catalog techniques, concurrency control, and recovery issues.

• Thus far, we have taken an ad hoc approach to database parallelism and distribution

– Time to formalize it

Page 15: CS 542 Parallel DBs, NoSQL, MapReduce

15© J Singh, 2011 15

Brewer’s Conjecture (p1)

• Source: Eric Brewer’s July 2000 PODC Keynote

• Main points:– Classic “Distributed Systems” don’t work

• They focus on computation, not data• Distributing computation is easy, distributing data is hard

– DBMS research is about ACID (mostly)• But we forfeit “C” and “I” for availability, graceful

degradation and performance• This tradeoff is fundamental

– BASE• Basically Available• Soft-state• Eventual Consistency

Page 16: CS 542 Parallel DBs, NoSQL, MapReduce

16© J Singh, 2011 16

Brewer’s Conjecture (p2)

• ACID– Strong consistency– Isolation– Focus on “commit”– Nested transactions– Availability?– Conservative

(pessimistic)– Difficult evolution (e.g.

schema)

• BASE– Weak consistency

• stale data OK– Availability first– Best effort– Approximate answers OK– Aggressive (optimistic)– Simpler!– Faster– Easier evolution

But I think it’s a spectrumEric Brewer

Page 17: CS 542 Parallel DBs, NoSQL, MapReduce

17© J Singh, 2011 17

CAP Theorem (p1)

• Brewer’s Take Home Messages�– Can have consistency & availability within a cluster, but it

is still hard in practice– OS/Networking good at BASE/Availability, but terrible at

consistency– Databases better at C than Availability

• Wide-area databases can’t have both– All systems are probabilistic

• Since then,– Brewer’s conjecture formally proved:

Gilbert & Lynch, 2002– Thus Brewer’s conjecture became the CAP theorem…– …and contributed to the birth of the NoSQL movement

Page 18: CS 542 Parallel DBs, NoSQL, MapReduce

18© J Singh, 2011 18

CAP Theorem (p2)

• But the theory is not settled– Aren’t Availability and Partition Tolerance the same thing?– And shouldn’t we be thinking about latency?

• While http://nosql-database.org/ lists 122 NoSQL databases

• References– Availability and Partition Tolerance, Jeff Darcy, 2010– Problems with CAP…, Dan Abadi, 2010 – What does the Proof of the CAP Theorem mean? Dan Weinreb,

2010

Page 19: CS 542 Parallel DBs, NoSQL, MapReduce

CS 542 Database Management SystemsNoSQL Databases

Page 20: CS 542 Parallel DBs, NoSQL, MapReduce

20© J Singh, 2011 20

What is NoSQL?

• Stands for Not Only SQL

• Class of non-relational data storage systems

• Usually do not require a fixed table schema nor do they use the concept of joins

• All NoSQL offerings relax one or more of the ACID properties

Page 21: CS 542 Parallel DBs, NoSQL, MapReduce

21© J Singh, 2011 21

Forces at Work

• Three major papers were the seeds of the NoSQL movement

– CAP Theorem (discussed above)– BigTable (Google)– Dynamo (Amazon)

• Gossip protocol (discovery and error detection)• Distributed key-value data store• Eventual consistency

• Some types of data could not be modeled well in RDBMS

– Document Storage and Indexing– Recursive Data and Graphs– Time Series Data– Genomics Data

Page 22: CS 542 Parallel DBs, NoSQL, MapReduce

22© J Singh, 2011 22

The Perfect Storm

• Large datasets, acceptance of alternatives, and dynamically-typed data has come together in a perfect storm

• Not a backlash/rebellion against RDBMS

• SQL is a rich query language that cannot be rivaled by the current list of NoSQL offerings

Page 23: CS 542 Parallel DBs, NoSQL, MapReduce

23© J Singh, 2011 23

What kinds of NoSQL

• NoSQL solutions fall into two major areas:– Key/Value or ‘the big hash table’.

• Amazon S3 (Dynamo)• Voldemort• Scalaris

– Schema-less which comes in multiple flavors, column-based, document-based or graph-based.

• Cassandra (column-based)• CouchDB (document-based)• Neo4J (graph-based)• HBase (column-based)

Page 24: CS 542 Parallel DBs, NoSQL, MapReduce

24© J Singh, 2011 24

Amazon SimpleDB

• Key-value store– Written in Erlang, (as is CouchDB)

• Data is modeled in terms of – Domain, a container of entities,– Item, an entity and – Attribute and Value, a property of an Item

• Eventually Consistent, except when ReadConsistent flag specified

– Impressive performance numbers, • e.g., .7 sec to store 1 million records

• SQL-like SELECT

select output_list from domain_name [where expression] [sort_instructions] [limit limit]

Page 25: CS 542 Parallel DBs, NoSQL, MapReduce

25© J Singh, 2011 25

Google Datastore

• Part of App Engine; also used for internal applications– Used for all storage– Incorporates a transaction model to ensure high

consistency• Optimistic locking• Transactions can fail

• CAP implications– Datastore isn’t just “eventually consistent”– They offer two commercial options (with different prices)

• Master/Slave – Low latency but also lower availability– Asynchronous replication

• High Replication– Strong availability at the cost of higher latency

Page 26: CS 542 Parallel DBs, NoSQL, MapReduce

26© J Singh, 2011 26

App Engine Architecture

PythonVM

process

stdlib

app

memcachedatastore

mail

images

urlfech

statefulAPIs

stateless APIs R/O FS

req/resp

Page 27: CS 542 Parallel DBs, NoSQL, MapReduce

27© J Singh, 2011 27

Datastore Programming Model (p1)

• Entities have a Kind, a Key, and Properties– Entity ~~ Record ~~ Python dict ~~ Python class instance– Key ~~ structured foreign key; includes Kind– Kind ~~ Table ~~ Python class– Property ~~ Column or Field; has a type

• Dynamically typed: Property types are recorded per Entity

• Key has either id or name– the id is auto-assigned; alternatively, the name is set by app– A key can be a path including the parent key, and so on

• Paths define entity groups which limit transactions– A transaction locks the root entity (parentless ancestor key)

Page 28: CS 542 Parallel DBs, NoSQL, MapReduce

28© J Singh, 2011 28

Datastore Programming Model (p2)

• GQL– GQL offers SELECT but no INSERT, UPDATE, JOIN– Use language bindings for INSERT, … and Transaction

primitives

SELECT [* | __key__] FROM <kind> [WHERE <condition> [AND <condition> ...]] [ORDER BY <property> [ASC | DESC] [, <property> [ASC | DESC] …]] [LIMIT [<offset>,]<count>] [OFFSET <offset>]

<condition> := <property> {< | <= | > | >= | = | != } <value> <condition> := <property> IN <list> <condition> := ANCESTOR IS <entity or key>

Page 29: CS 542 Parallel DBs, NoSQL, MapReduce

29© J Singh, 2011 29

Datastore is Based on BigTable

• Provides Scalable, Structured Storage

– Implemented as a sharded, sorted, array

• Sharded: Each block (tablet) lives on its own server

• Sorted: Engineered to fetch the results of range queries with fewest disk reads

– Operations: Only these six1. Read2. Write3. Delete4. Update (atomic)5. Prefix scan6. Range scan

• Row Names (keys) up to 64KB

• Columns unlimited in size– Divided into “column

families”

BigTable RowsRow Names Columns

a … Tablet 1 b …

c …

d … Tablet 2 f …

j …

n … Tablet 3 p …

z …

Page 30: CS 542 Parallel DBs, NoSQL, MapReduce

30© J Singh, 2011 30

Datastore Entity Implementation

• Shown– Index: Entities By Kind– Index: Entities By Prop ASC

• Not Shown– Index: Entities By Prop

DESC– Index: By Composite

Property

BigTable RowsRow Names Columns

a … Tablet 1 b …

c …

d … Tablet 2 f …

j …

n … Tablet 3 p …

z …

Index: Entities By KindID <AppID, Kind, Key>… … … … … …

… … … … … …

… … … … … …

Index: Entities By Property ASCID <AppID, propName, propVal, Key>… … … … … …

… … … … … …

… … … … … …

Root

Nod

e 1

Nod

e 2

Nod

e 3

**

*

*

* *

Page 31: CS 542 Parallel DBs, NoSQL, MapReduce

31© J Singh, 2011 31

• Some production data, circa 2008.

• For more info, see video of Ryan Barrett’s talk at Google I/O

Datastore Application at Google

Page 32: CS 542 Parallel DBs, NoSQL, MapReduce

32© J Singh, 2011 32

Databases and Key-Value Stores

http://browsertoolkit.com/fault-tolerance.png

Page 33: CS 542 Parallel DBs, NoSQL, MapReduce

CS 542 Database Management SystemsMapReduce

Page 35: CS 542 Parallel DBs, NoSQL, MapReduce

35© J Singh, 2011 35

Conceptual Underpinnings

• Programming model from Lisp and other functional languages

– (map square '(1 2 3 4)) (1 4 9 16)– (reduce + '(1 4 9 16)) 30

• Easy to distribute

• Nice failure/retry semantics

Page 36: CS 542 Parallel DBs, NoSQL, MapReduce

36© J Singh, 2011 36

MapReduce Flow

Page 37: CS 542 Parallel DBs, NoSQL, MapReduce

37© J Singh, 2011 37

Example: Reverse index words in docs

• Input: Crawler yields (url, content) pairs

• Map function:– map (key = url, value = content)

• For each word w in content, emit (w, [url, offset]

– reduce(key = word, values = list of [url, offset])• Sort values• Emit (word, sorted list of [url, offset])

Page 38: CS 542 Parallel DBs, NoSQL, MapReduce

38© J Singh, 2011 38

Implementation Questions

• Map: – How many processors should we use?

• 4? 32? 1024?

• Reduce:– How many processors?– What’s the allocation algorithm for assigning words to

processors?

• These all design decisions driven by– Size and other characteristics of the problem

Page 39: CS 542 Parallel DBs, NoSQL, MapReduce

39© J Singh, 2011 39

Implementation Steps

• Split key/value pairs into M chunks, run a map task on each chunk in parallel

• Partition the output of map tasks into R regions

• After all map tasks complete,– Why do we need to wait?

• Run a reduce task on each (of R) regions

Page 40: CS 542 Parallel DBs, NoSQL, MapReduce

40© J Singh, 2011 40

Fault Tolerance

• Problem Detection– Heartbeat

• Remedy– Re-execute in-progress and completed map tasks– Re-execute in-progress reduce tasks

Page 41: CS 542 Parallel DBs, NoSQL, MapReduce

41© J Singh, 2011 41

• Not limited to data analysis tasks– Any task that is parallelizable

• e.g., prepare a report for each user

– The importance of idempotence• It must be possible to rerun a task.

Applications

- name: prepUserReport mapper: input_reader: mapreduce.input_readers.DatastoreInputReader handler: Service.prepUserReport params: - name: entity_kind default: DataModel.Org - name: done_callback default: /mr_done/prep

Page 42: CS 542 Parallel DBs, NoSQL, MapReduce

42© J Singh, 2011 42

Refinements: Usability

• In the App Engine Environment, automation has been the focus

– Automatic sharding for faster execution– Automatic rate limiting for slow execution– Status pages (demo in a minute)– Counters– Parameterized mappers– Batching datastore operations– Iterating over blob data

Page 43: CS 542 Parallel DBs, NoSQL, MapReduce

43© J Singh, 2011 43

Refinement: Redundant Execution

• Slow workers significantly delay completion time – Other jobs consuming resources on machine – Bad disks w/ soft errors transfer data slowly – Weird things: processor caches disabled (!!)

• Solution: Near end of phase, spawn backup tasks – Whichever one finishes first "wins"

• Dramatically shortens job completion time

Page 44: CS 542 Parallel DBs, NoSQL, MapReduce

44© J Singh, 2011 44

Refinement: Locality Optimization

• Master scheduling policy: – Asks GFS for locations of replicas of input file blocks – Map tasks typically split into 64MB (GFS block size) – Map tasks scheduled so GFS input block replica are on

same machine or same rack

• Effect– Thousands of machines read input at local disk speed

• Without this, rack switches limit read rate

Page 45: CS 542 Parallel DBs, NoSQL, MapReduce

45© J Singh, 2011 45

Refinement: Skipping Bad Records

• Map/Reduce functions sometimes fail for particular inputs

– Best solution is to debug & fix• Not always possible ~ third-party source libraries

– On segmentation fault: • Send UDP packet to master from signal handler • Include sequence number of record being processed

– If master sees two failures for same record: • Next worker is told to skip the record

Page 46: CS 542 Parallel DBs, NoSQL, MapReduce

46© J Singh, 2011 46

Refinement: Pipelining

• Hadoop Online Prototype (HOP) supports pipelining within and between MapReduce jobs: push rather than pull

– Preserve simple fault tolerance scheme– Improved job completion time (better cluster utilization)– Improved detection and handling of stragglers

• MapReduce programming model unchanged– Clients supply same job parameters

• Hadoop client interface backward compatible– No changes required to existing clients

• E.g., Pig, Hive, Sawzall, Jaql– Extended to take a series of job

Page 47: CS 542 Parallel DBs, NoSQL, MapReduce

47© J Singh, 2011 47

MapReduce Statistics @ GOOG

Aug ’04

Mar ’06

Sep ’07

Sep ’09

Number of Jobs (000) 29 171 2,217 3,467

Average Completion Time (secs)

637 874 395 475

Machine-years 217 2,002 11,081 25,562

Input Data Read (TB) 3,288 52,254 403,152

544,130

Intermediate Data (TB) 758 6,743 34,774 90,120

Output Data (TB) 193 2,970 14,018 57,520

Avg Worker Machines 157 268 394 488• Take-away message:

– MapReduce is not a “new-fangled technology of the future”

– It is here, it is proven, use it!

Page 48: CS 542 Parallel DBs, NoSQL, MapReduce

48© J Singh, 2011 48

MapReduce is Still Controversial

• Arg1: MapReduce is a step backwards in database access– MapReduce is not a database, a data storage, or management

system– MapReduce is an algorithmic technique for the distributed

processing of large amounts of data

• Arg2: MapReduce is a poor implementation– MapReduce is one way to generate indexes from a large volume

of data, but it’s not a data storage and retrieval system

• Arg3: MapReduce is not novel– Hashing, parallel processing, data partitioning, and user-defined

functions are all old hat in the RDBMS world, but so what?– The big innovation MapReduce enables is distributing data

processing across a network of cheap and possibly unreliable computers

Page 49: CS 542 Parallel DBs, NoSQL, MapReduce

49© J Singh, 2011 49

MapReduce is Still Controversial (p2)

• Arg4: MapReduce is missing features

• Arg5: MapReduce is incompatible with the DBMS tools– The ability to process a huge volume of data quickly such as

web crawling and log analysis is more important than guaranteeing 100% data integrity and completeness

• Arg6: Even Google is replacing MapReduce– Not much written about it – seems more focused on pipelining

and incremental processing

Page 50: CS 542 Parallel DBs, NoSQL, MapReduce

50© J Singh, 2011 50

Next meetings

• February 21: Mid-Term Exam

• February 28: Presentations– Due: Presentation Report

• In-class presentation– Please bring on a thumb drive in PDF or PPT format

• Report to accompany the presentation, up to 5 pages– Submit electronically via Turn-In – deadline: 2/27 midnight

– Due: a proposal for your project• No more than 1 page, no less than 300 words.• Include an initial bibliography• Will not be graded independently, feedback will be provided• Will feed into your project grade