cs 542 parallel dbs, nosql, mapreduce
DESCRIPTION
TRANSCRIPT
CS 542 Database Management SystemsParallel and Distributed Databases
Introduction to NoSQL databases and MapReduce
J Singh February 14, 2011
2© J Singh, 2011 2
Today’s meeting
• Parallel and Distributed Databases, Chapter 20.1 – 20.4– But only a part of 20.4, the rest w/ Query Optimization
• NoSQL Databases
• MapReduce
• References:– Selected Papers– MapReduce (textbook), Lin & Dyer, 2010
3© J Singh, 2011 3
Parallel Databases
• Motivation– More transactions per second, or less time per query– Throughput vs. Response Time– Speedup vs. Scaleup
• Database operations are extremely parallel– E.g. Consider a join between R and S on R.b = S.b
4© J Singh, 2011 4
Parallel Databases
• Shared-nothing vs. shared-memory vs. shared-disk
5© J Singh, 2011 5
Parallel Databases
Shared Memory Shared Disk Shared Nothing
Communication between processors
Extremely fast Disk interconnect is very fast
Over a LAN, so slowest
Scalability ? Not beyond 32 or 64 or so (memory bus is the bottleneck)
Not very scalable (disk interconnect is the bottleneck)
Very scalable
Notes Cache-coherency an issue
Transactions complicated; natural fault-tolerance.
Distributed transactions are complicated (deadlock detection etc);
Main use Low degrees of parallelism
Not used very often
Everywhere
6© J Singh, 2011 6
Date’s Rules for Distributed DBMS
Other rules:1. Local autonomy2. No reliance on central
site3. Continuous operation4. Location independence5. Fragmentation
independence6. Replication
independence
7. Distributed query processing
8. Distributed transaction management
9. Hardware independence10. Operating system
independence11. Network independence12. DBMS independence
Rule 0To the user, a distributed system should look exactly
like a non-distributed system
7© J Singh, 2011 7
Distributed Systems Headaches
• Especially if trying to execute transactions that involve data from multiple sites
– Keeping the databases in sync• 2-phase commit for transactions uniformly hated (we’ll cover
on 4/11)– Autonomy issues
• Even within an organization, people tend to be protective of their unit/department
– Locks/Deadlock management
• Works better for query processing– Since we are only reading the data
8© J Singh, 2011 8
Distributed Query Optimization
• Cost-based approach; consider all plans, pick cheapest; similar to centralized optimization.
– Communication costs must be considered.– Local site autonomy must be respected.– New distributed join methods.
• Query site constructs global plan, with suggested local plans describing processing at each site.
– If a site can improve suggested local plan, free to do so.
9© J Singh, 2011 9
Distributed Query Example
• Find all cities in AfricaSELECT City.NameFROM City, CountryWHERE City.CountryCode = Country.Code AND Country.Continent = 'Africa' ;
• Location assumptions:– Country table: Server A– City table: Server B– Request arrives at Server A
• Think through on whiteboard
10© J Singh, 2011 10
Detour: Paxos Algorithm
• Paxos is a family of protocols for solving consensus in a network of unreliable processors.
– Consensus is the process of agreeing on one result among a group of participants.
– This problem becomes difficult when the participants or their communication medium may experience failures.
• Includes a spectrum of trade-offs between – the number of processors, number of message delays before
learning the agreed value, the activity level of individual participants, number of messages sent, and types of failures.
• Widely used. In Google’s lock management service among other uses
– The above definitions from Wikipedia. Feel free to pursue. Not on exam.
– Contributions from Leslie Lamport, Nancy Lynch, Barbara Liskov
11© J Singh, 2011 11
Updating Distributed Data
• Synchronous Replication: All copies of a modified relation (fragment) must be updated before the modifying transaction commits.
– Data distribution is made transparent to users.
• Asynchronous Replication: Copies of a modified relation are only periodically updated; different copies may get out of synch in the meantime.
– Users must be aware of data distribution.– Many current products follow this approach. Also referred
to as Master/Slave.
12© J Singh, 2011 12
Synchronous Replication
• Voting: transaction must write a majority of copies to modify an object; must read enough copies to be sure of seeing at least one most recent copy.
– E.g., 10 copies; 7 written for update; 4 copies read.– Each copy has version number.– Not attractive usually because reads are common.
• Read-any Write-all: Writes are slower and reads are faster, relative to Voting.
– Most common approach to synchronous replication.– But what if one of the 10 nodes is down?
• Choice of technique determines which locks to set.
13© J Singh, 2011 13
Peer-to-Peer Replication
• More than one of the copies of an object can be a master in this approach.
• Changes to a master copy must be propagated to other copies somehow.
• If two master copies are changed in a conflicting manner, this must be resolved. (e.g., Site 1: Joe’s age changed to 35; Site 2: to 36)
• Best used when conflicts do not arise:– E.g., Each master site owns a disjoint fragment.– E.g., Updating rights owned by one master at a time.
14© J Singh, 2011 14
Summary
• Parallel DBMSs designed for scalable performance. Relational operators very well-suited for parallel execution.
– Pipeline and partitioned parallelism.
• Distributed DBMSs offer site autonomy and distributed administration. Must revisit storage and catalog techniques, concurrency control, and recovery issues.
• Thus far, we have taken an ad hoc approach to database parallelism and distribution
– Time to formalize it
15© J Singh, 2011 15
Brewer’s Conjecture (p1)
• Source: Eric Brewer’s July 2000 PODC Keynote
• Main points:– Classic “Distributed Systems” don’t work
• They focus on computation, not data• Distributing computation is easy, distributing data is hard
– DBMS research is about ACID (mostly)• But we forfeit “C” and “I” for availability, graceful
degradation and performance• This tradeoff is fundamental
– BASE• Basically Available• Soft-state• Eventual Consistency
16© J Singh, 2011 16
Brewer’s Conjecture (p2)
• ACID– Strong consistency– Isolation– Focus on “commit”– Nested transactions– Availability?– Conservative
(pessimistic)– Difficult evolution (e.g.
schema)
• BASE– Weak consistency
• stale data OK– Availability first– Best effort– Approximate answers OK– Aggressive (optimistic)– Simpler!– Faster– Easier evolution
But I think it’s a spectrumEric Brewer
17© J Singh, 2011 17
CAP Theorem (p1)
• Brewer’s Take Home Messages�– Can have consistency & availability within a cluster, but it
is still hard in practice– OS/Networking good at BASE/Availability, but terrible at
consistency– Databases better at C than Availability
• Wide-area databases can’t have both– All systems are probabilistic
• Since then,– Brewer’s conjecture formally proved:
Gilbert & Lynch, 2002– Thus Brewer’s conjecture became the CAP theorem…– …and contributed to the birth of the NoSQL movement
18© J Singh, 2011 18
CAP Theorem (p2)
• But the theory is not settled– Aren’t Availability and Partition Tolerance the same thing?– And shouldn’t we be thinking about latency?
• While http://nosql-database.org/ lists 122 NoSQL databases
• References– Availability and Partition Tolerance, Jeff Darcy, 2010– Problems with CAP…, Dan Abadi, 2010 – What does the Proof of the CAP Theorem mean? Dan Weinreb,
2010
CS 542 Database Management SystemsNoSQL Databases
20© J Singh, 2011 20
What is NoSQL?
• Stands for Not Only SQL
• Class of non-relational data storage systems
• Usually do not require a fixed table schema nor do they use the concept of joins
• All NoSQL offerings relax one or more of the ACID properties
21© J Singh, 2011 21
Forces at Work
• Three major papers were the seeds of the NoSQL movement
– CAP Theorem (discussed above)– BigTable (Google)– Dynamo (Amazon)
• Gossip protocol (discovery and error detection)• Distributed key-value data store• Eventual consistency
• Some types of data could not be modeled well in RDBMS
– Document Storage and Indexing– Recursive Data and Graphs– Time Series Data– Genomics Data
22© J Singh, 2011 22
The Perfect Storm
• Large datasets, acceptance of alternatives, and dynamically-typed data has come together in a perfect storm
• Not a backlash/rebellion against RDBMS
• SQL is a rich query language that cannot be rivaled by the current list of NoSQL offerings
23© J Singh, 2011 23
What kinds of NoSQL
• NoSQL solutions fall into two major areas:– Key/Value or ‘the big hash table’.
• Amazon S3 (Dynamo)• Voldemort• Scalaris
– Schema-less which comes in multiple flavors, column-based, document-based or graph-based.
• Cassandra (column-based)• CouchDB (document-based)• Neo4J (graph-based)• HBase (column-based)
24© J Singh, 2011 24
Amazon SimpleDB
• Key-value store– Written in Erlang, (as is CouchDB)
• Data is modeled in terms of – Domain, a container of entities,– Item, an entity and – Attribute and Value, a property of an Item
• Eventually Consistent, except when ReadConsistent flag specified
– Impressive performance numbers, • e.g., .7 sec to store 1 million records
• SQL-like SELECT
select output_list from domain_name [where expression] [sort_instructions] [limit limit]
25© J Singh, 2011 25
Google Datastore
• Part of App Engine; also used for internal applications– Used for all storage– Incorporates a transaction model to ensure high
consistency• Optimistic locking• Transactions can fail
• CAP implications– Datastore isn’t just “eventually consistent”– They offer two commercial options (with different prices)
• Master/Slave – Low latency but also lower availability– Asynchronous replication
• High Replication– Strong availability at the cost of higher latency
26© J Singh, 2011 26
App Engine Architecture
PythonVM
process
stdlib
app
memcachedatastore
images
urlfech
statefulAPIs
stateless APIs R/O FS
req/resp
27© J Singh, 2011 27
Datastore Programming Model (p1)
• Entities have a Kind, a Key, and Properties– Entity ~~ Record ~~ Python dict ~~ Python class instance– Key ~~ structured foreign key; includes Kind– Kind ~~ Table ~~ Python class– Property ~~ Column or Field; has a type
• Dynamically typed: Property types are recorded per Entity
• Key has either id or name– the id is auto-assigned; alternatively, the name is set by app– A key can be a path including the parent key, and so on
• Paths define entity groups which limit transactions– A transaction locks the root entity (parentless ancestor key)
28© J Singh, 2011 28
Datastore Programming Model (p2)
• GQL– GQL offers SELECT but no INSERT, UPDATE, JOIN– Use language bindings for INSERT, … and Transaction
primitives
SELECT [* | __key__] FROM <kind> [WHERE <condition> [AND <condition> ...]] [ORDER BY <property> [ASC | DESC] [, <property> [ASC | DESC] …]] [LIMIT [<offset>,]<count>] [OFFSET <offset>]
<condition> := <property> {< | <= | > | >= | = | != } <value> <condition> := <property> IN <list> <condition> := ANCESTOR IS <entity or key>
29© J Singh, 2011 29
Datastore is Based on BigTable
• Provides Scalable, Structured Storage
– Implemented as a sharded, sorted, array
• Sharded: Each block (tablet) lives on its own server
• Sorted: Engineered to fetch the results of range queries with fewest disk reads
– Operations: Only these six1. Read2. Write3. Delete4. Update (atomic)5. Prefix scan6. Range scan
• Row Names (keys) up to 64KB
• Columns unlimited in size– Divided into “column
families”
BigTable RowsRow Names Columns
a … Tablet 1 b …
c …
d … Tablet 2 f …
j …
n … Tablet 3 p …
z …
30© J Singh, 2011 30
Datastore Entity Implementation
• Shown– Index: Entities By Kind– Index: Entities By Prop ASC
• Not Shown– Index: Entities By Prop
DESC– Index: By Composite
Property
BigTable RowsRow Names Columns
a … Tablet 1 b …
c …
d … Tablet 2 f …
j …
n … Tablet 3 p …
z …
Index: Entities By KindID <AppID, Kind, Key>… … … … … …
… … … … … …
… … … … … …
Index: Entities By Property ASCID <AppID, propName, propVal, Key>… … … … … …
… … … … … …
… … … … … …
Root
Nod
e 1
Nod
e 2
Nod
e 3
**
*
*
* *
31© J Singh, 2011 31
• Some production data, circa 2008.
• For more info, see video of Ryan Barrett’s talk at Google I/O
Datastore Application at Google
32© J Singh, 2011 32
Databases and Key-Value Stores
http://browsertoolkit.com/fault-tolerance.png
CS 542 Database Management SystemsMapReduce
The Story of Sam
35© J Singh, 2011 35
Conceptual Underpinnings
• Programming model from Lisp and other functional languages
– (map square '(1 2 3 4)) (1 4 9 16)– (reduce + '(1 4 9 16)) 30
• Easy to distribute
• Nice failure/retry semantics
36© J Singh, 2011 36
MapReduce Flow
37© J Singh, 2011 37
Example: Reverse index words in docs
• Input: Crawler yields (url, content) pairs
• Map function:– map (key = url, value = content)
• For each word w in content, emit (w, [url, offset]
– reduce(key = word, values = list of [url, offset])• Sort values• Emit (word, sorted list of [url, offset])
38© J Singh, 2011 38
Implementation Questions
• Map: – How many processors should we use?
• 4? 32? 1024?
• Reduce:– How many processors?– What’s the allocation algorithm for assigning words to
processors?
• These all design decisions driven by– Size and other characteristics of the problem
39© J Singh, 2011 39
Implementation Steps
• Split key/value pairs into M chunks, run a map task on each chunk in parallel
• Partition the output of map tasks into R regions
• After all map tasks complete,– Why do we need to wait?
• Run a reduce task on each (of R) regions
40© J Singh, 2011 40
Fault Tolerance
• Problem Detection– Heartbeat
• Remedy– Re-execute in-progress and completed map tasks– Re-execute in-progress reduce tasks
41© J Singh, 2011 41
• Not limited to data analysis tasks– Any task that is parallelizable
• e.g., prepare a report for each user
– The importance of idempotence• It must be possible to rerun a task.
Applications
- name: prepUserReport mapper: input_reader: mapreduce.input_readers.DatastoreInputReader handler: Service.prepUserReport params: - name: entity_kind default: DataModel.Org - name: done_callback default: /mr_done/prep
42© J Singh, 2011 42
Refinements: Usability
• In the App Engine Environment, automation has been the focus
– Automatic sharding for faster execution– Automatic rate limiting for slow execution– Status pages (demo in a minute)– Counters– Parameterized mappers– Batching datastore operations– Iterating over blob data
43© J Singh, 2011 43
Refinement: Redundant Execution
• Slow workers significantly delay completion time – Other jobs consuming resources on machine – Bad disks w/ soft errors transfer data slowly – Weird things: processor caches disabled (!!)
• Solution: Near end of phase, spawn backup tasks – Whichever one finishes first "wins"
• Dramatically shortens job completion time
44© J Singh, 2011 44
Refinement: Locality Optimization
• Master scheduling policy: – Asks GFS for locations of replicas of input file blocks – Map tasks typically split into 64MB (GFS block size) – Map tasks scheduled so GFS input block replica are on
same machine or same rack
• Effect– Thousands of machines read input at local disk speed
• Without this, rack switches limit read rate
45© J Singh, 2011 45
Refinement: Skipping Bad Records
• Map/Reduce functions sometimes fail for particular inputs
– Best solution is to debug & fix• Not always possible ~ third-party source libraries
– On segmentation fault: • Send UDP packet to master from signal handler • Include sequence number of record being processed
– If master sees two failures for same record: • Next worker is told to skip the record
46© J Singh, 2011 46
Refinement: Pipelining
• Hadoop Online Prototype (HOP) supports pipelining within and between MapReduce jobs: push rather than pull
– Preserve simple fault tolerance scheme– Improved job completion time (better cluster utilization)– Improved detection and handling of stragglers
• MapReduce programming model unchanged– Clients supply same job parameters
• Hadoop client interface backward compatible– No changes required to existing clients
• E.g., Pig, Hive, Sawzall, Jaql– Extended to take a series of job
47© J Singh, 2011 47
MapReduce Statistics @ GOOG
Aug ’04
Mar ’06
Sep ’07
Sep ’09
Number of Jobs (000) 29 171 2,217 3,467
Average Completion Time (secs)
637 874 395 475
Machine-years 217 2,002 11,081 25,562
Input Data Read (TB) 3,288 52,254 403,152
544,130
Intermediate Data (TB) 758 6,743 34,774 90,120
Output Data (TB) 193 2,970 14,018 57,520
Avg Worker Machines 157 268 394 488• Take-away message:
– MapReduce is not a “new-fangled technology of the future”
– It is here, it is proven, use it!
48© J Singh, 2011 48
MapReduce is Still Controversial
• Arg1: MapReduce is a step backwards in database access– MapReduce is not a database, a data storage, or management
system– MapReduce is an algorithmic technique for the distributed
processing of large amounts of data
• Arg2: MapReduce is a poor implementation– MapReduce is one way to generate indexes from a large volume
of data, but it’s not a data storage and retrieval system
• Arg3: MapReduce is not novel– Hashing, parallel processing, data partitioning, and user-defined
functions are all old hat in the RDBMS world, but so what?– The big innovation MapReduce enables is distributing data
processing across a network of cheap and possibly unreliable computers
49© J Singh, 2011 49
MapReduce is Still Controversial (p2)
• Arg4: MapReduce is missing features
• Arg5: MapReduce is incompatible with the DBMS tools– The ability to process a huge volume of data quickly such as
web crawling and log analysis is more important than guaranteeing 100% data integrity and completeness
• Arg6: Even Google is replacing MapReduce– Not much written about it – seems more focused on pipelining
and incremental processing
50© J Singh, 2011 50
Next meetings
• February 21: Mid-Term Exam
• February 28: Presentations– Due: Presentation Report
• In-class presentation– Please bring on a thumb drive in PDF or PPT format
• Report to accompany the presentation, up to 5 pages– Submit electronically via Turn-In – deadline: 2/27 midnight
– Due: a proposal for your project• No more than 1 page, no less than 300 words.• Include an initial bibliography• Will not be graded independently, feedback will be provided• Will feed into your project grade