big table alon pluda. introduction data model api building blocks implementation refinements...

Big TableAlon pluda

• Introduction• Data model• API• Building blocks• Implementation• Refinements• Performance• Applications• conclusions

(big) Table of content



לגו שולחן

•Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size• petabytes of data across thousands of commodity servers

•Many projects at Google store data in Bigtable • web indexing, Google Earth, Google Finance…

• These applications place very different demands on Bigtable• in terms of data size (from URLs to web pages to satellite imagery)• In terms of latency requirements (from backend bulk processing to real-time

data serving).

• It is not a relational database, it is a sparse, distributed, persistent multi-dimensional sorted map (key/value store)

introduction

Data model

Column Family1 Column Family2 Column Family )anchor)content: language: my.look.ca: cnnsi.com

com.cnn.www

com.bbc.www

il.co.ynet.www

org.apache.hbase

org.apache.hadoop

<html><html>

<html> t1t2

t3EN anchor1 anchor2

Column family with one column

key

Column family with multiple column key

Every cell is an uninterpreted array of bytestwo per-column-family settings for

automatic garbage-collect:• only the last n versions of

a cell be kept• only new-enough versions

be keptThe row range for a table is dynamically

partitioned in tablets

• Introduction• Data model• API• Building blocks• Implementation• Refinements• Performance • Applications• conclusions


API• The Bigtable API provides functions :• Creating and deleting tables and column families.• Changing cluster , table and column family metadata.• Support for single row transactions• Allows cells to be used as integer counters• Client supplied scripts can be executed in the address space of servers

// Open the tableTable *T = OpenOrDie("/bigtable/web/webtable");

// Write a new anchor and delete an old anchorRowMutation r1(T, "com.cnn.www");r1.Set("anchor:www.c-span.org", "CNN");r1.Delete("anchor:www.abc.com");Operation op;Apply(&op, &r1);

API

Scanner scanner(T);ScanStream *stream;stream = scanner.FetchColumnFamily("anchor");stream->SetReturnAllVersions();scanner.Lookup("com.cnn.www");for (; !stream->Done(); stream->Next()) {

printf("%s %s %lld %s\n",scanner.RowName(),stream->ColumnName(),stream->MicroTimestamp(),stream->Value());

}

API

Building blocksBigtable is built on several other pieces of Google infrastructure:

A. Google File System (GFS):• Bigtable uses the distributed Google File System (GFS) to store Metedata,

data and log.

B. Google SSTable file format(sorted string table)• Bigtable data is internally stored in Google SSTable file format. An SSTable

provides a persistent ordered immutable map from keys to values.

C. Google Chubby• Bigtable relies on a highly-available and persistent distributed lock service

called Chubby

Building blocksGoogle file system:• shared pool of machines that run a wide variety of

other distributed applications. • Bigtable depends on a cluster management system

for scheduling jobs, dealing with machine failures, and monitoring machine status.• Three major compunents: - log files: each tablet have its Own log file. - DATA: data of the tablet (stored in SSTable file) - METADATA: tablets location (also stored in SSTable file)

Building blocksGoogle SSTable file format• Contains a sequence of 64 KB Blocks

• Optionally, an SSTable can be completely mapped into memory

• Block index stored at the end of the file• Used to locate blocks• Index loaded in memory when the SSTable is opened• Lookup is performed with a single dist seek

1. Find the appropriate block by performing a binary search in the in-memory index2. Reading the appropriate block from disk

Building blocksGoogle Chubby• consists of five active replicas, one of which is elected to be the master

and actively serve requests

• Chubby provides A namespace that contains directories and small files (less then 256KB)

– Each Directory or file can be used as a lock– Reads And writes to a file are atomic– Chubby Client library provides consistent caching of Chubby files– Each Chubby Client maintains a session with a Chubby service

• Bigtable use Chubby file for many tasks:– To Ensure there is at most one active master at any time– To Store the bootstrap location of Bigtable Data (Root tablet)– To Discover tablet servers and finalize tablet server deaths– To Store Bigtable Schema information (column Family information for each table)– To Store access control lists (ACL)

1. Master sever– Assigning tab1lets to tablet servers– Detecting the addition and expiration of tablet servers– Balancing tablet server load– Garbage collecting of files in GFS– Handling schema changes (table crea7on, column family creation/deletion)

2. Tablet server– manages a set of tablets– Handles read and write request to the tablets– Splits tablets that have grown too large (100- 200 MB)

3. Client– Do not rely on the master for tablet location information– Communicates directly with tablet servers for reads and writes

Implementation

ImplementationTablet Assignment• Each tablet is assigned to one tablet server at a time• Master keeps tracks of

– the set of live tablet servers (tracking via Chubby, under directory “servers”)

– the current assignment of tablet to tablet servers– the current unassigned tablets (by scaning the B+ Tree)

• When a tablet is unassigned, the master assigns the tablet to an available tablet server by sending a tablet load request to that tablet server

ImplementationMaster startup• When a master is started by the cluster management system, it needs to discover the current tablet assignments before it can changes them: 1. Master grabs a unique master lock in Chubby to prevent concurrent master instantiations 2. Master scans the “servers” directory in Chubby to find the live tablet servers 3. Master communicate with every live tablet servers to discover what tablets are already assigned to each server 4. Master adds the root tablet to the set of unassigned tablets if an assignment for the root tablet is not discovered in step 3. 5. Master scans the METADATA table to learn the set of tablets (and detect unassigned tablets)

ImplementationTablet Service• Writing:

1. Server checks if it is well-formed2. Server Checks if the sender is authorized (list of permitted writers in

Chubby file)3. A valid mutation is written to the commit4. After the write has been committed, its contents are inserted into the

memtable.

ImplementationTablet Service• Reading:

1. Server checks if it is well-formed2. Server Checks if the sender is authorized (list of permitted readers in

Chubby file)3. A valid read operation is executed on a merged view of the sequence of

SSTables and the memtable

ImplementationTablet Service• Recover:

1. Tablet server find the commit log for this tablet by iterating over the METADATA tablets (searching the b+ tree)

2. Tablet server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have a committed since the redo points

ImplementationCompaction• Minor compaction:

When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS.

• shrinks the memory usage of the tablet server • reduces the amount of data that has to be read from the commit log during recovery

ImplementationCompaction• Marging compaction:

When the number of SSTables rich its bounde, the server reads the contents of a few SSTables and the memtable, and writes out a new SSTable • Every minor compaction creates a new SSTable. If this behavior continued unchecked,

read operations might need to merge updates from an arbitrary number of SSTables

ImplementationCompaction• Major compaction:

It is a merging compaction that rewrites all SSTables into exactly one SSTable that contains no deletion information or deleted data

Bigtable cycles throught all of it tablets and regularly applies major compaction to them (=reclaim ressources used by deleted data in a timely fashion)

RefinementsCaching for read performance

To improve read performance, tablet servers use two levels of cachin. Scan Cache: a high-level cache that caches key-value pairs returned by the SSTable

interface Block Cache: a lower-level cache that caches SSTable blocks read from file system

Get K13 value Is K13 in

Scan cache?K13 is in block 1

Is K13 in Block cache?

Read K13 from SSTable

RefinementsBloom filter• a read operation has to read from all SSTables that make up the state of a

tablet If these SSTables are not in memory, we may end up doing many disk accesses

• We reduce the number of accesses by using Bloom filters• Bloom filter let us know with high propapility if a SSTable contain a specified row/column pair or

not. (no false negative)• Use only small amount of memory

• Introduction• Data model• API• Building blocks• Implementation• Refinements• Performance • Experience• conclusions


PerformanceCluster• 500 tablet servers

– Configured to use 1 GB RAM

– Dual- ?core Opteron 2 GHz, Gigabit Ethernet NIC

– Write to a GFS cell (1786 machines with 2 x 400 GB IDE)

• 500 clients • Network roundtrip time between any machine < 1 millisecond

. . . . . . client Tablet server

GFS server

500 500

1786

GFS server GFS server

PerformanceRandom reads - Similar to Sequential reads except row key hashed modulo R Random reads (memory) - Similar to Random reads benchmark except locality group that contains the data is marked as in-memory Random writes - Similar to Sequential writes except row key hashed modulo R Sequential reads - Used R row keys partitioned and assigned to N clients Sequential writes - Used R row keys partitioned and assigned to N clientsScans - Similar to Random reads but uses support provided by Bigtable API for scanning over all values in a row range (reduces RPC)

Experience

Characteristics of a few tables in production use:

ExperienceGoogle Earth• Google operates a collection of services that provide users with access

to high-resolution satellite imagery of the world's surface• Relies greatly on Bigtable to keep his data• Requre both low latency raction and store very big data.

ExperienceGoogle Earth•aaa

. . . . . .

. . .

. . . . . .

. . .

big table alon pluda. introduction data model api building blocks implementation refinements...

Documents