cassandra 2.1 boot camp, read/write path

CASSANDRA 2.1READ/WRITE PATHCassandra Summit 2014 Boot Camp

Josh McKenzie

[email protected]

CORE COMPONENTS

Core Components

• Memtable – data in memory (R/W)

• SSTable – data on disk (immutable, R/O)

• CommitLog – data on disk (W/O)

• CacheService (Row Cache and Key Cache) – in-memory caches

• ColumnFamilyStore – logical grouping of “table” data

• DataTracker and View – provides atomicity and grouping of

memtable/sstable data

• ColumnFamily – Collection of Cells

• Cell – Name, Value, TS

• Tombstone – Deletion marker indicating TS and deleted cell(s)

MemTable

• In-memory data structure consisting of:

• Memory pools (on-heap, off-heap)

• Allocators for each pool

• Size and limit tracking and CommitLog sentinels

• Map of Key AtomicBTreeColumns

• Atomic copy-on-write semantics for row-data

• Flush to disk logic is triggered when pool passes ratio of usage relative

to user-configurable threshold

• Memtable w/largest ratio of used space (either on or off heap) is flushed

to disk

On heap vs. Off heap Memtables: an overview

• http://www.datastax.com/dev/blog/off-heap-memtables-in-cassandra-2-1

• https://issues.apache.org/jira/browse/CASSANDRA-6689

• https://issues.apache.org/jira/browse/CASSANDRA-6694

• memtable_allocation_type

• offheap_buffers moves the cell name and value to DirectBuffer objects. The values are still

“live” Java buffers. This mode only reduces heap significantly when you are storing large

strings or blobs

• offheap_objects moves the entire cell off heap, leaving only the NativeCell reference

containing a pointer to the native (off-heap) data. This makes it effective for small values

like ints or uuids as well, at the cost of having to copy it back on-heap temporarily when

reading from it.

• Default in 2.1 is heap buffers

http://www.datastax.com/dev/blog/off-heap-memtables-in-cassandra-2-1

https://issues.apache.org/jira/browse/CASSANDRA-6689

https://issues.apache.org/jira/browse/CASSANDRA-6694

On heap vs. Off heap: continued

• Why?

• Reduces sizes of objects in memory – no more ByteBuffer overhead

• More data fitting in memory == better performance

• Code changes that support it:

• MemtablePools allow on vs. off-heap allocation (and Slab, for that matter)

• MemtableAllocators to allow differentiating between on-heap and off-heap

allocation

• DecoratedKey and *Cells changed to interfaces to have different allocation

implementations based on native vs. heap

SSTable

• Ordered-map of KVP

• Immutable

• Consist of 3 files:

• Bloom Filter: optimization to determine if the Partition Key you’re

looking for is (probably) in this sstable

• Index file: contains offset into data file, generally memory mapped

• Data file: contains data, generally compressed

• Read by SSTableReader

CommitLog

• Append-only file structure corresponding – provides interim durability for writes while

they’re living in Memtables and haven’t been flushed to sstables

• Has sync logic to determine the level of durability to disk you want - either

PeriodicCommitLogService or BatchCommitLogService

• Periodic: (default) checks to see if it hit window limit, if so, block and wait for sync to catch up

• Batch: no ack until fsync to disk. Waits for a specific window before hitting fsync to coalesce

• Singleton – façade for commit log operations

• Consists of multiple components

• CommitLog.java: interface to subsystem

• CommitLogManager.java: segment allocation and management

• CommitLogArchiver.java: user-defined commands pre/post flush

• CommitLogMetrics.java

CacheService.java

• In-memory caching service to optimize lookups of hot data

• Contains three caches:

• keyCache

• rowCache

• counterCache

• See:

• AutoSavingCache.java

• InstrumentingCache.java

• Tunable per table, limits in cassandra.yaml, keys to cache, size in mb, rows, size in mb

• Defaults to keys only, can enable row cache via CQL

ColumnFamilyStore.java

• Contains logic for a “table”

• Holds DataTracker

• Creating and removing sstables on disk

• Writing / reading data

• Cache initialization

• Secondary index(es)

• Flushing memtables to sstables

• Snapshots

• And much more

CFS: DataTracker and View

• DataTracker allows for atomic operations on a “view” of a Table (ColumnFamilyStore)

• Contains various logic surrounding Memtables and flushing, SSTables and

compaction, and notification for subscribers on changes to SSTableReaders

• 1 DataTracker per CFS, 1 AtomicReference<View> per DataTracker

• View consists of current Memtable, Memtables pending flush, SSTables for the CFS,

and SSTables being actively compacted

• Currently active Memtable is atomically switched out in:

• DataTracker.switchMemtable(boolean truncating)

ColumnFamily.java

• A sorted map of columns

• Abstract class, extended by:

• ArrayBackedSortedColumns

• Array backed

• Non-thread-safe

• Good for iteration, adding cells (especially if in sorted order)

• AtomicBTreeColumns (memtable only)

• Btree backed

• Thread-safe w/atomic CAS

• Logarithmic complexity on operations

• Logic to add / retrieve columns, counters, tombstones, atoms

THE READ PATH

Read Path: Very High Level

Overview – the Read Path

Return results

CollationController

Keyspace

ColumnFamilyStore

Check Row Cachehit

miss

Memtable

SSTables

read merge

ColumnFamily

Update Row Cache

Coordinator

MessagingService

Key Cache

Binary scan index,

update cache

Seek to cached

positionhit

miss

Read-specific primitive: QueryFilter

• Wraps IDiskAtomFilter

• IDiskAtomFilter: used to get columns from Memtable, SSTable, or SuperColumn

• IdentityQueryFilter, NamesQueryFilter, SliceQueryFilter

• Contains a variety of iterators to collate on disk contents, gather tombstones, reduce

(merge) Cells with the same name, etc

• See:

• collateColumns(…)

• gatherTombstones(…)

• getReducer(final Comparator<Cell> comparator)

Read-specific class: SSTableReader

• Has 2 SegmentedFiles, ifile and dfile, for index and data respectively

• Contains a Key Cache, caching positions of keys in the SSTR

• Contains an IndexSummary w/sampling of the keys that are in the table

• Binary search used to narrow down location in file via IndexSummary

• getIndexScanPosition(RowPosition key)

• Short running operations guarded by ColumnFamilyStore.readOrdering

• See OpOrder.java – producer/consumer synchronization primitive to coordinate readers

w/flush operations

• Access is reference counted via acquireReference() and releaseReference() for long

running operations (See CASSANDRA-7705 re: moving away from this)

• Provides methods to retrieve an SSTableScanner which gives you access to OnDiskAtoms

via iterators and holds RandomAccessReaders on the raw files on disk