cassandra internals overview by sam tunnicliffe (1)

40
 CASSANDRA INTERNALS OVERVIEW DATASTAX BOOTCAMP 2015 Sam Tunnicliffe [email protected] / @beobal

Upload: rajul-srivastava

Post on 04-Nov-2015

216 views

Category:

Documents


0 download

DESCRIPTION

Cassandra Internals presentaiton

TRANSCRIPT

  • CASSANDRAINTERNALS OVERVIEW

    DATASTAX BOOTCAMP 2015Sam Tunnicliffe

    [email protected] / @beobal

  • OVERVIEWSystem startupMessagingGossipSchema PropagationRequest Coordination

  • STARTUPorg.apache.cassandra.service.CassandraDaemonprotectedvoidsetup()

    Load config

    Run preflight checks

    Load schema

    Clean up local temporary state

    Recover CommitLog

    Schedule background compactions

    Initialize storage service

  • PREFLIGHT CHECKSSane clockJNIJVM & InstrumentationFilesystem permissionsSystem keyspace statusUpgrades (#8049)Incompatible SSTables (#8049)

  • STARTUPorg.apache.cassandra.service.CassandraDaemonprotectedvoidsetup()

    Load config

    Run pre-flight checks

    Load schema

    Clean up local temporary state

    Recover CommitLog

    Schedule background compactions

    Initialize storage service

  • CLEAN UP LOCAL STATETruncate compactions_in_progressScrub data directories

  • STARTUPorg.apache.cassandra.db.commitlog.CommitLogpublicintrecover()throwsIOException

    Load config

    Run pre-flight checks

    Load schema

    Clean up local temporary state

    Recover CommitLog

    Schedule background compactions

    Initialize storage service

  • INITIALIZE STORAGE SERVICEorg.apache.cassandra.service.StorageServicepublicsynchronizedvoidinitServer()throwsConfigurationException

    Load ring state (unless don't)

    Start gossip & get initial ring info

    Set tokens

  • BOOTSTRAPAbort if other range movements happening

    Fetch bootstrap data

    Build secondary indexes

  • INITIALIZE STORAGE SERVICELoad ring state (unless don't)

    Start gossip & get initial ring info

    Set tokens

    Setup auth resources

    Ensure gossip stabilized

  • STARTUPLoad config

    Run preflight checks

    Load schema

    Clean up local temporary state

    Recover CommitLog

    Schedule background compactions

    Initialize storage service

  • -- it is done --

    STARTUP

  • MESSAGINGSERVICEorg.apache.cassandra.net.MessagingService

    Low level one-way messagingpublicvoidsendOneWay(MessageOutmessage,InetAddressto)

    Async Request/ResponsepublicintsendRR(MessageOutmessage,InetAddressto,IAsyncCallbackcb)

  • MESSAGINGSERVICEorg.apache.cassandra.net.MessagingService

    ReadspublicintsendRRWithFailure(MessageOutmessage,InetAddressto,IAsyncCallbackWithFailurecb)

    WritespublicintsendRR(MessageOut

  • MESSAGINGSERVICEPre-emptively drops messages when overwhelmed

    Dropped if time at execution > send time + timeout

    Timeout value dependant on message type

    Most client-initated requests can be dropped

    (see MessagingService.DROPPABLE_VERBS)

  • GOSSIPWhat it does do:

    Disseminates members' state around the clusterVersioned: generation (per JVM) & version (per value)Heartbeats: incremented every gossip roundApplication state:

    StatusTokensRelease & schema versionDC & RackAddressesData sizeHealth

  • GOSSIPWhat doesn't it do:

    Notify about up or down nodesPropagate schemaTransmit data filesDistribute mutations

  • GOSSIP

    https://wiki.apache.org/cassandra/ArchitectureGossip

  • GOSSIPorg.apache.cassandra.gms.GossiperprivateclassGossipTaskimplementsRunnable{publicvoidrun(){...

    Each round (1 second) gossip to:

    1 live endpointmaybe 1 unreachable endpointmaybe 1 seed - if neither of the above

  • SCHEMA MIGRATIONAnother custom protocol

    Also uses MessagingService

    Target schema objects serialized as Mutations

    diff/merge schema representations

  • SCHEMA PUSHorg.apache.cassandra.service.MigrationManagerprivatestaticFutureannounce(finalCollectionschema)

  • SCHEMA PULLorg.apache.cassandra.service.MigrationManagerpublicvoidscheduleSchemaPull(InetAddressendpoint,EndpointStatestate)

  • Client request arrives at coordinator:

    COORDINATION

    Transformed into actionable command(s):

    IReadCommandIMutation

    Coordinator distributes execution around the cluster

    Replicas perform commands and respond to coordinator

    Gather responses and determine client response

  • COORDINATIONorg.apache.cassandra.service

    StorageProxyAbstractWriteResponseHandlerAbstractReadExecutor

    org.apache.cassandra.locatorAbstractReplicationStrategyIEndpointSnitch

  • https://wiki.apache.org/cassandra/ArchitectureInternals

    COORDINATING WRITESorg.apache.cassandra.service.StorageProxypublicstaticvoidmutate(Collection

  • DATA REPLICATIONorg.apache.cassandra.locator.SimpleStrategy

  • DATA REPLICATIONorg.apache.cassandra.locator.NetworkTopologyStrategy

  • https://wiki.apache.org/cassandra/ArchitectureInternals

    COORDINATING WRITESorg.apache.cassandra.service.StorageProxypublicstaticvoidmutate(Collection

  • DELIVERING MUTATIONSorg.apache.cassandra.service.StorageProxypublicstaticvoidsendToHintedEndpoints(finalMutationmutation,Iterabletargets,AbstractWriteResponseHandlerresponseHandler,StringlocalDataCenter)

    Mutations sent to replicas using MessagingService

    ResponseHandler registered as callback

    Callback registry triggers an event on expiry

    Sent directly within local datacenter

    Forwarded via single node in each remote DC

  • COORDINATING WRITESorg.apache.cassandra.service.StorageProxypublicstaticvoidmutate(Collection
  • HINTSNodes can be down

    Writes may timeout

    In which case we may hint

    Enabled/disabled globally or enabled per-DC

    Writing a hint counts towards ConsistencyLevel.ANY

    Deliver hints when a node comes back up & periodically

    Too many hints in progress for a replica means we bail early

  • Determine point of failure by WriteType

    LOGGED BATCHESorg.apache.cassandra.service.StorageProxypublicstaticvoidmutateAtomically(Collectionmutations,ConsistencyLevelconsistency_level)

    CommitLog for batches

    Guarantee eventual success of batched statements

    Strives to distribute to across racks in local DC

    On success, cleanup log entries asynchronously

    Failed batches replayed by the nodes holding the logs

    WriteType.BATCH_LOGWriteType.BATCH

  • COORDINATING READSorg.apache.cassandra.service.StorageProxypublicstaticListread(Listcommands,ConsistencyLevelconsistencyLevel,ClientStatestate)

    Partition based reads

    Read Repair & Data vs Digest Requests

    Rapid Read Protection & (non)speculating executors

    Distribution is more slightly complex than for writes

  • IDENTIFY TARGET ENDPOINTSorg.apache.cassandra.service.AbstractReadExecutorpublicstaticAbstractReadExecutorgetReadExecutor(ReadCommandcommand,ConsistencyLevelconsistencyLevel)

    Use replication strategy to get live endpoints

    Snitch sorts by proximity & health of replicas

    Consult table metadata for Read Repair Decision

  • READ REPAIR DECISIONApply filter to sorted list of all live replicas

    NONE: closest n replicas required by CLGLOBAL: all live replicasDC_LOCAL: all local replicas

    Add closest n remotes needed to satisfy CLDefault Global Chance: 0.0Default Local Chance: 0.1

    Give us a list of replicas to send read requests

  • RAPID READ PROTECTIONNever

    Always

    Fixed timeout

    Table latency percentile

  • LIGHTS, CAMERA, EXECUTIONFire off each command using read executor

    Requests are sent via MessagingService

    Closest replica(s) sent full data requests

    Others get digest requests

  • RESOLUTIONResolution can have two outcomes:

  • RESOLUTIONDigestMismatchException

    Trigger a foreground read repairOf all targetted replicas

  • FOREGROUND READ REPAIRAll data requests, no digests

    Includes replicas contacted initially

    Effectively ConsistencyLevel.ALL

    Specialized resolver: RowDataResolver

    Retry any short reads

    May also perform background Read Repair

  • OVERVIEW OVER