aurora multi-master - hpts · input queue is 46xlessthan mysql (unamplified, per node) ... aurora...
TRANSCRIPT
![Page 1: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/1.jpg)
Aurora Multi-Master
Justin Levandoski
![Page 2: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/2.jpg)
Agenda
>
>
Aurora Fundamentals
Aurora Multi-Master Use Cases
> Aurora Multi-Master Architecture
![Page 3: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/3.jpg)
Aurora Fundamentals
![Page 4: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/4.jpg)
Amazon Aurora :A r e l a t i o n a l d a t a b a s e r e i m a g i n e d f o r t h e c l o u d
R Speed and availability of high-end commercial databases
R Simplicity and cost-effectiveness of open source databases
R Drop-in compatibility with MySQL and PostgreSQL
R Simple pay as you go pricing
Delivered as a managed service
![Page 5: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/5.jpg)
Why Aurora
SQL
TRANSACTIONS
CACHING
LOGGING
Relational databases were not design for the cloud Monolithic architectureLarge failure blast radius
Databases in the cloudCompute & storage have different lifetimesInstances fail/shutdown/scale up & downInstances added to a cluster
Compute & storage are best decoupled forscalability, availability, and durability
Attached Storage
![Page 6: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/6.jpg)
Aurora Arch i tec tu re
• Database tier• Writes redo log to network• No checkpointing! The log is the database• Pushes log application to storage• Master replicates to read replicas for cache updates
• Storage tier• Highly parallel scale-out redo processing• Data replicated 6 ways across 3 AZs• Generate/materialize database pages on demand• Instant database redo recovery
• 4/6 Write Quorum with Local Tracking• AZ + 1 failure tolerance• Read quorum needed only during recovery
DISTRIBUTED STORAGE NODES WITH SSDS
Master Replica Replica Replica
Master
Shared storage volume
Read Replica Read Replica
SQL
Transactions
Caching
SQL
Transactions
Caching
SQL
Transactions
Caching
AZ1 AZ2 AZ3
![Page 7: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/7.jpg)
Aurora S torage Node
LOG RECORDS
Primary Instance
INCOMING QUEUE
STORAGE NODE
S3 BACKUP
1
2
3
4
5
6
7
8
UPDATE QUEUE
ACK
HOTLOG
DATABLOCKS
POINT IN TIMESNAPSHOT
GC
SCRUBCOALESCE
SORTGROUP
PEER TO PEER GOSSIPPeerStorageNodes
All steps are asynchronousOnly steps 1 and 2 are in foreground latency pathInput queue is 46X less than MySQL (unamplified, per node)Favor latency-sensitive operationsUse disk space to buffer against spikes in activity
OBSERVATIONS
IO FLOW
① Receive record and add to in-memory queue② Persist record and ACK ③ Organize records and identify gaps in log④ Gossip with peers to fill in holes⑤ Coalesce log records into new data block versions⑥ Periodically stage log and new block versions to S3⑦ Periodically garbage collect old versions⑧ Periodically validate CRC codes on blocks
SUMMARY
Manages 10GB page segments10GB = right size for repair/fault toleranceUse fault tolerance for heat management/machine patching
![Page 8: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/8.jpg)
Crash Recovery
CRASH
Log records Gaps
Volume CompleteLSN (VCL)
AT CRASH
IMMEDIATELY AFTER CRASH RECOVERY
Consistency Point LSN (CPL)
Consistency Point LSN (CPL)
Storage establishes consistency points that increase monotonically + continuously returned to DB
Transactions commit once DB can prove all changes have met quorum
Volume Complete LSN (VCL) is the highest point where all records have met quorum
Consistency Point LSN (CPL) is the highest commit record below VCL.
Everything past CPL is deleted at crash recovery
Removes the need for 2PC at each commit spanning storage nodes.
No redo or undo processing is required before the database is opened for processing
![Page 9: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/9.jpg)
Aurora Multi-Master Architecture
![Page 10: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/10.jpg)
SHARED DISK CLUSTER
GLOBAL RESOURCE MANAGER
SHARED STORAGE
M1 M2 M3
M1 M1 M1M2 M3 M2
SQL
Transactions
Caching
Logging
STORAGE
Dis t r ibuted Lock Manager
APPLICATION
SQL
Transactions
Caching
Logging
LOCKING PROTOCOL MESSAGES
Cons
Heavyweight cache coherency traffic on per-lock basisNetworking can get expensiveNegative scaling with hot blocks
Pros
All data available to all nodesEasy to build applicationsSimilar cache coherency as in multi-processors
![Page 11: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/11.jpg)
SHARED NOTHING
SQL
Transactions
Caching
Logging
STORAGE
Par t i t ioned w i th Consensus
APPLICATION
SQL
Transactions
Caching
Logging
Cons
Heavyweight commit and membership change protocols
Can result in hot partitions = expensive repartitioning
Cross partition operations expensive; better at small requests
Pros
Query broken up and sent to data nodes
Less coherence traffic – only for commits
Can scale to many nodes
STORAGE
DATA RANGE #1
DATA RANGE #2
DATA RANGE #4
DATA RANGE #3
DATA RANGE #5
L
L L
L
L
![Page 12: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/12.jpg)
Shared storage volume
Storage nodes with SSDs
SQL
Transactions
Caching
Availability Zone 1
SQL
Transac9ons
Caching
• Each instance can execute write transaction with no coordination with the others
• Instances share a distributed storage volume
• Nodes fail and recover independently• Optimistic Page-Based Conflict
Resolution
• No Pessimistic Locking
• No Global Commit Coordination
• Writer instances in two availability zones provide continuous availability
• GA August 2019
Aurora Mu l t i -Maste r
Availability Zone 2
• Membership• Heartbeat
• Replication• Metadata
Cluster Services
![Page 13: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/13.jpg)
Storage
Page 1Page 1Page 1
Page 1Page 1
Page 2Page 2Page 2Page 2
Page 2Page 2
Page 1
Non-conflicting writes originating on different masters on different tables
Blue Master Green MasterTime
Begin Trx (BT1) 1 Begin Trx (OT1)
2 Update (table1)
3 Commit (BT1)
OK OK
No Synchronization
Non-Conf l i c t ing Wr i tes
Update (table2)
Commit (OT1)
![Page 14: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/14.jpg)
Storage
Page 1Page 1Page 1
Page 1Page 1
Page 2Page 2Page 2Page 2
Page 2Page 2
Page 1
Conflicting writes originating on different masters on the same table
Blue Master Green MasterTime
Begin Trx (BT1) 1 Begin Trx (OT1)
2 Update (row1, table1)
3 Commit (BT1)
OK RETRY
Optimistic Conflict Resolution
`Conf l i c t ing Wr i te
Update (row1, table1)
Rollback (OT1)
![Page 15: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/15.jpg)
Storage
Page 1Page 1Page 1
Page 1Page 1
Page 2Page 2Page 2Page 2
Page 2Page 2
Page 1
Conflicting writes originating on different masters on the same table
Blue Master Green MasterTime
Begin Trx (BT1) 1 Begin Trx (OT1)
2 Update (row1, table1)
3
OK RETRY
Logical Conflict Detection
Log i ca l Confl ic t
Update (row1, table1)and rollback (OT1)
4 Commit (BT1)
Page 1
![Page 16: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/16.jpg)
Mechan i cs F rom the Head Node
Partitioned LSN and transaction id space
Durability and resolution point at storage constantly increasing (creating new page versions)
Incoming replication from other masters
Database engine must handle rejected write to storageTransaction rollbackB-tree structure modificationsEtc…
Storage
upda
te
reje
ct
acce
pt
Replication
![Page 17: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/17.jpg)
Mul t i -Maste r Commit
BlueMaster
GreenMaster
Commit Commit
com
mit
com
mit
com
mit
com
mit
com
mit
com
mit
Log Records
![Page 18: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/18.jpg)
Mechan i cs F rom the Storage Node
Log records written by multiple masters
Quorum commit log records like before, fills in log chain, etc
Detects conflicting writes from other nodes
Returns rejection to log write on conflict
LOG RECORDS
Master 1INCOMING QUEUE /
CONFLICT DETECTION
STORAGE NODE
S3 BACKUP
12
35
6
7
8
UPDATE QUEUE
ACCEPT/REJECT
HOTLOG
DATABLOCKS
POINT IN TIMESNAPSHOT
GC
SCRUBCOALESCE
SORTGROUP
PEER TO PEER GOSSIPPeerStorageNodes
Master 2
Master N
…
4
![Page 19: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/19.jpg)
Recovery in Mu l t i -Maste r
CRASH
Log records Gaps
Volume CompleteLSN (VCL)
AT CRASH
IMMEDIATELY AFTER CRASH RECOVERY
Consistency Point LSN (CPL)
Consistency Point LSN (CPL)
Green Master Crashes
Gaps
VCL
AT CRASH
VCL
CPL CPL
Green Master Recovery Point
Gaps filled New LSNs and Gaps
SINGLE MASTER MULTI MASTER
![Page 20: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/20.jpg)
Instance Read-After-Write (INSTANCE_RAW): A transaction can observe all transactions previously committed on this instances, and transactions executed on other nodes, subject to replication lag.
Regional Read-After-Write (REGIONAL_RAW): A transaction can observe all transactions previously committed on all instances in the cluster.
Cons i s tency Mode l
![Page 21: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/21.jpg)
Reg iona l Read-Af te r Wr i te
N3
N1 N2
Client
T2
T3 N1 wait for replication to catch up until T2 AND T3
Globally consistent results
No waits on the write path
Adds latency ONLY to consistent reads
Configurable per session
Shared distributed storage volume
ReadT1
![Page 22: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/22.jpg)
Opt imi s t i c Execut ion : M in i -T ransact ions (3 )
Resolution point constantly advancing
Can pessimistically wait for multi-page mtr to resolve (lack of concurrency/performance)
Aurora optimistically executes multi-page mtr(greater in-memory concurrency)
Rolls back mtr (and all dependent operations) retroactively on conflict
Adaptively switches to pessimistic resolution if high percentage of conflict detected
![Page 23: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/23.jpg)
Aurora Multi-Master Use Cases
![Page 24: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/24.jpg)
Shared storage volume
Storage nodes with SSDs
AZ1 AZ2
Application
AZ3
R/WR/O
R/OR/W
A lot of work can be done on single Aurora writer
Writable replicas provide instant failover
Cont inuous Ava i l ab i l i t y
![Page 25: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/25.jpg)
Structure the workload to limit conflicts between database instances.
Prefer partitioning writes per table (or table partition) from a single database instance.
Aurora MM allows customers to “soft partition,” or re-partition on the fly
Continuous availability through failures and planned maintenance
Shared storage volume
Storage nodes with SSDs
AZ1 AZ2
Application
AZ3
R/WR/W
R/WR/W
Mul t i -Wr i te r Conf igura t ion
![Page 26: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/26.jpg)
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000W
rites
/sec
ond
Scaling/Node Failure in Aurora Multi Master
Total writes/sec
One Writer: R4.4XL
Scale out to Two WritersR4.4XL + R4.4XL
Scale Up Instance1R4.8XL + R4.4XL
Instance1 offline
Continuous availability
![Page 27: Aurora Multi-Master - HPTS · Input queue is 46Xlessthan MySQL (unamplified, per node) ... Aurora Multi-Master Availability Zone 2 •Membership •Heartbeat •Replication •Metadata](https://reader035.vdocuments.us/reader035/viewer/2022071017/5fd140c12afd2354550aadfc/html5/thumbnails/27.jpg)
Questions