retrofitting high availability mechanism to tame hybrid
TRANSCRIPT
Retrofitting High Availability Mechanism to Tame Hybrid Transaction/Analytical Processing
Sijie Shen, Rong Chen, Haibo Chen, Binyu Zang{ds_ssj, rongchen} @ sjtu.edu.cn
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong UniversityShanghai Artificial Intelligence Laboratory
Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China
1
Typical workloads of databases
2
Short-term read-write transactionsØ Online order processingØ Stock exchangeØ E-commerce
Long-term read-only queriesØ Business intelligenceØ Financial reportingØ Data mining
OLTP(OnLine Transaction Processing)
OLAP(OnLine Analytical Processing)
HTAP: new type of workloads
3
Real-time queries on data generated by transactionsØ IoTØ HealthcareØ Fraud detectionØ Personalized recommendation
OLTP OLAPHTAP(Hybrid Transaction/Analytical Processing)
RequirementsPerformanceØMinimizing performance degradation (e.g., < 10%)ØMillions of transactions per second[1]
FreshnessØReal-time time delay between TP and AP (e.g., < 20ms)
4[1] ALIBABA CLOUD. Double 11 Real-Time Monitoring System with Time Series Database.
Scenario FraudDetection
SystemMonitor
Personalized Ad.
Stockexchange
Time delay 20 ms 20 ms 100 ms 200 ms
Data analysis: TP+ETL+AP
5
OLTPMySQLRocksDB……
Extract
Trans.
Load
OLAP
ClickhouseMonetDB
Data generation, row store(update, delete, insert)
Data analysis, column store(SUM, AVG, GROUP)
Alternative#1:DUAL-SYSTEMGood performanceTime delay: from seconds to minutes
ETL(Kafka, IDAA, F1 lightening…)
index
HTAP alternativesAlternative#1:DUAL-SYSTEMØGood performanceØLarge time delay
Alternative#2:SINGLE-LAYOUTØShort time delayØHuge perf. degradation
Alternative#3:DUAL-LAYOUTØLightweight sync. (a tradeoff)
6
Freshness1 110010 10
0%10%20%30%40%50%60%70%80%
100ms s
Human Reaction Time (100-250ms)
HyPer
VEGITO
BatchDB
MemSQL
IDAA
F1 Lightning
43
Hekaton
Perf.
Degrad
atio
n
Fraud Detection(~20ms)
Online Gaming (50-100ms)
Stock Price Monitor (~200ms)
1
3
5
4 Personalized Ad(~100ms)
5
23
1
10M
1M
100K
10K IDAA
SyPer
MemSQL
BatchDB
HyPer
VEGITODrTM+H
Hekaton
SyPer
(SQLServer)
System Monitoring(~20ms)
2
21
TiDB
TiDB
(community)
TPC-H
1K
FaRMv2
TPC-C(NewOrder txns/s)
OLAP
OLTPΔ
OLTP
OLAP
ΔOLTP
OLAP
Goal
Goal
VEGITOa distributed in-memory HTAP systemOpportunity of High Availability for fast in-memory OLTPØReplication-based HA mechanism is commonØSynchronous log shipping during transaction committing
Reuse HA for HTAPØ For performance: OLTP on primary, OLAP on backup/APØ For freshness: synchronous logs
7
cleaner
Primary
Backup/TP
Backup/TP
Backup/TP:- Fresh- Fault-tolerant
cleaner
Primary
Backup/TP
Backup/AP
Backup/AP:- Fresh- Fault-tolerant- Columnar
Freshness1 110010 10
0%10%20%30%40%50%60%70%80%
100ms s
Human Reaction Time (100-250ms)
HyPer
VEGITO
BatchDB
MemSQL
IDAA
F1 Lightning
43
Hekaton
Perf.
Degrad
atio
n
Fraud Detection(~20ms)
Online Gaming (50-100ms)
Stock Price Monitor (~200ms)
1
3
5
4 Personalized Ad(~100ms)
5
23
1
10M
1M
100K
10K IDAA
SyPer
MemSQL
BatchDB
HyPer
VEGITODrTM+H
Hekaton
SyPer
(SQLServer)
System Monitoring(~20ms)
2
21
TiDB
TiDB
(community)
TPC-H
1K
FaRMv2
TPC-C(NewOrder txns/s)
Effects of VEGITO
8
Challenge#1: Log cleaningLog shippingØ TP threads append logs to queueØ Cleaners drain logs
For high availabilityØ Drain logs in parallelØ Without consistency until recovery
For AP queriesØ Should be consistencyØ Slow (70% ↓): global timestamp + sequential cleaning
10
C#1
Backup/AP needs consistent and parallel log cleaning.
Epoch-based designPartition time into non-overlapping epochs
Time isolation between OLTP and OLAP, each machine hasØ E/TX: epoch of TX logs (increase periodically)Ø E/C: epoch of logs being drainedØ E/Q: epoch of stable versions on backup/AP
Freshness: ms-level epoch with consistency of distributed TX
11545 3
cleaner
Primary
Backup/TP
Backup/AP
E/TX=5
E/Q=3TimeE/Q E/TX
Backup/APStable
Backup/APUnstable
freshness
E/C
E/C=4
Consistent epoch assign
Gossip-style epoch assign► Epoch oracle: update epoch periodically and broadcast► Gossip epoch during commit if violate dependence
► Consistency: previous TX within an equal or smaller epoch
12
1
M1
2
M2TX1
w
w
TX2
r
TX1 happen-before TX2
C
EpochOracle
4
4
Gossip
5
5
Parallel log cleaning
Clean logs matching TX dependence► Parallelism: Logs within an epoch drained in parallel► Consistency: each machine update E/C when all logs of an
epoch drained individually
13
1
M1
2
M2TX2
w
w
2
cleaner
5 5
1
cleaner
5
5
TX1
r
4 4
5 4
5 4
5
Q
3
3
3
4
45
5
TX1 happen-before TX2 E/C E/Q=3
Challenge#2: MVCSMulti-version column store (MVCS)Ø Isolation & OLAP performance
Different locality for read & writeØ column-wise vs. row-wise
Chain-based MVCSØ Update efficiently
Ø Scan performance drop 90%Ø when read 0.5 more version on avg.
14
C#2
40 E=20 E=2
100 E=1100 E=199 E=480 E=4
100 E=2100 E=2
100 E=1100 E=1
100 E=297 E=3 100 E=2
VEGITO: block-based MVCSExploit performance for OLAPØ Scan-efficient (locality)Ø Update: CoW in the unit of blocks
Optimizations: Row-split & Col-mergeØ temporal & spatial localityØ Split a column into several pages (unit of CoW)Ø Merge high-related columns together
15
Multi-version column store
400
1001009980
100100
E=4
400
10010010097
100100
E=3
400
100100100100100100
E=2
100100100100
E=1
12.5x scan performance improved
Challenge#3: tree-based indexInterference from heavy inserts
Write-optimized tree indexØ At the expense of read performance
Read-optimized tree indexØ Write performance is limited
16
C#3
Two-phase concurrent updatingTree insert = in-place insert + balance (costly)
Insert in buffer of each leaf, balance in batchØ Insert: 8.7x STX+HTM (read-opt), 1.4x Masstree (write-opt)Ø Read: 9% overhead under insertion
17
Fault toleranceVEGITO perseveres the same availability guaranteesØ no need for extra replicasØ prefers to recover the primary from backup/TP
Special cases (rare)Ø Both primary and backup/TP fail: rebuild primary from
backup/AP and migrate to another machineØ Backup/AP fail: rebuild to the next epoch
18
Evaluation Setup16 machines, each hasØ 2x12 Intel Xeon E5-2650 processors,128GB RAMØ 2x ConnectX-4 100Gbps InfiniBand NIC
BenchmarkØ CH-BenCHmark ≈ TPC-C + TPC-H
Workload SettingsØ OLTP-onlyØ OLAP-onlyØ HTAP (VEGITO: epoch interval = 15 ms)
19
Compare to specific systemsCompared with OLTP-specific systemsØ Peak throughput: 3.7 M txns/sØ 1% lower than DrTM+H (OLTP-specific)
Compared with OLAP-specific systemsØ Geo-mean latency: 57.2 msØ 2.8x faster than MonetDB (OLAP-specific)
20
HTAP workloadsVEGITO: performance & freshnessØ OLTP 1.9M txns/s (degradation: 5%)Ø OLAP 24 qry/s (degradation: 1%)Ø Freshness (max time delay) < 20 ms
21
RecoveryKill one of the primary for twice
Recovery from Backup/APØ 42 ms for rebuilding the primary
22
ConclusionVEGITO : retrofitting high availability mechanism to tame hybrid transaction/analytical processing
OLTP on primary, OLAP on backupØ Backup/AP: fresh, columnar, and fault-tolerant
Please check VEGITO atØ https://github.com/SJTU-IPADS/vegito
23
Thanks & QA