high availability storage (susecon2016)
TRANSCRIPT
![Page 1: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/1.jpg)
High Availability Storage SUSE's Building Blocks
Roger ZhouSr. Eng [email protected]
Guoqing JiangHA [email protected]
![Page 2: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/2.jpg)
2
SLE-HA IntroductionThe SUSE Linux Enterprise High Availability Extension is powered by Pacemaker & Corosync to eliminate SPOF for critical data, applications, and services, to implement HA clusters.
TUT88811 - Practical High Availability
Web Site A
Web Site B
Web Server 1
Web Site C
Web Site D
Web Site E
Web Site F
Web Server 2 Web Server 3
Shared StorageFibre Channel Switch
Web Site A Web Site B
Web Server 1
Web Site C
Web Site D
Web Site E
Web Site F
Web Server 2 Web Server 3
Shared StorageFibre Channel Switch
![Page 3: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/3.jpg)
3
Storage
Software Elements
– Block Device: hdX, sdX, vdX => vgX, dmX, mdX
– Filesystem: ext3/4, xfs, btrfs => ocfs2, gfs2
Enterprise Requirements ( must-have )
– High Available (Active-Passive, Active-Active)
– Data Protection (Data Replication)
![Page 4: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/4.jpg)
4
Small Business
![Page 5: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/5.jpg)
5
High Availability Block Device- active/passive
![Page 6: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/6.jpg)
6
High Availability – DRBD
• DRBD – Data Replication Block Device
• A special master/slave resources is managed by pacemaker, corosync software stack
• SLE HA stack manages the service ordering, dependency, and failover
• Active-Passive approach
Host1 Host2
Virtual IP
Apache/ext4
DRBD Master
Kernel
Virtual IP
Apache/ext4
DRBD Slave
Pacemaker + Corosync
Kernel
Failover
![Page 7: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/7.jpg)
7
High Availability LVM
• HA-LVM resources managed by pacemaker, corosync software stack
• lvmconf --enable-halvm
• Active-Passive approach
Host1 Host2
Virtual IP
LVM
DRBD Master
Kernel
Virtual IP
LVM
DRBD Slave
Pacemaker + Corosync
Kernel
Failover
![Page 8: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/8.jpg)
8
High Availability iSCSI Server
• iSCSI provides the block devices over TCP/IP
• Active-Passive approach
Host1 Host2
Virtual IP
ISCSILogicalUnit
iSCSITarget
Kernel
Virtual IP
iSCSILogicalUnit
iSCSITarget
Pacemaker + Corosync
Kernel
Failover
DRBDMaster/LVM
DRBDSlave/LVM
![Page 9: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/9.jpg)
9
High Availability Filesystem
![Page 10: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/10.jpg)
10
High Availability NFS (HA-NAS)
• NFS server on top of ext3/ext4
• Same applies to:
– xfs
– cifs samba
– etc.
• Active-Passive approach
Host1 Host2
Virtual IP
NFS Exports
Filesystem(ext4)
Kernel
Virtual IP
NFS Exports
Filesystem(ext4)
Pacemaker + Corosync
Kernel
Failover
DRBDMaster/LVM
DRBDSlave/LVM
![Page 11: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/11.jpg)
11
High Availability NFS (HA-NAS) with Shared Disk
• Active-Passive approach
NOTE: multipathing is “must-have” for Enterprise
Host1 Host2
Virtual IP
NFS Exports
Kernel
Virtual IP
NFS Exports
Pacemaker + Corosync
Kernel
Failover
SharedDisk
Filesystem(ext4) Filesystem(ext4)
![Page 12: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/12.jpg)
12
Motivation for Your Storage Growth
Scalability —► extendable storage
Easy Management —► cluster aware components
Cost Effective —► shared resources
Performance —► active – active
![Page 13: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/13.jpg)
13
Concurrent Sharing – you need a lock
![Page 14: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/14.jpg)
14
Clustered LVM2 (cLVM2)
• cLVM2 enables multiple nodes to use LVM2 on the shared disk
• cLVM2 coordinates LVM2 metadata, does not coordinate access to the shared data
• That said, multiple nodes access data of different dedicated VG are safe
• lvmconf --enable-cluster
• Active-Active approach
Host1 Host2
Pacemaker + Corosync +DLM
LVM2Metadata
clvmd
Shared LUN 2
Shared LUN 1
LVM2Metadata
clvmd
![Page 15: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/15.jpg)
15
Clustered FS – OCFS2 on shared disk
• OCFS2 resources is managed by using pacemaker, corosync, dlm software stack
• Multiple host can modify the same file with performance penalty
• DLM lock protects the data access
• Active-Active approach
Host1 Host2
Clustered File System
Kernel
Pacemaker + Corosync + DLM
Kernel
OCFS2 OCFS2
SharedDisk
![Page 16: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/16.jpg)
16
OCFS2 + cLVM2 (both volumes and data are safe)
• cLVM provides clustered VG. Metadata is protected
• OCFS2 protect the data access inside the volume
• Safe to enlarge the filesystem size
• Active-Active approach
Host1 Host2
Clustered File System
Kernel
Pacemaker + Corosync + DLM
Kernel
cLVM2 cLVM2
SharedDisk
OCFS2 OCFS2
![Page 17: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/17.jpg)
17
Is Data Safe Enough?
![Page 18: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/18.jpg)
18
Data Replication
Data Replication in general
– File Replication (Not the focus of this talk.)
– Block Level Replication
• A great new feature “Clustered MD RAID 1” now is available in in SLE12 HA SP2
– Object Storage Replication
• TUT89016 - SUSE Enterprise Storage: Disaster Recovery with Block Mirroring
Requirements in HA context– Multiple nodes
– Active-active
![Page 19: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/19.jpg)
19
Data Replication – DRBD
DRBD can be thought as a networked RAID1.
DRBD allows you to create a mirror of two block devices that are located at two different sites across the network.
It mirrors data in real-time, so its replication occurs continuously, and works well for long distance replication.
SLE 12 HA SP2 now supports DRBD 9.
![Page 20: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/20.jpg)
20
Data Replication – clvm/cmirrord
● There are different types of LV in CLVM: Snapshot, Striped and Mirrored etc.
● CLVM has been extended from LVM to support transparent management of volume groups across the whole cluster.
● For CLVM, we can also created mirrored lv to achieve data replication, and cmirrord is used to tracks mirror log info
in a cluster.
![Page 21: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/21.jpg)
21
Data Replication – clustered md/raid1
The cluster multi-device (Cluster MD) is a software based RAID storage solution for a cluster.
The biggest motivation for Cluster MD is that CLVM/cmirrord has severe performance issue and people are not happy about it.
Cluster MD provides the redundancy of RAID1
mirroring to the cluster. SUSE considers to support other RAID levels with further evaluation.
![Page 22: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/22.jpg)
22
Data Replication – clustered md/raid1 (cont.)
Internals:
– Cluster MD keeps write-intent-bitmap for each cluster node.
– During "normal" I/O access, we assume the clustered filesystem ensures that only one
node writes to any given block at a time.
– With each node have it's own bitmap, there would be no locking
and no need to keep sync array during normal operation.
– Cluster MD would only handle the bitmaps when
resync/recovery etc happened.
![Page 23: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/23.jpg)
23
Data Replication – clustered md/raid1 (cont.)
It coordinates RAID1 metadata, not coordinate access to the shared data, and it’s performance is close to native RAID1.
CLVM must send a message to user space, that must be passed to all nodes. Acknowledgments must return to the originating node, then to the kernel module.
Some related links:
https://lwn.net/Articles/674085/
http://www.spinics.net/lists/raid/msg47863.html
![Page 24: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/24.jpg)
24
Data Replication – Comparison
Active/Active mode Suitable for Geo Shared Storage
DRBD Supported (limited to two nodes) Yes
No, storage is dedicated
to each node.
CLVM(cmirrord)
Supported, the node number is limited by pacemaker and corosync
No Yes
Clustered md/raid1
Supported, the node number is limited by pacemaker and corosync
No Yes
![Page 25: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/25.jpg)
25
Data Replication – Performance Comparison
FIO test with sync engine
Read Write
4k
16k
0 500 1000 1500 2000 2500 3000 3500
RawDisk
NativeRaid
Clustermd
Cmirror
Average iopsB
lock
siz
e
4k
16k
0 1000 2000 3000 4000 5000 6000 7000 8000
RawDisk
NativeRaid
Clustermd
Cmirror
Average iops
Blo
ck s
ize
![Page 26: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/26.jpg)
26
Data Replication – Performance Comparison
FIO test with libaio engine
Read Write
4k
16k
0 5000 10000 15000 20000 25000 30000 35000 40000
RawDisk
NativeRaid
Clustermd
Cmirror
Average iops
Blo
ck s
ize
4k
16k
0 5000 10000 15000 20000 25000
RawDisk
NativeRaid
Clustermd
Cmirror
Average iopsB
lock
siz
e
![Page 27: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/27.jpg)
27
Recap
![Page 28: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/28.jpg)
28
Recap: HA Storage Building Blocks
Block-level
HA DRBD
HA LVM2
HA iSCSI
cLVM2
Clustered MD RAID1
Filesystem-level
HA NFS / CIFS
HA EXT3/EXT4/XFS
OCFS2 (GFS2)
![Page 29: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/29.jpg)
29
Question & Answer
![Page 30: High Availability Storage (susecon2016)](https://reader033.vdocuments.us/reader033/viewer/2022052418/58edf4971a28aba34a8b4681/html5/thumbnails/30.jpg)