building storage on the cheap
TRANSCRIPT
A Journey with SIGLabs
School of Computing
National University of Singapore
Building Storage on the Cheap
• Yap Yao Jun – [email protected]
• Year 4 – Information Systems (InfoSec)
• Joined SIGLabs or Student Network Associates in January 2010
About
• Since Late 2010
• Really Huge Storage @ Really Cheap Price • $5K to build P.O.C
• 2 Storages pods were built (45TB and 135TB)
SIGLabs JBOD Project
• Super Micro•Maximum of 36 Drives• ≈ SGD$5,200 (w/o RAM, CPU, HDD)
• BackBlaze 45 Drives•Chassis Only• USD$872 ≈ SGD$1,133
•Complete (w/o HDD)• USD$5,395 ≈ SGD$7,013
The Chassis Hunt
• Shipping cost > USD$250
• Really huge box with lots of bubble wrap
• Flown in over CNY2011
Flown in by SQ from the USA
• Sil3726 Chip• From Protocase @ USD$60 each• USD$540 ≈ SGD$700
Multiplier Card
• Sil3124• PCI-e (1x) Interface
• 4 SATA II Ports
• era-adapter.com @USD$59.95 each
•USD$179.85 ≈ SGD$234
SATA Expansion Card
P.O.C Build
Phase 1
• AMD Althon™ II X640 Processor
• 2 x 4GB DDR3 RAM
• PSU• iCute 500W•Cooler Master 460W• Seasonic X760
• 500GB OS HDD
• 10 x 2TB Seagate LP Drives
Storage Pod #1
But!
• Stand-offs •Clearance• 30mm brass standoff is too tall!• 20mm brass standoff is too short!• Cheap Y-Molex Power Connectors are too tall!
• Power Issues• iCute Really Ugly AND Not Adorable at all• Replaced by the Modular Seasonic X-760
• Simultaneous PSU firing • 2nd PSU is toggled on using an external switch
Initial Setbacks
• Improvise•Al-cheapo Standoff• Green Wall Plugs 7mm
• Too Long!• 1” 0.75”
• 8 per Multiplier Card• Cut 72 of them
•SGD$2 for 100 pieces. Cheap! • Else is USD$29 for the ones
Backblaze is using.
Stand-Offs
• Bought Ready Made
• Daisy Chaining
Right Angle 4Pin Molex
Working. Working. Working.
PSU Hack
Taking the pin out Connecting a signal cable directly to the
MB socket
Connecting directly to the PSU
Broken Pin Extractor The plastic guides too small cut
Manage to fit 3.5” HDD into the enclosure
Can power up 10 HDD (no dropouts)
Can power on the 2 PSUs simultaneously
Can do Linux RAID (mdadm)
Can export NFS
Project was supposed to continue but in the later part of 2011…
Phase 1 Achievements
• HDD Price Sky-rocketed• Thailand Floods•Only recover after May 2012
• Project was shelved
The Great HDD Shortage of 2011
Really filling up the Storage Pod with Hard Disks
August 2012
Phase 2
• 47 Seagate 1TB ES2
• 1 Hitachi 1TB
• Unable to power them…• again…
48 x 1TB Drives from Sun x4500
• POST 33 Drives On
• Build RAID, drives get dropped while building•When 1 drive drop, the whole array on the multiplier drop•No success•At best is 27 Drives
Erm…
• Multi-meter to measure voltage at the multiplier card• 12V was around 11V. •Under-Volt. Tsk. :\
• Get “raw”-er materials from Sim Lim Tower• 2x3 Mini Female Molex• Fits into Seasonic X-760• Kudos to modular PSUs
• 4 Pin Disk Drive Molex• 16 AWG wire
Problem. Again.
Custom Wiring
16AWG wires
Punch In
My Punch Tool &Cable Stripper
Negative Example
• Got a Second Seasonic X-760
• Each multiplier card has its own dedicated power wiring
• Stable 12V; Stable 5V
• No more HDD array drops
• Replicated to all 9 multiplier cards
Problem? Solved.
Backblaze Configuration
sdp
sdo
sdn
sdm
sdl
sdk
sdj
sdi
sdh
sdg
sdf
sde
sdd
sdc
Sdb
sdat
sdas
sdar
sdaq
sdap
sdao
sdan
sdam
sdal
sdak
sdaj
sdai
sdah
sdag
sdaf
sdae
sdad
sdac
sdab
sdaa
sdz
sdy
sdx
sdw
sdv
sdu
sdt
sds
sdr
sdq
RAID-6 + JFS
RAID-6 + JFS
RAID-6 + JFS
• # fdisk /dev/sdb• Partition ID = “fd” Linux RAID auto
• # mdadm --create /dev/md0 --level=6 --raid-devices=15 /dev/sdb1 /dev/sdc1…
• #mkfs.xfs /dev/md0
RAID-6
/sys/block || /dev – Before reboot
/sys/block || /dev – After reboot
# rebootsd
at
sdas
sdar
sdaq
sdap
sdp
sdo
sdn
sdm
sdl
sdae
sdad
sdac
sdab
sdaa
sdao
sdan
sdam
sdal
sdak
sdk
sdj
sdi
sdh
sdg
sdz
sdy
sdx
sdw
sdv
sdaj
sdai
sdah
sdag
sdaf
sdf
sde
sdd
sdc
Sdb
sdu
sdt
sds
sdr
sdq
sdat
sdas
sdar
sdaq
sdap
sdp
sdo
sdn
sdm
sdl
sdae
sdad
sdac
sdab
sdaa
sdao
sdan
sdam
sdal
sdak
sdk
sdj
sdi
sdh
sdg
sdz
sdy
sdx
sdw
sdv
sdaj
sdai
sdah
sdag
sdaf
sdf
sde
sdd
sdc
Sdb
sdu
sdt
sds
sdr
sdq
• Renaming partitions• Linux RAID = ‘fd’• mdadm can use only partitions and not whole drives• Partitions ends with a number
• udev rules to target the partitions•Using HDD serial number
Hacking udev
/etc/udev/rules/90-jbod.rulesKERNEL==“sd*[0-9]”, ENV{ID_SERIAL_SHORT}==“5YD1GW0A”, NAME=“jbod1a”
…...
Partitions at /devjb
od
1a
jbod
1b
jbod
1c
jbod
1d
jbod
1e
jbod
1f
Jbod
1g
jbod
1h
jbod
1i
jbod
1j
jbod
1k
jbod
1l
jbod
1m
jbod
1n
jbod
1o
jbod
2a
jbod
2b
jbod
2c
jbod
2d
jbod
2e
jbod
2f
jbod
2g
jbod
2h
jbod
2i
jbod
2j
jbod
2k
jbod
2l
jbod
2m
jbod
2n
jbod
2o
jbod
3a
jbod
3b
jbod
3c
jbod
3d
jbod
3e
jbod
3f
jbod
3g
jbod
3h
jbod
3i
jbod
3j
jbod
3k
jbod
3l
jbod
3m
jbod
3n
jbod
3o
• It Works!
• Solved the problem of drives being loaded at different times at boot
Not Elegant
FreeNAS! “Worked Out of the Box!”
Running off a $8 USB Thumb Drive!
• Zettabyte File System• File System and RAID Engine in One• Copy on Write• Prevents silent corruption from Scrubbing• Incremental Snapshots• Transparent Compression• Deduplication
• Zpools• Grow-able• mirror can grow• But RAID-Z cannot grow
• http://youtu.be/CN6iDzesEs0?t=3m25s
ZFS
• Single Parity
• Double Parity•Minimum 3 Drives• RAID 6 needs a minimum of 4 Drives
• Triple Parity Only in ZFS
RAID-Z
• Performance•ZFS
• Ease of Administration•Web GUI• NFS• iSCSI• AFP• rsync
FreeNAS – NAS made Easy
Uh-oh!
This shit happens every Tuesday night! #swapoff
• FreeNAS 8.3.0 Beta!•No more page faults on Tuesday Nights
Let’s go Cutting Edge!
• What drive is what drive?• Possible to pin-point but very troublesome• HDD Serial Number• # camcontol devlist
•Configuration• kern.geom.label.gptid.enable="0
“
• kern.geom.label.ufsid.enable="0“
• Label Partitions• #gpart –i 1 –l drivename1
/dev/ada0
Problem
• Logs• Not retained Hard to debug!• Used a script to do a symbolic link to persistent storage• https://raw.github.com/jag3773/FreeNAS-Change-Logging/master/
FreeNAS-Change-Logging.sh
• Swap• FreeNAS “defaults” every HDD 2GiB to Swap• We got 45 Disks = 90GiB of Swap (Madness!)• # swapoff
• Export via NFS as nfsnobody:nfsnogroup• Issue with user permissions• rsync Debian repositories• chmod 02775
• Maproot user = root
Problem
Replacement of Hard Disks
Powering 45 Drives
Ease of use via FreeNAS
ZFS
Living in CR1 with Gigabit uplink
Serves NUS Mirror Storage Needs
Psst… It’s performing better than the X4500…
Phase 2 Achievements
For “Production”
The 2nd Build
• Intel i3 3220 Ivy-Bridge
• H77 Chipset
• 16GB DDR3 RAM
• 2x Seasonic X-760 PSU
• 45 x Seagate 3TB (4k Aligned)
Storage Pod #2
Note: SATA Cables have to be nudged
in the “downward”
direction, else it may have connection problems
• Centos 6.3 – 64bit
• ZFS-on-Linux 0.6.0-rc14
• World Wide Numbers (WWN)•Bash Scripted rule generator
• Elegance in the management of naming Hard Drives
What’s Different?
• WWN & vdevs
• /dev/disk/by-vdevs/[1-9][a-e]• /etc/zfs/vdev_id.conf.[alias, multipath,
sas_direct, sas_switch]•Alias• WWN – World Wide Name• It’s like MAC addresses but for Hard Disks• Create Symb-link to sd*• Allow other programs to run normally• S.M.A.R.T smartd Alert via email
ZFS-on-Linux
• Divisors of 45 • 1x45
What can we do with 45 Drives?
2a
2b
2c
2d
2e
5a
5b
5c
5d
5e
8a
8b
8c
8d
8e
3a
3b
3c
3d
3e
6a
6b
6c
6d
6e
9a
9b
9c
9d
9e
1a
1b
1c
1d
1e
4a
4b
4c
4d
4e
7a
7b
7c
7d
7e
• Divisors of 45 • 1x45• 3x15
What can we do with 45 Drives?
2a
2b
2c
2d
2e
5a
5b
5c
5d
5e
8a
8b
8c
8d
8e
3a
3b
3c
3d
3e
6a
6b
6c
6d
6e
9a
9b
9c
9d
9e
1a
1b
1c
1d
1e
4a
4b
4c
4d
4e
7a
7b
7c
7d
7e
• Divisors of 45 • 1x45• 3x15• 5x9
What can we do with 45 Drives?
2a
2b
2c
2d
2e
5a
5b
5c
5d
5e
8a
8b
8c
8d
8e
3a
3b
3c
3d
3e
6a
6b
6c
6d
6e
9a
9b
9c
9d
9e
1a
1b
1c
1d
1e
4a
4b
4c
4d
4e
7a
7b
7c
7d
7e
• Divisors of 45 • 1x45• 3x15• 5x9• 9x5
What can we do with 45 Drives?
2a
2b
2c
2d
2e
5a
5b
5c
5d
5e
8a
8b
8c
8d
8e
3a
3b
3c
3d
3e
6a
6b
6c
6d
6e
9a
9b
9c
9d
9e
1a
1b
1c
1d
1e
4a
4b
4c
4d
4e
7a
7b
7c
7d
7e
• Divisors of 45 • 1x45• 3x15• 5x9• 9x5• 15x3
What can we do with 45 Drives?
2a
2b
2c
2d
2e
5a
5b
5c
5d
5e
8a
8b
8c
8d
8e
3a
3b
3c
3d
3e
6a
6b
6c
6d
6e
9a
9b
9c
9d
9e
1a
1b
1c
1d
1e
4a
4b
4c
4d
4e
7a
7b
7c
7d
7e
With ZFS-on-Linux , Bonnie++ and fio
Performance
• Scripted testing• “Layouts”•Raid-z level•Compression•Access time• Synchronization
Pursuit of the “Best” performance
• Hybrid Storage•ARC RAM• L2ARC SSD•ZIL SSD (SLC)•HDD Main Storage Pool
• Better Performance• atime=off• sync=disabled• compression=lz4• Debatable: if better CPU…
ZFS – Arbudens
• # bonnie++ \-d /mnt/zfspool/ \ # Directory to test-s 32G \ # Double RAM
recommended-n 2:16M:8k \ # 2x1024 files; max16M;
min 8k-r 8G\ # Amount of RAM to use-z 1361377183 \ # Random Seed-u root \ # Run as Root-b # No Write Buffering:
fsync()
Bonnie++
Bonnie++ Result
1x45-jbod
1x45-raidz1
1x45-raidz2
1x45-raidz3
3x15-raidz1
3x15-raidz2
3x15-raidz3
5x9-raidz1
5x9-raidz2
5x9-raidz3
9x5-raidz1
9x5-raidz2
9x5-raidz3
15x3-raidz1
15x3-raidz2
0 500000 1000000 1500000
Read Block K/Sec Write Block K/Sec
• Best Sequential Write (atime=off, sync=disabled, compression=lz4)
• 5x9 raidz1• Tolerate1 Disk Fail
Best Performing Layout
5x9-raidz1-atimeoff-syndisabled-comlz4
5x9-raidz1-atimeon-syndisabled-comlz4
1x45-jbod-atimeoff-syndisabled-comlz4
9x5-raidz1-atimeoff-syndisabled-comlz4
1x45-jbod-atimeon-syndisabled-comlz4
960000 980000 1000000
Sequential Input
Block K/Sec
• Best Sequential Read(atime=off, sync=disabled, compression=lz4)
• 3x15 raidz1• Tolerate1 Disk Fail
Best Performing Layout
3x15-raidz1-atimeoff-synstandard-comlz4
3x15-raidz1-atimeoff-syndisabled-comlz4
3x15-raidz1-atimeon-synstandard-comlz4
15x3-raidz1-atimeon-synstandard-comlz4
15x3-raidz1-atimeon-syndisabled-comlz4
1300000 1350000 1400000
Sequential Output
Block K/Sec
• Very High CPU Utilization
• Includes latency reports too
• Complete test has 360 results• Includes testing of various compression algorithms• lzjb, zle, gzip[1-9]
Bonnie++ Results
• IOPS = (MBps Throughput / KB per IO) * 1024• e.g.
598 / 4 * 1024 = 153 088 IOPS
564 /4 * 1024 = 114 384 IOPS
IOPS
Device Type Random 4KB IOPS (Write)
Random 4KBIOPS (Read)
Intel 520 Series MLC SSD Up to 80,000 Up to 50,000
Intel 313 Series SLC SSD 4,000 Up to 36,000
Seagate 15K.7 SAS
SAS HDD ≈ 121 ≈ 118
WD Velociraptor 10k
SATA HDD ≈ 111 ≈ 82
Seagate ES2 7.2K
SAS/SATA HDD ≈ 80 ≈ 189
Seagate Barracuda
SATA HDD ≈ 114 ≈ 261
Known IOPS
*These figures are plucked out from various sites by googling for IOPS.Different sites have different testing methods.
• How to test “Random”
• Flexible IO•But ZFS does not support direct IO
• Testing against a hybrid storage•Buffered IO; Not Direct IO
• Quite exhaustive
FIO - IOPS 4K Random
[testname]
rw=randwrite || randread || randrw || read || write || rw
size=32G # Double RAM
directory=<testdir>
numjobs=<number of cores>
group_reporting # Combined Results
bs=4k
thread
write_iops_log=testname # Logs
FIO
Random 4K IOPS Write Test
0 20000 40000 60000 80000 100000 1200000
5000
10000
15000
20000
25000
30000
35000
40000
4K Random Write
Time (ms)
IOP
S
Start: 34232 IOPS Approach about 100 IOPS eventually
5x9 r
aid
-Z1
Random 4k IOPS Read Test
3x15
raid
-z2
Apparently 4K Random Read IOPS are very good with ZFS
0 20000 40000 60000 80000 100000 1200000
20000
40000
60000
80000
100000
120000
4k Random Read – 1 Thread
Time (ms)
IOP
S
Random 4k IOPS Read Test
3x15
raid
-z2
0 20000 40000 60000 80000 100000 1200000
50000
100000
150000
200000
250000
300000
4K Random Read – 4 Threads
Time (ms)
IOP
S
Apparently 4K Random Read IOPS are very good with ZFS
• CPU Bottle Neck or Really?
Post FIO Testing Thoughts
• More RAM!
• Better Network Interface – “Intel NICs”
• SSD Write Cache•ZFS Intent Log (ZIL) SLC SSD Drives (e.g. Intel 313 SSD Series)• Sync Writes Async Writes• Mirrored
• SSD Read Cache• L2 ARC Cache MLC SSD Drives
Areas to Improve Performance
Write Cache
Disk Storage
Application
Write Read
Read Cache
• Android App Available!
Monitoring ZFS
• DRBD Linux•HAST FreeBSD
• Lustre
• Ceph
• GlusterFS
What’s Next?
Network Replicated Storage
Scalable Network Storage
Questions?
Thank You!