building storage on the cheap

81
A Journey with SIGLabs School of Computing National University of Singapore Building Storage on the Cheap

Upload: yao-jun-yap

Post on 14-Aug-2015

205 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Building Storage on the Cheap

A Journey with SIGLabs

School of Computing

National University of Singapore

Building Storage on the Cheap

Page 2: Building Storage on the Cheap

• Yap Yao Jun – [email protected]

• Year 4 – Information Systems (InfoSec)

• Joined SIGLabs or Student Network Associates in January 2010

About

Page 3: Building Storage on the Cheap

• Since Late 2010

• Really Huge Storage @ Really Cheap Price • $5K to build P.O.C

• 2 Storages pods were built (45TB and 135TB)

SIGLabs JBOD Project

Page 4: Building Storage on the Cheap

• Super Micro•Maximum of 36 Drives• ≈ SGD$5,200 (w/o RAM, CPU, HDD)

• BackBlaze 45 Drives•Chassis Only• USD$872 ≈ SGD$1,133

•Complete (w/o HDD)• USD$5,395 ≈ SGD$7,013

The Chassis Hunt

Page 5: Building Storage on the Cheap

• Shipping cost > USD$250

• Really huge box with lots of bubble wrap

• Flown in over CNY2011

Flown in by SQ from the USA

Page 6: Building Storage on the Cheap
Page 7: Building Storage on the Cheap

• Sil3726 Chip• From Protocase @ USD$60 each• USD$540 ≈ SGD$700

Multiplier Card

Page 8: Building Storage on the Cheap

• Sil3124• PCI-e (1x) Interface

• 4 SATA II Ports

• era-adapter.com @USD$59.95 each

•USD$179.85 ≈ SGD$234

SATA Expansion Card

Page 9: Building Storage on the Cheap

P.O.C Build

Phase 1

Page 10: Building Storage on the Cheap

• AMD Althon™ II X640 Processor

• 2 x 4GB DDR3 RAM

• PSU• iCute 500W•Cooler Master 460W• Seasonic X760

• 500GB OS HDD

• 10 x 2TB Seagate LP Drives

Storage Pod #1

Page 11: Building Storage on the Cheap
Page 12: Building Storage on the Cheap
Page 13: Building Storage on the Cheap
Page 14: Building Storage on the Cheap
Page 15: Building Storage on the Cheap
Page 16: Building Storage on the Cheap

But!

Page 17: Building Storage on the Cheap

• Stand-offs •Clearance• 30mm brass standoff is too tall!• 20mm brass standoff is too short!• Cheap Y-Molex Power Connectors are too tall!

• Power Issues• iCute Really Ugly AND Not Adorable at all• Replaced by the Modular Seasonic X-760

• Simultaneous PSU firing • 2nd PSU is toggled on using an external switch

Initial Setbacks

Page 18: Building Storage on the Cheap

• Improvise•Al-cheapo Standoff• Green Wall Plugs 7mm

• Too Long!• 1” 0.75”

• 8 per Multiplier Card• Cut 72 of them

•SGD$2 for 100 pieces. Cheap! • Else is USD$29 for the ones

Backblaze is using.

Stand-Offs

Page 19: Building Storage on the Cheap

• Bought Ready Made

• Daisy Chaining

Right Angle 4Pin Molex

Page 20: Building Storage on the Cheap

Working. Working. Working.

Page 21: Building Storage on the Cheap

PSU Hack

Taking the pin out Connecting a signal cable directly to the

MB socket

Connecting directly to the PSU

Broken Pin Extractor The plastic guides too small cut

Page 22: Building Storage on the Cheap

Manage to fit 3.5” HDD into the enclosure

Can power up 10 HDD (no dropouts)

Can power on the 2 PSUs simultaneously

Can do Linux RAID (mdadm)

Can export NFS

Project was supposed to continue but in the later part of 2011…

Phase 1 Achievements

Page 23: Building Storage on the Cheap

• HDD Price Sky-rocketed• Thailand Floods•Only recover after May 2012

• Project was shelved

The Great HDD Shortage of 2011

Page 24: Building Storage on the Cheap

Really filling up the Storage Pod with Hard Disks

August 2012

Phase 2

Page 25: Building Storage on the Cheap

• 47 Seagate 1TB ES2

• 1 Hitachi 1TB

• Unable to power them…• again…

48 x 1TB Drives from Sun x4500

Page 26: Building Storage on the Cheap

• POST 33 Drives On

• Build RAID, drives get dropped while building•When 1 drive drop, the whole array on the multiplier drop•No success•At best is 27 Drives

Erm…

Page 27: Building Storage on the Cheap

• Multi-meter to measure voltage at the multiplier card• 12V was around 11V. •Under-Volt. Tsk. :\

• Get “raw”-er materials from Sim Lim Tower• 2x3 Mini Female Molex• Fits into Seasonic X-760• Kudos to modular PSUs

• 4 Pin Disk Drive Molex• 16 AWG wire

Problem. Again.

Page 28: Building Storage on the Cheap

Custom Wiring

16AWG wires

Punch In

My Punch Tool &Cable Stripper

Page 29: Building Storage on the Cheap

Negative Example

Page 30: Building Storage on the Cheap

• Got a Second Seasonic X-760

• Each multiplier card has its own dedicated power wiring

• Stable 12V; Stable 5V

• No more HDD array drops

• Replicated to all 9 multiplier cards

Problem? Solved.

Page 31: Building Storage on the Cheap
Page 32: Building Storage on the Cheap
Page 33: Building Storage on the Cheap

Backblaze Configuration

sdp

sdo

sdn

sdm

sdl

sdk

sdj

sdi

sdh

sdg

sdf

sde

sdd

sdc

Sdb

sdat

sdas

sdar

sdaq

sdap

sdao

sdan

sdam

sdal

sdak

sdaj

sdai

sdah

sdag

sdaf

sdae

sdad

sdac

sdab

sdaa

sdz

sdy

sdx

sdw

sdv

sdu

sdt

sds

sdr

sdq

RAID-6 + JFS

RAID-6 + JFS

RAID-6 + JFS

Page 34: Building Storage on the Cheap

• # fdisk /dev/sdb• Partition ID = “fd” Linux RAID auto

• # mdadm --create /dev/md0 --level=6 --raid-devices=15 /dev/sdb1 /dev/sdc1…

• #mkfs.xfs /dev/md0

RAID-6

Page 35: Building Storage on the Cheap

/sys/block || /dev – Before reboot

/sys/block || /dev – After reboot

# rebootsd

at

sdas

sdar

sdaq

sdap

sdp

sdo

sdn

sdm

sdl

sdae

sdad

sdac

sdab

sdaa

sdao

sdan

sdam

sdal

sdak

sdk

sdj

sdi

sdh

sdg

sdz

sdy

sdx

sdw

sdv

sdaj

sdai

sdah

sdag

sdaf

sdf

sde

sdd

sdc

Sdb

sdu

sdt

sds

sdr

sdq

sdat

sdas

sdar

sdaq

sdap

sdp

sdo

sdn

sdm

sdl

sdae

sdad

sdac

sdab

sdaa

sdao

sdan

sdam

sdal

sdak

sdk

sdj

sdi

sdh

sdg

sdz

sdy

sdx

sdw

sdv

sdaj

sdai

sdah

sdag

sdaf

sdf

sde

sdd

sdc

Sdb

sdu

sdt

sds

sdr

sdq

Page 36: Building Storage on the Cheap

• Renaming partitions• Linux RAID = ‘fd’• mdadm can use only partitions and not whole drives• Partitions ends with a number

• udev rules to target the partitions•Using HDD serial number

Hacking udev

/etc/udev/rules/90-jbod.rulesKERNEL==“sd*[0-9]”, ENV{ID_SERIAL_SHORT}==“5YD1GW0A”, NAME=“jbod1a”

…...

Page 37: Building Storage on the Cheap

Partitions at /devjb

od

1a

jbod

1b

jbod

1c

jbod

1d

jbod

1e

jbod

1f

Jbod

1g

jbod

1h

jbod

1i

jbod

1j

jbod

1k

jbod

1l

jbod

1m

jbod

1n

jbod

1o

jbod

2a

jbod

2b

jbod

2c

jbod

2d

jbod

2e

jbod

2f

jbod

2g

jbod

2h

jbod

2i

jbod

2j

jbod

2k

jbod

2l

jbod

2m

jbod

2n

jbod

2o

jbod

3a

jbod

3b

jbod

3c

jbod

3d

jbod

3e

jbod

3f

jbod

3g

jbod

3h

jbod

3i

jbod

3j

jbod

3k

jbod

3l

jbod

3m

jbod

3n

jbod

3o

Page 38: Building Storage on the Cheap

• It Works!

• Solved the problem of drives being loaded at different times at boot

Not Elegant

Page 39: Building Storage on the Cheap

FreeNAS! “Worked Out of the Box!”

Running off a $8 USB Thumb Drive!

Page 40: Building Storage on the Cheap

• Zettabyte File System• File System and RAID Engine in One• Copy on Write• Prevents silent corruption from Scrubbing• Incremental Snapshots• Transparent Compression• Deduplication

• Zpools• Grow-able• mirror can grow• But RAID-Z cannot grow

• http://youtu.be/CN6iDzesEs0?t=3m25s

ZFS

Page 41: Building Storage on the Cheap

• Single Parity

• Double Parity•Minimum 3 Drives• RAID 6 needs a minimum of 4 Drives

• Triple Parity Only in ZFS

RAID-Z

Page 42: Building Storage on the Cheap

• Performance•ZFS

• Ease of Administration•Web GUI• NFS• iSCSI• AFP• rsync

FreeNAS – NAS made Easy

Page 43: Building Storage on the Cheap

Uh-oh!

This shit happens every Tuesday night! #swapoff

Page 44: Building Storage on the Cheap

• FreeNAS 8.3.0 Beta!•No more page faults on Tuesday Nights

Let’s go Cutting Edge!

Page 45: Building Storage on the Cheap

• What drive is what drive?• Possible to pin-point but very troublesome• HDD Serial Number• # camcontol devlist

•Configuration• kern.geom.label.gptid.enable="0

• kern.geom.label.ufsid.enable="0“

• Label Partitions• #gpart –i 1 –l drivename1

/dev/ada0

Problem

Page 46: Building Storage on the Cheap

• Logs• Not retained Hard to debug!• Used a script to do a symbolic link to persistent storage• https://raw.github.com/jag3773/FreeNAS-Change-Logging/master/

FreeNAS-Change-Logging.sh

• Swap• FreeNAS “defaults” every HDD 2GiB to Swap• We got 45 Disks = 90GiB of Swap (Madness!)• # swapoff

• Export via NFS as nfsnobody:nfsnogroup• Issue with user permissions• rsync Debian repositories• chmod 02775

• Maproot user = root

Problem

Page 47: Building Storage on the Cheap

Replacement of Hard Disks

Page 48: Building Storage on the Cheap

Powering 45 Drives

Ease of use via FreeNAS

ZFS

Living in CR1 with Gigabit uplink

Serves NUS Mirror Storage Needs

Psst… It’s performing better than the X4500…

Phase 2 Achievements

Page 49: Building Storage on the Cheap

For “Production”

The 2nd Build

Page 50: Building Storage on the Cheap

• Intel i3 3220 Ivy-Bridge

• H77 Chipset

• 16GB DDR3 RAM

• 2x Seasonic X-760 PSU

• 45 x Seagate 3TB (4k Aligned)

Storage Pod #2

Page 51: Building Storage on the Cheap
Page 52: Building Storage on the Cheap

Note: SATA Cables have to be nudged

in the “downward”

direction, else it may have connection problems

Page 53: Building Storage on the Cheap
Page 54: Building Storage on the Cheap

• Centos 6.3 – 64bit

• ZFS-on-Linux 0.6.0-rc14

• World Wide Numbers (WWN)•Bash Scripted rule generator

• Elegance in the management of naming Hard Drives

What’s Different?

Page 55: Building Storage on the Cheap

• WWN & vdevs

• /dev/disk/by-vdevs/[1-9][a-e]• /etc/zfs/vdev_id.conf.[alias, multipath,

sas_direct, sas_switch]•Alias• WWN – World Wide Name• It’s like MAC addresses but for Hard Disks• Create Symb-link to sd*• Allow other programs to run normally• S.M.A.R.T smartd Alert via email

ZFS-on-Linux

Page 56: Building Storage on the Cheap

• Divisors of 45 • 1x45

What can we do with 45 Drives?

2a

2b

2c

2d

2e

5a

5b

5c

5d

5e

8a

8b

8c

8d

8e

3a

3b

3c

3d

3e

6a

6b

6c

6d

6e

9a

9b

9c

9d

9e

1a

1b

1c

1d

1e

4a

4b

4c

4d

4e

7a

7b

7c

7d

7e

Page 57: Building Storage on the Cheap

• Divisors of 45 • 1x45• 3x15

What can we do with 45 Drives?

2a

2b

2c

2d

2e

5a

5b

5c

5d

5e

8a

8b

8c

8d

8e

3a

3b

3c

3d

3e

6a

6b

6c

6d

6e

9a

9b

9c

9d

9e

1a

1b

1c

1d

1e

4a

4b

4c

4d

4e

7a

7b

7c

7d

7e

Page 58: Building Storage on the Cheap

• Divisors of 45 • 1x45• 3x15• 5x9

What can we do with 45 Drives?

2a

2b

2c

2d

2e

5a

5b

5c

5d

5e

8a

8b

8c

8d

8e

3a

3b

3c

3d

3e

6a

6b

6c

6d

6e

9a

9b

9c

9d

9e

1a

1b

1c

1d

1e

4a

4b

4c

4d

4e

7a

7b

7c

7d

7e

Page 59: Building Storage on the Cheap

• Divisors of 45 • 1x45• 3x15• 5x9• 9x5

What can we do with 45 Drives?

2a

2b

2c

2d

2e

5a

5b

5c

5d

5e

8a

8b

8c

8d

8e

3a

3b

3c

3d

3e

6a

6b

6c

6d

6e

9a

9b

9c

9d

9e

1a

1b

1c

1d

1e

4a

4b

4c

4d

4e

7a

7b

7c

7d

7e

Page 60: Building Storage on the Cheap

• Divisors of 45 • 1x45• 3x15• 5x9• 9x5• 15x3

What can we do with 45 Drives?

2a

2b

2c

2d

2e

5a

5b

5c

5d

5e

8a

8b

8c

8d

8e

3a

3b

3c

3d

3e

6a

6b

6c

6d

6e

9a

9b

9c

9d

9e

1a

1b

1c

1d

1e

4a

4b

4c

4d

4e

7a

7b

7c

7d

7e

Page 61: Building Storage on the Cheap

With ZFS-on-Linux , Bonnie++ and fio

Performance

Page 62: Building Storage on the Cheap

• Scripted testing• “Layouts”•Raid-z level•Compression•Access time• Synchronization

Pursuit of the “Best” performance

Page 63: Building Storage on the Cheap

• Hybrid Storage•ARC RAM• L2ARC SSD•ZIL SSD (SLC)•HDD Main Storage Pool

• Better Performance• atime=off• sync=disabled• compression=lz4• Debatable: if better CPU…

ZFS – Arbudens

Page 64: Building Storage on the Cheap

• # bonnie++ \-d /mnt/zfspool/ \ # Directory to test-s 32G \ # Double RAM

recommended-n 2:16M:8k \ # 2x1024 files; max16M;

min 8k-r 8G\ # Amount of RAM to use-z 1361377183 \ # Random Seed-u root \ # Run as Root-b # No Write Buffering:

fsync()

Bonnie++

Page 65: Building Storage on the Cheap

Bonnie++ Result

1x45-jbod

1x45-raidz1

1x45-raidz2

1x45-raidz3

3x15-raidz1

3x15-raidz2

3x15-raidz3

5x9-raidz1

5x9-raidz2

5x9-raidz3

9x5-raidz1

9x5-raidz2

9x5-raidz3

15x3-raidz1

15x3-raidz2

0 500000 1000000 1500000

Read Block K/Sec Write Block K/Sec

Page 66: Building Storage on the Cheap

• Best Sequential Write (atime=off, sync=disabled, compression=lz4)

• 5x9 raidz1• Tolerate1 Disk Fail

Best Performing Layout

5x9-raidz1-atimeoff-syndisabled-comlz4

5x9-raidz1-atimeon-syndisabled-comlz4

1x45-jbod-atimeoff-syndisabled-comlz4

9x5-raidz1-atimeoff-syndisabled-comlz4

1x45-jbod-atimeon-syndisabled-comlz4

960000 980000 1000000

Sequential Input

Block K/Sec

Page 67: Building Storage on the Cheap

• Best Sequential Read(atime=off, sync=disabled, compression=lz4)

• 3x15 raidz1• Tolerate1 Disk Fail

Best Performing Layout

3x15-raidz1-atimeoff-synstandard-comlz4

3x15-raidz1-atimeoff-syndisabled-comlz4

3x15-raidz1-atimeon-synstandard-comlz4

15x3-raidz1-atimeon-synstandard-comlz4

15x3-raidz1-atimeon-syndisabled-comlz4

1300000 1350000 1400000

Sequential Output

Block K/Sec

Page 68: Building Storage on the Cheap

• Very High CPU Utilization

• Includes latency reports too

• Complete test has 360 results• Includes testing of various compression algorithms• lzjb, zle, gzip[1-9]

Bonnie++ Results

Page 69: Building Storage on the Cheap

• IOPS = (MBps Throughput / KB per IO) * 1024• e.g.

598 / 4 * 1024 = 153 088 IOPS

564 /4 * 1024 = 114 384 IOPS

IOPS

Page 70: Building Storage on the Cheap

Device Type Random 4KB IOPS (Write)

Random 4KBIOPS (Read)

Intel 520 Series MLC SSD Up to 80,000 Up to 50,000

Intel 313 Series SLC SSD 4,000 Up to 36,000

Seagate 15K.7 SAS

SAS HDD ≈ 121 ≈ 118

WD Velociraptor 10k

SATA HDD ≈ 111 ≈ 82

Seagate ES2 7.2K

SAS/SATA HDD ≈ 80 ≈ 189

Seagate Barracuda

SATA HDD ≈ 114 ≈ 261

Known IOPS

*These figures are plucked out from various sites by googling for IOPS.Different sites have different testing methods.

Page 71: Building Storage on the Cheap

• How to test “Random”

• Flexible IO•But ZFS does not support direct IO

• Testing against a hybrid storage•Buffered IO; Not Direct IO

• Quite exhaustive

FIO - IOPS 4K Random

Page 72: Building Storage on the Cheap

[testname]

rw=randwrite || randread || randrw || read || write || rw

size=32G # Double RAM

directory=<testdir>

numjobs=<number of cores>

group_reporting # Combined Results

bs=4k

thread

write_iops_log=testname # Logs

FIO

Page 73: Building Storage on the Cheap

Random 4K IOPS Write Test

0 20000 40000 60000 80000 100000 1200000

5000

10000

15000

20000

25000

30000

35000

40000

4K Random Write

Time (ms)

IOP

S

Start: 34232 IOPS Approach about 100 IOPS eventually

5x9 r

aid

-Z1

Page 74: Building Storage on the Cheap

Random 4k IOPS Read Test

3x15

raid

-z2

Apparently 4K Random Read IOPS are very good with ZFS

0 20000 40000 60000 80000 100000 1200000

20000

40000

60000

80000

100000

120000

4k Random Read – 1 Thread

Time (ms)

IOP

S

Page 75: Building Storage on the Cheap

Random 4k IOPS Read Test

3x15

raid

-z2

0 20000 40000 60000 80000 100000 1200000

50000

100000

150000

200000

250000

300000

4K Random Read – 4 Threads

Time (ms)

IOP

S

Apparently 4K Random Read IOPS are very good with ZFS

Page 76: Building Storage on the Cheap

• CPU Bottle Neck or Really?

Post FIO Testing Thoughts

Page 77: Building Storage on the Cheap

• More RAM!

• Better Network Interface – “Intel NICs”

• SSD Write Cache•ZFS Intent Log (ZIL) SLC SSD Drives (e.g. Intel 313 SSD Series)• Sync Writes Async Writes• Mirrored

• SSD Read Cache• L2 ARC Cache MLC SSD Drives

Areas to Improve Performance

Write Cache

Disk Storage

Application

Write Read

Read Cache

Page 78: Building Storage on the Cheap

• Android App Available!

Monitoring ZFS

Page 79: Building Storage on the Cheap

• DRBD Linux•HAST FreeBSD

• Lustre

• Ceph

• GlusterFS

What’s Next?

Network Replicated Storage

Scalable Network Storage

Page 80: Building Storage on the Cheap

Questions?

Page 81: Building Storage on the Cheap

Thank You!