bio it 15 - are your researchers paying too much for their cloud-based data backups

25
Are Your Researchers Paying Too Much for Their Cloud- Based Data Backups? Dirk Petersen, Scientific Computing Director, Fred Hutchinson Cancer Research Center (FHCRC) Bio-IT World 2015 1

Upload: dirk-petersen

Post on 15-Jul-2015

739 views

Category:

Software


0 download

TRANSCRIPT

Are Your Researchers Paying Too Much for Their Cloud-Based Data Backups?

Dirk Petersen, Scientific Computing Director,

Fred Hutchinson Cancer Research Center (FHCRC)

Bio-IT World 2015 1

Who are we and what do we do

What is Fred Hutch?• Cancer & HIV research

• 3 Nobel Laureates• $430M budget / 85% NIH funding• Seattle Campus with 13 buildings, 15 acre

campus, 1.5+ million sq ft of facility space

Research at “The Hutch”• 2,700 employees

• 220 Faculty, many with custom requirements• 13 research programs • 14 core facilities

• Conservative use of information technology

IT at “The Hutch”• Multiple data centers with >1000kw capacity

• 100 staff in Center IT plus divisional IT • Team of 3 Sysadmins to support storage• IT funded by indirects (F&A)

• Storage Chargebacks started Nov 2014

Bio-IT World 2015 2

How did we get here:

Economy File project in production in 2014

• Chargebacks drove the Hutch to embrace more economical storage

• Selected Swift object storage managed by SwiftStack

• Go-live in 2014, strong interest and expansion in 2015

• Researchers do not want to pay the price for standard enterprise storage

• Additional use cases:– In production: Swift as a backend for Galaxy

– In progress: Swift replaces standard disk deduplication devices for backup

– Planning: Swift as backend for endpoint backup (Druva)

– Planning: Swift as backend for virtual machines (openvstorage)

– Future option: Swift as backend for Enterprise file sharing / NAS

• File System Gateway for CIFS/NFS access phased out

3Bio-IT World 2015

Phasing out of Filesystem Gateway

• Initial deployment was using SwiftStack Gateway (CIFS /NFS) – User survey: strong preference for traditional file access– Gateway was easiest integration option in existing authentication and

authorization process

• However – Gateway was up to 10x slower than direct access to API – Users had agreed on low performance because of low costs– And low performance still causes frustration and increases Ops cost

• Now we have alternatives, and better AD integration of Swift – Gateway was non-HA, higher Ops costs– Removing gateway allows rolling updates during business hours – Gateway didn’t allow for full auditing of file access, but Swift does

• Users finally saw benefit of removing gateway and were willing to try alternative tools

4Bio-IT World 2015

How chargebacks were implemented

• Custom SharePoint site for storage chargeback processing and allocation to grants

– Each PI can allocate certain % of charges to up 3 grant budgets

– Allocation is default setup for next month

– User comments positive: “very easy to use“

• Don’t make chargeback worse by offering bad tools !

5Bio-IT World 2015

Chargebacks spike Swift utilization

• Started storage chargebacks on Nov 1st

– Triggered strong growth in October

– Users sought to avoid high cost of enterprise NAS and put as much as possible into lower cost Swift

• Underestimated success of Swift

– Needed to stop migration to buy more hardware

– Can migrate 30+ TB per day today

6Bio-IT World 2015

Chargebacks spike Swift utilization, cont.

• High Aggregate throughput

• Current network architecture is an (anticipated) bottleneck

• Many parallel streams required to max out throughput

• Ideal for HPC cluster architecture

7Bio-IT World 2015

Silicon Mechanics – Expert included.

Bio-IT World 2015 8

• Commodity hardware selection

• Open source software identification

• Quality assembly process with zero defects

• On-time installation and deployment

• Design consultation for the right solution

• Focused on your real world problems

• Real people behind the product

• Support staff who knows your system

Silicon Mechanics: The value of highly customizable hardware

Bio-IT World 2015 9

Silicon Mechanics Storform Storage Servers

• Flexible, Configurable, Reliable

• 144TB raw capacity; 130TB usable

• No RAID controllers; no storage lost to RAID

• 36 x 4TB 3.5” Seagate SATA drives

• 2 x 120GB Intel S3700 SSDs; OS + metadata

• 10Gb Base-T connectivity

• (2) Intel Xeon E5 CPUs

• 64GB RAM

Supermicro SC847 4U chassisLearn more at Booth #361

@ExpertIncluded

Management of OpenStack Swift using SwiftStack

• SwiftStack provides control & visibility– Deployment automation• Let us roll out Swift nodes in

10 minutes • Upgrading Swift across clusters

with 1 click– Monitoring and stats at cluster, node,

and drive levels– Authentication & Authorization– Capacity & Utilization Management

via Quotas and Rate Limits– Alerting, & Diagnostics

Bio-IT World 2015

SwiftStack Architecture Overview

Standard Linux DistributionOff-the-shelf Ubuntu, Red Hat, CentOS

Standard HardwareSilicon Mechanics, Supermicro, etc.

Swift RuntimeIntegrated storage engine with all node components

Integrations & InterfacesEnd-user web UI, legacy interfaces, authentication, utilization API, etc.

OpenStack SwiftReleased and supported by SwiftStack

100% Open Source

SwiftStack Nodes (2 —> 1000s)

Rolling Upgrades & 24x7 Support

Monitoring, Alerting & Diagnostics

Capacity & Utilization Mgmt.

Client Support

Ring & Cluster Management

Authentication Services

Deployment Automation

SwiftStack Controller

11Bio-IT World 2015

How much does it cost?

• Only small changes vs 2014

– Kryder’s law obsolete at <15%/Y ?

– Swift now down to Glacier cost(hardware down to $3 / TB / month)

– No price reductions in the cloud

• 4TB (~$120) and 6TB (~$250) drives cost the same

– Do you want a fault domain of 144TB or 216TB in your storage servers

– Don’t save on CPU / Erasure Code is coming !

12Bio-IT World 2015

11

2628

40

0

5

10

15

20

25

30

35

40

45

Swiftstack Google Amazon S3 NAS

Swiftstack

Google

Amazon S3

NAS

Object storage systems and traditional file systems –totally different, right?• No traditional file system hierarchy, we just have buckets (S3 lingo) or containers

(Swift lingo), that can contain millions of objects (aka files)

• Huh, no sub-directories ? But how the heck can I upload my uber-complex bioinformatics file system with 11 folder hierarchies to Swift ? – Answer: we simulate the hierarchical structure by simply putting forward slashes (/) in the object name (or file name)

– source /dir1/dir2/dir3/dir4/file5 can simply be copied to /container1/many/fake/dirs/file5

• So, how do you actually copy / migrate data over to Swift if I don’t want to use API?– The standard tool is the openstack Swift client, let’s assume I want to copy /my/local/folder to

/Swiftcontainer/pseudo/folder, here is the command you have to type: swift upload --changed --segment-size=2G --use-slo --object-name=“pseudo/folder" “container" " /my/local/folder"

– Really? Can’t we get this a little easier?

– There are a handful of open source tools available, some of them are easier to use (e.g. rclone)

– However, the Swift client is frequently used, well supported, maintained and really fast !!

Bio-IT World 2015 13

Object storage systems and traditional file systems –totally different, right?• OK, so let’s get over with this and do what HPC shops do all the time: write a

wrapper and verify that people who don’t have a lot of patience find it usable.

• Swift Commander, a simple shell wrapper for the Swift client, curl and some other tools makes working with Swift very easy:

• Sub commands such as swc ls, swc cd, swc rm, swc more give you a feel that is quite similar to a Unix file system, idea stolen from Google’s gsutil

• Actively maintained and available at https://github.com/FredHutch/Swift-commander/

Bio-IT World 2015 14

$ swc upload /my/posix/folder /my/Swift/folder$ swc compare /my/posix/folder /my/Swift/folder$ swc download /my/Swift/folder /my/scratch/fs

Object storage systems and traditional file systems –totally different, right?

• Didn’t someone say that object storage systems were great at using metadata?

• Yes, and you can just add a few key:value pairs as upload argument:

• Query the meta data via swc, or use an external search engine such as elastic search

Bio-IT World 2015 15

$ swc upload /my/posix/folder /my/Swift/folder project:grant-xyzcollaborators:jill,joe,jim cancer:breast

$ swc meta /my/Swift/folderMeta Cancer: breast

Meta Collaborators: jill,joe,jimMeta Project: grant-xyz

Object storage systems and traditional file systems –totally different, right?• Users tend to prefer to work with a posix file system with all files in one place ….. But integrating

Swift in your workflows is not really hard

• Example, running samtools using persistent scratch space (files deleted if not accessed for 30 days)

• A complex 50 line HPC submission script prepping a GATK workflow requires just 3 more lines !!

• Read the file from persistent scratch space and if it is not there simply pull it again from Swift

• If you don’t have scratch space you can pipe download from Swift directly to samtools

Bio-IT World 2015 16

If ! [[ -f /fh/scratch/delete30/pi/raw/genome.bam ]]; thenswc download /Swiftfolder/genome.bam /fh/scratch/delete30/raw/genome.bam

fisamtools view -F 0xD04 -c /fh/scratch/delete30/pi/raw/genome.bam > otherfile

Object storage systems and traditional file systems –totally different, right?• Use HPC system to download lots of bam files in parallel

• 30 cluster jobs run in parallel on 30 1G nodes (which is my HPC limit)

• My scratch file system says it loads data at 1.4 GB/s

• This means that each bam file is downloaded at 47 MB/s on average and downloading this dataset of 1.2 TB takes 14 min

Bio-IT World 2015 17

$ swc ls /Ext/seq_20150112/ > bamfiles.txt$ while read FILE; do $ sbatch -N1 -c4 --wrap="swc download /Ext/seq_20150112/$FILE ."; $ done < bamfiles.txt

$ squeue -u petersenJOBID PARTITION NAME USER ST TIME NODES NODELIST

17249368 campus sbatch petersen R 15:15 1 gizmof12017249371 campus sbatch petersen R 15:15 1 gizmof12317249378 campus sbatch petersen R 15:15 1 gizmof130

$ fhgfs-ctl --userstats --names --interval=5 --nodetype=storage====== 10 s ======Sum: 13803 [sum] 13803 [ops-wr] 1380.300 [MiB-wr/s] petersen 13803 [sum] 13803 [ops-wr] 1380.300 [MiB-wr/s]

Scientific file systems are a mixture of small files & large files

• How does Swift handle copying lots of small files ?

• Answer: not so fast …..but to be honest your NFS NAS does not handle this too well either

• Example: (ab)using filenames as database:

• So, we could tar up this entire directory structure ….. but then we have one giant tar ball of 1 TB that becomes really hard to handle …

• But what if we had a tool that would not tar up sub dirs in one file but create a tar ball for each level: /folder1/folder2/folder3 could turn into:

• So restoring folder2 and below we just need folder2.tar.gz + folder3.tar.gz

Bio-IT World 2015 18

dirk@rhino04:# ls metapop_results/corrected/release_test/evo/ | headglobal_indv_n=1_mutant-freq=1_mig=0_coop-release=0.05_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.15_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.1_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.25_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000

/folder1.tar.gz

/folder1/folder2.tar.gz

/folder1/folder2/folder3.tar.gz

Scientific file systems are a mixture of small files & large files

• Solution: Swift commander contains an archiving module

• Written by the author of the postmark file system benchmark … who has some experience with handling small files

• It’s easy:

• It’s fast: – Archiving uses multiple processes, measured up to 400 MB/s from one Linux box.

– Each process uses pigz multithreaded gzip compression (Example: compressing 1GB DNA string down to 272MB: 111 sec using gzip, 5 seconds using pigz)

– Restore can use standard gzip

• It’s simple & free: https://github.com/FredHutch/Swift-commander/blob/master/bin/swbundler.py

Bio-IT World 2015 19

$ archive: swc arch /my/posix/folder /my/Swift/folder$ restore: swc unarch /my/Swift/folder /my/scratch/fs

Scientific file systems are a mixture of small files & large files

• Special case: Sometimes we have large ngs files mixed with many small files, we want to copy but not tar the large files and archive the small files as tar.gz

• Default bundle option in Swift commander copies files >64MB straight and bundles files < 64M into tar.gz archives

• Can change default to other sizes:

• Benefit, archives small files effectively and still allows you to open large files directly with other tools, e.g. bam files in public folder in Swift can be opened by IGV browser

Bio-IT World 2015 20

archive: $ swc bundle /my/posix/folder /my/Swift/folder$ swc bundle /my/posix/folder /my/Swift/folder 512M

restore: $ swc unbundle /my/Swift/folder /my/scratch/fs

Access with GUI tools is required for collaboration

• Reality: Even if infrequent every archive requires access via GUI tools

• Needs to work with Windows and Mac

• Tools such as Cyberduck are standard but not perfectly convenient, we need tools that

– Are very easy to use and

– do not create any proprietary data structures in Swift that cannot be read by other tools and

– Simply replace a shared drive

Bio-IT World 2015 21

Access with GUI tools is required for collaboration

• Another example: ExpanDrive and Storage Made Easy– Works with Windows and Mac

– Integrates in Mac Finder and is mountable as a drive in Windows

Bio-IT World 2015 22

rclone: mass copy, backup, data migration - better than rsync

• rclone is a multithreaded data copy / mirror tool

• Consistent performance on Linux, Mac and Windows

• E.g. keep a mirror of Synology workgroup NAS (QNAP has a builtin swift mirror option)

• Data remains accessible by swc, desktop clients

• Mirror protected by swift undelete (currently 60 days retention)

Bio-IT World 2015 23

Galaxy integration with OpenStack Swift in production

• Galaxy web based high throughput computing at the Hutch uses Swift as primary storage in production today

• SwiftStack patches contributed to Galaxy Project

• Swift allows to delegate “root” access to bioinformaticians

• Integrated with Slurm HPC scheduler: automatically assigns default PI account for each user

Bio-IT World 2015 24

Q & A

Bio-IT World 2015 25