tf -storage action item measuring storage performance ...€¦ · measuring storageperformance...

TF-Storage Action ItemMeasuring storage performanceMeasuring storage performance

theory and practice

TF-Storage Action Item

Maciej Brzeźniak, PSNCSupercomputing Department

TF-Storage Action ItemMeasuring storage performance

Presentation outline

� TF-Storage Action Item on measuring performance

� Motivations for benchmarking storage

� Benchmarking:

� A bit of theory

� A bit of practice (in the context of tenders)

� Some conclusions


What is the Action Item?

� TF Storage aims:

� ... forum for exchanging and promoting ideas, experience and knowledge

related to data storage, data management and cloud storage;

� to identify and promote common standards, techniques, technology and

procedures in the field of data storage...

� to be a focal point for gathering of storage expertise in .edu area in Europe

� TF-Storage Action Items:

� Forum for storage and cloud

� Overview of (national) activities and deployments

� File sharing

� Measuring storage performance

� Storage and AAI

� Example actions:

� We try to contribute some interesting material to the TF-Storage wiki

� And let the others use it

� Present the results during meetings


Knowledge and experience transfer implementation

� TF-Storage meetings


Knowledge and experience transfer implementation

� TF-Storage Wiki at:

� https://wiki.terena.org/index.php/TF-Storage:Contents


Motivations for measuring performance (1)

� A lot of motivations:

� curiosity ;)

� need to evaluating storage system’s vs application

� to compare with the other products / people ;)

� lack of standard/universal ways of describing system performance

� it’s hard re-use existing results or trust vendor claims:

� results may come from different/optimal configurations

� results might be... tunned ☺

� specifications typically contains best cases

� NRENs: data storage/mgmt infrastructures/services:

� infrastructure development and planning

� tenders



� MONEY: value vs price

� Having defined needs we want to spent

minimum money to address them

� With fixed budget we want the best performance

� Realistic value-price assessment:

� Difficult without the proper tool (benchmark)

� Vendors specify performance in tricky ways:

(greeting to those present in the room ☺)

� Maximum bandwidth/IOPS:

� cache-host, not disks-host,

� peak/maximum, but what about sustained?

� ‘Catalogue’ values may be true for the best configurations:

� Max number of disk drives, FC 15krpm drives, SSDs...

� Full cache configuration

� Tenders: Formalised methodology producing re-usable results needed



� ‘Independent’ benchmarks

� SPC, ESG, other...

� Only a support for your methodology

� SPC/ESG limitations:

� Not every vendor/product there

� Not every access pattern represented

� SPC1 – random patterns – OLTP, database operations and mail servers

� SPC2 – sequential patterns – Large File Processing, Large DB Queries, VoD

� For complex/application scenarios you must do benchmarking by yourself

� Tested configurations ‘optimised’ by vendors:

� Results show performance of the best configs, perhaps not the one you will get

� FC disks (15k), SSDs, tunning...

� You might be unable to reproduce the results, since:

� you don’t know the product that well

� you’re not able to tune the system as vendors are

� Hard to compare different configurations...

http://www.storageperformance.org

http://www.enterprisestrategygroup.com


Some theory (1)

How to make a solid benchmark?

Current GB/s vs time vs tests

0,00

0,50

1,00

1,50

2,00

2,50

3,00

3,50

4,00

4,50

5,00

0 200 400 600 800 1000 1200 1400 1600 1800

time [s]

GB

/ s

Current GB/second

Cache Hit % [norm.]

1-w rite

2-read

3-w rite

4-read

5-w rite

6-read

7-w rite

8-read

9-w rite

10-read

11-w rite

12-read


Some theory (2):

How to make a solid benchmark?� Be aware of the system complexity ---->

� Apply top-down or bottom-up

approach depending on the

test purpose:

� Top-down – quickly check

what is possible; if you’re not

satisfied you go to lower level

� Bottom-up – if you want to

be sure what is the efficiency

of each layer and what is its overhead

� Theory: Do tuning on all levels!

� Reality: system is complex,

tuning takes time, so:

� Use ‘heuristics’:

� Trust your intuition & experience

while choosing the set of tuned

attributes and their values you examine Source: Stijn Eeckhaut’s presentation during 1st TF-StorageMeeting in Espoo, Finland


Some theory (3)


� Try to reflect the real-life system load:

� Run a real application if possible (costly, difficult)

� If real test impossible, use the benchmarking tools (iozone, xdd, hdparm...)

� Be sure you’re measuring what you want, e.g.:

� Disks-host performance not cache-host

� Beware of caching

� some methods to avoid the influence of the caching:

� Use large-enough data sets (3-4x RAM in hosts + storage caches)

� Force sync’s and flush caches, re-mount or re-create filesystems

� Use direct-io mounting flags or appropriate benchmark options...

� BUT: avoid an artificial setups – examples:

� switching the caching off on all levels typically

does not make sense

� switching cache mirroring off speeds up the test,

but will you make it in production?


Some theory (4)


� Avoid unwanted bottle-necks – try to detect them:

� Host side performance issues:

� use host state monitoring tools,

� optimize kernel settings and I/O subsystem (e.g. queues)

� Communication network issues:

� watch the network switches, check MTUs, drops, etc.

� examine the links speed with different tools

(e.g. iperf for IP)

� Storage system issues:

� explore system traces, logs and performance stats

� Analise storage back-ends

� Avoid configuration faults:

� Use enough:

� hosts, HBAs, network links

� disks if you want to examine controller’s features

� number of protocol peers (e.g. NFS clients and servers)

� Some tricky errors:

� LUN swapping between array controllers may kill your test

performance [kB/s]

0

50 000

100 000

150 000

200 000

250 000

0 32 64 96 128 160 192 224 256

record size [kB]

kB/s

WRITERS READERS RAND_readers RAND_w riters

performance [IOPS]

0

5 000

10 000

15 000

20 000

25 000

30 000

35 000

40 000

45 000

50 000

0 32 64 96 128 160 192 224 256

record size [kB]

IOP

S



Some theory (4)

performance [kB/s]

0

50 000

100 000

150 000

200 000

250 000

0 32 64 96 128 160 192 224 256

record size [kB]

kB/s


performance [IOPS]

0

5 000

10 000

15 000

20 000

25 000

30 000

35 000

40 000

45 000

50 000

0 32 64 96 128 160 192 224 256

record size [kB]

IOP

S


00:11.0 SATA controller: ATI Technologies IncSB700/SB800 SATA Controller [IDE mode]

# cat /proc/scsi/scsiAttached devices:Host: scsi0 Channel: 00 Id: 00 Lun: 00

Vendor: ATA Model: KINGSTON SVP100S Rev: CJRA....


Some theory (5)


� Use appropriate tools for the layer you want to examine:

� Application level: SPC, SPECsfs2008, TPC, the actual application...

� File system: iozone, xdd...

� Block-level: xdd, dd ☺, iometer...

� Network: iperf...

� Use appropriate test depending what is critical for your application:

� Throughput (backup, large file storage etc.)

� sequential read/write tests, possibly multi-host & multi-thread

� IOPS (DBs)

� random workloads

� Response time (DBs, real-time systems)

� random workloads with response time measurement


More theory (☺)


� Methodology - some information is available in the Internet:

� Howto-s/blogs etc.

� SPC tests specifications – very interesting

� TF-Storage benchmarking HOWTO:https://wiki.terena.org/index.php/TF_Storage_Benchmarking_Howto

� This is an early version

� If you can/want to contribute -> please contact me

� I would be happy to get any feedback (even the critical one)


Some practice (1)

� Example 1: PLATON project tender

for Popular Archival Service disk arrays and file servers:� Problem: We wanted to buy equipment we haven’t touched before...

� Application: backup/archive,

� large files, seq. pattern,

� we need throughput and capacity

� IOPS not critical; no ‘crazy features’ (thin-provisioning, snapshots etc.)

� Costs -> min.

� Requirements against the storage systems:

� SATA, 1TB drives – reliability, high # of spindles, short sparing time; lot of drives / RAID

� Disk arrays:

� 8Gbit FC front-end, 200TB (expandable)

� 2GB/sec. under a mixed (50%/50%) read/write workload

� File servers:

� SATA, 1TB drives, 10Gbit the front-end, 200TB (expandable)

� 1GB/sec. under a mixed (50%/50%) read/write workload or

1,2GB/s for 100% writes and reads measured independently

� Performance cannot be degraded if the filesystem is full...


Some practice (2)

� Example 1: PLATON disk arrays & file servers:

� Solution:

� Performance benchmark procedure in tender RFP

� open source benchmark: iozone or xdd throughput tests

� average result evaluated (calculated from 5 round of tests)

� workload: 50%/50% read/write with large block size (256kB)

� vendor can choose the optimal configuration, i.e.:

� Number of testing hosts

� Physical connectivity details

� LUN configuration and mapping to hosts

� Why that much freedom?

� Vendors know their systems the best

� There is a lot of architectures/solutions available in the market

-> hard to define all possible low-level details in RFP

� BUT:

� Test files to be 4x bigger than cache + RAM in hosts


Some practice (3)

� Example 1: PLATON disk arrays & file servers :� Solution:

� Some additional (formal) requirements:

� Full test output to be included in the offer

-> possible ‘vendor tricks’ can be ‘detected’ while evaluating the offers

� Tests to be made on the configuration identical to the delivered one:

-> this is a slippery area...

� Setting up the test-bed is costly to the vendor (some of them complained)

� BUT in that way, at the end we get the offers from:

� robust/’operative’ vendors/partner (able to provide the support later)

� We can verify the test results when the system is delivered

� vendors will/should help you at this stage

� if failed we can reject the accordance to the RFP statements

� BUT be careful: not only vendor has the problem in case of failure


Some practice (4)

� Example 1: PLATON disk arrays & file servers tenders:

� How it worked?

� We got:

� Disk arrays: 8 offers, from 4 vendors (OEMs), 2 types of disk arrays

� Filers: 4 offers for 2 different file servers (OEMs again)

� 1 important vendor+product (within the budget) didn’t apear

� The best offers were:

� Disk array:

� IBM DS5300 (LSI 7900 OEM)

� declared 2,08 GB/s (confirmed)

� File server:

� SGI Infinite Storage NAS with SGI IS220 arrays (Bluearc and LSI OEM)


� Conclusion:

� This is the right way

� BUT it worked

thanks to the size of the order? (2PB of disks in 5 arrays and 5 filers)


Some practice (5)

� Some details of the benchmarks we made (1):

� Example 1: PLATON’s disk array – tender benchmark reqs.:

� RFP criteria – mixed read/write (50%/50%) perform @ min. 2GB/s

� additional requirements:

� test files size – at least 2x (hosts’ RAM + array cache)

� iozone tool: iozone –t mm –i 0 –i 1 –s nnM –r 256k;

� mm – number of test threads (test files)

� nnM – test file size

� -r (I/O request size) 256k – array to be used under GPFS

� no file-system specified

� number of LUN or hosts not fixed (but >2)

� no RAID level specified (however we will use RAID5 or RAID6)

� tested configuration to be the same as the delivered one

� that was difficult for vendors

� 50%/50% read/write mixture implementation:

� read and write benchmarks to be run in parallel


Some practice (6)

� Example 1: PLATON’s disk array – verification benchmark:

� Array configuration:

� 12 LUNs over 12 RAIDs (each RAID5 composed from 16 drives) = 192 drives in total

� LUNs for writing exported through controller A, LUN for reading through B

� 4 FC ports/controller used for hosts connectivity, theoret. bandwidth ~6.4 GBit/s

� Hosts & SAN configuration:

� A lot of hosts: 16 blades with 1x 4Gbis FC HBA each

� Hosts divided into two groups: reading and writing ones

� 2 separate FC networks (blade FC switches, with 8 Gbit/s and 4 Gbit/s ports)


Some practice (7)

� Example 1: PLATON projects’ disk array – verification benchmark:

� Issues/ problems we had:

� It’s quite a lot of work to configure the testbed:

� setup - hardware, wiring, LUNs, mapping, hosts etc. – takes about 1 week

� tunning took some days - hosts, array <- bottlenecks

� Problems analysing and solving – another 2 days <- bottlenecks

e.g. performance issues in FC switch (firmware issues limited performance)


Some practice (8)



� Difficult to synchronize write/read tests running on 16 machines:

� iozone supports cluster mode but only for single workloads (e.g. 100% read)

� Non-cluster mode reports is only a ‘overall’ throughput

� Particular tests not really parallel


0,00

0,50

1,00

1,50

2,00

2,50

3,00

3,50

4,00

4,50

5,00

0 200 400 600 800 1000 1200 1400 1600 1800

time [s]

GB

/ s

Current GB/second

Cache Hit % [norm.]

1-w rite

2-read

3-w rite

4-read

5-w rite

6-read

7-w rite

8-read

9-w rite

10-read

11-w rite

12-read

Performance

criteria

Some benchmark threads finished faster



0,00

0,50

1,00

1,50

2,00

2,50

3,00

3,50

4,00

4,50

5,00

0 200 400 600 800 1000 1200 1400 1600 1800

time [s]

GB

/ s

Current GB/second

Cache Hit % [norm.]

1-w rite

2-read

3-w rite

4-read

5-w rite

6-read

7-w rite

8-read

9-w rite

10-read

11-w rite

12-read

Some benchmark threads finished faster


Some practice (9)




=> calculations needed (average perf. over the total time of the test)


3,50

4,00

4,50

5,00

Current GB/second

Cache Hit % [norm.]

1-w rite

2-read


Some practice (10)

� Example 1: PLATON disk array – verification benchmark:


-> monitoring of storage was required to verify the results – array perf. Monitor

-> we can see read and write performance over the time

Wydajność zapisu i odczytu danych z macierzy

0

500 000

1 000 000

1 500 000

2 000 000

2 500 000

3 000 000

3 500 000

0 200 400 600 800 1000 1200 1400 1600 1800

Sekundy testu

Wyd

ajn

ość [

kb/s

]

WRITE

READ

TOTAL

Perf.

criteria

Calculated.

average


Some practice (11)

� Example 2: PL-GRID project tender

for fast disk array for HPC (Lustre)

� Problem:

� Again the real performance was a big question mark

� Application: HPC, computing job scratch space

� we need throughput for large blocks (IOPS not critical)

� redundancy – RAID5 or 6 is OK

� Disk array:

� SATA, 1TB drives, 8Gbit FC front-end, 300TB (expandable)

� RFP required: 4.7GB/sec. write performance

(we assumed that writes are more difficult to implement effectively)

� Solution:

� Again the benchmark defined in RFP

� Procedure similar to presented before (but only writes to be measured)

� Formalities:

� Benchmark result to be included in the offer

� Right to verify the benchmark after delivery(and reject the equipment if it does not meet the requirements)


Some practice (12)

� Example 2: PL-GRID tender for disk array for HPC cluster:

� How it worked?

� We got 8 offers for 4 different vendors (OEMs), 2 types of disk arrays

� The best offer was:

� DDN S2A 9900 – declared 4.8 GB/s – confirmed in practice (even more)

� Conclusion:

� This is the right way

� BUT it worked perhaps thanks to the fact that

the vendor wanted to enter the market (so was ready to make effort to perform the tests)


Some practice (13)

� Some details of the benchmarks we made (2):

� Example 2: PL-GRID’s disk array for HPC cluster – tender benchm.:

� RFP criteria – 100 %write to perform @ min. 4.8GB/s

� additional requirements:

� test files size – at least 2x (hosts’ RAM + array cache)

� iozone tool: similarly as before

� number of LUN or hosts not fixed (but >2)

� no RAID level specified (however RAID5 or RAID6 expected)

� tested configuration to be the same as the delivered one

� that was again difficult for vendors


Some practice (14)

� Example 2: PL-GRID’s HPC disk array – verification benchmark:


� 29 LUNs over 29 RAIDs (each RAID6 composed from 10 drives) = 290 drives in total

� LUNs distributed equally among array controllers

� 4 FC ports/controller used for hosts connectivity, theoret. bandwidth ~6.4 GBit/s

� Hosts configuration:

� Hosts divided into two groups, each group uses another array controller

� 16 blade hosts with 1x 4Gbis FC HBA each

� 2 separate FC networks (blade FC switches, with 8 Gbit/s and 4 Gbit/s ports)


Some practice (15)



� The performance required in RFP is high...:

-> a lot of hosts and LUNs needed (15 hosts, 29 LUNs, 29 threads)

-> fortunately we could exploit iozone’s cluster mode -> simplify the test


Some practice (16)


� RFP: the performance should be measured with the 60% filesystem usage:

-> this required using some additional tools (scripts)

-> you have to spend time to prepare them

Peformance vs FS usage (iozone w ithout -c -e flags)

0

1 000

2 000

3 000

4 000

5 000

6 000

0 15 24 32 45 60 80

Writers

Re-w riters

Readers

Re-reader

Tender req. 60%


Some practice (17)


� again, we wanted to verify what benchmark says

� array-side measures were used (S2A monitoring tool)

� plus SAN network monitoring

Array-side and FC network measurements during test


Some practice (18)

� Example 3: PLATON project tender for fast file server (NFS) :

� Problem:

� As always... the real performance was a big question mark

� Requirements:

� Application: backup/archive,

� large files, seq. pattern, needed throughput and capacity

� IOPS not critical; no ‘crazy features’ (de-dup, snapshots etc.)

� Performance:

� SATA, 1TB drives, 10Gbit the front-end, 200TB (expandable)

� 1GB/sec. under a mixed (50%/50%) read/write workload or1.2GB/s 100% writes and reads measured independently

� Performance cannot be degraded if the filesystem is full...

� Solution:

� Again the benchmark defined in RFP + right to verify the results:

� Procedure:

� 100% write + 100% read with iozone

� or run 50%/50% read/write test with xdd


Some practice (19)

� Example 3: PLATON project tender for fast file server (NFS) :� How it worked?

� We got:

� 4 offers for 2 different file servers (OEMs again)

� 1 important vendor+product (within the budget) didn’t apear

� The best offers were:

� SGI Infinite Storage NAS with SGI IS220 arrays (Bluearc and LSI OEM)


� Conclusion:

� Again – RFP & offers – OK.

� Now the verification...


Some practice (20)

� Example 3: PLATON file server – verification test:


� 26 LUNs over 224 drives (RAID5s), 3 disk pools (defined in the filer)

� 38 filesystems (!) (proprietary FS withing the filer)

� Hosts configuration:

� 6 physical machines w 2x 10 Gbit Ethernet,

� 38 virtual hosts (!) – KVM

� Network:

� 2 separate

10 Gbit Eth

networks


Some practice (21)

� Example 3: PLATON file server – verification test:� Issues/ problems we had:

� Expected:

� tunning needed, a lot of settings can have influence on results

� Tight schedule: only 2 weeks for everything, hard deadline

� Unexpected:

� installation, wiring, setup, definition of RAIDs, LUNs, pools etc. – 4 days (24h/day)

� after a week of testing & tunning realised that:

� 6 client-server pairs unable to generate declared 1,17 GB/s, needed more

� decided to use virtual hosts (4 days before deadline)

� 10 Gbit Eth driver issues in KVM prevented reaching max performance

(Friday, 2 days to deadline)

� We were working 24h/day (2 benchmarking teams)

� At the end:

� Finished up with 38 client-server pairs (run on VMs)

� Iozone cluster helped to manage the benchmark

� Met the criteria and the deadline:

� Writes: 1,24 GB/s, Reads: 1,42 GB/s (avg from 10 runs)

� expected results got @6:00 AM in Monday – 2 hours before deadline?


Some practice (21)

� Example 3: PLATON file server – verification test:� again, we wanted to verify

what benchmark says

Around: 22,50 IOPS/s

/ NAS node

1 NFS operation = 32 kB


Our benchmarks conclusions:

� Some conclusions for tenders...

� You may want to measure the performance

of the system you plan to purchase/use

� This is difficult:

� Complicates RFP, offers and the whole process (more paper, risk of faults, protests...)

� Some effort/resources needed on the vendor side

� Effort+resources needed on your side:

� Procedure preparation and verification (+formalisation)

� Time, knowledge & equipment needed to verify results

� Some specific knowledge & experience necessary:

� Be aware of possible tricks and eliminate them in the RFP

� Verification test:

� Force the vendor to assist you in testing (at least in tunning)

� Additional monitoring is useful, to verify the results reported by the benchmark

� Plan your tender schedule accordingly (!!!)

� Testing takes time & resources - must be included in schedule, or you have a problem

� Leave some spare time for unexpected problems


Presentation conclusions:

� Benchmarking is needed:

� You can’t trust catalog values and re-use independent benchmarks

� You want to know what a given configuration may offer to you

� It’s difficult as:

� It takes time and human/hardware resources

� Some knowledge and experience is needed including:

� Tunning (storage, network, OS, protocols...)

� Benchmarking methods

� Auxiliary tools and techniques (automating, monitoring)

� We try to help you & want to learn something from you:

� ‘Best practices’ described on the TF-Storage Wiki:

� Please review and contribute to (or at least criticise)

� TF-Storage benchmarking HOWTO:

https://wiki.terena.org/index.php/TF_Storage_Benchmarking_Howto


Contact: Maciej Brzeźniak, [email protected]

Measuring storage performancetheory and practice

Maciej Brzeźniak, PSNC

TF-Storage Action Item

Note: most of the pictures used in the presentation come from sxc.hu service

tf -storage action item measuring storage performance ...€¦ · measuring storageperformance...

Documents