an overview of flash storage for databases
DESCRIPTION
TRANSCRIPT
An Overview of Flash Storage for Databases
Morgan Tocker<[email protected]>
1Wednesday, March 9, 2011
Introduction
★ No invested interest in which hardware I recommend.✦ [Disclaimer] Some hardware vendors have engaged in our
services to evaluate and improve performance of their products.
2
[ Me]
Director of Training. Previously worked at MySQL, Sun
Microsystems.
[Percona]
Consulting, Training, Support & Development
for MySQL.
Wednesday, March 9, 2011
What this talk is about
★ Flash technologies (NAND, NOR).★ Server Usage.
✦ Not USB thumb drives.✦ Not Consumer usage.
★ “For Database” == MySQL.✦ Should be more or less applicable for all databases.
3Wednesday, March 9, 2011
Agenda
★ Introduction.★ A look at the current market.★ Applications.
4Wednesday, March 9, 2011
Revolutionary
★ Change in technology -✦ From spinning disk to solid state.
★ No mechanical moving parts.★ Jump in performance.★ Requires changes in the Application.★ Hard not to predict a quick replacement to all SSDs in
the next 5-10 years*
5* However, at the moment hard disks are still becoming cheaper (size) quicker than SSDs!
Wednesday, March 9, 2011
“Numbers everyone should know”
6
L1 cache reference 0.5 nsBranch mispredict 5 nsL2 cache reference 7 nsMutex lock/unlock 25 nsMain memory reference 100 nsCompress 1K bytes with Zippy 3,000 nsSend 2K bytes over 1 Gbps network 20,000 nsNAND Flash (my estimate) 50,000 nsRead 1 MB sequentially from memory 250,000 nsRound trip within same datacenter 500,000 nsDisk seek 10,000,000 nsRead 1 MB sequentially from disk 20,000,000 nsSend packet CA->Netherlands->CA 150,000,000 ns
See: http://www.linux-mag.com/cache/7589/1.html and Google http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
Wednesday, March 9, 2011
Physics Behind
★ “Floating Gate Transistors”✦ Non volatile memory.
★ One State - Single State (SLC)✦ Faster, more reliable, expensive.
★ Many States - Multi Level Cell (MLC)✦ Usually 4 states.✦ Slower, less reliable, cheaper.
7Wednesday, March 9, 2011
Classification
★ NOR✦ Speeds like memory for reads.✦ Much, much slower for erase/writing data.✦ Practical use: storing firmware.
★ NAND✦ Faster writes.✦ Only block-level read access (4K).✦ Idea is to compact as many cells in limited space - to make it
competitive with hard drives.
8Wednesday, March 9, 2011
Erasing (NAND)
★ Erase is to set all bits to “1111...”✦ Erasing process is similar to “flash” in photocameras - this is
where the name FLASH comes from.✦ Erase is slow, done in batch operations (up to 1MB).
★ Change “1” -> “0” is fast.★ Change “0” -> “1” is possible only be erase.
✦ 1st write: “1111” -> “1110”. Block marked as “written”✦ 2nd write: even “1110” -> “1010” is not possible.
9Wednesday, March 9, 2011
Erase Challenges
★ Erase is slow✦ You want to erase many blocks in a single “flash”.✦ Block Management.
★ [via software] When you write, card never writes the same block.
★ Background process to run garbage collection.
10Wednesday, March 9, 2011
Erase Lifecycle
★ SLC ~100K times per cell (may vary).★ MLC ~10K times per cell (may vary).★ For many this is a major point of discussion.
✦ How big of an issue depends a lot on firmware.✦ Many cells and even distribution (“wear levelling”) makes it a
couple of years under heavy work load.
11Wednesday, March 9, 2011
Write degradation
★ Expected.✦ More full the device, harder it is to garbage collect.
★ Graph for Fusion-io 320G MLC card:
12Wednesday, March 9, 2011
Firmware Really Matters (1)
★ I would not expect even less flat performance on a cheaper, non-enterprise class of hardware.✦ Come to my talk on Friday.✦ I will tell you consistency of performance is more important
than anything else.
13Wednesday, March 9, 2011
Firmware Really Matters (2)
★ Many revisions of firmware for each vendor.✦ Important to compare apples-to-apples in any comparisons.✦ I heard a rumour one large SSD vendor is on their 4th
successful complete ground up implementation ;)
14Wednesday, March 9, 2011
Agenda
★ Introduction.★ A look at the current market.★ Applications.
15Wednesday, March 9, 2011
The current market (1)
★ Fusion-IO.✦ Established player with a large product line.✦ Enjoyed near-monopoly for a while being only PCI card
vendor.★ Virident.
✦ Previously a MySQL Appliance vendor.✦ Switched business model in ~2010 to just ship PCI Flash
cards.✦ Very good, consistent results.
16Wednesday, March 9, 2011
The current market (2)
★ Intel/OCZ/other.✦ Typically aims for pro-desktop market.✦ Does not necessarily offer the same features/promises as the
“enterprise hardware”...
17Wednesday, March 9, 2011
You pay more for...
★ Greater amount of over provisioning (more consistent).★ Internal redundancy (aka RAID).★ More complex firmware (more consistent).★ Guarantee of durability (such as a capacitor).★ Greater life-span (more write cycles).★ Better Performance (much more IOPS).
18Wednesday, March 9, 2011
Fusion-io
19Wednesday, March 9, 2011
Performance Specification
★ 160G SLC✦ 110K read IOPS (4K)✦ 26us read latency.
★ 320G MLC✦ 71K read IOPS.✦ 41us read latency.
★ “Duo” Range (not covered).★ Lifetime:
✦ SLC flash @ 40% write duty | 25 calendar years✦ MLC flash @ 20% write duty | 10 calendar years✦ MLC flash @ 40% write duty | 5 calendar years
20Wednesday, March 9, 2011
Fusion-io Overview
★ Fast. Very fast.✦ Cheaper than disks in terms of $-per IOPS.
★ PCI-E - closest to CPU.★ Durability.★ Shares host memory / CPU★ Most complex part - firmware.★ Large amount of space reservation for heavy writes.
21Wednesday, March 9, 2011
Fusion-io drawbacks
★ Expensive. Let’s say “$6000+” (retail; your price may be less).✦ For full performance, requires additional 25% space
reservation.✦ DRAM is actually probably cheaper per GB.
★ PCI-E is not hot swap.✦ Also has potential for errors (when host fails, garbage keeps
being sent. Fusion-io handles this well.)
22Wednesday, March 9, 2011
Fusion-io durability
★ Cache is located on host system.★ “Transaction log” to prevent lost data.
✦ Crash recovery.
23Wednesday, March 9, 2011
Fusion-io read performance
24
160GB SLC card8 threads: 33K IOPS (525MB/sec), 0.28 ms 95% response time
RAID 10 is Dell Perc 6ion 8 disks 2.5” 15 RPM SAS
Wednesday, March 9, 2011
Fusion-io write performance
★ 8 threads: 20K IOPS (314MB/sec), 0.26 ms 95% response time.
25Wednesday, March 9, 2011
Fusion-io databases
★ Many read / write threads to utilize throughput.★ “MySQL” is not able to fully use it.
✦ Better in 5.5, MySQL-5.1-plugin, XtraDB.★ InnoDB IO path “needs work”.
26Wednesday, March 9, 2011
Virident TachIOn
27Wednesday, March 9, 2011
Virident
★ PCI interface. ★ Has NAND flash upgrade modules.★ Good stable results.★ Advertised 300,000 IOPS in 75:25 (read:write).
28Wednesday, March 9, 2011
Virident Options
★ 300G, 400G, 600, 800G SLC cards.✦ 400G is $13,600
★ (More or less the same price range as Fusion-io).
29Wednesday, March 9, 2011
2010 Benchmarks:
30http://www.mysqlperformanceblog.com/2010/06/15/virident-tachion-new-player-on-flash-pci-e-cards-market/
Wednesday, March 9, 2011
Intel SSDs
31Wednesday, March 9, 2011
Intel SSDs
★ Were awesome in 2008.✦ Many accolades, first SSDs that probably made sense for a
lot of pro-desktop users.★ A couple of iterations of firmware, but mostly intel
treated customers like mushrooms for 2 years.✦ No clear advance warning of road map.✦ Finally a replacement 510 series announced last month.
• Slides don’t feature these. Have not used them.
32Wednesday, March 9, 2011
Intel Overview
★ SATA form factor.★ Intel X25-M Gen 1 (50nm) & Gen 11 (35nm).
✦ MLC★ Intel X25-E (50nm)
✦ SLC✦ “Enterprise”.
★ New 510 series - just released last month.
33Wednesday, March 9, 2011
X25-E
★ 32G / 64G★ Throughput: 35K IOPS reads, 3.5K IOPS writes.★ Latency: 75us reads, 85us writes.★ 64G - $725
✦ $11/GB★ Write endurance:
✦ 1 petabyte of random writes (32G)✦ 2 petabytes of random writes (64G)
34Wednesday, March 9, 2011
X25-M Gen II
★ 80G / 160G★ Throughput: 35K IOS reads, 6.5 / 8.5K IOPS writes.★ Latency: 65us reads, 85us writes.★ 160GB - $415
✦ ~$3 / GB★ Write Endurance.
✦ Not mentioned in official specification.
35Wednesday, March 9, 2011
X25-E and X25-M
★ Even if “E” is enterprise - power loss means data loss.✦ Loss of transactions.
★ You can disable write cache, but performance is woeful.
36Wednesday, March 9, 2011
X25 Deployments
★ RAID✦ Software / hardware?✦ Level 0? 1? 10? 5? 50?
★ Engineering process could be complicated and expensive.✦ There are/were ready solutions (Schooner[1], Gear6[2], Cisco
servers).
37[1] Changed business model recently.[2] Went broke.
Wednesday, March 9, 2011
Agenda
★ Introduction.★ A look at the current market.★ Applications.
38Wednesday, March 9, 2011
MySQL Specific (1)
★ SSD is very good at Random reads.✦ Not so good at sequential writes!
★ Data files on SSD.✦ Table files (*.ibd).✦ Rollback segments (ibdata1).
★ Logs on RAID with BBU.✦ Binary logs.✦ Transaction logs.✦ Double write buffer.✦ Insert buffer.✦ Slow log, error log, general log.
39 See: http://yoshinorimatsunobu.blogspot.com/2009/05/tables-on-ssd-redobinlogsystem.html
Wednesday, March 9, 2011
MySQL Specific (2)
★ Buy memory, or buy SSDs?✦ [Usually] Buy memory when it’s possible.
40Wednesday, March 9, 2011
Other Reasons to use Flash (1)
★ Server Consolidation.✦ Hard drives do ~100-200 IOPS*✦ Now one card can get 100K (theorhetical)!✦ ~x2 - x10 reduction in many cases (see craigslist).
41 * Assuming no RAID controller performing additional merging.Wednesday, March 9, 2011
Other Reasons to use Flash (2)
★ Power consumption reduction.✦ “Transactions per watt” incredibly lower.
• See: http://www.percona.com/files/percona-live/jeremy-Craigslist.pptx.pdf
✦ Important for a large number of people. Even if power is cheap, colo facilities often limit availability per-rack.
42Wednesday, March 9, 2011
Other Reasons to use Flash (3)
★ Limit variance / risk of operational issues from cold starts.✦ Easy to see something like an advertising network miss
response time goals when aim is 50ms/page.• Each IO is ~10ms.• Following a few secondary keys to a primary key and you miss it.
★ Good for throughput too.
43Wednesday, March 9, 2011
Applications must change
Wednesday, March 9, 2011
Short Term (1)
★ Multi-threaded IO is required to exploit all throughput offered.✦ InnoDB Plugin, MySQL 5.5 ready.✦ Many other databases are not ready.
45Wednesday, March 9, 2011
Short Term (2)
★ Opportunities for Multi-level caches when data exceeds SSDs size.✦ See Flashcache (Facebook), ZFS L2 ARC, Veritas.
46Wednesday, March 9, 2011
Long Term
★ Decades of hard drive assumptions about random IO cost need to be unwound.✦ For example, InnoDB, Oracle, PostgreSQL work like this...
47Wednesday, March 9, 2011
Basic Operation (High Level)
Log Files
48
SELECT * FROM CityWHERE CountryCode=ʼAUSʼ
Buffer PoolTablespace
Wednesday, March 9, 2011
Basic Operation (High Level)
Log Files
48
SELECT * FROM CityWHERE CountryCode=ʼAUSʼ
Buffer PoolTablespace
Wednesday, March 9, 2011
Basic Operation (High Level)
Log Files
48
SELECT * FROM CityWHERE CountryCode=ʼAUSʼ
Buffer PoolTablespace
Wednesday, March 9, 2011
Basic Operation (High Level)
Log Files
48
SELECT * FROM CityWHERE CountryCode=ʼAUSʼ
Buffer PoolTablespace
Wednesday, March 9, 2011
Basic Operation (High Level)
Log Files
48
SELECT * FROM CityWHERE CountryCode=ʼAUSʼ
Buffer PoolTablespace
Wednesday, March 9, 2011
Basic Operation (High Level)
Log Files
48
SELECT * FROM CityWHERE CountryCode=ʼAUSʼ
Buffer PoolTablespace
Wednesday, March 9, 2011
Basic Operation (cont.)
49
Log Files
UPDATE City SET name = 'Morgansville'
WHERE name = 'Brisbane' AND CountryCode='AUS'
Buffer PoolTablespace
Wednesday, March 9, 2011
Basic Operation (cont.)
49
Log Files
UPDATE City SET name = 'Morgansville'
WHERE name = 'Brisbane' AND CountryCode='AUS'
Buffer PoolTablespace
Wednesday, March 9, 2011
Basic Operation (cont.)
49
Log Files
UPDATE City SET name = 'Morgansville'
WHERE name = 'Brisbane' AND CountryCode='AUS'
Buffer PoolTablespace
Wednesday, March 9, 2011
Basic Operation (cont.)
49
Log Files
UPDATE City SET name = 'Morgansville'
WHERE name = 'Brisbane' AND CountryCode='AUS'
Buffer PoolTablespace
Wednesday, March 9, 2011
Basic Operation (cont.)
49
Log Files
UPDATE City SET name = 'Morgansville'
WHERE name = 'Brisbane' AND CountryCode='AUS'
01010
Buffer PoolTablespace
Wednesday, March 9, 2011
Basic Operation (cont.)
49
Log Files
UPDATE City SET name = 'Morgansville'
WHERE name = 'Brisbane' AND CountryCode='AUS'
01010
Buffer PoolTablespace
Wednesday, March 9, 2011
Basic Operation (cont.)
49
Log Files
UPDATE City SET name = 'Morgansville'
WHERE name = 'Brisbane' AND CountryCode='AUS'
01010
Buffer PoolTablespace
Wednesday, March 9, 2011
Basic Operation (cont.)
49
Log Files
UPDATE City SET name = 'Morgansville'
WHERE name = 'Brisbane' AND CountryCode='AUS'
01010
Buffer PoolTablespace
Wednesday, March 9, 2011
Long Term (cont.)
★ Examples of “the database is the log” for MySQL are the PBXT and RethinkDB storage engines.
50Wednesday, March 9, 2011
Storage Hardware also changes
★ Most of us used to buying RAID controllers, placing disks below them.✦ Only a very limited number of RAID controllers understand
SSDS.✦ RAID controllers are used to optimizing IO for devices
capable of 100-200 IOPS.✦ If we look at Fusion-IO, the devices also internally RAID
(~RAID4).
51Wednesday, March 9, 2011
Technologies to look at
★ More PCI express cards.✦ Potential to lower barrier to entry - only ~2-3 players,
competition not as hot as it could be (yet).★ More Enterprise focused MLC.
✦ Better software (firmware) means more wear levelling, improved performance, etc.
✦ More storage in fewer cells = lower cost.★ Violin Memory
✦ I am not hands-on familiar with their technology, but they have some very high end offerings.
✦ Expect more awesome high end offerings (all vendors).
52Wednesday, March 9, 2011
Questions
★ Thank you for Confoo for letting me speak about such a niche topic!
★ If I’m out of time, please feel free to catch me around.
53Wednesday, March 9, 2011