Download - Making the most of ssd in oracle11g
© 2012 Quest Software Inc. All rights reserved.
Making the most of Solid State Disk in Oracle
Guy HarrisonVP R&D Database Management
Making the most of Solid State Disk in Oracle 11g
Guy HarrisonExecutive Director, R&D Business Intelligence Software
Pg. 3© 2012 Quest Software Inc. All rights reserved.
Introductions
www.guyharrison.net [email protected]
http://twitter.com/guyharrison
Pg. 4© 2012 Quest Software Inc. All rights reserved.
Pg. 5© 2012 Quest Software Inc. All rights reserved.
Pg. 6© 2012 Quest Software Inc. All rights reserved. 6
Pg. 7© 2012 Quest Software Inc. All rights reserved.
Pg. 8© 2012 Quest Software Inc. All rights reserved.
Agenda
• Brief History of Magnetic Disk
• Solid State Disk (SSD) technologies
• SSD internals
• Oracle DB flash cache architecture
• Performance comparisons
• SSD in Exadata
• Recommendations and Suggestions
Pg. 9© 2012 Quest Software Inc. All rights reserved.
A brief history of disk
Pg. 10© 2012 Quest Software Inc. All rights reserved.
5MB HDD circa 1956
Pg. 11© 2012 Quest Software Inc. All rights reserved.
1800 RPM
28MB HDD - 1961
Pg. 12© 2012 Quest Software Inc. All rights reserved.
The more that things change....
Pg. 13© 2012 Quest Software Inc. All rights reserved.
Moore’s law•Transistor density doubles every 18 months
•Exponential growth is observed in most electronic components:−CPU clock speeds−RAM−Hard Disk Drive storage density
•But not in mechanical components−Service time (Seek latency) – limited by actuator
arm speed and disk circumference −Throughput (rotational latency) – limited by
speed of rotation, circumference and data density
Pg. 14© 2012 Quest Software Inc. All rights reserved.
Disk trends 2001-2009
IO Rate Disk Capacity IO/Capacity CPU IO/CPU-1,000
-500
0
500
1,000
1,500
2,000
260 1,635
-630
1,013
-390
%ag
e ch
ang
e
Pg. 15© 2012 Quest Software Inc. All rights reserved.
Solid State Disk
SSD to the rescue?
Magnetic Disk
SSD SATA Flash
SSD PCI flash
SSD DDR-RAM
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500
4,000
80
25
15
Seek time (us)
Economics of SSD
Seagate SATA HDD
Seagate SAS HDD
Intel MLC SATA SSD
Intel SLC SATA SSD
FusionIO PCI MLC Duo SSD
FusionIO PCI SLC SSD
0.00 0.50 1.00 1.50 2.00 2.50
0.00 10.00 20.00 30.00 40.00 50.00 60.00
2.38
1.53
0.05
0.05
0.06
0.06
0.09
1.00
6.88
21.88
24.92
53.44
$/GB
$/IOP
$/IOP
$/GB
$/GB
$/IOPS
Tiered storage management
Main Memory
DDR SSD
Flash SSD
Fast Disk (SAS, RAID 0+1)
Slow Disk (SATA, RAID 5)
Tape, Flat Files, Hadoop
$/IOP$/
GB
12c Automatic Data Placement
Active
• Segment on SSD tablespace
Frequent Acce
ss
• Segment on SAS tablespace
Occassion
al Acce
ss
• OLTP compression on SATA tablespace
Dormant
• Archive Compressed on RAID5 SATA
ALTER TABLE …. ADD ILM POLICY
TIER TO …. Tablespace AFTER 6 months of no access
COMPRESS FOR QUERY LOW AFTER 12 months of no access
Pg. 20© 2012 Quest Software Inc. All rights reserved.
SSD technology and internals
Flavours of Flash SSD DDR RAM Drive
SATA flash drive
PCI flash drive
SSD storage Server
PCI SSD vs SATA SSD
PCI vs SATA− SATA was designed for traditional disk drives with high
latencies− PCI is designed for high speed devices− PCI SSD has latency ~ 1/3rd of SATA
23
Dell Express Flash
Higher performance, durability, flexibility
Express Flash PCIe-SSD DrivesThe power to do more
Up to 1000x more IOPs than traditional HDDFront loading, hot swappableMaximum read and write lifespanImproves workload processingEnhances virtual environments
PCIe SSD Solution Framework
− Frees Up Valuable PCIe Slot Real Estate
− 1X16 Slot Supports 4X4 PCIe SSD Backplane− Modular, scalable
− Storage Device− HDD form factor – fits in 2.5”
carrier− SLC: 175G & 350G
Media Bay & Control Panel S
AS
0
SA
S 1
SA
S 2
SA
S 3
SA
S 4
SA
S 5
SA
S 6
SA
S 7
PC
Ie S
SS
0
PC
Ie S
SS
1
PC
Ie S
SS
2
PC
Ie S
SS
3
4x Drives w/ BP
Extender Card (Adapter FF)
Pg. 26© 2012 Quest Software Inc. All rights reserved.
Flash SSD Technology
• Cell: One (SLC) or Two (MLC) bits• Page: Typically 4K • Block: Typically 128-512K
Storage Hierarchy:
• Read and first write require single page IO• Overwriting a page requires an erase & overwrite of the block
Writes:
• 100,000 erase cycles for SLC before failure • 5,000 – 10,000 erase cycles for MLC
Write endurance:
Pg. 27© 2012 Quest Software Inc. All rights reserved.
Flash SSD performance
Read (4k page seek)
First insert (4k page write)
Update (256K block erase)
0 200 400 600 800 1000 1200 1400 1600 1800 2000
25
250
2000
Microseconds
Flash Disk write degradation
All Blocks empty:Write time=250 us
25% part full:− Write time= ( ¾ * 250 us + 1/4 * 2000 us) = 687 us
75% part full − Write time = ( ¼ * 250 us + ¾ * 2000 us ) = 1562 us
Empty
Partially Full
Valid Data Page
Empty Data Page
InValid Data Page
Free Block Pool
Used Block Pool
SSD ControllerInsert
Data Insert
Valid Data Page
Empty Data Page
Invalid Data Page
Free Block Pool
Used Block Pool
SSD ControllerUpdate
Data Update
Valid Data Page
Empty Data Page
Invalid Data Page
Free Block Pool
Used Block Pool
SSD Controller
Garbage Collection
Pg. 32© 2012 Quest Software Inc. All rights reserved.
Pg. 33© 2012 Quest Software Inc. All rights reserved.
11g DB flash Cache
Pg. 34© 2012 Quest Software Inc. All rights reserved.
Oracle DB flash cache
• Introduced in 11gR2 for OEL and Solaris only
• Secondary cache maintained by the DBWR, but only when idle cycles permit
• Architecture is tolerant of poor flash write performance
Buffer cache and Free buffer waits
Database files
Buffer cache
DBWR
Oracle process
Free Buffer Waits
Write dirty blocks to disk
Write to buffer cache
Read from disk
Read from buffer cache
Free buffer waits often occur when reads are much faster than writes....
Flash Cache
Database files
Buffer cache
DBWR
Oracle process
Write dirty blocks to disk
Write to buffer cache
Read from disk
Read from buffer cache
Flash Cache
Write clean blocks (time permitting)
Read from flash cache
DB Flash cache architecture is designed to accelerate buffered reads
Pg. 37© 2012 Quest Software Inc. All rights reserved.
Configuration• Create filesystem from flash device
• Set DB_FLASH_CACHE_FILE and DB_FLASH_CACHE_SIZE.
• Consider Filesystemio_options=setall
Pg. 38© 2012 Quest Software Inc. All rights reserved.
Flash KEEP pool• You can prioritise blocks for important objects using the FLASH_CACHE clause:
Pg. 39© 2012 Quest Software Inc. All rights reserved.
Oracle Db flash cache statistics
http://guyharrison.squarespace.com/storage/flash_insert_stats.sql
Flash Cache Efficiency
http://guyharrison.squarespace.com/storage/flash_time_savings.sql
Flash cache Contents
http://guyharrison.squarespace.com/storage/flashContents.sql
Pg. 42© 2012 Quest Software Inc. All rights reserved.
Performance tests
Pg. 43© 2012 Quest Software Inc. All rights reserved.
Test systems•Low end system:
−Dell Optiplex dual-core 4GB RAM−2xSeagate 7500RPM Baracuda SATA HDD−Intel X-25E SLC SATA SSD
•Higher end system:−Dell R510 2xquad core, 32 GB RAM−4x300GB 15K RPM,6Gbps Dell SAS HDD−1xFusionIO ioDrive SLC PCI SSD
Performance: indexed reads (SATA Flash)
No Flash
Flash cache
Flash tablespace
0 100 200 300 400 500 600
529.7
143.27
48.17
Total
db file IO
flash cache IO
Other
Elapsed (s)
Performance: Read/Write (SATA Flash)
No Flash
Flash Cache
Flash tablespace
0 500 1000 1500 2000 2500 3000 3500
3,289
1,693
200
Total
db file IO
write complete
free buffer
flash cache IO
Other
Elapsed time (s)
Random reads – PCI flash
SAS disk, no flash cache
SAS disk, flash cache
Table on SSD
0 500 1000 1500 2000 2500
2,211
583
121
Total
Other
DB File IO
Flash cache IO
Elapsed time (s)
Updates – PCI Flash
SAS disk, no flash cache
SAS disk, flash cache
Table on SSD
0 1000 2000 3000 4000 5000 6000 7000
6,219
1,934
529
Total
db file IO
log file IO
flash cache
free buffer waits
Other
Elapsed Time (s)
Pg. 48© 2012 Quest Software Inc. All rights reserved.
Buffer Cache bottlenecks • Flash cache architecture avoids ‘free buffer waits’ due to flash IO, but write complete waits can still occur on hot blocks.
• Free buffer waits are still likely against the database files, due to high physical read rates created by the flash cache
Full table scan – PCI flash
SAS disk, no flash cache
SAS disk, flash cache
Table on SSD
0 50 100 150 200 250 300 350 400 450
418
398
72
Total
Other
DB File IO
Flash Cache IO
Elasped time (s)
50
Sorting – what we expect
50Table/Index IO CPU Time Temp Segment IO
PGA Memory available (MB)
Tim
e
Memory Sort
Single PassDisk Sort
Multi-passDisk Sort
Disk Sorts – temporary tablespace
51
0501001502002503000
500
1000
1500
2000
2500
3000
3500
4000
SAS based TTS SSD based TTS
Sort Area Size
Ela
pse
d t
ime
(s)
Single PassDisk Sort
Multi-passDisk Sort
Redo performance – Fusion IO
SAS based redo log
Flash based redo log
0 50 100 150 200 250 300 350
292.39
291.93
Total
Log IO
Elapsed time (s)
53
Concurrent redo workload (x10)
53
SAS based redo log
Flash based redo log
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500
1,605
1,637
397
331
1,944
1,681
CPU
Other
Log File IO
Elapsed time (s)
Pg. 54© 2012 Quest Software Inc. All rights reserved.
Write degradation• In theory, high sustained write IO can lead to SSD degradation when GC fails to cope with the block erase/update cycle
• In practice, this is rarely noticeable from Oracle:−Oracle write IO is largely asynchronous (DBWR)−Almost all write activity has at least an equal amount of read activity− Garbage collection and wear levelling algorithms are sophisticated in
decent SSD drives
Pg. 55© 2012 Quest Software Inc. All rights reserved.
Pg. 56© 2012 Quest Software Inc. All rights reserved.
OS level direct cache
Read-intensive, potentially massive
tablespaces
•Temp Tablespace• Hot Segments• Hot Partitions• DB Flash Cache
(limited to the size of the SSD)
Regular Block Device
ioMemory VSL
File System/ Raw Devices/ ASM
directCache
File System/ Raw Devices/ ASM
Caching Block Device
ioMemory VSL
LUN
Fusion IO direct cache – Table scans
No cache 1st scan
No cache 2nd scan
direct cache on 1st scan
direct cache on 2nd scan
0 20 40 60 80 100 120 140 160
147
147
147
36
Total
IO
Other
Elapsed time (s)
Pg. 59© 2012 Quest Software Inc. All rights reserved.
Exadata
59
Pg. 61© 2012 Quest Software Inc. All rights reserved.
Exadata flash storage • 4x96GB PCI Flash drives on each storage server (X2)
• Flash can be configured as:− Exadata Smart Flash Cache (ESFC)− Solid State Disk available to ASM disk groups
• ESFC is not the same as the DB flash cache:− Maintained by cellsrv, not DBWR− DOES support full table scans− DOES NOT support smart scans
− Unless CELL_FLASH_CACHE= KEEP,
− Statistics accessed via the cellcli program
• Considerations for cache vs. SSD are similar
Exadata: Flash grid disk vs ESFC
100M row table, 200,000 random PK lookups, 1M possible keys
SAS disk no flash cache
SAS disk with flash cache
SSD disks (no flash cache)
0 500 1000 1500 2000 2500 3000
1,240
429
119
CPU
Total
Seconds
Exadata: SSD disk for Redo
SAS redo 10 concurrent (avg)
SSD redo 10 concurrent (avg)
SAS redo 1 concurrent
SSD redo 1 concurrent
0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000
352
6,885
236
1,007
Log IO
Other
Total
Elapsed time (s)
Exadata: the effect of redo block size
CellCLI: Release 11.2.2.3.2
SAS disk blocksize 512
SAS disk blocksize 4K
SSD disk blocksize 512
SSD disk blocksize 4k
0.00 500.00 1,000.00 1,500.00 2,000.00 2,500.00
236.05
243.79
1,009.67
309.68
Redo IO Time
Db time
Elasped time (s)
Pg. 65© 2012 Quest Software Inc. All rights reserved.
Exadata Smart FlashLog
Pg. 66© 2012 Quest Software Inc. All rights reserved.
Smart Flash Log
Designed to reduce “outlier” redo log sync waits
Redo is written simultaneously to disk and flash
First write to complete wins
Introduced in Exadata storage software 11.2.2.4 Disk Flash
CellSrv
DB Node
Pg. 67© 2012 Quest Software Inc. All rights reserved.
All Redo log writes (16M log writes)
Flash Log
Min Median Mean 99% Max
ON 1.0 650 723 1,656 75,740
OFF 1.0 627 878 4,662 291,800
Pg. 68© 2012 Quest Software Inc. All rights reserved.
Redo log outliers
WAIT #47124064145648: nam='log file sync' ela= 710 buffer#=129938 sync scn=1266588258 p3=0 obj#=-1 tim=1347583167579790WAIT #47124064145648: nam='log file sync' ela= 733 buffer#=130039 sync scn=1266588297 p3=0 obj#=-1 tim=1347583167580808WAIT #47124064145648: nam='log file sync' ela= 621 buffer#=130124 sync scn=1266588332 p3=0 obj#=-1 tim=1347583167581695WAIT #47124064145648: nam='log file sync' ela= 507 buffer#=130231 sync scn=1266588371 p3=0 obj#=-1 tim=1347583167582486WAIT #47124064145648: nam='log file sync' ela= 683 buffer#=101549 sync scn=1266588404 p3=0 obj#=-1 tim=1347583167583398WAIT #47124064145648: nam='log file sync' ela= 2084 buffer#=130410 sync scn=1266588442 p3=0 obj#=-1 tim=1347583167585748WAIT #47124064145648: nam='log file sync' ela= 798 buffer#=130535 sync scn=1266588488 p3=0 obj#=-1 tim=1347583167586864WAIT #47124064145648: nam='log file sync' ela= 1043 buffer#=101808 sync scn=1266588527 p3=0 obj#=-1 tim=1347583167588250WAIT #47124064145648: nam='log file sync' ela= 2394 buffer#=130714 sync scn=1266588560 p3=0 obj#=-1 tim=1347583167590888WAIT #47124064145648: nam='log file sync' ela= 932 buffer#=101989 sync scn=1266588598 p3=0 obj#=-1 tim=1347583167592057WAIT #47124064145648: nam='log file sync' ela= 291780 buffer#=102074 sync scn=1266588637 p3=0 obj#=-1 tim=1347583167884090WAIT #47124064145648: nam='log file sync' ela= 671 buffer#=102196 sync scn=1266588697 p3=0 obj#=-1 tim=1347583167885294WAIT #47124064145648: nam='log file sync' ela= 957 buffer#=102294 sync scn=1266588730 p3=0 obj#=-1 tim=1347583167886575WAIT #47124064145648: nam='log file sync' ela= 852 buffer#=120 sync scn=1266588778 p3=0 obj#=-1 tim=1347583167887763WAIT #47124064145648: nam='log file sync' ela= 639 buffer#=214 sync scn=1266588826 p3=0 obj#=-1 tim=1347583167888778WAIT #47124064145648: nam='log file sync' ela= 699 buffer#=300 sync scn=1266588853 p3=0 obj#=-1 tim=1347583167889767WAIT #47124064145648: nam='log file sync' ela= 819 buffer#=102647 sync scn=1266588886 p3=0 obj#=-1 tim=1347583167890829
Pg. 69© 2012 Quest Software Inc. All rights reserved.
Top 10,000 waits
Pg. 70© 2012 Quest Software Inc. All rights reserved.
Exadata 12c Smart Flash Cache Write-back Database writes go to flash cache
− LRU aging to mag disk − Reads serviced by flash prior to age out− Similar restrictions to flach cache (smart scans, etc)
Performance claims:− On X3 system, 1M random single block write IOPS− On X3 system, 1.5M random single block read IOPS− 500K write IOPS claimed for Exadata V2− Note: random IO writes are less problematic for flash than
sequential writes.
Pg. 71© 2012 Quest Software Inc. All rights reserved.
Summary
Pg. 72© 2012 Quest Software Inc. All rights reserved.
Recommendations•Don’t wait for SSD to become as cheap as HDD
−Magnetic HDD will always be cheaper per GB, SSD cheaper per IO−Oracle’s Exadata strategy suggests that the era of SSD for Oracle is here
•Consider a mixed or tiered storage strategy−Using DB flash cache, selective SSD tablespaces or partitions−Use SSD where your IO bottleneck is greatest and SSD advantage is
significant−Oracle 12c ILM offers a perfect solution for data tablespaces
•DB flash cache offers an easy way to leverage SSD for OLTP workloads, but has fewer advantages for OLAP or Data Warehouse
Pg. 73© 2012 Quest Software Inc. All rights reserved.
How to use SSD•Database flash cache
−If your bottleneck is single block (indexed reads) and you are on OEL or Solaris 11GR2
•Flash tablespace−Optimize read/writes against “hot” segments or partitions
•Flash temp tablespace−If multi-pass disk sorts or hash joins are your bottleneck
• Exadata (esp X3)−Provides most of the advantages without much setup
• 12c ILM−Should provide an ideal solution for DIY tiering
73
Pg. 74© 2012 Quest Software Inc. All rights reserved.
References• Latest version of this presentation:
http://www.slideshare.net/gharriso/ssd-and-the-db-flash-cache
• Quest whitepaper:• http://www.quest.com/documents/landing.aspx?id=15423
• Guy’s SSD guide• http://guyharrison.squarespace.com/ssdguide/
Pg. 75© 2012 Quest Software Inc. All rights reserved.
Pg. 76© 2012 Quest Software Inc. All rights reserved.
Pg. 77© 2012 Quest Software Inc. All rights reserved.