flash memory database systems and in-page logging
DESCRIPTION
Flash Memory Database Systems and In-Page Logging. Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. [email protected] In collaboration with Sang-Won Lee (SKKU), Chanik Park (Samsung). Flash Talk. COMPUTER SCIENCE DEPARTMENT. - PowerPoint PPT PresentationTRANSCRIPT
KOCSEA’09, Las Vegas, December 2009 -1- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
Flash Memory Database Systems Flash Memory Database Systems and In-Page Loggingand In-Page Logging
Flash Memory Database Systems Flash Memory Database Systems and In-Page Loggingand In-Page Logging
Bongki MoonBongki MoonDepartment of Computer ScienceDepartment of Computer Science
University of ArizonaUniversity of ArizonaTucson, AZ 85721, U.S.A.Tucson, AZ 85721, U.S.A.
[email protected]@cs.arizona.edu
In collaboration with Sang-Won Lee (SKKU), Chanik Park In collaboration with Sang-Won Lee (SKKU), Chanik Park (Samsung)(Samsung)
Flash TalkFlash TalkFlash TalkFlash Talk
KOCSEA’09, Las Vegas, December 2009 -2- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
Magnetic Disk vs Flash SSDMagnetic Disk vs Flash SSD
Samsung FlashSSD 128 GB 2.5/1.8
inch
Seagate ST340016A40GB,7200rpm
Championfor 50 years
New challengers!
Intel X25-M Flash SSD80GB 2.5 inch
KOCSEA’09, Las Vegas, December 2009 -3- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
Past Trend of DiskPast Trend of Disk• From 1983 to 2003 [Patterson, CACM 47(10) 2004]
Capacity increased about 2500 times (0.03 GB 73.4 GB) Bandwidth improved 143.3 times (0.6 MB/s 86 MB/s) Latency improved 8.5 times (48.3 ms 5.7 ms)
YearYear 19831983 19901990 19941994 19981998 20032003
ProductProduct CDC CDC 94145-3694145-36
Seagate Seagate ST41600ST41600
Seagate Seagate ST15150ST15150
Seagate Seagate ST39102ST39102
Seagate Seagate ST373453ST373453
CapacityCapacity 0.03 0.03 GBGB 1.4 1.4 GBGB 4.3 GB4.3 GB 9.1 GB9.1 GB 73.4 GB73.4 GB
RPMRPM 36003600 54005400 72007200 1000010000 1500015000
Bandwidth Bandwidth (MB/sec)(MB/sec)
0.60.6 44 99 2424 8686
Media Media diameterdiameter
5.255.25 5.255.25 3.53.5 3.03.0 2.52.5
Latency (msec)Latency (msec) 48.348.3 17.117.1 12.712.7 8.88.8 5.75.7
KOCSEA’09, Las Vegas, December 2009 -4- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
I/O Crisis in OLTP SystemsI/O Crisis in OLTP SystemsI/O Crisis in OLTP SystemsI/O Crisis in OLTP Systems• I/O becomes bottleneck in OLTP systemsI/O becomes bottleneck in OLTP systems
Process a large number of small random I/O operationsProcess a large number of small random I/O operations• Common practice to close the gapCommon practice to close the gap
Use a large disk farm to exploit I/O parallelismUse a large disk farm to exploit I/O parallelism• Tens or hundredsTens or hundreds of disk drives per processor core of disk drives per processor core• (E.g.) IBM Power 596 Server : 172 15k-RPM disks per processor core(E.g.) IBM Power 596 Server : 172 15k-RPM disks per processor core
Adopt short-stroking to reduce disk latencyAdopt short-stroking to reduce disk latency• Use only the outer tracks of disk plattersUse only the outer tracks of disk platters
Other concerns are raised tooOther concerns are raised too• Wasted capacity of disk drivesWasted capacity of disk drives• Increased amount of energy consumptionIncreased amount of energy consumption
• Then, what happens 18 months later? To catch up with Moore’s law and balance CPU and I/O (Amdhal’s
law), the number of spindles should be doubled again?
KOCSEA’09, Las Vegas, December 2009 -5- COMPUTER SCIENCE DEPARTMENT
Flash News in the MarketFlash News in the Market
• Sun Oracle Exadata Storage Server [Sep 2009] Each Exadata cell comes with 384 GB flash cache
• MySpace dumped disk drives [Oct 2009] Went all flash, saving power by 99%
• Google Chrome ditched disk drives [Nov 2009] SSD is the key to 7-sec boot time
• Gordon at UCSD/SDSC [Nov 2009] 64 TB RAM, 256 TB Flash, 4 PB Disks
• IBM hooked up with Fusion-io [Dec 2009] SSD storage appliance for System X server line
KOCSEA’09, Las Vegas, December 2009 -7- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
Flash for Database, Really?Flash for Database, Really?
• Immediate benefit for some DB operationsImmediate benefit for some DB operations Reduce commit-time delay by fast loggingReduce commit-time delay by fast logging Reduce read time for multi-versioned dataReduce read time for multi-versioned data Reduce query processing time (sort, hash)Reduce query processing time (sort, hash)
• What about the Big Fat Tables?What about the Big Fat Tables? Random scattered I/O is very common in OLTPRandom scattered I/O is very common in OLTP
• Slow random writes by flash SSD can handle this?Slow random writes by flash SSD can handle this?
KOCSEA’09, Las Vegas, December 2009 -8- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
Transactional LogTransactional Log
SQL Queries
System Buffer Cache
Database
Table space
Temporary
Table Space
Transaction
(Redo) Log
Rollback
Segments
KOCSEA’09, Las Vegas, December 2009 -9- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
Commit-time Delay by LoggingCommit-time Delay by Logging
• Write Ahead Log (WAL) A committing transaction force-writes its
log records Makes it hard to hide latency With a separate disk for logging
• No seek delay, but …• Half a revolution of spindle on average• 4.2 msec (7200RPM), 2.0 msec (15k RPM)
With a Flash SSD: about 0.4 msec
• Commit-time delay remains to be a significant overhead Group-commit helps but the delay doesn’t go away altogether.
• How much commit-time delay?
On average, 8.2 msec (HDD) vs 1.3 msec (SDD) : 6-fold reduction • TPC-B benchmark with 20 concurrent users.
SQL
Buffer Log Buffer
DB
LOG
pi
T1 T2 … Tn
KOCSEA’09, Las Vegas, December 2009 -10- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
Rollback SegmentsRollback Segments
SQL Queries
System Buffer Cache
Database
Table space
Temporary
Table Space
Transaction
(Redo) Log
Rollback
Segments
KOCSEA’09, Las Vegas, December 2009 -11- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
MVCC Rollback SegmentsMVCC Rollback Segments
• Multi-version Concurrency Control (MVCC) Alternative to traditional Lock-based CC Support read consistency and snapshot isolation Oracle, PostgresSQL, Sybase, SQL Server 2005, MySQL
• Rollback Segments Each transaction is assigned to a rollback segment When an object is updated, its current value is recorded
in the rollback segment sequentially (in append-only fashion)
To fetch the correct version of an object, check whether it has been updated by other transactions
KOCSEA’09, Las Vegas, December 2009 -12- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
MVCC Write PatternMVCC Write Pattern• Write requests from TPC-C workload
Concurrent transactions generate multiple streams of append-only traffic in parallel (apart by approximately 1 MB)
HDD moves disk arm very frequently SSD has no negative effect from no in-place update limitation
KOCSEA’09, Las Vegas, December 2009 -13- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
MVCC Read PerformanceMVCC Read Performance
• To support MV read consistency, I/O activities will increase A long chain of old versions may have
to be traversed for each access to a frequently updated object
• Read requests are scattered randomly Old versions of an object may be
stored in several rollback segments With SSD, 10-fold read time reduction
was not surprising
100B
…C
50A 100A
(2)
A:
100 -
> 5
0
200A
(1)
A:
200 ->
100
Rollback segment
T1
T2
T0
Rollback segment
KOCSEA’09, Las Vegas, December 2009 -14- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
Database Table SpaceDatabase Table Space
SQL Queries
System Buffer Cache
Database
Table space
Temporary
Table Space
Transaction
(Redo) Log
Rollback
Segments
KOCSEA’09, Las Vegas, December 2009 -15- COMPUTER SCIENCE DEPARTMENT
Workload in Table SpaceWorkload in Table Space• TPC-C workload
Exhibit little locality and sequentiality• Mix of small/medium/large read-write, read-only (join)
Highly skewed• 84% (75%) of accesses to 20% of tuples (pages)
• Write caching not as effective as read caching Physical read/write ratio is much lower that logical
read/write ratio
• All bad news for flash memory SSD Due to the No in place update and Asymmetric read/write
speeds In-Page Logging (IPL) approach [SIGMOD’07]
KOCSEA’09, Las Vegas, December 2009 -16- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
In-Page Logging (IPL)In-Page Logging (IPL)In-Page Logging (IPL)In-Page Logging (IPL)
• Key Ideas of the IPL ApproachKey Ideas of the IPL Approach Changes written to Changes written to loglog instead of updating them in place instead of updating them in place
• Avoid frequent write and erase operationsAvoid frequent write and erase operations
Log records are Log records are co-locatedco-located with data pages with data pages• No need to write them sequentially to a separate log regionNo need to write them sequentially to a separate log region
• Read current data more efficiently than sequential loggingRead current data more efficiently than sequential logging
DBMS buffer and storage managers work togetherDBMS buffer and storage managers work together
KOCSEA’09, Las Vegas, December 2009 -17- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
Design of the IPLDesign of the IPLDesign of the IPLDesign of the IPL
• Logging on Per-Page basis in both Memory and FlashLogging on Per-Page basis in both Memory and Flash
An In-memory log sector can be associated with a buffer frame in memory
Allocated on demand when a page becomes dirty
An In-flash log segment is allocated in each erase unit
The log area is shared by all the data pages in an erase unit
Flash Memory
DatabaseBuffer
in-memorydata page(8KB)
update-in-place
in-memorylog sector (512B)
log area (8KB): 16 sectors
Erase unit: 128KB
15 data pages (8KB each)
….….
KOCSEA’09, Las Vegas, December 2009 -18- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
IPL WriteIPL WriteIPL WriteIPL Write
BufferPool
Flash Memory
Update / Insert / Delete
Data Block Area
update-in-place
physiological log Page : 8KB
Sector : 512B
Block :
128KB
• Data pages in memoryData pages in memory Updated in place, andUpdated in place, and Physiological log records written to its in-memory log sectorPhysiological log records written to its in-memory log sector
• In-memory log sector is written to the in-flash log segment, whenIn-memory log sector is written to the in-flash log segment, when Data page is evicted from the buffer pool, orData page is evicted from the buffer pool, or The log sector becomes fullThe log sector becomes full
• When a dirty page is evicted, the content is When a dirty page is evicted, the content is not writtennot written to flash memory to flash memory The previous version remains intactThe previous version remains intact
Data pages and their log records are physically co-located in the same erase unitData pages and their log records are physically co-located in the same erase unit
KOCSEA’09, Las Vegas, December 2009 -19- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
IPL ReadIPL ReadIPL ReadIPL Read
• When a page is read from flash, the current version is computed on the When a page is read from flash, the current version is computed on the flyfly
BufferPool
Apply the “physiological action”to the copy read from Flash(CPU overhead)
Flash Memory
Read from Flash Original copy of Pi
All log records belonging to Pi (IO overhead)
Re-constructthe currentin-memory copy
Pi
log area (8KB): 16 sectors
data area (120KB): 15 pages
KOCSEA’09, Las Vegas, December 2009 -20- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
IPL MergeIPL MergeIPL MergeIPL Merge
• When all free log sectors in an erase unit are consumed When all free log sectors in an erase unit are consumed Log records are applied to the corresponding data pagesLog records are applied to the corresponding data pages The current data pages are copied into a new erase unitThe current data pages are copied into a new erase unit
• Consumes, erases, and releases only one erase unitConsumes, erases, and releases only one erase unit
PhysicalFlash Block
log area (8KB): 16 sectors
Bold Bnew
clean log area
15 up-to-datedata pages
Merge
Can beErased
KOCSEA’09, Las Vegas, December 2009 -21- COMPUTER SCIENCE DEPARTMENT
Industry ResponseIndustry Response• Common in Enterprise Class SSDs
Multi-channel, inter-command parallelism• Thruput than bandwidth, write-followed-by-read pattern
Command queuing (SATA-II NCQ) Large RAM Buffer (with super-capacitor backup)
• Even up to 1 MB per GB• Write-back caching, controller data (mapping, wear leveling)
Fat provisioning (up to ~20% of capacity)
• Impressive improvement
Prototype/ProductPrototype/Product EC SSDEC SSD X-25MX-25M 15k-RPM Disk15k-RPM Disk
Read (IOPS)Read (IOPS) 1050010500 2000020000 450450
Write (IOPS)Write (IOPS) 25002500 12001200 450450
KOCSEA’09, Las Vegas, December 2009 -22- COMPUTER SCIENCE DEPARTMENT
EC-SSD ArchitectureEC-SSD Architecture• Parallel/interleaved operations
8 channels, 2 packages/channel, 4 chips/package Two-plane page write, block erase, copy-back operations
Host I/F(SATA-II)
ARM9 ECC
FlashController
DRAM(128MB)
MainController
Flashpackage
Flashpackage
8 channelsNAND NAND
NAND NAND
KOCSEA’09, Las Vegas, December 2009 -23- COMPUTER SCIENCE DEPARTMENTCOMPUTER SCIENCE DEPARTMENT
Concluding RemarksConcluding RemarksConcluding RemarksConcluding Remarks• Recent advances cope with random I/O betterRecent advances cope with random I/O better
Write IOPS Write IOPS 100x100x higher than early SSD prototype higher than early SSD prototype
TPS: TPS: 1.3~2x1.3~2x higher than RAID0-8HDDs for read-write higher than RAID0-8HDDs for read-write TPC-C workload, with much less energy consumedTPC-C workload, with much less energy consumed
• Write still lags behind IOPSDisk < IOPSSSD-Write << IOPSSSD-Read
IOPSSSD-Read / IOPSSSD-Write = 4 ~ 17
• A lot more issues to investigate Flash-aware buffer replacement, I/O scheduling, EnergyFlash-aware buffer replacement, I/O scheduling, Energy Fluctuation in performance, Tiered storage architectureFluctuation in performance, Tiered storage architecture Virtualization, and much more …Virtualization, and much more …
KOCSEA’09, Las Vegas, December 2009 -24- COMPUTER SCIENCE DEPARTMENT
Question?Question?
• For more Flash Memory work of Ours In-Page Logging [SIGMOD’07] Logging, Sort/Hash, Rollback [SIGMOD’08] SSD Architecture and TPC-C [SIGMOD’09] In-Page Logging for Indexes [CIKM’09]
• Even More? www.cs.arizona.edu/~bkmoon