design of flash-based dbms: an in-page logging approachbkmoon/papers/sigmod07.pdf · design of...

Design of Flash-Based DBMS: An In-Page LoggingApproach

Sang-Won LeeSchool of Information & Communications Engr.

Sungkyunkwan UniversitySuwon 440-746, Korea

[email protected]

Bongki MoonDept. of Computer Science

University of ArizonaTucson, AZ 85721, U.S.A.

[email protected]

ABSTRACTThe popularity of high-density flash memory as data storage me-dia has increased steadily for a wide spectrum of computing de-vices such as PDA’s, MP3 players, mobile phones and digital cam-eras. More recently, computer manufacturers started launching newlines of mobile or portable computers that did away with magneticdisk drives altogether, replacing them with tens of gigabytes ofNAND flash memory. Like EEPROM and magnetic disk drives,flash memory is non-volatile and retains its contents even when thepower is turned off. As its capacity increases and price drops, flashmemory will compete more successfully with lower-end, lower-capacity disk drives. It is thus not inconceivable to consider run-ning a full database system on the flash-only computing platformsor running an embedded database system on the lightweight com-puting devices. In this paper, we present a new design called in-page logging (IPL) for flash memory based database servers. Thisnew design overcomes the limitations of flash memory such as highwrite latency, and exploits unique characteristics of flash memoryto achieve the best attainable performance for flash-based databaseservers. We show empirically that the IPL approach can yield con-siderable performance benefit over traditional design for disk-baseddatabase servers. We also show that the basic design of IPL can beelegantly extended to support transactional database recovery.

Categories and Subject DescriptorsH. Information Systems [H.2 DATABASE MANAGEMENT]: H.2.2Physical Design

General TermsDesign, Algorithms, Performance, Reliability

KeywordsFlash-Memory Database Server, In-Page Logging

∗This work was sponsored in part by MIC & IITA through IT Lead-ing R&D Support Project, MIC & IITA through Oversea Post-Doctoral Support Program 2005, and MIC, Korea under ITRCIITA-2006-(C1090-0603-0046). This work was done while Sang-Won Lee visited the University of Arizona. The authors assume allresponsibility for the contents of the paper.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD’07, June 12–14, 2007, Beijing, China.Copyright 2007 ACM 978-1-59593-686-8/07/0006 ...$5.00.

1. INTRODUCTIONSince a prototype NAND-type flash memory was introduced in

1987, high-density flash memory has been increasingly adoptedas data storage media for a wide spectrum of computing devicessuch as PDA’s, MP3 players, mobile phones and digital cameras.The success of flash memory as a storage alternative for mobilecomputers is due mainly to its superiority such as smaller size,lighter weight, better shock resistance, lower power consumption,less noise, and faster read performance than magnetic disk drives [6].Many market experts expect that this trend will continue for thecoming years. This trend is reflected on the fact that one of theleading flash device manufacturers recently started launching newlines of mobile personal computers that did away with disk drivesaltogether, replacing them with tens of gigabytes of NAND flashmemory [12]. The integration of high-density flash memory hasbeen accelerated twice every year for past few years [19]. We antic-ipate that flash memory products in the hundreds of gigabyte scalewill be available in the market in the near future.

Like EEPROM and magnetic disk drives, flash memory is non-volatile and retains its contents even when the power is turned off.As its capacity increases and price drops, flash memory will com-pete more successfully with lower-end, lower-capacity magneticdisk drives [21]. It is thus not inconceivable to consider runninga full database system on the flash-only computing platforms orrunning an embedded database system on the lightweight comput-ing devices [24]. From the application development perspective,the next generation mobile or embedded devices are expected tohandle large, complex, and more data-centric tasks. Therefore, ap-plication developers for such devices will be interested in takingadvantage of the database technology (e.g., SQL-like API) in orderto bring more products to market faster.

Access timeMedia Read Write Erase

Magnetic† 12.7 ms 13.7 ms N/ADisk (2 KB) (2 KB)

NAND 80 μs 200 μs 1.5 msFlash‡ (2 KB) (2 KB) (128 KB)

†Disk: Seagate Barracuda 7200.7 ST380011A, average accesstimes including seek and rotational delay;

‡NAND Flash: Samsung K9WAG08U1A 16 Gbits SLC NAND

Table 1: Access Speed: Magnetic disk vs. NAND Flash

With a proper software layer, known as the Flash TranslationLayer (FTL) [16], which makes linear flash memory appear to up-per layers like a disk drive, conventional disk-based database al-

55

gorithms and access methods will function adequately without anymodification. On the other hand, due to a few limitations of flashmemory, this approach is not likely to yield the best attainable per-formance. With flash memory, no data item can be updated in placewithout erasing a large block of flash memory (called erase unit)containing the data item. As is shown in Table 1, writing a sectorinto a clean (or erased) region of flash memory is much slower thanreading a sector. Since overwriting a sector must be preceded byerasing the erase unit containing the sector, the effective write band-width of flash memory will be even worse than that. It has been re-ported that flash memory exhibits poor write performance, partic-ularly when small-to-moderate sized writes are requested in a ran-dom order [2], which is quite a common access pattern for databaseapplications such as on-line transaction processing (OLTP). Theseunique characteristics of flash memory necessitate elaborate flash-aware data structures and algorithms in order to effectively utilizeflash memory as data storage media.

In this paper, we present a novel in-page logging (IPL) approachtoward the new design of flash-based database servers, which over-comes the limitations of and exploit the advantages of flash mem-ory. To avoid the high latency of write and erase operations thatwould be caused by small random write requests, changes made toa data page are buffered in memory on the per-page basis instead ofwriting the page in its entirety, and then the change logs are writtensector by sector to the log area in flash memory for the changes tobe eventually merged to the database.

The most common types of flash memory are NOR and NAND.NOR-type flash memory has a fully memory-mapped random ac-cess interface with dedicated address and data lines. On the otherhand, NAND-type flash memory has no dedicated address lines andis controlled by sending commands and addresses through an indi-rect IO-like interface, which makes NAND-type flash memory be-have similarly to magnetic disk drives it was originally intended toreplace [15]. The unit of read and write operations for NAND-typeflash memory is a sector of typically 512 bytes, which coincideswith the size of a magnetic disk sector. For the reason, the comput-ing platforms we aim at in this paper are assumed to be equippedwith NAND-type flash memory instead of magnetic disk drives.Hereinafter, we use the term flash memory to refer to NAND-typeflash memory, unless we need to distinguish it from NOR-type flashmemory.

The key contributions of this work are summarized as follows.

• A novel storage management strategy called in-page loggingis proposed to overcome the limitations of and exploit theadvantages of flash memory, which is emerging as a replace-ment storage medium for magnetic disks. For the first time,we expose the opportunities and challenges posed by flashmemory for the unique workload characteristics of databaseapplications. Our empirical study demonstrates that the IPLapproach can improve the write performance of conventionaldatabase servers by up to an order of magnitude or more forthe OLTP type applications.

• The IPL design helps achieve the best attainable performancefrom flash memory while minimizing the changes made tothe overall database server architecture. This shows that it isnot only feasible but also viable to run a full-fledged databaseserver on a wide spectrum of computing platforms with flashmemory replacing magnetic disk drives.

• With a few simple modifications to the basic IPL design, theupdate logs of the IPL can be used to realize a lean recov-ery mechanism for transactions such that the overhead dur-ing normal processing and the cost of system recovery can

be reduced considerably. This will also help minimize thememory foot-print of a database server, which is particularlybeneficial to mobile or embedded systems.

The rest of this paper is organized as follows. Section 2 discussesthe characteristics of flash memory and their impact on disk-baseddatabase servers, and then presents the design objectives of the stor-age subsystem we propose. Section 3 describes the basic conceptsand the design of the in-page logging (IPL) scheme. In Section 4,we analyze the performance of a traditional disk-based databaseserver with the TPC-C benchmark, and demonstrate the potentialof the in-page logging for considerable improvement of write per-formance through a simulation study. Section 5 discusses how thebasic IPL design can be extended to support transactional databaserecovery. Lastly, Section 6 surveys the related work, and Section 7summarizes the contributions of this paper.

2. DESIGN PRINCIPLESIn this section, we describe the key characteristics of flash mem-

ory that distinguish itself from magnetic disk drives, and elaborateon how they would affect the performance of traditional disk-baseddatabase servers. We then provide the design principles for newflash-based database servers.

2.1 Characteristics of Flash Memory

2.1.1 No In-Place UpdateMost traditional database systems assume magnetic disks as the

secondary storage media and take advantage of efficient updates ofdata items by overwriting them in place. On the other hand, withflash memory, no data item (or a sector containing the data item)can be updated in place just by overwriting it. In order to updatean existing data item, a time-consuming erase operation must beperformed before overwriting. To make it even worse, the eraseoperation cannot be performed selectively on the particular dataitem or sector, but can only be done for an entire block of flashmemory called erase unit containing the data item, which is muchlarger than a sector (typically 16 KBytes or 128 KBytes). Sinceevery update request will cause an erase operation followed by awrite, the effective update performance may degrade significantlyon database servers with a flash-based storage system.

Consequently, in order to overcome the erase-before-write limi-tation of flash memory, it is essential to reconsider the current de-sign of storage subsystems and reduce the requests of an erase op-eration to the minimum so that the overall performance will not beimpaired.

2.1.2 No Mechanical LatencyFlash memory is a purely electronic device and thus has no me-

chanically moving parts like disk heads in a magnetic disk drive.Therefore, flash memory can provide uniform random access speed.Unlike magnetic disks whose seek and rotational delay often be-comes the dominant cost of reading or writing a sector, the timeto access data in flash memory is almost linearly proportional tothe amount of data irrespective of their physical locations in flashmemory. 1

The ability of flash memory to quickly perform a sector reador a sector (clean) write located anywhere in flash memory is oneof the key characteristics we can take advantage of. In fact, this

1Even though it takes a rather long time for NAND flash to read outthe first data byte compared to NOR flash because of the resistanceof the NAND cell array, this time is still much shorter than the seektime for a magnetic disk by several orders of magnitude [15].

56

brings new opportunities for more efficient design of flash-baseddatabase server architectures (e.g., non-sequential logging with noperformance penalty). We will show how this property of flashmemory can be exploited in our design of in-page logging.

2.1.3 Asymmetric Speed of Read/WriteThe read and write speed of flash memory is asymmetric, simply

because it takes longer to write (or inject charge into) a cell untilreaching a stable status than to read the status from a cell. As isshown in Table 1, the read speed is typically at least twice fasterthan write speed. On the other hand, most existing software sys-tems with magnetic disks implicitly assume that the speed of readand write operations is almost the same. Not surprisingly, thereis little work targeted at the principles and algorithms for storagemedia with asymmetric speed of read and write operations.

This property of asymmetric speed makes us review many exist-ing techniques for DBMS implementation. We ultimately realizethat it is critical to find ways to reduce write operations (and eraseoperations as a result of that), even though it increases the numberof read operations, as long as the overall performance improves.

2.2 Problems with Conventional DesignsMost disk-based database systems rely on a paged I/O mech-

anism for database update and buffer management and take ad-vantage of sequential accesses given the hardware characteristicsof disk storage media composed of sectors, tracks and cylinders.One of the immediate implications is that even an update of a sin-gle record will cause an entire page (typically of 4 or 8 KBytes)containing the record to be overwritten. If the access pattern todatabase records is random and scattered, and the update granu-larity is small (as is often observed in the OLTP applications), theaggregate amount of data to be overwritten is likely to be muchlarger than the actual amount of data to be updated. Nonetheless,most disk-based database systems are still capable of dealing withsuch frequent updates, as is mentioned before, by overwriting themin place.

Suppose all the magnetic disks are replaced by flash memory inthe computing platform which a conventional disk-based databasesystem runs on. If the database system insists on updating dataitems in place, then, due to the erase-before-write limitation of flashmemory, each update can only be carried out by erasing an entireerase unit after reading its content to memory followed by writ-ing the updated content back to the erase unit. This will obviouslylead to high update latency due to the cost of frequent erase opera-tions as well as an increased amount of data to read from and writeto flash memory. Moreover, at the presence of hot data items re-peatedly updated, this would shorten the life span of flash memory,because an erase unit can be put through a finite number of erasecycles (typically up to 100,000 times) before becoming statisticallyunreliable.

Most flash memory devices or host systems adopt a process calledwear leveling within the device themselves or in the software layersin order to ensure that erase cycles are evenly distributed across theentire segment of flash memory so that the life span of flash mem-ory is prolonged. For example, a log-based flash file system ELFachieves wear leveling by creating a new sequential log entry foreach write operation. Thus, flash memory is used sequentially allthe way through, only returning to previously used sectors after allof the sectors have been written to at least once [5]. Since an up-date is performed by writing the content into a new sector differentfrom the current sector it was read from, an update does not requirean erase operation as long as a free (i.e., erased) sector is avail-able. This sequential logging approach, however, optimizes write

performance to the detriment of read performance to the extent thatthe overall performance of transactional database processing maynot be actually improved [9]. In addition, this approach tends toconsume free sectors quite rapidly, which in turn requires frequentgarbage collections to reclaim obsolete sectors to the pool of erasedsectors.

Under either of these two approaches, namely in-place updatingand sequential logging, the average latency of an update operationincreases due to frequent execution of costly erase operations. Thismay become the major bottleneck in the overall performance of adatabase server particularly for write-intensive workload.

2.2.1 Disk-Based Server PerformanceTo make the points raised above more acute, we ran a commer-

cial database server on two computer systems that were identicalexcept that one was equipped with a magnetic disk drive and theother with a flash memory storage device instead of the disk drive.In each case of the experiment, we executed an SQL query that ac-cessed the same base table differently - sequentially or randomly,and measured the response time of the query in the wall clock time.Table 2 summarizes the read and write performance of the mag-netic disk drive and the flash memory device, respectively, in termsof the random-to-sequential performance ratio.

Random-to-Sequential RatioMedia Read workload Write workload

Magnetic Disk† 4.3 ∼ 12.3 4.5 ∼ 10.0NAND Flash‡ 1.1 ∼ 1.2 2.4 ∼ 14.2

†Disk: Seagate Barracuda 7200.7 ST380011A‡NAND Flash: Samsung K9WAG08U1A 16 Gbits SLC NAND

Table 2: DBMS Performance: Sequential vs. Random

In the case of a magnetic disk drive, the random-to-sequentialratio was fairly high for both read and write queries. This resultshould not be surprising, given the high seek and rotational latencyof magnetic disks. In the case of a flash memory device, the resultwas mixed and indeed quite surprising. The performance of a readquery was insensitive to access patterns, which can be perfectly ex-plained by the no-mechanical-latency property of flash memory. Incontrast, the performance of a write query was even more sensitiveto access patterns than the case of disk. This is because, with a ran-dom access pattern, each update request is very likely to cause anerase unit containing the data page being updated to be copied else-where and erased. This clearly demonstrates that database serverswould potentially suffer serious update performance degradation ifthey ran on a computing platform equipped with flash memory in-stead of magnetic disks. See Section 4.1 for more detail of thisexperiment.

2.3 Design ManifestoWhen designing a storage subsystem for flash-based database

servers, we assume that the memory hierarchy of target computingplatforms consists of two levels: volatile system RAM and non-volatile NAND flash memory replacing magnetic disks. Guided bythe unique characteristics of flash memory described in this section,the design objectives of the storage subsystem are stated as follows.

• Take advantage of new features of flash memory such as uni-form random access speed and asymmetric read/write speed.The fact that there is no substantial penalty for scattered ran-dom accesses allows us more freedom in locating data ob-

57

jects and log records across the flash-based storage space. Inother word, log records can be scattered all over flash mem-ory and need not be written sequentially.

• Overcome the erase-before-write limitation of flash mem-ory. In order to run a database server efficiently on the targetcomputing platforms, it is critical to minimize the number ofwrite and/or erase requests to flash memory. Since the readbandwidth of flash memory is much faster than that of write,we may need to find ways to avoid write and erase opera-tions even at the expense of more read operations. This strat-egy can also be justified by an observation that the fractionof writes among all IO operations increases, as the memorycapacity of database servers grows larger [9].

• Minimize the changes made to the overall DBMS architec-tures. Due to the modular design of contemporary DBMSarchitectures, the design changes we propose to make will belimited to the buffer manager and storage manager.

3. IN-PAGE LOGGING APPROACHIn this section, we present the basic concepts of the In-Page Log-

ging (IPL) approach that we propose to overcome the problems ofthe conventional designs for disk-based database servers. We thenpresent the architectural design and the core operations of the In-Page Logging. In Section 5, we will show how the basic design ofIPL can be extended to provide transactional database recovery.

3.1 Basic ConceptsAs described in Section 2.2, due to the erase-before-write lim-

itation of flash memory, updating even a single record in a pageresults in invalidating the current page containing the record, andwriting a new version of the page into an already-erased space inflash memory, which leads to frequent write and erase operations.In order to avoid this, we propose that only the changes made to apage are written (or logged) to the database on the per-page basis,instead of writing the page in its entirety.

Like conventional sequential logging approaches (e.g., log-struct-ured file system [23]), all the log records might be written sequen-tially to a storage medium regardless of the locations of changes inorder to minimize the seek latency, if the storage medium were adisk. One serious concern with this style of logging, however, isthat whenever a data page is to be read from database, the currentversion of the page has to be re-created by applying the changesstored in the log to the previous version of the page. Since logrecords belonging to the data page may be scattered and can befound only by scanning the log, it may be very costly to re-createthe current page from the database.2

In contrast, since flash memory comes with no mechanical com-ponent, there is no considerable performance penalty arising fromscattered writes [8], and there is no compelling reason to write logrecords sequentially either. Therefore, we co-locate a data pageand its log records in the same physical location of flash memory,specifically, in the same erase unit. (Hence we call it In-Page log-ging.) Since we only need to access the previous data page andits log records stored in the same erase unit, the current versionof the page can be re-created efficiently under the IPL approach.Although the amount of data to read will increase by the numberof log records belonging to the data page, it will still be a sensi-ble trade-off for the reduced write and erase operations particularly

2LGeDBMS, recently developed for embedded systems with flashmemory, adopted the sequential logging approach for updating datapages [17].

considering the fact that read is typically at least twice faster thanwrite for flash memory. Consequently, the IPL approach can im-prove the overall write performance considerably.

Figure 1: From Update-In-Place to In-Page Logging

Figure 1 illustrates how the IPL approach is evolved from thetraditional update-in-place and log-structured approaches. Whilelogging is a consequential decision due to the erase-before-write(or no update-in-place) limitation of flash memory, in-page loggingis to take advantage of the desirable properties (i.e., no mechanicallatency and fast reads) of flash memory.

3.2 The Design of IPLAs is mentioned in the design manifesto (Section 2.3), the main

design changes we propose to make to the overall DBMS archi-tecture are limited to the buffer manager and storage manager. Inorder to realize the basic concepts of the in-page logging with theminimal cost, logging needs to be done by the buffer manager aswell as the storage manager. See Figure 2 for the illustration of theIPL design.

Figure 2: The Design of In-Page Logging

Whenever an update is performed on a data page, the in-memorycopy of the data page is updated just as done by traditional databaseservers. In addition, the IPL buffer manager adds a physiologicallog record on the per-page basis to the in-memory log sector asso-ciated with the in-memory copy of the data page. An in-memorylog sector can be allocated on demand when a data page becomesdirty, and can be released when the log records are written to a logsector on the flash memory. The log records in an in-memory logsector are written to flash memory when the in-memory log sectorbecomes full or when a dirty data page is evicted from the bufferpool. The effect of the in-memory logging is similar to that of writecaching [22], so that multiple log records can be written together

58

at once and consequently frequent erase operations can be avoided.When a dirty page is evicted, it is not necessary to write the contentof the dirty page back to flash memory, because all of its updatesare saved in the form of log records in flash memory. Thus, theprevious version of the data page remains intact in flash memory,but is just augmented with the update log records.

When an in-memory log sector is to be flushed to flash memory,its content is written to a flash log sector in the erase unit whichits corresponding data page belongs to, so that the data page andits log records are physically co-located in the same erase unit. Todo this, the IPL storage manager divides each erase unit of flashmemory into two segments – one for data pages and the other forlog sectors. For example, as shown in Figure 2, an erase unit of128 KBytes (commonly known as large block NAND flash) can bedivided into 15 data pages of 8 KBytes each and 16 log sectors of512 bytes each. (Obviously, the size of an in-memory log sectormust be the same as that of a flash log sector.) When an erase unitruns out of free log sectors, the IPL storage manager merges thedata pages and log sectors in the erase unit into a new erase unit.This new merge operation proposed as an internal function of IPLwill be presented in Section 3.3 in more detail.

This new logic for update requires the redefinition of read oper-ation as well. When a data page is to be read from flash memorydue to a page fault, the current version of the page has to be com-puted on the fly by applying its log records to the previous versionof the data page fetched from flash memory. This new logic forread operation clearly incurs additional overhead for both IO cost(to fetch a log sector from flash memory) and computational cost(to compute the current version of a data page). As is pointed outin Section 3.1, however, this in-page logging approach will even-tually improve the overall performance of the buffer and storagemanagers considerably, because write and erase operations will berequested less frequently.

The memory overhead is another factor to be examined for thedesign of IPL. In the worst case in which all the pages in the bufferpool are dirty, an in-memory log sector has to be allocated for eachbuffer page. In the real-world applications, however, an update toa base item is likely to be quickly followed by updates to the sameor related items (known as update locality) [1], and the averageratio of dirty pages in buffer is about 5 to 20 percent [14]. Withsuch a low ratio of dirty pages, if a data page is 8 KBytes andan in-memory log sector is 512 bytes, then the additional memoryrequirement will be no more than 1.3% of the size of the bufferpool. Refer to Section 4.2.2 for the update pattern of the TPC-Cbenchmark.

3.3 Merge OperationAn in-memory log sector can store only a finite number of log

records, and the content is flushed into a flash log sector when itbecomes full. Since there are only a small number of log sectorsavailable in each erase unit of flash memory, if data pages fetchedfrom the same erase unit get updated often, the erase unit may runout of free log sectors. It is when merging data pages and their logsectors is triggered by the IPL storage manager. If there is no freelog sector left in an erase unit, the IPL storage manager allocatesa free erase unit, computes the new version of the data pages byapplying the log records to the previous version, writes the newversion into the free erase unit, and then erases and frees the olderase unit. The algorithmic description of the merge operation isgiven in Algorithm 1.

The cost of a merge operation is clearly much higher than thatof a basic read or write operation. Specifically, the cost of a mergeoperation cmerge will amount to (kd +kl)× cread +kd× cwrite +

Algorithm 1: Merge Operation

Input: Bo: an old erase unit to mergeOutput: B: a new erase unit with merged content

procedure Merge(Bo, B)allocate a free erase unit B1:

for each data page p in Bo do2:

if any log record for p exists then3:

p′← apply the log record(s) to p4:

write p′ to B5:

elsewrite p to B6:

endifendforerase and free Bo7:

cerase for IO plus the computation required for applying log recordsto data pages. Here, kd and kl denote the number of data sectorsand log sectors in an erase unit, respectively. Note that a merge op-eration is requested only when all the log sectors are consumed onan erase unit. This actually helps avoid frequent write and erase op-erations that would be requested by in-place updating or sequentiallogging method of traditional database servers.

When a merge operation is completed for the data pages storedin a particular erase unit, the content of the erase unit (i.e., themerged data pages) is physically relocated to another erase unitin flash memory. Therefore, the logical-to-physical mapping ofthe data pages should be updated as well. Most flash memorydevices store the mapping information persistently in flash mem-ory, which is maintained as meta-data by the flash translation layer(FTL) [16, 18]. Note again that the mapping information needsto be updated only when a merge operation is performed, and theperformance impact will be even less under the IPL design thantraditional database servers that require updating the mapping in-formation more frequently for all write operations.

4. PERFORMANCE EVALUATIONIn this section, we analyze the performance characteristics of a

conventional disk-based database server to expose the opportunitiesand challenges posed by flash memory as a replacement mediumfor magnetic disk. We also carry out a simulation study with theTPC-C benchmark to demonstrate the potential of the IPL approachfor considerable improvement of write performance.

4.1 Disk-Based Server PerformanceIn this section, we analyze the performance characteristics of a

conventional disk-based database server with respect to differenttypes of storage media, namely, magnetic disk and flash memory.

4.1.1 Setup for ExperimentWe ran a commercial database server on two Linux systems, each

with a 2.0 GHz Intel Pentium IV processor and 1 GB RAM. Thecomputer systems were identical except that one was equipped witha magnetic disk drive and the other with a flash memory storagedevice instead of the disk drive. The model of the disk drive wasSeagate Barracuda 7200.7 ST380011A, and the model of the flashmemory device was M-Tron MSD-P35 [11] (shown in Figure 3),which internally deploys Samsung K9WAG08U1A 16 Gbits SLCNAND flash. Both storage devices were connected to the computersystems via an IDE/ATA interface.

In order to minimize the interference by data caching and log-

59

Read Query processing time (sec) Write Query processing time (sec)Queries Disk Flash Queries Disk Flash

Sequential (Q1) 14.04 11.02 Sequential (Q4) 34.03 26.01Random (Q2) 61.07 12.05 Random (Q5) 151.92 61.76Random (Q3) 172.01 13.05 Random (Q6) 340.72 369.88

Table 3: Read and Write Query Performance of a Commercial Database Server

Figure 3: MSD-P35 NAND Flash-based Solid State Disk

ging, the commercial database server was set to access both typesof storage as a raw device, and no logging option was chosen so thatmost of IO activities were confined to data pages of a base table andindex nodes. The size of a buffer pool was limited to 20 MBytes,and the page size was 8 KBytes by default for the database server.

A sample table was created on each of the storage devices, andthen populated with 640,000 records of 650 bytes each. Since eachdata page (of 8 KBytes) stored up to 10 records, the table was com-posed of 64,000 pages. In case of the flash memory device, thistable was spanned over 4,000 erase units, because each 128 KByteerase unit had sixteen 8 KByte data pages in it. The domain of thefirst two columns of the table was integer, and the values of the firsttwo columns were given by the following formulas

col1 = �record id/160�col2 = record id (mod 160)

so that we could fully control data access patterns through B+-treeindices created on the first two columns.

4.1.2 Read PerformanceTo compare the read performance of magnetic disk and flash

memory, we ran the following queries Q1, Q2 and Q3 on each ofthe two computer systems. Q1 scans the table sequentially; Q2 andQ3 read the table randomly. The detailed description of the queriesis given below.

Q1: scan the entire table of 64,000 pages sequentially.Q2: pick 16 consecutive pages randomly and read them to-

gether at once; repeat this until each and every page ofthe table is read only once.

Q3: read a page each time such that two pages read in se-quence are apart by 16 pages in the table. The id’sof pages read by this query are in the following order:0, 16, 32, . . . , 63984, 1, 17, 33, . . ..

The response times of the queries measured in the wall clocktime are presented in Table 3. In the case of disk, the responsetime increased as the access pattern changed from sequential (Q1)

to quasi-random (Q2 and Q3). Given the high seek and rotationallatency of magnetic disks, this result was not surprising, becausethe more random the access pattern is, the more frequently the diskarm has to move. On the other hand, in the case of flash memory,the amount of increase in the response times was almost negligible.This result was also quite predictable, because flash memory hasno mechanically moving parts nor mechanical latency.

4.1.3 Write PerformanceTo compare the write performance of magnetic disk and flash

memory, we ran the following queries Q4, Q5 and Q6 on each ofthe two computer systems. Q4 updates all the pages in the tablesequentially; Q5 and Q6 update all the pages in the table in a ran-dom order but differently. The detailed description of the queries isgiven below.

Q4: update each page in the entire table sequentially.Q5: update a page each time such that two pages updated in

sequence are apart by 16 pages in the table. The id’s ofpages updated by this query are in the following order:0, 16, 32, . . . , 63984, 1, 17, 33, . . ..

Q6: update a page each time such that two pages updatedin sequence are apart by 128 pages in the table. Theid’s of pages updated by this query are in the followingorder: 0, 128, 256, . . . , 63872, 1, 129, 257, . . ..

The response times of the queries measured in the wall clocktime are presented in Table 3. In the case of disk, the trend inthe write performance was similar to that observed in the read per-formance. As the access pattern became more random, the queryresponse time became longer due to prolonged seek and rotationallatency.

In the case of flash memory, however, a striking contrast was ob-served between the read performance and the write performance.As the access pattern changed from sequential to random, the up-date query response time became longer, and it was actually worsethan that of disk for Q6. As is discussed previously, the dominantfactor of write performance for flash memory is how often an eraseoperation has to be performed.

In principle, each and every update operation can cause an eraseunit to be copied elsewhere and erased. In practice, however, mostof flash memory products are augmented with a DRAM buffer toavoid as many erase operations as possible. The MSD-P35 NANDflash solid state disk comes with a 16 MByte DRAM buffer, eachone MByte segment of which can store eight contiguous erase units.If data pages are updated sequentially, they can be buffered in aDRAM buffer segment and written to the same erase unit at once.This was precisely what happened to Q4. Due to the sequential or-der of updates, each of the 4000 erase units of the table was copiedand erased only once during the processing of Q4.

On the other hand, Q5 and Q6 updated data pages in a randomorder but differently. Each pair of pages updated by Q5 in sequencewere apart by 16 pages, which is equivalent to an erase unit. Sincea total of eight consecutive erase units are mapped to a DRAMbuffer segment of one MByte, an erase operation was requested ev-ery eight page updates (i.e., a total of 64000/8 = 8000 erases).

60

The pages updated by Q6 were apart from the previous and the fol-lowing pages by 128 pages, which is equivalent to a DRAM buffersegment of one MByte. Consequently, each page updated by Q6

caused an erase operation (i.e., a total of 64000 erases). This is thereason why Q6 took considerably more time than Q5, which in turntook more than Q4.

4.2 Simulation with TPC-C BenchmarkIn this section, we examine the write performance of the IPL ap-

proach and compare it with that of a disk-based database server formore realistic workload. We used a reference stream of the TPC-Cbenchmark, which is a representative workload for on-line trans-action processing, and estimated the performance of a server man-aging database stored in flash memory, with and without the IPLfeatures. As pointed out in Section 3, the IPL read operation mayincur additional overhead to fetch log sectors from flash memory.However, due to its superior read performance of flash memory ir-respective of access patterns, as shown in Table 3, we expect thatthe benefit from the improved write performance will outweigh theincreased cost of read operations.

4.2.1 TPC-C Trace GenerationTo generate a reference stream of the TPC-C benchmark, we

used a workload generation tool called Hammerora [7]. Ham-merora is an open source tool written in Tcl/Tk. It supports TPC-Cversion 5.1, and allows us to create database schemas, populatedatabase tables of different cardinality, and run queries from a dif-ferent number of simulated users.

We ran the Hammerora tool with the commercial database serveron a Linux platform under a few different configurations, which is acombination of the size of database, the number of simulated users,and the size of a system buffer pool. When the database server ranunder each configuration, it generated (physiological) log recordsduring the course of query processing. In our experiments, we usedthe following traces to simulate the update behaviors of a databaseserver with and without the IPL feature.

100M.20M.10u: 100 MByte database, 20 MByte bufferpool, 10 simulated users

1G.20M.100u: 1 GByte database, 20 MByte bufferpool, 100 simulated users

1G.40M.100u: 1 GByte database, 40 MByte bufferpool, 100 simulated users

Note that each of the traces contained update log records only,because the database server did not produce any log record for readoperations. Nonetheless, these traces provided us with enough in-formation, as our empirical study was focused on analyzing theupdate behaviors of a database server with flash memory.

4.2.2 Update Pattern of the TPC-C BenchmarkFirst, we examined the lengths of log records. Since the num-

ber of log records kept in memory by the IPL buffer manager isdetermined by the average length of log records and the size ofa flash sector, which is typically 512 bytes, the average length oflog records is an indicator suggesting how quickly an in-memorylog sector becomes full and gets written to flash memory. Table 4shows the average length of log records by the types of opera-tions for the 1G.20M.100u trace. Since the average length isno longer than 50 bytes, a log sector of 512 bytes can store up to10 log records on average. This implies that an in-memory log sec-tor will not be flushed to flash memory until its associated datapage gets updated 10 times on average, unless the data page isevicted by a buffer replacement mechanism. In addition to the threetypes of physiological log records, the traces contain log records of

physical page writes. For example, the 1G.20M.100u trace con-tained 625527 log records of physical page writes in addition to the784278 physiological log records.

Operation occurrences avg. length

Insert 86902 (11.08%) 43.5Delete 284 (0.06%) 20.0Update 697092 (88.88%) 49.4

Total 784278 (100.00%) 48.7

Table 4: Update Log Statistics of the 1G.20M.100u Trace

Second, we examined the log reference locality by counting thenumber of log records that updated individual data pages. As shownin Figure 4(a), the distribution of update frequencies was highlyskewed. We also examined the distribution of references in termsof how frequently individual data pages were physically writtento the database. Note that the frequency distribution of physicalpage writes is expected to be correlated to the log reference local-ity above, but may be slightly different, because a data page cachedin the buffer pool can be modified multiple times (generating mul-tiple log records), until it is evicted and written to the database. Weobtained the page write count for each data page from the traces,which contained the information of individual data pages beingwritten to database, in addition to the update log records. Fig-ure 4(b) shows the distribution of the frequency of physical pagewrites for the 2000 most frequently updated pages in the 1G.-20M.100u trace. The distribution of the write frequencies is alsoclearly skewed. In this case, the 2000 most frequently written pages(1.6% of a total of 128K pages in the database) were written 29%of the times (637823 updates).3 We also derived the physical erasefrequencies by mapping each page to its corresponding erase unit.Figure 4(c) shows the erase frequencies of erase units for the same1G.20M.100u trace.

Third, we examined the temporal locality of data page updatesby running a sliding window of length 16 through each trace ofphysical write operations. We counted the number of distinct pageswithin individual windows, and averaged them across the entirespan of the trace. For the 1G.20M.100u trace, the probabilitythat 16 consecutive physical writes would be done for 16 distinctdata pages was 99.9%. We can derive the similar analysis for eraseunits. The probability that 16 consecutive physical writes would bedone for 16 distinct erase units was 93.1% (i.e., on average 14.89out of 16). Due to this remarkable lack of temporal locality, theupdate patterns of the TPC-C benchmark are expected to cause alarge number of erase operations with flash memory.

4.2.3 Write Performance EstimationIn order to evaluate the impact of the IPL design on the perfor-

mance of flash-based database servers, we implemented an event-driven simulator modeling the IPL buffer and storage managers asdescribed in Section 3. This simulator reads log records from theTPC-C traces described in Section 4.2.1, and mimics the opera-tions that would be carried out by the IPL managers according tothe types of log records. The simulator was implemented in the Clanguage, and its pseudo-code is shown in Algorithm 2.

There are four types of log records in the traces: three types ofphysiological log records (namely, insert, delete, and update) pluslog records for physical writes of data pages. When a physiological

3This trend of skewedness appears to coincide with the TPC-Ccharacterization of DB2 UDB traces [4]. The DB2 UDB tracescontained both read and write references.

61

6

4

2

015001000500

No.

of l

og r

ecor

ds (

x100

0)

2000 Most frequently updated pages

Distribution of Update Frequencies

600

400

200

015001000500

No.

of p

hysi

cal p

age

writ

es

2000 Most frequently updated pages


9

6

3

09060300

No.

of p

hysi

cal e

rase

s (x

100i

0)

100 Most frequently updated erase units


(a) Log reference locality by pages (b) Locality of physical page writes (c) Locality of physical erases

Figure 4: TPC-C Update Locality in terms of Physical Writes (1G.20M.100u Trace)

Algorithm 2: Pseudo-code of the IPL Simulator

Input: {Li}: a trace from the TPC-C benchmark

procedure Simulate({Li})for each log record Li do1:

if Li.opcode ∈ {insert, delete, update} then2:

if log count(Li .pageid) ≥ τs then3:

generate a sector-write event4:

log count(Li .pageid)← 05:

endiflog count(Li.pageid)++6:

else// Li is a log record of physical page writegenerate a sector-write event7:

log count(Li.pageid)← 08:

endifendforevent handler SectorWrite({Li})eid← erase unit id of the sector9:

if logsector count(eid) ≥ τe then10:

global merge count++11:

logsector count(eid)← 012:

endiflogsector count(eid)++13:

global sector write count++14:

log record is accepted as input, the simulator mimics adding the logrecord into the corresponding in-memory log sector by increment-ing the log record counter of the particular sector.4 If the counterof the in-memory log sector has already reached the limit (denotedby τs in the pseudo-code), then the counter is reset, and an internalsector-write event is created to mimic flushing the in-memory logsector to a flash log sector (Lines 3 to 5).

When a log record of physical page write is accepted as input,the simulator mimics flushing the in-memory log sector of the pagebeing written to the database by creating an internal sector-writeevent (Lines 7 to 8). Note that the in-memory log sector is flushedeven when it is not full, because the log record indicates that thecorresponding data page is being evicted from the buffer pool.

Whenever an internal sector-write event is generated, the sim-ulator increments the number of consumed log sectors in the cor-responding erase unit by one. If the counter of the consumed logsectors has already reached the limit (denoted by τe in the pseudo-

4Since the traces do not include any record of physical page reads,we can not tell when the page is fetched from the database, but itis inferred from the log record that the page referenced by the logrecord must have been fetched before the reference.

code), then the counter is reset, and the simulator increments theglobal counter of merges by one to mimic the execution of a mergeoperation and to keep track of total number of merges (Lines 10to 13). The simulator also increments the global counter of sectorwrites by one to keep track of total number of sector writes.

Trace No of update logs No of sector writes

100M.20M.10u 79136 468931G.40M.100u 784278 5946941G.20M.100u 785535 559391

Table 5: Statistics of Log Records and Sector Writes

When we ran the IPL simulator through each of the traces, we in-creased the size of the log region in each erase unit from 8 KBytesto 64 KBytes by 8 KBytes to observe the impact of the log re-gion size on the write performance. The IPL simulator returns twocounters at the completion of analyzing a trace, namely, the totalnumber of sector writes and the total number of erase unit merges.The number of sector writes is determined by the update referencepattern in a given trace and the buffer replacement by the databaseserver, independently of the size of a log region in the erase units.Table 5 summarizes the total number of references in each trace andthe number of sector writes reported by the IPL simulator for eachtrace. Note that, even with the relatively small sizes chosen for thebuffer pool, which causes in-memory log sectors to be flushed pre-maturely, the number of sector writes was reduced by a non-trivialmargin compared with the number of update references.

30

20

10

056KB40KB24KB8KB

Tot

al n

umbe

r of

Mer

ges

(x10

00)

Size of log region per erase unit of 128KB

1G.20M.100u1G.40M.100u

100M.20M.10u

Figure 5: Simulated Merge Counts

On the other hand, the number of merges is affected by the size ofa log region. Figure 5 shows the number of merges for each of thethree traces, 100M.20M.10u,1G.20M.100u and 1G.40M.100u,with a varying size of the log region. As the size of the log regionincreased, the number of merge operations decreased dramatically

62

900

600

300

056KB40KB24KB8KB

Pre

dict

ed w

rite

time

(sec

)


1G.20M.100u1G.40M.100u

100M.20M.10u2000

1500

1000

500

056KB40KB24KB8KB

Dat

abas

e si

ze (

MB

ytes

)


1GB Traces

(a) Estimated write time (b) Storage usage for database

Figure 6: Estimated Write Performance and Space Overhead

600

400

200

0100MB80MB60MB40MB20MB

No.

of s

ecto

r w

rites

(x1

000)

Buffer pool size of database server

TPC-C Benchmark: (1GB database, 100 users)

1GB Traces

30

20

10


No.

of m

erge

s (x

1000

)



1GB Traces

10000

1000

100

10


Est

imat

ed w

rite

time

(sec

)



Conventional (α=0.9)Conventional (α=0.5)

IPL

(a) Total number of sectors written (b) Total number of merges performed (c) Estimated write time

Figure 7: Performance Trend with Varying Buffer Sizes (1GB DB and 100 users; 8KB log sectors)

at the cost of increased storage space for database. As we observedin Figure 4(a), the distribution of update frequencies was so skewedthat a small fraction of data pages were updated much more fre-quently than the rest, before they were evicted from the buffer pool.This implies that the erase units containing the hot pages exhaustedtheir log regions rapidly, and became prone to merge operationsvery often. For the reason, the more flash log sectors were addedto an erase unit, the less frequently the erase unit was merged (i.e.,copied and erased), because more updates were absorbed in the logsectors of the erase unit.

To understand the performance impact of the reduced merge op-erations more realistically, we used the following formula to esti-mate the time that an IPL-enabled database server would spend onperforming the insert, delete and update operations in the TPC-Ctraces.

tIPL = (# of sector writes)× 200μs

+ (# of merges) × 20ms

The average time (200μs) spent on writing a flash sector is fromTable 1.5 The average time (20 ms) spent on merging an erase unitcan be calculated from Table 1 by adding the average times takenfor reading, writing and erasing an erase unit.6

Figure 6(a) shows the write time estimated by the tIPL formulafor each of the three traces. The performance benefit from the in-creased number of log sectors was evident, but the space overhead,as shown in Figure 6(b), was not trivial. As the price of flash mem-ory is expected to drop steadily, however, we expect that the in-crease throughput will outweigh the increased cost for storage.5Writing a 512-byte sector takes the same amount of time as writ-ing a 2-KByte block on large block NAND flash devices.6This average time of merge coincides with the measurement givenby Birrel et al. [2].

We also examined how the performance of a database serverwith the IPL features was affected by the capacity of the systembuffer pool. To do this, we generated three additional traces called1G.60M.100u, 1G.80M.100u and 1G.100M.100u by run-ning the database server with a buffer pool of different capacities,namely, 60 MB, 80 MB and 100 MB. Figures 7(a) and 7 (b) showthe total number of sector writes and the total number of mergeswith a varying capacity of the buffer pool. Not surprisingly, as thebuffer capacity increased, the total number of pages replaced bythe buffer manager decreased. Consequently, Figure 7(c) showsthe similar trend in the estimated write time, as the buffer capacityincreases.

In addition, Figure 7(c) shows the expected write time that a con-ventional database server would spend without the IPL features.The write time of this case was estimated by the following formula,

tConv = (α× # of page writes)× 20ms

where the parameter α denotes the probability that a page writewill cause the container erase unit to be copied and erased. In Fig-ure 7(c), the value of α was set to 90% and 50%. (Recall that inSection 4.2.2 the probability of 16 consecutive physical writes be-ing done to 16 distinct erase units was 93.1%.) Even when thevalue of α was arbitrarily adjusted from 90% to 50%, the write per-formance of the IPL server was an order of magnitude better thatthat of the conventional server. Note that the y-axis of Figure 7(c)is in the logarithmic scale.

4.3 SummaryHigh density flash memory has been successfully adopted by

personal media players, because flash memory yields excellent readand write performance for sequential access patterns. However, asshown in Table 3, the write performance of flash memory dete-

63

riorates drastically, as the access pattern becomes random, whichis quite common for the OLTP-type applications. The simulationstudy reported in this section demonstrates that the IPL strategy canhelp database servers overcome the limitations of flash memory andachieve the best attainable performance.

5. SUPPORT FOR RECOVERYIn this section, we discuss how the basic IPL design can be aug-

mented to support transactional database recovery. The IPL bufferand storage managers, as described in Section 3, rely on loggingupdates temporarily in main memory and persistently in flash mem-ory in order to overcome the erase-before-write limitation of flashmemory. The update logs of IPL can also be used to realize a leanrecovery mechanism for transactions with the minimal overheadduring the normal processing such that the cost of system recoverycan be reduced considerably. This will help minimize the memoryfoot-print of database servers particularly for mobile or embeddedsystems.

5.1 Additional Logging and Data StructureFor the support of transactional database recovery, it is necessary

to adopt the conventional system-wide logging maintained typi-cally in a separate storage, for keeping track of the start and end(i.e., commit or abort) of transactions. Like the transaction log ofthe Postgres No-Overwrite Storage [25], the only purpose of thissystem-wide logging is to determine the status of transactions atthe time of system failure during the recovery. Since no additionallog records (other than those by the in-page logging) are createdfor updates, the overall space and processing overhead is no worsethan that of conventional recovery systems.

In addition to the transaction log, a list of dirty pages in the bufferpool can be maintained in memory during the normal processing,so that a committing transaction or an aborting transaction (not bythe system failure) can quickly locate the in-memory log sectorscontaining the log records added by the transaction.

5.2 Transaction CommitMost disk-based database systems adopt a no-force buffer man-

agement policy for performance reasons [13]. With a force policy,all the data pages modified by a committing transaction would haveto be forced out to disk, which might often lead to random disk ac-cesses for an increased volume of data rather than just flushing thelog tail sequentially to a stable storage. With a no-force policy,only the log tail is forced to a stable storage. Consequently, how-ever, data pages resident on disks may not be current. Thus, when asystem failure occurs, REDO recovery actions should be performedfor committed transactions at the system restart.

To adopt a no-force policy for the IPL design, the correspondingin-memory log sectors need to be written to flash memory when atransaction commits. Note that, in the basic IPL design as describedin Section 3.2, an in-memory log sector is written to flash memorywhen it becomes full or its associated data page is evicted fromthe buffer pool. In addition to that, for the sake of transactionalrecovery, the IPL buffer manager has to force out an in-memorylog sector to flash memory, if it contains at least one log record ofa committing transaction.

Unlike a log tail sequentially written to a stable storage by disk-based database systems, the IPL in-memory log sectors are writtento non-consecutive locations of flash memory, because they must beco-located with their corresponding data pages. The access patternis thus expected to be random. With no mechanical latency of flashmemory, however, the cost of writing the in-memory log sectors toflash memory will be just about the same as the cost of writing them

sequentially, and this process will cause no substantial performancedegradation at commit time.

We claim that, even with the no-force policy, the IPL designdoes not require REDO actions explicitly for committed transac-tions at the system restart. Rather, any necessary REDO action willbe performed implicitly as part of normal database processing, be-cause the redefined IPL read applies log records on the fly to datapages being fetched from flash memory, and all the changes madeby a committed transaction are available in the log records in flashmemory. In other words, under the IPL design, the materializeddatabase [13] consists not only of data pages but also of their cor-responding log records.

5.3 Transaction AbortWhen an individual transaction T aborts (not by a system fail-

ure), T ’s log records that still remain in the in-memory log sectorscan be located via the list of dirty pages maintained in memory, re-moved from the in-memory log sectors, and de-applied to the cor-responding data pages in the buffer pool. Some of T ’s log records,however, may have already been written to flash log sectors by theIPL buffer manager. To make the matter even more complicated,the IPL merge operation described in Section 3.3 creates a newversion of data pages in an erase unit by applying the log recordsto the previous version, and frees the old erase unit in flash mem-ory. Since, when a merge is completed, the log records that werestored in the old erase unit are abandoned, it would be impossibleto rollback the changes made by an aborting transaction withoutproviding a separate logging mechanism for UNDO actions.

To cope with this issue, we propose a selective merge operationinstead of the regular merge so that we can take advantage of thein-page logging and simplify the UNDO recovery for aborted trans-actions or incomplete transactions at the time of system crash. Theidea of the selective merge is simply to keep log records from be-ing applied to data pages if the corresponding transactions are stillactive when a merge is invoked. With this selective merge, we canalways rollback the changes made by uncommitted transactions justby discarding their log records, because no changes by those trans-actions are applied to any data page in flash memory until theycommit.

When a selective merge is invoked for a particular erase unit, theIPL storage manager inspects each log record stored in the eraseunit, and performs a different action according to the status of thetransaction responsible for the log record. If the transaction is com-mitted, then the log record is applied to a relevant data page. If thetransaction is aborted, then the log record is simply ignored. If thetransaction is still active, the log record is moved to the log sector ina newly allocated erase unit. Obviously, when multiple log recordsare moved to a new erase unit, they are compacted into the fewestnumber of log sectors.

There is a concern, however, that may be raised when too manylog records need to be moved to a new erase unit. For example, ifall the transactions associated with the log records are still active,then none of the log records will be dropped when they are movedto a new erase unit. The problem in such a case is that the newlymerged erase unit is prone to another merge in the near future due tothe lack of free slots in the log sectors, which will cause additionalwrite and erase operations.

One way of avoiding such a thrashing behavior of the selectivemerge is to allow an erase unit being merged to have overflow logsectors allocated in a separate erase unit. When a selective mergeis triggered by an in-memory log sector being flushed to an eraseunit (say E), the IPL storage manager estimates what fraction oflog records would be carried over to a new erase unit. If the frac-

64

Algorithm 3: Selective Merge Operation

Input: Bo: an old erase unit to mergeOutput: B: a new erase unit with merged content

procedure Merge(Bo, B)if carry-over-fraction > τ then1:

// the log sector is added to an overflow log area2:

return Bo as B3:

endifallocate a free erase unit B4:

for each data page p in Bo do5:

if any committed log record for p exists then6:

p′← apply the committed log record(s) to p7:

write p′ to B8:

elsewrite p to B9:

endifendforcompact and write all active log records to B10:

erase and free Bo11:

tion is over a certain threshold value τ , the erase unit E remainsintact, but instead the in-memory log sector is written to a flash logsector in a separate erase unit allocated as an overflow area. The al-gorithmic description of the selective merge operation is presentedin Algorithm 3.

With the selective merge replacing the regular merge, it is notnecessary to explicitly perform UNDO actions for aborted or in-complete transactions. Rather, any necessary UNDO action willbe performed implicitly as part of normal database processing bypreventing any change made by aborted or incomplete transactionsfrom being merged to data pages. Note that the log records byaborted or incomplete transactions are not explicitly invalidated bythe IPL storage manager in order not to incur any unnecessary IO,but instead dropped by selective merge operations during the nor-mal processing, and eventually garbage-collected and erased.

5.4 System RestartAs described above, the IPL storage manager maintains data

pages and their log records in such a way that a consistent databasestate with respect to all the committed transactions can always bederived from data pages and log records. Consequently, both theREDO and UNDO actions can be performed implicitly, just as theyare done during the normal processing.

When the database server is recovered from a system failure, thetransaction log (described in Section 5.1) is examined to determinethe status of transactions at the time of the failure. For a transactionwhose commit or abort record appears in the transaction log, norecovery action needs to be performed. For a transaction that wasactive at the time of failure, an abort record should be added tothe transaction log, so that any change made by this transactioncan be rolled back by the subsequent processing of the IPL storagemanager.

6. RELATED WORKMost commercial database systems rely on the in-place updates

and the no-force buffer replacement. Without the no-force policy,the commit-time overhead may be high, because scattered randomwrites are required to propagate all the changes made by each andevery committing transaction. With the no-force policy, on theother hand, whenever a transaction commits, the log tail has to

be written to a stable storage in order to ensure the durability oftransactions. Since the log tail is always written in the sequentialmanner, the commit-time overhead will be minimal even with mag-netic disk drives with high mechanical latency. Under the recoverymechanism supported by the IPL scheme (Section 5), when a trans-action commits, its log records still cached in the buffer pool arewritten to corresponding log sectors in flash memory in the scat-tered fashion. Due to no mechanical latency of flash memory, how-ever, small random writes can be processed efficiently as long ascostly erase operations are not involved [8]. Since the IPL schemecan keep the the number of merge (i.e., copy and erase) operationsto the minimum, the commit-time overhead is likely to be still low.

As large and cheap magnetic disks were available in the mid1990s, the concept of “no-overwrite storage manager” was pro-posed for Postgres [25]. The main idea was, instead of overwritingdata in disk, to store the historical delta records of updates in addi-tion to the original contents of data. By taking the no-overwritepolicy, it can travel the history of changes for a data item, andmore importantly, recover from database failures very quickly. Inthis respect, it is similar to our in-page logging. However, the no-overwrite storage manager of Postgres must force to disk at com-mit time all pages written by a transaction by random IO. There-fore, it was retrospected that the no-overwrite storage would be-come a viable storage option only with stable main memory (e.g.,FeRAM) [26].

In the late 90s, PicoDBMS [3] was developed for EEPROM,which was then the major storage media for Smartcard in the late1990s. The main bottleneck of EEPROM is its write performance.While the read time per word is about 60 ∼ 250 ns, the writetime per word is about 1 ∼ 5 ms. Since EEPROM allows over-write unlike flash memory, PicoDBMS was built on the update-in-place approach. If flash memory is used instead of EEPROM,PicoDBMS will suffer the same performance degradation as thetraditional disk-based database servers (see Table 3). Besides, Pi-coDBMS frequently uses pointer-based data accesses in order tominimize the size of database, and to take advantage of the fastread accesses of EEPROM. However, the read speed of NANDflash memory is not as fast, compared with EEPROM. Therefore,the intensive use of pointer-based data access will be another per-formance bottleneck for flash-based systems.

Finally, we would like to discuss the limitations of flash transla-tion layers (FTL) for database workloads. The main goal of FTL’sis to minimize erase operations even for small random updates. Thepattern of random writes typically dealt with by a file system isquite different from that of database workload. Specifically, in afile system, most random writes are required for meta-data suchas FAT (file allocation table) and I-node map, and the writes arescattered over only a very limited address space (typically less thanseveral megabytes) [18]. Therefore, this type of random writes canbe efficiently handled by the LSF-like techniques adopted by mostFTL’s. In contrast, the write patterns typical in database workloadsare scattered randomly over a large address space (usually morethan several gigabytes). Consequently, most existing FTL’s are notwell suited for processing database workload.

Table 6 summarizes the representatives of the previous work re-lated to the in-page logging approach proposed in this paper. Thistable classifies database storage techniques with respect to the dataupdate policy (i.e., in-place vs. no in-place) and the data accesslatency (i.e., with or without mechanical latency).

7. CONCLUSIONSThe evidence that high-density flash memory can replace mag-

netic disks for a wide spectrum of computing platforms is clear and

65

In-place update No in-place update

Mechanical Traditional DB Postgres Storage [25]Latency Storage and recovery [10, 20] (Disk)

(Disk)No mechanical PicoDBMS [3] In-page logging

latency (EEPROM) (Flash Memory)

Table 6: Classification of Database Storage Techniques

present. While multimedia applications tend to access large audioor video files sequentially, database applications tend to read andwrite data in small pieces in the scattered and random fashion. Dueto the erase-before-write limitation of flash memory, the traditionaldatabase servers designed for disk-based storage systems will suf-fer seriously deteriorated write performance.

To the best of our knowledge, it is the first time we exposethe opportunities and challenges posed by flash memory for theunique workload characteristics of database applications, by run-ning a commercial database server on a flash-based storage system.The in-page logging (IPL) proposed in this paper has demonstratedits potential for considerable improvement of write performancefor OLTP-type applications by exploiting the advantages of flashmemory such as no mechanical latency and high read bandwidth.Besides, the IPL design can be extended to realize a lean recoverymechanism for transactions.

AcknowledgmentThe authors thank Dr. Young-Hyun Bae of M-Tron Corp. for pro-viding the company’s SSD product prototype and its technical in-formation, and Mr. Jae-Myung Kim for assisting with the experi-ments.

8. REFERENCES[1] Brad Adelberg, Ben Kao, and Hector Garcia-Molina.

Database Support for Efficiently Maintaining Derived Data.In the 5th International Conference on Extending DatabaseTechnology, pages 223–240, Avignon, France, March 1996.

[2] Andrew Birrel, Michael Isard, Chuck Thacker, and TedWobber. A Design for High-Performance Flash Disks.Technical Report MSR-TR-2005-176, Microsoft Research,December 2005.

[3] Christophe Bobineau, Luc Bouganim, Philippe Pucheral, andPatrick Valduriez. PicoDBMS: Scaling Down DatabaseTechniques for the Smartcard. In Proceedings of the 26thVLDB Conference, pages 11–20, Cairo, Egypt, September2000.

[4] R. Bonilla-Lucas et al. Characterization of the Data AccessBehavior for TPC-C Traces. In Performance Analysis ofSystems and Software, pages 115–122, 2004.

[5] Hui Dai, Michael Neufeld, and Richard Han. ELF: AnEfficient Log-Structured Flash File System for Micro SensorNodes. In The Second International Conference onEmbedded Networked Sensor Systems (SenSys’03), pages176–187, Baltimore, MD, USA, November 2004.

[6] Fred Douglis, Ramon Caceres, Frans Kaashoek, Kai Li,Brian Marsh, and Joshua A. Tauber. Storage Alternatives forMobile Computers. In Proceedings of the USENIX 1stSymposium on Operating Systems Design andImplementation (OSDI-94), Monterey, CA, USA, November1994.

[7] Julian Dyke and Steve Shaw. Pro Oracle Database 10g RACon Linux: Installation, Administration, and Performance.Apress, 2006.

[8] Eran Gal and Sivan Toledo. Algorithms and Data Structuresfor Flash Memories. ACM Computing Surveys,37(2):138–163, June 2005.

[9] Goetz Graefe. Write-Optimized B-Trees. In Proceedings ofthe 30th VLDB Conference, pages 672–683, Toronto,Canada, September 2004.

[10] Jim Gray and Andreas Reuter. Transaction Processing:Concepts and Techniques. Morgan Kaufmann, 1993.

[11] MTRON Media Experts Group. MSD-P Series ProductionSpecification. Technical Report Version 0.7 sv, MTRON Co.Ltd., October 2006.

[12] Mark Hachman. New Samsung Notebook Replaces HardDrive With Flash. http://www.extremetech.com, May 2006.

[13] Theo Harder and Andreas Reuter. Principles ofTransaction-Oriented Database Recovery. ACM ComputingSurvey, 15(4):287–317, 1983.

[14] Windsor W. Hsu, Alan Jay Smith, and Honesty C. Young.I/O Reference Behavior of Production Database Workloadsand the TPC Benchmarks - An Analysis at the Logical Level.ACM Transactions on Database System, 26(1):96–143, 2001.

[15] Atsushi Inoue and Doug Wong. NAND Flash ApplicationsDesign Guide. Technical Report Revision 2.0, ToshibaAmerica Electronic Components, Inc., March 2004.

[16] Intel. Understanding the Flash Translation Layer (FTL)Specification. Application Note AP-684, Intel Corporation,December 1998.

[17] Gye-Jeong Kim, Seung-Cheon Baek, Hyun-Sook Lee,Han-Deok Lee, and Moon Jeung Joe. LGeDBMS: A SmallDBMS for Embedded System with Flash Memory. InProceedings of the 32nd International Conference on VeryLarge Data Bases, Seoul, Korea, September 12-15, 2006,pages 1255–1258. ACM, 2006.

[18] Jesung Kim, Jong Min Kim, Sam H. Noh, Sang Lyul Min,and Yookun Cho. A Space-Efficient Flash Translation Layerfor CompactFlash Systems. IEEE Transactions on ConsumerElectronics, 48(2):366–375, May 2002.

[19] Katsutaka Kimura and Takashi Kobayashi. Trends inHigh-Density Flash Memory Technologies. In IEEEConference on Electron Devices and Solid-State Circuits,pages 45–50, Hong Kong, December 2003.

[20] C. Mohan, Donald J. Haderle, Bruce G. Lindsay, HamidPirahesh, and Peter M. Schwarz. ARIES: A TransactionRecovery Method Supporting Fine-Granularity Locking andPartial Rollbacks Using Write-Ahead Logging. ACMTransactions on Database Systems, 17(1):94–162, 1992.

[21] Linda Dailey Paulson. Will Hard Drives Finally StopShrinking? IEEE Computer, 38(5):14–16, May 2005.

[22] Mendel Rosenblum. The Design and Implementation of aLog-Structured File System. PhD thesis, UC Berkeley, 1991.

[23] Mendel RosenBlum and John K. Ousterhout. The Designand Implementation of a Log-Structured File System. In the13th Symposium on Operating System Principles, pages1–15, Pacific Grove, CA, September 1991.

[24] Rajkumar Sen and Krithi Ramamritham. Efficient DataManagement on Lightweight Computing Devices. InProceedings of the 21st Inter. Conference on DataEngineering, Tokyo, Japan, April 2005.

[25] Michael Stonebraker. The Design of the Postgres StorageSystem. In Proceedings of 13th International Conference onVery Large Data Bases, September 1-4, 1987, Brighton,England, pages 289–300. Morgan Kaufmann, 1987.

[26] Michael Stonebraker and Greg Kemnitz. The Postgres NextGeneration Database Management System. Communicationsof the ACM, 34(10):78–92, Oct 1991.

66

design of flash-based dbms: an in-page logging approachbkmoon/papers/sigmod07.pdf · design of...

Documents