optimizing write performance for read optimized databases

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/220787429

Optimizing Write Performance for Read Optimized Databases

Conference Paper · April 2010

DOI: 10.1007/978-3-642-12098-5_23 · Source: DBLP

CITATIONS

40READS

1,068

6 authors, including:

Some of the authors of this publication are also working on these related projects:

SAP HANA View project

Jens Krueger

Hasso Plattner Institute

50 PUBLICATIONS 925 CITATIONS

SEE PROFILE

Martin Grund

Université de Fribourg

32 PUBLICATIONS 598 CITATIONS

SEE PROFILE

Prof. Dr. Alexander Zeier

Accenture

154 PUBLICATIONS 1,822 CITATIONS

SEE PROFILE

Franz Färber

SAP Research

42 PUBLICATIONS 2,761 CITATIONS

SEE PROFILE

All content following this page was uploaded by Martin Grund on 05 June 2014.

The user has requested enhancement of the downloaded file.

https://www.researchgate.net/publication/220787429_Optimizing_Write_Performance_for_Read_Optimized_Databases?enrichId=rgreq-ed42536dba91d938bc523cc4cb2c2736-XXX&enrichSource=Y292ZXJQYWdlOzIyMDc4NzQyOTtBUzoxMDQ3OTI4NzM1MDQ3NzJAMTQwMTk5NTk3NjQzNg%3D%3D&el=1_x_2&_esc=publicationCoverPdf

https://www.researchgate.net/publication/220787429_Optimizing_Write_Performance_for_Read_Optimized_Databases?enrichId=rgreq-ed42536dba91d938bc523cc4cb2c2736-XXX&enrichSource=Y292ZXJQYWdlOzIyMDc4NzQyOTtBUzoxMDQ3OTI4NzM1MDQ3NzJAMTQwMTk5NTk3NjQzNg%3D%3D&el=1_x_3&_esc=publicationCoverPdf

https://www.researchgate.net/project/SAP-HANA?enrichId=rgreq-ed42536dba91d938bc523cc4cb2c2736-XXX&enrichSource=Y292ZXJQYWdlOzIyMDc4NzQyOTtBUzoxMDQ3OTI4NzM1MDQ3NzJAMTQwMTk5NTk3NjQzNg%3D%3D&el=1_x_9&_esc=publicationCoverPdf

https://www.researchgate.net/?enrichId=rgreq-ed42536dba91d938bc523cc4cb2c2736-XXX&enrichSource=Y292ZXJQYWdlOzIyMDc4NzQyOTtBUzoxMDQ3OTI4NzM1MDQ3NzJAMTQwMTk5NTk3NjQzNg%3D%3D&el=1_x_1&_esc=publicationCoverPdf

https://www.researchgate.net/profile/Jens-Krueger?enrichId=rgreq-ed42536dba91d938bc523cc4cb2c2736-XXX&enrichSource=Y292ZXJQYWdlOzIyMDc4NzQyOTtBUzoxMDQ3OTI4NzM1MDQ3NzJAMTQwMTk5NTk3NjQzNg%3D%3D&el=1_x_4&_esc=publicationCoverPdf


https://www.researchgate.net/institution/Hasso-Plattner-Institute?enrichId=rgreq-ed42536dba91d938bc523cc4cb2c2736-XXX&enrichSource=Y292ZXJQYWdlOzIyMDc4NzQyOTtBUzoxMDQ3OTI4NzM1MDQ3NzJAMTQwMTk5NTk3NjQzNg%3D%3D&el=1_x_6&_esc=publicationCoverPdf


https://www.researchgate.net/profile/Martin-Grund-2?enrichId=rgreq-ed42536dba91d938bc523cc4cb2c2736-XXX&enrichSource=Y292ZXJQYWdlOzIyMDc4NzQyOTtBUzoxMDQ3OTI4NzM1MDQ3NzJAMTQwMTk5NTk3NjQzNg%3D%3D&el=1_x_4&_esc=publicationCoverPdf


https://www.researchgate.net/institution/Universite-de-Fribourg?enrichId=rgreq-ed42536dba91d938bc523cc4cb2c2736-XXX&enrichSource=Y292ZXJQYWdlOzIyMDc4NzQyOTtBUzoxMDQ3OTI4NzM1MDQ3NzJAMTQwMTk5NTk3NjQzNg%3D%3D&el=1_x_6&_esc=publicationCoverPdf


https://www.researchgate.net/profile/Prof-Dr-Alexander-Zeier?enrichId=rgreq-ed42536dba91d938bc523cc4cb2c2736-XXX&enrichSource=Y292ZXJQYWdlOzIyMDc4NzQyOTtBUzoxMDQ3OTI4NzM1MDQ3NzJAMTQwMTk5NTk3NjQzNg%3D%3D&el=1_x_4&_esc=publicationCoverPdf


https://www.researchgate.net/institution/Accenture?enrichId=rgreq-ed42536dba91d938bc523cc4cb2c2736-XXX&enrichSource=Y292ZXJQYWdlOzIyMDc4NzQyOTtBUzoxMDQ3OTI4NzM1MDQ3NzJAMTQwMTk5NTk3NjQzNg%3D%3D&el=1_x_6&_esc=publicationCoverPdf


https://www.researchgate.net/profile/Franz-Faerber?enrichId=rgreq-ed42536dba91d938bc523cc4cb2c2736-XXX&enrichSource=Y292ZXJQYWdlOzIyMDc4NzQyOTtBUzoxMDQ3OTI4NzM1MDQ3NzJAMTQwMTk5NTk3NjQzNg%3D%3D&el=1_x_4&_esc=publicationCoverPdf


https://www.researchgate.net/institution/SAP?enrichId=rgreq-ed42536dba91d938bc523cc4cb2c2736-XXX&enrichSource=Y292ZXJQYWdlOzIyMDc4NzQyOTtBUzoxMDQ3OTI4NzM1MDQ3NzJAMTQwMTk5NTk3NjQzNg%3D%3D&el=1_x_6&_esc=publicationCoverPdf



Optimizing Write Performance for ReadOptimized Databases

Jens Krueger1, Martin Grund1, Christian Tinnefeld1, Hasso Plattner1,Alexander Zeier1, and Franz Faerber2

1 Hasso–Plattner–InstitutAugust–Bebel–Str. 88

14482 Potsdam, Germany2 SAP AG

Dietmar-Hopp-Allee 1669190 Walldorf

Abstract. Compression in column-oriented databases has been provento offer both performance enhancements and reductions in storage con-sumption. This is especially true for read access as compressed datacan directly be processed for query execution.Nevertheless, compressionhappens to be disadvantageous when it comes to write access due to un-avoidable re-compression: write-access requires significantly more datato be read than involved in the particular operation, more tuples mayhave to be modified depending on the compression algorithm, and table-level locks have to be acquired instead of row-level locks as long as nosecond version of the data is stored. As an effect the duration of a singlemodification — both insert and update — limits both throughput andresponse time significantly. In this paper, we propose to use an addi-tional write-optimized buffer to maintain the delta that in conjunctionwith the compressed main store represents the current state of the data.This buffer facilitates an uncompressed, column-oriented data structure.To address the mentioned disadvantages of data compression, we tradewrite-performance for query-performance and memory consumption byusing the buffer as an intermediate storage for several modificationswhich are then populated as a bulk in a merge operation. Hereby, theoverhead created by one single re-compression is shared among all recentmodifications. We evaluated our implementation inside SAP’s in mem-ory column store. We then analyze the different parameters influencingthe merge process, and make a complexity analysis. Finally, we showoptimizations regarding resource consumption and merge duration.

1 Introduction

Nowadays, enterprise data management database systems are classified beingoptimized either for online transaction processing (OLTP) or online analyticalprocessing (OLAP). In fact, enterprise applications today are primarily focusedon the day-to-day transaction processing needed to run the business while theanalytical processing necessary to understand and manage the business is added

on after the fact. In contrast to this classification, single applications such asAvailable-To-Promise (ATP) or Demand Planning exist, which cannot be exclu-sively referred to one or the other workload category. These application initiatea mixed workload in terms of that they process small sets of transactional dataat a time including write operations as well as complex, unpredictable mostly-read queries on large sets of data. Having a mixed workload is nothing new -the insight that it is originated by a single application is new. Given this andthe fact that databases are either build for OLTP or OLAP, it is evident thatthere is no DBMS that adequately addresses the needed characteristics for thesecomplex enterprise applications. For example, within sales order processing sys-tems, the decision of being able to deliver the product at the requested timerelies on the ATP check. The execution of this results in a confirmation for thesales order containing information about the product quantity and the deliverydate. Consequently, the checking operation leads to a database request summingup all available resources in the context of the specific product. Apparently,materialized aggregates could be seen as one solution to tackle the expensiveoperation of on-the-fly aggregation. However, they fail in processing real-timeorder rescheduling due to incoming high priority orders leading to a reallocationof all products. Considering this operation as essential part of the present ATPapplication encompasses characteristics of analytical workloads with regards tolow selectivity and low projectivity as well as aggregation functionality is usedand read-only queries are executed. Along the afore mentioned check operationthe write operations to declare products as promised to customers work on fine-granular transactional level. While looking at the characteristics of these writeoperations it is obvious that they belong to the OLTP category. In contrast tothe analytic-style operations these write-queries are inherent of a high selectivityand high projectivity. This simplified example of a complex enterprise applica-tion shows workload characteristics, which match with those associated withOLTP and OLAP. As a consequence, nowadays database management systemscannot fulfill the requirements of specific enterprise applications since they areoptimized for one or the other category leading to a mismatch of enterprise ap-plications regarding the underlying data management layer. To meet this issue,enterprise applications have become increasingly complicated to make up forshortcomings in the data management infrastructure.

Besides, enterprise applications have become more sophisticated, data setsizes have increased, requirements on the freshness of input data have increased,and the time allotted for completing business processes has been reduced. Atthe same time hardware has evolved; for example the size of main memory hasbeen significantly increased and multi-core CPU’s allow calculation on-the-fly.Column databases are a data management technology in which data is orga-nized along columns, in contrast to conventional RDBMSs where it is organizedalong rows. Column databases are particularly attractive for analytical queriesas described in [19] [3], which often require subsets of relations and aggregationsacross columns. This is due to the fact that analytical applications are largelyattribute-focused rather than entity-focused [2], in the that sense only a small

number of columns in a table might be of interest for a particular query. This al-lows for a model where only the required columns have to be read while the restof the table can be ignored. This is in contrast to the row-oriented model, whereall columns of a table - even those that are not necessary for the result - mustbe accessed due to their tuple-at-a-time processing paradigm. Reading only thenecessary columns exhibits a more cache-friendly I/O pattern, for both on-diskand in-memory column store databases. In the case of an on-disk column store,few disk seeks are needed to locate the page of a block that has the first field ofa particular column. From there, data can be read in large blocks. In the caseof an in-memory column store, sequentially laid out columns typically ensurea good second-level cache hit ratio, since modern main memory controllers usepre-fetching mechanisms which exploit spatial locality.

2 The Reason for Write-Optimized Storages

The most common light-weight compression technique for column stores is do-main coding using a dictionary (see [3,19,12]) Using dictionary compression, fre-quently appearing patterns and variable length symbols are replaced by smallerfixed length symbols.

Depending on the data distribution this allows reduction of memory con-sumption and on the other hand performance improvement for query execution.This performance improvement comes with specific query plan operators thatwork directly on compressed data and thus increase the available bandwidthregarding to uncompressed data [21].

The consequence is that the performance of read operations is increased bythe cost of rebuilding the compression scheme as soon as a new value mightchange the sort order of the used dictionary or breaks the currently used bitencoding scheme.

Following this assumption, inserts and updates invoked by OLTP style appli-cations cannot be executed in a timely manner since the re-compression mightdelay each query more until a stable version of the data is reached, only validuntil the next bulk of modification operations arrives.

For transactional workloads it becomes unavoidable to find a solution to thisproblem. A typical approach to this problem is to use a dedicated delta buffer forwrite operations and later on move the data from the delta buffer to the mainstorage [4,19]. While MonetDB X100 uses a chunk based approach for theirdelta columns equal to the PAX approach, C-Store uses a LSM tree optimizedfor disk access our approach proposes a memory only delta buffer with a diskbased delta log for transaction safety. Besides the delta buffer the point in timeand the way how the delta values are merged becomes important. Both of thebefore mentioned approaches do not discuss this. Referring to the original usecases of e.g. C-Store loading and updating delta values is typically done off-line.

To allow transactional workloads as described before and in the same systemallow analytic style queries loading, merging and querying must be performedonline.

3 SAP BWA as Read Optimized Column Store

SAPs Netweaver Business Warehouse Accelerator (BWA) is a main memorycolumn-store database system. All relational tables are completely loaded intomemory and stored column-wise. In order to save RAM and to improve accessrates, the in-memory structure are highly compressed. This is achieved by dif-ferent variants of Light Weight Compression (LWC) techniques like run-lengthencoding or multiple versions of fixed-length encoding.

3.1 Data Representation

In SAP Netweaver BWA the default compression mechanism is dictionary com-pression with bitwise encoding of columns. Here, all distinct values of a columnare extracted, sorted and inserted into a dictionary. The column itself keeps onlyreferences to the dictionary, where each reference value is stored using as manybits as needed for fixed length encoding. Each value inside the encoded columnis associated with an implicit row number — the document ID. By default allcolumns are sorted based on the sort order of the key column.

In addition to the default mode, SAP BWA offers a sparse mode for eachtable that allows to choose the best sort column based on statistical data andsort all other depending on the first one.

Internally each column is represented using two vectors. The first vector —the dictionary — is a compressed list of all distinct values stored in the columns.The second vector represents the row value inside the column. As values bitencoded keys from the value vector are used. This configuration shows that bydefault two lightweight compression methods are applied - domain coding andbit encoding.

When it comes to modification operations this compressed data representa-tion is not suitable for modifications as shown in the following cases:

1. Update without new value - the changed value is already in the dictionary,replace the old value with the new value from the dictionary

2. Update with new value - the changed value is not in the dictionary, newvalue may change the sort order of the dictionary and force more valuesto be changed than only the selected one. In case the cardinality of thedistinct values reaches 2n+1 with n is the old cardinality of distinct valuesthe complete document vector must be rewritten.

3. Append without new value - append a new entry to the document using anexisting value id from the dictionary

4. Append with new value - append new value, with same complexity as forUpdate with new value

5. Delete without value change - Since the offset of the document vector is theimplicit row number, a delete may force a complete lock of the table untilall readers finished their queries. In case of long running OLAP queries thiscan decrease the performance of a write operation dramatically.

Delta

Charlie

Beta

Alpha

2

3

2

1

0

00

2

3

1

2

3

2

1

0

0

Alpha Beta Charlie Delta

3

0 1 2 3

Main Delta

Fig. 1. Structure of Main and Delta Storage

6. Delete with value change - Combined complexity of an Update with new valueand a Delete without value change.

Each of the operations has a different impact on the overall performance formodification operations. To increase the speed for those operations SAP BWAof-fers a delta buffer for modifications. Instead of using the same representation asfor the compressed main store, the delta buffer does not use a compressed sortedlist but a CBS+ Tree[16] to store the values. The CBS+ Tree allows high modi-fication rates on the one hand and is optimized towards main memory systems.All values stored in the value tree are not compressed but the document vec-tor still applies bit encoding to save memory. Figure 1 shows the structure of asample attribute with a main storage and a delta buffer.

To conclude a relation stored in SAP BWAis a set of columns. Each columnis associated with a main storage column and if modifications exist a delta buffercolumn. Furthermore it is required that all columns have the same length andall inserts go directly to the delta buffer while updates are handled as a deleteoperation with a proceeding insert operation on the same key. To make all oper-ations immediately visible at runtime all operations have to be performed onceon the main store and once on the delta buffer of the requested attribute.

Since deletions cannot be performed immediately a valid row bit vector isused to identify those rows that are no longer valid and must not be returnedduring searches. Furthermore this bit vector is later used to identify those rowsthat are not merged and possibly refer to values that are no longer needed.

4 The Merge Process

Merging delta and main index at some point in time is required to maintain theperformance of column-store in write intensive scenarios. The reasons for mergingare two-fold. On the one hand merging the delta store into the main relationdecreases the memory consumption since better compression techniques can beapplied. On the other hand merging the delta allows better search performancedue to the ordered value dictionary of the main store.

The main requirement of the merge process is that it runs in asynchronously,has as less impact as possible on all other operations and does not affect the

3 Introduction to TREX

!"#$"%&"'(

)**%)++#,-.+"(!"#$"/

!"#$"%)++#,-.+"

01!!,+%2"3%,2/"4

/"*"+"%1*/%,2/"4

Figure 3.12: Overview over the merge process

!"#$"%&#&!'%(&)#(!#*&+',#(!*&-#.&/#0&1'"%

2(!*#.&/#(!#!&3#$,(!#(!*&-#.&/#

0&1'"%

$,%.#&!'%/#,1'(0&

.&/#&-()')

,**#!&3#.&/

)'"%&#$,44(!5#2%"$#*&+',#(!*&-#*"16$&!'#78#'"#!&3#$,(!#(!*&-#

*"16$&!'#78

5&'#!&-'#0,+(*#*&+',#(!*&-#.&/

1%&,'&#,#1"4/#"2#'9&#$,(!#(!*&-#.&/#

0&1'"%

Figure 3.13: Merging the key attribute

32

Fig. 2. Merge Process as Petri Net

transaction behavior. To achieve this goal the merge creates copies of the dataand only seldom acquires locks. During the merge, the process consumes addi-tional resources (CPU and main memory).

4.1 Detailed Description of the Merge Algorithm

The merge phase can be separated into three phases - prepare merge, attributemerge and commit merge. During the prepare merge phase the following actionsare executed. At first the table is locked for all queries and a new empty deltabuffer structure is created called delta 2. All updates and inserts will be sentto this new delta structure since the original delta and the main structure aremerged. In addition it creates a copy of the current valid document ids to identifythe snapshot of valid documents at merge start time. As soon as those actionsare finished, the table lock is released.

Now for every attribute (and key attributes before all other attributes) thefollowing process is executed. Since each of the attributed is compressed usinga dictionary (see 3) at first the dictionary of the main and the delta store mustbe merged to create a consistent view of all available values. Due to possiblechanges in value ids a mapping table is created to map the value ids from the oldmain store and the delta store to the new value ids from the merged dictionary.Applying the valid document list, the mapping table and the value ids from themain store the value ids from the original main vector are copied and secondlythe value ids from the delta vector. The valid document list is hereby used toidentify the rows that where removed from the table. Since either updates ofrows or deletion can lead to a pint where a value is no longer referenced the newdocument vector is searched for unreferenced values which are than removedfrom the mapping list and left out during creation of the new dictionary.

As a last step in the attribute merge phase the new attribute is written to asecondary storage for durability reasons in case of failures. Optimizations whenwriting this file are possible but left out of the discussion for future research.

When the attribute merge is finished for this attribute the next attribute mergestarts or phase three starts.

In phase three the relation is committed. The commit phase starts withacquiring an exclusive table lock. When the lock is acquired the original validdocument id list fetched at the beginning of the merge process is compared tothe current valid document id list to identify additional deletions during merge.Rows that were deleted during the merge will be marked invalid using the valid-bit-vector and will remain invisible until they are removed in the next merge.

As a next step the compressed binary representation of all attributes arereplaced by the new ones, the old table and delta 1 are unloaded from memory,and delta 2 is renamed to delta 1. When the unload step is finished the commitphase finishes and the merge process is finished making the new compressedversion of the table available.

Queries and Lock Granularity During merge runtime all queries are an-swered from the storages of the main, delta 1, and delta 2. The complete mergeprocess only works with copies of the data and thus is totally independent of theoriginal data. Since all new modifications are reflected in delta 2, main and delta1 can be safely merged. During merge runtime only at two points in time exclu-sive locking is required: first during the prepare phase when delta 2 is createdand the valid document id list is copied and second during the commit phase.

Failure Case and Recovery Each phase during the merge can fail indepen-dently. During the prepare phase critical situation is the creation of delta 2 andthe locking of delta 1. The entry of the prepare phase and the end are logged inthe delta log. Recovery is performed with erasing the eventual existing delta 2and freeing the lock. Afterwards the delta merge process can be re-started.

If an failure occurs during the attribute merge phase (e.g. power failure)recovery can be performed by reloading the table plus delta 1 and delta 2. Allfiles stored on disk are erased and the merge process starts from the beginning.

In the commit phase the recovery is handled as described before for the otherphases. Since a table lock is acquired during the commit phase the merge processcan easily erase and unload all partial merge data and restart the process.

In addition restarting capabilities of main memory systems become very im-portant. Figure 3 shows a comparison. In this experiment we compare the reloadtime of the fully compressed main table and the tree structured uncompresseddelta representation. From the graphs we can derive that it is always beneficialto perform the merge instead of delaying it.

4.2 Complexity Analysis of the Merge Algorithm

This section aims to identify the complexity of the different merge steps and toidentify all parameters that possibly influence the merge. The expected influencewill be validated in the next section. The goal of this analysis is to answer thequestion when is the best possible point in time to merge the delta values into the

0 2 4 6 8

10 12 14 16 18 20

1000000 1500000 2000000 2500000 3000000 3500000

Tim

e in

s

Total number of rows including 1M initial rows

Rows in Delta BufferMerged Rows in Main

Fig. 3. Time needed to fully load the table into memory compared between deltaand main

main document vector. As described earlier the merge of an attribute consistsout of the following stages:

1. Create a value id mapping for all main and delta valuesSimultaneously iterate once over both value vectors and create a global map-ping of both value vectors.

O(|DictM |+ |DictD|)

The linear dependency for the delta values is achieved due to the lineariteration costs for the CBS+ Tree.

2. Create the new document vectorIterate over delta document vector to retrieve the highest document id; iter-ate over the main document vector and copy the valid value ids to the newmain vector, iterate over the delta document vector and copy the values tothe new main vector.

O(2 · |DocD|+ |DocM |+ |BVvalid|)

3. Remove unreferenced valuesIterate over all documents in the new main vector to extract all referencedvalues, then iterate over all referenced values and compare to the globalmapping of the original delta and main value vectors.

O(|Docnew M |+ |Dictnew M |)

4. Create the new dictionaryCreating the new dictionary is done in a similar way like merging the newdictionary. It reuses the sorted list created during the mapping of the mainand delta values. Since this list is already sorted no additional for sortingoccur.

O(|DictM |+ |DictD|)

5. Adjust value idsIt is possible that due to changes of value ids from the remove unreferencedvalues step, no longer used value ids are used. These must be adjusted and

0 1 2 3 4 5 6 7 8 9

0 5 10 15 20 25 30 35 40 45 50

Mer

ge T

ime

in s

Number of documents in main in 100k

(a) Variable Main Size

Merge with 100k delta docs 0 2 4 6 8

10 12 14

0 5 10 15 20 25 30 35 40 45 50

Mer

ge T

ime

in s

Number of documents in delta in 100k

(b) Variable Delta Size

Merge with 1M main docs

2 3 4 5 6 7 8 9

0 100 200 300 400 500 600 700 800 900 1000

Mer

ge T

ime

in s

# Distinct Values in 1k

(c) Variable Dictionary Size

Merge 1M main / 100k delta

Fig. 4. Evaluation Results

set to the new values from the dictionary. In cooperation this step uses themapping from the first step and the new dictionary created one step before.

O(|Mvid|+ |Docnew M |)

Observation The implementation of our merge process shows a very importantcharacteristic: All operations linearly depend on the size of the delta and mainstorage. Even though the implementation of the value storage in for a deltaattribute is based on a tree representation the merge process benefits from thelinear iteration on all value items.

4.3 Evaluation

To verify the complexity of the given access patterns a test case has been devisedwhich varies the main arguments separately to show their contribution to themerge duration. The setup for the test case is as following: a single integer columnis created in SAP BWAwith 1M rows in the main document vector and 100krows in the delta document vector with one value in both the main and the deltadictionary - thus all rows having the same identical value. In the following testeach of the parameters is varied over a specific range while the others remain attheir base value.

Document Vector Size Varying main or delta document vector sizes results in alinear increase of the merge duration - see figure 4. The observation proves thatmain and delta document size have a linear contribution to the complexity ofthe merge process.

4.1. Locking and Resource Competition 15

We can see that during the whole phase of the merge process query executionis slowed down by approximately 6%.

Merge

71.25

77.5

83.75

90

0 20 40 60 80 100 120 140 160 180 200

Merge Impact

OLT

P Q

uery

tim

e [m

illis

eco

nd

s]

Time [seconds]

Fig. 4.1. Impact on OLTP Queries with one Index Server and Low System Utilization.

As described in paragraph 2.5, the merge process only keeps locks for shortperiods at its beginning and in the end. We deduce that the deceleration ofthe queries is not caused by locking. We rather assume that the deceleration iscaused by the use of only one index server.

Merge1800

1925

2050

2175

2300

0 20 40 60 80 100 120 140 160 180 200 220 240 260

OLA

P Q

uery

tim

e [m

illis

eco

nd

s]

Time [seconds]

Fig. 4.2. Impact on OLAP Queries with 2 Index Servers and Low System Utilization.

Figure 4.2 shows the situation with OLAP queries and two separate indexservers and indices - the queries went against one index and the merge was

Fig. 5. Impact on OLTP Queries on Low System Load

Dictionary Size Dictionary sizes have been varied from their base value of oneup to the maximum number of entries in the main or delta vector, so that thevalue in the test column are 100% distinct entries. The results in Figure 4 showan increase towards the fully distinct data set which validates the assumption ofa linear contribution to the merge time.

Merge Impact to Running Queries Due to the implementation of the merge,the impact on the rest of the system is determined by resource competition andlock granularity. As described earlier locking is performed in the beginning andin the end using an exclusive table lock.

To measure the impact regarding resource competition we created a workloadthat contains queries from two different kinds — OLTP and OLAP. The workloadis based on real customer data taken from a SAP financials and accountingsystem and issues the following queries:

1. Primary key select based on accounting year, customer id, accounting area,accounting entry id and accounting line

2. Primary key select on configuration tables3. Select the sum of all open item grouped by customer

From the area of reporting we extracted a query from the SAP BusinessWarehouse. This query calculates all open items of one customer based on busi-ness area, postal code and accounting area and groups by a due date in a gridof 0-29, 30-59, 60-90 and greater than 90 days.

The main table containing the accounting entry line items consists out of 260million tuples with 300 columns with a 250GB disk size and a compressed sizeof 20GB in memory. The system used for execution is a 8 core (3.0Ghz) IntelXeon 5450 blade with 32GB RAM and SuSe Linux Enterprise Edition.

Figure 5 shows the query execution time over a given time span basicallyexecuted against a single table with a low system utilization. During the timeof merge the overall response time is slowed down by ≈ 6%. When running thesystem in a high load scenario (see Figure 6) this degradation becomes almostinvisible.

4.5. Merge Impact in a Nutshell 19

4.4 Impact on Transactional and Analytical Queries

As illustrated in figure 4.6 and 4.7 the merge process has no essential impacton analytical queries when the merge runs on a separate index and that thisbehavior is independent of the system utilization.

Figure 4.8 shows the same picture for the influences of a separate mergeprocess on transactional queries on a system with high utilization where we findno crucial influences either.

Merge

0

25

50

75

100

0 20 40 60 80 100 120 140 160

OLT

P Q

uery

tim

e [m

illis

eco

nd

s]

Time [seconds]

Fig. 4.8. Impact on OLTP Queries with High System Utilization.

4.5 Merge Impact in a Nutshell

It seems that the nature of the running queries - if analytical or transactional- has no influence to the impact of the merge process for running queries. Aseparate merge process seems to influence running queries not at all, even withhigh system utilization around 90 to 95 percent. The only case where we wereable to measure a essential increment in query execution times of approximately6 percent was during the merge with one shared index server for the queries andthe merge process.

In following experiments it would be interesting to study the merge impactin grater detail when queries and the merge process are executed on the sameindex and index server. For us, this test setting was hard to examine, especiallywith high system utilization, because the merge process gets blocked infinitelyby the running queries due to a bug in TREX.

Fig. 6. Impact on OLTP Queries on High System Load

4.1. Locking and Resource Competition 15

We can see that during the whole phase of the merge process query executionis slowed down by approximately 6%.

Merge

71.25

77.5

83.75

90

0 20 40 60 80 100 120 140 160 180 200

Merge Impact

OLT

P Q

uery

tim

e [m

illis

eco

nd

s]

Time [seconds]

Fig. 4.1. Impact on OLTP Queries with one Index Server and Low System Utilization.

As described in paragraph 2.5, the merge process only keeps locks for shortperiods at its beginning and in the end. We deduce that the deceleration ofthe queries is not caused by locking. We rather assume that the deceleration iscaused by the use of only one index server.

Merge1800

1925

2050

2175

2300

0 20 40 60 80 100 120 140 160 180 200 220 240 260

OLA

P Q

uery

tim

e [m

illis

eco

nd

s]

Time [seconds]

Fig. 4.2. Impact on OLAP Queries with 2 Index Servers and Low System Utilization.

Figure 4.2 shows the situation with OLAP queries and two separate indexservers and indices - the queries went against one index and the merge was

Fig. 7. Impact on OLAP Queries on Low System Load

We could show the same behavior for OLAP queries as well — see Figures 7and 8. Our observation leads to the conclusion that as long as enough resourcesare available the query execution is not affected by the merge process.

As a result we can construct two major cases where executing the mergeprocess may result in system failure. In our observation we used system loads upto 600% by starting up to 6 server processes and querying them simultaneously.Figure 9 shows the distribution of the load during our tests. Our assumption isthat as long as one CPU has enough free resources to perform the merge thisprocess will create big negative impact on the other running queries.

The other major obstacle is the memory consumption during the merge pro-cess. Currently the complete new main storage is kept inside memory until themerge is finished. In the last step when the merge process acquires an exclusivetable lock the old data is unloaded and freed. Until this point double the amountof the main plus delta storage is required.

5 Optimizations and Future Work

Based on our implementation and evaluation the current merge process has agreat advantage: Due to its simplistic nature it is easy to calculate when it isbest to execute the merge and how long it will take. On the other hand thisdoes not take into account how applications are build and does not leverage

4.3. High and Low Utilization 18

Merge

0

625

1250

1875

2500

0 20 40 60 80 100 120 140 160 180

OL

AP

Qu

ery

tim

e [m

illis

eco

nd

s]

Time [seconds]

Fig. 4.6. Impact on serialized OLAP Queries and Low System Utilization.

one index and the merge was executed on a separate index. We can notice noessential peak in query execution times during the merge process.

Figure 4.7 shows the same test settings only under high system utilizationof 95 percent. As we can see the standard derivation of the query executiontime is much higher, but the merge process still has no visible effect on queryexecution times. We only notice a slightly higher peak at the beginning of themerge process, which was randomly distributed in other test runs.

0

12500

25000

37500

50000

0 20 40 60 80 100 120 140 160 180 200

OLA

P Q

uery

tim

e [m

illis

eco

nd

s]

Time [seconds]

Merge

Fig. 4.7. Impact on OLAP Queries with 2 Index Servers and High System Utilization.

The following section takes a closer look at the differences of the mergeimpact on transactional and analytical queries.

Fig. 8. Impact on OLAP Queries on High System Load

4.3. High and Low Utilization 17

Figure 4.4 shows the resulting system load. Queries were sent against eachof the four index servers and the merge was performed on an additional indexserver with an additional index. We can see that the system utilization increaseswith additional index servers. Therefore the index servers influence each others,probably through cache displacement or through main memory bus usage.

0

150

300

450

600

1 2 3 4 5 6

CPU Load

CP

U L

oad

Indexserver

0

100

200

300

400

1 2 3 4 5 6

Load per Indexserver

CP

U L

oad

Indexserver

Fig. 4.4. System Load

Figure 4.5 shows that the total throughput of queries increases with thecount of index servers, but that at the same time the average time for theexecution of queries increases.

0

750

1500

2250

3000

1 2 3 4 5 6

Queries per Second

Qu

eries p

er

Seco

nd

Indexserver

0

37.5

75

112.5

150

1 2 3 4 5 6

Oltp Average

Mill

iseco

nd

s

Indexserver

Fig. 4.5. Queries

4.3 High and Low Utilization

Figure 4.6 shows the merge impact on OLAP queries with a system load workingat approximately 30%. The queries were sent against one index server with

Fig. 9. System Load Distribution

current hardware trends towards multi-core systems with more than 32 cores.To address this issue we started to analyze customer systems to identify howmuch data characteristics could help to improve the merge process.

Exemplarily we show two results from a customer system of the consumerpackaged goods industry and its main tables from the financials and accountingsystem. Figure 10 shows the result from the distinct value analysis for one cal-endar year of the accounting document entries table BKPF. The picture showsonly the distinct values from the first 50 of 99 and only the first 10 attributeshave lots of distinct values. Based on source code analysis and interviews we sawthat the rest of the values typically is either a default value or a null value. Inaddition most of the expected values are known in advance [14]. Here, the mergeprocess should leverage this knowledge.

The most common compression scheme used is domain coding, but typicallythe database has no common interpretation of what the domain acutally is,besides the datatype used. This becomes even worse since applications tend touse very generic datatypes to be as flexible or compatible as possible. To proveour assumption we analyzed the total distinct values used by the application andprocessed all change documents created by the system to arrange the dictionarystatus in a timely manner — change documents are used by SAP applicationsto record changes for legal reasons. Figure 11 shows the result of this analysis.

Fig. 10. Distinct Value Analysis for Accounting Document Entries

Fig. 11. Dictionary Fill Level per Attribute over Time

The result is that for many different attributes already after a few month itis possible to define the complete dictionary without the need to recreate thisdictionary again during the merge.

6 Related Work

Pure vertical partitioning into a “column-store” has been a recent topic of in-terest in the literature. Copeland and Khoshafian [7] introduced the concept ofa Decomposition Storage Model (DSM) as a complete vertical, attribute-wisepartitioned schema. This work has been the foundation for multiple commer-cial and non-commercial column store implementations. Popular examples areMonetDB/X100 [4], C-Store [19] or Sybase IQ [8]. Recent research has showntheir ability to outperform conventional databases in read-only OLAP scenarioswith low selectivity. SybaseIQ and C-store are pure disk based approaches toa column-store implementation. Since updates must be propagated to multiplefiles on disk, they are inappropriate for OLTP workloads, because updates andinserts are spread across different disk locations. Data compression also lim-its their applicability to OLTP scenarios with frequent updates. As describedin [4], MonetDB/X100 implements dedicated delta structures to improve theperformance of updates. The MonetDB/X100 storage manager treats vertical

fragments as immutable objects, using a separate list for deleted tuples and un-compressed delta columns for appended data. A combination of both is used forupdates. In contrast to our research, the merge process it self is not consideredany further.

Another existing part of research focuses on mixed workloads based on con-ventional RDBMSs. [5] [13] assume that certain queries will be longer runningthan others and allocate resources so that class-based SLAs can be met. Ourtarget applications require that all queries execute rapidly. [9] modifies a rowstore to better support analytics, which favors the transactional workload. Ourapproach modifies a column store to support transactions, which favors the an-alytical workload.

[15] handles a mixed workload by placing a row store and a column storeside-by-side in the same system. One copy of the data is stored in row formatwhile the other is stored in column format. The row and column representationsare interleaved across the two database instances to better distribute the load asqueries are routed regarding their access pattern towards the optimal physicaldata representation. Column compression and merging were not implemented,however they should be present in a production system. The merge optimizationswe propose could be applied in that case to improve the performance of thecolumn store.

The authors of [1,20,6] describe how effective database compression based onlightweight compression techniques by attribute-level can be and furthermorebe leveraged for query precessing in read-optimized application scenarios. Dueto recent improvements in CPU speed which have outpaced main memory anddisk access rates by orders of magnitude, the use of data compression techniquesare more and more attractive to improve the performance of database systems.Those techniques are used in our research and lead to our concept of dividing thestorages since we focus on a mixed workload. Other work, such as [10] and [11]study read optimized databases to improve the performance of database systemsin read-intensive environments.

Furthermore, there has been substantial research on delta-indexing [17,18].The concept of maintaining differential files was particularly considered for verylarge databases where both the delta and the main index where queried in orderto provide latest results. As in our scenario, in-place updates are too expensiveand thus collected and scheduled as a batch job.

7 Conclusion

In this paper we showed an implementation strategy for merging values froma delta buffer that is optimized for modification operations into a compressedread-optimized version. Furthermore we showed for each of the steps that theparticipating data stores — main, delta and dictionary — have a linear contri-bution to the overall merge costs.

We validated our concept with an implementation used in the read-optimizeddatabase SAP BWA. In the end we showed potential improvements for the merge

process and how important it is to observe real world customer data to improvethis process.

References

1. D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution incolumn-oriented database systems. In SIGMOD ’06, pages 671–682, New York,NY, USA, 2006. ACM.

2. D. J. Abadi. Query Execution in Column-Oriented Database Systems. PhD thesis.3. P. A. Boncz, S. Manegold, and M. L. Kersten. Database architecture optimized

for the new bottleneck: Memory access. In VLDB, pages 54–65, 1999.4. P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query

execution. In CIDR, pages 225–237, 2005.5. K. P. Brown, M. M. 0002, M. J. Carey, and M. Livny. Towards automated perfor-

mance tuning for complex workloads. In VLDB, pages 72–84, 1994.6. Z. Chen, J. Gehrke, and F. Korn. Query optimization in compressed database

systems. In SIGMOD ’01, pages 271–282, New York, NY, USA, 2001. ACM.7. G. P. Copeland and S. Khoshafian. A decomposition storage model. In SIGMOD

Conference, pages 268–279, 1985.8. C. D. French. “one size fits all” database architectures do not work for dds. In

SIGMOD Conference, pages 449–450, 1995.9. C. D. French. Teaching an oltp database kernel advanced data warehousing tech-

niques. In ICDE, pages 194–198, 1997.10. S. Harizopoulos, V. Liang, D. J. Abadi, and S. Madden. Performance tradeoffs in

read-optimized databases. In VLDB, pages 487–498, 2006.11. A. L. Holloway and D. J. DeWitt. Read-optimized databases, in depth. PVLDB,

1(1):502–513, 2008.12. T. Legler, W. Lehner, and A. Ross. Data mining with the SAP NetWeaver BI

accelerator. In VLDB ’06, pages 1059–1068. VLDB Endowment, 2006.13. H. Pang, M. J. Carey, and M. Livny. Multiclass query scheduling in real-time

database systems. IEEE Trans. Knowl. Data Eng., 7(4):533–551, 1995.14. H. Plattner. A common database approach for oltp and olap using an in-memory

column database. In SIGMOD Conference, pages 1–2, 2009.15. R. Ramamurthy, D. J. DeWitt, and Q. Su. A case for fractured mirrors. In VLDB,

pages 430–441, 2002.16. J. Rao and K. A. Ross. Making b+-trees cache conscious in main memory. In

SIGMOD Conference, pages 475–486, 2000.17. R. L. Rappaport. File structure design to facilitate on-line instantaneous updating.

In SIGMOD ’75, pages 1–14, New York, NY, USA, 1975. ACM.18. D. G. Severance and G. M. Lohman. Differential files: their application to the

maintenance of large databases. ACM Trans. Database Syst., 1(3):256–267, 1976.19. M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira,

E. Lau, A. Lin, S. Madden, E. J. O’Neil, P. E. O’Neil, A. Rasin, N. Tran, and S. B.Zdonik. C-store: A column-oriented dbms. In VLDB, pages 553–564, 2005.

20. T. Westmann, D. Kossmann, S. Helmer, and G. Moerkotte. The implementationand performance of compressed databases. SIGMOD Rec., 29(3):55–67, 2000.

21. T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, and J. Schaffner.Simd-scan: Ultra fast in-memory table scan using on-chip vector processing units.PVLDB, 2(1):385–394, 2009.

View publication statsView publication stats

https://www.researchgate.net/publication/220787429

optimizing write performance for read optimized databases

Documents