write optimization of column-store databases in out-of-core environment
TRANSCRIPT
![Page 1: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/1.jpg)
WriteOptimizationofColumn-StoreDatabasesinOut-of-CoreEnvironment
YOUNGSTOWNSTATEUNIVERSITY
Dr.Feng“George”YuAssistantProfessor
DepartmentofComputerScienceandInformationSystemsYoungstownStateUniversity
![Page 2: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/2.jpg)
Outline1. PartI:Write-Optimization2. PartII:DataCleaning3. PartIII:ApplicationonBigData
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
![Page 3: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/3.jpg)
WhatisColumn-StoreDatabase?Column-storedatabaseisalsoknownascolumnardatabaseorcolumn-oriented database
Thehistoryofcolumn-storedatabasecanbetracedbackto1970s.Notuntilabout2005whenmanyopen-sourceandcommercialimplementationsofcolumn-storedatabasestookoff.
Well-knowncolumn-storedatabases:
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
![Page 4: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/4.jpg)
FeaturesofColumn-StoreDatabasesFitswellintothewrite-once-and-read-many environment.•WorksespeciallywellforOLAPanddataminingqueries• Retrievemanyrecordsbutneedonlyafewattributes.
Higherdatacompressionrate• Lowdata-entropy•Muchbetterthanrow-basedstorage
YOUNGSTOWNSTATEUNIVERSITY
![Page 5: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/5.jpg)
Row-BasedtoColumn-Store
YOUNGSTOWNSTATEUNIVERSITY
Fig.1customerDatainRow-BasedandColumn-Store(BAT)Format
id name balance1 Alissa 100.002 Bob 200.003 Charles 300.00
(a) Row-Based Table customer
oid int101 1102 2103 3
(b) BAT customer id
oid varchar101 Alissa102 Bob103 Charles
(c) BAT customer name
oid float101 100.00102 200.00103 300.00
(d) BAT customer balance
Figure 1: customer Data in Row-Based and Column-Store (BAT) Format
much faster in a column-store database.Another featured benefit of the column-store
database is data compression, which can reach a highercompression rate and higher speed than traditionalrow-based database. One of the major reasons is thatthe information entropy in the data of one column islower compared to that of row-based data.
Optimizing write operations in a column-storedatabase has always been a challenge. The data ina column-store database is vertically decomposed intoBATs and randomly distributed over the storage. Fur-thermore, assuming an external memory storage is em-ployed, there is a non-trivial probability that a BAT istoo large to fit into one page on the storage. Therefore,the writing on a column-store database will be signif-icantly delayed by ad hoc access to large BATs acrossmultiple pages. Optimizing the write operations in acolumn-store database is a demanding request.
Existing works majorly focus on write opti-mizations in a main-memory column-store database.Krueger at el [11, 12] introduced the di�erential updateto improve the write performance in MonetDB. A spe-cial columnar data structure called delta bu�er is in-troduced to temporarily store decomposed row-basedinput data. However, to the best of our knowledge,very few works focused on optimizing the write perfor-mance on the out-of-core (OOC or external memory)column-store databases. Vertica [14], a column-storedatabase on large volume OOC storage, introducesa specially designed data storage procedure called k-safety to ensure ACID of update transactions on largevolumes of data and improve the data importation ef-ficiency. Nevertheless, k-safety focuses more on thetransaction control rather than the write performanceimprovement for high velocity update query streams.
In this research, we focus on optimizing the writeoperations (update and deletion) on an OOC column-store database. An operation called AscynchronousOut-of-Core Update was originally designed based ona new data structure called Timestamped Binary As-sociation Table.
The rest of the paper is structured as follows. Sec-tion 2 is the background introduction of column-storedatabases. Section 3 states the proposed method ofupdate and deletion optimization on the OOC column-store database. Experimental results are illustrated in
section 4. And section 5 is the conclusion and futureworks.
2 Background of the Column-StoreDatabase
The data structure of a column-store databaseexclusively uses BAT s (Binary Association Tables). ABAT is a fragment of an attribute in the original row-based storage, which usually consists of an oid (ObjectIdentifier) or ROWID, along with a column of attributevalues, which in a pair is called a BUN (Binary UNits).It is a physical model in a column-store database andthe sole bulk data structure it implements. The BAT iscategorized in a special group of storage models called‘DSM’ (Decomposed Storage Model) [4, 10].
The row-based storage data is the original user in-put data, called the front-end data or logical data. Toinput the data into a column-store database, a map-ping rule should be defined from the logical data struc-ture to the physical data structure, namely BAT.
Example 1. (From Row-Based Table to BAT) Sup-pose the table name is customer, with id as the pri-mary key.
customer(id, name, balance)Primary Key: id
The row-based data is shown in Fig. 1(a).In a columnar database, this logical table willbe decomposed into 3 BATs namely customer id,customer name, customer balance. Each BAT con-tains two columns: an oid and an attribute value col-umn with the column name as the corresponding col-umn data type.
In the example, the logical table is fully decom-posed into 3 BATs, Fig 1(b)-1(d), with each BAT con-tains one of the attributes. This is also referred to asfull vertical fragmentation [1]. Full vertical fragmenta-tion has many advantages. First of all, data accessingis e�cient for queries accessing many rows but withfewer columns involved in the query. Another advan-tage is the reduction of the workload on the CPU andmemory generated by OLAP and data mining queries,
id name balance1 Alissa 100.002 Bob 200.003 Charles 300.00
(a) Row-Based Table customer
oid int101 1102 2103 3
(b) BAT customer id
oid varchar101 Alissa102 Bob103 Charles
(c) BAT customer name
oid float101 100.00102 200.00103 300.00
(d) BAT customer balance
Figure 1: customer Data in Row-Based and Column-Store (BAT) Format
much faster in a column-store database.Another featured benefit of the column-store
database is data compression, which can reach a highercompression rate and higher speed than traditionalrow-based database. One of the major reasons is thatthe information entropy in the data of one column islower compared to that of row-based data.
Optimizing write operations in a column-storedatabase has always been a challenge. The data ina column-store database is vertically decomposed intoBATs and randomly distributed over the storage. Fur-thermore, assuming an external memory storage is em-ployed, there is a non-trivial probability that a BAT istoo large to fit into one page on the storage. Therefore,the writing on a column-store database will be signif-icantly delayed by ad hoc access to large BATs acrossmultiple pages. Optimizing the write operations in acolumn-store database is a demanding request.
Existing works majorly focus on write opti-mizations in a main-memory column-store database.Krueger at el [11, 12] introduced the di�erential updateto improve the write performance in MonetDB. A spe-cial columnar data structure called delta bu�er is in-troduced to temporarily store decomposed row-basedinput data. However, to the best of our knowledge,very few works focused on optimizing the write perfor-mance on the out-of-core (OOC or external memory)column-store databases. Vertica [14], a column-storedatabase on large volume OOC storage, introducesa specially designed data storage procedure called k-safety to ensure ACID of update transactions on largevolumes of data and improve the data importation ef-ficiency. Nevertheless, k-safety focuses more on thetransaction control rather than the write performanceimprovement for high velocity update query streams.
In this research, we focus on optimizing the writeoperations (update and deletion) on an OOC column-store database. An operation called AscynchronousOut-of-Core Update was originally designed based ona new data structure called Timestamped Binary As-sociation Table.
The rest of the paper is structured as follows. Sec-tion 2 is the background introduction of column-storedatabases. Section 3 states the proposed method ofupdate and deletion optimization on the OOC column-store database. Experimental results are illustrated in
section 4. And section 5 is the conclusion and futureworks.
2 Background of the Column-StoreDatabase
The data structure of a column-store databaseexclusively uses BAT s (Binary Association Tables). ABAT is a fragment of an attribute in the original row-based storage, which usually consists of an oid (ObjectIdentifier) or ROWID, along with a column of attributevalues, which in a pair is called a BUN (Binary UNits).It is a physical model in a column-store database andthe sole bulk data structure it implements. The BAT iscategorized in a special group of storage models called‘DSM’ (Decomposed Storage Model) [4, 10].
The row-based storage data is the original user in-put data, called the front-end data or logical data. Toinput the data into a column-store database, a map-ping rule should be defined from the logical data struc-ture to the physical data structure, namely BAT.
Example 1. (From Row-Based Table to BAT) Sup-pose the table name is customer, with id as the pri-mary key.
customer(id, name, balance)Primary Key: id
The row-based data is shown in Fig. 1(a).In a columnar database, this logical table willbe decomposed into 3 BATs namely customer id,customer name, customer balance. Each BAT con-tains two columns: an oid and an attribute value col-umn with the column name as the corresponding col-umn data type.
In the example, the logical table is fully decom-posed into 3 BATs, Fig 1(b)-1(d), with each BAT con-tains one of the attributes. This is also referred to asfull vertical fragmentation [1]. Full vertical fragmenta-tion has many advantages. First of all, data accessingis e�cient for queries accessing many rows but withfewer columns involved in the query. Another advan-tage is the reduction of the workload on the CPU andmemory generated by OLAP and data mining queries,
ABUN consistsof(oid,value)
MappingRules
RelationalData
Column-Store
![Page 6: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/6.jpg)
Challenge•Optimizingwriteoperationsinacolumn-storedatabasehasalwaysbeenachallengebecause:• Dataisvertically decomposedintoBATsandrandomlydistributedoverthestorage.• Thewritingonacolumn-storedatabasewillbesignificantlydelayedbyadhocaccesstolargeBATsacrossmultiplepages.
YOUNGSTOWNSTATEUNIVERSITY
![Page 7: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/7.jpg)
Out-Of-Core(OOC)?Existingworksmajorlyfocusonwriteoptimizationsformain-memorycolumn-storedatabase.Tothebestofourknowledge,veryfewworksfocusonoptimizingthewriteperformanceontheOut-Of-Core (OOC orexternalmemory)column-storedatabases.
YOUNGSTOWNSTATEUNIVERSITY
![Page 8: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/8.jpg)
Traditional Update on BATIntraditionalBAT,anupdatebyagivenOIDinvolvesin2phases:1. SearchthelocationinBATbyOID(Time-consuming)
2.Updatethevalueatthetargetlocation.
YOUNGSTOWNSTATEUNIVERSITY
![Page 9: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/9.jpg)
Motivation1. AvoidSearching!2. Allowmulti-valuesforagivenOID.3. Keepdataconsistent.
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
![Page 10: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/10.jpg)
Timestamped BinaryAssociationTable(TBAT)
YOUNGSTOWNSTATEUNIVERSITY
oid float
101 100.00
102 200.00
103 300.00
optime oid float
time1 101 100.00
time1 102 200.00
time1 103 300.00
customer_balance customer_balance
BAT TBAT
Supposetheexistingrecordswereinsertedinonebatchattime1.
![Page 11: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/11.jpg)
TheprincipleofAOCupdateistoavoidOOCsearchingandwritingineveryeffortandtousethetimestampfieldofTBATtolabel.InAOCupdate,thenewlyupdateddatathatisdirectlyappendedtotheend ofaTBAT.Insuchamanner,wedon'thavetofrequentlyperformadhocdatasearching.
YOUNGSTOWNSTATEUNIVERSITY
![Page 12: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/12.jpg)
AOCUpdateExample
YOUNGSTOWNSTATEUNIVERSITY
Example:Uupdate queryoncustomer table:update customer set balance=201.00 where id=2Currenttimestampistime2(>time1).
ThenewestTBUNfor201.00isappendedtotheendofTBATcustomer_balance
New update ->
seeking oid and changing values are time consuming,due to the extra costs on data block seeking and writ-ing.
Based on TBAT, we propose the AsynchronousOOC Update (or AOC Update). The principle of AOCupdate is to avoid OOC seeking and writing in everye�ort and to use the timestamp field of TBAT to labelthe newly updated data that is directly appended tothe end of a TBA. In such a manner, we don’t haveto frequently perform ad hoc data seeking and writingbut finish this work after the system peak time. Lateron, a data cleaning will be performed on the TBAT tomerge duplicated tuples with same oid but di�erenttimestamps and attribute values.
3.2.1 An Example of an AOC Update
Without loss of generality we use an example of OOCupdate targeting on a single tuple. It can be easilygeneralized to any update targeting on a collection oftuples. A SQL query of an example of AOC update isshown as follows.
UPDATE customer SET balance = 201.00WHERE id = 2
The target tuple is the record with oid equals102 in the TBAT customer balance. The targetvalue is to change the attribute value from the orig-inal value to 201.00. Instead of seeking the positionto the record with oid=102, AOC update directly ap-pends at the end of the TBAT a new tuple as (time2,102, 201.00). The timestamp when AOC update isperformed is assumed to be time2, and 201.00 is thenewly updated value. The TBAT customer balanceafter the AOC update is illustrated in Table 3.
Table 3: TBAT customer balance after AOC Updateoptime oid floattime1 101 100.00time1 102 200.00time1 103 300.00time2 102 201.00
3.2.2 Cost Analysis of the AOC Update
Assuming the database keeps the final record positionin TBAT customer balance, then the only cost of theAOC update is to look for the oid and data writing onthe last position of TBAT customer balance. Seek-ing oid in TBAT customer id is inevitable and coststhe same as on BAT. However, this can be overcomeby utilizing indexes or calculating the functional re-lationship between id and oid, assuming the ids areessentially contiguous.
Updating in TBAT customer balance is muchfaster than on a traditional BAT since only one dataappending is involved. Without the help of any indexor function relationship, the complexity of one updateon BAT is O(log n), if the binary search is used ona sorted BAT. However, the complexity of one AOCupdate is O(1) when the last record position in TBATis kept by the database. In a data intensive and highvelocity data input environment, AOC updates willsave significant amounts of CPU, memory, and diskoverhead by just utilizing TBAT.
3.2.3 Selection after the AOC Update
AOC update will not hurt the data consistency on theTBAT. Straightforward proof can be given by observ-ing the selection on a TBAT after an AOC update.We continue to use the previous example and selectthe balance of customer with id=2 after the previousupdate query is executed. The selection query is asfollows.
SELECT balance FROM customer WHERE id=2
Since customer id is intact, seeking the oid ofid=2 is fast. After oid=102 is retrieved, in TBATcustomer balance, two tuples will be returned
t1=(time1, 102, 200.00)t2=(time2, 102, 201.00)
As we compare the timestamps, time2 is laterthan time1. Then 201.00 is returned which is consis-tent with the last update value.
3.2.4 O�ine Data Cleaning after the AOCUpdate
After a period of time performing an AOC update,there will be many updated tuples in a TBAT with thesame oid and di�erent timestamp. Once a selectionquery is issued in the TBAT, these multiple tupleswill all be returned which will increase the selectionexecution time.
In order to make the selection queries more ef-ficient on TBAT, we propose an o�ine data cleaningprocedure involving two phases. The first phase is tosort the TBAT by oid, which will move all tuple withsame oid but di�erent timestamp together. In the sec-ond phase, for each group of tuples with the same oid,only the one with the newest timestamp is retained.
Algorithm 1 illustrates the data cleaning proce-dure, named merge update. In the first phase, we ten-tatively use merge sort to apply on the OOC storage, ofwhich the complexity is O(n log n). The second phasecan be done by sequentially read all the tuples in the
Body
Appendix
![Page 13: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/13.jpg)
SelectionafterAOCUpdateThedataconsistency willbeintactinaTBATafterAOCupdate.AftertheTBATofcustomerhasappliedAOCupdates,werunthefollowingquery:
SELECT balance FROM customer WHERE id=2IntheupdatedTBATcustomer_balance,twotupleswillbereturned:
t1=(time1, 102, 200.00)t2=(time2, 102, 201.00)
Wecomparethetimestamps,time2> time1.Then201.00 isreturnedwhichisconsistentwiththelastupdatevalue.
YOUNGSTOWNSTATEUNIVERSITY
![Page 14: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/14.jpg)
AOCUpdateExperimentPreliminaryexperimentresultsaredesignedinordertocomparethespeedperformancebetweenAOCupdatesonTBATsandtraditionalupdatesonBATs.
TheexperimentisperformedonaCentOS 6.5workstationwithIntelCorei7-37003.4GHzCPU,16GBmemory,and250GBSATA7200RPMharddisk.
TheexperimenttestcodeisimplementedinPython2.7.
YOUNGSTOWNSTATEUNIVERSITY
![Page 15: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/15.jpg)
AOCUpdateExperiment(cont.)
2.27
4.71
7.13
9.59
12.01
1.63E-03 3.25E-03 4.81E-03 6.41E-03 7.95E-03 0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
10% 20% 30% 40% 50%
Ela
pase
d Ti
me
(sec
)
Update Percentage
BAT TBAT
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
AOCUpdateandTraditionalUpdateRunningTime
![Page 16: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/16.jpg)
AOCUpdateExperiment(cont.)
YOUNGSTOWNSTATEUNIVERSITY
1392.84
1449.65
1484.23 1495.56
1509.9
1320
1340
1360
1380
1400
1420
1440
1460
1480
1500
1520
10% 20% 30% 40% 50%
Tim
es F
aste
r (x1
.0)
Update Percentage
TimesofAOCUpdateFasterthanTraditionalUpdate=Time(Update on BAT)
Time(AOC Update on TBAT)
Average1466.436
timesfaster
Overheads
![Page 17: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/17.jpg)
PotentialProblemswithAOCUpdateWhenmanyAOCupdatesareperformed,searchingbecomesgraduallyslowerbecauseitsunsortedappendixrequiresalinearsearch;thegreaterthevolumeofupdateddata,theslowerthesearch.
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
seeking oid and changing values are time consuming,due to the extra costs on data block seeking and writ-ing.
Based on TBAT, we propose the AsynchronousOOC Update (or AOC Update). The principle of AOCupdate is to avoid OOC seeking and writing in everye�ort and to use the timestamp field of TBAT to labelthe newly updated data that is directly appended tothe end of a TBA. In such a manner, we don’t haveto frequently perform ad hoc data seeking and writingbut finish this work after the system peak time. Lateron, a data cleaning will be performed on the TBAT tomerge duplicated tuples with same oid but di�erenttimestamps and attribute values.
3.2.1 An Example of an AOC Update
Without loss of generality we use an example of OOCupdate targeting on a single tuple. It can be easilygeneralized to any update targeting on a collection oftuples. A SQL query of an example of AOC update isshown as follows.
UPDATE customer SET balance = 201.00WHERE id = 2
The target tuple is the record with oid equals102 in the TBAT customer balance. The targetvalue is to change the attribute value from the orig-inal value to 201.00. Instead of seeking the positionto the record with oid=102, AOC update directly ap-pends at the end of the TBAT a new tuple as (time2,102, 201.00). The timestamp when AOC update isperformed is assumed to be time2, and 201.00 is thenewly updated value. The TBAT customer balanceafter the AOC update is illustrated in Table 3.
Table 3: TBAT customer balance after AOC Updateoptime oid floattime1 101 100.00time1 102 200.00time1 103 300.00time2 102 201.00
3.2.2 Cost Analysis of the AOC Update
Assuming the database keeps the final record positionin TBAT customer balance, then the only cost of theAOC update is to look for the oid and data writing onthe last position of TBAT customer balance. Seek-ing oid in TBAT customer id is inevitable and coststhe same as on BAT. However, this can be overcomeby utilizing indexes or calculating the functional re-lationship between id and oid, assuming the ids areessentially contiguous.
Updating in TBAT customer balance is muchfaster than on a traditional BAT since only one dataappending is involved. Without the help of any indexor function relationship, the complexity of one updateon BAT is O(log n), if the binary search is used ona sorted BAT. However, the complexity of one AOCupdate is O(1) when the last record position in TBATis kept by the database. In a data intensive and highvelocity data input environment, AOC updates willsave significant amounts of CPU, memory, and diskoverhead by just utilizing TBAT.
3.2.3 Selection after the AOC Update
AOC update will not hurt the data consistency on theTBAT. Straightforward proof can be given by observ-ing the selection on a TBAT after an AOC update.We continue to use the previous example and selectthe balance of customer with id=2 after the previousupdate query is executed. The selection query is asfollows.
SELECT balance FROM customer WHERE id=2
Since customer id is intact, seeking the oid ofid=2 is fast. After oid=102 is retrieved, in TBATcustomer balance, two tuples will be returned
t1=(time1, 102, 200.00)t2=(time2, 102, 201.00)
As we compare the timestamps, time2 is laterthan time1. Then 201.00 is returned which is consis-tent with the last update value.
3.2.4 O�ine Data Cleaning after the AOCUpdate
After a period of time performing an AOC update,there will be many updated tuples in a TBAT with thesame oid and di�erent timestamp. Once a selectionquery is issued in the TBAT, these multiple tupleswill all be returned which will increase the selectionexecution time.
In order to make the selection queries more ef-ficient on TBAT, we propose an o�ine data cleaningprocedure involving two phases. The first phase is tosort the TBAT by oid, which will move all tuple withsame oid but di�erent timestamp together. In the sec-ond phase, for each group of tuples with the same oid,only the one with the newest timestamp is retained.
Algorithm 1 illustrates the data cleaning proce-dure, named merge update. In the first phase, we ten-tatively use merge sort to apply on the OOC storage, ofwhich the complexity is O(n log n). The second phasecan be done by sequentially read all the tuples in the
Body
Appendix
![Page 18: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/18.jpg)
SearchSpeedDegeneration
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
SelectionQueryExecutionOverhead:TBAToverBAT(time(TBAT)/time(BAT)× 100%)
![Page 19: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/19.jpg)
Outline1. PartI:Write-Optimization2. PartII:DataCleaning3. PartIII:ApplicationonBigData
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
![Page 20: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/20.jpg)
DataCleaning AfterAOCUpdateDatacleaninghasrecentlydrawnalotofattention.
Datacleaninginourcontextistheprocessbywhichwemergetheupdatedupdateddatafromtheappendixintothebody.• Removemulti-valuesofthesameOID• Avoidslowerlinearsearch
Duringnon-peaktimes,OfflineDataCleaningallowsfortheseadjustmentstobemadebymergingintothebodytherecentlyupdateddata.
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
![Page 21: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/21.jpg)
ProblemswiththeOfflineDataCleaningMethodThismethodcausesthedatabasetogooffline,meaningthatanyincomingquerieswillhavetowaituntilthedatabasecomesbackonline.
Thislapseinservicemaynotbeappropriateforenvironmentsthatrequireaconstantworkload;inappropriateforconstantinput-streams.
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
![Page 22: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/22.jpg)
OnlineDataCleaningThemajordifferenceofonlinedatacleaningistheemploymentofasophisticateddatastructurecalledsnapshot.Theideaoflivesnapshotrootsfromcloudcomputing.
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
Body SnapshotofBody
AppendixNew
Appendix(original)
online
merge
read
read
read&write
BodyMerged
AppendixNew
DuringOnlineCleaning AfterOnlineCleaning
![Page 23: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/23.jpg)
OnlineDataCleaning(cont.)TheOnlineEagerDataCleaning(speedpriority)methodmergestheentireappendixoftheTBATsintothebodyinonegotosaveontime.
TheOnlineProgressiveDataCleaning(memory-usagepriority)methodisusedduringmoreextremecaseswhenthefullappendixmaynotfitintomemory.TheDBAmanuallydecidesablocksize,andtheappendixissplitintoseveralofthoseblocksandaddedtoanappendixqueue.Theaboveeagermethodisappliedtotheseappendixfilesandanystreamingupdates(present-time)canbeaddedtoanewsplitappendixfiletobequeuedwhenitfillsuptheblocksize.
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
![Page 24: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/24.jpg)
ProgressiveData-CleaningResults
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
![Page 25: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/25.jpg)
Outline1. PartI:Write-Optimization2. PartII:DataCleaning3. PartIII:ApplicationonBigData
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
![Page 26: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/26.jpg)
UpdateonBATinMap-ReduceInaMap-Reduceenvironment,weassumetheupdatelistofOIDsarecollectedandsubmittedinabatchofUPDATE_LIST
1. Map-ReduceJoinBATLEFTOUTERJOINUPDATE_LISTONOID=>(BATcombinedwithUPDATE_LIST)• Map-sidejoin:whenUPDATE_LISTissmallenoughtofitinto
memory• Reduce-sidejoin:whenUPDATE_LISTislargeenough
2. SelectiveProjection(Map-Only)FOReachrecordin(BATcombineUPDATE_LIST)
IFUPDATE_LISTattributeisnotNULL:outputupdatedvalue(keepthemostrecentupdate)
ELSE:outputoriginalvalue
YOUNGSTOWNSTATEUNIVERSITY
![Page 27: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/27.jpg)
TBAT(Timestamped BAT)TBATin HDFS:struct TBUN{
TIMESTAMP optime,ROWID oid,USER_DEFINED_TYPE attrv
}struct TBAT_slip{
TBUN[max_size_per_HDFS_slip] tbuns}
• Noneedforanyglobalpre-sortingorindexing• ‘attrv’iscanbeanyuserdefinedtypethatflexiblydefinearbitrarykindsofschema
YOUNGSTOWNSTATEUNIVERSITY
![Page 28: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/28.jpg)
AMOUpdate(logical)
YOUNGSTOWNSTATEUNIVERSITY
Example:Updatequeryoncustomer table:
update customer set balance=201.00 where id=2Currenttimestampistime2(>time1).
ThenewestTBUNfor201.00isappendedtotheendofTBATcustomer_balance
seeking oid and changing values are time consuming,due to the extra costs on data block seeking and writ-ing.
Based on TBAT, we propose the AsynchronousOOC Update (or AOC Update). The principle of AOCupdate is to avoid OOC seeking and writing in everye�ort and to use the timestamp field of TBAT to labelthe newly updated data that is directly appended tothe end of a TBA. In such a manner, we don’t haveto frequently perform ad hoc data seeking and writingbut finish this work after the system peak time. Lateron, a data cleaning will be performed on the TBAT tomerge duplicated tuples with same oid but di�erenttimestamps and attribute values.
3.2.1 An Example of an AOC Update
Without loss of generality we use an example of OOCupdate targeting on a single tuple. It can be easilygeneralized to any update targeting on a collection oftuples. A SQL query of an example of AOC update isshown as follows.
UPDATE customer SET balance = 201.00WHERE id = 2
The target tuple is the record with oid equals102 in the TBAT customer balance. The targetvalue is to change the attribute value from the orig-inal value to 201.00. Instead of seeking the positionto the record with oid=102, AOC update directly ap-pends at the end of the TBAT a new tuple as (time2,102, 201.00). The timestamp when AOC update isperformed is assumed to be time2, and 201.00 is thenewly updated value. The TBAT customer balanceafter the AOC update is illustrated in Table 3.
Table 3: TBAT customer balance after AOC Updateoptime oid floattime1 101 100.00time1 102 200.00time1 103 300.00time2 102 201.00
3.2.2 Cost Analysis of the AOC Update
Assuming the database keeps the final record positionin TBAT customer balance, then the only cost of theAOC update is to look for the oid and data writing onthe last position of TBAT customer balance. Seek-ing oid in TBAT customer id is inevitable and coststhe same as on BAT. However, this can be overcomeby utilizing indexes or calculating the functional re-lationship between id and oid, assuming the ids areessentially contiguous.
Updating in TBAT customer balance is muchfaster than on a traditional BAT since only one dataappending is involved. Without the help of any indexor function relationship, the complexity of one updateon BAT is O(log n), if the binary search is used ona sorted BAT. However, the complexity of one AOCupdate is O(1) when the last record position in TBATis kept by the database. In a data intensive and highvelocity data input environment, AOC updates willsave significant amounts of CPU, memory, and diskoverhead by just utilizing TBAT.
3.2.3 Selection after the AOC Update
AOC update will not hurt the data consistency on theTBAT. Straightforward proof can be given by observ-ing the selection on a TBAT after an AOC update.We continue to use the previous example and selectthe balance of customer with id=2 after the previousupdate query is executed. The selection query is asfollows.
SELECT balance FROM customer WHERE id=2
Since customer id is intact, seeking the oid ofid=2 is fast. After oid=102 is retrieved, in TBATcustomer balance, two tuples will be returned
t1=(time1, 102, 200.00)t2=(time2, 102, 201.00)
As we compare the timestamps, time2 is laterthan time1. Then 201.00 is returned which is consis-tent with the last update value.
3.2.4 O�ine Data Cleaning after the AOCUpdate
After a period of time performing an AOC update,there will be many updated tuples in a TBAT with thesame oid and di�erent timestamp. Once a selectionquery is issued in the TBAT, these multiple tupleswill all be returned which will increase the selectionexecution time.
In order to make the selection queries more ef-ficient on TBAT, we propose an o�ine data cleaningprocedure involving two phases. The first phase is tosort the TBAT by oid, which will move all tuple withsame oid but di�erent timestamp together. In the sec-ond phase, for each group of tuples with the same oid,only the one with the newest timestamp is retained.
Algorithm 1 illustrates the data cleaning proce-dure, named merge update. In the first phase, we ten-tatively use merge sort to apply on the OOC storage, ofwhich the complexity is O(n log n). The second phasecan be done by sequentially read all the tuples in the
NewData
OldData
![Page 29: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/29.jpg)
AMOUpdateExperimentPerformedonaClouderaDistributedHadoop(CDH)cluster• 1 masterand3slaves• TotalHDFScapacity=310GB(blocksize=64MB)• InterconnectionisGigabitEthernet
Datasets:1GBand10GBrandomsyntheticdatainBATandTBAT.
Updatequeries:from10%to30%oftheoriginaldata.
YOUNGSTOWNSTATEUNIVERSITY
![Page 30: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/30.jpg)
AMOUpdateExperiment(cont.)
YOUNGSTOWNSTATEUNIVERSITY
1GBUpdateRunningTime
0 50
100 150 200 250 300 350 400 450 500
10 15 20 25 30
Run
ning
Tim
e (s
ec)
Update Percentage (%)
BAT TBAT
![Page 31: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/31.jpg)
YOUNGSTOWNSTATEUNIVERSITY
10GBUpdateRunningTime
0 500
1000 1500 2000 2500 3000 3500 4000 4500 5000
10 15 20 25 30
Run
ning
Tim
e (s
ec)
Update Percentage (%)
BAT TBAT
AMOUpdateExperiment(cont.)
![Page 32: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/32.jpg)
YOUNGSTOWNSTATEUNIVERSITY
Relative OverheadChangingoverDataSets
0 20 40 60 80
100 120 140 160 180
10 15 20 25 30
Ove
rhea
d (%
)
Update Percentage (%)
1GB 10GB
AMOUpdateExperiment(cont.)
![Page 33: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/33.jpg)
GeneralizedWriteOptimizationFramework(DEXA’15)
YOUNGSTOWNSTATEUNIVERSITY
AtomicBuffer
(ReadIn)
AtomicBuffer
AtomicBuffer
AtomicBuffer
AtomicBuffer
ReadOptimizedData
SerializedTBAT_1
SerializedTBAT_2
SerializedTBAT_N
InputStream
AtomicBuffer(Full)
AtomicBuffer(Full)
AtomicBuffer(Full)
…
WriteQueue
Buffe
rPoo
l
…
WriteOptim
izedMod
ule
ReadOptim
izedMod
ule
![Page 34: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/34.jpg)
Publications1. HasteningDataRetrievalonOut-of-CoreColumn-StoreDatabasesusingOffsetB+-Tree
F.Yu,E.S.Jones28thInternationalConferenceonComputerApplicationsinIndustryandEngineering(CAINE2015),October12-14,2015,HiltonSanDiego/HarborIsland,SanDiego,California,USA,pp.313-318
2. AFrameworkofWriteOptimizationonRead-OptimizedOut-of-CoreColumn-StoreDatabasesF.Yu,W.-C.Hou26thInternationalConferenceonDatabaseandExpertSystemsApplications(DEXA2015),Valencia,Spain,September1-4,2015,pp.155-169
3. WriteOptimizationusingAsynchronousUpdateonOut-of-CoreColumn-StoreDatabasesinMap-ReduceF.Yu,E.S.Jones,W.-C.Hou2015IEEEInternationalCongressonBigData,June27- July2,2015,NewYork,USA,pp.720-723
4. OnlineDataCleaningforOut-Of-CoreColumn-StoreDatabaseswithTimestamped BinaryAssociationTablesF.Yu,C.Luo,W.-C.Hou,E.S.JonesProceedingof30thInternationalConferenceOnComputersAndTheirApplications(CATA2015),Honolulu,Hawaii,USA,March9-11,2015,pp.407-412
5. AsynchronousUpdateonOut-of-CoreColumn-StoreDatabasesUtilizingtheTimestampedBinaryAssociationTableF.Yu,C.Luo,W.-C.Hou,E.S.JonesProceedingof27thInternationalConferenceon.ComputerApplicationsinIndustryandEngineering(CAINE2014),NewOrleans,Louisiana,LA,October13-15,2014,pp.215-220.
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
![Page 35: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/35.jpg)
SourceCodehttps://github.com/YSU-Data-Lab/TBAT-DEXA15
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
![Page 36: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/36.jpg)
NewChallenges•NewIndexonC-SDBs• Localandglobal• Searching• DataCleaning• ParallelProcessing
•BigData• Searching• DataCleaning• AutoMapping• ToIndexornottoindex?
•BroaderApplications• ScientificsDataManagement• BigDataAnalytics• MachineLearning• OLAP• OLTP• HPC• HTC
YOUNGSTOWNSTATEUNIVERSITY,OH,USA
![Page 37: Write Optimization of Column-Store Databases in Out-of-Core Environment](https://reader035.vdocuments.us/reader035/viewer/2022062522/58cfbace1a28ab223a8b601b/html5/thumbnails/37.jpg)
Thankyou!Feng“George”Yu
ComputerScienceandInformationSystemsYoungstownStateUniversity,Youngstown,OH
YOUNGSTOWNSTATEUNIVERSITY