fast udfs to compute sufficient statistics on large data sets exploiting caching and sampling

Data & Knowledge Engineering 69 (2010) 383–398

Contents lists available at ScienceDirect

Data & Knowledge Engineering

journal homepage: www.elsevier .com/locate /datak

Fast UDFs to compute sufficient statistics on large data sets exploitingcaching and sampling

Carlos Ordonez *, Sasi K. PitchaimalaiUniversity of Houston, Houston, TX 77204, USA

a r t i c l e i n f o

Article history:Received 10 June 2009Received in revised form 9 December 2009Accepted 10 December 2009Available online 24 December 2009

Keywords:ApproximationDBMSSamplingStatistical modelSufficient statistics

0169-023X/$ - see front matter � 2009 Elsevier B.Vdoi:10.1016/j.datak.2009.12.001

* Corresponding author.E-mail address: [email protected] (C. Ordonez)

a b s t r a c t

User-Defined Functions (UDFs) represent an extensibility mechanism provided by mostDBMSs, whose execution happens in main memory. Also, UDFs leverage the DBMSmulti-threaded capabilities and exploit the C language speed and flexibility for mathemat-ical computations. In this article, we study how to accelerate computation of sufficient sta-tistics on large data sets with UDFs exploiting caching and sampling techniques. Wepresent an aggregate UDF computing multidimensional sufficient statistics that benefit abroad array of statistical models: the linear sum of points and the quadratic sum ofcross-products of point dimensions. Caching can be applied when the data set fits in mainmemory. Otherwise, sampling is required to accelerate processing of very large data sets.Also, sampling can be applied on data sets that can be cached, to further accelerate process-ing. Experiments carefully analyze performance and accuracy with real and synthetic datasets. We compare UDFs working inside the DBMS and C++ reading flat files, running on thesame hardware. We show UDFs can have similar performance to C++, even if both exploitcaching and multi-threading. As expected, C++ is much faster than UDFs when the data setis scanned from disk. We carefully analyze the case where sampling is required with largerdata sets. We show geometric and bootstrapping sampling techniques can be faster thanperforming full tables scans, providing high accuracy estimation of mean, variance and cor-relation. Even further, sampling on cached data sets can provide accurate answers in a fewseconds. Detailed experiments illustrate UDF optimizations including diagonal matrixcomputation, join avoidance and acceleration with a multi-core CPU, when available. Aprofile of UDF run-time execution shows the UDF is slowed down by I/O when readingfrom disk.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

Efficiently computing statistical models on large data sets remains an important problem. Data mining research has intro-duced fast algorithms and efficient techniques [3,19,20,42] that work outside the DBMS on flat files [33]. Processing outside aDBMS with a systems language (e.g. C++) allows flexibility in developing optimizations (reducing disk access and pushingmost processing into main memory). Unfortunately, this approach eliminates the DBMS extensive functionality to the enduser (querying, security, fault tolerance, sharing data). Integrating data mining techniques with a DBMS [29,19,34] is a prob-lem that has received scant attention due to the mathematical nature of their computations, DBMS software complexity andthe comprehensive set of techniques available in statistical packages [21]. Thus in a modern database environment, usersgenerally export data sets to a data mining tool and perform most or all of the analysis outside the DBMS. SQL has been used

. All rights reserved.

.

http://dx.doi.org/10.1016/j.datak.2009.12.001

mailto:[email protected]

http://www.sciencedirect.com/science/journal/0169023X

http://www.elsevier.com/locate/datak

384 C. Ordonez, S.K. Pitchaimalai / Data & Knowledge Engineering 69 (2010) 383–398

as a mechanism to integrate data mining algorithms [34] since it is the standard language in relational DBMSs, but unfor-tunately it is generally slow and it has limitations to perform complex matrix computations. In this work, we show aggregateUDFs combined with caching are an outstanding alternative for high-performance data mining.

UDFs are a standard Application Programming Interface (API) available in modern DBMSs [13,14]. In general, UDFs aredeveloped in the C language (or similar language), compiled to object code and efficiently executed inside the DBMS likeany other SQL function. Thus UDFs represent an outstanding alternative to extend a DBMS with statistical models, exploitingthe C language flexibility and speed. Therefore, there is no need to change SQL syntax with new data mining primitives orclauses, making UDF implementation and usage easier [39]. The UDF studied in this article can be programmed on any DBMSsupporting aggregate UDFs. Since our research is based on exploiting UDFs as available in a modern DBMS, we assume theDBMS provides basic SQL syntax and programming features to compile and run UDFs. Also, we assume the DBMS source codeis not available and thus it cannot be extended. We provide sufficient technical details to program UDFs on any DBMS. Mostresearch on exploiting hardware has focused on faster processors (e.g. multi-core, graphics processor [6]), which is a minoraspect considered in this work. We believe main memory is as aspect that needs more study, especially today. Performingdata mining in main memory is important due to the following reasons, among others. Data mining is an iterative process:the same data set can be analyzed multiple times. Data mining algorithms generally require table scans: they do not needrandom access. Memory access based on 32 bit addressing can store up to 4 GB, which can hold fairly large data sets (mil-lions of records). On the other hand, sampling [10,21] is a practical mechanism used to accelerate statistical analysis on largedata sets. In our case, sampling is used to accelerate processing when the data set cannot fit in main memory. Unfortunately,sampling introduces error in estimations. Therefore, it is crucial to understand the tradeoff between speed and accuracy. Tothe best of our knowledge, we are the first to consider accelerating computation of sufficient statistics combining UDFs, sam-pling and caching.

The article is organized as follows. Section 2 provides definitions and an overview of UDFs. Section 3 explains how aggre-gate UDFs can efficiently compute sufficient statistics that benefit several models. Section 4 presents experiments comparingthe UDF and C++, evaluating UDF optimizations and profiling the UDF at run-time. Section 5 discusses related work. Conclu-sions are discussed in Section 6.

2. Preliminaries

2.1. Definitions

The article focuses on the computation of multidimensional (multivariate) statistical models on a d-dimensional data set.Let X ¼ fx1; . . . ; xng be a data set having n points, where each point has d dimensions; X is equivalent to a d � n matrix, wherexi represents a column vector. To avoid confusion we use i ¼ 1 . . . n as a subscript for points and h, a, b as dimension sub-scripts. The T superscript indicates matrix transposition, which is generally used to make matrices compatible for multipli-cation. To simplify notation, R without subscript means the sum is computed over all rows i ¼ 1 . . . n.

The data set X is stored on the DBMS as a table with an extra column i identifying the point (e.g. customer id), which is notused for statistical purposes. We consider a standard horizontal layout for X, represented by table XH , defined asXHði;X1;X2; . . . ;XdÞwith primary key i. When there is a predicted numeric dimension Y (e.g. linear regression), it can be trea-ted as an additional dimension in X. On the other hand, to avoid issues with high d in SQL queries we use a pivoted version ofX with a vertical layout represented by table XV ði;h;valÞ.

2.2. User-defined functions

UDFs are programmed in a high-level programming language (like C or C++) and can be called in a ‘‘SELECT” statement,like any other SQL function. There are two classes of UDFs: (1) Scalar, that take one or more parameter values and return asingle value, producing one value for each input row. (2) Aggregate, which work like standard SQL aggregate functions. Theyreturn one row for each distinct grouping of column value combinations and a column with some aggregation (e.g. ‘‘sum()”).If there are no grouping columns then they return one row.

UDFs provide several important advantages. There is no need to modify internal DBMS code, which allows end-users toextend the DBMS with data mining functionality. UDFs are programmed in a traditional programming language and oncecompiled they can be used in any ‘‘SELECT” statement, like other SQL functions. The UDF source code can exploit the flex-ibility and speed of the programming language. Most importantly, UDFs work in main memory; this is a fundamental featureto reduce disk I/O and reduce run-time. Notice a UDF gets its input from a physical table scan, but such scan comes fromevaluating the query calling the UDF. UDFs are automatically executed in parallel exploiting multi-threaded capabilitiesof the DBMS. This is an advantage, but also a constraint because code must be developed accordingly. On the other hand,UDFs have important constraints and limitations. UDFs cannot perform any I/O operation, which is a constraint to protectinternal storage. UDFs generally return one value of a simple data type. In other words, they cannot return a set of valuesor a matrix. Currently, UDF parameters in most DBMSs can be only of simple types (e.g. numbers or strings); array supportis limited or not available. Scalar functions cannot keep values in main memory from row to row, which means the functioncan only keep temporary variables in stack memory. In contrast, aggregate functions can keep aggregated values in heap

C. Ordonez, S.K. Pitchaimalai / Data & Knowledge Engineering 69 (2010) 383–398 385

memory from row to row. UDFs cannot access memory outside their allocated heap memory or stack memory. The amountof memory that can be allocated is low, compared to available memory. Finally, UDFs are not portable since their code de-pends on the internal DBMS architecture.

3. Fast computation of sufficient statistics with UDFs

3.1. Overview of models

Past research has shown sufficient statistics summarize properties of a large data set [5,17,30] and can reduce the numberof times it has to be scanned. We introduce the following two matrices that are fundamental sufficient statistics for severaldata mining techniques [33,30] including PCA [8], K-means clustering [12], Naive Bayes [25,38] and linear regression [21]. Itis beyond the scope of this article explaining the mathematical details of how sufficient statistics apply to each model. Wefocus instead on their efficient computation inside a modern DBMS.

3.2. Multidimensional sufficient statistics

Let L be the linear sum of points, in the sense that each point is taken at power 1. L is a d� 1 matrix, shown below withsum and column-vector notation.

L ¼Xn

i¼1

xi; ð1Þ

which is equivalent to

L ¼

PX1PX2

..

.

PXd

266664

377775:

Let Q be the quadratic sum of points, in the sense that each point is squared with a cross-product. Q is d� d

Q ¼ XXT ¼Xn

i¼1

xixTi : ð2Þ

Matrix Q has sums of squares in the diagonal and sums of cross-products off the diagonal.

Q ¼

PX2

1

PX1X2 � � �

PX1XdP

X2X1P

X22 � � �

PX2Xd

..

. ...

PXdX1

PXdX2 � � �

PX2

d

2666664

3777775

The most important property about L and Q is that they are much smaller than X, when n is large (i.e. d� n). However, L andQ summarize essential properties about X that can be exploited by statistical techniques. Therefore, the basic usage of L and Qis that we can substitute every sum

Pxi for L and every matrix product XXT for Q. In other words, L and Q are exploited to get

equivalent mathematical equations that do not refer to X. Therefore, such equations will depend on d, and not on n (which ingeneral is the performance challenge). Our goal will be to study how to compute such sufficient statistics as efficiently aspossible with a UDF scanning the data set once.

3.2.1. Computing models based on sufficient statisticsAll statistical models considered in this article are based on computing the mean vector l, the covariance matrix R and

the correlation matrix q on data set X as follows:

l ¼ 1n

L: ð3Þ

The covariance matrix R is given by

R ¼ 1n

Q � 1n2 LLT : ð4Þ

Depending on the model, R can be assumed a diagonal matrix (K-means clustering, EM clustering, Naive Bayes) or a full ma-trix (linear regression, PCA). Diagonal matrices are used when dimensions are assumed independent (i.e. covariances are as-sumed zero). Ref. [21] has an excellent overview of all models. Consequently, we consider a diagonal Q or non-diagonal Q,


depending if the model assumes independence or not. Finally, the correlation matrix is derived by standardizing covariances,by dividing by standard deviations. Each entry of the correlation matrix q is given by:

qab ¼Rab

rarb; ð5Þ

where Rab is the covariance between variables Xa;Xb and ra;rb are their respective standard deviations. These three equa-tions are generalized to compute statistics on k different subsets from X, as is the case of K-means and EM clustering [21].

3.2.2. Properties of sufficient statisticsWe list important properties of sufficient statistics n, L, Q that are important to achieve fast computation with UDFs and to

combine sufficient statistics from different samples. These properties are a generalization of the properties sufficient statis-tics for clustering presented in [5,42], but we illustrate their relevance in the context of UDF processing. The main propery isthe distributive [18] nature of n, L, Q, which can be considered a generalization of the sum() aggregation. Notice these prop-erties do not hold for other statistics like the mode or quantiles, which are used more for exploration or testing hypothesesrather than computing models.

1. Storage space for n, L, Q is Oðd2Þ. Therefore, it is independent from n.2. n, L, Q are distributive [18]. That is, given two subsets of X denoted by S and T, we can get n, L, Q for X as follows.

n ¼ nS þ nT ; L ¼ LS þ LT ; Q ¼ Q S þ QT . Notice this result states that cross-products are also distributive.3. Given a partition of X into k subsets we can get sufficient statistics for each of the k subsets in one pass, assuming aggre-

gations accept GROUP BY.4. Matrix Q is symmetric and can be diagonal when dimensions are assumed independent.

These properties have a practical application as follows:

1. Assuming d is low, storage space is low was well, despite being Oðd2Þ. Such small footprint enables maintaining sufficientstatistics n, L, Q in main memory at all times.

2. Assuming S and T are analyzed by separate threads (or processing nodes) the distributive property enables full parallelprocessing with a final ‘‘merge” step to add all individual matrices per partition into one global set of sufficient statistics.This aspect will be explained in detail below.

3. The sufficient statistics from several sample sets S1; S2; . . ., where Sj � X, can be combined into one.4. Each iteration of a partition clustering algorithm like K-means or EM [5] can be done in one pass. Even further, it becomes

feasible to derive incremental algorithms [5].

3.3. Alternatives to integrate statistical models with a DBMS

We discuss four alternatives to evaluate matrix equations taking into account X is stored inside a DBMS: (1) performing nomatrix operations inside the DBMS, exporting the data set to an external statistical or data mining tool; (2) integrating allmatrix operations inside the DBMS, modifying its source code, in general written in the C language; (3) performing all matrixcomputations only with SQL queries, manipulating matrices as relational tables; (4) computing matrix equations involvingthe data set inside the DBMS with UDFs.

Alternative (1) gives great flexibility to the user to analyze the data set outside the DBMS with any language or statisticalpackage. The drawbacks are the time to export the data set, lack of flexibility to create multiple subsets of it (e.g. selectingrecords), lack of functionality to manage data sets and models, the potentially lower processing power of a workstation com-pared to a fast database server and compromising data security. Alternative (2) represents the ‘‘ideal” scenario, where allmatrix computations are done inside the database system. But this is not a feasible or good alternative due to lack of accessto the internal DBMS source code, the need to understand the internal DBMS architecture, the possibility of introducingmemory leaks with array computations and the availability of many statistical and machine learning libraries and tools thatcan easily work outside the DBMS (e.g. in files outside the DBMS). Alternative (3) requires generating SQL code and exploitsDBMS functionality. However, since SQL does not provide advanced manipulation of multidimensional arrays, matrix oper-ations can be difficult to express as efficient SQL queries, especially if joins are required. Finally, alternative (4) allows pro-gramming matrix computations with UDFs exploiting arrays and the C control statements, enabling great efficiency andflexibility. For instance, inverting a matrix, evaluating a long expression involving many matrices, implementing a New-ton–Raphson method and computing singular value decomposition (SVD) are difficult to program in SQL. Instead, matricesL and Q are used to evaluate complex, but more efficient and equivalent, matrix expressions as explained in Section 3.2. Suchmatrix expressions can be analyzed by a software system different from the DBMS, which can reside in the database serveritself or in a workstation. Therefore, we focus on alternative (4), computing n, L and Q for a large data set X with SQL queriesand UDFs. The remaining matrix equations, explained above, can be easily and efficiently computed with a mathematicallibrary inside or outside the DBMS.


3.3.1. Computing sufficient statistics with SQL queriesWe assume X is stored on a vertical layout represented by table XV . The queries presented below work under the assumption

that some dimension values are zero. When Q is a diagonal matrix (e.g. for clustering) the query for L can also compute Q by squar-ing values. Otherwise, a self-join with XV is required. These queries are less efficient than queries working on a horizontal layoutof X, but they do not have any limitation on d. The three SQL queries to get n, L, Q, assuming Q is non-diagonal, are as follows:

SELECT cast(count(distinct i) AS double) /* n */FROM XV ;

SELECT h; sumðvalÞ /* Lh */FROM XV GROUP BY h;

SELECT T1.h AS a, T2.h AS bsum(T1.val*T2.val) AS Q /* Qab */

FROM XV T1 JOIN XV T2 ON T1.i=T2.iWHERE a P b GROUP BY a, b;

3.4. Aggregate UDF for a horizontal layout of data set

We now explain how to efficiently compute sufficient statistics n, L, Q with an aggregate UDF assuming a horizontal lay-out for X. We assume points in X appear in a random order and computations are performed in double precision variables toavoid numerical issues.

3.4.1. Aggregate UDF definition in SQLThe SQL definition specifies the call interface with parameters being passed at run-time and the value being returned. The

aggregate UDF takes as parameter the type of Q matrix being computed: diagonal or triangular, to perform the minimumnumber of operations required. UDFs in cannot directly accept arrays as parameters or return arrays. To solve the arrayparameter limitation, xi is passed as a list of values to the UDF (currently d 6 64 due to DBMS architecture constraints). SomeDBMSs provide limited array support through user-defined types (UDTs); our ideas can be easily extended with UDTs.

REPLACE FUNCTION udf_nLQ_triang(d INTEGER,

X_1 FLOAT,. . .,X_d FLOAT

)

RETURNS CHAR(36000)

CLASS AGGREGATE (52000)

The amount of maximum heap memory allocated is specified in the UDF definition with the ‘‘CLASS AGGREGATE” clause.The amount of memory required by the ‘‘big” output value is also specified here.

3.4.2. Aggregate UDF variable storageFollowing the single table scan approach for the SQL query, the aggregate UDF also computes n, L and Q in one pass. A C

‘‘struct” record is defined to store n, L and Q, which is allocated in main memory in each processing thread. When Q is diag-onal a one-dimensional array is sufficient and more efficient, whereas when Q is triangular a two-dimensional array is re-quired. Therefore, we propose to create two versions for the UDF depending on Q; this saves memory and time. X ishorizontally partitioned so that each thread can work in parallel. Notice n is double precision and the UDF has a fixed max-imum d because the UDF memory allocation is static (the UDF needs to be compiled before being called).

typedef struct {int d; double n;

double L[MAX_d];double Q[MAX_d][MAX_d]; /* Q triangular */

} UDF_nLQ_storage;

3.4.3. Aggregate UDF run-time executionWe omit discussion on code for handling errors (e.g. empty tables), invalid arguments (e.g. data type), nulls and memory

allocation. The aggregate UDF has four main steps: that are executed at different run-time stages: (1) Initialization, wherememory is allocated and UDF arrays for L and Q are initialized in each thread. (2) Aggregation, where each xi is scannedand passed to the UDF, xi entries are unpacked and assigned to array entries, n is incremented and L and Q entries are


incrementally updated by Eqs. (1) and (2). Since all rows are scanned, this step is executed n times and therefore, it is themost time-consuming. (3) merging partial results, which is required to compute totals, by adding sub-totals obtained byeach individual thread. Threads return their partial computations of n, L, Q that are aggregated into a single set of matricesby the master thread. (4) Returning sufficient statistics, where n, L, Q are packed and returned to the user.

Dimensionality d of X cannot be known at compile time. Therefore, the UDF ‘‘struct” record has a maximum dimension-ality, wasting some space. This is because the UDF allocates memory in the heap before the table scan starts. Otherwise, analternative solution is to create d UDFs which accept a k-dimensional vector, where k ¼ 1; . . . ; d.

Step (2) is the most intensive because it gets executed n times. The most intensive step is (2) because it gets executed ntimes, whereas the rest only once or as many threads the DBMS uses. Hence optimizations are incorporated in step (2). Westart by passing xi to the UDF. There exist two alternatives: (1) concatenating all vector values as a long string. (2) passing allvector values as parameters individually; each of them having a null indicator flag as required by SQL. (3) Passing the vectoras an array, if the DBMS supports arrays. This alternative is similar to (2). For alternative (1) the UDF needs to call a functionto unpack xi, which takes time OðdÞ and incurs on overhead. Overhead is produced by two reasons: at run-time, floating pointnumbers must be cast as strings and when the long string is received, it must be parsed to get numbers back, so that they areproperly stored in an array. The unpacking routine determines d. For alternative (2) the UDF directly assigns vector entries inthe parameter list to the UDF internal array entries. Given the UDF parameter compile-time definition, d must also be passedas a parameter. The UDF updates n, L, Q as follows: n in incremented, L Lþ xi and Q Q þ xixT

i based on the type of ma-trix: diagonal, triangular or full (default = triangular).

We show C code for step 2. In the following code subscripts a; b ¼ 1; . . . ; d are consistent with the definitions from Section2.1; since arrays in the C language start at a zero subscript the UDF wastes one entry in L and 2d entries in Q. The UDF up-dates n, L and Q reading each row once.

/* Step 2: aggregate rows */

thread_storage->n+=1.0;for(a=1;a<=d;a++) {thread_storage->L[a]+=X[a];for(b=1;b<=a;b++)thread_storage->Q[a][b]+=X[a]*X[b];

}

The partial result aggregation code in step (3), is somewhat similar to the aggregation step code, with the fundamentaldifferences that we aggregate matrices instead of arithmetic expressions and all partial results are summarized into globaltotals for all threads.

/* Step 3: aggregate partial results */

thread_storage->n+= distributed->n;for(a=1;a<=d;a++)

thread_storage->L[a]+= distributed->L[a];for(a=1;a<=d;a++)

for(b=1;b<=d;b++)thread_storage->Q[a][b]+= distributed->Q[a][b];

Step (4) is the inverse of getting vector values in step 2, where we need to pack n, L and Q into a long string. This is aconstraint imposed by SQL. Such string has the same memory limitation as the UDF, but on the stack. Therefore, both thestorage record and the result string are allocated similar memory space, in the heap and stack respectively. The code is sim-ple and therefore omitted. The name of the variable (n,L,Q) appears first and then values (floating point numbers) are sep-arated with some symbol (e.g. ‘‘,”). In addition rows are separated by another symbol for Q (e.g. ‘‘;”). This step has minimalimpact on performance since it is executed only once.

3.4.4. Calling the aggregate UDFWe first discuss models where Q is diagonal and then models where Q is triangular. For K-means, EM, Naïve Bayes and the

Bayesian classifier based on class decomposition Q is diagonal. For Naïve Bayes the UDF is called grouping by g in the query.For K-means and EM the UDF is called grouping by j, where j is the closest or most probable cluster obtained at each iteration.For correlation, covariance, linear regression and PCA Q and R are triangular: In general, the UDF is called without groupingrows (GROUP BY clause), except for K-means or EM clustering in which case the closest cluster subscripts is used to computek sets of sufficient statistics.

3.5. Sampling large data sets

If X is larger than memory available then we propose to collect samples from X to accelerate n, L, Q computation. However,sampling introduces error in estimations and such error has to be controlled and minimized. Sampling theory states error in


estimation decreases as sample size s increases and sampling becomes more accurate with sampling with replacement [10].Therefore, we exploit sampling mechanisms with replacement given their desirable statistical properties.

We propose two mechanisms to sample X to quickly compute and approximation of n, L, Q: geometric sampling and boot-strapping [21], adapting sampling techniques to work with the UDF. Each mechanism provides different advantages and dis-advantages with respect to speed and accuracy. The key issue is being able to get accurate approximations.

Geometric sampling starts with some initial sample size s and then sample size s grows geometrically until error comesdown to an acceptable level. In our case, we consider sample size s growing proportional to powers of 2 (e.g.s = 500,1000,2000, . . .), where s = 500 is an initial sample size and n, L, Q are independently computed on each sample set.Such geometric rate allows us to control error behavior as well as measuring worst case time complexity. Evidently, it is pref-erable s� n, when n is large, but not so small as to introduce significant error in approximations. On the other hand, boot-strapping [21] creates a batch of B samples of fixed sized s, where the statistic estimation is constructed as an average. Basedon the sufficient statistics properties, explained in Section 3.2, n, L, Q are additive (or distributive in OLAP terms [18]. There-fore, we can incrementally update them based on each new sample. The number of times X is sampled (B) grows until anacceptable error level is reached. To keep a uniform statistical framework, we also consider a geometric growth rate for B,where preferably B < s. Notice sampling theory [10] states that s = 500 or s = 1000 are considered large samples. Therefore,they can produce accurate estimations for large n, despite being counterintuitive. In other words, sample size s in absoluteterms is more important than the fraction it represents from n. We argue this is a great property about sampling because todevelop an efficient UDF on very large n it is preferable s is independent from n.

We now analyze some relevant DBMS aspects about sampling. SQL has a sampling clause (available in most DBMSs). Inthe case of the Teradata DBMS the sample clause just needs to specify s, the number of rows (e.g. ‘‘SELECT * FROM X SAMPLEs”); Sampling in SQL is more efficient if sampling is done with replacement (SWR) because there is no need to track whichpoints have already been sampled. The DBMS is efficient to collect sample set of small size s, using a block-based accessscheme [7]. In fact, time is O(s) and not O(n). The reason for such efficiency is that given a sample of size s the DBMS readsup to s blocks from disk, and in many cases less when blocking factor is large. There exists an interaction between samplingand caching since future samples can be collected from blocks cached in main memory. If s is large then the DBMS needs toload many blocks into main memory. Therefore, in the worst case, sampling will be inefficient if sample size is a big fractionof n. Notice sampling can be a slow mechanism if X needs to be scanned, but in such case many blocks can be cached. If theDBMS does not provide a sampling clause in SQL then to sample k random rows it may be necessary to generate k randomnumbers between 1 . . . k and then exploit a primary index based on i. In the absence of an index then the last resource is toscan X, which would be as slow as just computing n, L, Q on the entire data set: we discourage such approach. Further dis-cussion on how the DBMS collects samples can be found in [7].

We close this section discussing the main concern about using samples: error in estimations. Sampling to estimate lworks well [10]. In general, a small s can produce tight estimates of l regardless of density function []. Sampling to estimater is more challenging than l, especially when X does not follow a normal distribution. Finally, sampling to get the covarianceR or correlation q matrix turns out to be the most difficult issue, but fortunately it is only relevant for PCA and linear regres-sion. Therefore, we will study how accurate n L and Q are to approximate correlation based on a sample set, considering com-plementary Q cases: diagonal or triangular Q. Our experiments will analyze error behavior with both sampling methods toapproximate mean, variance and correlation, going from simplest to hardest case.

3.6. Time complexity and I/O cost

Time complexity is Oðd2nÞ for triangular Q and O(dn) for diagonal Q. I/O cost to scan the data set X is O(n). I/O is not pro-portional to d2 since Q is memory-resident. Assuming a sample of size s can be collected in time O(s), the time complexitybased on sampling is analogous.

The aggregate UDFs for the horizontal layout to compute n, L, Q has time complexity Oðd2nÞ for triangular Q and OðdnÞ fordiagonal Q. On the other hand, since the SQL query for triangular Q requires a self-join with the vertical layout of X, scanningXV twice, I/O cost is bigger (we assume a hash-based join is possible). Table 1 summarizes I/Os from X for all the alternatives,assuming one I/O operation per row. Clearly SQL is the worst alternative for the vertical layout.

3.6.1. Time complexity for each techniqueOnce matrices are computed statistical techniques take the following times to compute models (with C code or a math-

ematical or statistical library): linear correlation takes Oðd2Þ, PCA takes Oðd3Þ, which is the time to get SVD of the correlationmatrix q, linear regression takes Oðd3Þ to invert Q and clustering takes O(dk) to compute k clusters. In short, time complexityto compute models becomes independent from n. The UDF approach will be efficient as long as d� n. The CPU cost is thesame for SQL and UDFs: for diagonal Q there are 2dn floating point operations (flops), whereas for triangular Q there areðdþ d2

=2Þn flops.

3.7. Summary at a conceptual level

All linear statistical models can exploit the same sufficient statistics (summary matrices) and fortunately such matricescan be computed in a single pass with an aggregate UDF. On the other hand, specific scalar UDFs are needed to score data sets

Table 1I/O cost for SQL queries and UDFs to get N, L, Q without caching.

Q diagonal Q triangularI/Os I/Os

SQL XV dn d2nUDF XH n nUDF XH sampling s s


based on each model. Both scalar and aggregate UDFs perform as many calculations as possible when processing each datapoint in order to build models and score data sets in a single table scan. The remaining complex equations (rewritten basedon n,L,Q), involving only small matrices, can be quickly evaluated outside the DBMS. In other words, all calculations involv-ing large matrices are performed inside the DBMS. UDFs represent an alternative to extend the DBMS with advanced statis-tical functionality without modifying the internal DBMS source code, which is a difficult task. UDFs can easily manipulatematrices using the C language data types and flow control statements. The query calling the UDF is automatically executedin parallel and such execution is optimal if the data set is evenly distributed among processing threads. The data set can beprocessed scanning table rows in any order, which makes fast parallel processing possible. That is, each data point updatessummary matrices and gets scored in an independent manner from the other points. Finally, since our UDFs have wide appli-cability in statistics, machine learning, image processing and pattern recognition they open the possibility of using the DBMSas a back-end server for intensive numeric processing tasks.

4. Experimental evaluation

Our goal is to make a fair comparison between UDFs and C++ running on the same hardware, analyzing the tradeoff be-tween sample size and accuracy for sampling mechanisms, assessing the impact of developing numeric optimizations (i.e.reducing FLOPS) and identifying bottlenecks in UDF execution.

4.1. Experimental setup

4.1.1. HardwareWe present an experimental evaluation on the Teradata DBMS, which supports UDFs. We used two ‘‘average” DBMS serv-

ers. Our first server hardware configuration had one Intel Dual Core CPU with each core running at 3.2 GHz, 4 GB of RAMmemory and 750 GB on disk. The second server had identical hardware, except the CPU was an Intel Quad-core with eachcore running at 2.13 GHz. The operating system was Microsoft Windows XP. We emphasize the DBMS can cache (load) largetables into main memory (especially with 4 GB). Caching is automatically performed when the table is scanned and the tablesize is smaller than available memory. We carefully controlled when the DBMS cached the data set to measure performance.We also consider the case where the data set is large enough so that it cannot fit in main memory. Our workstation had a2.4 GHz CPU, 2 GB of main memory and 160 GB on disk.

4.1.2. Software: DBMS and C++The UDFs were programmed in the C language variant provided the DBMS turning on optimizations for best performance

(e.g. enabling unprotected execution; also called unfenced). It is recommended that a UDF is thoroughly tested in protectedmode foe memory leaks (e.g. accessing an array out of bounds) before deployment. We also compare UDF performance withC++. In order to conduct a fair comparison, the C++ implementation ran on another computer with identical hardwareexploiting the same optimizations: caching and multi-threading. For caching we forced the operating system to cache thedata set into main memory. On the other hand, we used the OpenMP multi-thread C library. For completeness, we also showC++ time measurements without caching and multi-threading. In general, we computed a non-diagonal Q matrix to assume aworst case time complexity.

We explain DBMS settings to enhance performance. UDFs were executed in unprotected (unfenced) mode. This meansUDFs take full advantage of the DBMS multi-threading capabilities and run in the same address space as queries). DuringUDF development and testing UDFs were executed in protected mode to detect any memory leaks. The DBMS was configuredwith four threads to improve I/O performance. The recovery log was disabled. All queries were run on temporary (spool)space, instead of creating temporary tables. We managed caching of large data sets, making sure the data set was not cachedfor disk-based measurements. UDFs in our DBMS were constrained to a 16-bit memory address space (i.e. 64 KB). The DBMSand the OS took about 800 KB of main memory.

The C++ implementation analyzed data sets stored on flat files exported out from the DBMS with the standard ODBCinterface. The C++ program was optimized to scan X once, keeping L and Q in main memory at all times.

Average time is calculated from five runs of each experiment and in general it is reported in seconds. Times reported onprocessing in main memory exclude the time to initially cache the data set. When times are a fraction of a second we report


1 s (this happens only a few times; we avoided reporting milliseconds to make presentation clear). In general, UDF executionwas stopped after 30 minutes indicated by *.

4.1.3. Data setsWe used both real and synthetic data sets to test performance and accuracy. To test accuracy we used real data with dif-

ferent n and d. Most data sets were obtained from the UCI Machine Learning repository. On the other hand, to test scalability,compare with C++ and to analyze optimizations we generated large synthetic data sets with a mixture of normal distribu-tions varying n and d. Such data set were stored as tables in the DBMS. To test sampling efficiency we generated large datasets that could not fit in main memory (n = 100M).

4.2. Disk versus memory processing

We start by comparing running time of our UDF and C++, with real data sets. Table 2 provides a summary. For C++ we usefour threads (same as the DBMS), with and without caching. Given the speed of the server these data sets are processed in afew seconds. The UDF is 50% slower than C++ in main memory and about twice as slow on disk for the largest data set. Forpractical purposes, the UDF in main memory slightly than C++ running with the same optimizations (caching and multi-threading). We will study scalability with larger data sets in depth below.

We now compare how time grows to compute n, L, Q with a triangular (most demanding) Q. The first plot in Fig. 1 com-pares the UDF scanning the table from disk and reading the data set in main memory as n grows. The UDF achieves 6� speed-up with the largest n. On the other hand, scalability is linear on both disk and memory, which is expected. The second plot inFig. 1 compares the UDF scanning the table from disk and reading the data set from main memory varying d (i.e. number ofcolumns). Notice the speedup decreases as d grows. It becomes just 2� at the highest d. Our explanation is that time com-plexity Oðd2Þ to get Q becomes important and it outweighs caching.

We now compare C++ and the UDF. The UDF works on four threads in the DBMS and the data set is cached. Such fourthreads are configured when the DBMS is installed (they cannot be changed later). To make a fair comparison we alsouse multi-threading and caching with C++. Fig. 2 compares the UDF and C++. The first plot analyzes data set size n (in mil-lions), whereas the second one considers dimensionality. As expected, C++ is the fastest, but not much faster than the UDF.The gap in performance grows as n and d grow (especially for d), but it is not significant. To understand the impact of cachingand multi-threading in more depth we now compare two C++ versions: one with one thread reading from disk (worst case)and a second one with four threads reading from main memory (best case, same advantages as UDF). Table 3 summarizesthis detailed comparison varying n, including the time to export the data with ODBC. We want to point out there are muchfaster export mechanisms. In Teradata large tables can be exported with a utility called FastExport, which is about five times

Table 2Comparing UDF and C++ with real data sets (C++ is multi-threaded; times in s).

Data set n d UDF UDF C++ C++Disk Mem Disk Mem

Corel 68K 7 3 1 1 1NYSE 693K 4 16 3 8 2USCensus 10M 10 128 22 66 16

0

20

40

60

80

100

120

0 2 4 6 8

Tim

e in

sec

s

n in millions

UDF with cached data sets

UDF diskUDF memory

0

20

40

60

80

100

0 8 16 24 32 40 48 56 64

Tim

e in

sec

s

d

UDF with cached data sets

UDF diskUDF memory

Fig. 1. Caching: disk versus memory UDF processing (defaults n = 1M, d = 8).

0

5

10

15

20

25

30

0 2 4 6 8

Tim

e in

sec

s

n in millions

UDF and C++

C++ mem,multithreadUDF memory

0

10

20

30

40

0 16 32 48 64

Tim

e in

sec

s

d

UDF and C++

C++ mem,multithreadUDF memory

Fig. 2. Comparing UDF and C++ (both with multi-threading and caching; defaults n = 1M, d = 8).

Table 3Comparing C++ and UDF (vary n, d ¼ 8).

n� 1M C++ C++ UDF ODBC1 thread disk 4 threads mem 4 threads mem

1 10 2 3 5082 19 4 5 10154 38 8 11 20288 76 16 22 4055


faster than ODBC. We preferred to show ODBC times because it is a standard interface, but as can be seen even one fifth ofexport time is a long time. C++ with multi-threading is the fastest as expected, but it is closely followed by the UDF: the gapis just 50% more time for the UDF for the largest n. Not surprisingly, C++ with one thread reading from disk is the slowest. Weconsider our experimental findings significant because C++ enjoys the same optimizations as the UDF, but the UDF executesinside the DBMS with all its overhead. In other words, the UDF and C++ have similar efficiency (the UDF is not twice as slow).For completeness, we show times to export a table with the standard ODBC interface: C++ becomes the worst alternativeconsidering ODBC.

4.3. Sampling large data sets

We divide these experiments in two parts. The first set of experiments compares time efficiency of sampling compared tofull table scans, with data sets that can be cached in main memory and larger data sets that do not fit in main memory. Thesecond set of experiments analyzes accuracy of estimation from samples. We focus on estimating mean, variance and cor-relation as they are the common statistics used in most models. To understand the tradeoff between accuracy and samplesize, we study time efficiency with synthetic data sets (especially with data sets that do not fit in main memory) and weanalyze accuracy with real data sets (to measure approximation error under stringent conditions).

Table 4 compares evaluation time for both sampling mechanisms, against a full table scan being a baseline (worst case).We use two data sets with significantly different size: n = 1M can be cached in main memory and n = 100M cannot be cached.We also consider dimensionality d ¼ 8 and d ¼ 16 to understand its impact on time. For geometric sampling s starts at sam-ple size given in column s and it grows in powers of two thereafter, as explained in Section 3.5 (e.g. for s = 2000 the sequenceof sample sizes is 2000;4000; . . .). Also, when n = 1M we allow X to be cached after the second iteration. On the other hand,for bootstrapping s is fixed across each row in Table 4 and B grows geometrically. In a similar manner to geometric sampling,for bootstrapping we also allow X to be cached when n = 1M.

Our findings for geometric sampling are as follows (look at left part of Table 4). When n is small (n = 1M) caching isnot used the first time (column 1 below ‘‘geometric”), but it is used thereafter (columns labeled 2 and 3). We can see forn = 1M the second and third samples are computed much faster (the increment in time is small), indicating samplingdoes benefit from caching the entire data set. Geometric sampling is efficient only when s is small and n = 1M. Other-wise, geometric sampling time with s = 8000 approaches the time to scan the entire table indicating a high I/O cost tocollect a sample data set. When n = 100M caching does not accelerate processing. The explanation is that the samplemust be truly random and therefore many new blocks must be read from disk at each iteration. Geometric samplingacceleration becomes important when n is large ðn ¼ 100MÞ. But for large n caching produces a marginal acceleration.

Table 4Sampling: evaluation time (in s).

n d s Table Geometric Bootstrapping

scan 1 2 3 B = 5 B = 10 B = 20 B = 40

1M 8 500 17 5 8 10 1 1 3 61M 8 1000 17 9 13 15 1 2 3 71M 8 2000 17 13 15 16 1 2 4 81M 8 4000 17 17 18 18 1 2 5 91M 8 8000 17 17 18 18 1 3 6 12100M 8 500 1986 19 37 54 75 150 290 662100M 8 1000 1986 26 48 72 89 165 327 628100M 8 2000 1986 39 76 110 122 252 505 1004100M 8 4000 1986 62 121 179 218 487 891 1776100M 8 8000 1986 111 212 309 416 788 1562 *100M 16 1000 1995 26 49 72 93 181 345 1196100M 16 2000 1995 39 76 112 146 300 597 2141


We can see that when X is cached geometric sampling performs much better than a table scan. In fact, when s is largeand n = 1M it turns out that the entire data set X is cached and the second and their geometric samples take a fractionof 1 s. Last, we can see from the last two rows d plays a minor role in sampling: times are practically the same despitedoubling d.

We now discuss bootstrapping, which is shown on the right part of Table 4. Comparatively, bootstrapping is much slowerthan geometric sampling. The explanation is of course, the high number of iterations (B). If both s and B are largeðs ¼ 8000;B ¼ 40Þ bootstrapping time approaches and even surpasses the time to perform a table scan. In such case, boot-strapping is inefficient. Another interesting aspect is that the difference between s = 500 and s = 1000 is small whenn = 100M, indicating that collecting smaller sample sizes does not necessarily improve performance. Bootstrapping doesseem a reasonable alternative for s = 1000, which is a decent sample size [10]. The last rows of Table 4 show times ford ¼ 16 for s = 1000, 2000 (notice we do not show all sample sizes s for n = 1M and n = 100M since the trend is the same).For both geometric sampling and bootstrapping when d is doubled the change in time is marginal, indicating d plays a minorrole in performance. In short, sampling can accelerate the computation of n, L, Q, provided s is small (but not too small). In thefollowing paragraphs, we will carefully analyze accuracy with s = 1000 by default.

We now turn our attention to accuracy varying s and B. We first compute n, L, Q on the entire data set X. Such n, L, Q willbe used to compute the exact value of l;R and q (with double precision variables, as explained above). We then compute therelative of the approximation with respect to the exact value. Table 5 shows the maximum relative error to estimate l and Rassuming R and Q are diagonal (i.e. considering only means and variances). That is, we report the highest relative error on alldimensions. To make the table intuitive and concise errors between 0.001 and 0.01 are shown as 0.01 and errors below 0.001(1.0E�3) are shown as 0.00; the rationale is scientific notation takes too much space and errors below 1% prove high accu-racy. Our default sample size is s = 1000, which can be efficiently evaluated as proven above. The error in the approximationof l approaches zero rapidly (such error is always smaller than the error for R) and hence we do not show it. The generaltrend indicates that relative error rapidly decreases as s or B grow. In geometric sampling, the overall trend is that error grad-ually decreases as s doubles. We must notice that bootstrapping starts off with good accuracy, even at B = 10 and thus errordecreases slower. At s = 8000 there are two data sets whose error has gone down to 1%. On the other hand, bootstrappingerror decreases as B grows, being 0.01 or better for four data sets at B ¼ 100. Overall, both sampling mechanisms are good,but bootstrapping tends to be more accurate and more stable, although slower (as shown in Table 4).

Table 6 shows error behavior for the most difficult case: the correlation matrix, with s = 1000 by default. As noted above,to make the table intuitive and concise errors between 0.001 and 0.01 are shown as 0.01, errors below 0.001 (1.0E�3) areshown as 0.00 (scientific notation takes too much space). Both sampling techniques become fairly accurate when s = 8000and B = 100, respectively, but bootstrapping is clearly superior.

Table 5Sampling: relative error to approximate l vector and R (covariance) matrix.

Data set n d Geometric Bootstrapping s = 1000

s B

500 1000 2000 4000 8000 10 20 40 100

Arsenic 10K 5 0.10 0.09 0.02 0.02 0.01 0.01 0.01 0.00 0.00Shuttle 58K 6 0.07 0.06 0.03 0.05 0.02 0.01 0.01 0.01 0.00Corel 68K 7 0.16 0.17 0.06 0.10 0.06 0.07 0.07 0.02 0.02NYSE 693K 4 0.36 0.18 0.15 0.03 0.01 0.09 0.05 0.07 0.00USCensus 10M 10 0.16 0.18 0.09 0.03 0.12 0.03 0.01 0.02 0.01


Given the longer time bootstrapping requires geometric sampling is a faster, yet reasonably accurate alternative (exceptfor data set Corel, which warrants larger samples). In summary, based on our experimental results we recommend usings = 1000 or larger and geometric sampling by default. If high accuracy is needed bootstrapping represents a more accurate,but slower, alternative, Another observation is that there is no universal setting for B. In two data sets B 6 20 produced 99%accuracy, whereas two data sets required going up to B = 100. It may well be the case that some data set may require goingup to B = 500 or even B = 1000 to make relative error come down, although we must emphasize that was not necessary withthe diverse real data sets we used. A similar trend holds for geometric sampling. Such trends indicate that the amount ofsampling required depends on the statistical properties of the data set rather than its size.

4.4. Optimizations

Table 7 analyzes the impact of performing OðdÞ computations instead of Oðd2Þ (diagonal vs. triangular Q). For diagonal Qthe amount is memory allocated is OðdÞ (a one-dimensional array), but Oðd2Þ for triangular Q (a two-dimensional array).Therefore, the amount of allocated main memory and time complexity are related. When the UDF reads the data set fromdisk there is practically no difference: the speedup achieved by this numerical optimization is marginal. On the other hand,when the data set is read from main memory this numerical optimization does make a difference. For the highest d thespeedup is 4�. We conclude that the UDF is severely delayed by disk I/O, despite the fact that it is doing fewer computations.

In the second optimization we compare an SQL query and the UDF to get the summary matrices. The SQL query requires aself-join on a vertical layout of X to compute the cross-products in Q and thus it has time complexity Oðd2nÞ. We do not com-pare a more efficient SQL query based solution that requires Oðd2Þ terms in a SELECT statement, because it easily exceeds themaximum number of columns allowed by the DBMS (e.g. d P 32 makes this query useless). As we can see from Table 8 thespeedup provided by the UDF is two orders of magnitude bigger, compared to the SQL query. We attempted to exploit cach-ing for the SQL query, but it did not help. Therefore, the SQL query is significantly impacted by disk I/O to read the data set(twice) and to create a temporary table having Oðd2nÞ rows. The lesson is clear: joins between large tables in SQL to performdata mining should be avoided.

We now compare UDF running time between two high-performance CPUs: one with two cores (dual-core) and anotherone with four cores (quadcore). Table 9 shows a great benefit about running UDFs with cached data sets: the UDF running onthe 4-core CPU takes half the time it takes on the 2-core CPU. The reason behind such acceleration is that the threads eval-uating the CPU can be easily assigned to each core and work independently, given the distributive nature of n, L, Q. On theother hand, there is no acceleration when the UDF reads X from disk: time is about the same since I/O time dominates. Webelieve this is a great feature of our UDF approach because this optimization is automatic and it achieves full speedup with-out developing any complex parallel algorithm.

We conclude our experiments with a breakdown of execution time of the UDF at run-time. Table 10 compares the UDFwith the data set retrieved from disk and the data set cached in main memory. We wanted to profile the UDF with a datalarge enough to eat up all RAM. After trial and error we found that n = 4 M and d ¼ 32 was a fairly large data set that tookalmost all main memory (over 3.2 GB). This experiment was conducted with several versions of the UDF, going from anempty UDF to a UDF performing all calculations and returning matrices. The initial query was a straight table scan. The

Table 6Sampling: relative error to approximate correlation matrix q.

Data set n d geometric bootstrapping s = 1000

s B

500 1000 2000 4000 8000 10 20 40 100

Arsenic 10K 5 0.44 0.25 0.15 0.16 0.03 0.15 0.22 0.07 0.00Shuttle 58K 6 0.03 0.02 0.03 0.03 0.01 0.01 0.01 0.01 0.01Corel 68K 7 0.47 0.33 0.68 0.40 0.18 0.09 0.12 0.17 0.00NYSE 693K 4 0.09 0.00 0.08 0.00 0.00 0.03 0.02 0.02 0.01USCensus 10M 10 0.08 0.05 0.04 0.02 0.01 0.02 0.01 0.01 0.00

Table 7Numerical optimization: comparing triangular and diagonal Q (n = 1M).

d Disk Speedup Memory Speedup

Triang Diag Triang Diag

8 17 16 1 3 2 216 29 26 1 5 3 232 49 48 1 12 5 264 100 90 1 41 10 4

Table 8Database optimization: SQL join query (disk) versus UDF (memory) ðd ¼ 8Þ.

n� 1M SQL UDF mem Speedup

1 692 3 2312 1242 5 2484 2533 11 2308 5346 22 24316 11227 44 255

Table 9Hardware optimization: UDF running on multi-core CPUs ðd ¼ 8Þ.

n� 1M Disk Disk Mem Mem2 cores 4 cores 2 cores 4 cores

1 16 16 3 12 32 33 5 34 62 69 11 68 122 142 22 11

Table 10UDF profile (d ¼ 32;n ¼ 4M, triang. Q; time in s).

Step Disk Memory

Time D % Time D %

Scan data set 190 190 97 4 4 9Call UDF 191 1 1 7 3 7Alloc arrays 193 2 1 22 15 33Pass argument 193 0 0 27 5 11Compute matrices 195 2 1 39 12 26Return matrices 195 0 0 46 7 15


second version was a query calling an empty UDF doing no computations. The next UDF version allocated arrays, but per-formed no computations. The next one passed the input vector as parameter. We then created a UDF that computed allmatrices, but without returning results. Finally, the last UDF returned matrices. In short, each section of the UDF source codewas benchmarked to understand its contribution to overall time. To our surprise, the UDF contributed almost no time whenreading the data set from disk. Scanning the data set took almost all time. The remaining steps in the UDF contributed a mar-ginal fraction of time. This basically means the UDF execution is interleaved with I/O and thus the quadratic time complexityplays a little role. On the other hand, when the UDF works with the cached data set in main memory I/O contributes less than10%. We expected calculation of n, L, Q to take most of the time, but it turned that allocating memory and maintaining thosearrays in main memory was the bottleneck. This is surprising because these results indicate important DBMS overhead man-aging the UDF execution in main memory. Returning results had a measurable time contribution, explained by the size of Qand ODBC.

5. Related work

There has been considerable work in data mining to develop efficient techniques, but most work has concentrated on pro-posing algorithms assuming the data set is available in a flat file outside the DBMS. Studying the integration of statisticaltechniques with a DBMS has received little attention. Most research work has concentrated on association rules, followedby clustering and decision trees. [15]. The importance of the linear sum of points and the quadratic sum of points (withoutcross-products) to decrease I/O in clustering is recognized in [5,42], but they assume the data set is directly accessible withsome I/O interface. Computation of sufficient statistics for classification with decision trees in a relational DBMS is proposedin [17]. Such sufficient statistics are different from ours as they are developed to accelerate attribute splits by preprocessingthe data set with binning. Sufficient statistics on binary data enable the computation of models for association rules based onbinary clusters and the correlation matrix in one pass [31]. A key difference is that sufficient statistics for binary dimensionsare simpler than those for continuous dimensions, accelerating and simplifying data mining algorithms. Even further, suffi-cient statistics also accelerate OLAP cube computations [32]. We are not the first to recognize the importance of sampling toaccelerate data mining computations or query evaluation. Some references include [5] to speedup K-means clustering, [4,37]to accelerate data mining algorithms, [24] to efficiently compute range queries and [2,23] to accelerate aggregation queries.Association rules are mined in one pass with a sample and refined on a second pass over the entire data set [37]. Biasedsampling with skewed distributions is used to accelerate data mining algorithms, which generally require multiple passes


over the data set [4]. Samples are collected from multiple nodes in a hierarchy over a data set to quickly answer differentaggregate queries [2]. Maintaining a materialized view indexed for efficient random sampling is proposed in [24]. On a com-plementary approach, aggregation queries are efficiently answered with histograms instead of samples in [26], where suchhistograms are optimally computed with splines. In contrast, in our proposal we directly exploit the DBMS random samplingmechanism without developing a new sampling algorithm (which would require modifying the DBMS source code). Our ap-proach will not require multiple passes on the data set as long as blocks can be directly accessed to collect samples. Anotherdifference is that we exploit sampling only when the data set does not fit in main memory to compute its sufficient statisticsand not specific models, aggregations or range queries. Reservoir and density biased sampling [4] could help getting accuratesufficient statistics on subsets of data sets with skewed distributions in one pass, but cannot exploit existing random sam-pling mechanisms in the DBMS.

There exists work to integrate data mining with SQL and UDFs. Most proposals add syntax to SQL and optimize queriesusing the proposed extensions. The importance of pushing statistical and data mining computations into a DBMS is recog-nized in [11]; this article also emphasizes the bottleneck to export tables out of the DBMS; [11] argues that SQL queries andMapReduce can be two complementary mechanisms to analyze large data sets, but favors MapReduce as an external mech-anism to analyze large data sets with data mining functionality instead of extending the DBMS; this work falls short recog-nizing UDFs as a fundamental and feasible extensibility mechanism. Developing data mining algorithms, rather thanstatistical techniques, using SQL has received moderate attention. Some important approaches include [28,34,36] to mineassociation rules, and [35] to define primitives for decision trees. SQL syntax is extended to allow spreadsheet-like compu-tations in [41], letting an end-user express complex equations in SQL. Data mining primitive operators are proposed in [9],including an operator to pivot a table and another one for sampling, useful to build data sets. SQL extensions to define, queryand deploy data mining models are proposed in [29]; this proposal focuses on managing models rather than computing themand therefore such extensions are complementary to aggregate UDFs to compute sufficient statistics. User-Defined Aggre-gates have been used to develop other data mining algorithms [39,40,30], but not studying performance aspects on a modernDBMS. ATLaS [40] extends SQL syntax with object-oriented constructs to define aggregate and table functions (with initial-ize, iterate and terminate clauses), providing a user-friendly interface to the SQL standard. UDFs have been used as a mech-anism to implement primitive vector and matrix operators in mathematical equations used by data mining algorithms[33,30]. We must mention UDFs require different strategies for cost-based optimization [22] since they are different fromSPJ queries. Our UDF automatically benefits from computing SPJ queries to produce the data set. To the best of our knowl-edge, we are the first to consider accelerating computation of sufficient statistics combining UDFs, sampling and caching.

We close this section by discussing some related work on database systems and exploiting new hardware. Exploitingmain memory to accelerate data mining is not new [1,16,27]. Data mining algorithms under-utilize a modern CPU becausetheir memory access patterns has low temporal locality [16]; the solution is based on maintaining a separate context perthread. On the other hand, [1] shows that incorporating small processors into memory chips enables fast parallel processingfor a graph clustering problem.

6. Conclusions

We focused on efficiently computing sufficient statistics with an aggregate UDF in a DBMS exploiting caching and sam-pling. Our sufficient statistics matrices benefit several well-known statistical models, including PCA, clustering, Naive Bayesand linear regression. The UDF enables fast memory-based processing of demanding matrix computations. The DBMS auto-matically caches a data set when it is accessed with a table scan, the standard read pattern of a data mining algorithm, there-by accelerating UDF processing. We introduced two sampling mechanisms (geometric sampling and bootstrapping) toaccelerate the UDF when data sets cannot fit in main memory. Experiments with large real and synthetic data sets showpromising results. The UDF is significantly faster when analyzing cached data sets, compared to reading from disk. TheUDF has slightly worse performance than C++ on a direct comparison, with both the UDF and C++ exploiting caching andmulti-threading. We consider such performance result encouraging, given the perception that DBMSs are slow for data min-ing processing, given DBMS overhead. Also, users can enjoy the extensive DBMS functionality. Not surprisingly, the UDF issignificantly faster than C++ when C++ reads from disk or when it works in a single thread. On the other hand, geometricsampling and bootstrapping are generally faster than performing a table scan on large data sets that cannot fit in main mem-ory. Both techniques achieve high accuracy (around 99%) to approximate the data set statistics (mean vector, covariance ma-trix, correlation) with a large enough sample and a few iterations. Even further, sampling on cached data sets allows the UDFto return sufficient statistics in a few seconds. We studied the relative merit of three optimizations: computing a diagonalmatrix instead of a full matrix to reduce CPU time, illustrating how the UDF avoids joins in SQL and taking advantage of amulti-core CPU, if available. Finally, a profile of the UDF at run-time shows the UDF spends most of the time waiting on diskI/O when the data set is not cached. On the other hand, managing matrix arrays in main memory on each thread is the mainperformance bottleneck. Contrary to intuition, the quadratic number of floating point operations was the second major bot-tleneck. In short, our UDF exploiting caching and sampling can efficiently compute sufficient statistics on large data sets,even if they cannot fit in main memory.

There exist several issues for future work. Other statistical techniques can follow the same approach: finding matricesthat summarize large data sets to build a model, efficiently computing such matrices with aggregate UDFs and exploiting


DBMS features and hardware. We want to investigate how the DBMS can control storage of a small array corresponding to amatrix in L1 or L2 cache memory; such optimization is beyond the UDF API. We need to develop accurate cost models forUDF evaluation considering caching, multi-threaded execution and sampling. It is feasible our sampling techniques can en-able online models computation with increasing accuracy as more samples are collected.

Acknowledgment

This research work was partially supported by US National Science Foundation Grants CCF 0937562 and IIS 0914861.

References

[1] J. Adibi, T. Barrett, S. Bhatt, H. Chalupsky, J. Chame, M.W. Hall. Processing-in-memory technology for knowledge discovery algorithms. In: Proc. ACMDaMoN Workshop, 2006.

[2] F.N. Afrati, P.V. Lekeas, C. Li, Adaptive-sampling algorithms for answering aggregation queries on web sites, Data and Knowledge Engineering 64 (2)(2008) 462–490.

[3] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, in: Proc. VLDB Conference, 1994, pp. 487–499.[4] H. Akcan, A. Astashyn, H. Brönnimann, Deterministic algorithms for sampling count data, Data and Knowledge Engineering 64 (2) (2008) 405–418.[5] P. Bradley, U. Fayyad, C. Reina, Scaling clustering algorithms to large databases, in: Proc. ACM KDD Conference, 1998, pp. 9–15.[6] B. Catanzaro, N. Sundaram, K. Keutzer, Fast support vector machine training and classification on graphics processors, in: Proc. ICML, 2008, pp. 104–

111.[7] S. Chaudhuri, G. Das, U. Srivastava, Effective use of block-level sampling in statistics estimation, in: Proc. ACM SIGMOD Conference, 2004, pp. 287–298.[8] S. Chitroub, A. Houacine, B. Sansal, A new PCA-based method for data compression and enhancement of multi-frequency polarimetric sar imagery,

Intelligent Data Analysis 6 (2) (2002) 187–207.[9] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, P. Lohman. Non-stop SQL/MX primitives for knowledge discovery, in: Proc. ACM KDD Conference, 1999, pp.

425–429.[10] W.G. Cochran, Sampling Techniques, third ed., Wiley, 2007.[11] J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, C. Welton, MAD skills: new analysis practices for big data. In Proc. VLDB Conference, 2009, pp. 1481–1492.[12] C. Ding, X. He. K-means clustering via principal component analysis, in: Proc. ICML Conference, 2004, p. 29.[13] R. Elmasri, S.B. Navathe, Fundamentals of Database Systems, third ed., Addison/Wesley, Redwood City, California, 2000.[14] H. Garcia-Molina, J.D. Ullman, J. Widom, Database Systems: The Complete Book, first ed., Prentice Hall, 2001.[15] J. Gehrke, Venkatesh Ganti, R. Ramakrishnan, BOAT-optimistic decision tree construction, in: Proc. ACM SIGMOD Conference, 1999, pp. 169–180.[16] A. Ghoting, G. Buehrer, S .Parthasarathy, D. Kim, A. Nguyen, Y. Chen, P. Dubey. A characterization of data mining workloads on a modern processor, in:

Proc. ACM DaMoN Workshop, 2005.[17] G. Graefe, U. Fayyad, S. Chaudhuri, On the efficient gathering of sufficient statistics for classification from large SQL databases, in: Proc. ACM KDD

Conference, 1998, pp. 204–208.[18] J. Gray, A. Bosworth, A. Layman, H. Pirahesh, Data cube: a relational aggregation operator generalizing group-by, cross-tab and sub-total, in: ICDE

Conference, 1996, pp. 152–159.[19] J. Han, M. Kamber, Data Mining: Concepts and Techniques, first ed., Morgan Kaufmann, San Francisco, 2001.[20] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in: Proc. ACM SIGMOD Conference, 2000, pp. 1–12.[21] T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning, first ed., Springer, New York, 2001.[22] Z. He, B.S. Lee, R. Snapp, Self-tuning cost modeling of user-defined functions in an object-relational DBMS, ACM Trans. Database Syst. 30 (3) (2005)

812–853.[23] M. Hsieh, M. Chen, P.S. Yu, Approximate query processing in cube streams, IEEE Transactions on Knowledge and Data Engineering (TKDE) 19 (11)

(2007) 1557–1570.[24] S. Joshi, C. Jermaine, Materialized sample views for database approximation, IEEE Transactions on Knowledge and Data Engineering (TKDE) 20 (3)

(2008) 337–351.[25] R. Kohavi, Scaling up the accuracy of Naive–Bayes classifiers: a decision-tree hybrid, in: Proc. KDD Conference, 1996, pp. 202–207.[26] X. Lin, Q. Zhang, Y. Yuan, Q. Liu, Error minimization in approximate range aggregates, Data and Knowledge Engineering 62 (1) (2007) 156–176.[27] S. Manegold, P.A. Boncz, M.L. Kersten, Optimizing main-memory join on modern hardware, IEEE Transactions on Knowledge and Data Engineering

(TKDE) 14 (4) (2002) 709–730.[28] R. Meo, G. Psaila, S. Ceri, An extension to SQL for mining association rules, Data Mining and Knowledge Discovery 2 (2) (1998) 195–224.[29] A. Netz, S. Chaudhuri, U. Fayyad, J. Berhardt, Integrating data mining with SQL databases: OLE DB for data mining, in: IEEE ICDE Conference, 2001, pp.

379–387.[30] C. Ordonez, Building statistical models and scoring with UDFs, in: Proc. ACM SIGMOD Conference, 2007, pp. 1005–1016.[31] C. Ordonez, Models for association rules based on clustering and correlation, Intelligent Data Analysis 13 (2) (2009) 337–358.[32] C. Ordonez, Z. Chen, Evaluating statistical tests on OLAP cubes to compare degree of disease, IEEE Transactions on Information Technology in

Biomedicine (TITB) 13 (5) (2009) 756–765.[33] C. Ordonez, J. García-García, Vector and matrix operations programmed with UDFs in a relational DBMS, in: Proc. ACM CIKM Conference, 2006, pp.

503–512.[34] S. Sarawagi, S. Thomas, R. Agrawal, Integrating association rule mining with relational database systems: alternatives and implications, in: ACM

SIGMOD, 1998, pp. 343–354.[35] K. Sattler, O. Dunemann, SQL database primitives for decision tree classifiers, in: Proc. ACM CIKM Conference, 2001, pp. 379–386.[36] X. Shang, K. Sattler, I. Geist, SQL based frequent pattern mining without candidate generation, in: Proc. ACM SAC Conference, 2004, pp. 618–619.[37] H. Toivonen, Sampling large databases for association rules, in: Proc. VLDB Conference, 1996, pp. 134–145.[38] R. Vilalta, I. Rish, A decomposition of classes via clustering to explain and improve Naive Bayes, in: Proc. ECML Conference, 2003, pp. 444–455.[39] H. Wang, C. Zaniolo, User defined aggregates in object-relational systems, in: Proc. ICDE Conference, 2000, pp. 135–144.[40] H. Wang, C. Zaniolo, C.R. Luo, ATLaS: a small but complete SQL extension for data mining and data streams, in: Proc. VLDB Conference, 2003, pp. 1113–

1116.[41] A. Witkowski, S. Bellamkonda, T. Bozkaya, G. Dorman, N. Folkert, A. Gupta, L. Sheng, S. Subramanian, Spreadsheets in RDBMS for OLAP, in: Proc. ACM

SIGMOD Conference, 2003, pp. 52–63.[42] T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient data clustering method for very large databases, in: Proc. ACM SIGMOD Conference, 1996, pp.

103–114.


Carlos Ordonez received a degree in applied mathematics and an M.S. degree in computer science, from UNAM University,Mexico, in 1992 and 1996 respectively. He got a Ph.D. degree in Computer Science from the Georgia Institute of Technology,USA, in 2000. He is currently an Assistant Professor at the University of Houston. His research is centered on the integration ofmachine learning techniques into database systems and their application to scientific problems.

Sasi K. Pitchaimalai received his B.E. degree in Computer Science and Engineering from Anna University in 2006 and M.S. inComputer Science from University of Houston in 2008, where he is currently pursuing his Ph.D. in Computer Science. Hisresearch interests involve high-performance data mining inside database systems.

fast udfs to compute sufficient statistics on large data sets exploiting caching and sampling

Documents