high-performancedataminingwith skeleton ...mjc/artigos-cp/high-performance data mining with... ·...

21
High-performance data mining with skeleton-based structured parallel programming Massimo Coppola * , Marco Vanneschi Dipartimento di Informatica, Universit a di Pisa, Corso Italia 40, 56125 Pisa, Italy Received 11 March 2001; received in revised form 20 November 2001 Abstract We show how to apply a structured parallel programming (SPP) methodology based on skeletons to data mining (DM) problems, reporting several results about three commonly used mining techniques, namely association rules, decision tree induction and spatial clustering. We analyze the structural patterns common to these applications, looking at application perfor- mance and software engineering efficiency. Our aim is to clearly state what features a SPP en- vironment should have to be useful for parallel DM. Within the skeleton-based PPE SkIE that we have developed, we study the different patterns of data access of parallel implementations of Apriori, C4.5 and DBSCAN. We need to address large partitions reads, frequent and sparse access to small blocks, as well as an irregular mix of small and large transfers, to allow efficient development of applications on huge databases. We examine the addition of an object/compo- nent interface to the skeleton structured model, to simplify the development of environment- integrated, parallel DM applications. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: High performance computing; Structured parallel programming; skeletons; Data mining; Association rules; clustering; classification 1. Introduction In recent years the process of knowledge discovery in databases (KDD) has been widespreadly recognized as a fundamental tool to improve results in both the www.elsevier.com/locate/parco Parallel Computing 28 (2002) 793–813 * Corresponding author. Tel.: +39-50-221-2728; fax: +39-50-221-2726. E-mail addresses: [email protected] (M. Coppola), [email protected] (M. Vanneschi). URL: http://www.di.unipi.it/coppola. 0167-8191/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII:S0167-8191(02)00095-9

Upload: tranhanh

Post on 18-Aug-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

High-performance data mining withskeleton-based structured parallel programming

Massimo Coppola *, Marco Vanneschi

Dipartimento di Informatica, Universit�aa di Pisa, Corso Italia 40, 56125 Pisa, Italy

Received 11 March 2001; received in revised form 20 November 2001

Abstract

We show how to apply a structured parallel programming (SPP) methodology based on

skeletons to data mining (DM) problems, reporting several results about three commonly used

mining techniques, namely association rules, decision tree induction and spatial clustering. We

analyze the structural patterns common to these applications, looking at application perfor-

mance and software engineering efficiency. Our aim is to clearly state what features a SPP en-

vironment should have to be useful for parallel DM. Within the skeleton-based PPE SkIE that

we have developed, we study the different patterns of data access of parallel implementations

of Apriori, C4.5 and DBSCAN. We need to address large partitions reads, frequent and sparse

access to small blocks, as well as an irregular mix of small and large transfers, to allow efficient

development of applications on huge databases. We examine the addition of an object/compo-

nent interface to the skeleton structured model, to simplify the development of environment-

integrated, parallel DM applications.

� 2002 Elsevier Science B.V. All rights reserved.

Keywords: High performance computing; Structured parallel programming; skeletons; Data mining;

Association rules; clustering; classification

1. Introduction

In recent years the process of knowledge discovery in databases (KDD) hasbeen widespreadly recognized as a fundamental tool to improve results in both the

www.elsevier.com/locate/parco

Parallel Computing 28 (2002) 793–813

*Corresponding author. Tel.: +39-50-221-2728; fax: +39-50-221-2726.

E-mail addresses: [email protected] (M. Coppola), [email protected] (M. Vanneschi).

URL: http://www.di.unipi.it/�coppola.

0167-8191/02/$ - see front matter � 2002 Elsevier Science B.V. All rights reserved.

PII: S0167-8191 (02 )00095-9

industrial and the research field. Parallel computing is a key resource in enhancingthe performance of applications and computer systems to match the computationaldemands of the data mining (DM) phase for huge electronic databases. The exploi-tation of parallelism is often restricted to specific research areas (scientific calcula-tions) or subsystem implementation (database servers) because of the practicaldifficulties of parallel software engineering. Parallel applications for the industryhave to be (1) efficiently developed and (2) easily portable, characteristics that tradi-tional low-level approaches to parallel programming lack. The work of our researchgroup has been directed to address the issue of parallel software engineering andshorten the time-to-market for parallel applications. The use of structured parallelprogramming (SPP) and high-level parallel programming environments (PPE) arethe main resources in this perspective. The structured approach has been fosteredand supported by several research and development projects, which resulted in theP3L language and the SkIE PPE [1–3].

Here we present our analysis of a significant set of DM techniques, which we haveported from sequential to parallel with SkIE. We report our experiences [4–7] aboutthe problems of association rule extraction, classification and spatial clustering. Wehave developed three prototype applications by restructuring sequential code tostructured parallel programs. The SPP approach of the SkIE coordination languageis evaluated against the engineering and performance issues of these I/O and compu-tationally intensive DM kernels. We also examine object-oriented additions to theskeleton programming model. Shared objects are used as a tool to simplify the im-plementation of parallel, out-of-core classification algorithms, easing the manage-ment of huge data in remote and mass memory. We show that the improvementsin program design and maintenance do not impair application performance. Thenext-generation PPE, called ASSIST [8], will provide remote objects as a commoninterface to access external libraries, servers, shared data structures and computa-tional grids. The need for a tighter integration of high performance DM systems withthe support for the management of data is well recognized in the literature [9,10]. Webelieve that the SPP approach and the availability of standard interfaces within thePPE will simplify the development of integrated parallel KDD environments. Thecommon implementation schemes that emerge, as well as the performance resultsthat we show, sustain the validity of a structured approach for DM application.

The next section explains the basics of SPP models, giving an overview of the field,of our research and of the SkIE PPE, as well as a short comparison of the computerarchitectures we ran our tests on. Section 3 draws the general framework of sequen-tial and parallel DM, and contains some general definitions. Section 4 examines thefirst prototype, parallel partitioned Apriori. Definitions of the problem and the algo-rithm, a summary of closely related works, description of the parallel structure andanalysis of the test results are reported. The same organisation of the matter is givento Section 5, about parallel clustering with DBSCAN, and Section 6, which describesa parallel C4.5 classifier employing a shared object abstraction. The matter of Sec-tion 7 is a discussion of the advantages of structured programming over the exper-iments we present. In Section 8 conclusions are drawn, and an object interface toexternal, shared mechanisms is proposed on the grounds of the experiences made.

794 M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813

2. Structured parallel programming

The software engineering problems we mentioned in Section 1 follow from thefact that most high performance computing technologies do not fully obey to theprinciples of modular programming [11]. The PPE SkIE, stemming from our workon parallel programming models and languages, is based on the concept of parallelcoordination language. Coordination in SkIE follows the parallel skeleton program-ming model [12,13]. The global structure of the program is expressed by the con-structs of the language, providing a high level description that is machineindependent. Skeleton-based models have many powerful features, like composition-ality, performance models and semantic-preserving transformations allowing to de-fine optimization techniques. The structured approach to coordination merges theseadvantages with a greater ease of software reuse.

Because of the compositional nature of skeletons, parallel modules can nest insideeach other to develop complex, parallel structures from the simple, basic ones. Theinteraction through clearly defined interfaces makes independent the implementationof different parallel and sequential modules. The concept of module interface alsoeases the interaction among different sequential host languages (like C/Cþþ, For-tran, Java) and the environment. The properties of the underlying skeleton modelcan be exploited for global optimizations, while retaining the existing sequentialtools and optimizations for the purely sequential modules. All the low level detailsof communication and parallelism management are left to the language support.

The SkIE-CL coordination language provides the user with a subset of the parallelskeletons studied in the literature. We give the informal semantics of the ones that wewill use in the paper, with graphical representations shown in Fig. 1. The general se-mantic of the skeletons is data-flow like, with packets of data we call tasks streamingbetween the interfaces of linked program modules. The simplest skeleton, the seq, isa mean to encapsulate sequential code from the various host languages into a mod-ular structure with well-defined interfaces. Pipeline composition of different stages ofa function is realized by the pipe skeleton. The independent functional evaluationover tasks of a stream, in a load-balancing fashion, is expressed through the farmskeleton. The worker module contained in the farm is seamlessly replicated, eachcopy operating on a subset of the input stream. The loop skeleton is used to definecyclic, possibly interleaved, data-flow computations, where tasks have to be repeat-edly computed until they satisfy the condition evaluated by a sequential test module.

Fig. 1. A subset of the parallel skeletons available in SkIE.

M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813 795

The map skeleton is used to define data-parallel computation on a portion of a singledata structure, which is distributed to a set of virtual processors (VP) according to adecomposition rule. The tasks output by the VP are recomposed to a new structurewhich is the output of the map.

2.1. Implementation issues and performance portability

The templates used to implement the skeleton semantics in SkIE are parametricnets of processes which the compiler instantiates and maps on the target parallel ma-chine. Appropriate techniques are used to hide communication latency and over-heads. The current implementation uses static templates, which are optimized atcompile time.

Performance portability across parallel platforms is a feature of SPP-PPEs. Theresults shown in the rest of the paper come from tests we have made on four parallelmachines belonging to different architectural classes. Low-cost clusters of worksta-tions and full-fledged parallel machines are represented, which differ in the memorymodel and relative performance of computation, communication and mass memorysupport. The first and more generally available platform is Backus, a cluster of 16LINUX workstations (COW) connected by a fast ethernet crossbar switch. TheCS-2, from QSW, is a multiprocessor architecture with distributed memory, dual-processor nodes and a fat tree network. The Cray T3E is a massively parallel proces-sor (MPP) with non-uniform access shared memory supported in hardware. The lastarchitecture in our list is a 4 CPU symmetric multiprocessor (SMP) with uniformmemory access (UMA). This kind of parallel architecture is not scalable, but is oftenused as a building block for larger, distributed memory clusters.

On the one hand, we might put on a line the various platforms ordered by the rawcomputing performance of their CPUs. We would find the CS-2, the SMP, the COWand then the T3E, which is the fastest. On the other hand, the computation to com-munication bandwidth ratio has a more profound impact on parallelism exploita-tion. If we look at it, the COW is outweighted by the true parallel machines, andthe SMP clearly offers the fastest communications. Finally, I/O speed and scalabilityare key factors for DM applications. The CS-2 and the COW have local disks in eachnode in addition to the shared network file system. The distributed I/O capabilities,even if not a parallel file-system, allow for implementing more scalable applications.The fastest local I/O is on the COW, followed by the CS-2, while the SMP is some-times impaired by a single mass memory interface. Sustained, irregular and highlyparallel I/O on the T3E, in the configuration we used, leads to high latency and in-sufficient bandwidth.

3. Data mining and integrated environments

The goal of a DM algorithm is to find an optimal description of the input datawithin a fixed model space, obeying to a model cost measure. Each model descriptionlanguage defines a model space, for which one or more DM algorithms exist. A num-

796 M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813

ber of interacting factors determine the quality and usefulness of the results: selectionand preprocessing of the raw input data, choice of the kind of model, choice of thealgorithm to run, tuning of its execution parameters. In order to select the best com-bination, the KDD process involves repeated execution of the DM step, supervisedby human experts and meta-learning algorithms. Fast DM algorithm are essential aswell as the efficient coupling between them and the software managing the data. Thisproblem is manifest in parallel DM, where the I/O bandwidth and communicationsare two balancing terms of parallelism exploitation [14].

To exploit parallelism at all levels, from the algorithm down to the I/O system,thus removing any bottleneck, a higher degree of integration has already been advo-cated in the literature [9,15]. Ideally, parallel implementations of the DM algorithms,the file system, DBMS, and data warehouse should seamlessly cooperate with eachother and with visualisation and meta-learning tools. Some high-performance, inte-grated systems for DM are already being developed for the parallel [10] and the dis-tributed settings [16]. Other works like [17] concentrate on requirements for the datatransport layer in parallel and distributed DM. We want to address the system inte-gration issues through the use of a PPE. Besides simplifying software development, aPPE should provide standard interfaces to conventional and parallel file systems andto database services.

Assuming there is an underlying data management and warehousing effort, manyDM algorithms use a tabular organisation of data. Each row of the table is a dataitem, while the columns are the various attributes of the object. The stored objectsmay be points, sets of related measurements or fields extracted from a database re-cord. The attributes can be integer, real values, labels or boolean values. Using mar-ket basket analysis as a practical example, each ‘‘object’’ is a commercial transactionin a store. In the case of clustering, data are usually points in a space Ra, each rowbeing a point and each of the a attributes a spatial coordinate value. In the rest ofthe paper, D is the input database, N is its number of rows, a the number of attri-butes, or columns. The number of rows in a horizontal partition is n, when appro-priate, and p is the degree of parallelism.

DM algorithms use the input data to build a solution (a point in the model space),but in some cases its intermediate representation may be even larger than the input.We usually have to partition the input, the model space, or the solution data to ex-ploit parallelism and to manage the workload over the available resources. We callhorizontal partitioning to divide data according to rows. Vertical partitioning is todivide according to columns, breaking the input records. Either approach may suitto a particular DM technique for I/O and algorithmic reasons. Of course, besidethese two simple schemes, other parallel organisations come from the coordinate de-composition of the input data and the structure of the search space [14].

4. Apriori association rules

The problem of association rule mining (ARM), which has been proposed back in1993, has its classical application in market basket analysis. From the sell database

M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813 797

we want to detect rules of the form AB ) C, meaning that a customer that buys to-gether objects A and B also buys C with some minimum probability. We refer thereader to the complete description of the problem given in [18], while we concentrateon the computationally hard subproblem of finding frequent sets.

In the ARM terminology the database D is made up of transactions (the rows),each one consisting of a unique identifier and a number of boolean attributes froma setI. The attributes are called items, and a k-itemset contained in a transaction r isa set of k items which are true in r. The support rðX Þ of an itemset X is the propor-tion of transactions that contain X. Given D, the set of items I, and a fixed realnumber 0 < s < 1, called minimum support, the solution of the frequent set problemis the collection fX jX � I; rðX ÞP sg of all itemsets that have at least the minimumsupport. The support information of the frequent sets can be used to infer all the va-lid association rules in the input. The power set PðIÞ of the set of items has a latticestructure naturally defined by the set inclusion relation. A level in this lattice is a setof all itemsets with equal number of elements. The minimum support property is pre-served over decreasing chains: ðrðX Þ > sÞ ^ ðY � X Þ ) rðY Þ > s. Computing thesupport count for a single itemset requires a linear scan of D. The database is oftenin the order of Gbytes, and the number of potentially frequent itemsets, 2jIj, usuallyexceeds the available memory. To efficiently compute the frequent sets, their struc-ture and properties have to be exploited.

We classify algorithms for ARM according to their lattice exploration strategy.Sequential and parallel solutions differ in the way they arrange the exploration, inhow they distribute the data structures to minimize computation, I/O and memoryrequirements, and in the fraction of the lattice they explore that is not part of thesolution. In the following we will restrict the attention to the Apriori algorithm, de-scribed in [18], and its direct evolutions.

Apriori builds the lattice level-wise and bottom-up, starting from the 1-itemsetsand using as a pruning heuristic the fact that non-frequent itemsets cannot have fre-quent supersets. From each level Lk of frequent itemsets, a set of candidates Ckþ1 isderived. The support for all the candidates is verified on the data to extract the nextlevel of frequent itemsets Lkþ1. Apriori is a breakthrough w.r.t. the naive approach,but some issues raise when applying it to huge data. A linear scan ofD is required foreach level of the solution. The underlying assumption is that the itemsets in Ck aremuch less than all the possible k-itemsets, but this is often false for k ¼ 2, 3, becausethe pruning heuristic does not apply well. Computing the support values for Ck be-comes quite hard if Ck is large. A review of several variants of sequential Apriori,which aim at correcting these problems, is given in [19]. A view on the theoreticalbackground of the frequent itemset problem and its connections to other problemsin machine learning can be found in [20].

4.1. Related work on parallel association rules

We studied the partitioned variant of ARM introduced in [21], which is a two-phase algorithm. The data is horizontally partitioned into blocks that fit inside theavailable memory, and frequent sets are identified separately in each block, with

798 M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813

the same relative value of s. The union of the frequent sets for all the blocks is a sup-erset of the true solution. The second phase is a linear scan of D to compute the sup-port counts for all the elements in the approximate solution. As in [21], we obtain thefrequent sets with only two I/O scans. Phase II is efficient, and so the whole algo-rithm, if the approximation built in phase I is not too coarse. This holds as longas the blocks are not too small w.r.t. D, and the data distribution is not too skewed.The essential limits of the partitioned scheme is that both the intermediate solutionand the data have to fit in memory, and that too small a block size causes data skew.The clear advantage for the parallelisation is that almost all work is done indepen-dently on each partition.

Following [22], we can classify the parallel implementations of Apriori into threemain classes, Count, Data and Candidate Distribution, according to the interplay ofthe partitioning schemes for the input and the Ck sets. We have applied the two phasepartitioned scheme without the vertical representation described in [21], using a se-quential implementation of Apriori as the core of the first phase. Count Distributionsolutions horizontally partition the input among the processors, and use global com-munications once a level to compute the candidate support. Although it is moreasynchronous and efficient, the parallel implementation of partitioned Apriori as-ymptotically behaves like Count Distribution w.r.t. to the parameters of the algo-rithm. It is quite scalable with the size of D, but cannot deal with huge candidatesets or frequent sets, i.e. it is not scalable with lower and lower values of the s supportparameter.

4.2. Parallel structure

The structure of the partitioned algorithm is clearly reflected in the skeleton com-position we have used, which is shown in Figs. 2 and 3a. The two phases are con-nected within a pipe skeleton. Since there is no parallel activity between them,they are in fact mapped on the same set of processors. The common internal schemeof the two phases is a three-stage pipeline. The first module within the inner pipereads the input and controls the computation. The second module is a farm contain-ing p seq workers running the Apriori code. The third module is sequential code per-forming a stream-reduction, to compute the sum of the results. In phase I the resultsare hash-tree structures containing all the locally frequent sets, and they are summedto a hash tree containing the union of the local solutions. Phase II has a simpler code

Fig. 2. SkIE code of the partitioned ARM prototype.

M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813 799

in the workers, and the results are arrays of support counts that are added togetherto compute the global support for all the selected itemsets.

Since we initially wanted to test the application without assuming the availabilityof parallel access to disk-resident data, we used sequential modules to interface to thefile system and to distribute the input partitions to the other modules. On the COW,where local disks were available and the network performance was inadequate tothat of the processors, we also implemented distributed I/O in the workers (seeFig. 3a) by replicating the data over all the disks and retaining the farm for its loadbalancing characteristics.

4.3. Results

The partitioned Apriori we realized in SkIE is a very good example of the advan-tages of SPP. A sequential source code has been restructured in a modular parallelapplication, whose code is less than 25% larger and reuses 90% of the original.The development times were also quite short, as reported in [4]. The test results ofFigs. 4 and 5 are consistent over a range of different architectures.

We used the synthetic dataset generator from the Quest project, whose underlyingmodel is explained in [18], choosing jIj ¼ 1000, average frequent sets of six itemsand a transaction length of 20. With these parameters, huge Ck sets are generatedeven for a high value of the minimum support. Values of N ¼ 1, 4, 12 millions result

Fig. 3. (a) Skeleton structure of the partitioned ARM prototype. (b) SMP speed-up.

Fig. 4. (a) CS-2 speed-up. (b) T3E completion time, 10M transactions (1.8 Gb).

800 M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813

in datasets of 90, 360 and 1260 Mbytes being produced (two times as much on theT3E, which is a 64-bit architecture).

The CS-2 architecture already shows a good behaviour with a small dataset andlow load, see Fig. 4a. By comparison, because of the slower communications theCOW has a lower efficiency, see the two ð�Þ curves in Fig. 5a. A better performanceis obtained by removing the I/O bottleneck, thus increasing the computation to com-munication ratio and shortening the startup times w.r.t. the overall computation. InFig. 5a we find the efficiency results with distributed I/O on the COW. The speedupgraphs in Fig. 5b also show that the application is scalable on the COW at highercomputational loads. The same application runs on the SMP (Fig. 3b), where the al-most ideal speedup of 3.8 is reached with p ¼ 6. The T3E results of Fig. 4b withs ¼ 2% do not look satisfying. A profiling of the running times has shown that theproblem is in the interaction with the file server, which is a high fixed overhead thatbecomes less severe at higher workloads, as the behaviour for s ¼ 0:5% shows.

5. DBSCAN spatial clustering

Clustering is the problem of grouping input data into sets in such a way that asimilarity measure is high for objects in the same cluster, and elsewhere low. In spa-tial clustering the input data are seen as points in a suitable space Ra, and discoveredclusters should describe their spatial distribution. Many kinds of data can be repre-sented this way, and their similarity in the feature space can be mapped to a concretemeaning, e.g. for spectral data to the similarity of two real-world signals. A high di-mension a of the data space is quite common and can lead to performance problems[23]. Usually, the spatial structure of the data has to be exploited by means of appro-priate index structures, to enhance the locality of data accesses. DBSCAN is a den-sity-based spatial clustering technique [24], whose parallel form we recently studiedin [7]. Density-based clustering identifies clusters from the density of objects in thefeature space. In the case of DBSCAN, computing densities in Ra means countingpoints inside a given region of the space.

Fig. 5. (a) Program efficiency over the Linux COW and the CS-2 with support set at 2%. (b) COW parallel

speed-up w.r.t. to parallelism and varying load.

M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813 801

The key concept of the algorithm is that of core point. Given two user parameters� and MinPts, a core point has at least MinPts other data points within a neighbor-hood of radius �. A suitable relation can be defined among the core points, that al-lows us to identify dense clusters made up of core points. We assign non-core pointsto the boundaries of neighboring clusters, or we label them as noise. To assign clusterlabels to all the points, DBSCAN repeatedly searches for a core point, then exploresthe whole cluster it belongs to. The process is much alike a graph visit, where con-nected points are those closer than �, and the visit recursively explores all reachedcore-points. When a point in the cluster is considered as a candidate, its neighbor-hood points are counted. If they are enough, the point is labelled and its neighborsare then put in the candidate queue.

DBSCAN holds the whole input set inside the R�-Tree spatial index structure [25].Data are kept in the leaves of a secondary memory tree with an ad-hoc directory or-ganisation and algorithms for building, updating and searching the structure. Giventwo hypothesis that we will detail in the following, the R�-Tree can answer to spatialqueries (what are the points in a given region) with time and I/O complexity propor-tional to the depth of the tree, which is OðlogNÞ. For each point in the input thereis exactly one neighborhood retrieval operation, so the expected complexity ofDBSCAN is OðN logNÞ.

The first hypothesis needed is that almost all regions involved in the queries aresmall w.r.t. the dataset, hence the search algorithm needs to examine only a smallnumber of leaves of the R�-Tree. We can assume that the � parameter is not set toa neighborhood radius comparable to that of the whole dataset. The second hypoth-esis is that a suitable value for � exists. It is well known that all spatial data structuresloose efficiency as the dimension a of the space grows, in some cases already fora > 10. The R�-Tree can be easily replaced with any improved spatial index that sup-ports neighborhood queries, but for a high value of a this could not lead to an effi-cient implementation anyway. It has been argued in [23], and it is still matter ofdebate, that for higher and higher dimensional data the concept of neighborhoodof fixed radius progressively looses its meaning for the sake of spatial organisationof the data. As a consequence, for some distributions of the input, the worst-caseperformance of good spatial index structures is that of a linear scan of the data[26]. For those cases where applying spatial clustering to huge and high-dimensionaldata produces useful results, but requires too much time, parallel implementation isthe practical way to speed up DBSCAN.

5.1. Parallel structure

The region queries to the R�-Tree are the first issue to address to enhanceDBSCAN. The method definition guarantees that the shape of clusters is invariantw.r.t. the order of selection of points inside a cluster, so we have chosen a parallelvisit strategy, with more independent operations on the R�-Tree at the same time.A Master process executes the sequential algorithm, demanding all the neighbor-hoods retrievals to a Slave process. By relaxing the ordering constraint on the an-swers from the R�-Tree, two kinds of parallelism are exploited in this scheme:

802 M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813

pipelining (pipe) between Master and Slaves, and independent parallelism (farm)among several Slave modules, each one holding a copy of the R�-Tree. We see theresulting structure in Fig. 6. This is a proper data-flow computation on a streamof tasks, with a loop skeleton used to make the results flow back to the beginning.

Two factors make the structure effective in decoupling and distributing the work-load to the slaves. First, the Master selects single points of the input, without usingspatial queries, second, the Slaves do not need accurate information about the clusterlabelling. All the Slaves need to search the R�-Tree structure, which is actually rep-licated. While in the sequential algorithm no already labelled point is inserted againin the visit queue, the process of oblivious parallel expansion of different parts of acluster may repeatedly generate the same candidates. The Master process checks thiscondition when enqueuing candidates for the visit, but this parallel overhead is toohigh if all the neighboring points are returned each time a region query is made.

Two filtering heuristics [7] are used in the Slaves to prune the returned set ofpoints. Neighbors of non-core points do not become candidates for the current clus-ter. Previously returned points, on the other hand, are surely present in the visitqueue, or already labelled, so they are not sent again to the Master. We let the Slavesmaintain information about the points already returned by previous answers. Fig. 6cshows that, for the degree of parallelism in the tests, pruning based on local infor-mation only is enough to avoid computation and communication overheads in theMaster.

5.2. Results

The results reported are from the COW platform, using up to 10 processors. Thedata are from the Sequoia 2000 benchmark database, a real world dataset of 2-d geo-graphical coordinates. DBSCAN was originally evaluated in [24] using samples fromthat database, D1 and D2, which hold 10% and 20% of the data. We also used thewhole dataset (D3, 62 556 points) and a scaled-up version (D4, 437 892 points), madeup of seven partially overlapping copies of the original dataset. The DBSCAN pa-rameters were MinPts ¼ 4 and � 2 f20000; 30000g. In Fig. 7a we report tests withp 2 f6; 8g, the parallel degree being the number of slaves. Efficiency (Fig. 7b) is com-puted w.r.t. the resources really used, i.e. p þ 1. The performance gain from our

Fig. 6. (a) The SkIE code and (b) the skeleton composition for parallel DBSCAN. (c) Average number of

points per query with filtering, vs parallelism and �.

M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813 803

parallel visit scheme does not compensate the parallel overhead on the two smallfiles, but speedup and efficiency are consistently high when the computational loadis heavy. We have also verified, by setting � ¼ 100000, that a higher cost of theR�-Tree retrieve leads to nearly optimal efficiency even for file D1. This can easilybe the case when the R�-Tree is out-of-core, or the dimension of the data space ishigh for the spatial index chosen.

Summing up, the parallel structure used is effective for high loads and large data-sets which are impractical to deal with using the sequential algorithm. The D3 andD4 datasets, although small enough to be loaded in memory, require up to 8 h ofsequential computation. On the contrary, the efficiency of the parallel implementa-tion rises with the load, thus ensuring a good behaviour of the parallel DBSCANeven when the data is out-of-core. It is also possible to devise a shared secondarymemory tree with all the Slaves reading concurrently and caching data in memorywhen possible. Further investigation is needed to evaluate the limits, at higher de-grees of parallelism, of the heuristics we used, and possibly to devise better ones.Like its sequential counterpart, the parallel DBSCAN is also general w.r.t. the spa-tial index used. It is thus easy to exploit any improvement or customization of thespatial data structure that may speed up the computation.

6. C4.5 classification

Being given a set of objects with assigned class labels, the classification problemconsists in building a behaviour model of the class attribute in terms of the othercharacteristics of the objects. Such a model can be used to predict the class ofnew, unclassified data. Many classifiers are based on induction of decision trees,and they rely on a common basic scheme. Quinlan’s C4.5 classifier [27] is the startingpoint of our work. The rows in D are called cases, and each of the a columns holdseither categorical attributes (discrete, finite domain) or continuous ones (i.e. real-val-ued). One of the categorical attributes is distinguished as the class of the case.

The decision tree T is a recursive tree model, where the root corresponds to thewhole data set. Each interior node is a decision node, performing a test on the values

Fig. 7. (a) Parallel DBSCAN speedup vs dataset, p 2 f6; 8g and � 2 f20000; 30000g. (b) Efficiency vs par-

allelism, for the D3, D4 files and � 2 f20000; 30000g.

804 M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813

of the attributes. The test partitions the cases among the child subtrees, a child foreach different outcome of the test. The leaves of the tree are class-homogeneous sub-sets of the input. Hence a path from the root to any leaf defines a series of tests thatall cases in that leaf verify, and implicitly assigns a class label to each case satisfyingthe tests.

C4.5 decision nodes test a single attribute. A child node is made for each differentvalue of a categorical attribute, while boolean tests of the form x7 threshold are usedfor continuous attributes. The algorithm is divided into two main phases, the build-ing phase and the pruning and evaluation one. The former is the most time consum-ing, that has been parallelized, and it has basically a divide and conquer (D&C)structure. The building phase proceeds top-down, at each decision node choosinga new test by exhaustive search. For each attribute a cost function, the informationgain (IG), is evaluated over the node data to select the most informative splitting.The building phase is a greedy search, it is based only on local evaluation and neverbacktracks. Building a node requires operating on the partition of the input that isassociated to that node. The tree itself is a compact knowledge model, but the datapartitions can be as large as the whole input. Ensuring efficiency and locality of dataaccesses is the main issue in building the decision tree.

Assuming that the data fit in memory, to evaluate the IG for a categorical attrib-ute A histograms are computed of the couples (class, A) in the current partition,which require OðnÞ operations per column. For the continuous attributes, to com-pute the IG we need the class column to be sorted according to the attribute. Thecost of repeated sorting (Oðn log nÞ operations) accounts for most of the C4.5 run-ning time. Partitioning the data according to the selected attribute then requires afurther sorting step of the whole partition. When data does not fit in memory, theabove complexity results are in terms of I/O operation and virtual memory pagefaults, and the in-core algorithm quickly becomes unusable. External-memory algo-rithms and memory-hierarchy aware parallel decompositions are needed to over-come this limitation.

In its original formulation, each time a continuous attribute is selected, after thesplit C4.5 looks for the threshold in all the input data. This OðNÞ search breaks theD&C paradigms. The effect on sequential computation time are discussed in [28],where better strategies are presented. However, the exact threshold is needed onlyin the evaluation phase [29], so all the thresholds can be computed in an amortizedmanner after the building phase. Data access locality in each node split is enhanced,the cost of split operations lowers from OðmaxðN ; n log nÞÞ to Oðn log nÞ, and parallelload balancing [5] improves.

6.1. Related work on parallel classifiers

Several different parallel strategies for classification have been explored in the lit-erature. Three of them can be considered as basic paradigms which are combinedand specialized in the real algorithms. Attribute parallelism vertically partitions thedata and distributes calculation over different columns. Data parallelism employshorizontal partitioning of the data and coordinate computation of all the processors

M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813 805

to build each node. Task parallelism is the independent classification of separatenodes and subtrees. These fundamental approaches may use replicated or parti-tioned data structures, do static or dynamic load-balancing and computation grainoptimization.

We will concentrate on the works based on the same C4.5 definition of IG. Muchof the research effort has been spent to avoid sorting the partitions to evaluate theIG, and to split the data using a reasonable number of I/O operations or communi-cations. A commonly used variation is to keep the columns in sorted order, verticallypartitioning the data. The drawback is that horizontal partitioning is done at eachnode split. Additional information and computations are needed to split the columnsfollowing the test on the selected attribute while keeping them sorted. Binary splitsrequire some extra processing to form two groups of values from each categoricalattribute, but simplify dealing with the data and make the tree structure more regu-lar.

Many existing parallel algorithms address the problem of repeated sorting thisway. The parallel implementations of the sequential SLIQ classifier [30] have eitherin-core memory requirements or communication costs which are OðNÞ for eachnode. The SPRINT parallel algorithm [31] lowers the memory requirements andfully distributes the input over the processors, but still requires hash tables of sizeOðNÞ to split the larger nodes. Such a large amount of communications per proces-sor makes SPRINT still inherently unscalable. ScalParC [32] uses a breadth-firstlevel-synchronous approach in building the tree, together with a custom parallelhashing and communications scheme. It is memory-scalable and has a better averagesplit communication cost, even if the worst-case is OðNÞ per-level.

Our research has focused on developing a structured parallel classifier based on aD&C formulation. Instead of balancing the computation and communications for awhole level, we aim at a better exploitation of the locality properties of the algo-rithm. A similar approach is those in [33]. They propose as general technique forD&C problems a mixed approach of data parallel and task parallel computation.Substantially, at first all the nodes above a certain size are computed in a data-par-allel fashion by all the processors. The smaller nodes are then classified using a sim-ple task parallelisation. The problem of locality exploitation has been addressed alsoin [34] with a Hybrid Parallelisation. A level-synchronous approach is still used, butas the amount of communications exceeds the estimated cost of data reorganisation,the available processors are split in two groups that operate on separate sets of sub-trees.

6.2. Parallel structure

We started from a task parallelisation approach. Each node classification opera-tion is a task, which generates as sub-tasks the input partitions for the child nodes.To throttle the computation grain size, a single task computation may expand a nodeto a subtree of more than one level, and return as subtasks all the nodes in the fron-tier of the subtree. As already noticed, C4.5 is a divide and conquer algorithm, exceptfor the threshold calculation, which anyway is not needed to build the tree. We al-

806 M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813

ready verified in [5] that if threshold calculation for continuous attributes is delayeduntil the pruning phase, the D&C computation can be exploited in a SkIE skeletonstructure by means of application-level parallel policies. Removing the OðNÞ over-head, the cost of each task computations becomes Oðn log nÞ and much less irregular.It is then effective to schedule the tasks in size order, giving precedence to large tasksthat generate more parallelism. Note that this does not relieve us from the task of re-sorting data, which has not yet been addressed.

The skeleton structure in Fig. 8 implements the recursive expansion of nodes byletting tasks circulate inside a loop skeleton. A pipeline of two stages expands eachtask. The anonymous workers in the farm skeleton expand each incoming node toa separate subtree. The second stage in the pipe is a sequential Conquer process co-ordinating the computation. The template underlying the farm skeleton takes care ofload-balancing, so its efficiency depends on the available parallelism and the compu-tation to communication ratio.

In a previous version, all the input data were replicated in the workers, to makethem anonymous, and the Conquer module was keeping locally the decision treestructure. Tree management, and the need to explicitly communicate partitioning in-formation through the interfaces of all the modules, were severe bottlenecks for theprogram. We have designed a shared tree (ST) library, an implementation of a gen-eral tree object in shared memory, used to represent the decision tree T. Since datalocality follows the evolution of the decision tree, the input is hold inside the ST, overthe frontier of the expanding tree, and is immediately accessible from each process inthe application. C4.5 is a D&C algorithm with a very simple conquer step, whichsimply consists in merging the subtrees back into T. All the operations requiredby the algorithm are done in the sequential workers of the farm. They access theshared structure to fetch their input data, they create the resulting sub-tree and storeback the data partitions on its frontier.

The Conquer module is still present to apply the task selection policy we previ-ously mentioned. A simple priority queue is used to give precedence to larger tasks,leading to a data-driven expansion scheme of the tree, in contrast to the depth-firstscheme of sequential C4.5 and to the level-synchronous approach of ScalParC [32].We also use a task expansion policy. We made a choice similar to that of [33] in dis-tinguishing the nodes according to their size. In our case we balance the task com-munication and computation times, which influence dynamic load-balancing, by

Fig. 8. The SkIE code and skeleton structure of task-parallel C4.5.

M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813 807

using three different classes of tasks. The base heuristic is that large task are ex-panded of one level only to increase available parallelism, small ones are fully com-puted sequentially, and intermediate ones are expanded to incomplete subtrees up toa given number of nodes and within computation time bounds. The actual limitswere tuned following the same experimental approach described in our previouswork [5]. For these tests, large task have more than 2000 cases (4% of the data), smallones less than 50 (0.1%), and sequential computation bounds are 1 s and 70 nodes.The input is the file Adult from the UCI machine learning repository.

6.3. Results

The results shown in Fig. 9 are good for a task parallelisation of C4.5. Next wehave to move toward a full out-of-core computation. The key point in making theapplication scalable is to exploit parallelism in the processing of large data parti-tions. Our line of research differs from that of [33] because we aim at developing ageneral support for operating on large objects. It has been already shown in [35] thatexploiting remote memory over high-speed networks can be faster than swapping tovirtual memory. To operate on data that is outside the local memory we want toexploit collective parallel operations on huge data, as well as external memoryalgorithms [36] for a single processor. The application programmer sees theseoperations as methods of objects, and the run-time support will take care of selectingthe appropriate implementation of the methods from time to time.

A first step has been accomplished with the implementation of the ST library. Wesee in 9b that it allows to reduce the centralized work and achieve a better scalabilityfor the task parallelisation. Highly dynamic irregular problems stress all the compo-nents of the memory system, and the implementation of shared objects is still in itsexperimental stage. There is no space here to fully describe such a support, which isbetter outlined in [6]. Fig. 9a shows the enhancement in the task computation timeobtained by introducing a simple preallocation support to the dynamic shared mem-ory handler.

Fig. 9. (a) Per-task completion time vs number of nodes in the subtree, with and without allocation op-

timization. (b) Relative Speedup with and without the ST.

808 M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813

7. Advantages of structured PPE

A comparison between two different programming methodologies must properlytake into account their abstraction level. In parallel programming, standard commu-nication libraries like MPI have some of the hampers of low-level languages. Thecomplete freedom in dealing with the details of communication and parallel work de-composition theoretically allows to tune the performance of the program, to fully ex-ploit the underlying hardware. However, it results in excessive costs for softwaredevelopment. High-level approaches are to be preferred when the resources for soft-ware development are bounded, and when complex structures and performance tun-ing are unpractical. The advantages of the SkIE approach are already noticeable forthe DM applications we have shown, even if they all have a simple parallel structure.Moreover, skeleton parallel solution can easily be nested inside each other, toquickly develop hybrid solutions with higher performance from different parallelones. Table 1 reports some software cost measures from our experiments, whichare to be viewed with respect to the targets of the structured approach: fast code de-velopment, code portability, performance portability, stability and integration ofstandards.Development costs and code expressiveness––when restructuring existing sequential

code, most of the work is spent in making the code modular, as it happens with otherapproaches. The amount of sequential code needed is reported in Table 1 as modu-larisation, separate from the true parallel code. Once this task has been accom-plished, several SkIE prototypes for different parallel structures were easilydeveloped and evaluated. The skeleton description of a parallel structure (Figs. 2,6 and 8) is shorter, quicker to write and far more readable that its equivalent writtenin MPI. Starting from the same sequential modules we developed an MPI versionof C4.5. Though it exploits a simpler structure than the skeleton one (masterslaves, no pipelined communications), the parallel code is longer, more complex

Table 1

Software development costs: number of lines and kind of code, development time

APRIORI DBSCAN C4.5

Kind of parallelisation SkIE SkIE SkIE SkIEþST MPI

Sequential

code

Cþþ, 2900

lines

Cþþ,

10 138 lines

non-ANSI C, uses global variables, 8179

lines

Modularisation code 630, Cþþ 493, Cþþ 977, Cþþ 977, Cþþ 1087, CþþParallel structure 350, SkIE-

CL, Cþþ793, SkIE-CL, Cþþ

303, SkIE-CL

380, SkIE-CL, Cþþ

431, MPI,

CþþEffort (man-months) 3 2.5 4 5 5

Best speed-

up and

(parallelism)

CS2 20 (40) – 2.5 (7) 5 (14) –

COW 9.4 (10) 6 (9) 2.45 (10) – 2.77 (9)

SMP 3.73 (4) – – – –

T3E n/av, see

Fig. 4b

– – – –

M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813 809

and error-prone. On the contrary, the speedup results showed no significant gainfrom the additional programming effort.Performance––The speed-up and scale-up results of the applications we have

shown are not all breakthrough, but comparable to those of similar solutions real-ized with unstructured parallel programming. The partitioned Apriori is fully scal-able w.r.t. database size, like count-distribution implementations. The C4.5prototype behaves better than other pure task-parallel implementations. It suffersthe limits of this parallelisation scheme, due to the object support being incomplete.We know of no other results about spatial clustering using our approach to the par-allelisation of cluster expansion.Code and performance portability––Skeleton code is by definition portable over all

the architectures that support the programming environment. As long as the inter-mediate code agrees to industry standards, as is the case with the MPI and Cþþcode produced by SkIE, the applications are portable to a broader set of architec-tures. The SMP and T3E tests of the ARM prototype were performed this way, withno extra development time. These results also show a good degree of performanceportability. Since we use compilation to produce the parallel application, the inter-mediate and support code can exploit all the advantages of parallel communicationlibraries. On the other hand, the support can be enhanced by using architecture-spe-cific facilities when the performance gain would be valuable.

8. Conclusive remarks and issues for PPE enhancements

We have presented a set of commonly used sequential DM algorithms which wererestructured to parallel by means of the SkIE parallel programming environment.The good reuse of application code and the ease of the conversion confirm the valid-ity of the approach w.r.t. software engineering criteria. The parallel applications pro-duced are nevertheless efficient and scalable. Performance results have been shownover different computer architectures, with low-level issues demanded to the environ-ment support, and application tuning turned into parameter specification for high-level, clearly understandable user-defined policies.

In the three DM application we have seen there are apparently different accesspatterns: large block reads intermixed with long computations (Apriori partitioned),frequent small data accesses with poor locality (DBSCAN), data intensive computa-tion with unpredictable reading and writing of a great amount of data, no chances todo static optimizations (C4.5). Actually, if we think about larger databases, all thesedata will eventually be pushed out-of-core. The R�-Tree used in DBSCAN should beshared and stored in mass memory, and the training data for C4.5 should not be lim-ited by the size of the shared memory. Exploiting the local memories and the sharedmemory for caching, and applying external memory techniques where appropriatewould make the structure of programs complex and unhandy.

In the modular, integrated view of a structured PPE, the explicit management ofthe interactions with parallel file systems and data bases is an obstacle to portability.To simplify and make modular the interface to shared resources, we are studying the

810 M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813

use of the object and component models. The experiments reported with a parallelshared-tree data type show that it allows to improve performance without sacrificingthe advantages of the structured approach. There are currently some groups workingto merge the object [37] and component [38] programming models with parallel pro-gramming. While in the former work some basic parallel computational patterns arerecasted as object classes and design patterns, the latter underlines the gain in codereuse by using standard interface definition languages (IDL).

Our attention is to the integration issues, to implement the objects is such a waythat it is based on the experiences gained so far, and it exposes uniform interfacesboth toward the application programmer and to the surrounding system environ-ment. Of course, a common way of accessing shared data and I/O services is a start-ing point to add interfaces to several standard technologies extensively used in thefield of KDD. Parallel file systems, DBMS, CORBA services should transparentlybe used as data sources and destinations, both at the module interface level and fromwithin the module code.

References

[1] M. Vanneschi, Heterogeneous HPC environments, in: D. Pritchard, J. Reeve (Eds.), Euro-Par ’98

Parallel Processing, vol. 1470 of LNCS, Springer, Berlin, 1998, pp. 21–34.

[2] M. Vanneschi, PQE2000: HPC tools for industrial applications, IEEE Concurrency: Parallel,

Distributed and Mobile Computing 6 (4) (1998) 68–73.

[3] B. Bacci, M. Danelutto, S. Pelagatti, M. Vanneschi, SkIE: A heterogeneous environment for HPC

applications, Parallel Computing 25 (13–14) (1999) 1827–1852.

[4] P. Becuzzi, M. Coppola, M. Vanneschi, Mining of association rules in very large databases: a

structured parallel approach, in: Euro-Par ’99 Parallel Processing, vol. 1685 of LNCS, Springer,

Berlin, 1999, pp. 1441–1450.

[5] P. Becuzzi, M. Coppola, S. Ruggieri, M. Vanneschi, Parallelisation of C4.5 as a particular divide and

conquer computation, in: Rolim et al. [39], pp. 382–389.

[6] G. Carletti, M. Coppola, Structured parallel programming and shared objects: experiences in data

mining classifiers, in: G. Joubert, A. Murli, F. Peters, M. Vanneschi (Eds.), Parallel Computing,

Advances and Current Issues, Proc. of the Internat. Conf. ParCo 2001, Naples, Italy. Imperial College

Press, London, 2002.

[7] D. Arlia, M. Coppola, Experiments in parallel clustering with DBSCAN, in: R. Sakellariou, J. Keane,

J. Gurd, L. Freeman (Eds.), Euro-Par 2001: Parallel Processing, vol. 2150 of LNCS, 2001.

[8] M. Vanneschi, The programming model of ASSIST, an environment for parallel and distributed

portable applications. To appear in Parallel Computing.

[9] W.A. Maniatty, M.J. Zaki, A requirement analysis for parallel KDD systems, in: Rolim et al. [39], pp.

358–365.

[10] G. Williams, I. Altas, S. Bakin, P. Christen, M. Hegland, A. Marquez, P. Milne, R. Nagappan, S.

Roberts, The integrated delivery of large-scale data mining: the ACSys data mining project, in: Zaki

and Ho [40], pp. 24–54.

[11] D.B. Skillicorn, D. Talia, Models and languages for parallel computation, ACM Computing Surveys

30 (2) (1998) 123–169.

[12] D.B. Skillicorn, Foundations of Parallel Programming, Cambridge University Press, Cambridge,

1994.

[13] J. Darlington, Y. Guo, H.W. To, J. Yang, Skeletons for Structured Parallel Programming, in:

Proceedings of the Fifth SIGPLAN Symposium on Principles and Practice of Parallel Programming,

1995, pp. 19–28.

M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813 811

[14] D. Skillicorn, Strategies for parallel data mining, IEEE Concurrency 7 (4) (1999) 26–35.

[15] M. J. Zaki, Parallel and distributed data mining: an introduction, in: Zaki and Ho [40], pp. 1–23.

[16] S. Partharasarty, S. Dwarkadas, M. Ogihara, Active mining in a distributed setting, in: Zaki and Ho

[40], pp. 65–82.

[17] S. Bailey, E. Creel, R. Grossman, S. Gutti, H. Sivakumar, A high performance implementation of the

data space transfer protocol (DSTP), in: Zaki and Ho [40], pp. 55–64.

[18] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Eds.), Advances in Knowledge

Discovery and Data Mining, AAAI press/MIT press, Cambridge, 1996.

[19] A. Mueller, Fast sequential and parallel algorithms for association rule mining: a comparison, Tech.

Rep. CS-TR-3515, Department of Computer Science, University of Maryland, College Park, MD,

August 1995.

[20] D. Gunopulos, H. Mannila, R. Khardon, H. Toivonen, Data mining, hypergraph transversals, and

machine learning (ext. abstract), in: PODS ’97. Proceedings of the 16th ACM Symposium on

Principles of Database Systems, 1997, pp. 209–216.

[21] A. Savasere, E. Omiecinski, S. Navathe, An efficient algorithm for mining association rules in large

databases, in: U. Dayal, P. Gray, S. Nishio (Eds.), VLDB ’95: Proceedings of the 21st International

Conference on Very Large Data Bases, Zurich, Switzerland, Morgan Kaufmann Publishers, Los

Altos, CA, 1995, pp. 432–444.

[22] R. Agrawal, J. Shafer, Parallel mining of association rules, IEEE Transactions on Knowledge and

Data Engineering 8 (6) (1996) 962–969.

[23] K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, When Is ‘‘Nearest Neighbor’’ Meaningful? in: C.

Beeri, P. Buneman (Eds.), Database Theory––ICDT ’99 Seventh International Conference, vol. 1540

of LNCS, 1999, pp. 217–235.

[24] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large

spatial databases with noise, in: Proceedings of KDD ’96, 1996, pp. 226–231.

[25] N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger, The R�-tree: an efficient and robust access

method for points and rectangles, in: Proceedings of the ACM SIGMOD International Conference on

Management of Data, 1990, pp. 322–331.

[26] S. Bertchold, D.A. Keim, H.-P. Kriegel, The X-Tree: an index structure for high-dimensional data, in:

Proceedings of the 22nd International Conference on Very Large Data Bases, 1996, pp. 28–39.

[27] J. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, 1993.

[28] S. Ruggieri, Efficient C4.5, IEEE Transactions on Knowledge and Data Engineering 14 (2) (2002)

438–444.

[29] J. Darlington, Y. Guo, J. Sutiwaraphun, H.W. To, Parallel induction algorithms for data mining, in:

Advances in Intelligent Data Analysis: Reasoning About Data IDA ’97, vol. 1280 of LNCS, 1997, pp.

437–445.

[30] M. Mehta, R. Agrawal, J. Rissanen, SLIQ: a fast scalable classifier for data mining, in: Proceedings of

the Fifth International Conference on Extending Database Technology, 1996.

[31] J. Shafer, R. Agrawal, M. Mehta, SPRINT: a scalable parallel classifier for data mining, in:

Proceedings of the 22nd VLDB Conference, 1996.

[32] M.V. Joshi, G. Karypis, V. Kumar, ScalParC: a new scalable and efficient parallel classification

algorithm for mining large datasets, in: Proceedings of 1998 International Parallel Processing

Symposium, 1998.

[33] M.K. Sreenivas, K. AlSabti, S. Ranka, Parallel out-of-core divide-and-conquer techniques with

application to classification trees, in: Proceedings of the International Parallel Processing Symposium

(IPPS/SPDP), Puerto Rico, 1999, pp. 555–562.

[34] A. Srivastava, E.-H. Han, V. Kumar, V. Singh, Parallel formulations of decision-tree classifica-

tion algorithms, Data Mining and Knowledge Discovery: An International Journal 3 (3) (1999) 237–

261.

[35] M. Oguchi, M. Kitsuregawa, Using available remote memory for parallel data mining application, in:

14th International Parallel and Distributed Processing Symposium, 2000, pp. 411–420.

[36] J.S. Vitter, External memory algorithms and data structures: dealing with MASSIVE DATA, ACM

Computing Surveys 33 (2) (2001) 209–271.

812 M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813

[37] D. Goswami, A. Singh, B.R. Preiss, Using object-oriented techniques for realizing parallel

architectural skeletons, in: Matsuoka et al. [41], pp. 130–141.

[38] B. Smolinski, S. Kohn, N. Elliott, N. Dykman, Language interoperability for high-performance

parallel scientific components, in: Matsuoka et al. [41], pp. 61–71.

[39] J. Rolim, et al. (Eds.), Parallel and Distributed Processing, vol. 1800 of LNCS, Springer, Berlin, 2000.

[40] M.J. Zaki, C.-T. Ho (Eds.), Large-Scale Parallel Data Mining, vol. 1759 of LNAI, Springer, Berlin,

1999.

[41] S. Matsuoka, R. Oldehoeft, M. Tholburn (Eds.), Computing in Object-Oriented Parallel Environ-

ments, Third International Symposium, ISCOPE 99, vol. 1732 of LNCS, Springer, Berlin, 1999.

M. Coppola, M. Vanneschi / Parallel Computing 28 (2002) 793–813 813