60 a high performance large scale data warehouse

8/14/2019 60 a High Performance Large Scale Data Warehouse

1/3


2/3


3/3

function. The query strings are stored in a file which can

be accessed by all map functions. Then the map

function searches every queried cell in its local data

cube and emits a (cell, msr) intermediate key/value pair.

These intermediate pairs are partitioned by the key, i.e.

the name of cell, so the measures are grouped by cells.

Finally, the measures for a cell are reduced to onemeasure by applying the aggregate function (e.g. sum).

Figure 4. The pseudocode of the querying interface

Even though the OLAP query involves in each node,

the result sets are small. Thus the partitioning merging

for the results causes little communication.

For the fact table with 60 million rows, the 1,000

point queries answering time is only 203 seconds (the

experimental environment is the same as the data cube

construction environment). Each point query answering

time is 0.2 seconds in average. Also the time approacheslinear speedup when the number of nodes increases

from 5 to 17.

3.4 The Code Base

We implemented our system based on Hadoop [13].

Hadoop is a software platform that allows one to easily

write and run parallel or distributed applications that

process vast amounts of data. It incorporates featuressimilar to those of the Google File System and of

MapReduce. Hadoop also includes HBase which is a

column-oriented store model like Bigtable.

Although Hadoop is implemented in Java, the map

and reduce computation tasks were all coded in C++

because of the efficiency of C++. The C++ programcommunicates with Hadoop through Hadoop Streaming.

4. Conclusions

HDW aims at building a large scale data warehouse

that accommodates terabytes data atop inexpensive PC

clusters with thousands of nodes. As the limitedexperimental condition, at present we demonstrated it

on only 18 nodes with 36 cores. But in view of

Hadoops successfully sorting 20 terabytes on a2000-node cluster within 2.5 hours [13], we believe that

HDW has the same potential ability which will be

proved in the next step. The data extraction,

transformation and loading (ETL) will be considered to

incorporate into HDW.

References

[1] David J. DeWitt, Samuel Madden, Michael Stonebraker.How to Build a High-Performance Data Warehouse. http: //db.lcs.mit.edu/madden/high_perf.pdf

[2] Michael Stonebraker et al. C-Store: A Column OrientedDBMS. In proceedings of VLDB, 2005.

[3] Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. TheGoogle file system. In 19th Symposium on Operating Systems

Principles, 2003.[4] Fay Chang, Jeffrey Dean, Sanjay Ghemawat et al. Bigtable:A Distributed Storage System for Structured Data. In 7th

Symposium on Operating System Design and Implementation,

2006[5] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified

Data Processing on Large Clusters. In Symposium onOperating Systems Design and Implementation, 2004.

[6] PANDA, http://projects.cs.dal.ca/panda/

[7] Ying Chen, Frank Dehne, Todd Eavis et al. ParallelROLAP Data Cube Construction on Shared-NothingMultiprocessors. Distributed and Parallel Databases, 2004

[8] Frank Dehne, Todd Eavis, Andrew Rau-Chaplin. ThecgmCUBE project: Optimizing parallel data cube generationfor ROLAP. Distributed and Parallel Databases, 2006.

[9] Laks V.S. Lakshmanan, Jian Pei, Yan Zhao. QCTrees: AnEfficient Summary Structure for Semantic OLAP. SIGMOD,

2003.[10] Sanjay Goil, Alok Choudhary. High performance OLAPand data mining on parallel computers. Journal of Data

Mining and Knowledge Discovery, 1(4):391417, 1997.[11] Sanjay Goil, Alok Choudhary. A parallel scalableinfrastructure for OLAP and data mining. In Proc.International Data Engineering and Applications Symposium(IDEAS99), Montreal, 1999

[12] Raymond T. Ng, Alan Wagner, Yu Yin. Iceberg-cubeComputation with PC Clusters. SIGMOD, 2001

[13] Apache org. Hadoop. http://lucene.apache.org/hadoop/

Class QueryMap

map(InputKey blockid, InputValue closedcells)

1. get queried cells from a file;

2. for each cell in queried cells3. msr = query(cell, closedcells); //query cell in

closedcells

4. emit(cell, msr);

Class QueryReduce

reduce(InputKey cell, InputValue msrlist)

1. result = 0;

2. for each msr in msrlist3. result += msr;

4. emit cell, result ;

60 a high performance large scale data warehouse

Documents