60 a high performance large scale data warehouse
TRANSCRIPT
-
8/14/2019 60 a High Performance Large Scale Data Warehouse
1/3
-
8/14/2019 60 a High Performance Large Scale Data Warehouse
2/3
-
8/14/2019 60 a High Performance Large Scale Data Warehouse
3/3
function. The query strings are stored in a file which can
be accessed by all map functions. Then the map
function searches every queried cell in its local data
cube and emits a (cell, msr) intermediate key/value pair.
These intermediate pairs are partitioned by the key, i.e.
the name of cell, so the measures are grouped by cells.
Finally, the measures for a cell are reduced to onemeasure by applying the aggregate function (e.g. sum).
Figure 4. The pseudocode of the querying interface
Even though the OLAP query involves in each node,
the result sets are small. Thus the partitioning merging
for the results causes little communication.
For the fact table with 60 million rows, the 1,000
point queries answering time is only 203 seconds (the
experimental environment is the same as the data cube
construction environment). Each point query answering
time is 0.2 seconds in average. Also the time approacheslinear speedup when the number of nodes increases
from 5 to 17.
3.4 The Code Base
We implemented our system based on Hadoop [13].
Hadoop is a software platform that allows one to easily
write and run parallel or distributed applications that
process vast amounts of data. It incorporates featuressimilar to those of the Google File System and of
MapReduce. Hadoop also includes HBase which is a
column-oriented store model like Bigtable.
Although Hadoop is implemented in Java, the map
and reduce computation tasks were all coded in C++
because of the efficiency of C++. The C++ programcommunicates with Hadoop through Hadoop Streaming.
4. Conclusions
HDW aims at building a large scale data warehouse
that accommodates terabytes data atop inexpensive PC
clusters with thousands of nodes. As the limitedexperimental condition, at present we demonstrated it
on only 18 nodes with 36 cores. But in view of
Hadoops successfully sorting 20 terabytes on a2000-node cluster within 2.5 hours [13], we believe that
HDW has the same potential ability which will be
proved in the next step. The data extraction,
transformation and loading (ETL) will be considered to
incorporate into HDW.
References
[1] David J. DeWitt, Samuel Madden, Michael Stonebraker.How to Build a High-Performance Data Warehouse. http: //db.lcs.mit.edu/madden/high_perf.pdf
[2] Michael Stonebraker et al. C-Store: A Column OrientedDBMS. In proceedings of VLDB, 2005.
[3] Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. TheGoogle file system. In 19th Symposium on Operating Systems
Principles, 2003.[4] Fay Chang, Jeffrey Dean, Sanjay Ghemawat et al. Bigtable:A Distributed Storage System for Structured Data. In 7th
Symposium on Operating System Design and Implementation,
2006[5] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified
Data Processing on Large Clusters. In Symposium onOperating Systems Design and Implementation, 2004.
[6] PANDA, http://projects.cs.dal.ca/panda/
[7] Ying Chen, Frank Dehne, Todd Eavis et al. ParallelROLAP Data Cube Construction on Shared-NothingMultiprocessors. Distributed and Parallel Databases, 2004
[8] Frank Dehne, Todd Eavis, Andrew Rau-Chaplin. ThecgmCUBE project: Optimizing parallel data cube generationfor ROLAP. Distributed and Parallel Databases, 2006.
[9] Laks V.S. Lakshmanan, Jian Pei, Yan Zhao. QCTrees: AnEfficient Summary Structure for Semantic OLAP. SIGMOD,
2003.[10] Sanjay Goil, Alok Choudhary. High performance OLAPand data mining on parallel computers. Journal of Data
Mining and Knowledge Discovery, 1(4):391417, 1997.[11] Sanjay Goil, Alok Choudhary. A parallel scalableinfrastructure for OLAP and data mining. In Proc.International Data Engineering and Applications Symposium(IDEAS99), Montreal, 1999
[12] Raymond T. Ng, Alan Wagner, Yu Yin. Iceberg-cubeComputation with PC Clusters. SIGMOD, 2001
[13] Apache org. Hadoop. http://lucene.apache.org/hadoop/
Class QueryMap
map(InputKey blockid, InputValue closedcells)
1. get queried cells from a file;
2. for each cell in queried cells3. msr = query(cell, closedcells); //query cell in
closedcells
4. emit(cell, msr);
Class QueryReduce
reduce(InputKey cell, InputValue msrlist)
1. result = 0;
2. for each msr in msrlist3. result += msr;
4. emit cell, result ;