60 a high performance large scale data warehouse

Upload: sandeepnagar29

Post on 30-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 60 a High Performance Large Scale Data Warehouse

    1/3

  • 8/14/2019 60 a High Performance Large Scale Data Warehouse

    2/3

  • 8/14/2019 60 a High Performance Large Scale Data Warehouse

    3/3

    function. The query strings are stored in a file which can

    be accessed by all map functions. Then the map

    function searches every queried cell in its local data

    cube and emits a (cell, msr) intermediate key/value pair.

    These intermediate pairs are partitioned by the key, i.e.

    the name of cell, so the measures are grouped by cells.

    Finally, the measures for a cell are reduced to onemeasure by applying the aggregate function (e.g. sum).

    Figure 4. The pseudocode of the querying interface

    Even though the OLAP query involves in each node,

    the result sets are small. Thus the partitioning merging

    for the results causes little communication.

    For the fact table with 60 million rows, the 1,000

    point queries answering time is only 203 seconds (the

    experimental environment is the same as the data cube

    construction environment). Each point query answering

    time is 0.2 seconds in average. Also the time approacheslinear speedup when the number of nodes increases

    from 5 to 17.

    3.4 The Code Base

    We implemented our system based on Hadoop [13].

    Hadoop is a software platform that allows one to easily

    write and run parallel or distributed applications that

    process vast amounts of data. It incorporates featuressimilar to those of the Google File System and of

    MapReduce. Hadoop also includes HBase which is a

    column-oriented store model like Bigtable.

    Although Hadoop is implemented in Java, the map

    and reduce computation tasks were all coded in C++

    because of the efficiency of C++. The C++ programcommunicates with Hadoop through Hadoop Streaming.

    4. Conclusions

    HDW aims at building a large scale data warehouse

    that accommodates terabytes data atop inexpensive PC

    clusters with thousands of nodes. As the limitedexperimental condition, at present we demonstrated it

    on only 18 nodes with 36 cores. But in view of

    Hadoops successfully sorting 20 terabytes on a2000-node cluster within 2.5 hours [13], we believe that

    HDW has the same potential ability which will be

    proved in the next step. The data extraction,

    transformation and loading (ETL) will be considered to

    incorporate into HDW.

    References

    [1] David J. DeWitt, Samuel Madden, Michael Stonebraker.How to Build a High-Performance Data Warehouse. http: //db.lcs.mit.edu/madden/high_perf.pdf

    [2] Michael Stonebraker et al. C-Store: A Column OrientedDBMS. In proceedings of VLDB, 2005.

    [3] Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. TheGoogle file system. In 19th Symposium on Operating Systems

    Principles, 2003.[4] Fay Chang, Jeffrey Dean, Sanjay Ghemawat et al. Bigtable:A Distributed Storage System for Structured Data. In 7th

    Symposium on Operating System Design and Implementation,

    2006[5] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified

    Data Processing on Large Clusters. In Symposium onOperating Systems Design and Implementation, 2004.

    [6] PANDA, http://projects.cs.dal.ca/panda/

    [7] Ying Chen, Frank Dehne, Todd Eavis et al. ParallelROLAP Data Cube Construction on Shared-NothingMultiprocessors. Distributed and Parallel Databases, 2004

    [8] Frank Dehne, Todd Eavis, Andrew Rau-Chaplin. ThecgmCUBE project: Optimizing parallel data cube generationfor ROLAP. Distributed and Parallel Databases, 2006.

    [9] Laks V.S. Lakshmanan, Jian Pei, Yan Zhao. QCTrees: AnEfficient Summary Structure for Semantic OLAP. SIGMOD,

    2003.[10] Sanjay Goil, Alok Choudhary. High performance OLAPand data mining on parallel computers. Journal of Data

    Mining and Knowledge Discovery, 1(4):391417, 1997.[11] Sanjay Goil, Alok Choudhary. A parallel scalableinfrastructure for OLAP and data mining. In Proc.International Data Engineering and Applications Symposium(IDEAS99), Montreal, 1999

    [12] Raymond T. Ng, Alan Wagner, Yu Yin. Iceberg-cubeComputation with PC Clusters. SIGMOD, 2001

    [13] Apache org. Hadoop. http://lucene.apache.org/hadoop/

    Class QueryMap

    map(InputKey blockid, InputValue closedcells)

    1. get queried cells from a file;

    2. for each cell in queried cells3. msr = query(cell, closedcells); //query cell in

    closedcells

    4. emit(cell, msr);

    Class QueryReduce

    reduce(InputKey cell, InputValue msrlist)

    1. result = 0;

    2. for each msr in msrlist3. result += msr;

    4. emit cell, result ;