benchmarking data warehouse systems in the cloud: new requirements & new metrics

35
30 th May 2013 DWS in the Cloud, AICCSA'13, Fez Data Warehouse Systems in the Cloud: new requirements and new challenges Rim Moussa LaTICE Lab. -University of Tunis ESTI -University of Carthage [email protected] 10th Intl. Conference on Computer Systems and Applications (AICCSA), Fez, Kingdom of Morocco Keynote @ Intl. Conference on Computing, Networking and Communications, Hammamet, Tunisia 30 th May 2013

Upload: rim-moussa

Post on 30-Oct-2014

672 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

30th May 2013 DWS in the Cloud, AICCSA'13, Fez

Data Warehouse Systems in the Cloud: new requirements and new

challenges

Rim MoussaLaTICE Lab. -University of TunisESTI -University of [email protected]

10th Intl. Conference on Computer Systems and Applications (AICCSA), Fez, Kingdom of Morocco Keynote @ Intl. Conference on Computing, Networking and Communications, Hammamet, Tunisia

30th May 2013

Page 2: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

230th May

2013 DWS in the Cloud, AICCSA'13, Fez

Context

Benchmarking Data Warehouse

Systems

Cloud Rationale

NO

Page 3: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

330th May

2013 DWS in the Cloud, AICCSA'13, Fez

Benchmarking Data Warehouse

Systems

Cloud Rationale

NO

Page 4: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

430th May

2013 DWS in the Cloud, AICCSA'13, Fez

Outline

1. Cloud Computing

2. Data Warehouse Systems

3. Overview of DWS Benchmarks

4. New Requirements for DWS in the Cloud

5. Related Work

6. Conclusion

7. Research Perspectives

Page 5: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

530th May

2013 DWS in the Cloud, AICCSA'13, Fez

Cloud Computing

● NIST Definition– cloud computing as a pay-per-use model for enabling available,

convenient, on-demand network access to a shared pool of configurable computing resources (e.g. networks, servers, storage, applications, services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

● Opportunities– Performance

● Faster data analysis through usage of up-to-date hardware infrastructure made available by Cloud Service Providers,

– More Economical● Organizations no longer need to expend capital upfront for

hardware and software purchases, with Services provided on a pay-per-use basis,

Page 6: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

630th May

2013 DWS in the Cloud, AICCSA'13, Fez

Cloud Computing --Market share

● Market Share

– Forrester Research expects the global cloud computing market to reach $241 billion in 2020,

– Gartner group: The public cloud services market is forecast to grow 18.5% in 2013 to total $131 billion worldwide, up from $111 billion in 2012,

– Gartner: the public cloud services market in the Middle East and North Africa (MENA) is expected to increase by 24.5% in 2013,

– Gartner group: the public cloud services market in INDIA is forecast to grow 36% in 2013 to total $443 million, up from $326 million in 2012,

Page 7: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

730th May

2013 DWS in the Cloud, AICCSA'13, Fez

Data Warehouse Systems--Typical System Architecture

Page 8: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

830th May

2013 DWS in the Cloud, AICCSA'13, Fez

Data Warehouse Systems --Technologies

● Traditional Relational DBMSs & OLAP Servers

– Mature

– Do not scale linearly ● NoSQL solutions

– Adopted by Google, Facebook, Amazon, ...

– Dynamic horizontal scale-up● Nodes are added without bringing the cluster down● Shared-nothing architecture● Independent computing and storage nodes

interconnected via a high speed network– MapReduce Distributed programming framework

Page 9: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

930th May

2013 DWS in the Cloud, AICCSA'13, Fez

Data Warehouse Systems--challenges with big data management

Page 10: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

1030th May

2013 DWS in the Cloud, AICCSA'13, Fez

Data Warehouse Systems--Common Optimizations: Hardware Storage Tech.

● DRAM: in-memory data processing (very expensive)● SSD (Solid State Drives): a non-volatile type of memory.

● An SSD does not have a mechanical arm to read and write data

SSD HDD

Cost/GB $1/GB $0.075/GB

Typical size 512GB Up to 2TB

Failure rate: MTBF

2 million hours 1.5 million hour

Read/Write speed 200-500 MBps 120 MBps

Page 11: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

1130th May

2013 DWS in the Cloud, AICCSA'13, Fez

Data Warehouse Systems--Common Optimizations: Columnar Storage Principle

● Row-oriented storage– Read pages containing all columns

● Column-oriented storage– Read only columns needed for query processing

Date Customer QuantityPriceProduct

Date Customer Product Price Quantity

Page 12: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

1230th May

2013 DWS in the Cloud, AICCSA'13, Fez

Data Warehouse Systems--Common Optimizations: Columnar Storage Benefits

● Allows best data compression rate, since data values are redundant within a single column,

● Eliminates unnecessary I/O through the retrieval of only relevant data

● Vectorwise is in the TPC-H - Top Ten Performance Results (14-Jun-2013)

Page 13: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

1330th May

2013 DWS in the Cloud, AICCSA'13, Fez

Data Warehouse Systems--Common Optimizations: Derived Data

● Derived Data: – Indexes,

– Derived Attributes,

– Aggregate tables

● Pros: – High Performance

● Cons: – Maintenance: refresh is expensive

– Storage cost

Page 14: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

1430th May

2013 DWS in the Cloud, AICCSA'13, Fez

Data Warehouse Systems--DWS Benchmarks

● APB-1 OLAP Benchmark --obsolete

– Released by the OLAP Council (www.olapcouncil.org) in 1998

– A simple star schema data model● TPC DSS Benchmark

– Released by the Transaction Processing Council (www.tpc.org)

– Examine large volumes of data (from 10GB to 100TB)

– Complex relational data model

– TPC-H ● Workload composed of 22 ad-hoc complex SQL Statements ● The most prominent DSS benchmark

– TPC-DS -successor of TPC-H● Workload composed of a 99 SQL business questions● Same metrics than TPC-H

Page 15: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

1530th May

2013 DWS in the Cloud, AICCSA'13, Fez

Data Warehouse Systems--TPC-H Benchmark Metrics (same for TPC-DS)

● Query-per-hour Performance Metric

– For a given scale factor (warehouse data volume)

– Concurrent users● Price-Performance Metric

– Ratio of Priced System (cost of ownership: hardware, software, maintenance, and cost of everything needed to run the TPC6H workload) to Query performance Metric

Page 16: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

1630th May

2013 DWS in the Cloud, AICCSA'13, Fez

Data Warehouse Systems--TPC-H mismatches Cloud Rationale

● TPC-H Does not represent BI suites

– Integration services

– Analytics services (Multi-dimensional eXpressions Language, Mining Structures)

– Reporting services● TPC-H Workload Processing Metric

– Qph@Size defines the number of queries processed by hour

– The workload is assumed static, which is not realistic!

– The benchmark should assess the SUT scalability under variable and evolving workload and data volumes

Page 17: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

1730th May

2013 DWS in the Cloud, AICCSA'13, Fez

Data Warehouse Systems--TPC-H mismatches Cloud Rationale (ctnd.1)

● TPC-H Cost-Performance Metric

– $/Qph@Size, where the cost relates to all of hardware, software and HR required for running the workload (3yrs)

– The cost model in the cloud is different, and does not relate to the cost of ownership

● TPC-H does not report a Cost-Effectiveness Metric

● TPC-H implementation vs. CAP theorem– CAP theorem: A distributed system can not fulfill both

Consistency (same view of data), Availability (query response) and Partition Tolerance (cope with hardware crash).

– Since DWS deployments are onto shared-nothing architectures, benchmarks should be either CA, CP and AP-compliant.

Page 18: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

1830th May

2013 DWS in the Cloud, AICCSA'13, Fez

New Requirements & New MetricsNew Requirements & New Metrics

Page 19: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

1930th May

2013 DWS in the Cloud, AICCSA'13, Fez

High Performance RequirementHigh Performance Requirement--Data Transfer IN/ OUT CSP

● Data Transfer Characteristics

– Huge data volumes transfer IN and OUT the Cloud Service Provider

– Resulting in Network-bound DWS

– Usually, the cost model adopted by CSPs is: ● Data upload IN the CSP is free of charge● Data download OUT the CSP is priced

● Data Transfer Metrics in the Cloud

– Time and cost for data upload

– Time and cost for data download

Page 20: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

2030th May

2013 DWS in the Cloud, AICCSA'13, Fez

High Performance RequirementHigh Performance Requirement--Workload Processing (ctnd. 1)

● Workload Processing Characteristics

– Both I/O-bound and CPU-bound business questions

– Intra-query processing combined with virtual partitioning or physical processing

● Performance across Cluster Size

– For each business question, there is an optimum response time for a particular cluster size and performance degrades from this optimum onward and backward

– Proved for both SQL and NoSQL technologies

Page 21: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

2130th May

2013 DWS in the Cloud, AICCSA'13, Fez

High Performance RequirementHigh Performance Requirement--Workload Processing (ctnd.2)

● TPC-H benchmarking of Apache Hadoop/Pig Latin on GRID5000 -Bordeaux Site [Moussa,ICCIT'12] (SF=10)

Page 22: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

2230th May

2013 DWS in the Cloud, AICCSA'13, Fez

High Performance RequirementHigh Performance Requirement--Workload Processing (ctnd.3)

● Workload Processing Metrics– Elapsed times for running business questions,

– Slope: performance - cost

Page 23: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

2330th May

2013 DWS in the Cloud, AICCSA'13, Fez

Scalability Requirement

● Definition

– Scalability is the ability of a system to increase total throughput under an increased load when hardware resources are added..

● Scalability Metric

– Query Performance Metric under● Ever increasing workload● Different query frequencies

Page 24: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

2430th May

2013 DWS in the Cloud, AICCSA'13, Fez

Elasticity Requirement

● Definition

– Elasticity adjusts the system capacity at runtime by adding and removing resources without service interruption in order to handle the workload variation.

● Elasticity Metric

– Capacity to add/remove resources: (0|1)

– Scaling Latency: elapsed time to scale-down and scale-up

– Impact on SUT performances during scale-up and scale-down

– Scale-up cost (+$)

– Scale-down gain (-$)

Page 25: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

2530th May

2013 DWS in the Cloud, AICCSA'13, Fez

High Availability Requirement–- Redundancy Strategies

● Redundancy Strategies

– Replication (a.k.a. mirroring)

– Erasure-Resilient Codes ● Redundancy Strategies vs. Workload Type

– Replication suits OLTP workload

– Erasure-resilient codes suits OLAP workload

● Comparison [Litwin et al.,ACM TODS'05]

– Data storage cost

– Computation cost

– Communication cost

Page 26: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

2630th May

2013 DWS in the Cloud, AICCSA'13, Fez

High Availability Requirement–-Strategies Comparison (ctnd.1)

Page 27: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

2730th May

2013 DWS in the Cloud, AICCSA'13, Fez

High Availability Requirement--Metrics for the Cloud (ctnd.2)

● High Availability Metrics– $@k: Cost of different targeted levels of

availabilities (1-available, . . . , k-available, i.e. the number of failures the system can tolerate).

– Cost of recovery expressed ● Time to get system back ● Decreased system productivity caused by

the hardware failure ($) from customer perspective

Page 28: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

2830th May

2013 DWS in the Cloud, AICCSA'13, Fez

Cost Management Requirement

● CSP price cost model

– Different cloud service price models (IaaS, PaaS, SaaS)

– e.g. ● CPU cost for IaaS: Instance based

(Amazon, MS Azur) or CPU-cycles based (Cloud Sites, Google App Engine)

● Query processing by Google BigQuery is based on retrieved bytes (columnar storage)

● Cost-Performance Ratio

● Cost-Effectiveness ratio

Page 29: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

2930th May

2013 DWS in the Cloud, AICCSA'13, Fez

Related Work

● Benchmarking in the cloud

– [Gray,MS'08]: Terasoft Benchmark for data sort evaluations,

– [Cooper et al., SoCC'10]: Yahoo Cloud Serving Benchmark (YCSB) for evaluating the performance of "key-value" and "cloud" serving stores.

– [Sobel et al., ICCSA'08]: CloudStone Benchmark for Web2.0 applications

– [Bennet et al., KDD'10]: MalStone Benchmarking for data mining in the cloud

– [Ang et al., USENIX'10]: CloudCMP project for CSP comparison

– [Binnig et al., DBTest'09], [Kossmann et al., SIGMOD'10]: Benchmarking OLTP systems in the cloud

Page 30: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

3030th May

2013 DWS in the Cloud, AICCSA'13, Fez

Related Work (ctnd.1)

● NoSQL and SQL Technologies Assessment in the cloud

– [Pavlo et al. SIGMOD'09],

– [Floratou et al., TPC-TC'11 ],

● More Specific Issues

– [Forrester, 2011]: Storage on-premises vs. in the cloud

– [Nguyen et al., EDBT Workshops'12]: Materialized Views Selection

– [Moussa, IJWA'12]: OLAP Scenarios in the Cloud and OLAP Workload Texonomy

Page 31: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

3130th May

2013 DWS in the Cloud, AICCSA'13, Fez

Conclusion & Future Work

● Keynote scope

– Overview of DWS

– Insight of new requirements and new metrics to be considered for benchmarking DWS in the cloud [Moussa, AICCSA'13]

● Research Perspectives

– Assessment of OLAP systems in the cloud e● Amazon RDS● Google BigQuery ● MS Azure● ...

Page 32: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

3430th May

2013 DWS in the Cloud, AICCSA'13, Fez

Research Perspectives--New OLTP Systems

● Classical Workload Taxonomy– OLTP: Transactions, ACID properties

– OLAP: complex queries, star-joins, grouping, aggregations...

● New OLTP Workload features:– OLTP

– Big Data

– Real-time analytics

● Examples of systems: Google Spanner, Clustrix, NuoDB and TransLattice

Page 33: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

3530th May

2013 DWS in the Cloud, AICCSA'13, Fez

Thank you for Your AttentionQ & A

Rim Moussa

Data Warehouse Systems in the Cloud

N2C'2013, Hammamet15th June 2013

?

Page 34: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

3630th May

2013 DWS in the Cloud, AICCSA'13, Fez

Data Warehouse Systems--TPC-H Benchmark Relational DB Schema

Page 35: Benchmarking data warehouse systems in the cloud: new requirements & new metrics

3730th May

2013 DWS in the Cloud, AICCSA'13, Fez

Data Warehouse Systems--TPC-H Benchmark Metrics