benchmarking data warehouse systems in the cloud: new requirements & new metrics
DESCRIPTION
TRANSCRIPT
30th May 2013 DWS in the Cloud, AICCSA'13, Fez
Data Warehouse Systems in the Cloud: new requirements and new
challenges
Rim MoussaLaTICE Lab. -University of TunisESTI -University of [email protected]
10th Intl. Conference on Computer Systems and Applications (AICCSA), Fez, Kingdom of Morocco Keynote @ Intl. Conference on Computing, Networking and Communications, Hammamet, Tunisia
30th May 2013
230th May
2013 DWS in the Cloud, AICCSA'13, Fez
Context
Benchmarking Data Warehouse
Systems
Cloud Rationale
NO
330th May
2013 DWS in the Cloud, AICCSA'13, Fez
Benchmarking Data Warehouse
Systems
Cloud Rationale
NO
430th May
2013 DWS in the Cloud, AICCSA'13, Fez
Outline
1. Cloud Computing
2. Data Warehouse Systems
3. Overview of DWS Benchmarks
4. New Requirements for DWS in the Cloud
5. Related Work
6. Conclusion
7. Research Perspectives
530th May
2013 DWS in the Cloud, AICCSA'13, Fez
Cloud Computing
● NIST Definition– cloud computing as a pay-per-use model for enabling available,
convenient, on-demand network access to a shared pool of configurable computing resources (e.g. networks, servers, storage, applications, services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.
● Opportunities– Performance
● Faster data analysis through usage of up-to-date hardware infrastructure made available by Cloud Service Providers,
– More Economical● Organizations no longer need to expend capital upfront for
hardware and software purchases, with Services provided on a pay-per-use basis,
630th May
2013 DWS in the Cloud, AICCSA'13, Fez
Cloud Computing --Market share
● Market Share
– Forrester Research expects the global cloud computing market to reach $241 billion in 2020,
– Gartner group: The public cloud services market is forecast to grow 18.5% in 2013 to total $131 billion worldwide, up from $111 billion in 2012,
– Gartner: the public cloud services market in the Middle East and North Africa (MENA) is expected to increase by 24.5% in 2013,
– Gartner group: the public cloud services market in INDIA is forecast to grow 36% in 2013 to total $443 million, up from $326 million in 2012,
730th May
2013 DWS in the Cloud, AICCSA'13, Fez
Data Warehouse Systems--Typical System Architecture
830th May
2013 DWS in the Cloud, AICCSA'13, Fez
Data Warehouse Systems --Technologies
● Traditional Relational DBMSs & OLAP Servers
– Mature
– Do not scale linearly ● NoSQL solutions
– Adopted by Google, Facebook, Amazon, ...
– Dynamic horizontal scale-up● Nodes are added without bringing the cluster down● Shared-nothing architecture● Independent computing and storage nodes
interconnected via a high speed network– MapReduce Distributed programming framework
930th May
2013 DWS in the Cloud, AICCSA'13, Fez
Data Warehouse Systems--challenges with big data management
1030th May
2013 DWS in the Cloud, AICCSA'13, Fez
Data Warehouse Systems--Common Optimizations: Hardware Storage Tech.
● DRAM: in-memory data processing (very expensive)● SSD (Solid State Drives): a non-volatile type of memory.
● An SSD does not have a mechanical arm to read and write data
SSD HDD
Cost/GB $1/GB $0.075/GB
Typical size 512GB Up to 2TB
Failure rate: MTBF
2 million hours 1.5 million hour
Read/Write speed 200-500 MBps 120 MBps
1130th May
2013 DWS in the Cloud, AICCSA'13, Fez
Data Warehouse Systems--Common Optimizations: Columnar Storage Principle
● Row-oriented storage– Read pages containing all columns
● Column-oriented storage– Read only columns needed for query processing
Date Customer QuantityPriceProduct
Date Customer Product Price Quantity
1230th May
2013 DWS in the Cloud, AICCSA'13, Fez
Data Warehouse Systems--Common Optimizations: Columnar Storage Benefits
● Allows best data compression rate, since data values are redundant within a single column,
● Eliminates unnecessary I/O through the retrieval of only relevant data
● Vectorwise is in the TPC-H - Top Ten Performance Results (14-Jun-2013)
1330th May
2013 DWS in the Cloud, AICCSA'13, Fez
Data Warehouse Systems--Common Optimizations: Derived Data
● Derived Data: – Indexes,
– Derived Attributes,
– Aggregate tables
● Pros: – High Performance
● Cons: – Maintenance: refresh is expensive
– Storage cost
1430th May
2013 DWS in the Cloud, AICCSA'13, Fez
Data Warehouse Systems--DWS Benchmarks
● APB-1 OLAP Benchmark --obsolete
– Released by the OLAP Council (www.olapcouncil.org) in 1998
– A simple star schema data model● TPC DSS Benchmark
– Released by the Transaction Processing Council (www.tpc.org)
– Examine large volumes of data (from 10GB to 100TB)
– Complex relational data model
– TPC-H ● Workload composed of 22 ad-hoc complex SQL Statements ● The most prominent DSS benchmark
– TPC-DS -successor of TPC-H● Workload composed of a 99 SQL business questions● Same metrics than TPC-H
1530th May
2013 DWS in the Cloud, AICCSA'13, Fez
Data Warehouse Systems--TPC-H Benchmark Metrics (same for TPC-DS)
● Query-per-hour Performance Metric
– For a given scale factor (warehouse data volume)
– Concurrent users● Price-Performance Metric
– Ratio of Priced System (cost of ownership: hardware, software, maintenance, and cost of everything needed to run the TPC6H workload) to Query performance Metric
1630th May
2013 DWS in the Cloud, AICCSA'13, Fez
Data Warehouse Systems--TPC-H mismatches Cloud Rationale
● TPC-H Does not represent BI suites
– Integration services
– Analytics services (Multi-dimensional eXpressions Language, Mining Structures)
– Reporting services● TPC-H Workload Processing Metric
– Qph@Size defines the number of queries processed by hour
– The workload is assumed static, which is not realistic!
– The benchmark should assess the SUT scalability under variable and evolving workload and data volumes
1730th May
2013 DWS in the Cloud, AICCSA'13, Fez
Data Warehouse Systems--TPC-H mismatches Cloud Rationale (ctnd.1)
● TPC-H Cost-Performance Metric
– $/Qph@Size, where the cost relates to all of hardware, software and HR required for running the workload (3yrs)
– The cost model in the cloud is different, and does not relate to the cost of ownership
● TPC-H does not report a Cost-Effectiveness Metric
● TPC-H implementation vs. CAP theorem– CAP theorem: A distributed system can not fulfill both
Consistency (same view of data), Availability (query response) and Partition Tolerance (cope with hardware crash).
– Since DWS deployments are onto shared-nothing architectures, benchmarks should be either CA, CP and AP-compliant.
1830th May
2013 DWS in the Cloud, AICCSA'13, Fez
New Requirements & New MetricsNew Requirements & New Metrics
1930th May
2013 DWS in the Cloud, AICCSA'13, Fez
High Performance RequirementHigh Performance Requirement--Data Transfer IN/ OUT CSP
● Data Transfer Characteristics
– Huge data volumes transfer IN and OUT the Cloud Service Provider
– Resulting in Network-bound DWS
– Usually, the cost model adopted by CSPs is: ● Data upload IN the CSP is free of charge● Data download OUT the CSP is priced
● Data Transfer Metrics in the Cloud
– Time and cost for data upload
– Time and cost for data download
2030th May
2013 DWS in the Cloud, AICCSA'13, Fez
High Performance RequirementHigh Performance Requirement--Workload Processing (ctnd. 1)
● Workload Processing Characteristics
– Both I/O-bound and CPU-bound business questions
– Intra-query processing combined with virtual partitioning or physical processing
● Performance across Cluster Size
– For each business question, there is an optimum response time for a particular cluster size and performance degrades from this optimum onward and backward
– Proved for both SQL and NoSQL technologies
2130th May
2013 DWS in the Cloud, AICCSA'13, Fez
High Performance RequirementHigh Performance Requirement--Workload Processing (ctnd.2)
● TPC-H benchmarking of Apache Hadoop/Pig Latin on GRID5000 -Bordeaux Site [Moussa,ICCIT'12] (SF=10)
2230th May
2013 DWS in the Cloud, AICCSA'13, Fez
High Performance RequirementHigh Performance Requirement--Workload Processing (ctnd.3)
● Workload Processing Metrics– Elapsed times for running business questions,
– Slope: performance - cost
2330th May
2013 DWS in the Cloud, AICCSA'13, Fez
Scalability Requirement
● Definition
– Scalability is the ability of a system to increase total throughput under an increased load when hardware resources are added..
● Scalability Metric
– Query Performance Metric under● Ever increasing workload● Different query frequencies
2430th May
2013 DWS in the Cloud, AICCSA'13, Fez
Elasticity Requirement
● Definition
– Elasticity adjusts the system capacity at runtime by adding and removing resources without service interruption in order to handle the workload variation.
● Elasticity Metric
– Capacity to add/remove resources: (0|1)
– Scaling Latency: elapsed time to scale-down and scale-up
– Impact on SUT performances during scale-up and scale-down
– Scale-up cost (+$)
– Scale-down gain (-$)
2530th May
2013 DWS in the Cloud, AICCSA'13, Fez
High Availability Requirement–- Redundancy Strategies
● Redundancy Strategies
– Replication (a.k.a. mirroring)
– Erasure-Resilient Codes ● Redundancy Strategies vs. Workload Type
– Replication suits OLTP workload
– Erasure-resilient codes suits OLAP workload
● Comparison [Litwin et al.,ACM TODS'05]
– Data storage cost
– Computation cost
– Communication cost
2630th May
2013 DWS in the Cloud, AICCSA'13, Fez
High Availability Requirement–-Strategies Comparison (ctnd.1)
2730th May
2013 DWS in the Cloud, AICCSA'13, Fez
High Availability Requirement--Metrics for the Cloud (ctnd.2)
● High Availability Metrics– $@k: Cost of different targeted levels of
availabilities (1-available, . . . , k-available, i.e. the number of failures the system can tolerate).
– Cost of recovery expressed ● Time to get system back ● Decreased system productivity caused by
the hardware failure ($) from customer perspective
2830th May
2013 DWS in the Cloud, AICCSA'13, Fez
Cost Management Requirement
● CSP price cost model
– Different cloud service price models (IaaS, PaaS, SaaS)
– e.g. ● CPU cost for IaaS: Instance based
(Amazon, MS Azur) or CPU-cycles based (Cloud Sites, Google App Engine)
● Query processing by Google BigQuery is based on retrieved bytes (columnar storage)
● Cost-Performance Ratio
● Cost-Effectiveness ratio
2930th May
2013 DWS in the Cloud, AICCSA'13, Fez
Related Work
● Benchmarking in the cloud
– [Gray,MS'08]: Terasoft Benchmark for data sort evaluations,
– [Cooper et al., SoCC'10]: Yahoo Cloud Serving Benchmark (YCSB) for evaluating the performance of "key-value" and "cloud" serving stores.
– [Sobel et al., ICCSA'08]: CloudStone Benchmark for Web2.0 applications
– [Bennet et al., KDD'10]: MalStone Benchmarking for data mining in the cloud
– [Ang et al., USENIX'10]: CloudCMP project for CSP comparison
– [Binnig et al., DBTest'09], [Kossmann et al., SIGMOD'10]: Benchmarking OLTP systems in the cloud
●
3030th May
2013 DWS in the Cloud, AICCSA'13, Fez
Related Work (ctnd.1)
● NoSQL and SQL Technologies Assessment in the cloud
– [Pavlo et al. SIGMOD'09],
– [Floratou et al., TPC-TC'11 ],
● More Specific Issues
– [Forrester, 2011]: Storage on-premises vs. in the cloud
– [Nguyen et al., EDBT Workshops'12]: Materialized Views Selection
– [Moussa, IJWA'12]: OLAP Scenarios in the Cloud and OLAP Workload Texonomy
3130th May
2013 DWS in the Cloud, AICCSA'13, Fez
Conclusion & Future Work
● Keynote scope
– Overview of DWS
– Insight of new requirements and new metrics to be considered for benchmarking DWS in the cloud [Moussa, AICCSA'13]
● Research Perspectives
– Assessment of OLAP systems in the cloud e● Amazon RDS● Google BigQuery ● MS Azure● ...
3430th May
2013 DWS in the Cloud, AICCSA'13, Fez
Research Perspectives--New OLTP Systems
● Classical Workload Taxonomy– OLTP: Transactions, ACID properties
– OLAP: complex queries, star-joins, grouping, aggregations...
● New OLTP Workload features:– OLTP
– Big Data
– Real-time analytics
● Examples of systems: Google Spanner, Clustrix, NuoDB and TransLattice
3530th May
2013 DWS in the Cloud, AICCSA'13, Fez
Thank you for Your AttentionQ & A
Rim Moussa
Data Warehouse Systems in the Cloud
N2C'2013, Hammamet15th June 2013
?
3630th May
2013 DWS in the Cloud, AICCSA'13, Fez
Data Warehouse Systems--TPC-H Benchmark Relational DB Schema
3730th May
2013 DWS in the Cloud, AICCSA'13, Fez
Data Warehouse Systems--TPC-H Benchmark Metrics