isncc 2017
TRANSCRIPT
16th May 2017 ISNCC'[email protected] of Morocco 1
Multi-Dimensional Database Modeling:
Literature Survey & Research Agenda in the
Big Data
Alfredo Cuzzocrea University of Trieste & ICAR Rim Moussa LaTICE Lab. & University of Carthage
The 4th Intl. Symposium on Networks, Computers and
Communications @ Marrakesh, Kingdom of Morocco 16th of May, 2017
16th May 2017 ISNCC'[email protected] of Morocco 2
Business Intelligence
How to translate data into Insights & Opportunities! »BI encompasses a variety of tools, applications and methodologies that
enable organizations to collect data from internal systems and external sources, prepare it for analysis, develop and run queries against the data, and create reports, dashboards and data visualizations to make the analytical results available to corporate decision makers as well as operational workers.
»Integrations tools (ETL tools), Workflow design, »Data warehousing tools »OLAP servers technologies, OLAP clients, »mining structures, reporting tools, dashboards’ design tools »Multi-dimensional design methods, …
16th May 2017 ISNCC'[email protected] of Morocco 3
Research agenda?
DOLAP Workshop 2006
IBM White paper 2015
16th May 2017 ISNCC'[email protected] of Morocco 4
Outline
Part I: Data Warehouse Systems DWS Architectures OLAP cube DSS Benchmarks
Part II: Multi-dimensional Database Modeling
Part III: Challenging Problems
Conclusion
16th May 2017 ISNCC'[email protected] of Morocco 5
Lazy Data Integration Query-driven Architecture
Relational Data
Source
WRAPPER WRAPPER
MEDIATOR
WRAPPER WRAPPER
16th May 2017 ISNCC'[email protected] of Morocco 6
Eager Data Integration Warehouse System Architecture
Data Warehouse Relational Data
Source
Integration workflows of the
Integration System
16th May 2017 ISNCC'[email protected] of Morocco
Facts: are the objects that represent the subject of the desired analyses. »Examples: sales records, weather records, cabs trips, … »The fact table contained 3 types of attributes: measured attributes, foreign keys
to dimension tables, degenerate dimensions Dimension(s): »Levels are individual values that make up dimensions »Examples
»Date dimension (Trimester, month, day) »Time dimension (hour, min, sec) »Geography dimension (Country, city, postal code)
Measure(s): »Examples: revenue, lost revenue, sold quantities, expenses, … »Use aggregate functions: min, max, count, distinct-count, sum, average, …
7
OLAP Cube
16th May 2017 ISNCC'[email protected] of Morocco 8
OLAP Operations On-Line Analytical Processing (Q10 of TPC-H benchmark)
16th May 2017 ISNCC'[email protected] of Morocco
Structured Query Language (SQL)
»Relational and static schema »Data Definition, data Manipulation, and Data Control Language »Analytic Functions (window functions over partition by …) »Cube, roll-up and grouping sets operators
MultiDimensional eXpressions (MDX)
»Invented by Microsoft in 1997 »For querying and manipulating the multidimensional data stored in OLAP cubes »Static schema
Data Flow programming language
»Google Sawzall, Apache Pig Latin, IBM Infosphere Streams »Dynamic schema »After data is loaded, multiple operators are applied on that data before the final
output is stored.
9
Query Languages
Load Data Apply
Schema Apply Filter Group Data
Apply Aggregate Function
Sort Data Store
Output
16th May 2017 ISNCC'[email protected] of Morocco 10
Query Languages SQL – Q16 of TPC-H benchmark
16th May 2017 ISNCC'[email protected] of Morocco 11
Query Languages MDX –Q16 of TPC-H Benchmark
WITH SET [Brands] AS 'Except({[Part Brand].Members}, {[Part
Brand].[Brand#45 ]})'
SET [Types] AS 'Filter({[Part Type].Members}, (NOT ([Part
Type].CurrentMember.Name MATCHES "(?i)MEDIUM POLISHED.*")))'
SET [Sizes] AS 'Filter({[Part Size].Members}, ([Part Size].CurrentMember IN
{[Part Size].[3], [Part Size].[9], [Part Size].[14], [Part Size].[19], [Part Size].[23],
[Part Size].[36], [Part Size].[45], [Part Size].[49]}))'
SELECT [Measures].[Supplier Count] ON COLUMNS,
nonemptyCrossjoin(nonemptyCrossjoin([Brands], [Types]), [Sizes]) ON ROWS
FROM [Cube16]
16th May 2017 ISNCC'[email protected] of Morocco 12
Query Languages Data Flow –Pig Latin script for Q16 of TPC-H benchmark
16th May 2017 ISNCC'[email protected] of Morocco 13
Decision Support Systems Benchmarks
Non-TPC Benchmarks Real datasets
»Open data or proprietary data »fixed size »Devise a workload or trace the proprietary workload
APB-1: no scale factor TPC Benchmarks
The Transaction Processing Council founded in 1988 to define benchmarks
In 2009, TPC-TC is set up as an International Technology Conference Series on Performance Evaluation and Benchmarking
Examples of benchmarks relevant for benchmarking decision support systems: TPC-H, TPC-DS and TPC-DI Common characteristics of TPC benchmarks
»Synthetic data »Scale factor allowing generation of different volumes 1GB to 1PB
16th May 2017 ISNCC'[email protected] of Morocco 14
Decision Support Systems Benchmarks TPC-H Benchmark Schema (1/2) TPC-H Benchmark: complex schema 22 ad-hoc SQL statements (star queries, nested queries, …) + refresh functions
16th May 2017 ISNCC'[email protected] of Morocco 15
Decision Support Systems Benchmarks TPC-H Benchmark (2/2)
TPC-H Benchmark 2 Metrics »QphH@Size is the number of queries processed per hour, that the system
under test can handle for a fixed load »$/QphH@Size represents the ratio of cost to performance, where the cost is
the cost of ownership of the SUT (hardware,software, maintenance). Variants of TPC-H Benchmarks
TPC-H*d Benchmark [Cuzzocrea and Moussa, 2013] »Turning TPC-H benchmark into a Multi-dimensional benchmark »Few schema changes »No update to workload requirement »MDX workload for OLAP cubes and OLAP queries
SSB: Star Schema Benchmark [O’Neil et al., 2012] »Turning TPC-H benchmark into star-schema »Workload composed of 12 queries
TPC-H translated into Pig Latin (Apache Hadoop Ecosystem) [Moussa,2012] »22 pig latin scripts which load and process TPC-H raw data files (.tbl files)
16th May 2017 ISNCC'[email protected] of Morocco 16
Decision Support Systems Benchmarks TPC-DS Benchmark (1/2)
TPC-DS Benchmark: 7 data marts
16th May 2017 ISNCC'[email protected] of Morocco 17
Decision Support Systems Benchmarks TPC-DS Benchmark (2/2) TPC-DS Benchmark Workload
Hundred of queries (99 query templates) OLAP, windowing functions, mining, and reporting queries ACID and Concurrent data maintenance (not ACID in TPC-DS 2.x)
TPC-DS Benchmark Metrics 3 Metrics
»QphDS@Size is the number of queries processed per hour, that the system under test can handle for a fixed load.
»Data Maintenance and Load Time are calculated »$/QphDS@Size represents the ratio of cost to performance, where the
cost is a 3 year cost of ownership of the SUT (hardware,software, maintenance)
»System Availability date: the date when the system is available to customers.
TPC-DS implementations TPC-DS v2.0
»Extension for non-relational systems such as Hadoop/Spark big data systems
16th May 2017 ISNCC'[email protected] of Morocco 18
Decision Support Systems Benchmarks TPC-DI Benchmark (1/3) [Poess et al. 2014] For benchmarking Data Integration technologies Synthetic Data of a Factious Retail Brokerage Firm
»Internal Trading system data, Internal Human resources data, Internal CRM System and External data
»Different data scales »Data extracted from different sources:
»Structured (csv) »Semi-structured data (xml) »Multi record (nested data) »Change Data Capture (CDS)
Complex Data Integration Tasks Load large volumes of historical data Load incremental updates Execute complex transformations Check and ensure consistency of data
16th May 2017 ISNCC'[email protected] of Morocco 19
TPC-DI Benchmark Complex Transformations (2/3) TPC-DI implements 18 Complex Transformations which feature the
following characteristics, Transfer XML to relational data Detect changes in dimension data, and applying appropriate tracking
mechanisms for history keeping dimensions Filter input data according to pre-defined conditions Identify new, deleted and updated records in input data Merge multiple input files of the same structure Join data of one input file to data from another input file with different
structure Standardize entries of the input files Join data from input file to dimension table Join data from multiple input files with separate structures Consolidate multiple change records per day and identify most current Perform extensive arithmetic calculations Read data from files with variable type records Check data for errors or for adherence to business rules Detect changes in fact data, and journaling updates to reflect current state
16th May 2017 ISNCC'[email protected] of Morocco 20
TPC-DI Benchmark Metrics (3/3)
Metrics
Performance Metric TPC_DI_RPS = Trunc(GeoMean(TH, Min(TI1 , TI2)))
TH: Throughput of Historical Load
TI1: Throughput of Incremental Update 1
TI2: Throughput of Incremental Update 2
Price/Performance Metric Price-per-TPC_DI_RPS = $ / TPC_DI_RPS
$ is the total 3-year pricing
16th May 2017 ISNCC'[email protected] of Morocco 21
Outline
Introduction
Part I: Data Warehouses
Part II: Experiences
TPC-H*d Experience AutoMDB TPC-DS*d
Part III: Challenging Problems
Conclusion
16th May 2017 ISNCC'[email protected] of Morocco
Given,
A relational Warehouse schema
A Workload -a set of SQL Statements,
W = {Q1, Q2, …, Qn}
where Qi is a parameterized query
How to design the Multi-dimensional DB Schema?
How to define cubes?
Will there be a single cube or multiple cubes?
Are there any rules for merging of cubes?
Are there any rules for definition of virtual cubes?
Which Optimizations are suitable for performance tuning ?
Derived data calculus & refresh?
Data partitioning & parallel cube building?
# 22
MDB Design Problem
16th May 2017 ISNCC'[email protected] of Morocco # 23
Idea
Map each business question to an OLAP cube
>> Obtain a multi-dimensional DB schema
Recommend & Test Optimizations
>> Derived Data
>> Data partitioning
>> Cube Merging
16th May 2017 ISNCC'[email protected] of Morocco # 24
TPC-H*d
Truly OLAP variant of TPC-H benchmark
TPC-H SQL workload translated into MDX (MultiDimensional
eXpressions)
The workload is composed of 23 MDX statements for OLAP
cubes and 23 MDX statements for OLAP business queries. Each business question of TPC-H benchmark is mapped to an OLAP
cube
16th May 2017 ISNCC'[email protected] of Morocco # 25
TPC-H*d Q8: From SQL statement to OLAP cube
16th May 2017 ISNCC'[email protected] of Morocco # 26
TPC-H*d TPC-H*d OLAP Cube C8
Market Share for each supplier nation within a region of customers, for each year and each part type
16th May 2017 ISNCC'[email protected] of Morocco # 27
TPC-H*d TPC-H*d OLAP Query Q8
Market Share for each RUSSIAN Suppliers within AMERICA region, Over the years 1995 and 1996 and for part type ECO. ANODIZED STEEL
16th May 2017 ISNCC'[email protected] of Morocco
Open source software implemented in java
Parses MDB schemas (.xml) files using SAX Library
Performs comparisons of OLAP cubes' characteristics.
»For each pair of OLAP cubes, »show whether they have same fact table or not »compute the nbr of shared | different | coalescable dimensions
»Dimensions are coalescable if they are extracted from the dimension table and their hierarchies are coalescable
»compute the number of shared | different measures »Run merge of OLAP cubes using different similarity functions
»Simple distance function have or not same fact table »K-means clustering
»Distance function is computed with weights according dimensions
»Propose Virtual Cubes »Auto-generate a new MDB Schema (.xml)
»Create MDB Schema from SQL Workload »On-going, tests include TPC-DS benchmark
# 28
AutoMDB
16th May 2017 ISNCC'[email protected] of Morocco # 29
AutoMDB Load OLAP Cubes defined in xml file
16th May 2017 ISNCC'[email protected] of Morocco # 30
AutoMDB Compare OLAP Cubes –have or not same fact table
16th May 2017 ISNCC'[email protected] of Morocco # 31
AutoMDB Compare Cubes –Group cubes which have same fact table
16th May 2017 ISNCC'[email protected] of Morocco # 32
AutoMDB Compare Cubes –Auto-generate a new MDB schema
16th May 2017 ISNCC'[email protected] of Morocco 33
Outline
Introduction
Part I: Data warehousing
Part II: Experiences
Part III: Challenges Big Data Integration Flexible Schema Model Curse of dimensionality Systems which scale-out Intelligent Recommenders Real-time OLAP Advanced Visualization
Conclusion
16th May 2017 ISNCC'[email protected] of Morocco 34
Big Data
Volume
»Volume refers to the amount of data, henceforth the challenge is integration at scale,
Velocity »Velocity refers to the speed at which new data is generated, henceforth
the challenge is to integrate and analyze data while it is being generated, Variety
»Variety refers to different types of data; e.g. structured (relational data), semi-structured (XML, JSON, BSON), unstructured (text); henceforth the challenge is integration of different types of data,
Veracity »Veracity refers to the messiness or trustworthiness of the data,
henceforth the challenge is to integrate uncertain data quality in data sources,
Value »Value refers to our ability to turn data into value.
16th May 2017 ISNCC'[email protected] of Morocco 35
Advanced and highly performance Technologies
»Allow to load and extract different data formats and allow to perform complex transformations
»Solve heterogeneity »Syncing Across Data Sources
»data copies migrated from a wide range of sources on different rates and schedules can rapidly get out of the synchronization
»Perform data integration at scale Solution Cost
TPC-DI is a good start for benchmarking integration solutions TPC-DI implementation is on-going
Challenge #1: Big Data Integration
16th May 2017 ISNCC'[email protected] of Morocco 36
Challenge #2: Data Schema Model
Schema-based data »E.g. relational model, multi-dimensional model »They define what columns appear, their names, and their datatypes. Schemaless Data and Dynamic Schema »Non-uniform data: custom fields and non-uniform data types »Parse data and query on-the-fly »NoSQL systems (e.g. Apache pig latin, SQL-on-hadoop) Data Lake »All data is loaded from source systems i.e., no data is turned away. »Data is stored at the leaf level in an untransformed or nearly
untransformed state. »Data is transformed and schema is applied to fulfill the needs of analysis.
»Load in the raw data, as-is. At processing time give data a structure. schema-on-read
»High agility: configuration and reconfiguration of models, queries and applications as needed on-the-fly
16th May 2017 ISNCC'[email protected] of Morocco 37
Challenge #3: Curse of Dimensionality
Datasets in applications like bioinformatics and text processing are characterized by,
Large column data set a.k.a. highly dimensional data »100 columns of data 100 dimension hyperspace »A data cube of 100 dimensions such each dimension cardinality=10 11100 aggregate cells
Moderate size »One million of tuples
Proposed Solutions Minimal Cubing Approach [Li et al., 2004 ] Dimension Reduction & Feature Selection
»Principal Component Analysis (PCA) »Linear Discriminant Analysis (LDA) »Canonical Correlation Analysis (CCA) »Latent Semantic Indexing (LSI) for text data
»Complex and not scalable
16th May 2017 ISNCC'[email protected] of Morocco 38
Challenge #4: Scale-out
Introduction
Part I: Methods & State-of-the-Art
Part II: Experiences
Part III: Challenging Problems
Conclusion
Data Fragmentation Parallel IO Parallel Processing
Technologies Parallel Cube Calculus Distributed Relational Data Warehouses + Mid-tier for parallel
cube calculus, e.g. OLAP* framework
SQL-on-Hadoop Systems e.g. Hive, Spark SQL, Drill, Impala, BigInsights
Scalable and Distributed Data Structures K-RP*s [Litwin et al.,1996], SkipTree [Messeguer, 1997], SkipWebs
[Arge, 2005], PN-Tree [Ali, 2005]
16th May 2017 ISNCC'[email protected] of Morocco 39
Challenge #5: Intelligent Recommenders Self Manageability Selection of Indexes and Materialized Views are physical
structures that can significantly accelerate performance. Recommenders for performance tuning
»AutoAdmin research project at Microsoft, which explores novel techniques to make databases self-tuning [Agrawal et al., 2000]
»MS Database Tuning Advisor: offline techniques, DBA is out of the picture
Alerter Approach [Hose et al., 2008] »support the aggregate configuration of an OLAP server by (1)
continuously monitoring information about the workload and the benefit of aggregation tables and (2) alerting the DBA if changes to the current configuration would be beneficial
Semi-Automatic Index Tuning: keeping DBAs in the loop [Schnaiter and Polyzotis, 2012] »Online workload analysis with decisions delegated to the DBA »Solution takes into index interactions
16th May 2017 ISNCC'[email protected] of Morocco 40
Challenge #6: Real-time Processing Retrospective Analysis
Traditional Architectures »DWS are deployed as part of an OLAP System separated out from the
OLTP system, »Data propagates down to the OLAP system, but typically after some “time”,
»This is a sufficient system for retrospective analytics, but does not suffice
in situations that require real-time analytics . New Data Processing Architectures Architecture for Big Data processing at Scale »Lambda Architecture by Nathan Marz (2014)
»Batch processing system + Speed processing system »Kappa Architecture
»No batch processing systems
»New solutions for architecting DW systems!
16th May 2017 ISNCC'[email protected] of Morocco 41
Challenge #7: Visualization
Introduction
Part I: Methods & State-of-the-Art
Part II: Experiences
Part III: Challenging Problems
Conclusion
Goal: Visualize multidimensional data sets by capturing the dimensionality of data Data cube
Is a representation of a multidimensional data set OLAP clients technologies Pivot table or cross-tab OLAP ops: drill-down, roll-up, pivot, slice, dice bar-charts, pie-charts, and time series e.g. Apache jPivot, Saiku
Sophisticated visualization techniques challenges Interactive Reflect current data Support of large dimensional datasets Ergonomic
16th May 2017 ISNCC'[email protected] of Morocco 42
Challenge #7: Visualization Multidimensional Networks
Networks’ application domains: social networks (linkedIn, facebook, …) , the web, communication networks, … A multidimensional network
Graph stucture and Multidimensional attributes Example: Graph Cube [Zhao et al. 2011]
-1- and -2- are friends
16th May 2017 ISNCC'[email protected] of Morocco 43
Challenge #7: Visualization Nanocubes by AT&T Real-time exploratory visualization on large spatiotemporal and
multidimensional datasets Real-time: extremely fast queries Spatial: spatial region that can be either a rectangle covering most of
the world, or a heatmap of activity Multidimensional: besides latitude, longitude, and time, there are
other attributes such as tweet device: android or iphone, tweet language: eng, fr, it, sp, ru…
Nanocube [Lins et al., 2013 ] Algorithm outline for every object oi find the finest address of the schema S hit by this object, update the time series associated with this address update in a deepest first fashion, all coarser addresses hit by oi
Cons: memory usage when indexing all six dimensions (latitude, longitude, time, language, device,
application), the 210 million points from Twitter take around 45GB of memory
16th May 2017 ISNCC'[email protected] of Morocco 45
References (1/3) M. Fricke, The Knowledge Pyramid: A Critique of the DIKW Hierarchy. Journal of Information
Science. 2009. E.F. Codd, S.B. Codd and C.T. Salley, Providing OLAP to User Analysts: an IT mandate, 1993. J. Widom, Integrating Heterogeneous databases: eager or lazy? ACM Computing Surveys (CSUR)
Vol.4, 1996 Y.R. Cho, Data Warehouse and OLAP Operations www.ecs.baylor.edu/faculty/cho/4352 TPC homepage http://www.tpc.org/ M. Poess, T. Rabl and B. Caufield: TPC-DI: The First Industry Benchmark for Data
Integration. PVLDB 7(13): 1367-1378 (2014) http://www.vldb.org/pvldb/vol7/p1367-poess.pdf X. Li, J. Han, H. Gonzalez: High-Dimensional OLAP: A Minimal Cubing Approach. VLDB 2004. C. Imhoff, N. Galemmo, J. G. Geiger. Mastering Data Warehouse Design: Relational and
Dimensional Techniques. 2003.
R. Kimball, M. Ross, W. Thornthwaite, J. Mundy, B. Becker. The Data Warehouse
Lifecycle Toolkit. 2nd Edition.
R. Kimball, M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. 2nd Edition. H. G. Molia. Data Warehousing Overview: Issues, Terminology, Products.
www.cs.uh.edu/~ceick/6340/dw-olap.ppt (slides)
16th May 2017 ISNCC'[email protected] of Morocco # 46
References (2/3) Modeling Multidimensional Databases (non exhaustive list)
M. Gyssens and L. V.S. Lakshmanan. A Foundation for Multi-Dimensional Databases. VLDB’1997. R. Agrawal, A. Gupta and S. Sarawagi. Modeling Multidimensional Databases. ICDE’1997. J. Gray, A. Bosworth, A. Layman and H. Priahesh. Data Cube: A Relational Aggregation
Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. ICDE’2008. P. Vassiliadis. Modeling Multidimensional Databases, Cubes and Cube Operations. SSDBM’1998. L. Cabibbo and R. Torlone. A Logical Approach to Multidimensional Databases. EDBT’1998. D. Cheung, B. Zhou, B. Kao, H. Lu, T. Lam and H. Ting. Requirement-based data cube
schema design. CIKM’1999. T. Niemi, J. Nummenmaa and P. Thanisch. Constructing OLAP cubes based on Queries. DOLAP’2001. O. Teste. Towards Conceptual Multidimensional Design in Decision Support Systems. DEXA’2010. A. Cuzzocrea and R. Moussa. Multidimensional Database Design
via Schema Transformation: Turning TPC-H into the TPC-H*d Multidimensional Benchmark. COMAD’2013.
16th May 2017 ISNCC'[email protected] of Morocco 47
References (3/3)
Introduction
Part I: Methods & State-of-the-Art
Part II: Experiences
Part III: Challenging Problems
Conclusion
M. Fowler, Schemaless data structures. 2013 http://martinfowler.com/articles/schemaless/ N. Marz and J. Warren, Big Data: Principles and best practices of scalable realtime data
systems, 1st Edition S. Agrawal, S. Chaudhuri and V. Narasayya Automated Selection of Materialized Views and
Indexes for SQL Databases. VLDB’2000 http://www.research.microsoft.com/dmx/AutoAdmin K. Hose, D. Klan, M. Marx and K. Sattler. When is it Time to Rethink the Aggregate
Configuration of Your OLAP Server?. VLDB’2008 Karl Schnaitter and Neoklis Polyzotis. Semi-Automatic Index Tuning: Keeping DBAs in the
Loop. VLDB’2012 P. Zhao, X. Li, D. Xin and J. Han.
Graph cube: on warehousing and OLAP multidimensional networks. SIGMOD’2011 L. D. Lins, J. T. Klosowski and C. E. Scheidegger:
Nanocubes for Real-Time Exploration of Spatiotemporal Datasets. IEEE Trans. Vis. Comput. Graph. 2013 https://github.com/laurolins/nanocube
16th May 2017 ISNCC'[email protected] of Morocco 48
Thank you for your Attention
Q & A
Multi-Dimensional Database Modeling and Querying: Literature
Survey and Research Agenda in the Big Data Era
Alfredo Cuzzocrea and Rim Moussa
16th of May, 2017