qach16

24
CHAPTER 16: QUESTIONS AND ANSWERS 1. Why are operational databases not particularly suited for decision support applications? Ans: Operational databases must be cleaned and integrated for decision support. Decision support needs a broad view that integrates business processes. Because of the different requirements, operational databases are usually separate from databases for decision support. Using a common database for both kinds of processing can significantly degrade database performance and make it difficult to summarize activity across business processes. 2. How is a data warehouse different from a data mart? Ans: A data warehouse is a central repository for summarized and integrated data from operational databases and external resources. A data mart is a subset of a data warehouse typically at a department or functional level. 3. When is the three-tier data warehouse architecture more appropriate than the two-tier data warehouse architecture? Ans: The three-tier data warehouse architecture is more appropria te when the two-tier data warehouse architecture has performance problems for large data warehouse with data-intensive applications for decision support. 4. What are the components of an enterprise data model? Ans: An enterprise data model contains the structure of a data warehouse and the metadata to access operational databases and external data sources. The EDM may also contain details about cleaning and integrating data sources. 5. What are some causes for failures in data warehouse projects? Ans: Failures in data warehouse projects have been caused by poor planning or too much emphasis being placed on first creating an enterprise data model. The difficulty of cleaning and integrating data sources has also been the downfall of some projects. 6. Does a bottom-up data warehouse architecture use an enterprise data model? Ans: No, a bottom-up data warehouse architecture does not use an enterprise data model. However, a data mart contains a smaller-scale data model that contains the metadata to access operational databases and external data sources. 7. What is an oper mart? Ans: An oper mart is a just-in-time data mart, usually built from one operational database in anticipation or in response to major events such as disasters and new product introductions. An oper mart supports peak demand for reporting and business analysis that accompanies a major event. After the decision support demand subsides, the oper mart may be dismantled or it may be merged into an existing data warehouse. 8. What are the advantages of multidimensional representation over relational representation for data used for decision support?

Upload: naglaa-mostafa

Post on 08-Nov-2014

97 views

Category:

Documents


22 download

DESCRIPTION

data warehouse questions

TRANSCRIPT

Page 1: QAch16

CHAPTER 16: QUESTIONS AND ANSWERS 1. Why are operational databases not particularly suited for decision support

applications? Ans: Operational databases must be cleaned and integrated for decision support. Decision support needs a broad view that integrates business processes. Because of the different requirements, operational databases are usually separate from databases for decision support. Using a common database for both kinds of processing can significantly degrade database performance and make it difficult to summarize activity across business processes. 2. How is a data warehouse different from a data mart? Ans: A data warehouse is a central repository for summarized and integrated data from operational databases and external resources. A data mart is a subset of a data warehouse typically at a department or functional level. 3. When is the three-tier data warehouse architecture more appropriate than the two-tier

data warehouse architecture? Ans: The three-tier data warehouse architecture is more appropria te when the two-tier data warehouse architecture has performance problems for large data warehouse with data-intensive applications for decision support. 4. What are the components of an enterprise data model? Ans: An enterprise data model contains the struc ture of a data warehouse and the metadata to access operational databases and external data sources. The EDM may also contain details about cleaning and integrating data sources. 5. What are some causes for failures in data warehouse projects? Ans: Failures in data warehouse projects have been caused by poor planning or too much emphasis being placed on first creating an enterprise data model. The difficulty of cleaning and integrating data sources has also been the downfall of some projects. 6. Does a bottom-up data warehouse architecture use an enterprise data model? Ans: No, a bottom-up data warehouse architecture does not use an enterprise data model. However, a data mart contains a smaller-scale data model that contains the metadata to access operational databases and external data sources. 7. What is an oper mart? Ans: An oper mart is a just- in-time data mart, usually built from one operational database in anticipation or in response to major events such as disasters and new product introductions. An oper mart supports peak demand for reporting and business analysis that accompanies a major event. After the decision support demand subsides, the oper mart may be dismantled or it may be merged into an existing data warehouse. 8. What are the advantages of multidimensional representation over relational

representation for data used for decision support?

Page 2: QAch16

Ans: A multidimensional representation is conceptually easier to understand and visualize if the underlying data model has more than two dimensions. 9. Explain why a dimension may have multiple hierarchies. Ans: A dimension may have multiple hierarchies to provide alternative organizations among dimension elements. For example, a location dimension can have one hierarchy for country, state, and city and a second hierarchy for country, state, and zip code. In a dimension with multiple hierarchies, usually at least one level is shared. 10. What are the advantages of using time-series data in a cell instead of time as a

dimension? Ans: Using time-series data allows users to store all historic data in one cell, instead of specifying a separate time dimension. A major advantage is that a number of statistical functions can operate directly on time-series data. 11. How is slicing a data cube different from dicing? Ans: Slicing data focuses on a subset of the dimensions to gain insights. Dicing data focuses on a subset of members to gain insights. 12. What are the differences between drilling down a data cube and slicing or dicing it? Ans: Drilling down a data cube allows users to naviga te among the levels of hierarchical dimensions. Dicing typically follows slicing which returns a subset of the values displayed in the preceding slice. 13. How is a pivot operation useful for multidimensional databases? Ans: The pivot operation supports rearrangement of the dimensions in a data cube. 14. Explain the significance of sparsity in a data cube. Ans: Sparsity indicates the extent of empty cells in a data cube. Sparsity can be a problem if two or more dimensions are related. For example, if certain products are sold only in selected states, cells may be empty. If a large number of cells are empty, the data cube can waste space and be slow to process. Special compression techniques can be used to reduce the size of sparse data cubes. 15. What is a star schema? Ans: A star schema consists of a fact table surrounded by dimension tables. There is a 1-M relationship from each dimension table to the fact table. 16. What are the differences between fact tables and dimension tables? Ans: Fact tables store numeric data (facts), while dimension tables store descriptive data corresponding to individual dimensions of the data cube. 17. How does a snowflake schema differ from a star schema? Ans: A snowflake schema has multiple levels of dimension tables related to one or more fact tables. In contrast, a star schema only has one level of dimension tables. You should

Page 3: QAch16

consider the snowflake schema instead of the star schema for small dimension tables that are not in 3NF. 18. What is a constellation schema? Ans: A constellation schema contains multiple fact tables in the center related to dimension tables. Typically the fact tables share some dimension tables. 19. What is the purpose of the Oracle CREATE DIMENSION statement? Ans: The star schema and its variations do not provide explicit representation of hierarchical dimensions. A dimension table does not define the hierarchical relationships among the levels of a dimension. Because dimension definition is important to support data cube operations as well as optimization techniques for relational database queries, many relational DBMS vendors have created proprietary SQL extensions for dimensions such as the CREATE DIMENSION statement. 20. What is the purpose of the SQL CUBE operator? Ans: The CUBE operator augments the normal GROUP BY result with all combinations of subtotals. The CUBE operator is appropriate to summarize columns from independent dimensions rather than columns representing different levels of a single dimension. 21. What is the purpose of the SQL ROLLUP operator? Ans: The CUBE operator augments the normal GROUP BY result a partial set of subtotals. The ROLLUP operator is appropriate to summarize columns from a single dimension. 22. What is the purpose of the SQL GROUPING SETS operator? Ans: The GROUPING SETS operator requires explicit specification of column combinations. The GROUPING SETS operator is appropriate when precise control over grouping and subtotals is required. 23. Briefly list some of the variations of the CUBE, ROLLUP, and GROUPING SETS

operators. Ans: A partial CUBE can be done to produce subtotals for a subset of independent dimensions. A partial ROLLUP can be done to produce subtotals for a subset of columns from the same dimension. Composite columns can be used with the CUBE or ROLLUP operators to skip some subtotals. CUBE and ROLLUP operations can be included in a GROUPING SETS operation. 24. Why are materialized views important for data warehouses but not important for

operational databases? Ans: Materialized views are attractive in data warehouses because the source data is stable except for periodic refreshments that can usually be performed during non peak times. In contrast, non-materialized views dominate operational database processing because the refresh cost can be high. 25. What materialization properties does Oracle 9i provide for materialized views?

Page 4: QAch16

Ans: The Oracle materialization properties include the method of refresh (incremental or complete), the timing of refresh (on demand or on commit), and the timing of build (immediate or deferred). Oracle has a number of restrictions on the types of materialized views that can be incrementally refreshed. 26. Compare and contrast query rewriting for materialized views to query modification

for traditional (non materialized) views. Ans: The query rewriting process for materialized views reverses the query modification process for traditional views. The query modification process substitutes base tables for views in queries that reference views so that materialization of the views is not needed. In contrast, the query rewriting process substitutes materialized views for base tables to avoid accessing large fact tables. In both processes, the DBMS performs the substitution process, not the user. In the query rewrite process, the query optimizer must evaluate whether the substitution will improve performance over the original query. Overall, query rewriting is more complex than query modification because query rewriting involves a more complex substitution process and query rewriting requires the optimizer to evaluate costs. 27. Explain the significance of indexing fact and dimension tables in a data warehouse. Ans: Because data warehouses are used mostly for retrieval, many indexes can be used to speed retrieval speed. When refreshing a data warehouse, indexes can be dropped to speed the loading process and then recreated after loading completes. 28. What are the pros and cons of a MOLAP storage engine? Ans: The storage engines of MOLAP systems are optimized for the unique characteristics of multidimensional data such as sparsity and complex aggregation across thousands of cells. Because data cubes are precomputed, MOLAP query performance is generally better than competing approaches that use relational database storage. Even with techniques to deal with sparsity, MOLAP engines can be overwhelmed by the size of data cubes. A fully calculated data cube may expand many times as compared to the raw input data. This data explosion problem limits the size of data cubes that MOLAP engines can manipulate. 29. What are the pros and cons of a ROLAP storage engine? Ans: Despite the intensive research and development on ROLAP storage and optimization techniques, MOLAP engines still provide faster query response time. However, MOLAP storage suffers from limitations in data cube size so that ROLAP storage is preferred for fine-grained data warehouses. In addition, the difference in response time has narrowed so that ROLAP storage may involve only a slight performance penalty if ROLAP storage and optimization techniques are properly utilized. 30. What are the pros and cons of a HOLAP storage engine? Ans: HOLAP allows a data warehouse to be divided between relational storage of fact and dimension tables and multidimensional storage of summary data cubes. When an OLAP query is submitted, the HOLAP system can combine data from the ROLAP managed data and the MOLAP managed data. Despite the appeal of HOLAP, there are

Page 5: QAch16

potential disadvantages that may limit its use. First, HOLAP can be more complex than either ROLAP or MOLAP especially if the DBMS vendor does not provide full HOLAP support. To fully support HOLAP, the DBMS vendor must provide both MOLAP and ROLAP engines as well as tools to combine both storage engines in the design and operation of a data warehouse. Second, there is considerable overlap in functionality between the storage and optimization techniques in ROLAP and MOLAP engines. It is not clear whether the ROLAP storage and optimization techniques should be discarded or used in addition to the MOLAP techniques. Third, because the difference in response time has narrowed between ROLAP and MOLAP, the combination of MOLAP and ROLAP may not provide significant performance improvement to justify the added complexity. 31. List some storage and optimization techniques used in ROLAP engines. Ans: Bitmap join indexes are particularly useful for columns in dimension tables with few values. Star join query optimization uses bitmap join indexes on dimension tables to reduce the number of rows in the fact table to retrieve. Query rewriting using materialized views can eliminate the need to access large fact and dimension tables. Summary storage advisors determine the best set of materialized views that should be created and maintained for a given query workload. Partitioning, striping, and parallel query execution provide opportunities to reduce the execution time of data warehouse queries. 32. What is cooperative change data? Ans: Cooperative data involves notification from the source system about changes. The notification typically occurs at transaction completion time using a trigger. Cooperative change data can be input immediately into the data warehouse or placed in queue for later batch input with other changes. 33. What is logged change data? Ans: Logged change data involves files that record changes or other user activity. For example, a transaction log contains every change made by a transaction, and a web log contains page access histories (called clickstreams) by website visitors. Logged change data usually involves no changes to source systems as logs are readily available for most source systems. 34. What is queryable change data? Ans: Queryable change data comes directly from a data source via a query. Queryable change data requires timestamping in the data source. Since few data sources completely support timestamps, queryable change data must be augmented with other kinds of change data. Queryable change data is most applicable for fact tables using fields such as order data, shipment data, and hire data that are stored in a data source. 35. What is snapshot change data? Ans: Snapshot change data involves periodic dumps of source data. To derive change data, a difference operation uses the two most recent snapshots. The result of a difference operation is called a delta. Snapshots are the most common form of change data because

Page 6: QAch16

of applicability. Snapshots are the only form of change data without requirements on a source system. 36. Briefly describe the phases of data warehouse maintenance. Ans: The preparation phase manipulates change data from individual source systems. The preparation phase involves extraction, transportation, cleaning, and auditing. The integration phase involves merging the separate, cleaned sources into one source. Merging can involve removal of inconsistencies among the source data. Auditing involves recording results of the merging process, performing completeness and reasonableness checks, and handling exceptions. The update phase involves propagating the integrated change data to various parts of the data warehouse including fact and dimension tables, materialized views, stored data cubes, and data marts. After propagation, notification can be sent to user groups and administrators. 37. List common data quality problems that should be resolved by the preparation and

integration phases. Ans: Multiple identifiers, multiple names, different units, missing values, orphaned transactions, multipurpose fields, and conflicting data. 38. What is the benefit of using ETL tools in data warehouse maintenance? Ans: ETL tools eliminate the need to write custom coding for many maintenance tasks. Most ETL tools use rule specifications rather than procedural coding to indicate logic and actions. Some ETL tools can generate code that can be customized for more flexibility. 39. What is valid time lag? Ans: Valid time lag is the difference between the occurrence of an event in the real world (valid time) and the storage of the event in an operational database (transaction time). 40. What is load time lag? Ans: Load time lag is the difference between transaction time (storage of an event in an operational database) and the storage of the event in the data warehouse (load time). 41. What is the primary objective in managing the refresh process for a data warehouse? Ans: The primary objective in managing the refresh process is to determine the refresh frequency for each data source. The optimal refresh frequency maximizes the net refresh benefit defined as the value of data timeliness minus the cost of refresh. The value of data timeliness depends on the sensitivity of decision making to the currency of the data. The cost to refresh a data warehouse includes both computer and human resources. Computer resources are necessary for all tasks in the maintenance workflow. Human resources may be used in the auditing tasks in the preparation and integration phases. 42. What are three alternatives for preserving historical integrity in a data warehouse? Ans: The Type I approach overwrites old values with the changed data providing no historical integrity. The Type II approach uses version numbers to augment the primary key of dimension tables. For each change to a dimension row, insert a row in the

Page 7: QAch16

dimension table with a larger version number. The Type III approach uses columns to maintain a fixed history.

Page 8: QAch16

PROBLEM SOLUTIONS Part 1: Automobile Insurance Problems Part 1 provides practice with data cube definition and operations as well as star schemas. The questions in Part 1 involve the database shown below used by an automobile insurance provider to support policy transactions (create and maintain customer policies) and claims transactions (claims made by other parties). The policy transactions utilize the tables Policy, Item, InsuredParty, InsuredAuto, PolicyItem, and Agent, while the claims transactions use the tables Claim, Claimant, InsuredAuto, Policy, InsuredParty, and ThirdParty. The nine tables in this database are shown below, along with the primary keys (shown in bold font) and the foreign keys (shown in italics). For each table, you may assume field formats of your own choice. Table: Item ItemNo ItemDesc ItemMinCoverage ItemMaxCoverage Primary key: ItemNo Foreign keys: None Table: Agent AgentNo AgentName AgentPhone AgentDept AgentType AgentRegion Primary key: AgentNo Foreign keys: None Table: InsuredParty IPSSN IPDrivLicNo IPState IPNam e IPPhone IPAddr IPDOB IPCi t y I P Z i p IPRiskCat Primary key: IPSSN Foreign keys: none IPDOB is the insured party’s date of birth Table: Claimant ClmtNo ClmtName ClmtPhone ClmtInsComp ClmtPolNo ClmtAddr ClmtCity ClmtState ClmtZip Primary key: ClmtNo Foreign keys: None Table: ThirdParty T P S S N T P N a m e TPPhon e TPDes c T P A d d r TPCit y TPState T P Z i p Primary key: TPSSN Foreign keys: None Table: Policy PolNo P o l B e g D a t e P o l E n d D a te I P S S N A g e n t N o PolPremium P o l E f f D a te

Page 9: QAch16

Primary key: PolNo Foreign keys: IPSSN, AgentNo Table: InsuredAuto IAVIN IALicPlateNo IAState IAMake IAModel IAYear IAAirBags IADriverSSN PolNo Primary key: IPVIN Foreign keys: IADriverSSN (refers to InsuredParty), PolNo Table: PolicyItem IAVIN ItemNo PICoverage PIPremium P I C o m m e n ts Primary key: IAVIN, ItemNo Foreign keys: IAVIN, ItemNo Table: Claim ClaimNo ClaimAmount ClaimEstimate ClaimDesc ClaimDate I A V IN ClmtNo T P S S N Primary key: ClaimNo Foreign keys: IAVIN, ClmtNo, TPSSN 1. Identify dimensions and measures in a data cube for automobile policy analysis. Ans: Dimensions: ItemNo, Insured automobile characteristics such as make, model, and year, insured party location characteristics such as the city, the state, and the zip code, insured party characteristics such as the age (derived from the date of birth column, IPDOB) and the risk category, AgentNo and other agent characteristics such as the agent type, the department, and the region, and characteristic about the policy date such as the beginning, the ending, and the effective dates Measures: premium for the covered item, coverage limit for the covered item, the policy premium is a derived measure (sum of the premium of individual policy items) 2. Draw a star schema to support the dimensions and measures in the data cube from

problem (1). You may want to use denormalization to reduce the number of dimension tables and the relationships among dimension tables.

Ans: Snowflake schema for the policy data cube Dimension tables:

Table: Item ItemNo ItemDesc ItemMinCoverage ItemMaxCoverage Table: Agent AgentNo AgentName AgentPhone AgentDept AgentType AgentRegion

Page 10: QAch16

Table: InsuredParty IPSSN IPDrivLicNo IPState IPName IPPhone IPAddr IPDOB IPCity IPZip IPRiskCa t Table: InsuredAuto IAVIN IALicPlateNo IAState IAMake IAModel IAYear IAAirBags IADriverSSN PolNo Table: Policy PolNo PolBegDate PolEndDate IPSSN AgentNo PolPremium PolEffDate To reduce the number of dimension tables, the InsuredAuto, Policy, and InsuredParty tables may be combined into one large table. The large table is not in 3NF but this should not be a problem because the dimension tables are not updated. In addition, there are typically only one or two automobiles and insured parties per policy so there will not be too much redundancy.

Fact table:

Table: PolicyItem I A V I N I t e m N o PICoverage PIPremiu m To support a data cube at the policy level (instead of policy item level), the Policy table would be the fact table and the InsuredParty and Agent tables would be the dimension tables. 3. Identify dimensions and measures in a data cube for claims analysis. Ans: Dimensions: Insured automobile characteristics such as make, model, and year, insured party, location characteristics such as city, state, and zip code, insured party characteristics such as age and risk category, claimant location characteristics such as state and zip code, third party location characteristics such as the state and zip code, and characteristic about the policy date such as the beginning and ending dates Measures: claim estimate and claim amount 4. Draw a star schema to support the dimensions and measures in the data cube from

problem (3). You may want to use denormalization to reduce the number of dimension tables and the relationships among dimension tables.

Ans: Star schema for claims data cube: Dimension tables:

Table: Claimant ClmtNo ClmtName ClmtPhone ClmtInsurCom p ClmtPolNo ClmtAddr ClmtCity ClmtState ClmtZip

Page 11: QAch16

Table: ThirdParty TPSSN TPName TPPhone TPDesc TPAddr TPCit y TPState T P Z i p Table: InsuredParty IPSSN IPDrivLicNo IPState IPName IPPhone IPAddr IPDOB IPCity IPZip IPRiskCa t Table: InsuredAuto IAVIN IALicPlateNo IAState IAMake IAModel IAYear IAAirBags IADriverSSN PolNo Table: Policy PolNo PolBegDate PolEndDate IPSSN AgentNo PolPremium PolEffDate To reduce the number of dimension tables, the InsuredAuto, Policy, and InsuredParty tables may be combined into one large table. The large table is not in 3NF but this should not be a problem because the dimension tables are not updated. In addition, there is typically only one or two automobiles and insured parties per policy so there will not be too much redundancy.

Fact table:

Table: Claim ClaimNo ClaimAmount ClaimEstimate ClaimDesc ClaimDate IAVIN ClmtNo TPSSN 5. In the policy schema, how can the data be divided between a production database

and a data warehouse? What should be the finest level grain in a data cube? Ans: To provide analysis flexibility, the data warehouse should contain the same level of detail as the production tables. To reduce the number of tables and the number of columns, the data warehouse tables may use denormalization on the dimension tables and omit columns that are not needed for policy analysis. The data warehouse tables should support many data cubes that use summarized data as well as data mining applications that use detailed transaction data. 6. Identify hierarchies in the dimensions in the data cube for the policy transaction. Ans: Hierarchies for the dimensions of the policy data cube are: year within make within model for insured auto; zip code within state and city within state for insured party; department within region for agent, a time hierarchy for the beginning, the ending, and the effective dates for the policy, and age categories for the insured party’s age. 7. In the claims schema, how can the data be divided between a production database

and a data warehouse? What should be the finest level grain in a data cube? Ans:

Page 12: QAch16

To provide analysis flexibility, the data warehouse should contain the same level of detail as the production tables. To reduce the number of tables and the number of columns, the data warehouse tables may use denormalization on the dimension tables and omit columns that are not needed for claims analysis. The data warehouse tables should support many data cubes that use summarized data as well as data mining applications that use detailed transaction data. 8. Identify hierarchies in the dimensions in the data cube for the claims analysis. Ans: Hierarchies for the dimensions of the claims data cube are: year within make within model for insured auto; zip code within state and city within state for the insured party, the claimant, and the third party; a time hierarchy (such as year, month, and date) for the beginning, the ending, and the effective dates for the policy, a time hierarchy for the claim date, and age categories for the insured party’s age. 9. Describe the data cube resulting from the operation to slice the policy data cube by a

certain agent. Ans: Slicing the policy data cube by the agent produces a data cube with the dimensions of item, insured automobile characteristics, insured party location, insured party characteristics, and policy date dimensions. The specific agent number replaces the agent dimension. 10. Describe the data cube resulting from the operation to dice the data cube result of the

slice operation in problem 9 by insured parties having zip codes in a specified state. Ans: The number of dimensions of the data cube remains the same from problem 9. However, the insured party location dimension is reduced to the subset of zip codes within the specified state. 11. Begin with a data cube with four dimensions (InsuredParty, InsuredAuto, Item, and

Agent) and one measure (policy amount) in the cells. From this data cube, describe the operation to generate a new data cube with three dimensions (InsuredParty, Item, and Agent) and one measure (average auto policy amount).

Ans: Use a slice-summarize operation to summarize the insured auto dimension. The slice operation should compute the average of the policy amount over the insured auto dimension. The insured auto dimension is replaced with the average policy amount. 12. Identify dimension levels and hierarchies within the Agent table. You do not need to

use the Oracle CREATE DIMENSION statement.

Page 13: QAch16

Ans: The levels in the agent dimension are the agent number, phone, department, type, and region. Each of these levels resides in a separate hierarchy with agent number. If department and region are related, these levels can be combined in one dimension. The phone number can be parsed to provide a hierarchy of country code, area code, and prefix. 13. Identify dimension levels and hierarchies within the InsuredParty table. You do not

need to use the Oracle CREATE DIMENSION statement. Ans: The levels in the insured party dimension are the drivers license number, social security number, state, city, address, phone, risk category, date of birth, and phone. For each dimension hierarchy, the most detailed level is either the drivers license number or the social security number. The hierarchies are (state, city, address), (state, zip, address), birth date, risk category, and phone. The date of birth can have a hierarchy consisting of year, month, and day. The phone number can be parsed to yield a hierarchy of country code, area code, and prefix. 14. Identify dimension levels and hierarchies within the InsuredAuto table. You do not

need to use the Oracle CREATE DIMENSION statement. Ans: The levels in the insured auto dimension are the vehicle identification number, license plate number, state, make, model, and year. For each dimension hierarchy, the most detailed level is either the license plate number or the vehicle identification number. The hierarchies are state and (make, model, year). 15. Is it necessary to have a time dimension table for the automobile insurance data

warehouse? Please justify your answer and explain how time is represented if not in a separate dimension table.

Ans: A time dimension table is not required. The time dimension can be represented using a single field for each date field rather than a foreign key to a time table. For time hierarchies, there are functions to extract the year, month, day, and other attributes from a date field.

Part 2: Store Sales Problems Part 2 provides practice with relational database manipulation of multidimensional data. The questions in Part 2 involve the store sales snowflake schema used in Section 16.3. For your reference, Figure 16P.1 displays the ERD for the store sales snowflake schema. To support the usage of Oracle with these problems, the student section of the Online

Page 14: QAch16

Learning Center contains Oracle CREATE TABLE statements and sample data for the tables of the store sales schema.

CustomerCustIdCustNameCustPhoneCustStreetCustCityCustStateCustZipCustNation

StoreStoreIdStoreManagerStoreStreetStoreCityStoreStateStoreZipStoreNation

SalesSalesNoSalesUnitsSalesDollarSalesCost

ItemItemIdItemNameItemUnitPriceItemBrandItemCategory

TimeTimeNoTimeDayTimeMonthTimeQuarterTimeYearTimeDayOfWeekTimeFiscalYear

ItemSales

CustSales

TimeSales

StoreSales

DivisionDivIdDivNameDivManager

DivStore

Figure 16.P1: ERD Snowflake Schema for the Store Sales Example

In the solutions, Oracle table names are used. To prevent naming conflicts with other tables, the prefix “SS” has been prepended to all table names. The student section of the Online Learning Center contains Oracle CREATE TABLE statements and sample data for the tables of the store sales schema. 1. Write a SELECT statement to summarize sales by store state, year, and item brand.

The result should compute the SUM of the dollar sales for the years 2002 and 2003. The result should include full totals for every combination of grouped fields.

Ans:

SELECT StoreState, TimeYear, ItemBrand, SUM(SalesDollar) as TotalDollarSales FROM SSStore, SSTime, SSSales, SSItem WHERE SSStore.StoreId = SSSales.StoreId AND SSTime.TimeNo = SSSales.TimeNo AND SSItem.ItemId = SSSales.ItemId AND TimeYear IN (2002, 2003) GROUP BY CUBE(StoreState, TimeYear, ItemBrand); 2. Write a SELECT statement to summarize sales by year, quarter, and month. The

result should compute the SUM of the dollar sales for the years 2002 and 2003. The result should include partial totals in order of the grouped fields (year, quarter, and month).

Page 15: QAch16

Ans:

SELECT TimeYear, TimeQuarter, TimeMonth, SUM(SalesDollar) as TotalDollarSales FROM SSTime, SSSales WHERE SSTime.TimeNo = SSSales.TimeNo AND TimeYear IN (2002, 2003) GROUP BY ROLLUP(TimeYear, TimeQuarter, TimeMonth); 3. Write a SELECT statement to summarize sales by store state, month, and year. The

result should compute the SUM of the dollar sales for the years 2002 and 2003. The result should include partial totals in order of the grouped fields (year and month). Do not use the GROUPING SETS operator in your SQL statement.

Ans: SELECT StoreState, TimeYear, TimeMonth, SUM(SalesDollar) as TotalDollarSales FROM SSStore, SSTime, SSSales WHERE SSStore.StoreId = SSSales.StoreId AND SSTime.TimeNo = SSSales.TimeNo AND TimeYear IN (2002, 2003) GROUP BY StoreState, ROLLUP(TimeYear, TimeMonth); 4. Rewrite the SQL statement solution for problem (1) without using the CUBE,

ROLLUP, or GROUPING SETS operators. To reduce your time, you may write the first few query blocks and then indicate the pattern to rewrite the remaining query blocks. In rewriting the queries, you may use two single quotes (with nothing inside) as the default text value and 0 as the default numeric value.

Ans: Both UNION and UNION ALL give the same result. In the statements, 0 is used for the null value for TimeYear and '' is used for ItemBrand and StoreState. SELECT StoreState, TimeYear, ItemBrand, SUM(SalesDollar) as TotalDollarSales FROM SSStore, SSTime, SSSales, SSItem WHERE SSStore.StoreId = SSSales.StoreId AND SSTime.TimeNo = SSSales.TimeNo AND SSItem.ItemId = SSSales.ItemId AND TimeYear IN (2002, 2003) GROUP BY StoreState, TimeYear, ItemBrand UNION SELECT StoreState, TimeYear, '', SUM(SalesDollar) as TotalDollarSales FROM SSStore, SSTime, SSSales, SSItem

Page 16: QAch16

WHERE SSStore.StoreId = SSSales.StoreId AND SSTime.TimeNo = SSSales.TimeNo AND SSItem.ItemId = SSSales.ItemId AND TimeYear IN (2002, 2003) GROUP BY StoreState, TimeYear UNION SELECT StoreState, 0, ItemBrand, SUM(SalesDollar) as TotalDollarSales FROM SSStore, SSTime, SSSales, SSItem WHERE SSStore.StoreId = SSSales.StoreId AND SSTime.TimeNo = SSSales.TimeNo AND SSItem.ItemId = SSSales.ItemId AND TimeYear IN (2002, 2003) GROUP BY StoreState, ItemBrand UNION ALL SELECT '', TimeYear, ItemBrand, SUM(SalesDollar) as TotalDollarSales FROM SSStore, SSTime, SSSales, SSItem WHERE SSStore.StoreId = SSSales.StoreId AND SSTime.TimeNo = SSSales.TimeNo AND SSItem.ItemId = SSSales.ItemId AND TimeYear IN (2002, 2003) GROUP BY TimeYear, ItemBrand UNION SELECT StoreState, 0, '', SUM(SalesDollar) as TotalDollarSales FROM SSStore, SSTime, SSSales, SSItem WHERE SSStore.StoreId = SSSales.StoreId AND SSTime.TimeNo = SSSales.TimeNo AND SSItem.ItemId = SSSales.ItemId AND TimeYear IN (2002, 2003) GROUP BY StoreState UNION SELECT '', TimeYear, '', SUM(SalesDollar) as TotalDollarSales FROM SSStore, SSTime, SSSales, SSItem WHERE SSStore.StoreId = SSSales.StoreId AND SSTime.TimeNo = SSSales.TimeNo AND SSItem.ItemId = SSSales.ItemId AND TimeYear IN (2002, 2003) GROUP BY TimeYear UNION SELECT '', 0, ItemBrand, sum(SalesDollar) as TotalDollarSales FROM SSStore, SSTime, SSSales, SSItem WHERE SSStore.StoreId = SSSales.StoreId AND SSTime.TimeNo = SSSales.TimeNo AND SSItem.ItemId = SSSales.ItemId AND TimeYear IN (2002, 2003) GROUP BY ItemBrand UNION ALL

Page 17: QAch16

SELECT '', 0, ''s, SUM(SalesDollar) as TotalDollarSales FROM SSStore, SSTime, SSSales, SSItem WHERE SSStore.StoreId = SSSales.StoreId AND SSTime.TimeNo = SSSales.TimeNo AND SSItem.ItemId = SSSales.ItemId AND TimeYear IN (2002, 2003); 5. Rewrite the SQL statement solution for problem (2) without using the CUBE,

ROLLUP, or GROUPING SETS operators. In rewriting the queries, you may use two single quotes (with nothing inside) as the default text value and 0 as the default numeric value.

Ans: Both UNION and UNION ALL give the same result. In the statements, 0 is used for the default value for all fields. SELECT TimeYear, TimeQuarter, TimeMonth, SUM(SalesDollar) as TotalDollarSales FROM SSTime, SSSales WHERE SSTime.TimeNo = SSSales.TimeNo AND TimeYear IN (2002, 2003) GROUP BY TimeYear, TimeQuarter, TimeMonth UNION SELECT TimeYear, TimeQuarter, 0, SUM(SalesDollar) as TotalDollarSales FROM SSTime, SSSales WHERE SSTime.TimeNo = SSSales.TimeNo AND TimeYear IN (2002, 2003) GROUP BY TimeYear, TimeQuarter UNION SELECT TimeYear, 0, 0, SUM(SalesDollar) as TotalDollarSales FROM SSTime, SSSales WHERE SSTime.TimeNo = SSSales.TimeNo AND TimeYear IN (2002, 2003) GROUP BY TimeYear UNION SELECT 0, 0, 0, SUM(SalesDollar) as TotalDollarSales FROM SSTime, SSSales WHERE SSTime.TimeNo = SSSales.TimeNo AND TimeYear IN (2002, 2003); 6. Rewrite the SQL statement solution for problem (3) without using the CUBE,

ROLLUP, or GROUPING SETS operators. In rewriting the queries, you may use

Page 18: QAch16

two single quotes (with nothing ins ide) as the default text value and 0 as the default numeric value.

Ans: Both UNION and UNION ALL give the same result. In the statements, 0 is used for the default value for all fields. SELECT StoreState, TimeYear, TimeMonth, SUM(SalesDollar) as TotalDollarSales FROM SSStore, SSTime, SSSales WHERE SSStore.StoreId = SSSales.StoreId AND SSTime.TimeNo = SSSales.TimeNo AND TimeYear IN (2002, 2003) GROUP BY StoreState, TimeYear, TimeMonth UNION SELECT StoreState, TimeYear, 0, SUM(SalesDollar) as TotalDollarSales FROM SSStore, SSTime, SSSales WHERE SSStore.StoreId = SSSales.StoreId AND SSTime.TimeNo = SSSales.TimeNo AND TimeYear IN (2002, 2003) GROUP BY StoreState, TimeYear UNION SELECT StoreState, 0, 0, SUM(SalesDollar) as TotalDollarSales FROM SSStore, SSTime, SSSales WHERE SSStore.StoreId = SSSales.StoreId AND SSTime.TimeNo = SSSales.TimeNo AND TimeYear IN (2002, 2003) GROUP BY StoreState; 7. Rewrite the SQL statement solution for problem (3) using the GROUPING SETS

operator instead of the ROLLUP operator. Ans: The GROUPING SETS operator requires enumeration of all grouping combinations. SELECT StoreState, TimeYear, TimeMonth, SUM(SalesDollar) as TotalDollarSales FROM SSStore, SSTime, SSSales WHERE SSStore.StoreId = SSSales.StoreId AND SSTime.TimeNo = SSSales.TimeNo AND TimeYear IN (2002, 2003) GROUP BY GROUPING SETS(StoreState, (StoreState, TimeYear),

Page 19: QAch16

(StoreState, TimeYear, TimeMonth)); 8. Write an Oracle CREATE DIMENSION statement for a customer dimension

consisting of the customer identifier, the city, the state, the zip, and the country. Define two hierarchies grouping the customer identifier with the city, the state, and the nation and the customer identifier with the zip, the state, and the nation.

Ans: Oracle CREATE DIMENSION statement for the customer dimension CREATE DIMENSION Customer_D LEVEL CustId IS SSCustomer.CustId LEVEL City IS SSCustomer.CustCity LEVEL Zip IS SSCustomer.CustZip LEVEL State IS SSCustomer.CustState LEVEL Nation IS SSCustomer.CustNation HIERARCHY city_rollup ( CustId CHILD OF City CHILD OF State CHILD OF Nation ) HIERARCHY zip_rollup ( CustId CHILD OF Zip CHILD OF State CHILD OF Nation ); 9. Write an Oracle CREATE DIMENSION statement for a time dimension consisting

of the time identifier, the day, the month, the quarter, the year, the fiscal year, and the day of the week. Define three hierarchies grouping the time identifier, the day, the month, the quarter, and the year, the time identifier and the fiscal year, and the time identifier and the day of the week.

Ans: Oracle CREATE DIMENSION statement for the time dimension CREATE DIMENSION Time_D LEVEL TimeNo IS SSTime.TimeNo LEVEL Day IS SSTime.TimeDay LEVEL Mon IS SSTime.TimeMonth LEVEL Qtr IS SSTime.TimeQuarter LEVEL Year IS SSTime.TimeYear LEVEL DayofWeek IS SSTime.TimeDayofWeek LEVEL FiscalYear IS SSTime.TimeFiscalYear HIERARCHY date_rollup ( timeno CHILD OF

Page 20: QAch16

Day CHILD OF Mon CHILD OF Qtr CHILD OF Year) HIERARCHY date_day_rollup ( TimeNo CHILD OF DayofWeek) HIERARCHY date_fy_rollup ( TimeNo CHILD OF FiscalYear); 10. Write an Oracle CREATE MATERIALIZED VIEW statement to support the store

sales schema. The materialized view should include the sum of the dollar sales and the sum of the cost of sales. The materialized view should summarize the measures by the store identifier, the item identifier, and the time number. The materialized view should include sales in the year 2002.

Ans: Oracle CREATE MATERIALIZED VIEW statement for 2002 store sales. DROP MATERIALIZED VIEW SalesByStoreTimeMV2002; CREATE MATERIALIZED VIEW SalesByStoreTimeMV2002 BUILD IMMEDIATE REFRESH COMPLETE ON DEMAND AS SELECT StoreId, SSTime.TimeNo, ItemId, SUM(SalesDollar) as TotalDollarSales, SUM(SalesCost) AS TotalCostSales FROM SSTime, SSSales WHERE SSTime.TimeNo = SSSales.TimeNo AND TimeYear = 2002 GROUP BY StoreId, SSTime.TimeNo, ItemId; 11. Write an Oracle CREATE MATERIALIZED VIEW statement to support the store

sales schema. The materialized view should include the sum of the dollar sales and the sum of the cost of sales. The materialized view should summarize the measures by the store identifier, the item identifier, and the time number. The materialized view should include sales in the year 2003.

Ans: Oracle CREATE MATERIALIZED VIEW statement for 2003 store sales. DROP MATERIALIZED VIEW SalesByStoreTimeMV2003;

Page 21: QAch16

CREATE MATERIALIZED VIEW SalesByStoreTimeMV2003 BUILD IMMEDIATE REFRESH COMPLETE ON DEMAND AS SELECT StoreId, SSTime.TimeNo, ItemId, SUM(SalesDollar) as TotalDollarSales, SUM(SalesCost) AS TotalCostSales FROM SSTime, SSSales WHERE SSTime.TimeNo = SSSales.TimeNo AND TimeYear = 2003 GROUP BY StoreId, SSTime.TimeNo, ItemId; 12. Rewrite the SQL statement solution for problem (1) using the materialized views in

problems (10) and (11). You should ignore the CUBE operator in the solution for problem (1). Your SELECT statement should reference the materialized views as well as base tables if needed.

Ans: 12. Rewrite of query 1 (except CUBE operator) using the materia lized views in problems 10 and 11. SELECT StoreState, TimeYear, ItemBrand, SUM(TotalDollarSales) as TotalSales FROM SSStore, SSTime, SSItem, SalesByStoreTimeMV2002 SS0 WHERE SSStore.StoreId = SS0.StoreId AND SSTime.TimeNo = SS0.TimeNo AND SSItem.ItemId = SS0.ItemId GROUP BY StoreState, TimeYear, ItemBrand UNION SELECT StoreState, TimeYear, ItemBrand, SUM(TotalDollarSales) as TotalSales FROM SSStore, SSTime, SSItem, SalesByStoreTimeMV2003 SS1 WHERE SSStore.StoreId = SS1.StoreId AND SSTime.TimeNo = SS1.TimeNo AND SSItem.ItemId = SS1.ItemId GROUP BY StoreState, TimeYear, ItemBrand; 13. Rewrite the SQL statement solution for problem (1) using the materialized views in

problems (10) and (11). Your SELECT statement should reference the materialized views as well as base tables if needed. You should think carefully about how to handle the CUBE operator in your rewritten query.

Ans:

Page 22: QAch16

Rewrite of query 1 (including the CUBE operator) using the materialized views in problems 10 and 11. The CUBE must be done after the UNION. Either a view can be used or a nested query in the FROM clause. -- Solution 1 using a view DROP VIEW Exercise13View; CREATE VIEW Exercise13View AS SELECT StoreState, TimeYear, ItemBrand, SUM(TotalDollarSales) as TotalSales1 FROM SSStore, SSTime, SSItem, SalesByStoreTimeMV2002 SS0 WHERE SSStore.StoreId = SS0.StoreId AND SSTime.TimeNo = SS0.TimeNo AND SSItem.ItemId = SS0.ItemId GROUP BY StoreState, TimeYear, ItemBrand UNION SELECT StoreState, TimeYear, ItemBrand, SUM(TotalDollarSales) as TotalSales1 FROM SSStore, SSTime, SSItem, SalesByStoreTimeMV2003 SS1 WHERE SSStore.StoreId = SS1.StoreId AND SSTime.TimeNo = SS1.TimeNo AND SSItem.ItemId = SS1.ItemId GROUP BY StoreState, TimeYear, ItemBrand; SELECT StoreState, TimeYear, ItemBrand, SUM(TotalSales1) as TotalSales FROM Exercise13View GROUP BY CUBE(StoreState, TimeYear, ItemBrand); -- Solution 2 using a nested query in the FROM clause SELECT StoreState, TimeYear, ItemBrand, SUM(TotalSales1) as TotalSales FROM ( SELECT StoreState, TimeYear, ItemBrand, SUM(TotalDollarSales) as TotalSales1 FROM SSStore, SSTime, SSItem, SalesByStoreTimeMV2002 SS0 WHERE SSStore.StoreId = SS0.StoreId AND SSTime.TimeNo = SS0.TimeNo AND SSItem.ItemId = SS0.ItemId GROUP BY StoreState, TimeYear, ItemBrand UNION SELECT StoreState, TimeYear, ItemBrand, SUM(TotalDollarSales) as TotalSales1 FROM SSStore, SSTime, SSItem, SalesByStoreTimeMV2003 SS1 WHERE SSStore.StoreId = SS1.StoreId AND SSTime.TimeNo = SS1.TimeNo

Page 23: QAch16

AND SSItem.ItemId = SS1.ItemId GROUP BY StoreState, TimeYear, ItemBrand ) GROUP BY CUBE(StoreState, TimeYear, ItemBrand); 14. Rewrite the SQL statement solution for problem (3) using the materialized views in

problems (10) and (11). You should ignore the ROLLUP operator in the solution for problem (3). Your SELECT statement should reference the materialized views as well as base tables if needed.

Ans: Rewrite of query 3 (except ROLLUP operator) using the materialized views in problems 10 and 11. SELECT StoreState, TimeYear, TimeMonth, SUM(TotalDollarSales) as TotalSales FROM SSStore, SSTime, SalesByStoreTimeMV2002 SS0 WHERE SSStore.StoreId = SS0.StoreId AND SSTime.TimeNo = SS0.TimeNo GROUP BY StoreState, TimeYear, TimeMonth UNION SELECT StoreState, TimeYear, TimeMonth, SUM(TotalDollarSales) as TotalSales FROM SSStore, SSTime, SalesByStoreTimeMV2003 SS1 WHERE SSStore.StoreId = SS1.StoreId AND SSTime.TimeNo = SS1.TimeNo GROUP BY StoreState, TimeYear, TimeMonth; 15. Rewrite the SQL statement solution for problem (3) using the materialized views in

problems (10) and (11). Your SELECT statement should reference the materialized views as well as base tables if needed. You should think carefully about how to handle the ROLLUP operator in your rewritten query.

Ans: Rewrite of query 3 (including the ROLLUP operator) using the materialized views in problems 10 and 11. The ROLLUP must be done after the UNION. Either a view can be used or a nested query in the FROM clause. -- Solution 1 using a view DROP VIEW Exercise15View; CREATE VIEW Exercise15View AS SELECT StoreState, TimeYear, TimeMonth, SUM(TotalDollarSales) as TotalSales1

Page 24: QAch16

FROM SSStore, SSTime, SalesByStoreTimeMV2002 SS0 WHERE SSStore.StoreId = SS0.StoreId AND SSTime.TimeNo = SS0.TimeNo GROUP BY StoreState, TimeYear, TimeMonth UNION SELECT StoreState, TimeYear, TimeMonth, SUM(TotalDollarSales) as TotalSales1 FROM SSStore, SSTime, SalesByStoreTimeMV2003 SS1 WHERE SSStore.StoreId = SS1.StoreId AND SSTime.TimeNo = SS1.TimeNo GROUP BY StoreState, TimeYear, TimeMonth; SELECT StoreState, TimeYear, TimeMonth, SUM(TotalSales1) as TotalSales FROM Exercise15View GROUP BY StoreState, ROLLUP(TimeYear, TimeMonth); -- Solution 2 using a nested query in the FROM clause SELECT StoreState, TimeYear, TimeMonth, SUM(TotalSales1) as TotalSales FROM ( SELECT StoreState, TimeYear, TimeMonth, SUM(TotalDollarSales) as TotalSales1 FROM SSStore, SSTime, SalesByStoreTimeMV2002 SS0 WHERE SSStore.StoreId = SS0.StoreId AND SSTime.TimeNo = SS0.TimeNo GROUP BY StoreState, TimeYear, TimeMonth UNION SELECT StoreState, TimeYear, TimeMonth, SUM(TotalDollarSales) as TotalSales1 FROM SSStore, SSTime, SalesByStoreTimeMV2003 SS1 WHERE SSStore.StoreId = SS1.StoreId AND SSTime.TimeNo = SS1.TimeNo GROUP BY StoreState, TimeYear, TimeMonth ) GROUP BY StoreState, ROLLUP(TimeYear, TimeMonth);