data warehouse overview -...
TRANSCRIPT
1
1
Introduction toData Warehouse
( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)
2
Introduction to Data Warehousing and Data Mining
Introduction to Data Warehousing and Data Mining
1) Data Warehouse Introduction2) Engineering Conflicts3) OLTP and DSS4) Stovepipe vs. Integration5) Data Warehouse Solution6) Enterprise Information System7) Security in a Data Warehouse8) Moving Data to a Data Warehouse9) Data Marts10) Data Mining
2
3
IntroductionIntroduction
• Key topics for this course include:– Data Warehouse– Data Mart
– Data Mining
• Background and review of relational database systems
• Main focus on data warehouse and data mining
4
Data Warehouse IntroductionData Warehouse Introduction
• A data warehouse is a single source for key, corporate information needed to enable business decisions
• A database application is a piece of software that provides a user interface for users to add, delete, query and update data
• Typically, a database management system is used to actually do the work of adding, deleting, querying or updating data
DatabaseSystemApplication Data
3
5
Engineering Conflicts, Query and UpdateEngineering Conflicts, Query and Update
• It is often an engineering problem when data is updated and long-running queries occur at the same time
• In some cases, the users who are doing updates must wait for queries to complete
• One way to avoid this is to make a read-onlycopy of data
Application
Database System
Datafor update
Datafor query
6
Database System
OLTPApplication
DSSDataDSS
Application
OLTPData
OLTP and DSS DefinedOLTP and DSS Defined
• An application that updates is called an on-line transaction processing (OLTP) application
• An application that issues queries to the read-only database is called a decision support system (DSS)
4
7
Applications in a Typical EnterpriseApplications in a Typical Enterprise
• Most organizations have several disparate OLTP/DSS applications in several databases
InventoryOLTP
Application
FinanceOLTP
Application
Finance DSS
Application
InventoryDSS
Application
SalesOLTP
Application
SalesDSS
Application
F i n a n c eDSSData
F i n a n c eOLTPData
Inven to ryDSSData
Inven to ryOLTPData
S a l e sDSSData
S a l e sOLTPData
DATABASE SYSTEM
8
Stovepipe vs IntegrationStovepipe vs Integration
• When systems stand by themselves they are often referred to as “stovepipes”
• Systems that easily share data are called “well integrated systems”
Finance DSS
Application
InventoryOLTP
Application
FinanceOLTP
Application
InventoryDSS
Application
5
9
• Problems:– Users who wish to access data must query several different DSS to
find it– Data may have fundamental conflicts between DSS
– a department code table in one DSS may differ in another DSS– a measurement may be stored in meters in one DSS and yards in
another
• Solution:– Use a data warehouse, where data is integrated
from the several different stovepipe systems– Data warehouse is really sharing-lite -- you
don’t have to co-ordinate as much when applications are built and you still reap the benefits of data sharing
Problems with Stovepipe ArchitectureProblems with Stovepipe Architecture
10
Data Warehouse SolutionData Warehouse Solution
• A data warehouse is an attempt to integrate separate DSS so that users can query one place to find the answers to their questions
• A data warehouse has the key, corporate data in the organization
• A data warehouse tracks historical data
6
11
Data Warehouse - A Success StoryData Warehouse - A Success Story
• Largest data warehouse is Wal-Mart (9 TB)
• Uses for Wal-Mart data warehouse– Identifies where a new store should be built based on customer demand– Identifies how stores are performing across the nation– Contains every “scan” from every purchase
• Benefits Wal-Mart gained from their data warehouse– Provided competitive advantage over K-Mart– Reduced excess inventory in individual stores– Avoided wasted funds in building stores which would fail
12
Selling the Data WarehouseSelling the Data Warehouse
• A data warehouse project will fail without corporate sponsorship– Preferably, the project should be sponsored by the CEO– The CEO must be sold on the value to the business to
improve competitive advantage by deploying a data warehouse
• If an active, corporate sponsor does not exist, data sources will be very difficult to identify
• Only add data to the warehousethat will answer key, corporate questions asked by the corporate sponsor.
Otherwise, you will have a data dump
7
13
Building a Useful Data WarehouseBuilding a Useful Data Warehouse
• You really need:– strong executive sponsorship
– good knowledge of the data
– sound software engineering
– stability from source systems
– users who want a success
• A 75 percent failure rate is often cited
• It is WORTH the effort!!!
14
Enterprise Information SystemEnterprise Information System
Data Warehouse
EnterpriseInformation
System
• An EIS (Enterprise Information System) allows users to query data in a data warehouse
• Users can access key, corporate data in the data warehouse
8
15
Users of an Enterprise Information SystemUsers of an Enterprise Information System
• Frequently, multiple EIS are needed to satisfy different types of users– Some users only want a system that has pre-defined reports so they
only need to “click one button” to see data they need. These users want the system to be no harder to use than a “coffee pot”
– Other users want to delve into the data and build their own queries
• Executives want a high-level, summarydata and a simple tool– Must be VERY easy to use, users want to click a few
buttons and get data they want– Results must be graphs– Users should be able to drill-down into key areas.
16
Users of an Enterprise Information SystemUsers of an Enterprise Information System
• Analysts want a flexible, more detailed tool– Often very knowledgeable about the data– Willing to do more work to learn about the data– Sometimes even learn SQL to issue their own
ad-hoc queries
• General users want a tool that provides detailed data, but is very easy to use– Want access to the data warehouse to do
routine tasks such as “Find me Hank’s phone number”, etc.
– Simple application, but not so focused on large reports
9
17
Data Warehouse / EISData Warehouse / EIS
Data Warehouse
FinanceOLTP
Application
SalesOLTP
Application
F i n a n c eSub jec t
Area
F i n a n c eOLTPData
Inven to rySub j ec t
Area
Inven to ryOLTPData
Sales Sub j ec t
Area
S a l e sOLTPData
InventoryOLTP
Application
EnterpriseInformation
System
S a l e sOLTPData
18
Need for Data WarehousesNeed for Data Warehouses
• Data warehouses provide a single place to store key corporate data– The idea is that users can go one place to find this key data using an
enterprise information system (EIS)
• Data warehouse is also a place to store and access historical data– Users measure performance goals for their company over a period of
time– Company statistics are available– Data not stored in the same place is difficult to locate
and compare, easily lost– Single query can be used to access key data
10
19
Security in Data WarehouseSecurity in Data Warehouse
• Building a data warehouse does increase security risk because key, corporate information is all in one place
• To mitigate that risk, database system components can be used to protect the data warehouse. These include– Views – Access control– Security Administration– Encryption– Audit
20
Moving Data into the Data Warehouse Moving Data into the Data Warehouse
• Moving data from source OLTP systems to the data warehouse is the hard part of data warehousing
• Updates to the data warehouse are performed periodically– weekly– nightly– monthly
• Occasionally, real-time data is needed in a data warehouse, but this
is not very common
11
21
Using Middleware to Move DataUsing Middleware to Move Data
SourceOLTP
System
DataWarehouseMigrationSoftware
“Middleware”
DataWarehouse
• Data can be moved to the warehouse via data migration software
• This is often called “middleware” because it sits between the source OLTP and the data warehouse
22
Need for a Data MartNeed for a Data Mart
• A data mart is a subset of the data warehouse that may make it simpler for users to access key corporate data – Sometimes, users only need a piece of data from the data
warehouse
• The data mart is typically fed from the data warehouse
Data Warehouse
F i n a n c eSub jec t
Area
Inven to rySub j ec t
Area
Sales Sub j ec t
Area
New YorkD a t a M a r t
CaliforniaD a t a M a r t
12
23
Data Mart in ActionData Mart in Action
New YorkD a t a M a r t
CaliforniaD a t a M a r t
Data Warehouse
FinanceOLTP
Application
SalesOLTP
Application
F i n a n c eSub jec t
Area
F i n a n c eOLTPData
Inven to rySub j ec t
Area
Inven to ryOLTPData
Sales Sub j ec t
Area
S a l e sOLTPData
InventoryOLTP
Application
EnterpriseInformation
System
S a l e sOLTPData
24
Data Mining IntroductionData Mining Introduction
• Data Mining is done by running software that examines a database and looks for patterns in the data
• A data warehouse by itself will respond to queries from users– It will not tell users about patterns in data that users may not have
thought about– To find patterns in data, data mining is
used to try and mine key information from a data warehouse
13
25
Advantages of Data MiningAdvantages of Data Mining
• Data mining allows companies to collect information and make them more productive and beat their competition
• Data mining helps identify – why customers buy certain products
– ideas for very direct marketing– ideas for shelf placement– training of employees vs. employee retention– employee benefits vs. employee retention
26
Implementing Data MiningImplementing Data Mining
• Apply data mining tools to run data mining algorithms against data
• There are two approaches:– Copy data from the Data Warehouse and mine it– Mine the data in the Data Warehouse
• Popular tools use a variety of different data mining algorithms:– association rules
– genetic algorithms– decision trees– neural networks
14
27
Data Mining using Separate Data Data Mining using Separate Data
• You can move data from the data warehouse to data mining tools– Advantages
– Data mining tools may organize data so they can run faster– Disadvantages
– Could be very expensive to move largeamounts of data
Data Warehouse
Data Mining ToolCopy of data made
by the Data Mining Tool
28
Data Mining Against the Data WarehouseData Mining Against the Data Warehouse
• Data mining tools can access data directly in the Data Warehouse– Advantages
– No copy of data is needed for data mining– Disadvantages
– Data may not be organized in a way that isefficient for the tool
Data Warehouse
Data Mining Tool
15
29
Data Mining: SummaryData Mining: Summary
• Data mining attempts to find patterns in data that we did not know about
• Often data mining is just a new buzzword for statistics
• Data mining differs from statistics in that large volumes of data are used
• Many different data mining algorithms exist and we will discuss them in the course
• Examples– identify users who are most likely to commit credit card fraud– identify what attributes about a person most results in them buying
product x.
30
SQL Review
( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)
16
31
Introduction to SQLIntroduction to SQL
1) Introduction to SQL2) Data Definition Language (DDL)3) Data Manipulation Language (DML)4) SELECT Construct 5) SELECT Operators6) Wildcard Searches7) Aggregate Operators8) Calculated Attributes9) Sorting Results
32
Introduction to Structured Query LanguageIntroduction to Structured Query Language
• Structured Query Language (SQL) is the language used to communicate with a relational database– Industry standard– Based on set theory
• SQL composed of two types of constructs:– Data Definition Language (DDL)
– Defines the structure of the database– Data Manipulation Language (DML)
– Provides the constructs to input and retrieve data
17
33
SQL Overview - DDLSQL Overview - DDL
• Data Definition Language (DDL) is used to describe the structure of the database– Create tables, indexes, etc. – Typical Operations are:
– CREATE TABLE defines what columns are in the table and establishes the table
– CREATE INDEX defines an index for the table. Indexes are used to improve database performance
34
SQL Overview - DMLSQL Overview - DML
• Data Manipulation Language (DML) is used for storing, updating, and retrieving data.
• Typical operations include:– SELECT is used to retrieve data.
– Ex: SELECT * FROM PRODUCTS– INSERT is used to add new rows to the database.
– INSERT INTO PRODUCTS VALUES ('food', 'hardware', 'housewares')
– UPDATE is used to change rows that already exist in the database.– UPDATE PRODUCTS SET PRICE = PRICE + 4
– DELETE is used to eliminate rows of data from the database.– DELETE FROM PRODUCTS
18
35
SELECT OverviewSELECT Overview
• SELECT is used to retrieve records from the database.
• Single table SELECT constructs:– WHERE– IN– BETWEEN– LIKE– Aggregate Operators– DISTINCT– ORDER BY
36
SELECT ExamplesSELECT Examples
• Query Purpose: Retrieve names and prices of all products
SELECT ProductName, PriceFROM TinyProducts
• Query Purpose: Retrieve all information for all employees from the TinyProducts table
SELECT *FROM TinyProducts
19
37
SELECT with WHERESELECT with WHERE
• The WHERE clause is used to filter which information is returned from a SELECT
• Query Purpose: Retrieve all information only for product type of “food”
SELECT * FROM TinyProductsWHERE ProductType = ‘Food’
38
Use of Boolean OperatorsUse of Boolean Operators
• Conditions can be separated by Boolean operators: – AND, OR, NOT
• Query Purpose: List all information about food products that are either cereal or fruit
SELECT *FROM TinyProductsWHERE (ProductName = 'Cereal')OR (ProductName = 'Fruit')
20
39
Boolean Operator ExampleBoolean Operator Example
• Query Purpose: List the names of all products that the type is fruit and the price is less than $2.00
SELECT ProductType, ProductNameFROM TinyProducts
WHERE Price < 2AND ProductName = 'Fruit'
40
IN OperatorIN Operator
• The IN operator allows a search for records that match one value in a set of unordered values
• Example questions to use IN:– 'Find all products whose type is Food, Hardware, or Housewares'– 'Find all food whose type is Meat, Fish, Vegetables, or Fruit'
21
41
IN ExampleIN Example
• Query Purpose: List the name of Housewares that are Cookware, Linens, or Dishes
SELECT ProductName, ProductTypeFROM TinyProductsWHERE ProductName in ('Cookware', 'Linens', 'Dishes')
instead of:
SELECT ProductName, ProductTypeFROM TinyProducts
WHERE (ProductName = ’Cookware') OR (ProductName = 'Linens') OR (ProductName = 'Dishes')
42
BETWEEN OperatorBETWEEN Operator
• The BETWEEN operator allows a search for a range of values
• Example Queries:– 'Find all fruit between Bananas and Grapes'– 'Find all cereals whose price is between $1.50 and $4.00 a box
1.50 4.00
22
43
BETWEEN ExampleBETWEEN Example
• Query Purpose: Find all products whose price is between $2.00 and $8.00
SELECT ProductName, PriceFROM TinyProductsWHERE Price BETWEEN 2.00 AND 8.00
instead of:
SELECT ProductName, HardwareFROM TinyProductsWHERE (Price >= 2.00) OR (Price <= 8.00)
44
Wildcard Searches of StringsWildcard Searches of Strings
• The LIKE operator is used to search parts of a string
• The following wildcard characters are used:% to match any zero or more characters_ to match exactly one character
23
45
Wildcard Search ExamplesWildcard Search Examples
• Query Purpose: List all products whose name starts with an ’C'
SELECT * FROM TinyProductsWHERE ProductName LIKE 'C%'
• Query Purpose: List all products that have a SKU number with the last 2 characters of ’23' when you don't know the first character
SELECT * FROM TinyProductsWHERE SKUNumber LIKE '_23'
46
Aggregate OperatorsAggregate Operators
• MIN, MAX, and AVERAGE are used when computing statistics on a range of data
• Query Examples:– 'What is the highest batting average on the team?'– 'What is the average number of hits for all the little league teams in
the National League?'– 'What are the names of the players that had the lowest average on
the little league team?'
24
47
Aggregate Operators ExampleAggregate Operators Example
• Query Purpose: Find the minimum, maximum, and average batting average of all players in the National League of Little League
SELECT MIN(Average), MAX(Average),AVG(Average)
FROM PLAYERSWHERE League = 'National'
48
SUM and COUNT OperatorsSUM and COUNT Operators
• Use the SUM operator to total the results of a query
• COUNT will count the total number of occurrences of an item in a search
11 + + 22 + + 33 + + 44
25
49
SUM And COUNT ExamplesSUM And COUNT Examples
• Query Purpose: Find the total number of homeruns hit by all players in the American League?
SELECT SUM(HomeRuns)FROM PLAYERSWHERE League='American'
• Query Purpose: List the names of players that have hit 3 home runs in the National League?
SELECT COUNT(*) FROM PLAYERSWHERE HomeRuns = '3' AND League = 'National'
50
Calculated AttributesCalculated Attributes
• A new attribute can be obtained by using arithmetic operators (+,-, *, /) on other numeric attributes
• All operators follow standard precedence:– Multiplication and division are computed first left to right– Addition and subtraction are computed last left to right– Use parenthesis to override the standard precedence
( ( + + ,, -- ,, * * ,, / / ))
26
51
Calculated Attributes ExampleCalculated Attributes Example
Query Purpose: List all players with their hits, at bats, and their batting average
SELECT Name, Hits, AtBats, (Hits / AtBats)
FROM PLAYERS
52
DISTINCT OperatorDISTINCT Operator
• DISTINCT is used to exclude duplicate occurrences in the result of a query
• Query Purpose: List all distinct batting averages
SELECT DISTINCT(Average)FROM PLAYERS
27
53
Sorting Query ResultsSorting Query Results
• The ORDER BY clause is used at the end of the SELECT statement to sort the results of a query
• Use DESC on the end of the ORDER BY clause to sort the data in descending order. Otherwise, the result will be in ascending order
54
Sorting ExampleSorting Example
• Query Purpose: List all players in ascending order of their batting average
SELECT Name, AverageFROM PLAYERSORDER BY Average
• For descending order add the keyword DESC
SELECT Name, AverageFROM PLAYERSORDER BY Name DESC
28
55
Sorting Calculated AttributesSorting Calculated Attributes
• To refer to a computed attribute in the ORDER BY, use its position in the list of columns following SELECT
• Query Purpose: List all players in descending order of their batting average (here we assume batting average is computed at the time of the query)
SELECT Name, Hits, AtBats, Hits / AtBats
FROM PLAYERSORDER BY 3 DESC
56
More SQLMore SQL
1) GROUP BY Construct
2) HAVING Filter
3) Multiple Tables
4) Joins5) Equijoins6) Cartesian Product7) Nulls8) OUTER JOIN
29
57
GROUP BY ClauseGROUP BY Clause
• GROUP BY will partition a table into multiple groups of related rows.
• As an example, consider the EMPLOYEE table where Department partitions the EMPLOYEE set into subsets:
Engineering
Marketing Customer
Finance
58
GROUP BY ExampleGROUP BY Example
• Query Purpose: For each department, list the average salary using the EMPLOYEE table
SELECT Department, AVG(Salary)FROM EMPLOYEEGROUP BY Department
30
59
• To filter data further, we can use the WHEREclause with GROUP BY clause
Query Purpose: For each department, list the highest salary of their administrative assistants.
SELECT Department, MAX(Salary)FROM EMPLOYEEWHERE Title='administrative assistant'GROUP BY Department
GROUP BY With WHEREGROUP BY With WHEREGROUP BY With WHEREGROUP BY With WHERE
60
HAVING ConstructHAVING Construct
• HAVING is used to restrict the output of aggregate functions, such as SUM, MIN, MAX and AVG, to only those groups of rows that meet some condition.
Query Purpose: List the average salary for all departments that have more than three employees.
SELECT Department, AVG(Salary)FROM EMPLOYEE
GROUP BY DepartmentHAVING COUNT(*) > 3
31
61
EmpID Name Salary1 Fred 200
2 Ethel 300
3 Mike 400
4 David 100
EMPLOYEE
Multi-Table SQLMulti-Table SQL
• It is often necessary to combine data into multiple tables.
ATTENDSEmpID Name
1 Harvard2 GMU2 Yale3 MIT3 Stanford3 GMU
62
JoinsJoins
• Joins are the means by which multiple tables can be combined.
• A join allows us to combine data from different tables. A join operation is done through the SELECT construct.
• Types of Joins: Equijoin, Outer Join, Inner Join
32
63
EquijoinEquijoin
• Joins only those rows where a foreign key matches the primary key
• Allows information from multiple tables to be linked together in a single query
• Can be used to link as many tables as needed in a single query
64
• Query Purpose: List the names of all colleges attended by Ethel
SELECT b.NameFROM EMPLOYEE a, ATTENDS bWHERE a.EmpID = b.EmpIDAND a.Name = 'Ethel'
Equijoin Query ExampleEquijoin Query Example
33
65
Equijoin ExampleEquijoin Example
EmpID College GPA
1 Harvard 2.452 GMU 3.792 Nova 3.653 Yale 2.853 Nova 2.653 GMU 4.0
EmpID Name Salary
1 Fred 2002 Ethel 3003 Mike 400
EMPLOYEE
ATTENDS
66
Warning about Joining TablesWarning about Joining Tables
• A join is really just a subset of a cartesianproduct. When no fields are 'joined' in the WHEREclause, a cartesian product is produced
– Restated in English: When the linking condition is omitted from the WHERE clause, you get a lot of excess garbage that you probably do not want.
Sample Query:
SELECT b.NameFROM EMPLOYEE a, ATTENDS bWHERE a.Name = 'Ethel'
34
67
Cartesian ProductCartesian Product
• Each row in one table with every other row in other table
a.EmpID a.Name a.Salary b.EmpID b.GPA
2 Ethel 300 1 3.42 Ethel 300 2 2.82 Ethel 300 3 3.72 Ethel 300 4 3.5
....
68
NullsNulls
• An attribute may be defined as null.
• This indicates that the value is unknown and avoids the need for user-defined special indicators.
• To prevent a column from having nulls, specify NOT NULL on the column in the CREATE TABLEstatement when setting up the database.
35
69
Nulls ExamplesNulls Examples
Statement Purpose: Add an employee whose salary is unknown
INSERT INTO EMPLOYEE (3,'Hank', NULL)
Query Purpose: Find all employees whose salary is unknown (or null)
SELECT *FROM EMPLOYEEWHERE Salary IS NULL
70
• An OUTER JOIN is used when the query should return a result row even for rows that do not have corresponding data in one of the tables.
• A LEFT OUTER JOIN returns all rows from the 'left' table.
• Nulls are returned when a row in the 'left' table has no corresponding rows in the right table.
OUTER JOINOUTER JOIN
36
71
LEFT OUTER JOIN ExampleLEFT OUTER JOIN Example
• Query Purpose: List the college GPAs for each employee. Include employees who have not attended any colleges
SELECT a.Name, b.GPAFROM EMPLOYEE aLEFT OUTER JOIN ATTENDS bon a.EmpID = b.EmpID
72
LEFT OUTER JOIN ExampleLEFT OUTER JOIN Example
• Result of the outer join – All employees are listed. – For an equijoin, only those who attended a college would be listed– Here, employee number 4 did not attend college, but is still
retrieved by the outer join.
Name GPA
---------- -----Fred 2.45Ethel 3.79Ethel 3.65Mike 2.85Mike 2.65Mike 4.00David NULL
37
73
Advanced SQL
( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)
74
Advanced SQLAdvanced SQL
1) Finding the nth element in a list2) Finding the median3) Correlated subquery4) Data Definition Language Constructs
38
75
Find the Nth ElementFind the Nth Element
• It is very common to try to find the nth element in a list.– Examples:
– Who makes the second highest salary in marketing department? – What is the fifth best product in sales?
– This can be done with a program that uses SQL to access the database: SQL is sent to the database and the program keeps retrieving the result set until the threshold is crossed.
• We show another way ofdoing this using standard SQL.
76
Find the Nth Element: Example TableFind the Nth Element: Example Table
• Consider a table, called TEST, with just one column, x, with the following values:
X
4
5
8
39
77
• First join TEST with itself, this yields each element matched with every other element:
Find the Nth Element: Step 1Find the Nth Element: Step 1
444555888
458458458
78
Find the Nth Element: Step 2Find the Nth Element: Step 2
• Next keep only those rows where the first column is greater than or equal the second column.
Notice the pattern that just developed, each number on the list now has a certain number of values that match on the right. This number matches the position of this value in the list. For example, 4 has only one match as it is the first number in the list, 5 has two matches, 8 has three matches.
455888
445458
444555888
458458458
40
79
Find the Nth Element: Step 3Find the Nth Element: Step 3
• Now group by the column on the left and identify the size of each group.
• The same ideas can be applied to any SELECTstatement output.
455888
445458
458
123
80
Finding the Nth Element: ExampleFinding the Nth Element: Example
• Query Purpose: Find the information about the product with the second highest price.
SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber
FROM TinyProducts a, TinyProducts bWHERE a.Price >= b.PriceGROUP BY a.ProductName,a.ProductType,
a.Price, a.SKUNumberHAVING COUNT(*) =
(SELECT COUNT(*)-1 FROM TinyProducts)
41
81
Finding the Top N Elements: ExampleFinding the Top N Elements: Example
• To ask for the top n values instead of the nthvalue, specify a range (>=) instead of just an equality (=) in the HAVING.
• Query Purpose: Find information about the products with the two highest prices.
SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber FROM TinyProducts a, TinyProducts b
WHERE a.Price >= b.PriceGROUP BY a.ProductName,a.ProductType,
a.Price, a.SKUNumberHAVING COUNT(*) >=
(SELECT COUNT(*)-1 FROM TinyProducts)ORDER BY a.Price
82
Finding the MedianFinding the Median
• The median is defined as the element in the middle of the list.
• Query Purpose: Find the median price in TinyProducts.
SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber FROM TinyProducts a, TinyProducts bWHERE a.Price >= b.PriceGROUP BY a.ProductName,a.ProductType, a.Price, a.SKUNumber
HAVING COUNT(*) = (SELECT (COUNT(*)/2)+1 FROM TinyProducts)
42
83
Using SubqueriesUsing Subqueries
• A subquery may be used in the middle of a query.
• Query Purpose: Find the information about the highest priced product, using a simple subquery.
SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber FROM TinyProducts aWHERE Price = (SELECT MAX(PRICE) FROM TinyProducts)
84
Correlated SubqueryCorrelated Subquery
• If the subquery references a data element from outside of the subquery, it is called a correlatedsubquery.– For each row in the outer part of the query, the correlated subquery
is executed.
The following query will indicate who makes more money than ‘Ethel’
SELECT a.Name, a.Salary FROM Employee a WHERE EXISTS
(SELECT b.Salary FROM Employee b
WHERE a.Salary > b.Salary AND b.Name = 'Ethel')
43
85
• INSERT– Add rows to a single table
• UPDATE– Modify rows in a single table
• DELETE– Remove rows from a single table
Other Data ManipulationOther Data Manipulation
86
INSERT ExamplesINSERT Examples
• Statement Purpose: Add a record for employee #1, ’Fred' with a salary of 200 to the EMPLOYEEtable
INSERT INTO Employee VALUES(1, ’Fred', 200)
• Statement Purpose: Copy all rows in the EMPLOYEE table and place them in NEW_EMPLOYEE
INSERT INTO New_EmployeeSELECT * FROM Employee
44
87
UPDATE ExampleUPDATE Example
• Statement Purpose: Modify Fred’s salary to 150
UPDATE EmployeeSET Salary = 150.00WHERE Name = 'Fred'
• Statement Purpose: Give all employees a ten percent raise
UPDATE EmployeeSET Salary = Salary * 1.10
88
DELETE ExamplesDELETE Examples
• Statement Purpose: Remove all employees who have a salary higher than 100.
DELETE FROM EmployeeWHERE Salary > 100
• To remove all employees:
DELETE FROM Employee
45
89
CREATE TABLE ExampleCREATE TABLE Example
• Statement Purpose: Create a table to store employee information
CREATE TABLE EMPLOYEE(EmpId SMALLINT,Name CHAR(10),Salary DECIMAL(5,2))
To drop the EMPLOYEE table
DROP TABLE EMPLOYEE
90
Data WarehouseSecurity
( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)
46
91
Data Warehouse SecurityData Warehouse Security
1) Key Security Services2) Views3) Access Control 4) Roles5) Encryption6) Audit Trails7) Security Holes8) Intrusion Detection9) Misuse Detection
92
IntroductionIntroduction
• A key feature provided by database systems is good security services. – In a database system with good security, applications do not have
to worry about problems that arise with security violations.
• A data warehouse also requires good security services because it holds key, corporate data.
EIS
Database System
SecurityServices
47
93
Key Security ServicesKey Security Services
• Access Control– Controls who accesses what data
• Administration of Access Control– Used to give access to users as well as track who has various
accesses and what kind of accesses are given to a user or group of users
– Audit tracks the usage of the data warehouse
94
Security in a Data WarehouseSecurity in a Data Warehouse
• A data warehouse consolidates organizations key data in one place.– A data warehouse increases the security risk that unauthorized
users will try to obtain this data
• Security aspects of EIS applications must be designed and implemented very thoroughly.
• Access control and audits are two of the critical components of security.
48
95
Data Warehouse Security ComponentsData Warehouse Security Components
• Database system components that can be used to protect a data warehouse include:– Views
– Allow users to only see certain rows or columns of data– Access control
– Indicate which users have access to what data
– Administration– This component is used to actually give access to groups of users
and to define the accesses given to either an individual or a group. – Encryption– Protect data from access outside of the DBMS
– Audit– Track what users are doing
96
Views in Data WarehouseViews in Data Warehouse
• A view is a logical view into one or more tables. Users may be given access to the view without access to the base table.
• Views provide some security assistance because they can hide data from users.
Name Address Salary
H a n kE s t h e rTomSueDavePeteKathy
1 S o u t h S t r e e t2 N o r t h S t r e e t3 4 M a i n S t r e e t4 5 E a s y S t r e e t5 6 5 t h A v e n u e7 Broadway8 9 W e s t e r n A v e n u e
$50 ,000$80 ,000$90 ,000$28 ,500$35 ,000$60 ,000$85 ,000
EMPLOYEE
49
97
View ExampleView Example
VIEW (SAFE_EMPLOYEE)“Salary” i s e f fect ive ly h idden
• A view called SAFE_EMPLOYEE may be createdas:CREATE VIEW SAFE_EMPLOYEE AS
(SELECT name, address FROM EMPLOYEE)
Now users of the view SAFE_EMPLOYEE will noteven know that salary exists.
Name Address Salary
H a n kE s t h e rTomSueDavePeteKathy
1 S o u t h S t r e e t2 N o r t h S t r e e t3 4 M a i n S t r e e t4 5 E a s y S t r e e t5 6 5 t h A v e n u e7 Broadway8 9 W e s t e r n A v e n u e
SAFE_EMPLOYEE
98
Updating ViewsUpdating Views
• Restrictions exist on updating views. For the EMPLOYEE table, it is possible to insert into the SAFE_EMPLOYEE view.
– Example:
INSERT INTO SAFE_EMPLOYEE VALUES (‘Hank’, 300)
This will insert a NULL into the SALARY column of the base table EMPLOYEE.
• Other restrictions to view updates exist:– Cannot update a view that is defined with an aggregate – Cannot update a view that is defined with a GROUP BY
50
99
Data Warehouse Access ControlData Warehouse Access Control
• Access control is implemented in a data warehouse with the SQL Grant and Revoke commands.
• Syntax
– GRANT <ALL|UPDATE|DELETE|INSERT|SELECT> ON<object-name> TO <user name>
– Example: GRANT SELECT ON EMPLOYEE TO MARY
• Access control is done by DBAs and creators of tables.
• To remove access the REVOKE command is used.– Example: REVOKE SELECT ON EMPLOYEE FROM MARY
100
Database RolesDatabase Roles
• Roles provide security administration by allowing users to be grouped into roles. Accesses may then be given to a group of users. – As an example, some roles for a company might be:
– Administrative assistant– Loan officer– Salesperson
• Accesses may be assigned based on roles. – This dramatically simplifies administration.– If new tables are created, it is not necessary to add thousands of
new accesses. – Examples:
CREATE ROLE loan_officer AS (Hank, John, Mike)GRANT SELECT ON LOAN TO LOAN_OFFICER
51
101
Example of Application-based RolesExample of Application-based Roles
• Consider:
• If the database system controls accesses than it does not matter what the application does, accesses are controlled consistently (same for SALES as MARKETING)
• However, more fine-grained access control can be granted in the application.
DatabaseSystemApplications DataUsers
102
Application RolesApplication Roles
• The application can restrict:– Data entry screens– Reports
• Care must be taken to restrict users in a consistent fashion so that a user cannot jump to a different application and avoid security set up by another application.
52
103
Role Based Security in a Data WarehouseRole Based Security in a Data Warehouse
• Both application and database level security are useful in a data warehouse.
• Database level security is needed so that users are only allowed to see data they
need to see.
• Application level security canbe used to control access to certain menus so that users do not even know what reports exist.
104
EncryptionEncryption
• Encryption is the process of coding data so that it can only be read by users who have the key that allows them to decrypt the data. – Example:A message “sell 500 shares” would appear as “xyzzy”
without the key. Once the key is paired with the encrypted string “xyzzy”, it can then be decrypted.
– The size of the key is a factor in how difficult it is to attack the encryption scheme.
• Three places where encryption might be used in a data warehouse:– Network– Data
– Tape backups
53
105
Network EncryptionNetwork Encryption
• In a data warehouse application, data and queries are transmitted through a network. – Attackers might be able to steal network traffic just by breaking
into the network medium.
• One way to reduce the risk of this threat is to encrypt traffic on the network.
User
DatabaseSystem
Application
DataWarehouse
N e t w o r k
Tape Backup
106
• Network encryption is critical because the network connects all of the key components in a data warehouse.
• Encrypting network traffic mitigates the risk that an attacker could succeed with the “man in the middle” attack.
• Without this, it may be possible for the“man in the middle” to masquerade asanother user and circumvent existingapplication and database security.
Network EncryptionNetwork Encryption
54
107
Data EncryptionData Encryption
• Data encryption refers to encrypting the actual data in the data warehouse.
• If the attackers were to retrieve data fromthe warehouse, they would have to decrypt it in order to read it.
EIS DatabaseSystem
DataWarehouse
108
Backup EncryptionBackup Encryption
• Periodically, databases are copied to some kind of long-term storage (usually tapes).
• If the database is encrypted, but the tapes are not encrypted, the risk exists of someone walking off with the tapes.
Tape Backup
EIS DatabaseSystem
DataWarehouse
55
109
Audit TrailsAudit Trails
• Audit trails are a means of tracking queries, updates, deletes, and additions of new data to the data warehouse.– Audit trails are turned on when the DBMS is started and all
activity that uses the data warehouse is tracked in the audit trail.
• If a user is suspected of an evil deed, the audit trail can be examined to identify what data has been accessed by users.
110
Details of DW Audit TrailsDetails of DW Audit Trails
• An audit trail of a database system typically includes the following information:– User ID, Date, Time, Object that has been accessed (table or view),
Action that accessed the object (INSERT, UPDATE, DELETE, SELECT)
– For UPDATE, the old value and new value is tracked.
• For data warehouses, the SELECT is often used to track the queries that have
been run against the warehouse.
56
111
Other Uses for DW Audit TrailsOther Uses for DW Audit Trails
• Audit trails can be used to identify the most popular data in the warehouse.
– This information can be used to optimize queries
• An additional use for audit trails is performance tuning of the data warehouse.
– Administrators know where to focus their efforts– Reduces administrative overhead
112
Dealing with Known Security HolesDealing with Known Security Holes
• Commercial database systems and operating systems are often filled with holes that allow users to obtain unauthorized access.– To reduce the risk of these known holes, vendors often provide
“fixes” to their products as soon as these holes become public.
• It is important to constantly keep up with known security holes and apply the latest fixes as soon as they are released.
• One of the key risks surrounding a data warehouse is that privileged users have the “keys to the kingdom”.
57
113
The Risk of “Privileged Users”The Risk of “Privileged Users”
• "Privileged users" include:– Data warehouse administrators– Operating system programmers– Operators in the computer center
– These users can:– Modify, delete and query any data in the warehouse– Modify the audit trail to mask their actions– Give other users unauthorized access
• Numbers of "privileged users" could be anywhere from 20 to 30 in some
organizations.
114
Reducing the Risk of Privileged UsersReducing the Risk of Privileged Users
• One way to reduce the risk of privileged users is to separate security administration from database administration. – This would separate the task of giving accesses and managing the
audit trail from the task of making sure the data in the warehouse was correct and properly optimized.
Secur i ty Serv icesAccess Cont ro l
Audit
Database Serv icesDatabase Tun ing
Q u e r y O p t i m i z a t i o nBackups
SecurityServices
Access Cont ro lAudit
DatabaseServices
Database Tun ingQ u e r y O p t i m i z a t i o n
Backups
58
115
Information Security Attacks Information Security Attacks
• Two types of Information security attacks on data warehouses are:– Intrusion
– An intrusion occurs when an unauthorized user gains access to the data warehouse.
– Misuse– Misuse, often referred to as the insider
problem occurs when a user who has access to the warehouse uses that access for an
unauthorized purpose
• Audit Trails can be used to identify either type of attack, but
identification of misuse is typically MUCH harder to do than intrusion.
116
Intrusion Detection Intrusion Detection
• An intrusion is defined as an unauthorized access to a system. The assumption is the user is external to the environment (e.g.; a hacker).
• To reduce the risk of intrusion, intrusion detection tools are used. – These tools monitor access to the data warehouse and sound an
alarm if unauthorized accesses are detected.
DATAWAREHOUSEUSER
INTRUSION DETECTION SYSTEM
59
117
Misuse DetectionMisuse Detection
• Unwanted access by a user that has the ability to access data is referred to as misuse.
– This is also known as the insider problem.– Some estimates have shown that 80 % of computer crime is a
result of misuse.
• For data warehouses the threat of misuse is high especially by privileged users.
118
SummarySummary
• DBMS Security is useful for data warehouses to hide data from users with views and to restrict access to data with GRANT and REVOKE.
• Application Level Security assists EIS that access data warehouses by hiding certain reports from users.
• Encryption can be used to further protect against the risk of someone walking off with the data warehouse.
• Audit Trails are useful for:– Catching attackers– Identifying usage trends of the data warehouse
60
119
Moving Datato the
Data Warehouse
( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)
120
Moving Data to the Data WarehouseMoving Data to the Data Warehouse
1) Moving Data into the Data Warehouse2) Updating the Data Warehouse3) Full Refresh 4) Copy Only the Changes5) BCP6) Simple Transformations7) Complex Transformations8) Commercial ETL Tools
61
121
Moving Data into the Data WarehouseMoving Data into the Data Warehouse
• Data must be moved to the data warehouse from source systems.
• Some key issues:– Determine the frequency of data updates -- how often should data
be moved from source systems to the data warehouse. – Various means of updating data in the warehouse exist:
– SQL Commands– Database system load programs (e.g.; SQL Server’s BCP) – Commercial tools
122
Updating the Data WarehouseUpdating the Data Warehouse
• OLTP (On-Line Transaction Processing) Systems have to send their updates to the data warehouse.
InventoryOLTP
Application
FinanceOLTP
Application
SalesOLTP
Application
Data Warehouse
F i n a n c eSub jec t
Area
Inven to rySubjec t
Area
S a l e sSub j ec t
Area
62
123
Frequency of Updates to the Data Warehouse
Frequency of Updates to the Data Warehouse
• Updates may occur daily, weekly, monthly, or in real-time.
Daily Update
Weekly U
pdate
Monthly U
pdateInventory
OLTPApplication
FinanceOLTP
Application
SalesOLTP
Application
Data Warehouse
F i n a n c eSub jec t
Area
Inven to rySubjec t
Area
S a l e sSub j ec t
Area
124
Determining the Frequency of UpdatesDetermining the Frequency of Updates
• Requirements should drive update frequency
• Range of updates runs from real-time, to quarterly.– Real time update
– Expensive– Requires update of warehouse while users are
querying – Daily update
– Somewhat cheaper than real time, but significantmaintenance required if the warehouse has lots of tables.
– Monthly or weekly update– Much more manageable
63
125
Updating the WarehouseUpdating the Warehouse
• Full Refresh vs. Only the Changes
InventoryOLTP
ApplicationFinanceOLTP
Application
SalesOLTP
Application
Full Refresh
Changes since last update
Full refresh
of some tables
chan
ges for other tables
Data Warehouse
F i n a n c eSub jec t
Area
Inven to rySubjec t
Area
S a l e sSub j ec t
Area
126
Full RefreshFull Refresh
• Copy the entire source table in the OLTP system to the destination table in the Data Warehouse.
SourceTable
Source OLTP
TargetTable
Target Data Warehouse
64
127
Copy Only the ChangesCopy Only the Changes
• Copy only the changes to the source table in the OLTP system to the destination table in the data warehouse.
SourceTable
Source OLTP
TargetTable
Target Data Warehouse
Modif ied da ta s i n c e l a s t u p d a t e t o t h e w a r e h o u s e
D a t a f r o m t w o u p d a t e s a g o . H i s to r i ca l da t a no longe r i n source OLTP.
128
Full Refresh vs. Only the ChangesFull Refresh vs. Only the Changes
• Full Refresh– Pros
– Much easier to implement– Less chance of messing up your database (good data integrity)
– Cons– Can take a lot longer to actually do -- may “run out of night”– Can lose out on warehouse ability to track historical data.
• Only the Changes (DELTA)– Pros
– Tracks historical data– Cons
– Can be very hard to implement– Can require changes in source applications (more on this later)
65
129
• One way to move data from one table to another is via the INSERT-SELECT.– Syntax: INSERT INTO <target_table>
<any sql SELECT statement>
• Example:INSERT INTO DW_EMPLOYEE
SELECT *FROM EMPLOYEE
Full Refresh Using INSERT-SELECTFull Refresh Using INSERT-SELECT
TARGET
130
Updating Changes Using INSERT-SELECTUpdating Changes Using INSERT-SELECT
• Changes may be moved by adding a WHEREclause to the INSERT-SELECT.
• Example:– INSERT INTO DW_EMPLOYEE
SELECT *FROM EMPLOYEE
WHERE DATE-UPDATED = DATEPART(m, CURRENT_TIMESTAMP)
66
131
Updating Using BCPUpdating Using BCP
• BCP is the bulk copy program that comes with MS SQL Server.– Bulk copy (BCP) moves data to or from a flat file to a SQL table.
• Syntax: bcp <table> [in | out] <data file>
SourceTable
Source OLTP
TargetTable
Target DataWarehouse
TemporaryF l a tFile
Unload Load
132
BCP ExampleBCP Example
• To bulk copy data from the publishers table in the pubs database to the publishers.txt data file in ASCII text format, execute from the command prompt:bcp pubs..publishers out publishers.txt -c
-Sservername -Usa -Ppassword
• To bulk copy data from the publishers.txt file into the pub2 table in the pubs database, execute from the command prompt:bcp pubs..pub2 in publishers.txt -c
-Sservername -Usa -Ppassword
67
133
Simple TransformationSimple Transformation
• In addition to moving data from OLTP to the warehouse, it is often necessary to transform data. – Example: System A stores TOTAL_CLOTH in meters and system
B stores TOTAL_CLOTH in yards. Before the data is moved from system A, we need to transform the data.
Data Warehouse
Pat tern = 31 , Tota l C lo th = 50 yardsPat tern = 32 , Tota l C lo th = 70 yards
S t o r e 3 2( P a t t e r n = 3 2 ,
To ta l Clo th = 20 meters)
S t o r e 3 1( P a t t e r n = 3 1 ,
TOTAL_CLOTH = 50 yards )TRANSFORMATION
134
Data Warehouse
Complex TransformationComplex Transformation
• More complex transformations occur when a value in a source table must be moved to several locations in a data warehouse.
BLUE3 4 8 4(Color = Blue, 34 Inches, LS)
TABLE 1
COLOR
BLUE
34 in
CONVERT TOCENTIMETERS
TABLE 2
8 6 . 3 6 c m
84
CONVERT TO
CODE 84
(long sleev
es) and
put in two tab
les
TABLE 3
LongS l e e v e s
TABLE 4
LongS l e e v e s
68
135
• Key tools in the marketplace– Informatica– Ardent– DecisionBase (Platinum)– Microsoft Data Transformation Services
• All provide libraries of common transformations.
• All provide the ability to code complex transformations.
Commercial ETL ToolsCommercial ETL Tools
136
Data Transformation ServicesData Transformation Services
69
137
Choose a SourceChoose a Source
138
Choose a DestinationChoose a Destination
70
139
Choose to use a Query for TransferChoose to use a Query for Transfer
140
Enter SQL Query Enter SQL Query
71
141
Choose Destination TableNameChoose Destination TableName
142
Verify TransformationVerify Transformation
72
143
Decide When to Run TransformationDecide When to Run Transformation
144
Final VerificationFinal Verification
73
145
Run TransformationRun Transformation
146
Check ResultsCheck Results
orderid orderdate productid productname quantity unitprice discount
10248 1996-07-04 00:00:00.000 11 Queso Cabrales 12 14.0000 0.010248 1996-07-04 00:00:00.000 42 Singaporean Hokkien Fried 10 9.8000 0.010248 1996-07-04 00:00:00.000 72 Mozzarella di Giovanni 5 34.8000 0.010249 1996-07-05 00:00:00.000 14 Tofu 9 18.6000 0.010249 1996-07-05 00:00:00.000 51 Manjimup Dried Apples 40 42.4000 0.0
select *from orderfact
74
147
SummarySummary
• ETL is one of the hard parts of building a data warehouse.
• Either full refreshes of data or just the changes may be done.
• Doing full refresh is easy, but historical data is lost and it may take a lot of time.
• Tracking changes is a tough business.
• ETL commercial tools are beginning to mature and can lessen the pain of this task.
148
More Ways of Moving Data to
the Data Warehouse
( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)
75
149
More Ways of Moving Datato the Data Warehouse
More Ways of Moving Datato the Data Warehouse
1) Determining What Data Has Changed2) Recovery Logs3) Triggers4) Insert Triggers5) Delete Triggers6) Update Triggers7) Manual Detection
150
• There is a need to move data into the data warehouse from OLTP and DSS applications
• The problem is detecting what data needs to be moved into the data warehouse
• Three methods:
– Recovery Logs
– Triggers
– Manual Techniques
More Ways of Moving Datato the Data Warehouse
More Ways of Moving Datato the Data Warehouse
76
151
Determining What Data Has ChangedDetermining What Data Has Changed
• Problem: How to get updates made to the source to the same information in the data warehouse?
TABLE A
SOURCE
OLTP
DATA WAREHOUSE
TABLE B??UPDATES
H o w t o g e t u p d a t e s f r o mSource Table A to
Data Warehouse Table B
152
??
Determining What Data Has Changed (cont.)Determining What Data Has Changed (cont.)
• Problem: How to get updates made to multiple sources to the same information in the data warehouse?
TABLE A
SOURCE
OLTP
UPD
ATES
“ROW X”
NAME DEPT. SALARY
Fred Mktg 35000
Hank Sales 60000Sue IT 71000Joe Sales 50000
Employee
RO
W X
UPDATES
Insert intoEmployee Values(‘Joe’,’Sales’,’50000)
DATA WAREHOUSE
TABLE A
“ROW X”
TABLE B
“ROW X”
DEPT COUNTMktg 1Sales 1IT 1HR 0
??EmployeeCount
SalaryInfo
2DEPT AVG SAL TOT SAL
Mktg 35000 35000IT 71000 71000HR 0 0Sales 60000 60000
55000 110000
77
153
What is the Recovery Log?What is the Recovery Log?
• Recovery log is used for transaction processing– Used to handle errors– Does contain before and after image.
• Recovery log can be used toidentify the data to be updated
in the data warehouse.
– Change Data Capture Utility– This scans the database log and identifies all changes that the user
is interested in and either writes them to a file or stores them in another table.
154
Change Data Capture Utility in ActionChange Data Capture Utility in Action
DBMS
DATA
LOG
A l l c h a n g e sto DBMS
SOURCE
OLTP
CHANGEDATA
CAPTUREUTILITY
DATA WAREHOUSE
READSREADS
WRITESWRITES
RECOVERY LOG
78
155
Example of Using Recovery LogExample of Using Recovery Log
UPDATE EMPLOYEEWhere SSN=10
SET Salary=Salary*2.0
LOGTABLE=EMPLOYEE
SSN=10
OldSalary=100,
NewSalary=200
CHANGEDATA
CAPTURERECONSTRUCTS
UPDATE
DATA WAREHOUSE
• Consider an update to the Employee table– The information is recorded in the log– The change data capture reconstructs update– Can then be sent to the data warehouse
156
Using the Recovery LogUsing the Recovery Log
• Recovery logs are usually in proprietary format. Use commercial tools to read the log and identify the changes.
• Commercial tools such as CA’s log analyzer can place the results of their work in a table.
79
157
Summary of Change Data CaptureSummary of Change Data Capture
• Pro
– Log exists anyway, might as well use it to find what has changed
• Con– Some difficult scenarios may occur where it is hard to see what the
new update should be in the Data Warehouse.
– Proprietary format, may not be supported in many DBMS and will always lag behind DBMS development.
– Many tables will be in the source that have nothing to do with the data warehouse, but change data capture will process their changes as well.
158
TriggersTriggers
• Triggers allow DBA’s to specify that when an “event” such as an INSERT, UPDATE, or DELETEoccurs on a table, another event is triggered.– Triggers are used to identify changes that are needed by the
warehouse. – A trigger can be added to a source table and whenever the source
table is updated, an update can be placed either directly in thewarehouse or in a staging table that tracks all updates.
• Triggers can be used to detect the changes and perform data warehouse updates.
– A different trigger might be run on key updates so that the datawarehouse nightly process would know what data has changed.
80
159
Example of a TriggerExample of a Trigger
TABLE A
INSERT intoTABLE A
VALUES (X, Y)
STEP 1
STEP 2
Values (X, Y) are inser ted
STEP 4
When va lues a r einser ted , se t s o f f
the TRIGGER
Nigh t ly P rocess i n se r t svalues (X, Y) into
t h e D a t a W a r e h o u s e
DATA WAREHOUSE
TABLE A
Values (X, Y)
X, YSTAGING
STEP 3
TRIGGER inser tsvalues (X, Y) intoa “STAGING” area
NightlyProcess
160
Real-Life Trigger ExampleReal-Life Trigger Example
• OLTP/DSS Data - Employee table:
–Employee (ssn, name, salary)
• DW Data - Summary table:
–EmployeeStatistics (total number employees, total salary paid, average salary).
• When a row is inserted in the employee table, we need to do an insert into the EmployeeStatisticstable. – Shown on the next page
81
161
Insert Trigger ExampleInsert Trigger Example
CREATE TRIGGER EmployeeInsertTriggerON EmployeeFOR INSERT AS
BEGINUPDATE EmployeeStatistics
SET NoEmployee = NoEmployee + (SELECT COUNT(*) FROM INSERTED)
UPDATE EmployeeStatisticsSET TotSalary = TotSalary +
(SELECT SUM(Salary) FROM INSERTED)UPDATE EmployeeStatistics
SET AvgSalary = TotSalary / NoEmployeeEND
162
Insert Trigger in ActionInsert Trigger in Action
INSERT INTO EMPLOYEEVALUES (1, 'John', 300) (1 ROW(S) AFFECTED)
INSERT INTO EMPLOYEEVALUES (2,'Mike', 400) (1 ROW(S) AFFECTED)
SELECT * FROM EMPLOYEEEmployeeEmpId Name Salary ------ --------------------------1 John 300.002 Mike 400.00
SELECT * FROMEMPLOYEESTATISTICS
EmployeeStatisticsNoEmployee TotSalary AvgSalary---------- ---------- ---------2 700.00 350.00
COMMANDS RESULTS
82
163
Delete Trigger ExampleDelete Trigger Example
CREATE TRIGGER EmployeeDeleteTriggerON EmployeeFOR DELETE AS
BEGINDECLARE @numberEmployee int
UPDATE EmployeeStatistics SET NoEmployee = NoEmployee - (SELECT COUNT(*) FROM DELETED)
UPDATE EmployeeStatisticsSET TotSalary = TotSalary - (SELECT SUM(Salary) FROM DELETED)
SELECT @numberEmployee = NoEmployee FROM EmployeeStatistics
IF @numberEmployee > 0 BEGIN
UPDATE EmployeeStatisticsSET AvgSalary = TotSalary / NoEmployee
EndELSE
UPDATE EmployeeStatistics SET AvgSalary = 0.0END
164
Update Trigger ExampleUpdate Trigger Example
CREATE TRIGGER EmployeeUpdateTriggerON Employee
FOR UPDATE ASBEGIN
IF UPDATE (Salary)UPDATE EmployeeStatisticsSET TotSalary = TotSalary -
(SELECT SUM(Salary) FROM DELETED) +(SELECT SUM(Salary) FROM INSERTED)
UPDATE EmployeeStatisticsSET AvgSalary = TotSalary / NoEmployee
END
83
165
Summary of Using TriggersSummary of Using Triggers
• Pro– Only needed for tables whose data is going to go to the DW
• Con– Additional work needed to create detailed triggers
– Non-trivial to generate a trigger to implement appropriate action
– May not be acceptable for commercial software on source system
166
Other Ways to Determine What Has ChangedOther Ways to Determine What Has Changed
• There are other manual ways of detecting the change and doing DW updates– Look at each row of OLTP and the data in the warehouse– Compare the differences between the two files, if the data is not in
the warehouse, add it!
Hank
J o h n
Mike
S a m
OLTP DATA WAREHOUSE
Hank
J o h n
MikeCOMPARE
ADD THE DIFFERENCESADD THE DIFFERENCES
84
167
Manually Identifying What Has ChangedManually Identifying What Has Changed
• Pro– Flexible
• Con– Very expensive
– Could take a long time
168
SummarySummary
• Recovery Logs
• Triggers
• Manual Detection
85
169
Data WarehouseDesign
( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)
170
Data Warehouse DesignData Warehouse Design
1) Overview2) Describing a Design - ER Diagrams3) Design Normalization4) Star Schema Design
86
171
OverviewOverview
• How to describe a design
– Entity Relationship (ER) Diagram
• Types of Designs
– Normalized– Star Schema– Snowflake
172
Describing a DesignDescribing a Design
• Different techniques exist, the most prevalent is the ER (Entity-Relationship) Diagram
• Entities– Things that occur in the real world, usually nouns e.g.; employee,
part, product, etc.
• Relationships– How entities interact, example: one employee may attend many
colleges -- usually verbs– Types of relationships
– 1-1– 1-Many– Many-1– Many-Many
87
173
Examples of Relationships Examples of Relationships
1-MANY
MANY-1
1-1
MANY-MANY
174
Normalized DesignNormalized Design
• Methodology– All 1-1 relationships are placed in a single table.– Many-many relationships require two tables that store the single-
valued relationships and one linking table that indicates how the entities are related. The relationship is represented in the linking table by referencing keys in the two tables that represent each entity in the relationship.
• Checking the design– In a Normalized Design, there are many different normalized
forms. Each normal form (NF) builds on the previous one so that a table in 2NF is, by definition, in 1NF.
– 1NF– 2NF– 3NF
88
175
Dealing With Many-Many RelationshipsDealing With Many-Many Relationships
• For Many-Many– Two 1-1 Tables (SUPPLIER, PARTS)– One linking table (SP)– Ex: Suppliers, Parts are the 1-1, SP is the linking table that says
who sells what parts.
S# SNAME
1 SEARS2 OFFICE DEPOT
SUPPLIER PARTS
P# PNAME1 HAMMERS
2 NAILS
SP
S# P#
1 11 22 1
2 2
176
Normalized Design: ExampleNormalized Design: Example
• A store sells a product which is supplied by a given vendor. The product is purchased by a customer at a certain time.– Entities: Customer, Product, Store– Relationships: Customer buys Product
– Product is located in Store– Product is supplied By a Vendor
CUSTOMER PRODUCT
BUYS
STORE
IS-LOCATED-IN
VENDOR
89
177
Checking a Normalized DesignChecking a Normalized Design
• Normalization – Used to reduce data insertion, delete, and update anomalies caused
by bad designs. – Enables users to quickly check a design and make sure there are no
glaring holes in the design. – 1NF
– All “cells” are atomic -- i.e. each entry in a column contains only one value
– 2NF– All non-key values are functionally dependent upon the entire
primary key -- i.e. if the primary key changes, all other columns change.
– 3NF– No transitive dependencies -- i.e. all keys are completely
dependent on the primary key. If the primary key changes, all non-key columns are affected.
178
Overview of Normalized DesignOverview of Normalized Design
• Pro– Relatively easy to change
• Con– Queries can involve numerous joins– The massive number of tables and links between tables makes it
hard for customers to build their own queries
90
179
Star SchemaStar Schema
• Methodology– Single fact table in the middle describing a key event (e.g. sale)
surrounded by dimension tables (i.e. location, time, employee)
FACT
D5
D1
D3
D2
D4
D = DIMENSIONS
180
Star Schema: MethodologyStar Schema: Methodology
• Identify a key fact that occurs.– Usually some event creates a real fact. Selling a product in a store
on Wednesday, patient visiting a hospital, etc.
• Identify all the dimensions of the data being used. Think of a dimension as a way to slice the data.– Ex: by time, by product, by customer, etc.
• Drill down operations are very well supported
91
181
Star Schema: ExampleStar Schema: Example
• A store sells a product which is supplied by a given vendor. The product is purchased by a customer at a certain time.
• Fact– CustomerPurchase
• Dimensions are– Customer– Product– Time
– Vendor
182
Star Schema: Example (cont.)Star Schema: Example (cont.)
Sale
C u s t o m e r
Store
Time
Product
SALE ID
1
CUST. ID
3
STORE ID
7
PROD. ID
4
PRICE
$ 3 . 0 0
TIME
4 / 2 4 / 9 9
CUST. ID
3
NAME
FRED
PHONE
1 2 3 4
Buys App le s
Y
Has Big Car
Y
DAY
24
MONTH
4
QTR
2Q
YEAR
99
Price
SALE
CUSTOMER
TIME
92
183
Star Schema: OverviewStar Schema: Overview
• Pro– Easy for users to navigate and understand
• Con– Performance
– Can end up with one monster fact table, millions of rows– Flexibility
– Not as easy for customers to change the design
184
MakeChips
Pa r t sManu-
facturing
PRODUCT
Price
Labor
Cost
Snowflake SchemaSnowflake Schema
• Several stars can be connected to form a snowflake
Sale
Price
R e v e n u e
Product
Marketing
Vendor
SALES
Direct MailPr ice
Ad
Location
MARKETING
Distrib-u t i o n
Sales
93
185
SummarySummary
• Two basic types of design– Star Schema– Normalized
• Many Data Warehouse vendors sell products built specifically for the star schema
• Some data warehouses insist that normalization is the way to build the data warehouse.
186
Building aData Warehouse
( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)
94
187
Building a Data WarehouseBuilding a Data Warehouse
1) Top Down Approaches2) Enterprise Data Model Approach3) "Let Data Users Decide" 4) "Let Data Warehouse Builders Decide" 5) "Let Senior Management Decide"6) Bottom Up Approach
188
Building the Data WarehouseBuilding the Data Warehouse
• How to decide what data goes into the data warehouse?
• Methods:
– Top Down– Using Enterprise Data Models– "Let data users decide" approach– "Let data warehouse builders decide" approach– "Let senior management decide" approach
– Bottom Up– Combine data marts into a data warehouse
95
189
Using Enterprise Data ModelsUsing Enterprise Data Models
• Use the Enterprise Data Model to decide what data goes into the data warehouse.– Model key processes. This approach says let the business decide.– Identify key data used by these processes in an enterprise data
model -- might be a giant Entity-Relationship diagram.
• Put data in the warehouse based on the enterprise data model.
190
CHIPRECIPES
An Enterprise Data Model ExampleAn Enterprise Data Model Example
MAKECHIPS
PUT INBAGS
SELL CHIPS
COUNT$$
BUY MOREPOTATOES
INGREDIANTS
CHIPSUPPLIERS
96
191
"Enterprise Data Model" Approach"Enterprise Data Model" Approach
• Pro– All inclusive -- no chance of leaving key data out.
• Con
– Very difficult to build an EDM.
– If the business model changes, you may have to rebuild the Enterprise Data Model and the data warehouse.
• Ways of Avoiding the Con
– In some cases you can buy an EDM -- if the business is common enough the packaged EDM might be very close and then you just have to modify it to fit your business.
192
USE
RS
SO
UR
CE
"Let Data Users Decide""Let Data Users Decide"
• Let the users of the data warehouse choose what data will go into the warehouse. – The data users deciding the data warehouse data and design will
pay for it as well.– Also, you can charge users who
query the data as well.
DATA WAREHOUSE
97
193
"Let Data Users Decide": An Example"Let Data Users Decide": An Example
DATA WAREHOUSE
MARKETING HUMANRESOURCES FINANCE
DATA
d e m o g r a p h i c s
Adver t is ing
?
trends
e d u c a t i o n
Ethnicgroup
Age
?
DATA
budget
spendingR e v e n u e
?
DATA
194
"Let Data Users Decide" Approach"Let Data Users Decide" Approach
• Pro– Reduces budget problems– Users know best!
• Con– Requires marketing– Could end up with data in the warehouse that is meaningless to the
people who run the place. – Users may not place important data in the warehouse because their
budget is small.– Users who need the data may not use the DW because of budget
concerns.
• Ways of Mitigating the Con– Do not just take money -- try to determine if data is really
corporate.
98
195
Pay As You Go Warehouse AnalogyPay As You Go Warehouse Analogy
I-495
196
"Let Data Warehouse Builders Decide""Let Data Warehouse Builders Decide"
LETS PUTINFORMATION ON
HOW TO BUILDVIRUSES IN THE
DATA WAREHOUSE
DATA WAREHOUSE
• The technical staff who is building the warehouse decides what data gets put in the warehouse.
99
197
"Let Data Warehouse Builders Decide" Approach
"Let Data Warehouse Builders Decide" Approach
• Pro– Very easy to design – Does not take much time– Do not have to deal with users
• Con
– Could easily result in data DUMP not data warehouse
• Ways to mitigate the con– Talk to lots of users to help you guess what should go in the DW
198
“Let Senior Management Decide”“Let Senior Management Decide”
• The senior management decides what data goes into the warehouse.
• Asking the senior management is the safest way to build a data warehouse.
• Identify the key questions on senior management’s mind and get the data to answer these questions.
100
199
“Let Senior Management Decide” Approach“Let Senior Management Decide” Approach
• Pro– Ensures executive support for the project
• Con– Senior management does not have much time for this -- you will
have to only get a few questions at a time
– This dramatically increases visibility - if you do not move quickly senior management will become very angry with the DW.
• Ways to mitigate the con– Do your homework before talking to the senior management -- talk
to the aides of senior management to find out what is on their mind.
– Allocate resources so you can plan to move very quickly once you hear from the senior management.
200
Bottom-Up ApproachBottom-Up Approach
• Move data from existing OLTP Applications to data marts.
• Combine data marts into a data warehouse.
DATAMART
25YARDS
DATAMART
50METERS
DATAMART200CM
DATAWAREHOUSE
OLTPAPP
OLTPAPP
OLTPAPP
101
201
• Pro– Data marts are much easier to build than full-fledged DW.
• Con– Could end up with a bunch of stove pipe data marts.
• Ways to mitigate the con
– Develop standards for data when building the data marts so that you can glue data from different data marts together.
Bottom-Up ApproachBottom-Up Approach
202
Recommendations for an ApproachRecommendations for an Approach
"Let senior management decide"
102
203
User Interface to the
Data Warehouse
( s l ides in th i s sec t ion a r e u s e d c o u r t e s y o fCar r ig Emerg ing Techno logyPh: 410- 553 - 6760www.c a r r i g e t. com)
204
User Interface to the Data WarehouseUser Interface to the Data Warehouse
1) Introduction2) Types of Users3) Functions Users Want to Do 4) Approaches to Building a User Interface5) Hand Built6) Class Libraries7) OLAP Tools8) Types of User Interfaces
103
205
IntroductionIntroduction
Inven to ryOLTP
Appl ica t ion
F i n a n c eOLTP
Appl ica t ion
S a l e sOLTP
Appl ica t ion
F i n a n c eOLTPData
Inven to ryOLTPData
S a l e sOLTPData
DATA WAREHOUSE
USER INTERFACE
• A User Interface (UI) is a front end application designed for the user that presents information in a simplified manner.– Data in a data warehouse does nothing if users cannot access it
– Users do not want to learn SQL to drive DW applications
206
Building User InterfacesBuilding User Interfaces
• DW applications have different types of users with different functionality requirements.– It is critical to identify the key users.– Once you do this, you need to identify their functional
requirements.
• There are three main approaches to building UI’s– Build your own entirely– Use commercial Class Libraries– Using OLAP Tools
104
207
Types of UsersTypes of Users
Everyone
Analysts
Marketing
Executive
Everyone
Analysts
Sales
Executive
Everyone
Analysts
Finance
Executive
CEO
208
• Executives– People who run the place– Need answers quickly– May not be very technical– Expect UI to get them what they
want quickly and efficiently without any need for special training
• Analysts– Have time to really analyze data and think about it
– May have strong statistical and IT background(i.e. Power user of Excel)
– Expect UI to have many complex features, and provide the ability to generate new queries and perform statistical
analysis of the data.
Types of Users (cont.)Types of Users (cont.)
105
209
Types of Users (cont.)Types of Users (cont.)
• Regular User– All other users – Just need some simple answers to simple questions like “What is
Hank’s phone number)– Expect UI to be simplistic, easy to understand, and provide access
to basic information.
210
Subject Matter Experts ExpectSubject Matter Experts Expect
• Query data in the data warehouse• Trend analysis
– “show me how much money we have spent on computers in the last four years”
• Benchmark to competitors– “what are all our competitors charging for product X”
Trend
Sales
1995 1999
106
211
WAL-MART
Subject Matter Experts Expect (cont.)Subject Matter Experts Expect (cont.)
• Drill Down– “on that chart you just showed me, I noticed that revenue was
down in Region #4. Please drill down and show me the breakdown of each area in Region #4.”
DRILL DOWN
05
101520
1 2 3 4
REGIONS
RE
VE
NU
E
Y Values
X Values
DCVAMD
DRILL DOWN
Reg ion 4
Reven
ue
212
Approaches to Building User InterfacesApproaches to Building User Interfaces
• Hand-Built– Write all of your own code
• Use Class Libraries– Use an object oriented approach and buy the CLASS libraries that
do all the hard work
• OLAP– Use an On-Line Analytical Processing package to build user
interfaces for you.
107
213
Architecture of User Interfaces (cont.)Architecture of User Interfaces (cont.)
• Hand Built
• Class Libraries
USER INTERFACEDATA
WAREHOUSE
C o m m e r c i a lOff The Shelf
(COTS)
i.e. JAVAi.e. JAVA
USERINTERFACE
DBMSDBMS
GRAPHICSCLASSLIBRARY OLAPCLASS
LIBRARY
USER
INTERFACE
CLASS
LIBRARYHandHandBuiltBuilt
214
Architecture of User Interfaces (cont.)Architecture of User Interfaces (cont.)
• OLAPST
ORE
REVENUE
YEAR
REGION
USER INTERFACE
DBMS
C o m m e r c i a lOff The Shelf
(COTS)
Resu l t Cube
108
215
Hand-Building User InterfacesHand-Building User Interfaces
• Write all the code yourself– Requires many design documents, coding and testing for all of the
code components.
• Pros
– Very flexible
• Cons
– Could take a long time to develop
– Requires substantial resources– May need lots of testing and debugging
216
Using Class Libraries to Build User InterfacesUsing Class Libraries to Build User Interfaces
• Write initial user dialog yourself and call class libraries for the hard part (graphics and data access functionality).
• Pro– Many class libraries available -- avoid doing a lot of coding
yourself
• Con– Not as flexible -- if the class library does not do what you want it
to do you have to – Find a new class library– Live without the functionality
– Can take a while to find the class library you need and learn how to interface to it
109
217
Using OLAP Tools to Build User InterfacesUsing OLAP Tools to Build User Interfaces
• Many different OLAP tools– Need to survey an OLAP tool– Buy an OLAP tool– Install it– If it does not match all requirements some code may be needed to
communicate with the OLAP tool.
• Three types multi-dimensional OLAP
– Relational OLAP (ROLAP)– Multi-dimensional (MOLAP)– Hybrid (HOLOP)– Distributed (DOLAP)
218
Summary of Tools for UI Development of DWSummary of Tools for UI Development of DW
• Tools that may be used include:– Development of in-house software
– Do it all yourself– Use Class Libraries
– OLAP– ROLAP– MOLAP– HOLAP– DOLAP
• Different tools or techniques may be useful depending upon what kind of user interface is being developed. – Executive Information Systems– Analytical Systems– Enterprise Information Systems
110
219
Types of User InterfacesTypes of User Interfaces
• Executive Information System – Developed for the person who runs the place
• Analytical System– Developed for business analysts
• Enterprise Information System– Developed for users throughout the organization
Everyone
Analysts
Marketing
Executive
Everyone
Analysts
Sales
Executive
Everyone
Analysts
Finance
Executive
CEOEXECUTIVEINFORMATION SYSTEM
ANALYTICAL SYSTEM
ENTERPRISEINFORMATION SYSTEM
220
Executive Information SystemExecutive Information System
• The Executive IS is developed specifically for people who run the organization.
• Development process:– No clean life cycle
– Prototype constantly. Usually have to guess atwhat executives will want to see
– Show executives let them come up with ideasfor revisions
– Drill down functionality required
• Tools– Frequently hand-built, but purchasing a class library can help
lower the development cost.– May just want to use tools that allow development of a
subscription service in which users may “Subscribe” to a fewcanned reports.
111
221
Analytical SystemAnalytical System
• Analytical systems are user interfaces developed for business analysts in an organization.
• Development process:– Allow users to drag-and-drop data around to further the analysis of
this data.– More complex interface is acceptable
– Users may be required to know some SQL knowledge
• Tools:
– OLAP Tools are frequently used to build the interface
222
Enterprise Information SystemEnterprise Information System
• Enterprise IS is written for the general user to retrieve simple, key information.
• Development process:– Frequently developed in-house– So many users around that you really cannot pick a few and ask
what they need.– Simpler than Executive IS as it does not require drill down
functionality.
• Tools
– Place some simple, key informationon a few screens and controlaccess and then deploy.
112
223
Summary of Types of User InterfacesSummary of Types of User Interfaces
• Executive Information System– For the senior executives– Use in-house development or in -house development augmented by
class libraries
• Analytical System– OLAP may make sense here as the interface is more complicated,
but OLAP has drawbacks due to:– Data sparseness– No well accepted query language
• Enterprise Information System– Much simpler than executive system– Good candidate for in-house development