Download - 58528578-SEM-6-BC0058-1-Data-WareHouse
BCA VI-Sem BC0058-(01) – Data Warehousing
1. What are Data Warehouse requirements? How do you gather the requirements?
Ans:
Requirements analysis is fundamental to the design and development of business
intelligence. The user requirements drive decisions about the data to incorporate in the data
warehouse, how to organize (dimension) the data and how often to refresh the data. User
requirements also determine the type of tool required and how the user will interact with the
data. Data granularity decisions should not be driven simply by the availability of data -- the
users should have a need for the data beyond their current abilities.
Business intelligence requirements analysis has three fundamental steps:
1. Identify the data and granularity the users interact with, or would like to interact with,
via interviews.
2. Develop a data model to provide the data at the appropriate granularity.
3. Determine how the data will be sourced to this data model.
There are other related tasks which also need to be addressed as a result of the
requirements analysis process such as: query/OLAP tool features to support the user's
interactions with the data; hardware infrastructure required to load, store and provide data to
the end user; source data analysis; archiving strategy; and sizing.
Data Warehouse Requirements Gathering
The initial requirements are identified through interviews, with a representative set of end
users. In preparation for the interviews, it's often useful for the end users to collect a sample
of the reports they work with in reporting and analyzing their data as well as screen shots of
any tools they useWhat are your business objectives?
• How would you interpret data set results?
• How should the data you work with be organized? Should it be organized by
customer, product, geography and time? Should it be organized by account, salesperson,
distribution channel and month?
Suresh Kumar Suthar Roll : 52077676 1
BCA VI-Sem BC0058-(01) – Data Warehousing
• What are the hierarchies, rollups or aggregations used with these dimensions? Do
customers roll up to geographies that roll up to total? Do products roll up to product
groups? Do salespeople roll up to districts that roll up to regions? What types of summary
reports do you work with?
• What are the measures or facts you work with (e.g., revenues, expenses, balances,
variances, percent growth, percent of total)? How are they defined?
• Do you "filter" the data? Do you need data only at the top and bottom accounts? Do
you review the performance of only certain types of products? Do you segment the data
based on demographics?
• How often do you obtain refreshes of the data? Do you obtain them daily, weekly,
monthly or quarterly? Do you need it this often?
• What tools do you use to interact with the data?
• Is the data clean?
• Do you receive the data in a timely fashion?
• Do the tools you use support your requirements?
• What types of things would you like to do that you can't do today?
• What is your tool functionality?
• What is your data availability?
• Do you spend most of your time on analysis or preparation for analysis?
Suresh Kumar Suthar Roll : 52077676 2
BCA VI-Sem BC0058-(01) – Data Warehousing
2. What is Data Transformation? Explain the tasks involved in it.
Ans:
Data transformations are often the most complex and, in terms of processing time, the most
costly part of the extraction, transformation, and loading (ETL) process. They can range
from simple data conversions to extremely complex data scrubbing techniques. Many, if not
all, data transformations can occur within an Oracle database, although transformations are
often implemented outside of the database (for example, on flat files) as well.
In metadata, a data transformation converts data from a source data format into destination data.
Data transformation can be divided into two steps:
1. data mapping maps data elements from the source to the destination and captures any transformation that must occur
2. code generation that creates the actual transformation program
Data element to data element mapping is frequently complicated by complex transformations that require one-to-many and many-to-one transformation rules.
The code generation step takes the data element mapping specification and creates an executable program that can be run on a computer system. Code generation can also create transformation in easy-to-maintain computer languages such as Java or XSLT.
When the mapping is indirect via a mediating data model, the process is also called data mediation.
Basic Tasks in Data Transformation
The data transformation contains the following basic tasks:
Selection: This takes place at the beginning of the whole process of data transformation.
We select either whole records or parts of several records from the source system. The task
of selection usually forms part of the extraction function itself.
Splitting/Joining: This task includes the types of data manipulation we need to perform on
the selected parts of source records. Sometimes we will be splitting the selected parts even
Suresh Kumar Suthar Roll : 52077676 3
BCA VI-Sem BC0058-(01) – Data Warehousing
further during data transformation. Joining of parts selected from many source systems is
more widespread in the Data Warehouse environment.
Conversion: This is an all-inclusive task. It includes a large variety of rudimentary
conversions of single fields for two primary reasons – one to standardize among the data
extraction from disparate source systems, and the other to make the fields usable and
understandable to the users.
Summarization: Sometimes we may find that it is not feasible to keep data at the lowest
level of details in your Data Warehouse. It may be that none of our users ever need data at
the lowest granularity for analysis or querying. For example, for a grocery chain, sales data
at the lowest level of details for every transaction at the checkout may not needed. Storing
sales byproduct by store by day in the Data Warehouse may be quite adequate. So, in this
case, the data transformation function includes summarization of daily sales by product and
by store.
Enrichment: This task is the rearrangement and simplification of individual fields to make
them more useful for the Data Warehouse environment. We may use one or more fields
from the same input record to create a better view of the data for the Data Warehouse. This
principle is extended when one or more fields originate from multiple records, resulting in a
single field for the Data Warehouse.
Suresh Kumar Suthar Roll : 52077676 4
BCA VI-Sem BC0058-(01) – Data Warehousing
4. What is Data Management? Explain the different topics in Data Management
Ans:
Data Management is a broad field of study, but essentially is the process of managing data
as a resource that is valuable to an organization or business. One of the largest
organizations that deal with data management, DAMA (Data Management Association),
states that data management is the process of developing data architectures, practices and
procedures dealing with data and then executing these aspects on a regular basis.
There are many topics within data management, some of the more popular topics include
data modeling, data warehousing, data movement, database administration and data
mining.
Data Modeling
Data modeling is first creating a structure for the data that you collect and use and then
organizing this data in a way that is easily accessible and efficient to store and pull the data
for reports and analysis. In order to create a structure for data, it must be named
appropriately and show a relationship with other data. It also must fit appropriately in a
class. For instance, if you have a database of media, you might have a hierarchal structure
of objects that include photos, videos, and audio files. Within each category, you can
classify objects accordingly.
Data Warehousing
Data warehousing is storing data effectively so that it can be accessed and used efficiently.
Different organizations collect different types of data, but many organizations use their data
the same way, in order to create reports and analyze their data to make quality business
decisions. Data warehousing is usually an organizational wide repository of data, however
for very large corporations in can encompass just one office or one department.
Suresh Kumar Suthar Roll : 52077676 5
BCA VI-Sem BC0058-(01) – Data Warehousing
Data Movement
Data movement is the ability to move data from one place to another. For instance, data
needs to be moved from where it is collected to a database and then to an end user, but this
process takes quite a bit of logistic insight. Not only do all hardware, applications and data
collected need to be compatible with one another, they must also be able to be classified,
stored and accessed with ease within an organization. Moving data can be very expensive
and can require lots of resources to make sure that data is moved efficiently, that data is
secure in transit and that once it reaches the end user it can be used effectively either to be
printed out as a report, saved on a computer or sent as an email attachment.
Database Administration
Database administration is extremely important in managing data. Every organization or
enterprise needs database administrators that are responsible for the database
environment. Database administrators are usually given the authority to do the following
tasks that include recoverability, integrity, security, availability, performance and
development & testing support.
Recoverability is usually defined as a way to store data as a back up and then test the back
ups to make sure that they are valid. The task of integrity means that data that is pulled for
certain records or files are in fact valid and have high data integrity. Data integrity is
extremely important especially when creating reports or when data is used for analysis. If
you have data that is deemed invalid, your results will be worthless.
Database security is an essential task for database administrators. For instance, database
administrators are usually in charge of giving clearance and access to certain databases or
trees in an organization. Another important task is availability. Availability is defined as
making sure a database is up and running. The more up time, usually the higher level of
productivity. Performance is related to availability, it is considered getting the most out of the
hardware, applications and data as possible. Performance is usually in relation to an
organizations budget, physical equipment and resources.
Suresh Kumar Suthar Roll : 52077676 6
BCA VI-Sem BC0058-(01) – Data Warehousing
Finally, a database administrator is usually involved in database development and testing
support. Database administrators are always trying to push the envelope, trying to get more
use out of the data and add better performing and more powerful applications, hardware and
resources to the database structure. A database that is administered correctly is not only a
sign of competent database administrator, but it also means that all end users have a huge
resource in the data that is available. This makes it easy to create reports, conduct analysis
and make high quality decisions based on data that is collected and used within the
organization.
Data Mining
Another important topic regarding data management is data mining. Data mining is a
process in which large amounts of data are sifted through to show trends, relationships, and
patterns. Data mining is a crucial component to data management because it exposes
interesting information about the data being collected. It is important to note that data is
primarily collected so it can be used to find these patterns, relationships and trends that can
help a business grow or create profit.
While there are many topics within data management, they all work together from the
beginning where data is collected to the end of the process where it is sifted through;
analyzed and formatted where specialists can then make quality decisions based upon it.
Suresh Kumar Suthar Roll : 52077676 7
BCA VI-Sem BC0058-(01) – Data Warehousing
5. Define OLAP. Mention its advantages.
Ans:
OLAP stands for On Line Analytical Processing, a series of protocols used mainly for
business reporting. Using OLAP, businesses can analyze data in all manner of different
ways, including budgeting, planning, simulation, data warehouse reporting, and trend
analysis. A main component of OLAP is its ability to make multidimensional calculations,
allowing a wide and lightning-fast array of possibilities. In addition, the bigger the business,
the bigger its business reporting needs. Multidimensional calculations enable a large
business to complete in seconds what it otherwise would have waited a handful of minutes
to receive.
One main benefit of OLAP is consistency of calculations. No matter how fast data is
processed through OLAP software or servers, the reporting that results is presented in a
consistent presentation, so executives always know what to look for where. This is
especially helpful when comparing information from previous reports to information
contained in new ones and projected future ones. "What if" scenarios are some of the most
popular uses of OLAP software and are made eminently more possible by multidimensional
processing.
Another benefit of multidimensional data presentation is that it allows a manager to pull
down data from an OLAP database in broad or specific terms. In other words, reporting can
be as simple as comparing a few lines of data in one column of a spreadsheet or as
complex as viewing all aspects of a mountain of data. Also, multidimensional presentation
can create an understanding of relationships not previously realized. All of this, of course,
can be done in the blink of an eye.
Producers of OLAP software are familiar, including Oracle, IBM, and Hyperion Solutions.
Oracle, which has a reputation for being different, refers to OLAP software as Business
Intelligence. IBM and Hyperion Solutions, wishing to remain consistent with industry
standards, call their software OLAP.
Following are some of the benefits of OLAP -
Suresh Kumar Suthar Roll : 52077676 8
BCA VI-Sem BC0058-(01) – Data Warehousing
1. The main benefit of the OLAP is its steadiness in calculations. The reporting is
always represented in a coherent presentation irrespective of how fast data is deal
with through the OLAP server or software and thus allows the executives and
analysts to know exactly to look for where.
2. The other convenience of OLAP is that it allows the manager to tear down data from
OLAP database in specific or broad terms. In layman's term, the report can be as
simple of comparing two columns or as complex of analyzing a huge amount of data.
Moreover, it also helps to realize relationships that were forgotten earlier.
Suresh Kumar Suthar Roll : 52077676 9
BCA VI-Sem BC0058-(01) – Data Warehousing
6. Explain Indexing on Dimensional Tables and Fact Tables.
Ans:
Indexing in the Dimension Tables:
Columns in the dimension tables are used in the predicates of queries. A query may run like
this: How much are the sales of Product A in the month of March for Northern Division?
Here the columns product, month, and division from three different dimension tables are
candidates for indexing. Inspect the columns or each dimension table carefully and plan the
indexes for these tables. We may be not be able to achieve performance improvement by
indexing the columns in the fact tables but the columns in the dimension tables offer
tremendous possibilities to improve performance through indexing.
Creating Index on Dimension Table:
Create a unique B-Tree index on the single-column primary key. Example the columns that
are commonly used to constrain the queries. These are candidates for bitmapped indexes.
Look for columns that are frequently accessed together in large Dimension Tables.
Determine bow these columns may be arranged and used to create multicolumn indexes.
Remember that the columns that are more frequently accessed or the columns that are at
higher hierarchical levels in the dimension table are placed at the high order of the
multicolumn indexes. Individually index every column likely to be used frequently in join
condition.
Indexing in the Fact Table:
The primary key of the fact table consists of the primary keys of all the connected
dimensions. If we have four dimension tables store, product time and promotion then the full
primary key of the fact table is the concatenation of the primary keys of the store, product,
time and promotion tables. What are the other columns? The other columns are metrics
such as sale units, sale dollars, cost dollars, and so on. These are the types of columns to
be considered for indexing the fact tables.
If the DBMS does not create an index on the primary key, deliberately create a B-Tree
index on the full primary key.
Suresh Kumar Suthar Roll : 52077676 10
BCA VI-Sem BC0058-(01) – Data Warehousing
Review the individual components of concatenated key. Create indexes on combinations
based on query processing requirements. Carefully design the order of individual key
elements in the full-concatenated key for indexing. In the high order of the concatenated
key, place the keys of the dimension tables frequently referred to while querying.
Suresh Kumar Suthar Roll : 52077676 11
BCA VI-Sem BC0058-(01) – Data Warehousing
7. What is repository? How is it helpful to the data warehouse maintenance?
Ans:
Data Repository is a logical (and sometimes physical) partitioning of data where multiple
databases which apply to specific applications or sets of applications reside. For example,
several databases (revenues, expenses) which support financial applications (A/R, A/P)
could reside in a single financial Data Repository.
A database warehouse is one large Data Repository of all business related information
including all historical data of the business organization implementing the data warehouse.
Data warehousing is a complex process of building a data repository in the form of a
relational database so that the company can support web or text mining in order to leverage
data and transform or aggregate them into useful information.
In all cases, organizations use data warehousing to gain a competitive advantage, support
for decision making processes through comprehensive data analysis.
Some of the key components of data warehousing are Decision Support Systems (DSS)
and Data Mining (DM).
Data volumes in data warehouse could grow at an exponential rate so there should be a
way to handle this tremendous growth. With respect to storage requirements, the critical
needs that need to be seriously considered in a data warehouse are high availability, high
data volume, high performance and scalability, simplification and usability and easy
management.
Partitioning of data into a logical or in some cases physical Data Repository could greatly
help meet the requirement in relation to dealing with the exponential growth of data volumes
in the data warehouse. If all the data in the data warehouse were not partitioned into several
Data Repositories, then there will be profound disadvantage in terms of perfomance and
efficiency.
Suresh Kumar Suthar Roll : 52077676 12
BCA VI-Sem BC0058-(01) – Data Warehousing
For one, if the central server fails, the system would come to a halt. This is because data is
just located in one monolithic system, and when the hardware fails, there is no sort back up.
It may take some time to get the server up, depending on the nature of the problem. But in a
business company, even a few minutes of business stoppage can already translate into
thousands of potential dollars lost from the business.
When Data Repository is employed in the data warehouse, the load can be distributed
across many databases or even across many servers. For instance, instead of having one
computer handle the database related to customers, several databases could be handling
the different aspects of customers.
In a very large company such as a company that has several branches around the country,
instead of having all the customers in one database, several databases may be handling
different branch customer database in a data repository. Or as earlier mentioned, several
company departmental database may be broken down into various Data Repository such as
one data repository supporting several databases (revenues, expenses) which support
financial applications (A/R, A/P) could reside in a single financial Data Repository.
Data Repository offers easier and faster access due to the fact that related information are,
to some degree, lumped or clustered together. For instance, in the example with financial
Data Repository, anybody from the financial department or any other data use wanting
information related to financials will not have to dig through the entire volume of the data in
the data warehouse.
Suresh Kumar Suthar Roll : 52077676 13