syllabus of data warehousing & mining
TRANSCRIPT
BCA 5th Sem (Data warehousing & Mining)
1
Syllabus of Data warehousing & Mining
Content Detail Page NO
UNIT I
Introduction to Data Warehousing, The
needfordatawarehousing,Operational&Informational Data Stores, Data
Warehouse Characteristics, Data Warehouse role&Structure,The cost of
warehousing data.
Introduction to OLAP &OLTP, Difference between OLAP &OLTP. OLAP
Operations
4-40
UNIT II
Building aDataWarehouse,
Design/Technical/ImplementationConsiderations,DataPre- processing
Overview.DataSummarization,DataCleaning,DataTransformation,Concept
Hierarchy, Structure. Patterns &Models, Artificial Intelligence (Overview).
MultidimensionalDataModel,SchemasforMultidimensionalData(StarSchema,S
nowflakeSchema,FactConstellation),DataWarehouseArchitecture,DataWareho
useDesign,OLAP
ThreetierArchitecture,Indexing&QueryinginOLAP,OLAM,EfficientMethodso
fCube Computation, DiscoveryDriven Exploration of Data Cubes, Attributed-
OrientedInduction.
42-90
UNIT III
AssociationRuleMining, MarketBasketAnalysis,AprioriAlgorithm,Mining
Multilevel Association Rules, FromAssociationMiningtoCorrelationAnalysis,
Constraint Based
AssociationMining,IntroductiontoClassification,Classification by
decisionTree,Attribute Selection Measure.
92-105
UNIT IV
IntroductiontoPredictiontechniques,Accuracy ofaClassifier,Cross-
Validation,Bootstrap,
Boosting,Bagging,IntroductiontoClustering,Classificationof VariousClustering
Algorithms, SelectingandUsingRightDMTechnique,SelectingandUsing
RightDMTechnique,Data Visualization.
107-123
BCA 5th Sem (Data warehousing & Mining)
2
INDEX
S.No Content Page No.
1 Introduction to Data Warehousing
2 Introduction to OLAP &OLTP
3 Building a Data Warehouse
4 Multidimensional Data Model
5 Three tier Architecture
6 Association Rule Mining
7 Introduction to Prediction techniques
BCA 5th Sem (Data warehousing & Mining)
3
UNIT-I
BCA 5th Sem (Data warehousing & Mining)
4
Data Warehousing:
Data warehouse is a subject oriented, integrated, time-variant, and non-volatile
collection of data. This data helps analysts to take informed decisions in an
organization.
An operational database undergoes frequent changes on a daily basis on account
of the transactions that take place. Suppose a business executive wants to analyze
previous feedback on any data such as a product, a supplier, or any consumer data,
then the executive will have no data available to analyze because the previous data
has been updated due to transactions.
A data warehouses provides us generalized and consolidated data in
multidimensional view. Along with generalized and consolidated view of data, a
data warehouses also provides us Online Analytical Processing (OLAP) tools.
These tools help us in interactive and effective analysis of data in a
multidimensional space. This analysis results in data generalization and data
mining.
For example, in the business world, a data warehouse might incorporate customer
information from a company's point-of-sale systems (the cash registers), its
website, its mailing lists and its comment cards.
Data mining functions such as association, clustering, classification, prediction
can be integrated with OLAP operations to enhance the interactive mining of
knowledge at multiple level of abstraction. That's why data warehouse has now
become an important platform for data analysis and online analytical processing.
There many types of data warehouses but these are the three most common:
BCA 5th Sem (Data warehousing & Mining)
5
Enterprise data warehouse – Provides a central repository tailored for support
decision-making for the entire enterprise
Operational Data Store – Similar to the enterprise warehouse in terms of scope,
but data is refreshed in near real time and can be used for operational reporting
Data Mart – This is a subset of a data warehouse used to support a specific
region, business unit or function area.
Understanding a Data Warehouse
A data warehouse is a database, which is kept separate from the organization's
operational database.
There is no frequent updating done in a data warehouse.
It possesses consolidated historical data, which helps the organization to analyze
its business.
A data warehouse helps executives to organize, understand, and use their data to
take strategic decisions.
Data warehouse systems help in the integration of diversity of application systems.
A data warehouse system helps in consolidated historical data analysis.
Benefits of a Data Warehouse
Organizations have a common goal – to make better business decisions. A data
warehouse, once implemented into your business intelligence framework, can
benefit your company in numerous ways. A data warehouse:
1. Delivers enhanced business intelligence
By having access to information from various sources from a single platform,
decision makers will no longer need to rely on limited data or their instinct.
Additionally, data warehouses can effortlessly be applied to a business’s
processes, for instance, market segmentation, sales, risk, inventory, and financial
management.
2. Saves times
BCA 5th Sem (Data warehousing & Mining)
6
A data warehouse standardizes, preserves, and stores data from distinct sources,
aiding the consolidation and integration of all the data. Since critical data is
available to all users, it allows them to make informed decisions on key aspects. In
addition, executives can query the data themselves with little to no IT support,
saving more time and money.
3. Enhances data quality and consistency
A data warehouse converts data from multiple sources into a consistent format.
Since the data from across the organization is standardized, each department will
produce results that are consistent. This will lead to more accurate data, which will
become the basis for solid decisions.
4. Generates a high Return on Investment (ROI)
Companies experience higher revenues and cost savings than those that haven’t
invested in a data warehouse.
5. Provides competitive advantage
Data warehouses help get a holistic view of their current standing and evaluate
opportunities and risks, thus providing companies with a competitive advantage.
6. Improves the decision-making process
Data warehousing provides better insights to decision makers by maintaining a
cohesive database of current and historical data. By transforming data into
purposeful information, decision makers can perform more functional, precise,
and reliable analysis and create more useful reports with ease.
7. Enables organizations to forecast with confidence
Data professionals can analyze business data to make market forecasts, identify
potential KPIs, and gauge predicated results, allowing key personnel to plan
accordingly.
8. Streamlines the flow of information
Data warehousing facilitates the flow of information through a network
connecting all related or non-related parties.
The Disadvantages of a Data Warehouse
BCA 5th Sem (Data warehousing & Mining)
7
Data warehouses are relational databases that act as data analysis tools,
aggregating data from multiple departments of a business into one data store.
Data warehouses are typically updated as an end-of-day batch job, rather than
being churned by real time transactional data. Their primary benefits are giving
managers better and timelier data to make strategic decisions for the company.
However, they have some drawbacks as well.
Extra Reporting Work
Depending on the size of the organization, a data warehouse runs the risk of
extra work on departments. Each type of data that's needed in the warehouse
typically has to be generated by the IT teams in each division of the business.
This can be as simple as duplicating data from an existing database, but at other
times, it involves gathering data from customers or employees that wasn't
gathered before.
Cost/Benefit Ratio
A commonly cited disadvantage of data warehousing is the cost/benefit analysis.
A data warehouse is a big IT project, and like many big IT projects, it can suck a
lot of IT man hours and budgetary money to generate a tool that doesn't get used
often enough to justify the implementation expense. This is completely
sidestepping the issue of the expense of maintaining the data warehouse and
updating it as the business grows and adapts to the market.
Data Ownership Concerns
Data warehouses are often, but not always, Software as a Service
implementations, or cloud services applications. Your data security in this
environment is only as good as your cloud vendor. Even if implemented locally,
there are concerns about data access throughout the company. Make sure that the
people doing the analysis are individuals that your organization trusts, especially
with customers' personal data. A data warehouse that leaks customer data is a
privacy and public relations nightmare.
Data Flexibility
Data warehouses tend to have static data sets with minimal ability to "drill
down" to specific solutions. The data is imported and filtered through a schema,
and it is often days or weeks old by the time it's actually used. In addition, data
warehouses are usually subject to ad hoc queries and are thus notoriously
BCA 5th Sem (Data warehousing & Mining)
8
difficult to tune for processing speed and query speed. While the queries are
often ad hoc, the queries are limited by what data relations were set when the
aggregation was assembled.
Data Mining:
Data mining refers to the analysis of data. It is the computer-supported process of
analyzing huge sets of data that have either been compiled by computer systems or
have been downloaded into the computer. In the data mining process, the
computer analyzes the data and extract useful information from it. It looks for
hidden patterns within the data set and try to predict future behavior. Data mining
is primarily used to discover and indicate relationships among the data sets.
Examples of data mining in current industry. Marketing. Banks use data mining to
better understand market risks. It is commonly applied to credit ratings and to
intelligent anti-fraud systems to analyze transactions, card transactions, purchasing
patterns and customer financial data.
Data mining aims to enable business organizations to view business behaviors,
trends relationships that allow the business to make data-driven decisions. It is
also known as knowledge Discover in Database (KDD). Data mining tools utilize
AI, statistics, databases, and machine learning systems to discover the relationship
between the data. Data mining tools can support business-related questions that
traditionally time-consuming to resolve any issue.
Important features of Data Mining:
The important features of Data Mining are given below:
It utilizes the automated discovery of patterns.
It predicts the expected results.
BCA 5th Sem (Data warehousing & Mining)
9
It focuses on large data sets and databases
It creates actionable information.
Advantages of Data Mining:
i. Market Analysis:
Data Mining can predict the market that helps the business to make the decision.
For example, it predicts who is keen to purchase what type of products.
ii. Fraud detection:
Data Mining methods can help to find which cellular phone calls, insurance
claims, credit, or debit card purchases are going to be fraudulent.
iii. Financial Market Analysis:
Data Mining techniques are widely used to help Model Financial Market
iv. Trend Analysis:
Analyzing the current existing trend in the marketplace is a strategic benefit
because it helps in cost reduction and manufacturing process as per market
demand.
Differences between Data Mining and Data Warehousing:
Data Mining Data Warehousing
Data mining is the process of
determining data patterns.
A data warehouse is a database system
designed for analytics.
Data mining is generally
considered as the process of
extracting useful data from a large
set of data.
Data warehousing is the process of
combining all the relevant data.
Business entrepreneurs carry data
mining with the help of engineers.
Data warehousing is entirely carried out by
the engineers.
BCA 5th Sem (Data warehousing & Mining)
10
In data mining, data is analyzed
repeatedly.
In data warehousing, data is stored
periodically.
Data mining uses pattern
recognition techniques to identify
patterns.
Data warehousing is the process of
extracting and storing data that allow easier
reporting.
One of the most amazing data
mining technique is the detection
and identification of the unwanted
errors that occur in the system.
One of the advantages of the data
warehouse is its ability to update
frequently. That is the reason why it is
ideal for business entrepreneurs who want
up to date with the latest stuff.
The data mining techniques are
cost-efficient as compared to
other statistical data applications.
The responsibility of the data warehouse is
to simplify every type of business data.
The data mining techniques are
not 100 percent accurate. It may
lead to serious consequences in a
certain condition.
In the data warehouse, there is a high
possibility that the data required for
analysis by the company may not be
integrated into the warehouse. It can simply
lead to loss of data.
Companies can benefit from this
analytical tool by equipping
suitable and accessible
knowledge-based data.
Data warehouse stores a huge amount of
historical data that helps users to analyze
different periods and trends to make future
predictions.
Difference between data warehouse and data mart
Parameter Data Warehouse Data Mart
Definition A Data Warehouse is a large
repository of data collected
A data mart is an only subtype of
a Data Warehouse. It is designed
BCA 5th Sem (Data warehousing & Mining)
11
Parameter Data Warehouse Data Mart
from different organizations or
departments within a
corporation.
to meet the need of a certain user
group.
Usage It helps to take a strategic
decision.
It helps to take tactical decisions
for the business.
Objective
The main objective of Data
Warehouse is to provide an
integrated environment and
coherent picture of the
business at a point in time.
A data mart mostly used in a
business division at the
department level.
Designing The designing process of Data
Warehouse is quite difficult.
The designing process of Data
Mart is easy.
May or may not use in a
dimensional model. However,
it can feed dimensional
models.
It is built focused on a
dimensional model using a start
schema.
Data Handling
Data warehousing includes
large area of the corporation
which is why it takes a long
time to process it.
Data marts are easy to use,
design and implement as it can
only handle small amounts of
data.
Focus
Data warehousing is broadly
focused all the departments. It
is possible that it can even
represent the entire company.
Data Mart is subject-oriented,
and it is used at a department
level.
Data type
The data stored inside the Data
Warehouse are always detailed
when compared with data
mart.
Data Marts are built for particular
user groups. Therefore, data short
and limited.
Subject-area The main objective of Data
Warehouse is to provide an
integrated environment and
Mostly hold only one subject
area- for example, Sales figure.
BCA 5th Sem (Data warehousing & Mining)
12
Parameter Data Warehouse Data Mart
coherent picture of the
business at a point in time.
Data storing
Designed to store enterprise-
wide decision data, not just
marketing data.
Dimensional modeling and star
schema design employed for
optimizing the performance of
access layer.
Data type
Time variance and non-
volatile design are strictly
enforced.
Mostly includes consolidation
data structures to meet subject
area's query and reporting needs.
Data value Read-Only from the end-users
standpoint.
Transaction data regardless of
grain fed directly from the Data
Warehouse.
Scope
Data warehousing is more
helpful as it can bring
information from any
department.
Data mart contains data, of a
specific department of a
company. There are maybe
separate data marts for sales,
finance, marketing, etc. Has
limited usage
Source In Data Warehouse Data
comes from many sources.
In Data Mart data comes from
very few sources.
Size
The size of the Data
Warehouse may range from
100 GB to 1 TB+.
The Size of Data Mart is less
than 100 GB.
Implementation
time
The implementation process of
Data Warehouse can be
extended from months to
years.
The implementation process of
Data Mart is restricted to few
months.
BCA 5th Sem (Data warehousing & Mining)
13
Need For Data Warehouse
1. Improving integration
An organization registers data in various systems which support the various
business processes. In order to create an overall picture of business operations,
customers, and suppliers – thus creating a single version of the truth – the data
must come together in one place and be made compatible. Both external (from the
environment) and internal data (from ERP and financial systems) should merge
into the data warehouse and then be grouped.
2. Speeding up response times
The source systems are fully optimized in order to process many small
transactions, such as orders, in a short time. Generating information about the
performance of the organization only requires a few large ‘transactions’ in which
large volumes of data are gathered and aggregated. The structure of a data
warehouse is specifically designed to quickly analyze such large volumes of (big)
data.
3. Faster and more flexible reporting
The structure of both data warehouses and data marts enables end users to report
in a flexible manner and to quickly perform interactive analysis based on various
predefined angles (dimensions). They may, for example, with a single mouse click
jump from year level, to quarter, to month level, and quickly switch between the
customer dimension and the product dimension, all while the indicator remains
fixed. In this way, end users can actually mix the data and thus quickly gain
knowledge about business operations and performance indicators.
4. Recording changes to build history
Source systems don’t usually keep a history of certain data. For example, if a
customer relocates or a product moves to a different product group, the (old)
values will most likely be overwritten. This means they disappear from the system
– or at least they’re very difficult to trace back.
That’s really bad, because in order to generate reliable information, we actually
need these old values, as users sometimes want to be able to look back in time. In
BCA 5th Sem (Data warehousing & Mining)
14
other words: we want to be able to look at the organization’s performance from a
historical perspective – in accordance with the organizational structure and
product classifications of that time – instead of in the current context. A data
warehouse ensures that data changes in the source system are recorded, which
enables historical analysis.
5. Increasing data quality
Stakeholders and users frequently overestimate the quality of data in the source
systems. Unfortunately, source systems quite often contain data of poor quality.
When we use a data warehouse, we can greatly improve the data quality, either
through – where possible – correcting the data while loading or by tackling the
problem at its source.
6. Unburdening the IT department
A data warehouse and Business Intelligence tools allow employees within the
organization to create reports and perform analyses independently. However, an
organization will first have to invest in order to set up the required infrastructure
for that data warehouse and those BI tools. The following principle applies: the
better the architecture is set up and developed, the more complex reports users can
independently create. Obviously, users first need sufficient training and support,
where necessary. Yet, what we see in practice is that many of the more complex
reports end up being created by the IT department. This is mostly due to users
lacking either the time or the knowledge.
7. Increasing findability
When we create a data warehouse, we make sure that users can easily access the
meaning of data. (In the source system, these meanings are either non-existent or
poorly accessible.) With a data warehouse, users can find data more quickly, and
thus establish information and knowledge faster. All the goals of the data
warehouse serve the aims of Business Intelligence: making better decisions faster
at all levels within the organization and even across organizational boundaries.
Data Warehouse Features
The key features of a data warehouse are discussed below −
BCA 5th Sem (Data warehousing & Mining)
15
Subject Oriented − A data warehouse is subject oriented because it provides
information around a subject rather than the organization's ongoing operations.
These subjects can be product, customers, suppliers, sales, revenue, etc. A data
warehouse does not focus on the ongoing operations, rather it focuses on
modelling and analysis of data for decision making.
Integrated − A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc. This integration
enhances the effective analysis of data.
Time Variant − The data collected in a data warehouse is identified with a
particular time period. The data in a data warehouse provides information from the
historical point of view.
Non-volatile − Non-volatile means the previous data is not erased when new data
is added to it. A data warehouse is kept separate from the operational database and
therefore frequent changes in operational database is not reflected in the data
warehouse.
Data Warehouse Applications
Data warehouse helps business executives to organize, analyze, and use their data
for decision making. Data warehouses are widely used in the following fields −
Financial services
Banking services
Consumer goods
Retail sectors
Controlled manufacturing
Characteristics and Functions of Data warehouse
Data warehouse can be controlled when the user has a shared way of explaining
the trends that are introduced as specific subject. Below are
major characteristics of data warehouse:
BCA 5th Sem (Data warehousing & Mining)
16
Subject-oriented –
A data warehouse is always a subject oriented as it delivers information about a
theme instead of organization’s current operations. It can be achieved on specific
theme. That means the data warehousing process is proposed to handle with a
specific theme which is more defined. These themes can be sales, distributions,
marketing etc.
A data warehouse never put emphasis only current operations. Instead, it focuses
on demonstrating and analysis of data to make various decision. It also delivers an
easy and precise demonstration around particular theme by eliminating data which
is not required to make the decisions.
Integrated –
It is somewhere same as subject orientation which is made in a reliable format.
Integration means founding a shared entity to scale the all similar data from the
different databases. The data also required to be resided into various data
warehouse in shared and generally granted manner.
A data warehouse is built by integrating data from various sources of data such
that a mainframe and a relational database. In addition, it must have reliable
naming conventions, format and codes. Integration of data warehouse benefits in
effective analysis of data. Reliability in naming conventions, column scaling,
encoding structure etc. should be confirmed. Integration of data warehouse
handles various subject related warehouse.
Time-Variant –
In this data is maintained via different intervals of time such as weekly, monthly,
or annually etc. It founds various time limit which are structured between the large
datasets and are held in online transaction process (OLTP). The time limits for
data warehouse is wide-ranged than that of operational systems. The data resided
in data warehouse is predictable with a specific interval of time and delivers
information from the historical perspective. It comprises elements of time
explicitly or implicitly. Another feature of time-variance is that once data is stored
in the data warehouse then it cannot be modified, alter, or updated.
Non-Volatile –
As the name defines the data resided in data warehouse is permanent. It also
BCA 5th Sem (Data warehousing & Mining)
17
means that data is not erased or deleted when new data is inserted. It includes the
mammoth quantity of data that is inserted into modification between the selected
quantity on logical business. It evaluates the analysis within the technologies of
warehouse.
In this, data is read-only and refreshed at particular intervals. This is beneficial in
analyzing historical data and in comprehension the functionality. It does not need
transaction process, recapture and concurrency control mechanism. Functionalities
such as delete, update, and insert that are done in an operational application are
lost in data warehouse environment. Two types of data operations done in the data
warehouse are:
Data Loading
Data Access
Data warehouse role and structure
The purpose of the Data Warehouse mostly is to integrate corporate data in an
organization. It contains the "single version of truth" for the organization that has
been carefully constructed from data stored in disparate internal and external
operational databases. ... Data is stored at a very granular level of detail.
The data warehouse is composed of data structures populated by data extracted
from the OLTP database and transformed to fit a flatter schema. Ultimately
the warehouse structures are exposed as star schemas through views of fact and
dimension tables.
Data Warehouse Cost
Budgeting – Without cost justification, projects will always be in jeopardy.
During future budget cycles, management will be looking for ways to reduce cost
and if there is no documented reason for completing the project, they are likely to
forget the flush of excitement that accompanied the project’s initiation.
Staffing – Without cost justification, staffing with the right people may be
difficult. By having some real dollar numbers to back up a request, the request is
more likely to be satisfied.
BCA 5th Sem (Data warehousing & Mining)
18
Prioritization – Without a cost/benefit analysis, project prioritization is difficult
and management has little to compare projects other than a gut-feel that one is
more important than another. Without cost/benefit analysis, the line-of-business
manager with the most power is likely to get her project approved. The project
that is most important to the enterprise may never be implemented.
Controlling Costs
Costs can and must be controlled. It is the project manager who has the
responsibility for controlling costs along with the other responsibilities. Adhering
to the Project Agreement is a major start for controlling costs. The Project
Agreement specifies the data that will be in the data warehouse, the periods for
which the data is kept, the number of users and predefined queries and reports.
Any one of these factors, if not held in check, will increase the cost and possibly
the schedule of the project. A primary role of the project manager will be to
control scope creep.
Additional Support
User Support staff or the Help Desk staff will be the users’ primary contact when
there are problems. Providing adequate User Support will require more people,
and more training of those people, to answer questions and help the users through
difficult situations. The cost for the additional people, the training and possibly an
upgrade in the number and knowledge-level of the staff answering the phones
must be added into the data warehouse costs.
Consultants and Contractors
Consultant and contractor expenses can balloon a project’s cost. Consultants are
used to supplement the lack of experience of the project team, contractors are used
to supplement the lack of skilled personnel. There are two types of
consultant/contractors:
Product specific contractors – These persons are brought in because they know
the product. They can either help or actually install the product, and they can tune
the product. They will customize the product, if it is necessary. The product-
specific consultants may either be in the employ of the tool vendor or may be
independent. An example of their services would be installing and using an ETL
tool to extract, transform and load data from your source files to the data
warehouse. In this activity they may be generating the ETL code on their own or
working with your people in this endeavor.
BCA 5th Sem (Data warehousing & Mining)
19
General data warehouse consultants – These consultants may have a specific
niche such as data modeling, performance, data mining, tool selection,
requirements gathering or project planning. They will typically be involved for a
shorter period of time than the product-specific consultant/contractor. They have
two roles that are equally important. The first is working with your people to
complete a task such as selecting a query tool or developing a project plan. The
second is the knowledge transfer to your staff so they can perform the activity the
next time on their own. Just as in the case of the product-specific
consultant/contractor, your goal is to make your staff as self-sufficient as soon as
possible.
Products
The software products that support the data warehouse can be very expensive. The
first thing to consider is which categories of tools you need. Do not bring in more
categories of products than you need. Do not try to accomplish everything with
your first implementation. Be very selective.
Hopefully, you have someone in your organization experienced in dealing with
vendors and understanding their contracts. You will be working closely with this
person. They will know the things to watch out for in a contract, but you will need
to give them some help to acquaint them with data warehousing. You will also
have to give them some warning if you heard anything negative about the vendor.
Your contract people will know how to include protection in the contract to keep
the vendor from arbitrarily raising their prices.
Existing tools
Your organization most likely already has an RDBMS. Should you have to pay for
it as part of your data warehouse project? If there is a site license, there may be no
charge to your department or you may have to pay a portion of the site license.
You may have to pay if the data warehouse will be on another CPU, and if the
RDBMS is charged by CPU. You may have to pay an upgrade if the data
warehouse requires going to a larger CPU, and if there is an additional cost for the
larger CPU.
Capacity planning
Capacity planning for a data warehouse is extremely difficult because:
BCA 5th Sem (Data warehousing & Mining)
20
The actual amount of data that will be in the warehouse is very difficult to
anticipate.
The number of users will also be difficult to estimate.
The number of queries each user will run is difficult to anticipate.
The time of day and the day in the week when the queries will be run is difficult to
guess (we know there will not be an even distribution, expecting more activity at
month-end, etc.).
The nature of the queries, the number of I/Os, the internal processing is almost
impossible to estimate.
Hardware Costs
For the data warehouse, you will need CPUs, disks, networks and desktop
workstations. The hardware vendors can help size the machines and disks. Be
aware that unanticipated growth of the data, increased number of users and
increased usage will explode the hardware costs. Existing desktop workstations
may not be able to support the query tool. Do not ask the query tool vendor for the
minimum desktop configuration. Ask for the recommended configuration. Call
references to find out if and how they had to upgrade their desktop workstations.
Raw Data Multiplier
There are many debates over how much disk is needed as a multiplier of the raw
data. Besides the raw data itself, space is needed for indexes, summary tables and
working space. Additional space may be needed for replicated data that may be
required for both performance and security reasons. The actual space is very
dependent on how much is indexed and how many summary tables are needed.
Existing Hardware
How should you account for existing hardware that can be used for the data
warehouse? It may mean you do not have to buy any additional hardware. The
Y2K testing may have required hardware that is now redundant and unused.
Should that be included in our data warehouse cost? It is a safe assumption that
your organization will need additional hardware in the future. By using the
redundant hardware for the data warehouse, it means that additional hardware for
non-data warehouse purposes must be purchased sooner. You may be able to defer
BCA 5th Sem (Data warehousing & Mining)
21
the cost of the redundant hardware; you will eventually have to pay. At the time
the hardware is purchased, it will undoubtedly be less than today’s costs.
Controlling Hardware Costs
Your ability to control hardware costs will depend primarily on whether your
organization has a chargeback system. Even though department heads are
supposed to have the best interests of the organization at heart, what they care
most about is meeting their performance objectives. These, of course, include the
costs assigned to their department. If department heads are paying for what they
get, they will be more thoughtful about asking for resources that may not be cost
justified. We had an experience with a user asking to store ten years worth of
detailed data. When he was presented with the bill (an additional $1.5 million), he
decided that two years worth of data was adequate.
Internal People Costs
These people are getting paid anyway regardless of whether we use them on this
project or not. Why should we have to include their costs in our budget? We have
to assume these people would be working on other productive projects. Otherwise,
there is no reason for the organization to keep them employed. Count on having to
include the fully burdened costs of the people on your project. Keep in mind that
you are much better off with a small team of highly skilled and dedicated workers
than with a larger team of the type of people to avoid for your project.
User Training
User training is usually done on the premises and not at a vendor site. There are
four cost areas for user training that must be considered.
The cost to engage a trainer from the outside or the time it takes for your in-house
trainer to develop and teach the class.
The facilities including the desktop workstations for the workshop.
The time the users spend away from the job being in class, and the time it takes
them to become proficient with the tool.
If not all the users are in the same location, travel expenses for either the users or
the trainer must be included.
IT Training
BCA 5th Sem (Data warehousing & Mining)
22
Generic training may be appropriate. Examples are classes in logical data
modeling, data warehouse project management or star schema database designs.
Data warehouse conferences and seminars can provide an overall perspective as
well as training in specific areas. IT will need to attend training on the complex
tools and products. IT will also need enough time to work with the products to
become proficient. The cost of training is sometimes included in the price of the
tool.
On-Going Costs
Most organizations focus on the cost to implement the initial data warehouse
application and give little thought to on-going expense. Over a period of years, the
continuing cost will very likely exceed the cost of the initial application. The data
warehouse will grow in size, in the number of users and in the number of queries
and reports. The database will not remain static. New data will be added,
sometimes more than for the initial implementation and the design most probably
will change, and the database will need to be tuned. New software will be
introduced, new releases will be installed and some interfaces will have to be
rewritten. As the data warehouse grows, the hardware and network will have to be
upgraded.
.
Operational & Informational Data Stores
Operational Data Store(ODS)
The Operational Database is the source of information for the data warehouse. It
includes detailed information used to run the day to day operations of the business.
The data frequently changes as updates are made and reflect the current value of
the last transactions.
Operational Database Management Systems also called as OLTP (Online
Transactions Processing Databases), are used to manage dynamic data in real-
time.
Data Warehouse Systems serve users or knowledge workers in the purpose of data
analysis and decision-making. Such systems can organize and present information
in specific formats to accommodate the diverse needs of various users. These
systems are called as Online-Analytical Processing (OLAP) Systems
Operational data store benefits
BCA 5th Sem (Data warehousing & Mining)
23
An ODS provides current, clean data from multiple sources in a single
place, and the benefits apply primarily to business operations.
The ODS provides a consolidated repository into which previously isolated
or inefficiently communicating IT systems can feed.
ODS reporting, which is focused on a snapshot of operational data, can be
more sophisticated than reports from individual underlying systems. The
ODS is architected to provide a consolidated view of data integrated from
multiple systems, so reports can provide a holistic perspective on
operational processes.
The up-to-date view into operational status also makes it easier for users to
diagnose problems before digging into component systems. For example, an
ODS enables service representatives to immediately find a customer order,
its status, and any troubleshooting information that might be helpful.
An ODS contains critical, time-sensitive business rules, such as those
automatically notifying a financial institution when a customer has
overdrawn an account. These rules, in aggregate, are a kind of process
automation that greatly improves efficiency, which would be impossible
without current and integrated operational data
Informational Data Store (IDS)
There are some functions that go on within the enterprise that have to do
with planning,forcasting and managing the organization. These functions
are also critical to the survival of the organization, especially in our current
fast paced world.
Functions like “marketing planning”, “engineering planning” and financial
analysis” also require information systems to support them. But these
functions are different from operational one, and the types of systems and
information required are also different.
Informational systems have to do with analyzing data and making decisions,
often major decisions.
Where operational data needs are normally focused upon a single area,
informational data needs often span a number of different areas and need
large amounts of related operational data.
BCA 5th Sem (Data warehousing & Mining)
24
Differences between OPERATIONAL & INFORMATIONAL Data Stores
OPERATIONAL INFORMATIONAL
Data Content Current values, Day to
Day Values
Archived, derived,
summarized, historical
Data Structure Optimized for
transactions
Optimized for complex
queries
Access Frequency High Medium to low
Access Type Read, update, delete Read Only
Queries Predictable, repetitive Ad hoc, random
Response Time Sub-seconds Several seconds to minutes
Kind of Users Clerks, DBAs, Database
Pros.
Knowledge Workers eg.
Analysts, Managers,
Executives.
Number of Users Large number, Thousands Relatively small number,
Hundreds.
Usage Used to run the Business Used to Analyse the state
of Business
Focus Focus on Storing Data Focused on outputting
Information
Models E-R Model Star Schema, Snowflake ,
Fact Constellation
Why a Data Warehouse is Separated from Operational Databases
BCA 5th Sem (Data warehousing & Mining)
25
A data warehouses is kept separate from operational databases due to the
following reasons −
An operational database is constructed for well-known tasks and workloads such
as searching particular records, indexing, etc. In contract, data warehouse queries
are often complex and they present a general form of data.
Operational databases support concurrent processing of multiple transactions.
Concurrency control and recovery mechanisms are required for operational
databases to ensure robustness and consistency of the database.
An operational database query allows to read and modify operations, while an
OLAP query needs only read only access of stored data.
An operational database maintains current data. On the other hand, a data
warehouse maintains historical data.
What is OLAP (Online Analytical Processing)?
OLAP stands for On-Line Analytical Processing. OLAP is a classification of
software technology which authorizes analysts, managers, and executives to gain
insight into information through fast, consistent, interactive access in a wide
variety of possible views of data that has been transformed from raw information
to reflect the real dimensionality of the enterprise as understood by the clients.
OLAP implement the multidimensional analysis of business information and
support the capability for complex estimations, trend analysis, and sophisticated
data modeling. It is rapidly enhancing the essential foundation for Intelligent
Solutions containing Business Performance Management, Planning, Budgeting,
Forecasting, Financial Documenting, Analysis, Simulation-Models, Knowledge
Discovery, and Data Warehouses Reporting. OLAP enables end-clients to perform
ad hoc analysis of record in multiple dimensions, providing the insight and
understanding they require for better decision making.
Who uses OLAP and Why?
OLAP applications are used by a variety of the functions of an organization.
Finance and accounting:
Budgeting
Activity-based costing
BCA 5th Sem (Data warehousing & Mining)
26
Financial performance analysis
And financial modeling
Sales and Marketing
Sales analysis and forecasting
Market research analysis
Promotion analysis
Customer analysis
Market and customer segmentation
Production
Production planning
Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a
data model more intuitive to them than a tabular model. This model is called a
Dimensional Model.
The second purpose is to enable fast query response that is usually difficult to
achieve using tabular models.
Characteristics of OLAP
In the FASMI characteristics of OLAP methods, the term derived from the first
letters of the characteristics are:
BCA 5th Sem (Data warehousing & Mining)
27
Fast
It defines which the system targeted to deliver the most feedback to the client
within about five seconds, with the elementary analysis taking no more than one
second and very few taking more than 20 seconds.
Analysis
It defines which the method can cope with any business logic and statistical
analysis that is relevant for the function and the user, keep it easy enough for the
target client. Although some preprogramming may be needed we do not think it
acceptable if all application definitions have to be allow the user to define new
Adhoc calculations as part of the analysis and to document on the data in any
desired method, without having to program so we excludes products (like Oracle
Discoverer) that do not allow the user to define new Adhoc calculation as part of
the analysis and to document on the data in any desired product that do not allow
adequate end user-oriented calculation flexibility.
Share
It defines which the system tools all the security requirements for understanding
and, if multiple write connection is needed, concurrent update location at an
appropriated level, not all functions need customer to write data back, but for the
increasing number which does, the system should be able to manage multiple
updates in a timely, secure manner.
Multidimensional
BCA 5th Sem (Data warehousing & Mining)
28
This is the basic requirement. OLAP system must provide a multidimensional
conceptual view of the data, including full support for hierarchies, as this is
certainly the most logical method to analyze business and organizations.
Information
The system should be able to hold all the data needed by the applications. Data
sparsity should be handled in an efficient manner.
The main characteristics of OLAP are as follows:
Multidimensional conceptual view: OLAP systems let business users have a
dimensional and logical view of the data in the data warehouse. It helps in
carrying slice and dice operations.
Multi-User Support: Since the OLAP techniques are shared, the OLAP operation
should provide normal database operations, containing retrieval, update, adequacy
control, integrity, and security.
Accessibility: OLAP acts as a mediator between data warehouses and front-end.
The OLAP operations should be sitting between data sources (e.g., data
warehouses) and an OLAP front-end.
Storing OLAP results: OLAP results are kept separate from data sources.
Uniform documenting performance: Increasing the number of dimensions or
database size should not significantly degrade the reporting performance of the
OLAP system.
OLAP provides for distinguishing between zero values and missing values so that
aggregates are computed correctly.
OLAP system should ignore all missing values and compute correct aggregate
values.
OLAP facilitate interactive query and complex analysis for the users.
OLAP allows users to drill down for greater details or roll up for aggregations of
metrics along a single business dimension or across multiple dimension.
OLAP provides the ability to perform intricate calculations and comparisons.
OLAP presents results in a number of meaningful ways, including charts and
graphs.
BCA 5th Sem (Data warehousing & Mining)
29
Benefits of OLAP
OLAP holds several benefits for businesses: -
OLAP helps managers in decision-making through the multidimensional
record views that it is efficient in providing, thus increasing their
productivity.
OLAP functions are self-sufficient owing to the inherent flexibility support
to the organized databases.
It facilitates simulation of business models and problems, through extensive
management of analysis-capabilities.
In conjunction with data warehouse, OLAP can be used to support a
reduction in the application backlog, faster data retrieval, and reduction in
query drag.
Types of OLAP
There are three main types of OLAP servers are as following:
ROLAP stands for Relational OLAP, an application based on relational DBMSs.
MOLAP stands for Multidimensional OLAP, an application based on
multidimensional DBMSs.
HOLAP stands for Hybrid OLAP, an application using both relational and
multidimensional techniques.
BCA 5th Sem (Data warehousing & Mining)
30
Relational OLAP (ROLAP) Server
These are intermediate servers which stand in between a relational back-end
server and user frontend tools.
They use a relational or extended-relational DBMS to save and handle warehouse
data, and OLAP middleware to provide missing pieces.
ROLAP servers contain optimization for each DBMS back end, implementation of
aggregation navigation logic, and additional tools and services.
ROLAP technology tends to have higher scalability than MOLAP technology.
ROLAP systems work primarily from the data that resides in a relational database,
where the base data and dimension tables are stored as relational tables. This
model permits the multidimensional analysis of data.
This technique relies on manipulating the data stored in the relational database to
give the presence of traditional OLAP's slicing and dicing functionality. In
essence, each method of slicing and dicing is equivalent to adding a "WHERE"
clause in the SQL statement.
Relational OLAP Architecture
ROLAP Architecture includes the following components
Database server.
ROLAP server.
Front-end tool.
BCA 5th Sem (Data warehousing & Mining)
31
Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology
segment in the market. This method allows multiple multidimensional views of
two-dimensional relational tables to be created, avoiding structuring record around
the desired view.
Some products in this segment have supported reliable SQL engines to help the
complexity of multidimensional analysis. This includes creating multiple SQL
statements to handle user requests, being 'RDBMS' aware and also being capable
of generating the SQL statements based on the optimizer of the DBMS engine.
Advantages
Can handle large amounts of information: The data size limitation of ROLAP
technology is depends on the data size of the underlying RDBMS. So, ROLAP
itself does not restrict the data amount.
Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or
multiple SQL queries) in the relational database, the query time can be
prolonged if the underlying data size is large.
Limited by SQL functionalities: ROLAP technology relies on upon
developing SQL statements to query the relational database, and SQL
statements do not suit all needs.
Multidimensional OLAP (MOLAP) Server
A MOLAP system is based on a native logical model that directly supports
multidimensional data and operations. Data are stored physically into
multidimensional arrays, and positional techniques are used to access them.
One of the significant distinctions of MOLAP against a ROLAP is that data are
summarized and are stored in an optimized format in a multidimensional cube,
instead of in a relational database. In MOLAP model, data are structured into
proprietary formats by client's reporting requirements with the calculations pre-
generated on the cubes.
MOLAP Architecture
MOLAP Architecture includes the following components
Database server.
BCA 5th Sem (Data warehousing & Mining)
32
MOLAP server.
Front-end tool.
MOLAP structure primarily reads the precompiled data. MOLAP structure has
limited capabilities to dynamically create aggregations or to evaluate results which
have not been pre-calculated and stored.
Applications requiring iterative and comprehensive time-series analysis of trends
are well suited for MOLAP technology (e.g., financial analysis and budgeting).
Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot
Software's Lightship Server, Sniper's TM/1. Planning Science's Gentium and
Kenan Technology's Multiway.
Some of the problems faced by clients are related to maintaining support to
multiple subject areas in an RDBMS. Some vendors can solve these problems by
continuing access from MOLAP tools to detailed data in and RDBMS.
Advantages
Excellent Performance: A MOLAP cube is built for fast information
retrieval, and is optimal for slicing and dicing operations.
Can perform complex calculations: All evaluation have been pre-
generated when the cube is created. Hence, complex calculations are not
only possible, but they return quickly.
Disadvantages
BCA 5th Sem (Data warehousing & Mining)
33
Limited in the amount of information it can handle: Because all
calculations are performed when the cube is built, it is not possible to
contain a large amount of data in the cube itself.
Requires additional investment: Cube technology is generally proprietary
and does not already exist in the organization. Therefore, to adopt MOLAP
technology, chances are other investments in human and capital resources
are needed.
Hybrid OLAP (HOLAP) Server
HOLAP incorporates the best features of MOLAP and ROLAP into a single
architecture. HOLAP systems save more substantial quantities of detailed data in
the relational tables while the aggregations are stored in the pre-calculated cubes.
HOLAP also can drill through from the cube down to the relational tables for
delineated data. The Microsoft SQL Server 2000 provides a hybrid OLAP server.
Advantages of HOLAP
HOLAP provide benefits of both MOLAP and ROLAP.
It provides fast access at all levels of aggregation.
HOLAP balances the disk space requirement, as it only stores the aggregate
information on the OLAP server and the detail record remains in the
relational database. So no duplicate copy of the detail record is maintained.
Disadvantages of HOLAP
BCA 5th Sem (Data warehousing & Mining)
34
HOLAP architecture is very complicated because it supports both MOLAP and
ROLAP servers.
Other Types
There are also less popular types of OLAP styles upon which one could stumble
upon every so often. We have listed some of the less popular brands existing in the
OLAP industry.
Web-Enabled OLAP (WOLAP) Server
WOLAP pertains to OLAP application which is accessible via the web browser.
Unlike traditional client/server OLAP applications, WOLAP is considered to have
a three-tiered architecture which consists of three components: a client, a
middleware, and a database server.
Desktop OLAP (DOLAP) Server
DOLAP permits a user to download a section of the data from the database or
source, and work with that dataset locally, or on their desktop.
Mobile OLAP (MOLAP) Server
Mobile OLAP enables users to access and work on OLAP data and applications
remotely through the use of their mobile devices.
Spatial OLAP (SOLAP) Server
SOLAP includes the capabilities of both Geographic Information Systems (GIS)
and OLAP into a single user interface. It facilitates the management of both spatial
and non-spatial data
Difference between ROLAP, MOLAP, and HOLAP
ROLAP MOLAP HOLAP
ROLAP stands for
Relational Online
Analytical Processing.
MOLAP stands for
Multidimensional Online
Analytical Processing.
HOLAP stands for Hybrid
Online Analytical
Processing.
The ROLAP storage
mode causes the
The MOLAP storage mode
principle the aggregations
The HOLAP storage mode
connects attributes of both
BCA 5th Sem (Data warehousing & Mining)
35
aggregation of the
division to be stored in
indexed views in the
relational database that
was specified in the
partition's data source.
of the division and a copy
of its source information to
be saved in a
multidimensional
operation in analysis
services when the
separation is processed.
MOLAP and ROLAP.
Like MOLAP, HOLAP
causes the aggregation of
the division to be stored in
a multidimensional
operation in an SQL
Server analysis services
instance.
ROLAP does not
because a copy of the
source information to be
stored in the Analysis
services data folders.
Instead, when the
outcome cannot be
derived from the query
cache, the indexed views
in the record source are
accessed to answer
queries.
This MOLAP operation is
highly optimize to
maximize query
performance. The storage
area can be on the
computer where the
partition is described or on
another computer running
Analysis services. Because
a copy of the source
information resides in the
multidimensional
operation, queries can be
resolved without accessing
the partition's source
record.
HOLAP does not causes a
copy of the source
information to be stored.
For queries that access the
only summary record in
the aggregations of a
division, HOLAP is the
equivalent of MOLAP.
Query response is
frequently slower with
ROLAP storage than
with the MOLAP or
HOLAP storage mode.
Processing time is also
frequently slower with
ROLAP.
Query response times can
be reduced substantially by
using aggregations. The
record in the partition's
MOLAP operation is only
as current as of the most
recent processing of the
separation.
Queries that access source
record for example, if we
want to drill down to an
atomic cube cell for which
there is no aggregation
information must retrieve
data from the relational
database and will not be as
fast as they would be if the
source information were
stored in the MOLAP
BCA 5th Sem (Data warehousing & Mining)
36
architecture.
OLTP (On-Line Transaction Processing)
OLTP (On-Line Transaction Processing) is featured by a large number of short
on-line transactions (INSERT, UPDATE, and DELETE). The primary
significance of OLTP operations is put on very rapid query processing,
maintaining record integrity in multi-access environments, and effectiveness
consistent by the number of transactions per second. In the OLTP database, there
is an accurate and current record, and schema used to save transactional database
is the entity model
BCA 5th Sem (Data warehousing & Mining)
37
Advantages of an OLTP System:
OLTP Systems are user friendly and can be used by anyone having basic
understanding
It allows its user to perform operations like read, write and delete data
quickly.
It responds to its user actions immediately as it can process query very
quickly.
This systems are original source of the data.
It helps to administrate and run fundamental business tasks
It helps in widening customer base of an organization by simplifying
individual processes
Characteristics of OLTP
Following are important characteristics of OLTP:
OLTP uses transactions that include small amounts of data.
Indexed data in the database can be accessed easily.
OLTP has a large number of users.
It has fast response times
Databases are directly accessible to end-users
OLTP uses a fully normalized schema for database consistency.
The response time of OLTP system is short.
It strictly performs only the predefined operations on a small number of
records.
OLTP stores the records of the last few days or a week.
It supports complex data models and tables.
Sr.No. Data Warehouse (OLAP) Operational Database(OLTP)
BCA 5th Sem (Data warehousing & Mining)
38
1 It involves historical processing
of information.
It involves day-to-day processing.
2 OLAP systems are used by
knowledge workers such as
executives, managers, and
analysts.
OLTP systems are used by clerks,
DBAs, or database professionals.
3 It is used to analyze the
business.
It is used to run the business.
4 It focuses on Information out. It focuses on Data in.
5 It is based on Star Schema,
Snowflake Schema, and Fact
Constellation Schema.
It is based on Entity Relationship
Model.
6 It focuses on Information out. It is application oriented.
7 It contains historical data. It contains current data.
8 It provides summarized and
consolidated data.
It provides primitive and highly
detailed data.
9 It provides summarized and
multidimensional view of data.
It provides detailed and flat
relational view of data.
10 The number of users is in
hundreds.
The number of users is in
thousands.
11 The number of records
accessed is in millions.
The number of records accessed is
in tens.
12 The database size is from The database size is from 100 MB
BCA 5th Sem (Data warehousing & Mining)
39
Difference between OLAP & OLTP
BASIS FOR
COMPARISON
OLTP OLAP
Basic It is an online
transactional system and
manages database
modification.
It is an online data retrieving
and data analysis system.
Focus Insert, Update, Delete
information from the
database.
Extract data for analyzing
that helps in decision
making.
Data OLTP and its transactions
are the original source of
data.
Different OLTPs database
becomes the source of data
for OLAP.
Transaction OLTP has short
transactions.
OLAP has long transactions.
100GB to 100 TB. to 100 GB.
13 These are highly flexible. It provides high performance.
BCA 5th Sem (Data warehousing & Mining)
40
BASIS FOR
COMPARISON
OLTP OLAP
Time The processing time of a
transaction is
comparatively less in
OLTP.
The processing time of a
transaction is comparatively
more in OLAP.
Queries Simpler queries. Complex queries.
Normalization Tables in OLTP database
are normalized (3NF).
Tables in OLAP database are
not normalized.
Integrity OLTP database must
maintain data integrity
constraint.
OLAP database does not get
frequently modified. Hence,
data integrity is not affected.
BCA 5th Sem (Data warehousing & Mining)
41
UNIT- II
BCA 5th Sem (Data warehousing & Mining)
42
Building a Data Warehouse
A Data warehouse is a heterogeneous collection of different data sources
organized under unified schema. Builders should take a broad view of the
anticipated use of the warehouse while constructing a data warehouse. During the
design phase, there is no way to anticipate all possible queries or analyses. Some
characteristic of Data warehouse are:
Subject oriented
Integrated
Time Variant
Non-volatile
Building a Data Warehouse –
Some steps that are needed for building any data warehouse are as following
below:
To extract the data (transnational) from different data sources:
For building a data warehouse, a data is extracted from various data sources
and that data is stored in central storage area. For extraction of the data
Microsoft has come up with an excellent tool. When you purchase
Microsoft SQL Server, then this tool will be available at free of cost.
To transform the transnational data:
There are various DBMS where many of the companies stores their data.
Some of them are: MS Access, MS SQL Server, Oracle, Sybase etc. Also
these companies saves the data in spreadsheets, flat files, mail systems etc.
Relating a data from all these sources is done while building a data
warehouse.
To load the data (transformed) into the dimensional database:
After building a dimensional model, the data is loaded in the dimensional
database. This process combines the several columns together or it may split
one field into the several columns. There are two stages at which
transformation of the data can be performed and they are: while loading the
data into the dimensional model or while data extraction from their origins.
To purchase a front-end reporting tool:
There are top notch analytical tools are available in the market. These tools
are provided by the several major vendors. A cost effective tool and Data
Analyzer is released by the Microsoft on its own.
For the warehouse there is an acquisition of the data. There must be a use of
multiple and heterogeneous sources for the data extraction, example databases.
BCA 5th Sem (Data warehousing & Mining)
43
There is a need for the consistency for which formation of data must be done
within the warehouse. Reconciliation of names, meanings and domains of data
must be done from unrelated sources. There is also a need for the installation of
the data from various sources in the data model of the warehouse.
To provide the time variant data
To store the data as per the data model of the warehouse
Purging the data
To support the updating of the warehouse data
Design considerations
Several structural design considerations should be taken into account for
economical and efficient welding. Many of these apply to other joining methods,
and all apply to both subassemblies and the complete structure.
Recognize and analyze the design problem: Designs must perform well under
expected and worst-case conditions. The designer should consider this before
sitting down at the drawing board or CAD terminal. Considerations include: Is it
more economical to build an irregular shape from welded pieces or to cut it from a
plate, with the accompanying waste? Can bending replace a welded joint? Are
preformed sections available? How, when, and how much should the structure be
welded? Can weight be reduced cost-effectively by using welded joints? Will
fewer parts offer equal or better performance
Determine load conditions: Structures will be subject to tension, compression,
torsion, and bending. These loads must be calculated under service conditions.
Locations of critical loads must be determined and the structure designed to
handle the loads efficiently. Careful designers will locate joints away from high-
stress areas when possible.
Consider producibility:The most elegant design is useless if it cannot be made
efficiently. Welders cannot always fabricate what designers think up. Designers
should spend time in the shop and consult foremen or manufacturing engineers
during design to become familiar with the challenges of translating drawings into
products.
Optimize layout: When drawing the preliminary design, engineers should plan
layout to reduce waste when the pieces are cut from plate. Preformed beams,
channels, and tubes also may reduce costs without sacrificing quality.
BCA 5th Sem (Data warehousing & Mining)
44
Anticipate plate preparation: Many designers assume that metals are
homogeneous, but real-world metal does not have equal properties in all
directions. Therefore, the type of plates used should be considered.
Consider using standard sections and forms: Preformed sections and forms
should be used whenever possible. Specifying standard sections for welding is
usually cheaper than welding many individual parts. In particular, specifying bent
components is preferable to making welded corners.
Select weld-joint design: There are five basic types of joints: butt joints, corner
joints, T-joints, lap joints, and edge joints. In addition, the American Welding
Society recognizes about 80 different types of welding and joining processes.
Each process has its own characteristics and capabilities, so joint design must be
suitable for the desired welding process. In addition, the joint design will affect
access to the weld.
Restrain size and number of welds: Welds should match, not exceed, the
strength of the base metal for full joint efficiency. Over welding is unnecessary,
increases costs, and reduces strength.
Welding sometimes induces distortions and residual stresses in structures. It is
best to specify the minimum amount of welding needed. To check for
overwelding, determine joint stresses versus stress in the adjoining members
Implementation Considerations
I. Access Tools
Currently no single tool in the market can handle all possible data warehouse
access needs. Therefore, most implementations rely on a suite of tools.
ii. Data Placement Strategies
As Data Warehouse grows, there are at least two options for Data
Placement. One is to put some of the data in the data warehouse into another
storage media (WORM, RAID). Second option is to distribute data in data
warehouse across multiple servers.
iii. Data Extraction, Cleanup, Transformation, and Migration
BCA 5th Sem (Data warehousing & Mining)
45
As components of the Data Warehouse architecture, proper attention must be
given to Data Extraction, which represents a critical success factor for a data
warehouse architecture.
1. The ability to identify data in the data source environments that can be
read by conversion tool is important.
2. Support for the flat files. (VSAM, ISM, IDMS) is critical, since bulk of
the corporate data is still maintained in this type of data storage.
3. The capability to merge data from multiple data stores is required in
many installations.
4. The specification interface to indicate the data to extracted and the
conversion criteria is important.
5. The ability to read information from data dictionaries or import
information from repository product is desired
iv. Metadata
A frequently occurring problem in Data Warehouse is the problem of
communicating to the end user what information resides in the data warehouse
and how it can be accessed. The key to providing users and applications with a
roadmap to the information stored in the warehouse is the metadata. It can
define all data elements and their attributes, data sources and timing, and the
rules that govern data use and data transformations. Meta data needs to be
collected as the warehouse is designed and built.
Data Warehouse- Technical Consideration
The following technical issues are required to be considered for designing and
implementing a data warehouse.
1. Hardware platform for data warehouse
2. DBMS for supporting data warehouse
3. Communication and network infrastructure for a data warehouse
4. Software tools for building, operating and using data warehouse
1. Hardware platforms for data warehouse:- Organization normally like to
utilize the already existing platforms for data warehouse development. However,
BCA 5th Sem (Data warehousing & Mining)
46
the disk storage requirements for a data warehouse will be significantly large,
especially in comparison with single application.
Thus, hardware with large data storage capacity is essential for data warehousing.
For every data size identified, the disk space provided should be two to three times
that of the data to accommodate processing, indexing etc.
2. DBMS for supporting data warehouse:- After hardware selection, a factor
most important is the DBMA selection. This determines the speed performance of
the data warehousing environment. The requirements of a DBMA for data
warehousing environment are scalability, performance in high volume storage
and processing and throughput in traffic. All the well known RDBMA vendors
like:- IBM,ORACLE,Sybase support parallel database processing, even some of
them have improved their architectures so as to better suit the specialized
requirements of a data warehouse.
3. Communication and network infrastructure for a data warehouse:- Data
warehouse can be internet or web enabled or intranet enabled as the choice may
be. If web enabled, the networking is taken care by the internet. If only intranet
based, then the appropriate LAN operational environment should be provided so
as to be accessible to all the identified users.
4. Software tools for building, operating, and using data warehouse:- All the
data warehouse vendors are not currently providing comprehensive single window
software tools capable of handling all aspects of a data warehousing project
implementation.
The types of access and reporting are as follow.
Statistical analysis and forecasting
Data visualization, graphing, and charting
Complex textual search
Ad hoc user specific queries
Predefined repeatable queries
Reporting and analysis by drilling down
BCA 5th Sem (Data warehousing & Mining)
47
Data Pre-processing
Today’s real-world databases are highly subject to noisy, missing and inconsistent
data due to their typically huge size, and their likely origin from
multiple,heterogenous source. Incomplete data can occur for a number of reasons.
Attributes of interest may not always be available , such as customer information
for sales transaction data. Other data may not be included simply because it was
not considered important at the time of entry. Some of the major reason for noisy
data is:
The data collection instruction used may be faulty
There may have been human or computer errors occurring at data entry.
Error in data transmission can also occur
There may be technology limitations, such as limited buffer size for
coordinating synchronized data transfer and consumption.
Objectives of data Preprocessing
Size reduction of the Input space:- Reducing the number of input
variables or the size of the input space is a common goal of the
preprocessing. The objectives is to get a reasonable overview of the data set
without losing the most important relationship of the data. If the input space
is large, one may identify the most important input variables and eliminate
the unimportant variables by combining several variables as a single
variable.
Smoother relationship: - Another commonly used type of preprocessing is
problem transformation. The original problem is transformed into a simpler
problem.
Data normalization:- For many practical problem, the units used to
measure each of the input variables can change the data and make the range
of values much larger than others. This results in unnecessarily complex
relationship by making the nature of the mapping along some dimensions
much different from others.
Noise Reduction:- A sequence of data may involve useful data, noisy data,
and inconsistent data. Preprocessing may reduce the noisy and inconsistent
data. The data corrupted with noise can be recovered with preprocessing
techniques.
BCA 5th Sem (Data warehousing & Mining)
48
Features Extraction:- If the key attribute or features characterizing the
data can be extracted, the problem encountered can be easily solved .
Data Summarization
Data Summarization is a simple term for a short conclusion of a big theory or a
paragraph. This is something where you write the code and in the end, you declare
the final result in the form of summarizing data. Data summarization has the great
importance in the data mining. As nowadays a lot of programmers and developers
work on big data theory. Earlier, you used to face difficulties to declare the result,
but now there are so many relevant tools in the market where you can use in the
programming or wherever you want in your data.
Why Data Summarization?
Why we need more summarization of data in the mining process, we are living in
a digital world where data transfers in a second and it is much faster than a human
capability. In the corporate field, employees work on a huge volume of data which
is derived from different sources like Social Network, Media, Newspaper, Book,
cloud media storage etc. But sometimes it may create difficulties for you to
summarize the data. Sometimes you do not expect data volume because when
you retrieve data from relational sources you cannot predict that how much data
will be stored in the database.
As a result, data becomes more complex and takes time to summarize information.
Let me tell you the solution to this problem. Always retrieve data in the form of
category what type of data you want in the data or we can say use filtration when
you retrieve data. Although, “Data Summarization” technique gives the good
amount of quality to summarize the data. Moreover, a customer or user can take
benefits in their research.
Data Preprocessing Techniques
Data Cleaning
Data Cleaning is a process of cleaning raw data by handling irrelevant and missing
tuples. While working our machine learning projects, the data sets which we take
might not be perfect they might have many impurities, Noisy values and a majority
of times the actual data might be missing. the major problems we will be facing
during data cleaning are:
1. Missing Values: If it is noted that there are many tuples that have no
BCA 5th Sem (Data warehousing & Mining)
49
recorded value for several attributes, then the missing values can be filled in
for the attribute by various methods described below:
Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification or description). This
method is not very effective, unless the tuple contains several attributes
with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
Fill in the missing value manually: In general, this approach is time-consuming and may not be feasible given a large data set with many missing values.
Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like \Unknown",
or -∞. If missing values are replaced by, say, \Unknown", then the
mining program may mistakenly think that they form an interesting
concept, since they all have a value in common | that of \Unknown".
Hence, although this method is simple, it is not recommended.
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class as the given tuple.
Use the most probable value to fill in the missing value: This may be determined with inference-based tools using a Bayesian formalism or decision tree induction.
Methods 3 to 6 bias the data. The filled-in value may not be correct. Method
6, however, is a popular strategy. In comparison to the other methods, it uses
the most information from the present data to predict missing values.
2. Noisy Data: Noise is a random error or variance in a measured variable.
Given a numeric attribute such as, say, price, how can the data be “smoothed"
to remove the noise? The following data smoothing techniques describes this.
1. Binning methods: Binning methods smooth a sorted data value by
consulting the \neighborhood” or values around it. The sorted values are distributed
into a number of 'buckets', or bins. Because binning methods consult
the neighborhood of values, they perform local smoothing values
around it. The sorted values are distributed into a number of 'buckets', or
bins. Because binning methods consult the neighborhood of values,
they perform local smoothing.
2. Clustering: Outliers may be detected by clustering, where similar
BCA 5th Sem (Data warehousing & Mining)
50
values are organized into groups or \clusters".
3. Combined computer and human inspection: Outliers may be
identified through a combination of computer and human inspection. In
one application, for example, an information-theoretic measure was
used to help identify outlier patterns in a handwritten character database
for classification. The measure's value reflected the
\surprise" content of the predicted character label with respect to the
known label. Outlier patterns may be informative (e.g., identifying
useful data exceptions, such as different versions of the characters \0"
or \7"), or \garbage" (e.g., mislabeled characters). Patterns whose
surprise content is above a threshold are output to a list. A human can
then sort through the patterns in the list to identify the actual garbage
ones.
This is much faster than having to manually search through the entire database. The garbage patterns can then be removed from the (training) database.
4. Regression: Data can be smoothed by fitting the data to a function,
such as with regression. Linear regression involves finding the \best"
line to fit two variables, so that one variable can be used to predict the
other. Multiple linear regression is an extension of linear regression,
where more than two variables are involved and the data are fit to a
multidimensional surface. Using regression to find a mathematical
equation to fit the data helps smooth out the noise.
3. Inconsistent data: There may be inconsistencies in the data recorded for
some transactions. Some data inconsistencies may be corrected manually
using external references. For example, errors made at data entry may be
corrected by performing a paper trace. This may be coupled with routines
designed to help correct the inconsistent use of codes. Knowledge engineering
tools may also be used to detect the violation of known data constraints. For
example, known functional dependencies between attributes can be used to
find values contradicting the functional constraints.
Data Transformation
In data transformation, the data are transformed or consolidated into forms
appropriate for mining. Data transformation can involve the following:
BCA 5th Sem (Data warehousing & Mining)
51
1. Normalization, where the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0, or 0 to 1.0.
2. Smoothing works to remove the noise from data. Such techniques include binning, clustering, and regression.
3. Aggregation, where summary or aggregation operations are applied to the
data. For example, the daily sales data may be aggregated so as to compute
monthly and annual total amounts. This step is typically used in constructing a
data cube for analysis of the data at multiple granularities.
4. Generalization of the data, where low level or 'primitive' (raw) data are
replaced by higher level concepts through the use of concept hierarchies. For
example, categorical attributes, like street, can be generalized to higher level
concepts, like city or county. Similarly, values for numeric attributes, like age,
may be mapped to higher level concepts, like young, middle-aged, and senior.
Concept Hierarchy
A concept hierarchy defines a sequence of mappings from a set of low-level
concepts to higher-level, more general concepts. Consider a concept hierarchy for
the dimension location. City values for location include Vancouver, Toronto, New
York, and Chicago. Each city, however, can be mapped to the province or state to
which it belongs. For example, Vancouver can be mapped to British Columbia,
and Chicago to Illinois. The provinces and states can in turn be mapped to the
country (e.g., Canada or the United States) to which they belong. These mappings
form a concept hierarchy for the dimension location, mapping a set of low-level
concepts (i.e., cities) to higher-level, more general concepts (i.e., countries).
BCA 5th Sem (Data warehousing & Mining)
52
Patterns and Models
A pattern is an entity used to represent an abstract concept or a physical object. It
may contain several attributes or features to characterize an object. Data
preprocessing is to remove the irrelevant information and extract key features of
the data to simplify a pattern recognition problem without throwing away any
important information. Pattern provides a common framework for describing
problems and solution.
Models are a cornerstone of design. Engineers build a model of a car to work out
any details before putting it into production. In the same manner, system designers
develop models to explore ideas and improve the understanding of the database
design.
A data model is a graphical view of data created for analysis and design purposes.
Data modeling designing data warehouse databases in detail. It can be defined as
an integrated collection of concept that can be used to describe the structure of the
database including data types, relationships between data and constraints that
should apply on the data.
A data model comprises of following three components
1) Structural part:- It consists of a set of rules according to which database can
be constructed.
2) Manipulative part:- It defines the types of operations that are allowed on the
data. This includes the operations that are used for updating or retrieving data
from the database and for changing the structure of the students.
3) Integrity rules:- Rules, which ensures that the data is accurate.
BCA 5th Sem (Data warehousing & Mining)
53
Artificial Intelligence
Artificial intelligence (AI) refers to the simulation of human intelligence in
machines that are programmed to think like humans and mimic their actions. The
term may also be applied to any machine that exhibits traits associated with a
human mind such as learning and problem-solving.
The ideal characteristic of artificial intelligence is its ability to rationalize and take
actions that have the best chance of achieving a specific goal.
Application of AI
Following are some sectors which have the application of Artificial Intelligence
BCA 5th Sem (Data warehousing & Mining)
54
1. AI in Astronomy
o Artificial Intelligence can be very useful to solve complex universe
problems. AI technology can be helpful for understanding the universe such
as how it works, origin, etc.
2. AI in Healthcare
o In the last, five to ten years, AI becoming more advantageous for the
healthcare industry and going to have a significant impact on this industry.
o Healthcare Industries are applying AI to make a better and faster diagnosis
than humans. AI can help doctors with diagnoses and can inform when
patients are worsening so that medical help can reach to the patient before
hospitalization.
3. AI in Gaming
o AI can be used for gaming purpose. The AI machines can play strategic
games like chess, where the machine needs to think of a large number of
possible places.
4. AI in Finance
o AI and finance industries are the best matches for each other. The finance
industry is implementing automation, chatbot, adaptive intelligence,
algorithm trading, and machine learning into financial processes.
5. AI in Data Security
o The security of data is crucial for every company and cyber-attacks are
growing very rapidly in the digital world. AI can be used to make your data
more safe and secure. Some examples such as AEG bot, AI2 Platform,are
used to determine software bug and cyber-attacks in a better way.
6. AI in Social Media
o Social Media sites such as Facebook, Twitter, and Snapchat contain billions
of user profiles, which need to be stored and managed in a very efficient
way. AI can organize and manage massive amounts of data. AI can analyze
lots of data to identify the latest trends, hashtag, and requirement of different
users.
7. AI in Travel & Transport
o AI is becoming highly demanding for travel industries. AI is capable of
doing various travel related works such as from making travel arrangement
to suggesting the hotels, flights, and best routes to the customers. Travel
BCA 5th Sem (Data warehousing & Mining)
55
industries are using AI-powered chatbots which can make human-like
interaction with customers for better and fast response.
8. AI in Automotive Industry
o Some Automotive industries are using AI to provide virtual assistant to their
user for better performance. Such as Tesla has introduced TeslaBot, an
intelligent virtual assistant.
o Various Industries are currently working for developing self-driven cars
which can make your journey more safe and secure.
9. AI in Robotics:
o Artificial Intelligence has a remarkable role in Robotics. Usually, general
robots are programmed such that they can perform some repetitive task, but
with the help of AI, we can create intelligent robots which can perform tasks
with their own experiences without pre-programmed.
o Humanoid Robots are best examples for AI in robotics, recently the
intelligent Humanoid robot named as Erica and Sophia has been developed
which can talk and behave like humans.
10. AI in Entertainment
o We are currently using some AI based applications in our daily life with
some entertainment services such as Netflix or Amazon. With the help of
ML/AI algorithms, these services show the recommendations for programs
or shows.
11. AI in Agriculture
o Agriculture is an area which requires various resources, labor, money, and
time for best result. Now a day's agriculture is becoming digital, and AI is
emerging in this field. Agriculture is applying AI as agriculture robotics,
solid and crop monitoring, predictive analysis. AI in agriculture can be very
helpful for farmers.
12. AI in E-commerce
o AI is providing a competitive edge to the e-commerce industry, and it is
becoming more demanding in the e-commerce business. AI is helping
shoppers to discover associated products with recommended size, color, or
even brand.
BCA 5th Sem (Data warehousing & Mining)
56
13. AI in education:
o AI can automate grading so that the tutor can have more time to teach. AI
Chabot can communicate with students as a teaching assistant.
o AI in the future can be work as a personal virtual tutor for students, which
will be accessible easily at any time and any place.
Multi Dimensional Data Model
A multidimensional model views data in the form of a data-cube. A data cube
enables data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization
keeps records. For example, a shop may create a sales data warehouse to keep
records of the store's sales for the dimension time, item, and location. These
dimensions allow the save to keep track of things, for example, monthly sales of
items and the locations at which the items were sold. Each dimension has a table
related to it, called a dimensional table, which describes the dimension further. For
example, a dimensional table for an item may contain the attributes item name,
brand, and type.
A multidimensional data model is organized around a central theme, for example,
sales. This theme is represented by a fact table. Facts are numerical measures. The
fact table contains the names of the facts or measures of the related dimensional
tables.
Logical cubes:- Logical cubes are designed to organize measures or procedures
that have the same shape or dimensions. Measures in the same cube have the same
relationship to other logical objects and can easily be analyzed and displayed
together.
BCA 5th Sem (Data warehousing & Mining)
57
Logical measures:- With logical measures ,cells of the logical cube are filled with
facts collected about an organization’s operations or functions. The measures are
organized according to the dimensions, which typically include a time dimension.
Logical dimensions:- Dimensions contain a set of unique values that identify and
categorize data. Dimensions represent the different views for an entity that an
organization is interested in. For example, a store will create a sales data
warehouse in order to keep track of the store sales with respect to different
dimensions such as time, branch and location
Data Cube
When data is grouped or combined in multidimensional matrices called Data
Cubes. The data cube method has a few alternative names or a few variants, such
as "Multidimensional databases," "materialized views," and "OLAP (On-Line
Analytical Processing)."
The general idea of this approach is to materialize certain expensive computations
that are frequently inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-
price) can be materialized into a set of eight views as shown in fig,
where psc indicates a view consisting of aggregate function value (such as total-
sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.
BCA 5th Sem (Data warehousing & Mining)
58
Example: In the 2-D representation, we will look at the All Electronics sales
data for items sold per quarter in the city of Vancouver. The measured display in
dollars sold (in thousands)
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For
example, suppose we would like to view the data according to time, item as well as
the location for the cities Chicago, New York, Toronto, and Vancouver. The
measured display in dollars sold (in thousands). These 3-D data are shown in the
table. The 3-D data of the table are represented as a series of 2-D tables.
BCA 5th Sem (Data warehousing & Mining)
59
Schemas For Multidimensional Data
Multidimensional Schema is especially designed to model data warehouse
systems. The schemas are designed to address the unique needs of very large
databases designed for the analytical purpose (OLAP).
Types of Data Warehouse Schema:
Following are 3 chief types of multidimensional schemas each having its unique
advantages.
Star Schema
Snowflake Schema
Galaxy Schema
1. Star Schema
In the STAR Schema, the center of the star can have one fact table and a number
of associated dimension tables. It is known as star schema as its structure
resembles a star. The star schema is the simplest type of Data Warehouse schema.
It is also known as Star Join Schema and is optimized for querying large data sets.
In the following example,the fact table is at the center which contains keys to every
dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID &
other attributes like Units sold and revenue.
BCA 5th Sem (Data warehousing & Mining)
60
Characteristics of Star Schema:
Every dimension in a star schema is represented with the only one-
dimension table.
The dimension table should contain the set of attributes.
The dimension table is joined to the fact table using a foreign key
The dimension table are not joined to each other
Fact table would contain key and measure
The Star schema is easy to understand and provides optimal disk usage.
The dimension tables are not normalized. For instance, in the above figure,
Country_ID does not have Country lookup table as an OLTP design would
have.
The schema is widely supported by BI Tools
2. Snowflake Schema
SNOWFLAKE SCHEMA is a logical arrangement of tables in a
multidimensional database such that the ER diagram resembles a snowflake shape.
A Snowflake Schema is an extension of a Star Schema, and it adds additional
dimensions. The dimension tables are normalized which splits data into additional
tables.
In the following example, Country is further normalized into an individual table
BCA 5th Sem (Data warehousing & Mining)
61
Characteristics of Snowflake Schema:
The main benefit of the snowflake schema it uses smaller disk space.
Easier to implement a dimension is added to the Schema
Due to multiple tables query performance is reduced
The primary challenge that you will face while using the snowflake Schema
is that you need to perform more maintenance efforts because of the more
lookup tables.
Star Vs Snowflake Schema: Key Differences
Star Schema Snow Flake Schema
Hierarchies for the dimensions are
stored in the dimensional table.
Hierarchies are divided into
separate tables.
It contains a fact table surrounded
by dimension tables.
One fact table surrounded by
dimension table which are in turn
surrounded by dimension table
In a star schema, only single join
creates the relationship between
the fact table and any dimension
tables.
A snowflake schema requires
many joins to fetch the data.
Simple DB Design. Very Complex DB Design.
Denormalized Data structure and Normalized Data Structure.
BCA 5th Sem (Data warehousing & Mining)
62
query also run faster.
High level of Data redundancy Very low-level data redundancy
Single Dimension table contains
aggregated data.
Data Split into different Dimension
Tables.
Cube processing is faster. Cube processing might be slow
because of the complex join.
Offers higher performing queries
using Star Join Query
Optimization. Tables may be
connected with multiple
dimensions.
The Snow Flake Schema is
represented by centralized fact
table which unlikely connected
with multiple dimensions.
3. Fact Constellation Schema( Galaxy Schema)
A GALAXY SCHEMA contains two fact table that share dimension tables
between them. It is also called Fact Constellation Schema. The schema is viewed
as a collection of stars hence the name Galaxy Schema.
As you can see in above example, there are two facts table
BCA 5th Sem (Data warehousing & Mining)
63
1. Revenue
2. Product.
In Galaxy schema shares dimensions are called Conformed Dimensions.
Characteristics of Galaxy Schema:
The dimensions in this schema are separated into separate dimensions based
on the various levels of hierarchy.
For example, if geography has four levels of hierarchy like region, country,
state, and city then Galaxy schema should have four dimensions.
Moreover, it is possible to build this type of schema by splitting the one-star
schema into more Star schemes.
The dimensions are large in this schema which is needed to build based on
the levels of hierarchy.
This schema is helpful for aggregating fact tables for better understanding.
Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of
data communication processing and presentation that exist for end-clients
computing within the enterprise. Each data warehouse is different, but all are
characterized by standard vital components.
Production applications such as payroll accounts payable product purchasing and
inventory control are designed for online transaction processing (OLTP). Such
applications gather detailed data from day to day operations.
Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP).
These include applications such as forecasting, profiling, summary reporting, and
trend analysis.
Data warehouses and their architectures very depending upon the elements of an
organization's situation.
Three common architectures are:
o Data Warehouse Architecture: Basic
o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Mart
BCA 5th Sem (Data warehousing & Mining)
64
Operational System
An operational system is a method used in data warehousing to refer to
a system that is used to process the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and
every file in the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding
and work with particular instances of data more accessible. For example, author,
data build, and data changed, and file size are examples of very basic document
metadata.
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.
BCA 5th Sem (Data warehousing & Mining)
65
The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business
managers for strategic decision-making. These customers interact with the
warehouse using end-client access tools.
The examples of some of the end-user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
Data Warehouse Architecture: With Staging Area
We must clean and process your operational information before put it into the
warehouse.
We can do this programmatically, although data warehouses uses a staging
area (A place where data is processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses
where all relevant data of an enterprise is consolidated.
BCA 5th Sem (Data warehousing & Mining)
66
Data Warehouse Architecture: With Staging Area and Data Marts
We may want to customize our warehouse's architecture for multiple groups within
our organization.
We can do this by adding data marts. A data mart is a segment of a data
warehouses that can provided information for reporting and analysis on a section,
unit, department or operation in the company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated.
In this example, a financial analyst wants to analyze historical data for purchases
and sales or mine historical information to make predictions about customer
behavior.
Data Warehouse Design
BCA 5th Sem (Data warehousing & Mining)
67
A data warehouse is a single data repository where a record from multiple data
sources is integrated for online business analytical processing (OLAP). This
implies a data warehouse needs to meet the requirements from all the business
stages within the entire organization. Thus, data warehouse design is a hugely
complex, lengthy, and hence error-prone process. Furthermore, business analytical
functions change over time, which results in changes in the requirements for the
systems. Therefore, data warehouse and OLAP systems are dynamic, and the
design process is continuous.
There are two approaches
1. "top-down" approach
2. "bottom-up" approach
Top-down Design Approach
In the "Top-Down" design approach, a data warehouse is described as a subject-
oriented, time-variant, non-volatile and integrated data repository for the entire
enterprise data from different sources are validated, reformatted and saved in a
normalized (up to 3NF) database as the data warehouse. The data warehouse stores
"atomic" information, the data at the lowest level of granularity, from where
dimensional data marts can be built by selecting the data required for specific
business subjects or particular departments. An approach is a data-driven approach
as the information is gathered and integrated first and then business requirements
by subjects for building data marts are formulated. The advantage of this method is
which it supports a single integrated data source. Thus data marts built from it will
have consistency when they overlap.
Advantages of top-down design
Data Marts are loaded from the data warehouses.
Developing new data mart from the data warehouse is very easy.
Disadvantages of top-down design
This technique is inflexible to changing departmental needs.
The cost of implementing the project is high.
BCA 5th Sem (Data warehousing & Mining)
68
Bottom-Up Design Approach
In the "Bottom-Up" approach, a data warehouse is described as "a copy of
transaction data specifically architecture for query and analysis," term the star
schema. In this approach, a data mart is created first to necessary reporting and
analytical capabilities for particular business processes (or subjects). Thus it is
needed to be a business-driven approach in contrast to Inman’s data-driven
approach.
The advantage of the "bottom-up" design approach is that it has quick ROI, as
developing a data mart, a data warehouse for a single subject, takes far less time
and effort than developing an enterprise-wide data warehouse. Also, the risk of
failure is even less. This method is inherently incremental. This method allows the
project team to learn and grow.
BCA 5th Sem (Data warehousing & Mining)
69
Advantages of bottom-up design
Documents can be generated quickly.
The data warehouse can be extended to accommodate new business units.
It is just developing new data marts and then integrating with other data
marts.
Disadvantages of bottom-up design
The locations of the data warehouse and the data marts are reversed in the
bottom-up approach design.
Differentiate between Top-Down Design Approach and Bottom-Up Design
Approach
Top-Down Design Approach Bottom-Up Design Approach
Breaks the vast problem into
smaller subproblems.
Solves the essential low-level problem and
integrates them into a higher one.
Inherently architected- not a
union of several data marts.
Inherently incremental; can schedule
essential data marts first.
Single, central storage of
information about the content.
Departmental information stored.
Centralized rules and control. Departmental rules and control.
It includes redundant
information.
Redundancy can be removed.
It may see quick results if
implemented with repetitions.
Less risk of failure, favorable return on
investment, and proof of techniques.
BCA 5th Sem (Data warehousing & Mining)
70
Three-Tier Data Warehouse Architecture
Generally a data warehouses adopts a three-tier architecture. Following are the
three tiers of the data warehouse architecture.
Bottom Tier − The bottom tier of the architecture is the data warehouse
database server. It is the relational database system. We use the back end
tools and utilities to feed data into the bottom tier. These back end tools and
utilities perform the Extract, Clean, Load, and refresh functions.
Middle Tier − In the middle tier, we have the OLAP Server that can be
implemented in either of the following ways.
o By Relational OLAP (ROLAP), which is an extended relational
database management system. The ROLAP maps the operations on
multidimensional data to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly
implements the multidimensional data and operations.
Top-Tier − This tier is the front-end client layer. This layer holds the query
tools and reporting tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse
BCA 5th Sem (Data warehousing & Mining)
71
From the architecture point of view there are three data warehouse models
1. Enterprise warehouse
2. data marts
3. Virtual warehouse
Enterprise Warehouse: - An enterprise warehouse collects all details comparing
of all information about subjects spanning the entire organization. It provides
corporate wide data integration, usually from one or more operational systems and
from external information providers. It takes extensive business modeling and it
takes many years to design and build.
Data Marts:- A data mart consists of a subset of corporate wide data that is of
value to specific group of users. The scope is restricted to specific selected
subjects. The data contained in a data mart tend to be summarized.
Virtual warehouse:- A virtual warehouse is a set of views over operational
databases. A virtual warehouse is essentially a business database. The data found
in a virtual warehouse is usually copied from multiple sources throughout a
production system.
OLAP
OLAP (online analytical processing) is a computing method that enables users to
easily and selectively extract and query data in order to analyze it from different
points of view. OLAP business intelligence queries often aid in trends analysis,
financial reporting, sales forecasting, budgeting and other planning purposes.
For example, a user can request that data be analyzed to display a spreadsheet
showing all of a company's beach ball products sold in Florida in the month of
BCA 5th Sem (Data warehousing & Mining)
72
July, compare revenue figures with those for the same products in September and
then see a comparison of other product sales in Florida in the same time period.
Working of OLAP system
To facilitate this kind of analysis, data is collected from multiple data sources and
stored in data warehouses then cleansed and organized into data cubes.
Each OLAP cube contains data categorized by dimensions (such as customers,
geographic sales region and time period) derived by dimensional tables in the data
warehouses. Dimensions are then populated by members (such as customer names,
countries and months) that are organized hierarchically. OLAP cubes are often pre-
summarized across dimensions to drastically improve query time over relational
databases.
Basic analytical operations of OLAP
Four types of analytical operations in OLAP are:
1. Roll-up
2. Drill-down
BCA 5th Sem (Data warehousing & Mining)
73
3. Slice and dice
4. Pivot (rotate)
1) Roll-up:
Roll-up is also known as "consolidation" or "aggregation." The Roll-up operation
can be performed in 2 ways
1. Reducing dimensions
2. Climbing up concept hierarchy. Concept hierarchy is a system of grouping
things based on their order or level.
Consider the following diagram
In this example, cities New jersey and Lost Angles and rolled up into
country USA
The sales figure of New Jersey and Los Angeles are 440 and 1560
respectively. They become 2000 after roll-up
In this aggregation process, data is location hierarchy moves up from city to
the country.
In the roll-up process at least one or more dimensions need to be removed.
In this example, Quater dimension is removed.
2) Drill-down
In drill-down data is fragmented into smaller parts. It is the opposite of the rollup
process. It can be done via
Moving down the concept hierarchy
Increasing a dimension
BCA 5th Sem (Data warehousing & Mining)
74
Consider the diagram above
Quarter Q1 is drilled down to months January, February, and March.
Corresponding sales are also registers.
In this example, dimension months are added.
3) Slice:
Here, one dimension is selected, and a new sub-cube is created.
Following diagram explain how slice operation performed:
BCA 5th Sem (Data warehousing & Mining)
75
Dimension Time is Sliced with Q1 as the filter.
A new cube is created altogether.
Dice:
This operation is similar to a slice. The difference in dice is you select 2 or more
dimensions that result in the creation of a sub-cube.
BCA 5th Sem (Data warehousing & Mining)
76
4) Pivot
In Pivot, you rotate the data axes to provide a substitute presentation of data.
In the following example, the pivot is based on item types.
Types of OLAP systems
BCA 5th Sem (Data warehousing & Mining)
77
OLAP Hierarchical Structure
Type of OLAP Explanation
Relational OLAP(ROLAP): ROLAP is an extended RDBMS along
with multidimensional data mapping to
perform the standard relational
operation.
Multidimensional OLAP
(MOLAP)
MOLAP Implementes operation in
multidimensional data.
Hybrid OnlineAnalytical
Processing (HOLAP)
In HOLAP approach the aggregated
totals are stored in a multidimensional
database while the detailed data is
stored in the relational database. This
offers both data efficiency of the
ROLAP model and the performance of
the MOLAP model.
Desktop OLAP (DOLAP) In Desktop OLAP, a user downloads a
part of the data from the database
locally, or on their desktop and analyze
it.
BCA 5th Sem (Data warehousing & Mining)
78
DOLAP is relatively cheaper to deploy
as it offers very few functionalities
compares to other OLAP systems.
Web OLAP (WOLAP) Web OLAP which is OLAP system
accessible via the web browser.
WOLAP is a three-tiered architecture.
It consists of three components: client,
middleware, and a database server.
Mobile OLAP: Mobile OLAP helps users to access
and analyze OLAP data using their
mobile devices
Spatial OLAP : SOLAP is created to facilitate
management of both spatial and non-
spatial data in a Geographic
Information system (GIS)
ROLAP
ROLAP works with data that exist in a relational database. Facts and dimension
tables are stored as relational tables. It also allows multidimensional analysis of
data and is the fastest growing OLAP.
Advantages of ROLAP model:
High data efficiency. It offers high data efficiency because query
performance and access language are optimized particularly for the
multidimensional data analysis.
Scalability. This type of OLAP system offers scalability for managing large
volumes of data, and even when the data is steadily increasing.
Drawbacks of ROLAP model:
Demand for higher resources: ROLAP needs high utilization of
manpower, software, and hardware resources.
BCA 5th Sem (Data warehousing & Mining)
79
Aggregately data limitations. ROLAP tools use SQL for all calculation of
aggregate data. However, there are no set limits to the for handling
computations.
Slow query performance. Query performance in this model is slow when
compared with MOLAP
MOLAP
MOLAP uses array-based multidimensional storage engines to display
multidimensional views of data. Basically, they use an OLAP cube.
Hybrid OLAP
Hybrid OLAP is a mixture of both ROLAP and MOLAP. It offers fast computation
of MOLAP and higher scalability of ROLAP. HOLAP uses two databases.
1. Aggregated or computed data is stored in a multidimensional OLAP cube
2. Detailed information is stored in a relational database.
Benefits of Hybrid OLAP:
This kind of OLAP helps to economize the disk space, and it also remains
compact which helps to avoid issues related to access speed and
convenience.
Hybrid HOLAP's uses cube technology which allows faster performance for
all types of data.
ROLAP are instantly updated and HOLAP users have access to this real-
time instantly updated data. MOLAP brings cleaning and conversion of data
thereby improving data relevance. This brings best of both worlds.
Drawbacks of Hybrid OLAP:
Greater complexity level: The major drawback in HOLAP systems is that it
supports both ROLAP and MOLAP tools and applications. Thus, it is very
complicated.
Potential overlaps: There are higher chances of overlapping especially into
their functionalities.
Advantages of OLAP
OLAP is a platform for all type of business includes planning, budgeting,
reporting, and analysis.
BCA 5th Sem (Data warehousing & Mining)
80
Information and calculations are consistent in an OLAP cube. This is a
crucial benefit.
Quickly create and analyze "What if" scenarios
Easily search OLAP database for broad or specific terms.
OLAP provides the building blocks for business modeling tools, Data
mining tools, performance reporting tools.
Allows users to do slice and dice cube data all by various dimensions,
measures, and filters.
It is good for analyzing time series.
Finding some clusters and outliers is easy with OLAP.
It is a powerful visualization online analytical process system which
provides faster response times
Disadvantages of OLAP
OLAP requires organizing data into a star or snowflake schema. These
schemas are complicated to implement and administer
You cannot have large number of dimensions in a single OLAP cube
Transactional data cannot be accessed with OLAP system.
Any modification in an OLAP cube needs a full update of the cube. This is a
time-consuming process
OLAP software then locates the intersection of dimensions, such as all products
sold in the Eastern region above a certain price during a certain time period, and
displays them. The result is the "measure"; each OLAP cube has at least one to
perhaps hundreds of measures, which are derived from information stored in fact
tables in the data warehouse.
Types of OLAP systems
OLAP (online analytical processing) systems typically fall into one of three types:
Multidimensional OLAP (MOLAP) is OLAP that indexes directly into
a multidimensional database.
Relational OLAP (ROLAP) is OLAP that performs dynamic
multidimensional analysis of data stored in a relational database.
BCA 5th Sem (Data warehousing & Mining)
81
Hybrid OLAP (HOLAP) is a combination of ROLAP and MOLAP. HOLAP
was developed to combine the greater data capacity of ROLAP with the
superior processing capability of MOLAP.
Indexing & Querying in OLAP
To facilitate efficient data accessing, most data warehouse systems support index
structures and materialized views. Two indexing techniques that are popular for
olap data are:
Bitmap Indexing
Join Indexing
1) Bitmap Indexing
The bitmap indexing method is popular in OLAP products because it allows
quick searching in data cubes.
A bitmap index is a very efficient method for storing sparse data columns.
Sparse data columns are one which contain data values from a very small set
of possibilities .
In the bitmap index for a given attribute, there is a distinct bit vector ,for
each value V in the domain of the attribute.
If the domain for the attribute consists of n values, then n bits are needed for
each entry in the bitmap index.
The length of the bit vector is equal to the number of record in the base
table.
Not suitable for high cardinality domains.
BCA 5th Sem (Data warehousing & Mining)
82
2) Join Indexing
A join index is an index set on a join result.
Join indexing method gained popularity from its use in relational database
query processing.
In data warehousing, join indexing is especially useful in the star schema
model to join the records of the fact table with the corresponding dimension
table.
Consider 2 relations R(RID ,A) and S(B,SID) that join on attribute A and B.
Then the join index contains the pair(RID,SID) where RID and SID are
record identifiers from the R and S relations.
Querying in OLAP
OLAP is a database technology that has been optimized for querying and reporting,
instead of processing transactions. The source data for OLAP is online
transactional processing(OLTP) databases that are commonly stored in data
warehouses.
OLAP is implemented in a multi user client /server environment and offers
consistently rapid response to queries, regardless of database size and complexity .
The purpose of constructing OLAP index structures is to speed up query
processing in data cubes. Given materialized views, query processing should
proceed as follows:
1. Determine which operations should be performed on the available
cuboids:- This involves transforming any selection,projection,roll up and
drill-down operations specified in the query into corresponding SQL or
OLAP operations.
BCA 5th Sem (Data warehousing & Mining)
83
2. Determine to which materialized cuboids the relevant operations should
be applied:- This involves identifying all of the materialized cuboids that
may potentially be used to answer the query, estimating the costs of using
the remaining materialized cuboids, and selecting the cuboids with the least
cost.
OLAM ( ONLINE ANALYTICAL MINING)
Online analytical mining integrates online analytical processing (OLAP) and data
mining. It represent a promising direction for mining large databases and data
warehouses.
Importance of OLAM
OLAM is important for the following reasons −
High quality of data in data warehouses − The data mining tools are
required to work on integrated, consistent, and cleaned data. These steps are
very costly in the preprocessing of data. The data warehouses constructed
by such preprocessing are valuable sources of high quality data for OLAP
and data mining as well.
Available information processing infrastructure surrounding data
warehouses − Information processing infrastructure refers to accessing,
integration, consolidation, and transformation of multiple heterogeneous
databases, web-accessing and service facilities, reporting and OLAP
analysis tools.
OLAP−based exploratory data analysis − Exploratory data analysis is
required for effective data mining. OLAM provides facility for data mining
on various subset of data and at different levels of abstraction.
Online selection of data mining functions − Integrating OLAP with
multiple data mining functions and online analytical mining provide users
with the flexibility to select desired data mining functions and swap data
mining tasks dynamically.
OLAM Architecture
BCA 5th Sem (Data warehousing & Mining)
84
An OLAM engine performs analytical mining in data cubes in a similar manner as
an OLAP engine performs online analytical processing. Therefore, it is suggested
to have an integrated OLAM and OLAP architecture . Where the OLAM and
OLAP engines both accept users online queries via a user graphical user interface.
An OLAM engine can perform multiple data mining tasks, such as concept
description ,association ,classification, prediction, clustering and time series
analysis. Therefore, it usually consists of multiple, integrated data mining modules,
making it more sophisticated than an OLAP engine. There is no fundamental
difference between the data cube required for OLAP, although OLAM analysis
might require more powerful data cube construction and accessing tools.
Efficient methods of cube computation
Data cube computation is an important task in data warehouse implementation.
The pre- computation of all or part of a data cube can greatly reduce the response
time and enhance the performance of online analytical processing. However, such
computation is challenging since it may require large computational time and
storage space. This section explores efficient methods for data cube computation.
Multiway Array Aggregation for full cube computation
BCA 5th Sem (Data warehousing & Mining)
85
The multiway array aggregation or simply multiway method computers a
full data cube by using multidimensional array as its basic data structure.
It is a typical MOLAP (Multidimensional online Analytical Processing)
approach that uses direct array addressing, where dimension values are
accessed via the position or index of their corresponding array locations.
A different approach is developed for the array based cube construction, as
follows:
1. Partition the array into chunks:- A chunk is a sub cube that is small
enough to fit into the memory available for cube computation. chunking is a
method for dividing an N-dimensional array into small N- dimensional
chunks, where each chunk is stored as an object on disk. The chunks are
compressed so as to remove wasted space resulting from empty array cells.
2. Compute aggregates by visiting cube cells:- The order in which cells are
visited can be optimized so as to minimize the number of times that each cell
must be revisited, thereby reducing memory access and storage costs.
BUC (Bottom-up Construction) : Computing Iceberg cubes from the apex
cuboid downward
BUC stands for bottom –up construction is an algorithm for the computation
of sparse and iceberg cubes.
Unlike multiway ,BUC constructs the cube from the apex cuboid towards
the base cuboid. This allows BUC to share data partitioning costs.
BCA 5th Sem (Data warehousing & Mining)
86
This representation of a lattice of cuboids, with the apex at the top and the
base at the bottom, is commonly accepted in data warehousing. It
consolidated the notions of drill down and roll up.
Star cubing: computing iceberg cubes using a dynamic star tree structure
Star cubing integrates top down and bottom up cube computation and
explores both multidimensional aggregations.
It operates from a data structure called a star tree, which performs lossless
data compression, thereby reducing the computation time and memory
requirements.
A key idea behind star cubing is the concept of shared dimensions . To build
up to this notion.
The order of computation is from the base cuboid, upwards towards the apex
cuboid . This order of computation is similar to that to Multiway.
BCA 5th Sem (Data warehousing & Mining)
87
Discovery Driven Exploration of data cubes
A data cube may have a large number of cuboids and each cuboids, and each
cuboid may contain in large number of cells. With such an extremely large space,
it becomes a burden for users just browse a cube. Tools need to be developed to
assist users in intelligently exploring the huge aggregated space of a data cube.
Discovery driven exploration is such a cube exploration approach. The main
features of this approach are:
In discovery driven exploration, pre computed measures or procedures
indicating data exceptions are used to guide the user in the data analysis
process, at all levels of aggregation.
In this approach, an exception is a data cube cell’s value that is significantly
different from the expected value, based on a statistical model.
This approach considers variations and patterns in the measures value across
all of the dimensions to which a cell belongs.
Visual cues or signs such as background color are used to reflect the degree
of exception of each cell, based on the pre- computed exception indicators.
The computation of exception indicators can be overlapped with cube
construction, so that the overall construction of data cubes for discovery
driven exploration is efficient.
Three measures are used as exception indicators to help identify data
anomalies.
1. SelfExp: - This indicates the degree of surprise of the cell value,
relative to other cells at the same levels of aggregation.
BCA 5th Sem (Data warehousing & Mining)
88
2. InExp: - This indicates the degree of surprise somewhere under the
cell, if we were to drill down from it.
3. PathExp: - This indicates the degree of surprise for each drill down
path from the cell.
Attribute Oriented Induction In Data Mining - Data
Characterization
Attribute-Oriented Induction
The Attribute-Oriented Induction (AOI) approach to data generalization and
summarization – based characterization was first proposed in 1989 (KDD ‘89
workshop) a few years before the introduction of the data cube approach.
The data cube approach can be considered as a data warehouse – based, pre
computational – oriented, materialized approach.
It performs off-line aggregation before an OLAP or data mining query is submitted
for processing.
On the other hand, the attribute oriented induction approach, at least in its initial
proposal, a relational database query – oriented, generalized – based, on-line data
analysis technique.
However, there is no inherent barrier distinguishing the two approaches based on
online aggregation versus offline precomputation.
Some aggregations in the data cube can be computed on-line, while off-line
precomputation of multidimensional space can speed up attribute-oriented
induction as well.
It was proposed in 1989 (KDD ‘89 workshop).
It is not confined to categorical data nor particular measures.
BCA 5th Sem (Data warehousing & Mining)
89
Basic Principles Of Attribute Oriented Induction
Data focusing:
Analyzing task-relevant data, including dimensions, and the result is the initial
relation.
Attribute-removal:
To remove attribute A if there is a large set of distinct values for A but (1) there
is no generalization operator on A, or (2) A’s higher-level concepts are
expressed in terms of other attributes.
Attribute-generalization:
If there is a large set of distinct values for A, and there exists a set of
generalization operators on A, then select an operator and generalize A.
Attribute-threshold control:
Typical 2-8, specified/default.
Generalized relation threshold control (10-30):
To control the final relation/rule size.
Algorithm for Attribute Oriented Induction
InitialRel:
It is nothing but query processing of task-relevant data and deriving the initial
relation.
BCA 5th Sem (Data warehousing & Mining)
90
PreGen:
It is based on the analysis of the number of distinct values in each attribute and
to determine the generalization plan for each attribute: removal? or how high to
generalize?
PrimeGen:
It is based on the PreGen plan and performing the generalization to the right
level to derive a “prime generalized relation” and also accumulating the counts.
Presentation:
User interaction:
(1) Adjust levels by drilling,
(2) Pivoting,
(3) Mapping into rules, cross tabs, visualization presentations
BCA 5th Sem (Data warehousing & Mining)
91
UNIT III
Association Rule Mining
Association rule mining is the scientific technique used to find out interesting and
frequent patterns from the transactional, spatial, temporal or other databases and to
set associations or relations among patterns( also known as item sets) in order to
discover knowledge.
BCA 5th Sem (Data warehousing & Mining)
92
Association rules can be applied in various fields like network management,
catalog design, clustering, classification, marketing etc.
A typical example of association rule mining is market basket analysis. This
process analysis analyzes customer- buying habits by finding associations between
the different items that customers place in their shopping basket. The discovery of
such associations can help retailer develop marketing strategies by gaining insight
into which items are frequently purchased together by customers. For instance, if
customers are buying milk, how likely are they to also buy bread ( and of what
kind of bread) on the same trip to the supermarket? such marketing and plan their
shelf space.
Market Basket Analysis
Suppose, as manager of a branch of HTC company, you would like to learn more
about the buying habits of your customers. You may also want to know which
groups or sets of items are customers likely to purchase on a given trip to store.
To answers your question, market basket analysis may be performed on the retail
data of customer transactions at your store. The result may be used to plan
marketing or advertising strategies , as well as catalogue design. For instance,
market basket analysis may help mangers design different store layouts. In one
strategy, items that are frequently purchased together can be placed in close
proximity in order to further encourage the sale of such items together. If
customers who purchase computer also tend to buy cell phone at the same time,
then placing the computer display close to the cell phone display may help to
increase the sales of both of these items. In an alternative strategy, placing
computer and cell phone at the opposite ends of the store may attract the customers
who purchase such items to pick up other items along the way.
Market basket analysis can also help retailers to plan which items to put in sale at
reduced prices. If customers tend to purchase computers and printers together, then
having a sale on printers may encourage the sale printers as well as computers.
BCA 5th Sem (Data warehousing & Mining)
93
Association rule mining is a two step process.
1. Find all frequent items sets or patterns
2. Generate strong association rules from the frequent items sets.
Market basket analysis is just one form of association rule mining. In fact, there
are many kinds of association rules. Association rules can be classified in
various ways, based on the following criteria.
Based on the types of values handled in the rule:- If a rule concerns
associations between the presence or absence of items, it is a Boolean
association rule.
If a rule describes associations between quantitative items or attributes,
then it is a quantitative association rule. In these rules, quantitative
values for items or attribute are partitioned into intervals.
Based on the dimensions of data involved in the rule:- If the items or
attributes in an association rule reference only one dimension, then it is a
single dimensional association rule. The rule could be written as:
Buys [ X, “computer”] buys[X, cell phone]
Based on the levels of abstractions involved in the rule set:- Some
methods for association rule mining can find rules at different levels of
abstractions. For example, suppose that asset of association rules mined
include the following rules:
age [X, “30…39”] buys[ X,” laptop computer”]
age[X, “30…39”] buys[X,”computer”]
BCA 5th Sem (Data warehousing & Mining)
94
Based on various extensions to association mining :- Association
mining can be extended to correiation analysis, where the absence or
presence of correlated items can be identified . It can also be extended to
mining maxpatterns and frequent closed itemsets.
Apriori Algorithm
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding
frequent item sets in a dataset for Boolean association rule. Name of the algorithm
is Apriori because it uses prior knowledge of frequent item set properties. We
apply an iterative approach or level-wise search where k-frequent item sets are
used to find k+1 item sets.
To improve the efficiency of level-wise generation of frequent item sets, an
important property is used called Apriori property which helps by reducing the
search space.
Apriori Property
All non-empty subset of frequent itemset must be frequent. The key concept of
Apriori algorithm is its anti-monotonicity of support measure. Apriori assumes that
All subsets of a frequent item set must be frequent(Apriori property).
If an item set is infrequent, all its supersets will be infrequent.
Before we start understanding the algorithm, go through some definitions which
are explained in my previous post.
Consider the following dataset and we will find frequent itemsets and generate
association rules for them.
Step-1: K=1
Create a table containing support count of each item present in dataset – Called C1
(candidate set)
BCA 5th Sem (Data warehousing & Mining)
95
(II)Compare candidate set item’s support count with minimum support count (here
min_support=2 if support_count of candidate set items is less than min_support
then remove those items). This gives us itemset L1.
Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition of
joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.
Check all subsets of an item set are frequent or not and if not frequent
remove that item set.(Example subset of{I1, I2} are {I1}, {I2} they are
frequent. Check for each item set)
Now find support count of these item sets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support count of candidate set item is less than min_support then
remove those items) this gives us item set L2.
BCA 5th Sem (Data warehousing & Mining)
96
Step-3:
Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in common. So here, for L2, first
element should match.
So item set generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2,
I3, I4}{I2, I4, I5}{I2, I3, I5}
Check if all subsets of these itemsets are frequent or not and if not, then
remove that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3}
which are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so
remove it. Similarly check for every item set)
Find support count of these remaining item set by searching in dataset.
I) Compare candidate (C3) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L3.
Step-4:
Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and
Lk-1 (K=4) is that, they should have (K-2) elements in common. So here,
for L3, first 2 elements (items) should match.
BCA 5th Sem (Data warehousing & Mining)
97
Check all subsets of these itemsets are frequent or not (Here itemset formed
by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is
not frequent). So no itemset in C4
We stop here because no frequent itemsets are found further
Limitations of Apriori algorithm
Computationally Expensive. Even though the apriori algorithm reduces the
number of candidate itemsets to consider, this number could still be huge
when store inventories are large or when the support threshold is low.
However, an alternative solution would be to reduce the number of
comparisons by using advanced data structures, such as hash tables, to sort
candidate itemsets more efficiently.
Spurious Associations. Analysis of large inventories would involve more
itemset configurations, and the support threshold might have to be lowered
to detect certain associations. However, lowering the support threshold
might also increase the number of spurious associations detected. To ensure
that identified associations are generalizable, they could first be distilled
from a training dataset, before having their support and confidence assessed
in a separate test dataset.
Mining Multilevel Association Rules
Multilevel association means mining the data in different levels. For many
applications, it is difficult to find strong associations among data items at low level
of abstraction due to the sparsity of data in multidimensional space.
Strong associations discovered at high concept levels that might represent common
sense knowledge. However, What may represent common sense to one user may
seem new or novel to another. Therefore, data mining systems should provide
capabilities to mine association rules at multiple levels of abstraction.
Approaches to mining multilevel Association Rules
“How can we mine multiple association rules efficiently using concept
hierarchies?” Let’s look at some approaches based on a support – confidence
framework.
In general , a top down strategy is employed, where counts are build up for
the calculation of frequent items sets at each concept level, starting at the concept
BCA 5th Sem (Data warehousing & Mining)
98
level 1 and working towards the lower, more specific concept levels, until no more
frequent itemsets can found.
Uniform minimum support for all levels (referred to as uniform
support) :- The same minimum support threshold or limit is used when
mining at each level of abstraction.
In Fig, a minimum support threshold of 5% is used throughout ( eg for mining
from computer downward to laptop computer) . Both computer and laptop
computer are found to be frequent whereas desktop computer is not.
The uniform support approach, however, has some difficulties. It is unlikely that
items at lower of abstraction will occur as frequently as those at higher levels of
abstraction.
Reduced minimum support at lower levels (referred to as reduced
support):- Each abstraction level has its own minimum support
threshold or limit. The deeper the abstraction levels, the smaller the
corresponding threshold.
For example in fig. the minimum support threshold for levels 1 and 2
are 6% and 4% respectively. In this way, “computer “ laptop computer
and “ desktop computer “ are all considered frequent.
BCA 5th Sem (Data warehousing & Mining)
99
From Association mining to correlation Analysis
In association rule mining, many interesting rules can be found using low support
threshold. Although many rules generated in association rule mining are still not
interesting to the users. Unfortunately, this is especially true when mining at low
support threshold or mining for long patterns. This has been one of the major
bottlenecks for successful application of association rule mining.
Strong rules are not necessarily interesting: - whether or not a rule is
interesting can be assessed either subjectively or objectively. Ultimately,
only the user can judge if a given rule is interesting, and this judgment,
being subjective, may differ from one user to another.
However, objective interestingness measures, based on the statistics
“behind” the data, can be used as one step toward the goal of removing
uninteresting rules from presentation to the user.
Correlation analysis: - Correlation analysis is a good method for detecting
duplications of variables in the data. Sometimes, the same variable appears
accidently more than once in the dataset because the dataset was merged
form multiple sources, the same phenomenon is measured in different units
and so on. One simple way to find redundancies is to look at a correlation
matrix. This shows all the pair wise correlations between variables .
Constraint Based Association Mining
BCA 5th Sem (Data warehousing & Mining)
100
For a given set of task relevant data, the data mining process may uncover
thousands of rules, many of which are uninteresting to the user. In constraints-
based miming, mining is performed under the guidance of various kinds of
constraints provided by the user. These constraints include the following:
Knowledge type constraints: These specify the type of knowledge to be
mined, such as association.
Data Constraints: - Theses specify the set of task relevant data.
Dimension /level Constraints:- These specify the dimension of the data, or
levels of the concept hierarchies to be used .
Interestingness Constraints: - These specify threshold on statistical
measures of rule interestingness, such as support and confidence.
Rule Constraints: - These specify the form of rules to be mined. Such
Constraints may be expressed as metarules. as the maximum or minimum
number of predicates that can occur in the rule ancestor or consequent or as
relationships among attributes.
Introduction to Classification
Classification is the process of finding a model or function that describes and
distinguishes data classes or concepts, for the purpose of being able to use these
models to predict the class of objects whose class label is unknown.
A classifier assigns items to classes. In data mining, the items are typically
records from a database. The classes are defined by the person doing the data
mining. Each class has a class label or name. A classifier decides which class an
items belongs to on the basis of the values of its attributes.
Following are the examples of cases where the data analysis task is
Classification
BCA 5th Sem (Data warehousing & Mining)
101
A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a
given profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes
or no for marketing data.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process includes
two steps −
Building the Classifier or Model
Using Classifier for Classification
Building the Classifier or Model
This step is the learning step or the learning phase.
In this step the classification algorithms build the classifier.
The classifier is built from the training set made up of database tuples and
their associated class labels.
Each tuple that constitutes the training set is referred to as a category or
class. These tuples can also be referred to as sample, object or data points.
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to
estimate the accuracy of classification rules. The classification rules can be
applied to the new data tuples if the accuracy is considered acceptable.
BCA 5th Sem (Data warehousing & Mining)
102
Classification by Decision Tree
A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the outcome
of a test, and each leaf node holds a class label. The topmost node in the tree is the
root node.
The following decision tree is for the concept buy computer that indicates whether
a customer at a company is likely to buy a computer or not. Each internal node
represents a test on an attribute. Each leaf node represents a class
The benefits of having a decision tree are as follows −
It does not require any domain knowledge.
BCA 5th Sem (Data warehousing & Mining)
103
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast
Decision Tree Induction
The basic algorithm for decision tree induction is a greedy algorithm that
constructs decision trees in a top down recursive divide and conquer manner. The
algorithm, given below, is a well known decision tree induction algorithm.
Algorithm: Generate a decision tree from the given training data.
Input: Data Partition , which is a set of training tuples and their associated class
labels, attribute_list, the set of candidate attribute, attribute_selection_method ( a
procedure to determine the splitting criterion that best partitions the data tuples into
individual classes. This criterion includes a splitting_ attribute and a splitting
point)
Output: A decision tree.
Method:
1. Create a node N;
2. if tuples in D are all of the same class, C then
return N as leaf node labeled with class C;
3. if attribute _list is empty then
return N as a leaf node labeled with the most common class in D;
4. Apply attribute_selection_method (D, attribute_list) to find the best
splitting_criterion;
5. label node N with splitting_criterion;
6. if splitting _attribute is discrete – valued and multiway splits allowed then
attribute _ list = splitting_ attribute;
7. for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
let Dj be the set of data tuples in D satisfying outcome J;
If Dj is empty then
attach a leaf labeled with the majority class ( most common class) in D to
node N;
else
attach the node returned by generate _ decision _ tree ( Dj, attribute _ list) to
node N;
end for
BCA 5th Sem (Data warehousing & Mining)
104
8. return N;
Attribute selection measures
An attribute selection measure is a learning method used for selecting the splitting
criterion that best separates a given data partition, D , of class labeled training
tuples into individual classes. if we were to split D into smaller partitions
according to the outcomes of the splitting criterion, ideally each partition would be
pure.
Attribute selection measures are also known as splitting rules because they
determine how the tuples at a given node are to be split. The attribute selection
measures provide a ranking for each attribute describing the given training tuples.
The attribute having the best score for the measure is chosen as the splitting
attribute for the given tuples.
The tree node created for partition D is labeled with the splitting criterion,
branches are grown for each outcome of the criterion, and the tuples are partitioned
accordingly, This section describes three popular attribute selection measures:
I. Information Gain: The information gain measures is used to select the test
attribute at each node in the tree. The attribute with the highest information
gain is chosen as the best attribute for the current node.
II. Gain ratio : The information gain measure is based toward tests with many
outcomes. That is , it prefers to select attributes having a large number of
values.
III. Gini Index: Gini index measures the impurity of data partition or set of
training tuples.
BCA 5th Sem (Data warehousing & Mining)
105
BCA 5th Sem (Data warehousing & Mining)
106
UNIT – IV
Introduction to Prediction
Prediction can be viewed as the construction and use of a model to assess the class
of an unlabeled sample or object.
Prediction also assesses the value or value range of an attribute that is given object
is likely to have.
In this view, classification and regression are the two major types of prediction
problems, where classification is used to predict discrete or nominal values, while
regression is used to predict continuous or ordered values.
Data prediction is a two step process, similar to that of data classification as
described earlier. However, for prediction, we lose the terminology of “ class label
attribute “ because the attribute for which values are being predicted is continuous
– valued rather than categorical . The attribute can be simply as the predicted
attribute.
Issues Regarding classification and prediction
BCA 5th Sem (Data warehousing & Mining)
107
The major issue is preparing the data for Classification and Prediction. Preparing
the data involves the following activities −
Data Cleaning − Data cleaning involves removing the noise and treatment
of missing values. The noise is removed by applying smoothing techniques
and the problem of missing values is solved by replacing a missing value
with most commonly occurring value for that attribute.
Relevance Analysis − Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
Data Transformation and reduction − The data can be transformed by
any of the following methods.
o Normalization − The data is transformed using normalization.
Normalization involves scaling all values for given attribute in order
to make them fall within a small specified range. Normalization is
used when in the learning step, the neural networks or the methods
involving measurements are used.
o Generalization − The data can also be transformed by generalizing it
to the higher concept. For this purpose we can use the concept
hierarchies.
Comparison of Classification and Prediction Methods
Here is the criterion for comparing the methods of Classification and Prediction −
Accuracy − Accuracy of classifier refers to the ability of classifier. It
predict the class label correctly and the accuracy of the predictor refers to
how well a given predictor can guess the value of predicted attribute for a
new data.
Speed − This refers to the computational cost in generating and using the
classifier or predictor.
Robustness − It refers to the ability of classifier or predictor to make
correct predictions from given noisy data.
Scalability − Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
Interpretability − It refers to what extent the classifier or predictor
understands.
Prediction Techniques
BCA 5th Sem (Data warehousing & Mining)
108
1. Neural Network(NN)
Neural network is a network of several interconnected units. Each of
these units consists of input/output characteristics that implement a
local computation or function.
The function could serve be a computation of weighted sums of inputs
which produces iterates until a final output is produced.
The output could serve as an input to other neurons in the network,
This process iterates until a final output is produced.
Neural network is considered to copy the way the human brain works
in the sense that it acquires knowledge from its environment thought a
learning process.
There are numerous types of neural networks with the main categories
being feedforward and recurrent neural networks.
In feedforwards network, the information moves in only one direction,
forward form the input neuron, through the hidden neuron and to the
output neuron.
The recurrent on the other hand , is a neural network , where the
connections between the units form a directed cycle.
The advantages of neural network are that : first they are capable of
handling a wide range of problems, second, they have an ability to
produce good results even in complicated domains, and third they are
able to handle both categorical and continuous data types.
2. Decision Tree
Decision tree is well known to be one other effective prediction
technique in several domains.
Decision tree models are commonly used in data mining to examine
data and induce the tree and its rules that will be used to make
predictions.
The prediction could be to predict categorical values when instances
are to be placed in categories or classes.
The technique could also be utilized in the prediction of continues
variables , where absolute values are required.
Decision tree is developed through an iterative process of splitting
data into distinct groups, where the objective is to maximize the
distance between groups at each split.
Decision tree is known to have an advantages over several technique
due to the output it produces. The output of a decision tree is
transparent, which makes it easy for users or non technical persons to
understand
BCA 5th Sem (Data warehousing & Mining)
109
Decision tree techniques are known to have scalability and efficiency
problems.
3. Logistic Regression
Logistics regression is considered as the standard statistical approach
to modeling binary data.
It is classified as a member of the class of generalized linear models, a
broad set of models designed to generalize the normal linear models
to target variables of many different types.
LR has been applied in many data mining classification and prediction
problems. For example, used LR for the classification and prediction
of cancer genes.
Accuracy of a classifier
The accuracy of a classifier refers to the ability of a given classifiers to correctly
predict the class label of new or previously unseen data. Common techniques used
for assessing accuracy of a classifier are:
I. Holdout method and random sub-sampling
II. Cross validation
III. Bootstrap
Holdout method and random sub sampling
In this method, the given data are randomly partitioned into two independent
sets, a training set and a test set.
Typically , two thirds of the data are allocated to the training set, and the
remaining one third is allocated to the test set.
The training set is used to derive the model or classifier, whose accuracy is
estimated with the test set. The estimate is negative because only a portion
of the initial data is used to derive the model.
Random sub sampling is a variation of the holdout method in which the
holdout method is repeated K times.
BCA 5th Sem (Data warehousing & Mining)
110
Cross validation
Cross validation is a technique used for estimating the performance of a
predictive model or a classifier.
In cross- validation , the initial data are randomly partitioned into k mutually
exclusive subsets D1, D2…Dk , each of approximately equal size.
Training and testing is performed k times. In iteration I, partition Di is
reserved as the test set, and the remaining partitions are collectively serve as
the training set in order to obtain a first model, which is tested on D1, the
second iteration on subsets D1,D3…. Dk and tested on D2, and so on.
Cross- validation method uses sampling without replacement. There are no
duplicate records in the training and test sets.
Unlike the holdout and random sub sampling methods above, here, each
sample or object is used the same number of times for training and once for
testing.
For classification, the accuracy estimate is the overall number of correct
classifications from l iterations, divided by the total number of tuples in the
initial data.
For prediction, the error estimate can be computed as the total loss from the
k iterations, divided by the total number of initial tuples.
Bootstrap
Bootstrap is a technique used for assessing accuracy of a classifier in data
mining.
Bootstrap method works well with small data sets.
The bootstrap method samples the given training tuples uniformly with
replacement. That is , each time a tuple is selected, it is equally likely to be
selected, it is equally likely to be selected again and re- added to the training
set.
The bootstrap sample is created via a technique called random sampling with
replacement . If the original data set contains M tuples.
This sample constitutes the training set whereas the rest of the original data
forms the test set.
Bagging and Boosting
Bagging and boosting are two such techniques or strategies used for improving
classifier and predictor accuracy . These are useful techniques to improve the
predictive performance of tree models.
BCA 5th Sem (Data warehousing & Mining)
111
Bagging
One way to obtain multiple classifiers is through bagging or bootstrap aggregating,
proposed by breiman in 1996 to improve the classification by combining outputs of
classifiers that are trained using randomly generated training sets.
Bagging is a technique that can be used with many classification methods
and regression methods to reduce the variation associated with prediction, and
thereby improve the prediction process.
Boosting
Boosting, like bagging, is a committee based approach that can be used to improve
the accuracy of classification or regression methods. Unlike bagging, which uses a
simple averaging of results to obtain an overall prediction, boosting uses a
weighted average of results obtained form applying a prediction method to various
samples.
Basically, boosting takes multiple random samples from the data and builds a
classification model for each. The training set is changed based on the result of the
previous models. The final classification is the class assigned most often by the
models. The exact algorithms for boosting gave evolved from the original, but the
underlying idea is the same. Boosting has become a very popular addition to data
mining packages.
While bagging is the result of combining the results of combining the results of
independent classifiers and models, boosting is the action of combining classifier
and models of the same type to form a new target concept.
For example
If you were to build a model using one sample of data and then build a new model
using the same algorithm but on a different sample, you might get a different
result. After validating the two models, you cloud choose the one that best met
your objectives. Even better results might be achieved if you built several models
and let them vote, making a prediction based on what the majority recommended.
Differences Between Bagging and Boosting –
S.NO BAGGING BOOSTING
BCA 5th Sem (Data warehousing & Mining)
112
1.
Simplest way of combining
predictions that
belong to the same type.
A way of combining
predictions that
belong to the different
types.
2.
Aim to decrease variance, not
bias.
Aim to decrease bias, not
variance.
3.
Each model receives equal
weight.
Models are weighted
according to their
performance.
4.
Each model is built
independently.
New models are influenced
by performance of
previously built models.
5.
Different training data subsets
are randomly drawn with
replacement from the entire
training dataset.
Every new subsets contains
the elements that were
misclassified by previous
models.
6.
Bagging tries to solve over-
fitting problem.
Boosting tries to reduce
bias.
BCA 5th Sem (Data warehousing & Mining)
113
7.
If the classifier is unstable
(high variance), then apply
bagging.
If the classifier is stable
and simple (high bias) the
apply boosting.
8. Random forest. Gradient boosting.
Introduction to Clustering
Clustering is the task of dividing the population or data points into a number of
groups such that data points in the same groups are more similar to other data
points in the same group than those in other groups. In simple words, the aim is to
segregate groups with similar traits and assign them into clusters.
Let’s understand this with an example. Suppose, you are the head of a rental store
and wish to understand preferences of your costumers to scale up your business. Is
it possible for you to look at details of each costumer and devise a unique business
strategy for each one of them? Definitely not. But, what you can do is to cluster all
of your costumers into say 10 groups based on their purchasing habits and use a
separate strategy for costumers in each of these 10 groups. And this is what we call
clustering.
Now, that we understand what is clustering. Let’s take a look at the types of
clustering.
Types of Clustering
Broadly speaking, clustering can be divided into two subgroups:
Hard Clustering: In hard clustering, each data point either belongs to a
cluster completely or not. For example, in the above example each customer
is put into one group out of the 10 groups.
BCA 5th Sem (Data warehousing & Mining)
114
Soft Clustering: In soft clustering, instead of putting each data point into a
separate cluster, a probability or likelihood of that data point to be in those
clusters is assigned. For example, from the above scenario each costumer is
assigned a probability to be in either of 10 clusters of the retail store.
Strict partitioning clustering with outliers:- here each object belongs to
exactly one cluster.
Strict partitioning clustering with outliers : Objects can also belong to no
cluster, and are considered outliers.
Overlapping clustering /Alternative clustering: While usually a hard
clustering ; objects may belong to more than one cluster.
Hierarchical clustering: Objects that belong to a child cluster also belongs
to the parent cluster.
Subspace clustering: While an overlapping clustering, within a uniquely
defined subspaces, clusters are not expected to overlap. Requirements of
Clustering in Data Mining
Interpretability
The result of clustering should be usable, understandable and interpretable.
Helps in dealing with messed up data
Usually, the data is messed up and unstructured. It cannot be analyzed quickly, and
that is why the clustering of information is so significant in data mining. Grouping
can give some structure to the data by organizing it into groups of similar data
objects. It becomes more comfortable for the data expert in processing the data and
also discover new things.
High Dimensional
Data clustering is also able to handle the data of high dimension along with the
data of small size.
Attribute shape clusters are discovered
Arbitrary shape clusters are detected by using the algorithm of clustering. Small
size cluster with spherical shape can also be found.
Algorithm Usability with multiple data kind
Many different kinds of data can be used with algorithms of clustering. The data
can be like binary data, categorical and interval-based data.
Clustering Scalability The database usually is enormous to deal with. The algorithm should be scalable to
handle extensive database, so it needs to be scalable
Applications of clustering
Some of the main application areas are:
BCA 5th Sem (Data warehousing & Mining)
115
Market Research: Cluster analysis is widely used in market research.
Market researches use cluster analysis to partition the general population of
consumers into market segments and to better understand the relationships
between different groups of consumers/potential customers. They can
characterize their customers groups based on purchasing patterns.
World wide web: In the study of social networks, clustering may be used to
recognize communities within large groups of people.
Image processing: Clustering can be used to divide a digital image into
distinct regions for border detection or object recognition.
Crime Analysis: Cluster analysis can be used to identify areas where there
are greater incidences of particular types of crime. By identifying these
distinct areas or hot spots where a similar crime has happened over a period
of time, it is possible to manage law enforcement resources more effectively.
Education: Cluster analysis is used to identify groups of schools or
students with similar properties.
City Planning: Identifying groups of houses according to their house type,
value and geographical location.
Robotics: Clustering algorithms are used for robotic situational awareness to
track objects and detect outliers in sensor data.
Different Types of Clustering Algorithm
The clustering Algorithms are of many types. The following overview will only list
the most prominent examples of clustering algorithms, as there are possibly over 100
published clustering algorithms. Not all provide models for their clusters and can
thus not easily be categorized.
Distribution based methods
It is a clustering model in which we will fit the data on the probability that how it
may belong to the same distribution. The grouping done may be normal or gaussian .
Gaussian distribution is more prominent where we have fixed number of
distributions and all the upcoming data is fitted into it such that the distribution of
data may get maximized . This result in grouping which is shown in figure:-
BCA 5th Sem (Data warehousing & Mining)
116
This model works good on synthetic data and diversely sized clusters. But this
model may have problem if the constraints are not used to limit model’s complexity.
Furthermore, Distribution-based clustering produces clusters which assume
concisely defined mathematical models underlying the data, a rather strong
assumption for some data distributions.
For Ex- Expectation-maximization algorithm which uses multivariate normal
distributions is one of popular example of this algorithm .
Centroid based methods
This is basically one of iterative clustering algorithm in which the clusters are
formed by the closeness of data points to the centroid of clusters. Here , the cluster
center i.e. centroid is formed such that the distance of data points is minimum with
the center. This problem is basically one of NP- Hard problem and thus solutions are
commonly approximated over a number of trials.
BCA 5th Sem (Data warehousing & Mining)
117
For Ex- K – means algorithm is one of popular example of this algorithm .
The biggest problem with this algorithm is that we need to specify K in advance. It
also has problem in clustering density based distributions.
Connectivity based methods
The core idea of connectivity based model is similar to Centroid based model which
is basically defining clusters on the basis of closeness of data points .Here we work
on a notion that the data points which are closer have similar behavior as compared
to data points that are farther .
It is not a single partitioning of the data set , instead it provides an extensive
hierarchy of clusters that merge with each other at certain distances. Here the choice
of distance function is subjective. These models are very easy to interpret but it lacks
scalability .
For Ex- hierarchical algorithm and it’s variants .
Density Models
In this clustering model there will be a searching of data space for areas of varied
density of data points in the data space . It isolates various density regions based on
different densities present in the data space .
BCA 5th Sem (Data warehousing & Mining)
118
For Ex- DBSCAN and OPTICS.
Subspace clustering
Subspace clustering is an unsupervised learning problem that aims at grouping data
points into multiple clusters so that data point at single cluster lie approximately on a
low-dimensional linear subspace. Subspace clustering is an extension of feature
selection just as with feature selection subspace clustering requires a search method
and evaluation criteria but in addition subspace clustering limit the scope of
evaluation criteria. Subspace clustering algorithm localize the search for relevant
dimension and allow to them to find cluster that exist in multiple overlapping
subspaces. Subspace clustering was originally purpose to solved very specific
computer vision problem having a union of subspace structure in the data but it
gains increasing attention in the statistic and machine learning community. People
use this tool in social network, movie recommendation, and biological dataset.
Subspace clustering raise the concern of data privacy as many such application
involve dealing with sensitive information. Data points are assumed to be
incoherentas it only protects the differential privacy of any feature of a user rather
than the entire profile user of the database.
There are two branches of subspace clustering based on their search strategy.
BCA 5th Sem (Data warehousing & Mining)
119
Top-down algorithms find an initial clustering in the full set of dimension
and evaluate the subspace of each cluster.
Bottom-up approach finds dense region in low dimensional space then
combine to form clusters.
Selecting and Using Right DM Technique
Introduction: To do effective business information process in an enterprise we
should use the right Data Mining techniques. The real value of a product for a
business depends on the bottom line selection of a Data Mining technology.
This always translates to the bottom-line profit, increased revenue ,decreased cost,
or return on investment. If the technique and tool do not provide one of these four
assets in a measurable way. It is unlikely anyone in your business will have time to
mine their data.
Data mining needs to be more than finding interesting patterns in large databases if
it is to be successfully deployed in your business
In order to make an intelligent selection among data mining tools and technologies.
It will be helpful to categorize areas where they differ. To more clearly see this and
because there is so much overlap, one of the best ways to see the valid distinctions
between the algorithms is to see what is similar
For instance, although they differ in how they accomplish them, each data mining
algorithm has the following.
Model Structure : The structure that defines the model
Search:- The manner in which the algorithm modifies the model over time as more
data is made available.
Validation:- The determines , by the algorithm, of a valid model.
Data Visualization
Data Visualization is used to communicate information clearly and efficiently to
users by the usage of information graphics such as tables and charts. It helps users
in analyzing a large amount of data in a simpler way. It makes complex data more
accessible, understandable, and usable.
BCA 5th Sem (Data warehousing & Mining)
120
Tables are used where users need to see the pattern of a specific parameter, while
charts are used to show patterns or relationships in the data for one or more
parameters.
Tips to follow while representing data visually −
Number all diagrams
Label all diagrams
Ensure that units of measurement on axes are clearly labelled
Place any explanatory information in footnotes below the visual
Check layouts to ensure maximum clarity
Pro and Cons of Data Visualization
Here are some pros and cons to representing data visually −
Pros
It can be accessed quickly by a wider audience.
It conveys a lot of information in a small space.
It makes your report more visually appealing.
Cons
It can misrepresent information – if an incorrect visual representation is
made.
It can be distracting – if the visual data is distorted or excessively used
Importance of Data Visualization
Data visualization is important because of the processing of information in human
brains. Using graphs and charts to visualize a large amount of the complex data
sets is more comfortable in comparison to studying the spreadsheet and reports.
Data visualization is an easy and quick way to convey concepts universally. You
can experiment with a different outline by making a slight adjustment.
Data visualization have some more specialties such as:
BCA 5th Sem (Data warehousing & Mining)
121
o Data visualization can identify areas that need improvement or
modifications.
o Data visualization can clarify which factor influence customer behavior.
o Data visualization helps you to understand which products to place where.
o Data visualization can predict sales volumes.
Data visualization tools have been necessary for democratizing data, analytics, and
making data-driven perception available to workers throughout an organization.
They are easy to operate in comparison to earlier versions of BI software or
traditional statistical analysis software. This guide to a rise in lines of business
implementing data visualization tools on their own, without support from IT.
Data visualization techniques
Depending on these factors, you can choose different data visualization techniques
and configure their features. Here are the common types of visualization
techniques:
Charts
The easiest way to show the development of one or several data sets is a chart.
Charts vary from bar and line charts that show the relationship between elements
over time to pie charts that demonstrate the components or proportions between the
elements of one whole.
Plots
Plots allow to distribute two or more data sets over a 2D or even 3D space to show
the relationship between these sets and the parameters on the plot. Plots also vary.
Scatter and bubble plots are some of the most widely-used visualizations. When it
BCA 5th Sem (Data warehousing & Mining)
122
comes to big data, analysts often use more complex box plots that help visualize
the relationship between large volumes of data.
Maps
Maps are popular ways to visualize data used in different industries. They allow to
locate elements on relevant objects and areas — geographical maps, building
plans, website layouts, etc. Among the most popular map visualizations are heat
maps, dot distribution maps, cartograms.
Diagrams and matrices
Diagrams are usually used to demonstrate complex data relationships and links and
include various types of data on one visualization. They can be hierarchical,
multidimensional, tree-like.
Matrix is one of the advanced data visualization techniques that help determine the
correlation between multiple constantly updating (steaming) data sets.
BCA 5th Sem (Data warehousing & Mining)
123