syllabus of data warehousing & mining

BCA 5th Sem (Data warehousing & Mining)

1

Syllabus of Data warehousing & Mining

Content Detail Page NO

UNIT I

Introduction to Data Warehousing, The

needfordatawarehousing,Operational&Informational Data Stores, Data

Warehouse Characteristics, Data Warehouse role&Structure,The cost of

warehousing data.

Introduction to OLAP &OLTP, Difference between OLAP &OLTP. OLAP

Operations

4-40

UNIT II

Building aDataWarehouse,

Design/Technical/ImplementationConsiderations,DataPre- processing

Overview.DataSummarization,DataCleaning,DataTransformation,Concept

Hierarchy, Structure. Patterns &Models, Artificial Intelligence (Overview).

MultidimensionalDataModel,SchemasforMultidimensionalData(StarSchema,S

nowflakeSchema,FactConstellation),DataWarehouseArchitecture,DataWareho

useDesign,OLAP

ThreetierArchitecture,Indexing&QueryinginOLAP,OLAM,EfficientMethodso

fCube Computation, DiscoveryDriven Exploration of Data Cubes, Attributed-

OrientedInduction.

42-90

UNIT III

AssociationRuleMining, MarketBasketAnalysis,AprioriAlgorithm,Mining

Multilevel Association Rules, FromAssociationMiningtoCorrelationAnalysis,

Constraint Based

AssociationMining,IntroductiontoClassification,Classification by

decisionTree,Attribute Selection Measure.

92-105

UNIT IV

IntroductiontoPredictiontechniques,Accuracy ofaClassifier,Cross-

Validation,Bootstrap,

Boosting,Bagging,IntroductiontoClustering,Classificationof VariousClustering

Algorithms, SelectingandUsingRightDMTechnique,SelectingandUsing

RightDMTechnique,Data Visualization.

107-123


2

INDEX

S.No Content Page No.

1 Introduction to Data Warehousing

2 Introduction to OLAP &OLTP

3 Building a Data Warehouse

4 Multidimensional Data Model

5 Three tier Architecture

6 Association Rule Mining

7 Introduction to Prediction techniques


3

UNIT-I


4

Data Warehousing:

Data warehouse is a subject oriented, integrated, time-variant, and non-volatile

collection of data. This data helps analysts to take informed decisions in an

organization.

An operational database undergoes frequent changes on a daily basis on account

of the transactions that take place. Suppose a business executive wants to analyze

previous feedback on any data such as a product, a supplier, or any consumer data,

then the executive will have no data available to analyze because the previous data

has been updated due to transactions.

A data warehouses provides us generalized and consolidated data in

multidimensional view. Along with generalized and consolidated view of data, a

data warehouses also provides us Online Analytical Processing (OLAP) tools.

These tools help us in interactive and effective analysis of data in a

multidimensional space. This analysis results in data generalization and data

mining.

For example, in the business world, a data warehouse might incorporate customer

information from a company's point-of-sale systems (the cash registers), its

website, its mailing lists and its comment cards.

Data mining functions such as association, clustering, classification, prediction

can be integrated with OLAP operations to enhance the interactive mining of

knowledge at multiple level of abstraction. That's why data warehouse has now

become an important platform for data analysis and online analytical processing.

There many types of data warehouses but these are the three most common:


5

Enterprise data warehouse – Provides a central repository tailored for support

decision-making for the entire enterprise

Operational Data Store – Similar to the enterprise warehouse in terms of scope,

but data is refreshed in near real time and can be used for operational reporting

Data Mart – This is a subset of a data warehouse used to support a specific

region, business unit or function area.

Understanding a Data Warehouse

A data warehouse is a database, which is kept separate from the organization's

operational database.

There is no frequent updating done in a data warehouse.

It possesses consolidated historical data, which helps the organization to analyze

its business.

A data warehouse helps executives to organize, understand, and use their data to

take strategic decisions.

Data warehouse systems help in the integration of diversity of application systems.

A data warehouse system helps in consolidated historical data analysis.

Benefits of a Data Warehouse

Organizations have a common goal – to make better business decisions. A data

warehouse, once implemented into your business intelligence framework, can

benefit your company in numerous ways. A data warehouse:

1. Delivers enhanced business intelligence

By having access to information from various sources from a single platform,

decision makers will no longer need to rely on limited data or their instinct.

Additionally, data warehouses can effortlessly be applied to a business’s

processes, for instance, market segmentation, sales, risk, inventory, and financial

management.

2. Saves times


6

A data warehouse standardizes, preserves, and stores data from distinct sources,

aiding the consolidation and integration of all the data. Since critical data is

available to all users, it allows them to make informed decisions on key aspects. In

addition, executives can query the data themselves with little to no IT support,

saving more time and money.

3. Enhances data quality and consistency

A data warehouse converts data from multiple sources into a consistent format.

Since the data from across the organization is standardized, each department will

produce results that are consistent. This will lead to more accurate data, which will

become the basis for solid decisions.

4. Generates a high Return on Investment (ROI)

Companies experience higher revenues and cost savings than those that haven’t

invested in a data warehouse.

5. Provides competitive advantage

Data warehouses help get a holistic view of their current standing and evaluate

opportunities and risks, thus providing companies with a competitive advantage.

6. Improves the decision-making process

Data warehousing provides better insights to decision makers by maintaining a

cohesive database of current and historical data. By transforming data into

purposeful information, decision makers can perform more functional, precise,

and reliable analysis and create more useful reports with ease.

7. Enables organizations to forecast with confidence

Data professionals can analyze business data to make market forecasts, identify

potential KPIs, and gauge predicated results, allowing key personnel to plan

accordingly.

8. Streamlines the flow of information

Data warehousing facilitates the flow of information through a network

connecting all related or non-related parties.

The Disadvantages of a Data Warehouse


7

Data warehouses are relational databases that act as data analysis tools,

aggregating data from multiple departments of a business into one data store.

Data warehouses are typically updated as an end-of-day batch job, rather than

being churned by real time transactional data. Their primary benefits are giving

managers better and timelier data to make strategic decisions for the company.

However, they have some drawbacks as well.

Extra Reporting Work

Depending on the size of the organization, a data warehouse runs the risk of

extra work on departments. Each type of data that's needed in the warehouse

typically has to be generated by the IT teams in each division of the business.

This can be as simple as duplicating data from an existing database, but at other

times, it involves gathering data from customers or employees that wasn't

gathered before.

Cost/Benefit Ratio

A commonly cited disadvantage of data warehousing is the cost/benefit analysis.

A data warehouse is a big IT project, and like many big IT projects, it can suck a

lot of IT man hours and budgetary money to generate a tool that doesn't get used

often enough to justify the implementation expense. This is completely

sidestepping the issue of the expense of maintaining the data warehouse and

updating it as the business grows and adapts to the market.

Data Ownership Concerns

Data warehouses are often, but not always, Software as a Service

implementations, or cloud services applications. Your data security in this

environment is only as good as your cloud vendor. Even if implemented locally,

there are concerns about data access throughout the company. Make sure that the

people doing the analysis are individuals that your organization trusts, especially

with customers' personal data. A data warehouse that leaks customer data is a

privacy and public relations nightmare.

Data Flexibility

Data warehouses tend to have static data sets with minimal ability to "drill

down" to specific solutions. The data is imported and filtered through a schema,

and it is often days or weeks old by the time it's actually used. In addition, data

warehouses are usually subject to ad hoc queries and are thus notoriously


8

difficult to tune for processing speed and query speed. While the queries are

often ad hoc, the queries are limited by what data relations were set when the

aggregation was assembled.

Data Mining:

Data mining refers to the analysis of data. It is the computer-supported process of

analyzing huge sets of data that have either been compiled by computer systems or

have been downloaded into the computer. In the data mining process, the

computer analyzes the data and extract useful information from it. It looks for

hidden patterns within the data set and try to predict future behavior. Data mining

is primarily used to discover and indicate relationships among the data sets.

Examples of data mining in current industry. Marketing. Banks use data mining to

better understand market risks. It is commonly applied to credit ratings and to

intelligent anti-fraud systems to analyze transactions, card transactions, purchasing

patterns and customer financial data.

Data mining aims to enable business organizations to view business behaviors,

trends relationships that allow the business to make data-driven decisions. It is

also known as knowledge Discover in Database (KDD). Data mining tools utilize

AI, statistics, databases, and machine learning systems to discover the relationship

between the data. Data mining tools can support business-related questions that

traditionally time-consuming to resolve any issue.

Important features of Data Mining:

The important features of Data Mining are given below:

It utilizes the automated discovery of patterns.

It predicts the expected results.


9

It focuses on large data sets and databases

It creates actionable information.

Advantages of Data Mining:

i. Market Analysis:

Data Mining can predict the market that helps the business to make the decision.

For example, it predicts who is keen to purchase what type of products.

ii. Fraud detection:

Data Mining methods can help to find which cellular phone calls, insurance

claims, credit, or debit card purchases are going to be fraudulent.

iii. Financial Market Analysis:

Data Mining techniques are widely used to help Model Financial Market

iv. Trend Analysis:

Analyzing the current existing trend in the marketplace is a strategic benefit

because it helps in cost reduction and manufacturing process as per market

demand.

Differences between Data Mining and Data Warehousing:

Data Mining Data Warehousing

Data mining is the process of

determining data patterns.

A data warehouse is a database system

designed for analytics.

Data mining is generally

considered as the process of

extracting useful data from a large

set of data.

Data warehousing is the process of

combining all the relevant data.

Business entrepreneurs carry data

mining with the help of engineers.

Data warehousing is entirely carried out by

the engineers.


10

In data mining, data is analyzed

repeatedly.

In data warehousing, data is stored

periodically.

Data mining uses pattern

recognition techniques to identify

patterns.

Data warehousing is the process of

extracting and storing data that allow easier

reporting.

One of the most amazing data

mining technique is the detection

and identification of the unwanted

errors that occur in the system.

One of the advantages of the data

warehouse is its ability to update

frequently. That is the reason why it is

ideal for business entrepreneurs who want

up to date with the latest stuff.

The data mining techniques are

cost-efficient as compared to

other statistical data applications.

The responsibility of the data warehouse is

to simplify every type of business data.

The data mining techniques are

not 100 percent accurate. It may

lead to serious consequences in a

certain condition.

In the data warehouse, there is a high

possibility that the data required for

analysis by the company may not be

integrated into the warehouse. It can simply

lead to loss of data.

Companies can benefit from this

analytical tool by equipping

suitable and accessible

knowledge-based data.

Data warehouse stores a huge amount of

historical data that helps users to analyze

different periods and trends to make future

predictions.

Difference between data warehouse and data mart

Parameter Data Warehouse Data Mart

Definition A Data Warehouse is a large

repository of data collected

A data mart is an only subtype of

a Data Warehouse. It is designed


11


from different organizations or

departments within a

corporation.

to meet the need of a certain user

group.

Usage It helps to take a strategic

decision.

It helps to take tactical decisions

for the business.

Objective

The main objective of Data

Warehouse is to provide an

integrated environment and

coherent picture of the

business at a point in time.

A data mart mostly used in a

business division at the

department level.

Designing The designing process of Data

Warehouse is quite difficult.

The designing process of Data

Mart is easy.

May or may not use in a

dimensional model. However,

it can feed dimensional

models.

It is built focused on a

dimensional model using a start

schema.

Data Handling

Data warehousing includes

large area of the corporation

which is why it takes a long

time to process it.

Data marts are easy to use,

design and implement as it can

only handle small amounts of

data.

Focus

Data warehousing is broadly

focused all the departments. It

is possible that it can even

represent the entire company.

Data Mart is subject-oriented,

and it is used at a department

level.

Data type

The data stored inside the Data

Warehouse are always detailed

when compared with data

mart.

Data Marts are built for particular

user groups. Therefore, data short

and limited.

Subject-area The main objective of Data

Warehouse is to provide an

integrated environment and

Mostly hold only one subject

area- for example, Sales figure.


12


coherent picture of the

business at a point in time.

Data storing

Designed to store enterprise-

wide decision data, not just

marketing data.

Dimensional modeling and star

schema design employed for

optimizing the performance of

access layer.

Data type

Time variance and non-

volatile design are strictly

enforced.

Mostly includes consolidation

data structures to meet subject

area's query and reporting needs.

Data value Read-Only from the end-users

standpoint.

Transaction data regardless of

grain fed directly from the Data

Warehouse.

Scope

Data warehousing is more

helpful as it can bring

information from any

department.

Data mart contains data, of a

specific department of a

company. There are maybe

separate data marts for sales,

finance, marketing, etc. Has

limited usage

Source In Data Warehouse Data

comes from many sources.

In Data Mart data comes from

very few sources.

Size

The size of the Data

Warehouse may range from

100 GB to 1 TB+.

The Size of Data Mart is less

than 100 GB.

Implementation

time

The implementation process of

Data Warehouse can be

extended from months to

years.

The implementation process of

Data Mart is restricted to few

months.


13

Need For Data Warehouse

1. Improving integration

An organization registers data in various systems which support the various

business processes. In order to create an overall picture of business operations,

customers, and suppliers – thus creating a single version of the truth – the data

must come together in one place and be made compatible. Both external (from the

environment) and internal data (from ERP and financial systems) should merge

into the data warehouse and then be grouped.

2. Speeding up response times

The source systems are fully optimized in order to process many small

transactions, such as orders, in a short time. Generating information about the

performance of the organization only requires a few large ‘transactions’ in which

large volumes of data are gathered and aggregated. The structure of a data

warehouse is specifically designed to quickly analyze such large volumes of (big)

data.

3. Faster and more flexible reporting

The structure of both data warehouses and data marts enables end users to report

in a flexible manner and to quickly perform interactive analysis based on various

predefined angles (dimensions). They may, for example, with a single mouse click

jump from year level, to quarter, to month level, and quickly switch between the

customer dimension and the product dimension, all while the indicator remains

fixed. In this way, end users can actually mix the data and thus quickly gain

knowledge about business operations and performance indicators.

4. Recording changes to build history

Source systems don’t usually keep a history of certain data. For example, if a

customer relocates or a product moves to a different product group, the (old)

values will most likely be overwritten. This means they disappear from the system

– or at least they’re very difficult to trace back.

That’s really bad, because in order to generate reliable information, we actually

need these old values, as users sometimes want to be able to look back in time. In


14

other words: we want to be able to look at the organization’s performance from a

historical perspective – in accordance with the organizational structure and

product classifications of that time – instead of in the current context. A data

warehouse ensures that data changes in the source system are recorded, which

enables historical analysis.

5. Increasing data quality

Stakeholders and users frequently overestimate the quality of data in the source

systems. Unfortunately, source systems quite often contain data of poor quality.

When we use a data warehouse, we can greatly improve the data quality, either

through – where possible – correcting the data while loading or by tackling the

problem at its source.

6. Unburdening the IT department

A data warehouse and Business Intelligence tools allow employees within the

organization to create reports and perform analyses independently. However, an

organization will first have to invest in order to set up the required infrastructure

for that data warehouse and those BI tools. The following principle applies: the

better the architecture is set up and developed, the more complex reports users can

independently create. Obviously, users first need sufficient training and support,

where necessary. Yet, what we see in practice is that many of the more complex

reports end up being created by the IT department. This is mostly due to users

lacking either the time or the knowledge.

7. Increasing findability

When we create a data warehouse, we make sure that users can easily access the

meaning of data. (In the source system, these meanings are either non-existent or

poorly accessible.) With a data warehouse, users can find data more quickly, and

thus establish information and knowledge faster. All the goals of the data

warehouse serve the aims of Business Intelligence: making better decisions faster

at all levels within the organization and even across organizational boundaries.

Data Warehouse Features

The key features of a data warehouse are discussed below −


15

Subject Oriented − A data warehouse is subject oriented because it provides

information around a subject rather than the organization's ongoing operations.

These subjects can be product, customers, suppliers, sales, revenue, etc. A data

warehouse does not focus on the ongoing operations, rather it focuses on

modelling and analysis of data for decision making.

Integrated − A data warehouse is constructed by integrating data from

heterogeneous sources such as relational databases, flat files, etc. This integration

enhances the effective analysis of data.

Time Variant − The data collected in a data warehouse is identified with a

particular time period. The data in a data warehouse provides information from the

historical point of view.

Non-volatile − Non-volatile means the previous data is not erased when new data

is added to it. A data warehouse is kept separate from the operational database and

therefore frequent changes in operational database is not reflected in the data

warehouse.

Data Warehouse Applications

Data warehouse helps business executives to organize, analyze, and use their data

for decision making. Data warehouses are widely used in the following fields −

Financial services

Banking services

Consumer goods

Retail sectors

Controlled manufacturing

Characteristics and Functions of Data warehouse

Data warehouse can be controlled when the user has a shared way of explaining

the trends that are introduced as specific subject. Below are

major characteristics of data warehouse:


16

Subject-oriented –

A data warehouse is always a subject oriented as it delivers information about a

theme instead of organization’s current operations. It can be achieved on specific

theme. That means the data warehousing process is proposed to handle with a

specific theme which is more defined. These themes can be sales, distributions,

marketing etc.

A data warehouse never put emphasis only current operations. Instead, it focuses

on demonstrating and analysis of data to make various decision. It also delivers an

easy and precise demonstration around particular theme by eliminating data which

is not required to make the decisions.

Integrated –

It is somewhere same as subject orientation which is made in a reliable format.

Integration means founding a shared entity to scale the all similar data from the

different databases. The data also required to be resided into various data

warehouse in shared and generally granted manner.

A data warehouse is built by integrating data from various sources of data such

that a mainframe and a relational database. In addition, it must have reliable

naming conventions, format and codes. Integration of data warehouse benefits in

effective analysis of data. Reliability in naming conventions, column scaling,

encoding structure etc. should be confirmed. Integration of data warehouse

handles various subject related warehouse.

Time-Variant –

In this data is maintained via different intervals of time such as weekly, monthly,

or annually etc. It founds various time limit which are structured between the large

datasets and are held in online transaction process (OLTP). The time limits for

data warehouse is wide-ranged than that of operational systems. The data resided

in data warehouse is predictable with a specific interval of time and delivers

information from the historical perspective. It comprises elements of time

explicitly or implicitly. Another feature of time-variance is that once data is stored

in the data warehouse then it cannot be modified, alter, or updated.

Non-Volatile –

As the name defines the data resided in data warehouse is permanent. It also


17

means that data is not erased or deleted when new data is inserted. It includes the

mammoth quantity of data that is inserted into modification between the selected

quantity on logical business. It evaluates the analysis within the technologies of

warehouse.

In this, data is read-only and refreshed at particular intervals. This is beneficial in

analyzing historical data and in comprehension the functionality. It does not need

transaction process, recapture and concurrency control mechanism. Functionalities

such as delete, update, and insert that are done in an operational application are

lost in data warehouse environment. Two types of data operations done in the data

warehouse are:

Data Loading

Data Access

Data warehouse role and structure

The purpose of the Data Warehouse mostly is to integrate corporate data in an

organization. It contains the "single version of truth" for the organization that has

been carefully constructed from data stored in disparate internal and external

operational databases. ... Data is stored at a very granular level of detail.

The data warehouse is composed of data structures populated by data extracted

from the OLTP database and transformed to fit a flatter schema. Ultimately

the warehouse structures are exposed as star schemas through views of fact and

dimension tables.

Data Warehouse Cost

Budgeting – Without cost justification, projects will always be in jeopardy.

During future budget cycles, management will be looking for ways to reduce cost

and if there is no documented reason for completing the project, they are likely to

forget the flush of excitement that accompanied the project’s initiation.

Staffing – Without cost justification, staffing with the right people may be

difficult. By having some real dollar numbers to back up a request, the request is

more likely to be satisfied.


18

Prioritization – Without a cost/benefit analysis, project prioritization is difficult

and management has little to compare projects other than a gut-feel that one is

more important than another. Without cost/benefit analysis, the line-of-business

manager with the most power is likely to get her project approved. The project

that is most important to the enterprise may never be implemented.

Controlling Costs

Costs can and must be controlled. It is the project manager who has the

responsibility for controlling costs along with the other responsibilities. Adhering

to the Project Agreement is a major start for controlling costs. The Project

Agreement specifies the data that will be in the data warehouse, the periods for

which the data is kept, the number of users and predefined queries and reports.

Any one of these factors, if not held in check, will increase the cost and possibly

the schedule of the project. A primary role of the project manager will be to

control scope creep.

Additional Support

User Support staff or the Help Desk staff will be the users’ primary contact when

there are problems. Providing adequate User Support will require more people,

and more training of those people, to answer questions and help the users through

difficult situations. The cost for the additional people, the training and possibly an

upgrade in the number and knowledge-level of the staff answering the phones

must be added into the data warehouse costs.

Consultants and Contractors

Consultant and contractor expenses can balloon a project’s cost. Consultants are

used to supplement the lack of experience of the project team, contractors are used

to supplement the lack of skilled personnel. There are two types of

consultant/contractors:

Product specific contractors – These persons are brought in because they know

the product. They can either help or actually install the product, and they can tune

the product. They will customize the product, if it is necessary. The product-

specific consultants may either be in the employ of the tool vendor or may be

independent. An example of their services would be installing and using an ETL

tool to extract, transform and load data from your source files to the data

warehouse. In this activity they may be generating the ETL code on their own or

working with your people in this endeavor.


19

General data warehouse consultants – These consultants may have a specific

niche such as data modeling, performance, data mining, tool selection,

requirements gathering or project planning. They will typically be involved for a

shorter period of time than the product-specific consultant/contractor. They have

two roles that are equally important. The first is working with your people to

complete a task such as selecting a query tool or developing a project plan. The

second is the knowledge transfer to your staff so they can perform the activity the

next time on their own. Just as in the case of the product-specific

consultant/contractor, your goal is to make your staff as self-sufficient as soon as

possible.

Products

The software products that support the data warehouse can be very expensive. The

first thing to consider is which categories of tools you need. Do not bring in more

categories of products than you need. Do not try to accomplish everything with

your first implementation. Be very selective.

Hopefully, you have someone in your organization experienced in dealing with

vendors and understanding their contracts. You will be working closely with this

person. They will know the things to watch out for in a contract, but you will need

to give them some help to acquaint them with data warehousing. You will also

have to give them some warning if you heard anything negative about the vendor.

Your contract people will know how to include protection in the contract to keep

the vendor from arbitrarily raising their prices.

Existing tools

Your organization most likely already has an RDBMS. Should you have to pay for

it as part of your data warehouse project? If there is a site license, there may be no

charge to your department or you may have to pay a portion of the site license.

You may have to pay if the data warehouse will be on another CPU, and if the

RDBMS is charged by CPU. You may have to pay an upgrade if the data

warehouse requires going to a larger CPU, and if there is an additional cost for the

larger CPU.

Capacity planning

Capacity planning for a data warehouse is extremely difficult because:


20

The actual amount of data that will be in the warehouse is very difficult to

anticipate.

The number of users will also be difficult to estimate.

The number of queries each user will run is difficult to anticipate.

The time of day and the day in the week when the queries will be run is difficult to

guess (we know there will not be an even distribution, expecting more activity at

month-end, etc.).

The nature of the queries, the number of I/Os, the internal processing is almost

impossible to estimate.

Hardware Costs

For the data warehouse, you will need CPUs, disks, networks and desktop

workstations. The hardware vendors can help size the machines and disks. Be

aware that unanticipated growth of the data, increased number of users and

increased usage will explode the hardware costs. Existing desktop workstations

may not be able to support the query tool. Do not ask the query tool vendor for the

minimum desktop configuration. Ask for the recommended configuration. Call

references to find out if and how they had to upgrade their desktop workstations.

Raw Data Multiplier

There are many debates over how much disk is needed as a multiplier of the raw

data. Besides the raw data itself, space is needed for indexes, summary tables and

working space. Additional space may be needed for replicated data that may be

required for both performance and security reasons. The actual space is very

dependent on how much is indexed and how many summary tables are needed.

Existing Hardware

How should you account for existing hardware that can be used for the data

warehouse? It may mean you do not have to buy any additional hardware. The

Y2K testing may have required hardware that is now redundant and unused.

Should that be included in our data warehouse cost? It is a safe assumption that

your organization will need additional hardware in the future. By using the

redundant hardware for the data warehouse, it means that additional hardware for

non-data warehouse purposes must be purchased sooner. You may be able to defer


21

the cost of the redundant hardware; you will eventually have to pay. At the time

the hardware is purchased, it will undoubtedly be less than today’s costs.

Controlling Hardware Costs

Your ability to control hardware costs will depend primarily on whether your

organization has a chargeback system. Even though department heads are

supposed to have the best interests of the organization at heart, what they care

most about is meeting their performance objectives. These, of course, include the

costs assigned to their department. If department heads are paying for what they

get, they will be more thoughtful about asking for resources that may not be cost

justified. We had an experience with a user asking to store ten years worth of

detailed data. When he was presented with the bill (an additional $1.5 million), he

decided that two years worth of data was adequate.

Internal People Costs

These people are getting paid anyway regardless of whether we use them on this

project or not. Why should we have to include their costs in our budget? We have

to assume these people would be working on other productive projects. Otherwise,

there is no reason for the organization to keep them employed. Count on having to

include the fully burdened costs of the people on your project. Keep in mind that

you are much better off with a small team of highly skilled and dedicated workers

than with a larger team of the type of people to avoid for your project.

User Training

User training is usually done on the premises and not at a vendor site. There are

four cost areas for user training that must be considered.

The cost to engage a trainer from the outside or the time it takes for your in-house

trainer to develop and teach the class.

The facilities including the desktop workstations for the workshop.

The time the users spend away from the job being in class, and the time it takes

them to become proficient with the tool.

If not all the users are in the same location, travel expenses for either the users or

the trainer must be included.

IT Training


22

Generic training may be appropriate. Examples are classes in logical data

modeling, data warehouse project management or star schema database designs.

Data warehouse conferences and seminars can provide an overall perspective as

well as training in specific areas. IT will need to attend training on the complex

tools and products. IT will also need enough time to work with the products to

become proficient. The cost of training is sometimes included in the price of the

tool.

On-Going Costs

Most organizations focus on the cost to implement the initial data warehouse

application and give little thought to on-going expense. Over a period of years, the

continuing cost will very likely exceed the cost of the initial application. The data

warehouse will grow in size, in the number of users and in the number of queries

and reports. The database will not remain static. New data will be added,

sometimes more than for the initial implementation and the design most probably

will change, and the database will need to be tuned. New software will be

introduced, new releases will be installed and some interfaces will have to be

rewritten. As the data warehouse grows, the hardware and network will have to be

upgraded.

.

Operational & Informational Data Stores

Operational Data Store(ODS)

The Operational Database is the source of information for the data warehouse. It

includes detailed information used to run the day to day operations of the business.

The data frequently changes as updates are made and reflect the current value of

the last transactions.

Operational Database Management Systems also called as OLTP (Online

Transactions Processing Databases), are used to manage dynamic data in real-

time.

Data Warehouse Systems serve users or knowledge workers in the purpose of data

analysis and decision-making. Such systems can organize and present information

in specific formats to accommodate the diverse needs of various users. These

systems are called as Online-Analytical Processing (OLAP) Systems

Operational data store benefits


23

An ODS provides current, clean data from multiple sources in a single

place, and the benefits apply primarily to business operations.

The ODS provides a consolidated repository into which previously isolated

or inefficiently communicating IT systems can feed.

ODS reporting, which is focused on a snapshot of operational data, can be

more sophisticated than reports from individual underlying systems. The

ODS is architected to provide a consolidated view of data integrated from

multiple systems, so reports can provide a holistic perspective on

operational processes.

The up-to-date view into operational status also makes it easier for users to

diagnose problems before digging into component systems. For example, an

ODS enables service representatives to immediately find a customer order,

its status, and any troubleshooting information that might be helpful.

An ODS contains critical, time-sensitive business rules, such as those

automatically notifying a financial institution when a customer has

overdrawn an account. These rules, in aggregate, are a kind of process

automation that greatly improves efficiency, which would be impossible

without current and integrated operational data

Informational Data Store (IDS)

There are some functions that go on within the enterprise that have to do

with planning,forcasting and managing the organization. These functions

are also critical to the survival of the organization, especially in our current

fast paced world.

Functions like “marketing planning”, “engineering planning” and financial

analysis” also require information systems to support them. But these

functions are different from operational one, and the types of systems and

information required are also different.

Informational systems have to do with analyzing data and making decisions,

often major decisions.

Where operational data needs are normally focused upon a single area,

informational data needs often span a number of different areas and need

large amounts of related operational data.


24

Differences between OPERATIONAL & INFORMATIONAL Data Stores

OPERATIONAL INFORMATIONAL

Data Content Current values, Day to

Day Values

Archived, derived,

summarized, historical

Data Structure Optimized for

transactions

Optimized for complex

queries

Access Frequency High Medium to low

Access Type Read, update, delete Read Only

Queries Predictable, repetitive Ad hoc, random

Response Time Sub-seconds Several seconds to minutes

Kind of Users Clerks, DBAs, Database

Pros.

Knowledge Workers eg.

Analysts, Managers,

Executives.

Number of Users Large number, Thousands Relatively small number,

Hundreds.

Usage Used to run the Business Used to Analyse the state

of Business

Focus Focus on Storing Data Focused on outputting

Information

Models E-R Model Star Schema, Snowflake ,

Fact Constellation

Why a Data Warehouse is Separated from Operational Databases


25

A data warehouses is kept separate from operational databases due to the

following reasons −

An operational database is constructed for well-known tasks and workloads such

as searching particular records, indexing, etc. In contract, data warehouse queries

are often complex and they present a general form of data.

Operational databases support concurrent processing of multiple transactions.

Concurrency control and recovery mechanisms are required for operational

databases to ensure robustness and consistency of the database.

An operational database query allows to read and modify operations, while an

OLAP query needs only read only access of stored data.

An operational database maintains current data. On the other hand, a data

warehouse maintains historical data.

What is OLAP (Online Analytical Processing)?

OLAP stands for On-Line Analytical Processing. OLAP is a classification of

software technology which authorizes analysts, managers, and executives to gain

insight into information through fast, consistent, interactive access in a wide

variety of possible views of data that has been transformed from raw information

to reflect the real dimensionality of the enterprise as understood by the clients.

OLAP implement the multidimensional analysis of business information and

support the capability for complex estimations, trend analysis, and sophisticated

data modeling. It is rapidly enhancing the essential foundation for Intelligent

Solutions containing Business Performance Management, Planning, Budgeting,

Forecasting, Financial Documenting, Analysis, Simulation-Models, Knowledge

Discovery, and Data Warehouses Reporting. OLAP enables end-clients to perform

ad hoc analysis of record in multiple dimensions, providing the insight and

understanding they require for better decision making.

Who uses OLAP and Why?

OLAP applications are used by a variety of the functions of an organization.

Finance and accounting:

Budgeting

Activity-based costing


26

Financial performance analysis

And financial modeling

Sales and Marketing

Sales analysis and forecasting

Market research analysis

Promotion analysis

Customer analysis

Market and customer segmentation

Production

Production planning

Defect analysis

OLAP cubes have two main purposes. The first is to provide business users with a

data model more intuitive to them than a tabular model. This model is called a

Dimensional Model.

The second purpose is to enable fast query response that is usually difficult to

achieve using tabular models.

Characteristics of OLAP

In the FASMI characteristics of OLAP methods, the term derived from the first

letters of the characteristics are:


27

Fast

It defines which the system targeted to deliver the most feedback to the client

within about five seconds, with the elementary analysis taking no more than one

second and very few taking more than 20 seconds.

Analysis

It defines which the method can cope with any business logic and statistical

analysis that is relevant for the function and the user, keep it easy enough for the

target client. Although some preprogramming may be needed we do not think it

acceptable if all application definitions have to be allow the user to define new

Adhoc calculations as part of the analysis and to document on the data in any

desired method, without having to program so we excludes products (like Oracle

Discoverer) that do not allow the user to define new Adhoc calculation as part of

the analysis and to document on the data in any desired product that do not allow

adequate end user-oriented calculation flexibility.

Share

It defines which the system tools all the security requirements for understanding

and, if multiple write connection is needed, concurrent update location at an

appropriated level, not all functions need customer to write data back, but for the

increasing number which does, the system should be able to manage multiple

updates in a timely, secure manner.

Multidimensional


28

This is the basic requirement. OLAP system must provide a multidimensional

conceptual view of the data, including full support for hierarchies, as this is

certainly the most logical method to analyze business and organizations.

Information

The system should be able to hold all the data needed by the applications. Data

sparsity should be handled in an efficient manner.

The main characteristics of OLAP are as follows:

Multidimensional conceptual view: OLAP systems let business users have a

dimensional and logical view of the data in the data warehouse. It helps in

carrying slice and dice operations.

Multi-User Support: Since the OLAP techniques are shared, the OLAP operation

should provide normal database operations, containing retrieval, update, adequacy

control, integrity, and security.

Accessibility: OLAP acts as a mediator between data warehouses and front-end.

The OLAP operations should be sitting between data sources (e.g., data

warehouses) and an OLAP front-end.

Storing OLAP results: OLAP results are kept separate from data sources.

Uniform documenting performance: Increasing the number of dimensions or

database size should not significantly degrade the reporting performance of the

OLAP system.

OLAP provides for distinguishing between zero values and missing values so that

aggregates are computed correctly.

OLAP system should ignore all missing values and compute correct aggregate

values.

OLAP facilitate interactive query and complex analysis for the users.

OLAP allows users to drill down for greater details or roll up for aggregations of

metrics along a single business dimension or across multiple dimension.

OLAP provides the ability to perform intricate calculations and comparisons.

OLAP presents results in a number of meaningful ways, including charts and

graphs.


29

Benefits of OLAP

OLAP holds several benefits for businesses: -

OLAP helps managers in decision-making through the multidimensional

record views that it is efficient in providing, thus increasing their

productivity.

OLAP functions are self-sufficient owing to the inherent flexibility support

to the organized databases.

It facilitates simulation of business models and problems, through extensive

management of analysis-capabilities.

In conjunction with data warehouse, OLAP can be used to support a

reduction in the application backlog, faster data retrieval, and reduction in

query drag.

Types of OLAP

There are three main types of OLAP servers are as following:

ROLAP stands for Relational OLAP, an application based on relational DBMSs.

MOLAP stands for Multidimensional OLAP, an application based on

multidimensional DBMSs.

HOLAP stands for Hybrid OLAP, an application using both relational and

multidimensional techniques.


30

Relational OLAP (ROLAP) Server

These are intermediate servers which stand in between a relational back-end

server and user frontend tools.

They use a relational or extended-relational DBMS to save and handle warehouse

data, and OLAP middleware to provide missing pieces.

ROLAP servers contain optimization for each DBMS back end, implementation of

aggregation navigation logic, and additional tools and services.

ROLAP technology tends to have higher scalability than MOLAP technology.

ROLAP systems work primarily from the data that resides in a relational database,

where the base data and dimension tables are stored as relational tables. This

model permits the multidimensional analysis of data.

This technique relies on manipulating the data stored in the relational database to

give the presence of traditional OLAP's slicing and dicing functionality. In

essence, each method of slicing and dicing is equivalent to adding a "WHERE"

clause in the SQL statement.

Relational OLAP Architecture

ROLAP Architecture includes the following components

Database server.

ROLAP server.

Front-end tool.


31

Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology

segment in the market. This method allows multiple multidimensional views of

two-dimensional relational tables to be created, avoiding structuring record around

the desired view.

Some products in this segment have supported reliable SQL engines to help the

complexity of multidimensional analysis. This includes creating multiple SQL

statements to handle user requests, being 'RDBMS' aware and also being capable

of generating the SQL statements based on the optimizer of the DBMS engine.

Advantages

Can handle large amounts of information: The data size limitation of ROLAP

technology is depends on the data size of the underlying RDBMS. So, ROLAP

itself does not restrict the data amount.

Disadvantages

Performance can be slow: Each ROLAP report is a SQL query (or

multiple SQL queries) in the relational database, the query time can be

prolonged if the underlying data size is large.

Limited by SQL functionalities: ROLAP technology relies on upon

developing SQL statements to query the relational database, and SQL

statements do not suit all needs.

Multidimensional OLAP (MOLAP) Server

A MOLAP system is based on a native logical model that directly supports

multidimensional data and operations. Data are stored physically into

multidimensional arrays, and positional techniques are used to access them.

One of the significant distinctions of MOLAP against a ROLAP is that data are

summarized and are stored in an optimized format in a multidimensional cube,

instead of in a relational database. In MOLAP model, data are structured into

proprietary formats by client's reporting requirements with the calculations pre-

generated on the cubes.

MOLAP Architecture

MOLAP Architecture includes the following components

Database server.


32

MOLAP server.

Front-end tool.

MOLAP structure primarily reads the precompiled data. MOLAP structure has

limited capabilities to dynamically create aggregations or to evaluate results which

have not been pre-calculated and stored.

Applications requiring iterative and comprehensive time-series analysis of trends

are well suited for MOLAP technology (e.g., financial analysis and budgeting).

Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot

Software's Lightship Server, Sniper's TM/1. Planning Science's Gentium and

Kenan Technology's Multiway.

Some of the problems faced by clients are related to maintaining support to

multiple subject areas in an RDBMS. Some vendors can solve these problems by

continuing access from MOLAP tools to detailed data in and RDBMS.

Advantages

Excellent Performance: A MOLAP cube is built for fast information

retrieval, and is optimal for slicing and dicing operations.

Can perform complex calculations: All evaluation have been pre-

generated when the cube is created. Hence, complex calculations are not

only possible, but they return quickly.

Disadvantages


33

Limited in the amount of information it can handle: Because all

calculations are performed when the cube is built, it is not possible to

contain a large amount of data in the cube itself.

Requires additional investment: Cube technology is generally proprietary

and does not already exist in the organization. Therefore, to adopt MOLAP

technology, chances are other investments in human and capital resources

are needed.

Hybrid OLAP (HOLAP) Server

HOLAP incorporates the best features of MOLAP and ROLAP into a single

architecture. HOLAP systems save more substantial quantities of detailed data in

the relational tables while the aggregations are stored in the pre-calculated cubes.

HOLAP also can drill through from the cube down to the relational tables for

delineated data. The Microsoft SQL Server 2000 provides a hybrid OLAP server.

Advantages of HOLAP

HOLAP provide benefits of both MOLAP and ROLAP.

It provides fast access at all levels of aggregation.

HOLAP balances the disk space requirement, as it only stores the aggregate

information on the OLAP server and the detail record remains in the

relational database. So no duplicate copy of the detail record is maintained.

Disadvantages of HOLAP


34

HOLAP architecture is very complicated because it supports both MOLAP and

ROLAP servers.

Other Types

There are also less popular types of OLAP styles upon which one could stumble

upon every so often. We have listed some of the less popular brands existing in the

OLAP industry.

Web-Enabled OLAP (WOLAP) Server

WOLAP pertains to OLAP application which is accessible via the web browser.

Unlike traditional client/server OLAP applications, WOLAP is considered to have

a three-tiered architecture which consists of three components: a client, a

middleware, and a database server.

Desktop OLAP (DOLAP) Server

DOLAP permits a user to download a section of the data from the database or

source, and work with that dataset locally, or on their desktop.

Mobile OLAP (MOLAP) Server

Mobile OLAP enables users to access and work on OLAP data and applications

remotely through the use of their mobile devices.

Spatial OLAP (SOLAP) Server

SOLAP includes the capabilities of both Geographic Information Systems (GIS)

and OLAP into a single user interface. It facilitates the management of both spatial

and non-spatial data

Difference between ROLAP, MOLAP, and HOLAP

ROLAP MOLAP HOLAP

ROLAP stands for

Relational Online

Analytical Processing.

MOLAP stands for

Multidimensional Online

Analytical Processing.

HOLAP stands for Hybrid

Online Analytical

Processing.

The ROLAP storage

mode causes the

The MOLAP storage mode

principle the aggregations

The HOLAP storage mode

connects attributes of both


35

aggregation of the

division to be stored in

indexed views in the

relational database that

was specified in the

partition's data source.

of the division and a copy

of its source information to

be saved in a

multidimensional

operation in analysis

services when the

separation is processed.

MOLAP and ROLAP.

Like MOLAP, HOLAP

causes the aggregation of

the division to be stored in

a multidimensional

operation in an SQL

Server analysis services

instance.

ROLAP does not

because a copy of the

source information to be

stored in the Analysis

services data folders.

Instead, when the

outcome cannot be

derived from the query

cache, the indexed views

in the record source are

accessed to answer

queries.

This MOLAP operation is

highly optimize to

maximize query

performance. The storage

area can be on the

computer where the

partition is described or on

another computer running

Analysis services. Because

a copy of the source

information resides in the

multidimensional

operation, queries can be

resolved without accessing

the partition's source

record.

HOLAP does not causes a

copy of the source

information to be stored.

For queries that access the

only summary record in

the aggregations of a

division, HOLAP is the

equivalent of MOLAP.

Query response is

frequently slower with

ROLAP storage than

with the MOLAP or

HOLAP storage mode.

Processing time is also

frequently slower with

ROLAP.

Query response times can

be reduced substantially by

using aggregations. The

record in the partition's

MOLAP operation is only

as current as of the most

recent processing of the

separation.

Queries that access source

record for example, if we

want to drill down to an

atomic cube cell for which

there is no aggregation

information must retrieve

data from the relational

database and will not be as

fast as they would be if the

source information were

stored in the MOLAP


36

architecture.

OLTP (On-Line Transaction Processing)

OLTP (On-Line Transaction Processing) is featured by a large number of short

on-line transactions (INSERT, UPDATE, and DELETE). The primary

significance of OLTP operations is put on very rapid query processing,

maintaining record integrity in multi-access environments, and effectiveness

consistent by the number of transactions per second. In the OLTP database, there

is an accurate and current record, and schema used to save transactional database

is the entity model


37

Advantages of an OLTP System:

OLTP Systems are user friendly and can be used by anyone having basic

understanding

It allows its user to perform operations like read, write and delete data

quickly.

It responds to its user actions immediately as it can process query very

quickly.

This systems are original source of the data.

It helps to administrate and run fundamental business tasks

It helps in widening customer base of an organization by simplifying

individual processes

Characteristics of OLTP

Following are important characteristics of OLTP:

OLTP uses transactions that include small amounts of data.

Indexed data in the database can be accessed easily.

OLTP has a large number of users.

It has fast response times

Databases are directly accessible to end-users

OLTP uses a fully normalized schema for database consistency.

The response time of OLTP system is short.

It strictly performs only the predefined operations on a small number of

records.

OLTP stores the records of the last few days or a week.

It supports complex data models and tables.

Sr.No. Data Warehouse (OLAP) Operational Database(OLTP)


38

1 It involves historical processing

of information.

It involves day-to-day processing.

2 OLAP systems are used by

knowledge workers such as

executives, managers, and

analysts.

OLTP systems are used by clerks,

DBAs, or database professionals.

3 It is used to analyze the

business.

It is used to run the business.

4 It focuses on Information out. It focuses on Data in.

5 It is based on Star Schema,

Snowflake Schema, and Fact

Constellation Schema.

It is based on Entity Relationship

Model.

6 It focuses on Information out. It is application oriented.

7 It contains historical data. It contains current data.

8 It provides summarized and

consolidated data.

It provides primitive and highly

detailed data.

9 It provides summarized and

multidimensional view of data.

It provides detailed and flat

relational view of data.

10 The number of users is in

hundreds.

The number of users is in

thousands.

11 The number of records

accessed is in millions.

The number of records accessed is

in tens.

12 The database size is from The database size is from 100 MB


39

Difference between OLAP & OLTP

BASIS FOR

COMPARISON

OLTP OLAP

Basic It is an online

transactional system and

manages database

modification.

It is an online data retrieving

and data analysis system.

Focus Insert, Update, Delete

information from the

database.

Extract data for analyzing

that helps in decision

making.

Data OLTP and its transactions

are the original source of

data.

Different OLTPs database

becomes the source of data

for OLAP.

Transaction OLTP has short

transactions.

OLAP has long transactions.

100GB to 100 TB. to 100 GB.

13 These are highly flexible. It provides high performance.


40

BASIS FOR

COMPARISON

OLTP OLAP

Time The processing time of a

transaction is

comparatively less in

OLTP.

The processing time of a

transaction is comparatively

more in OLAP.

Queries Simpler queries. Complex queries.

Normalization Tables in OLTP database

are normalized (3NF).

Tables in OLAP database are

not normalized.

Integrity OLTP database must

maintain data integrity

constraint.

OLAP database does not get

frequently modified. Hence,

data integrity is not affected.


41

UNIT- II


42

Building a Data Warehouse

A Data warehouse is a heterogeneous collection of different data sources

organized under unified schema. Builders should take a broad view of the

anticipated use of the warehouse while constructing a data warehouse. During the

design phase, there is no way to anticipate all possible queries or analyses. Some

characteristic of Data warehouse are:

Subject oriented

Integrated

Time Variant

Non-volatile

Building a Data Warehouse –

Some steps that are needed for building any data warehouse are as following

below:

To extract the data (transnational) from different data sources:

For building a data warehouse, a data is extracted from various data sources

and that data is stored in central storage area. For extraction of the data

Microsoft has come up with an excellent tool. When you purchase

Microsoft SQL Server, then this tool will be available at free of cost.

To transform the transnational data:

There are various DBMS where many of the companies stores their data.

Some of them are: MS Access, MS SQL Server, Oracle, Sybase etc. Also

these companies saves the data in spreadsheets, flat files, mail systems etc.

Relating a data from all these sources is done while building a data

warehouse.

To load the data (transformed) into the dimensional database:

After building a dimensional model, the data is loaded in the dimensional

database. This process combines the several columns together or it may split

one field into the several columns. There are two stages at which

transformation of the data can be performed and they are: while loading the

data into the dimensional model or while data extraction from their origins.

To purchase a front-end reporting tool:

There are top notch analytical tools are available in the market. These tools

are provided by the several major vendors. A cost effective tool and Data

Analyzer is released by the Microsoft on its own.

For the warehouse there is an acquisition of the data. There must be a use of

multiple and heterogeneous sources for the data extraction, example databases.


43

There is a need for the consistency for which formation of data must be done

within the warehouse. Reconciliation of names, meanings and domains of data

must be done from unrelated sources. There is also a need for the installation of

the data from various sources in the data model of the warehouse.

To provide the time variant data

To store the data as per the data model of the warehouse

Purging the data

To support the updating of the warehouse data

Design considerations

Several structural design considerations should be taken into account for

economical and efficient welding. Many of these apply to other joining methods,

and all apply to both subassemblies and the complete structure.

Recognize and analyze the design problem: Designs must perform well under

expected and worst-case conditions. The designer should consider this before

sitting down at the drawing board or CAD terminal. Considerations include: Is it

more economical to build an irregular shape from welded pieces or to cut it from a

plate, with the accompanying waste? Can bending replace a welded joint? Are

preformed sections available? How, when, and how much should the structure be

welded? Can weight be reduced cost-effectively by using welded joints? Will

fewer parts offer equal or better performance

Determine load conditions: Structures will be subject to tension, compression,

torsion, and bending. These loads must be calculated under service conditions.

Locations of critical loads must be determined and the structure designed to

handle the loads efficiently. Careful designers will locate joints away from high-

stress areas when possible.

Consider producibility:The most elegant design is useless if it cannot be made

efficiently. Welders cannot always fabricate what designers think up. Designers

should spend time in the shop and consult foremen or manufacturing engineers

during design to become familiar with the challenges of translating drawings into

products.

Optimize layout: When drawing the preliminary design, engineers should plan

layout to reduce waste when the pieces are cut from plate. Preformed beams,

channels, and tubes also may reduce costs without sacrificing quality.


44

Anticipate plate preparation: Many designers assume that metals are

homogeneous, but real-world metal does not have equal properties in all

directions. Therefore, the type of plates used should be considered.

Consider using standard sections and forms: Preformed sections and forms

should be used whenever possible. Specifying standard sections for welding is

usually cheaper than welding many individual parts. In particular, specifying bent

components is preferable to making welded corners.

Select weld-joint design: There are five basic types of joints: butt joints, corner

joints, T-joints, lap joints, and edge joints. In addition, the American Welding

Society recognizes about 80 different types of welding and joining processes.

Each process has its own characteristics and capabilities, so joint design must be

suitable for the desired welding process. In addition, the joint design will affect

access to the weld.

Restrain size and number of welds: Welds should match, not exceed, the

strength of the base metal for full joint efficiency. Over welding is unnecessary,

increases costs, and reduces strength.

Welding sometimes induces distortions and residual stresses in structures. It is

best to specify the minimum amount of welding needed. To check for

overwelding, determine joint stresses versus stress in the adjoining members

Implementation Considerations

I. Access Tools

Currently no single tool in the market can handle all possible data warehouse

access needs. Therefore, most implementations rely on a suite of tools.

ii. Data Placement Strategies

As Data Warehouse grows, there are at least two options for Data

Placement. One is to put some of the data in the data warehouse into another

storage media (WORM, RAID). Second option is to distribute data in data

warehouse across multiple servers.

iii. Data Extraction, Cleanup, Transformation, and Migration


45

As components of the Data Warehouse architecture, proper attention must be

given to Data Extraction, which represents a critical success factor for a data

warehouse architecture.

1. The ability to identify data in the data source environments that can be

read by conversion tool is important.

2. Support for the flat files. (VSAM, ISM, IDMS) is critical, since bulk of

the corporate data is still maintained in this type of data storage.

3. The capability to merge data from multiple data stores is required in

many installations.

4. The specification interface to indicate the data to extracted and the

conversion criteria is important.

5. The ability to read information from data dictionaries or import

information from repository product is desired

iv. Metadata

A frequently occurring problem in Data Warehouse is the problem of

communicating to the end user what information resides in the data warehouse

and how it can be accessed. The key to providing users and applications with a

roadmap to the information stored in the warehouse is the metadata. It can

define all data elements and their attributes, data sources and timing, and the

rules that govern data use and data transformations. Meta data needs to be

collected as the warehouse is designed and built.

Data Warehouse- Technical Consideration

The following technical issues are required to be considered for designing and

implementing a data warehouse.

1. Hardware platform for data warehouse

2. DBMS for supporting data warehouse

3. Communication and network infrastructure for a data warehouse

4. Software tools for building, operating and using data warehouse

1. Hardware platforms for data warehouse:- Organization normally like to

utilize the already existing platforms for data warehouse development. However,


46

the disk storage requirements for a data warehouse will be significantly large,

especially in comparison with single application.

Thus, hardware with large data storage capacity is essential for data warehousing.

For every data size identified, the disk space provided should be two to three times

that of the data to accommodate processing, indexing etc.

2. DBMS for supporting data warehouse:- After hardware selection, a factor

most important is the DBMA selection. This determines the speed performance of

the data warehousing environment. The requirements of a DBMA for data

warehousing environment are scalability, performance in high volume storage

and processing and throughput in traffic. All the well known RDBMA vendors

like:- IBM,ORACLE,Sybase support parallel database processing, even some of

them have improved their architectures so as to better suit the specialized

requirements of a data warehouse.

3. Communication and network infrastructure for a data warehouse:- Data

warehouse can be internet or web enabled or intranet enabled as the choice may

be. If web enabled, the networking is taken care by the internet. If only intranet

based, then the appropriate LAN operational environment should be provided so

as to be accessible to all the identified users.

4. Software tools for building, operating, and using data warehouse:- All the

data warehouse vendors are not currently providing comprehensive single window

software tools capable of handling all aspects of a data warehousing project

implementation.

The types of access and reporting are as follow.

Statistical analysis and forecasting

Data visualization, graphing, and charting

Complex textual search

Ad hoc user specific queries

Predefined repeatable queries

Reporting and analysis by drilling down


47

Data Pre-processing

Today’s real-world databases are highly subject to noisy, missing and inconsistent

data due to their typically huge size, and their likely origin from

multiple,heterogenous source. Incomplete data can occur for a number of reasons.

Attributes of interest may not always be available , such as customer information

for sales transaction data. Other data may not be included simply because it was

not considered important at the time of entry. Some of the major reason for noisy

data is:

The data collection instruction used may be faulty

There may have been human or computer errors occurring at data entry.

Error in data transmission can also occur

There may be technology limitations, such as limited buffer size for

coordinating synchronized data transfer and consumption.

Objectives of data Preprocessing

Size reduction of the Input space:- Reducing the number of input

variables or the size of the input space is a common goal of the

preprocessing. The objectives is to get a reasonable overview of the data set

without losing the most important relationship of the data. If the input space

is large, one may identify the most important input variables and eliminate

the unimportant variables by combining several variables as a single

variable.

Smoother relationship: - Another commonly used type of preprocessing is

problem transformation. The original problem is transformed into a simpler

problem.

Data normalization:- For many practical problem, the units used to

measure each of the input variables can change the data and make the range

of values much larger than others. This results in unnecessarily complex

relationship by making the nature of the mapping along some dimensions

much different from others.

Noise Reduction:- A sequence of data may involve useful data, noisy data,

and inconsistent data. Preprocessing may reduce the noisy and inconsistent

data. The data corrupted with noise can be recovered with preprocessing

techniques.


48

Features Extraction:- If the key attribute or features characterizing the

data can be extracted, the problem encountered can be easily solved .

Data Summarization

Data Summarization is a simple term for a short conclusion of a big theory or a

paragraph. This is something where you write the code and in the end, you declare

the final result in the form of summarizing data. Data summarization has the great

importance in the data mining. As nowadays a lot of programmers and developers

work on big data theory. Earlier, you used to face difficulties to declare the result,

but now there are so many relevant tools in the market where you can use in the

programming or wherever you want in your data.

Why Data Summarization?

Why we need more summarization of data in the mining process, we are living in

a digital world where data transfers in a second and it is much faster than a human

capability. In the corporate field, employees work on a huge volume of data which

is derived from different sources like Social Network, Media, Newspaper, Book,

cloud media storage etc. But sometimes it may create difficulties for you to

summarize the data. Sometimes you do not expect data volume because when

you retrieve data from relational sources you cannot predict that how much data

will be stored in the database.

As a result, data becomes more complex and takes time to summarize information.

Let me tell you the solution to this problem. Always retrieve data in the form of

category what type of data you want in the data or we can say use filtration when

you retrieve data. Although, “Data Summarization” technique gives the good

amount of quality to summarize the data. Moreover, a customer or user can take

benefits in their research.

Data Preprocessing Techniques

Data Cleaning

Data Cleaning is a process of cleaning raw data by handling irrelevant and missing

tuples. While working our machine learning projects, the data sets which we take

might not be perfect they might have many impurities, Noisy values and a majority

of times the actual data might be missing. the major problems we will be facing

during data cleaning are:

1. Missing Values: If it is noted that there are many tuples that have no


49

recorded value for several attributes, then the missing values can be filled in

for the attribute by various methods described below:

Ignore the tuple: This is usually done when the class label is missing

(assuming the mining task involves classification or description). This

method is not very effective, unless the tuple contains several attributes

with missing values. It is especially poor when the percentage of

missing values per attribute varies considerably.

Fill in the missing value manually: In general, this approach is time-consuming and may not be feasible given a large data set with many missing values.

Use a global constant to fill in the missing value: Replace all missing

attribute values by the same constant, such as a label like \Unknown",

or -∞. If missing values are replaced by, say, \Unknown", then the

mining program may mistakenly think that they form an interesting

concept, since they all have a value in common | that of \Unknown".

Hence, although this method is simple, it is not recommended.

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to the same class as the given tuple.

Use the most probable value to fill in the missing value: This may be determined with inference-based tools using a Bayesian formalism or decision tree induction.

Methods 3 to 6 bias the data. The filled-in value may not be correct. Method

6, however, is a popular strategy. In comparison to the other methods, it uses

the most information from the present data to predict missing values.

2. Noisy Data: Noise is a random error or variance in a measured variable.

Given a numeric attribute such as, say, price, how can the data be “smoothed"

to remove the noise? The following data smoothing techniques describes this.

1. Binning methods: Binning methods smooth a sorted data value by

consulting the \neighborhood” or values around it. The sorted values are distributed

into a number of 'buckets', or bins. Because binning methods consult

the neighborhood of values, they perform local smoothing values

around it. The sorted values are distributed into a number of 'buckets', or

bins. Because binning methods consult the neighborhood of values,

they perform local smoothing.

2. Clustering: Outliers may be detected by clustering, where similar


50

values are organized into groups or \clusters".

3. Combined computer and human inspection: Outliers may be

identified through a combination of computer and human inspection. In

one application, for example, an information-theoretic measure was

used to help identify outlier patterns in a handwritten character database

for classification. The measure's value reflected the

\surprise" content of the predicted character label with respect to the

known label. Outlier patterns may be informative (e.g., identifying

useful data exceptions, such as different versions of the characters \0"

or \7"), or \garbage" (e.g., mislabeled characters). Patterns whose

surprise content is above a threshold are output to a list. A human can

then sort through the patterns in the list to identify the actual garbage

ones.

This is much faster than having to manually search through the entire database. The garbage patterns can then be removed from the (training) database.

4. Regression: Data can be smoothed by fitting the data to a function,

such as with regression. Linear regression involves finding the \best"

line to fit two variables, so that one variable can be used to predict the

other. Multiple linear regression is an extension of linear regression,

where more than two variables are involved and the data are fit to a

multidimensional surface. Using regression to find a mathematical

equation to fit the data helps smooth out the noise.

3. Inconsistent data: There may be inconsistencies in the data recorded for

some transactions. Some data inconsistencies may be corrected manually

using external references. For example, errors made at data entry may be

corrected by performing a paper trace. This may be coupled with routines

designed to help correct the inconsistent use of codes. Knowledge engineering

tools may also be used to detect the violation of known data constraints. For

example, known functional dependencies between attributes can be used to

find values contradicting the functional constraints.

Data Transformation

In data transformation, the data are transformed or consolidated into forms

appropriate for mining. Data transformation can involve the following:


51

1. Normalization, where the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0, or 0 to 1.0.

2. Smoothing works to remove the noise from data. Such techniques include binning, clustering, and regression.

3. Aggregation, where summary or aggregation operations are applied to the

data. For example, the daily sales data may be aggregated so as to compute

monthly and annual total amounts. This step is typically used in constructing a

data cube for analysis of the data at multiple granularities.

4. Generalization of the data, where low level or 'primitive' (raw) data are

replaced by higher level concepts through the use of concept hierarchies. For

example, categorical attributes, like street, can be generalized to higher level

concepts, like city or county. Similarly, values for numeric attributes, like age,

may be mapped to higher level concepts, like young, middle-aged, and senior.

Concept Hierarchy

A concept hierarchy defines a sequence of mappings from a set of low-level

concepts to higher-level, more general concepts. Consider a concept hierarchy for

the dimension location. City values for location include Vancouver, Toronto, New

York, and Chicago. Each city, however, can be mapped to the province or state to

which it belongs. For example, Vancouver can be mapped to British Columbia,

and Chicago to Illinois. The provinces and states can in turn be mapped to the

country (e.g., Canada or the United States) to which they belong. These mappings

form a concept hierarchy for the dimension location, mapping a set of low-level

concepts (i.e., cities) to higher-level, more general concepts (i.e., countries).


52

Patterns and Models

A pattern is an entity used to represent an abstract concept or a physical object. It

may contain several attributes or features to characterize an object. Data

preprocessing is to remove the irrelevant information and extract key features of

the data to simplify a pattern recognition problem without throwing away any

important information. Pattern provides a common framework for describing

problems and solution.

Models are a cornerstone of design. Engineers build a model of a car to work out

any details before putting it into production. In the same manner, system designers

develop models to explore ideas and improve the understanding of the database

design.

A data model is a graphical view of data created for analysis and design purposes.

Data modeling designing data warehouse databases in detail. It can be defined as

an integrated collection of concept that can be used to describe the structure of the

database including data types, relationships between data and constraints that

should apply on the data.

A data model comprises of following three components

1) Structural part:- It consists of a set of rules according to which database can

be constructed.

2) Manipulative part:- It defines the types of operations that are allowed on the

data. This includes the operations that are used for updating or retrieving data

from the database and for changing the structure of the students.

3) Integrity rules:- Rules, which ensures that the data is accurate.


53

Artificial Intelligence

Artificial intelligence (AI) refers to the simulation of human intelligence in

machines that are programmed to think like humans and mimic their actions. The

term may also be applied to any machine that exhibits traits associated with a

human mind such as learning and problem-solving.

The ideal characteristic of artificial intelligence is its ability to rationalize and take

actions that have the best chance of achieving a specific goal.

Application of AI

Following are some sectors which have the application of Artificial Intelligence


54

1. AI in Astronomy

o Artificial Intelligence can be very useful to solve complex universe

problems. AI technology can be helpful for understanding the universe such

as how it works, origin, etc.

2. AI in Healthcare

o In the last, five to ten years, AI becoming more advantageous for the

healthcare industry and going to have a significant impact on this industry.

o Healthcare Industries are applying AI to make a better and faster diagnosis

than humans. AI can help doctors with diagnoses and can inform when

patients are worsening so that medical help can reach to the patient before

hospitalization.

3. AI in Gaming

o AI can be used for gaming purpose. The AI machines can play strategic

games like chess, where the machine needs to think of a large number of

possible places.

4. AI in Finance

o AI and finance industries are the best matches for each other. The finance

industry is implementing automation, chatbot, adaptive intelligence,

algorithm trading, and machine learning into financial processes.

5. AI in Data Security

o The security of data is crucial for every company and cyber-attacks are

growing very rapidly in the digital world. AI can be used to make your data

more safe and secure. Some examples such as AEG bot, AI2 Platform,are

used to determine software bug and cyber-attacks in a better way.

6. AI in Social Media

o Social Media sites such as Facebook, Twitter, and Snapchat contain billions

of user profiles, which need to be stored and managed in a very efficient

way. AI can organize and manage massive amounts of data. AI can analyze

lots of data to identify the latest trends, hashtag, and requirement of different

users.

7. AI in Travel & Transport

o AI is becoming highly demanding for travel industries. AI is capable of

doing various travel related works such as from making travel arrangement

to suggesting the hotels, flights, and best routes to the customers. Travel


55

industries are using AI-powered chatbots which can make human-like

interaction with customers for better and fast response.

8. AI in Automotive Industry

o Some Automotive industries are using AI to provide virtual assistant to their

user for better performance. Such as Tesla has introduced TeslaBot, an

intelligent virtual assistant.

o Various Industries are currently working for developing self-driven cars

which can make your journey more safe and secure.

9. AI in Robotics:

o Artificial Intelligence has a remarkable role in Robotics. Usually, general

robots are programmed such that they can perform some repetitive task, but

with the help of AI, we can create intelligent robots which can perform tasks

with their own experiences without pre-programmed.

o Humanoid Robots are best examples for AI in robotics, recently the

intelligent Humanoid robot named as Erica and Sophia has been developed

which can talk and behave like humans.

10. AI in Entertainment

o We are currently using some AI based applications in our daily life with

some entertainment services such as Netflix or Amazon. With the help of

ML/AI algorithms, these services show the recommendations for programs

or shows.

11. AI in Agriculture

o Agriculture is an area which requires various resources, labor, money, and

time for best result. Now a day's agriculture is becoming digital, and AI is

emerging in this field. Agriculture is applying AI as agriculture robotics,

solid and crop monitoring, predictive analysis. AI in agriculture can be very

helpful for farmers.

12. AI in E-commerce

o AI is providing a competitive edge to the e-commerce industry, and it is

becoming more demanding in the e-commerce business. AI is helping

shoppers to discover associated products with recommended size, color, or

even brand.


56

13. AI in education:

o AI can automate grading so that the tutor can have more time to teach. AI

Chabot can communicate with students as a teaching assistant.

o AI in the future can be work as a personal virtual tutor for students, which

will be accessible easily at any time and any place.

Multi Dimensional Data Model

A multidimensional model views data in the form of a data-cube. A data cube

enables data to be modeled and viewed in multiple dimensions. It is defined by

dimensions and facts.

The dimensions are the perspectives or entities concerning which an organization

keeps records. For example, a shop may create a sales data warehouse to keep

records of the store's sales for the dimension time, item, and location. These

dimensions allow the save to keep track of things, for example, monthly sales of

items and the locations at which the items were sold. Each dimension has a table

related to it, called a dimensional table, which describes the dimension further. For

example, a dimensional table for an item may contain the attributes item name,

brand, and type.

A multidimensional data model is organized around a central theme, for example,

sales. This theme is represented by a fact table. Facts are numerical measures. The

fact table contains the names of the facts or measures of the related dimensional

tables.

Logical cubes:- Logical cubes are designed to organize measures or procedures

that have the same shape or dimensions. Measures in the same cube have the same

relationship to other logical objects and can easily be analyzed and displayed

together.


57

Logical measures:- With logical measures ,cells of the logical cube are filled with

facts collected about an organization’s operations or functions. The measures are

organized according to the dimensions, which typically include a time dimension.

Logical dimensions:- Dimensions contain a set of unique values that identify and

categorize data. Dimensions represent the different views for an entity that an

organization is interested in. For example, a store will create a sales data

warehouse in order to keep track of the store sales with respect to different

dimensions such as time, branch and location

Data Cube

When data is grouped or combined in multidimensional matrices called Data

Cubes. The data cube method has a few alternative names or a few variants, such

as "Multidimensional databases," "materialized views," and "OLAP (On-Line

Analytical Processing)."

The general idea of this approach is to materialize certain expensive computations

that are frequently inquired.

For example, a relation with the schema sales (part, supplier, customer, and sale-

price) can be materialized into a set of eight views as shown in fig,

where psc indicates a view consisting of aggregate function value (such as total-

sales) computed by grouping three attributes part, supplier, and

customer, p indicates a view composed of the corresponding aggregate function

values calculated by grouping part alone, etc.


58

Example: In the 2-D representation, we will look at the All Electronics sales

data for items sold per quarter in the city of Vancouver. The measured display in

dollars sold (in thousands)

3-Dimensional Cuboids

Let suppose we would like to view the sales data with a third dimension. For

example, suppose we would like to view the data according to time, item as well as

the location for the cities Chicago, New York, Toronto, and Vancouver. The

measured display in dollars sold (in thousands). These 3-D data are shown in the

table. The 3-D data of the table are represented as a series of 2-D tables.


59

Schemas For Multidimensional Data

Multidimensional Schema is especially designed to model data warehouse

systems. The schemas are designed to address the unique needs of very large

databases designed for the analytical purpose (OLAP).

Types of Data Warehouse Schema:

Following are 3 chief types of multidimensional schemas each having its unique

advantages.

Star Schema

Snowflake Schema

Galaxy Schema

1. Star Schema

In the STAR Schema, the center of the star can have one fact table and a number

of associated dimension tables. It is known as star schema as its structure

resembles a star. The star schema is the simplest type of Data Warehouse schema.

It is also known as Star Join Schema and is optimized for querying large data sets.

In the following example,the fact table is at the center which contains keys to every

dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID &

other attributes like Units sold and revenue.


60

Characteristics of Star Schema:

Every dimension in a star schema is represented with the only one-

dimension table.

The dimension table should contain the set of attributes.

The dimension table is joined to the fact table using a foreign key

The dimension table are not joined to each other

Fact table would contain key and measure

The Star schema is easy to understand and provides optimal disk usage.

The dimension tables are not normalized. For instance, in the above figure,

Country_ID does not have Country lookup table as an OLTP design would

have.

The schema is widely supported by BI Tools

2. Snowflake Schema

SNOWFLAKE SCHEMA is a logical arrangement of tables in a

multidimensional database such that the ER diagram resembles a snowflake shape.

A Snowflake Schema is an extension of a Star Schema, and it adds additional

dimensions. The dimension tables are normalized which splits data into additional

tables.

In the following example, Country is further normalized into an individual table


61

Characteristics of Snowflake Schema:

The main benefit of the snowflake schema it uses smaller disk space.

Easier to implement a dimension is added to the Schema

Due to multiple tables query performance is reduced

The primary challenge that you will face while using the snowflake Schema

is that you need to perform more maintenance efforts because of the more

lookup tables.

Star Vs Snowflake Schema: Key Differences

Star Schema Snow Flake Schema

Hierarchies for the dimensions are

stored in the dimensional table.

Hierarchies are divided into

separate tables.

It contains a fact table surrounded

by dimension tables.

One fact table surrounded by

dimension table which are in turn

surrounded by dimension table

In a star schema, only single join

creates the relationship between

the fact table and any dimension

tables.

A snowflake schema requires

many joins to fetch the data.

Simple DB Design. Very Complex DB Design.

Denormalized Data structure and Normalized Data Structure.


62

query also run faster.

High level of Data redundancy Very low-level data redundancy

Single Dimension table contains

aggregated data.

Data Split into different Dimension

Tables.

Cube processing is faster. Cube processing might be slow

because of the complex join.

Offers higher performing queries

using Star Join Query

Optimization. Tables may be

connected with multiple

dimensions.

The Snow Flake Schema is

represented by centralized fact

table which unlikely connected

with multiple dimensions.

3. Fact Constellation Schema( Galaxy Schema)

A GALAXY SCHEMA contains two fact table that share dimension tables

between them. It is also called Fact Constellation Schema. The schema is viewed

as a collection of stars hence the name Galaxy Schema.

As you can see in above example, there are two facts table


63

1. Revenue

2. Product.

In Galaxy schema shares dimensions are called Conformed Dimensions.

Characteristics of Galaxy Schema:

The dimensions in this schema are separated into separate dimensions based

on the various levels of hierarchy.

For example, if geography has four levels of hierarchy like region, country,

state, and city then Galaxy schema should have four dimensions.

Moreover, it is possible to build this type of schema by splitting the one-star

schema into more Star schemes.

The dimensions are large in this schema which is needed to build based on

the levels of hierarchy.

This schema is helpful for aggregating fact tables for better understanding.

Data Warehouse Architecture

A data warehouse architecture is a method of defining the overall architecture of

data communication processing and presentation that exist for end-clients

computing within the enterprise. Each data warehouse is different, but all are

characterized by standard vital components.

Production applications such as payroll accounts payable product purchasing and

inventory control are designed for online transaction processing (OLTP). Such

applications gather detailed data from day to day operations.

Data Warehouse applications are designed to support the user ad-hoc data

requirements, an activity recently dubbed online analytical processing (OLAP).

These include applications such as forecasting, profiling, summary reporting, and

trend analysis.

Data warehouses and their architectures very depending upon the elements of an

organization's situation.

Three common architectures are:

o Data Warehouse Architecture: Basic

o Data Warehouse Architecture: With Staging Area

o Data Warehouse Architecture: With Staging Area and Data Mart


64

Operational System

An operational system is a method used in data warehousing to refer to

a system that is used to process the day-to-day transactions of an organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored, and

every file in the system must have a different name.

Meta Data

A set of data that defines and gives information about other data.

Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make finding

and work with particular instances of data more accessible. For example, author,

data build, and data changed, and file size are examples of very basic document

metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly

summarized (aggregated) data generated by the warehouse manager.


65

The goals of the summarized information are to speed up query performance. The

summarized record is updated continuously as new information is loaded into the

warehouse.

End-User access Tools

The principal purpose of a data warehouse is to provide information to the business

managers for strategic decision-making. These customers interact with the

warehouse using end-client access tools.

The examples of some of the end-user access tools can be:

o Reporting and Query Tools

o Application Development Tools

o Executive Information Systems Tools

o Online Analytical Processing Tools

o Data Mining Tools

Data Warehouse Architecture: With Staging Area

We must clean and process your operational information before put it into the

warehouse.

We can do this programmatically, although data warehouses uses a staging

area (A place where data is processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational method

coming from multiple source systems, especially for enterprise data warehouses

where all relevant data of an enterprise is consolidated.


66

Data Warehouse Architecture: With Staging Area and Data Marts

We may want to customize our warehouse's architecture for multiple groups within

our organization.

We can do this by adding data marts. A data mart is a segment of a data

warehouses that can provided information for reporting and analysis on a section,

unit, department or operation in the company, e.g., sales, payroll, production, etc.

The figure illustrates an example where purchasing, sales, and stocks are separated.

In this example, a financial analyst wants to analyze historical data for purchases

and sales or mine historical information to make predictions about customer

behavior.

Data Warehouse Design


67

A data warehouse is a single data repository where a record from multiple data

sources is integrated for online business analytical processing (OLAP). This

implies a data warehouse needs to meet the requirements from all the business

stages within the entire organization. Thus, data warehouse design is a hugely

complex, lengthy, and hence error-prone process. Furthermore, business analytical

functions change over time, which results in changes in the requirements for the

systems. Therefore, data warehouse and OLAP systems are dynamic, and the

design process is continuous.

There are two approaches

1. "top-down" approach

2. "bottom-up" approach

Top-down Design Approach

In the "Top-Down" design approach, a data warehouse is described as a subject-

oriented, time-variant, non-volatile and integrated data repository for the entire

enterprise data from different sources are validated, reformatted and saved in a

normalized (up to 3NF) database as the data warehouse. The data warehouse stores

"atomic" information, the data at the lowest level of granularity, from where

dimensional data marts can be built by selecting the data required for specific

business subjects or particular departments. An approach is a data-driven approach

as the information is gathered and integrated first and then business requirements

by subjects for building data marts are formulated. The advantage of this method is

which it supports a single integrated data source. Thus data marts built from it will

have consistency when they overlap.

Advantages of top-down design

Data Marts are loaded from the data warehouses.

Developing new data mart from the data warehouse is very easy.

Disadvantages of top-down design

This technique is inflexible to changing departmental needs.

The cost of implementing the project is high.


68

Bottom-Up Design Approach

In the "Bottom-Up" approach, a data warehouse is described as "a copy of

transaction data specifically architecture for query and analysis," term the star

schema. In this approach, a data mart is created first to necessary reporting and

analytical capabilities for particular business processes (or subjects). Thus it is

needed to be a business-driven approach in contrast to Inman’s data-driven

approach.

The advantage of the "bottom-up" design approach is that it has quick ROI, as

developing a data mart, a data warehouse for a single subject, takes far less time

and effort than developing an enterprise-wide data warehouse. Also, the risk of

failure is even less. This method is inherently incremental. This method allows the

project team to learn and grow.


69

Advantages of bottom-up design

Documents can be generated quickly.

The data warehouse can be extended to accommodate new business units.

It is just developing new data marts and then integrating with other data

marts.

Disadvantages of bottom-up design

The locations of the data warehouse and the data marts are reversed in the

bottom-up approach design.

Differentiate between Top-Down Design Approach and Bottom-Up Design

Approach

Top-Down Design Approach Bottom-Up Design Approach

Breaks the vast problem into

smaller subproblems.

Solves the essential low-level problem and

integrates them into a higher one.

Inherently architected- not a

union of several data marts.

Inherently incremental; can schedule

essential data marts first.

Single, central storage of

information about the content.

Departmental information stored.

Centralized rules and control. Departmental rules and control.

It includes redundant

information.

Redundancy can be removed.

It may see quick results if

implemented with repetitions.

Less risk of failure, favorable return on

investment, and proof of techniques.


70

Three-Tier Data Warehouse Architecture

Generally a data warehouses adopts a three-tier architecture. Following are the

three tiers of the data warehouse architecture.

Bottom Tier − The bottom tier of the architecture is the data warehouse

database server. It is the relational database system. We use the back end

tools and utilities to feed data into the bottom tier. These back end tools and

utilities perform the Extract, Clean, Load, and refresh functions.

Middle Tier − In the middle tier, we have the OLAP Server that can be

implemented in either of the following ways.

o By Relational OLAP (ROLAP), which is an extended relational

database management system. The ROLAP maps the operations on

multidimensional data to standard relational operations.

o By Multidimensional OLAP (MOLAP) model, which directly

implements the multidimensional data and operations.

Top-Tier − This tier is the front-end client layer. This layer holds the query

tools and reporting tools, analysis tools and data mining tools.

The following diagram depicts the three-tier architecture of data warehouse


71

From the architecture point of view there are three data warehouse models

1. Enterprise warehouse

2. data marts

3. Virtual warehouse

Enterprise Warehouse: - An enterprise warehouse collects all details comparing

of all information about subjects spanning the entire organization. It provides

corporate wide data integration, usually from one or more operational systems and

from external information providers. It takes extensive business modeling and it

takes many years to design and build.

Data Marts:- A data mart consists of a subset of corporate wide data that is of

value to specific group of users. The scope is restricted to specific selected

subjects. The data contained in a data mart tend to be summarized.

Virtual warehouse:- A virtual warehouse is a set of views over operational

databases. A virtual warehouse is essentially a business database. The data found

in a virtual warehouse is usually copied from multiple sources throughout a

production system.

OLAP

OLAP (online analytical processing) is a computing method that enables users to

easily and selectively extract and query data in order to analyze it from different

points of view. OLAP business intelligence queries often aid in trends analysis,

financial reporting, sales forecasting, budgeting and other planning purposes.

For example, a user can request that data be analyzed to display a spreadsheet

showing all of a company's beach ball products sold in Florida in the month of

https://searchcrm.techtarget.com/definition/sales-forecast


72

July, compare revenue figures with those for the same products in September and

then see a comparison of other product sales in Florida in the same time period.

Working of OLAP system

To facilitate this kind of analysis, data is collected from multiple data sources and

stored in data warehouses then cleansed and organized into data cubes.

Each OLAP cube contains data categorized by dimensions (such as customers,

geographic sales region and time period) derived by dimensional tables in the data

warehouses. Dimensions are then populated by members (such as customer names,

countries and months) that are organized hierarchically. OLAP cubes are often pre-

summarized across dimensions to drastically improve query time over relational

databases.

Basic analytical operations of OLAP

Four types of analytical operations in OLAP are:

1. Roll-up

2. Drill-down


73

3. Slice and dice

4. Pivot (rotate)

1) Roll-up:

Roll-up is also known as "consolidation" or "aggregation." The Roll-up operation

can be performed in 2 ways

1. Reducing dimensions

2. Climbing up concept hierarchy. Concept hierarchy is a system of grouping

things based on their order or level.

Consider the following diagram

In this example, cities New jersey and Lost Angles and rolled up into

country USA

The sales figure of New Jersey and Los Angeles are 440 and 1560

respectively. They become 2000 after roll-up

In this aggregation process, data is location hierarchy moves up from city to

the country.

In the roll-up process at least one or more dimensions need to be removed.

In this example, Quater dimension is removed.

2) Drill-down

In drill-down data is fragmented into smaller parts. It is the opposite of the rollup

process. It can be done via

Moving down the concept hierarchy

Increasing a dimension

https://www.guru99.com/images/1/022218_1238_WhatisOLAPO2.png


74

Consider the diagram above

Quarter Q1 is drilled down to months January, February, and March.

Corresponding sales are also registers.

In this example, dimension months are added.

3) Slice:

Here, one dimension is selected, and a new sub-cube is created.

Following diagram explain how slice operation performed:



75

Dimension Time is Sliced with Q1 as the filter.

A new cube is created altogether.

Dice:

This operation is similar to a slice. The difference in dice is you select 2 or more

dimensions that result in the creation of a sub-cube.



76

4) Pivot

In Pivot, you rotate the data axes to provide a substitute presentation of data.

In the following example, the pivot is based on item types.

Types of OLAP systems




77

OLAP Hierarchical Structure

Type of OLAP Explanation

Relational OLAP(ROLAP): ROLAP is an extended RDBMS along

with multidimensional data mapping to

perform the standard relational

operation.

Multidimensional OLAP

(MOLAP)

MOLAP Implementes operation in

multidimensional data.

Hybrid OnlineAnalytical

Processing (HOLAP)

In HOLAP approach the aggregated

totals are stored in a multidimensional

database while the detailed data is

stored in the relational database. This

offers both data efficiency of the

ROLAP model and the performance of

the MOLAP model.

Desktop OLAP (DOLAP) In Desktop OLAP, a user downloads a

part of the data from the database

locally, or on their desktop and analyze

it.



78

DOLAP is relatively cheaper to deploy

as it offers very few functionalities

compares to other OLAP systems.

Web OLAP (WOLAP) Web OLAP which is OLAP system

accessible via the web browser.

WOLAP is a three-tiered architecture.

It consists of three components: client,

middleware, and a database server.

Mobile OLAP: Mobile OLAP helps users to access

and analyze OLAP data using their

mobile devices

Spatial OLAP : SOLAP is created to facilitate

management of both spatial and non-

spatial data in a Geographic

Information system (GIS)

ROLAP

ROLAP works with data that exist in a relational database. Facts and dimension

tables are stored as relational tables. It also allows multidimensional analysis of

data and is the fastest growing OLAP.

Advantages of ROLAP model:

High data efficiency. It offers high data efficiency because query

performance and access language are optimized particularly for the

multidimensional data analysis.

Scalability. This type of OLAP system offers scalability for managing large

volumes of data, and even when the data is steadily increasing.

Drawbacks of ROLAP model:

Demand for higher resources: ROLAP needs high utilization of

manpower, software, and hardware resources.


79

Aggregately data limitations. ROLAP tools use SQL for all calculation of

aggregate data. However, there are no set limits to the for handling

computations.

Slow query performance. Query performance in this model is slow when

compared with MOLAP

MOLAP

MOLAP uses array-based multidimensional storage engines to display

multidimensional views of data. Basically, they use an OLAP cube.

Hybrid OLAP

Hybrid OLAP is a mixture of both ROLAP and MOLAP. It offers fast computation

of MOLAP and higher scalability of ROLAP. HOLAP uses two databases.

1. Aggregated or computed data is stored in a multidimensional OLAP cube

2. Detailed information is stored in a relational database.

Benefits of Hybrid OLAP:

This kind of OLAP helps to economize the disk space, and it also remains

compact which helps to avoid issues related to access speed and

convenience.

Hybrid HOLAP's uses cube technology which allows faster performance for

all types of data.

ROLAP are instantly updated and HOLAP users have access to this real-

time instantly updated data. MOLAP brings cleaning and conversion of data

thereby improving data relevance. This brings best of both worlds.

Drawbacks of Hybrid OLAP:

Greater complexity level: The major drawback in HOLAP systems is that it

supports both ROLAP and MOLAP tools and applications. Thus, it is very

complicated.

Potential overlaps: There are higher chances of overlapping especially into

their functionalities.

Advantages of OLAP

OLAP is a platform for all type of business includes planning, budgeting,

reporting, and analysis.


80

Information and calculations are consistent in an OLAP cube. This is a

crucial benefit.

Quickly create and analyze "What if" scenarios

Easily search OLAP database for broad or specific terms.

OLAP provides the building blocks for business modeling tools, Data

mining tools, performance reporting tools.

Allows users to do slice and dice cube data all by various dimensions,

measures, and filters.

It is good for analyzing time series.

Finding some clusters and outliers is easy with OLAP.

It is a powerful visualization online analytical process system which

provides faster response times

Disadvantages of OLAP

OLAP requires organizing data into a star or snowflake schema. These

schemas are complicated to implement and administer

You cannot have large number of dimensions in a single OLAP cube

Transactional data cannot be accessed with OLAP system.

Any modification in an OLAP cube needs a full update of the cube. This is a

time-consuming process

OLAP software then locates the intersection of dimensions, such as all products

sold in the Eastern region above a certain price during a certain time period, and

displays them. The result is the "measure"; each OLAP cube has at least one to

perhaps hundreds of measures, which are derived from information stored in fact

tables in the data warehouse.

Types of OLAP systems

OLAP (online analytical processing) systems typically fall into one of three types:

Multidimensional OLAP (MOLAP) is OLAP that indexes directly into

a multidimensional database.

Relational OLAP (ROLAP) is OLAP that performs dynamic

multidimensional analysis of data stored in a relational database.

https://searchsqlserver.techtarget.com/definition/MOLAP

https://searchoracle.techtarget.com/definition/relational-online-analytical-processing


81

Hybrid OLAP (HOLAP) is a combination of ROLAP and MOLAP. HOLAP

was developed to combine the greater data capacity of ROLAP with the

superior processing capability of MOLAP.

Indexing & Querying in OLAP

To facilitate efficient data accessing, most data warehouse systems support index

structures and materialized views. Two indexing techniques that are popular for

olap data are:

Bitmap Indexing

Join Indexing

1) Bitmap Indexing

The bitmap indexing method is popular in OLAP products because it allows

quick searching in data cubes.

A bitmap index is a very efficient method for storing sparse data columns.

Sparse data columns are one which contain data values from a very small set

of possibilities .

In the bitmap index for a given attribute, there is a distinct bit vector ,for

each value V in the domain of the attribute.

If the domain for the attribute consists of n values, then n bits are needed for

each entry in the bitmap index.

The length of the bit vector is equal to the number of record in the base

table.

Not suitable for high cardinality domains.

https://searchsqlserver.techtarget.com/definition/hybrid-online-analytical-processing


82

2) Join Indexing

A join index is an index set on a join result.

Join indexing method gained popularity from its use in relational database

query processing.

In data warehousing, join indexing is especially useful in the star schema

model to join the records of the fact table with the corresponding dimension

table.

Consider 2 relations R(RID ,A) and S(B,SID) that join on attribute A and B.

Then the join index contains the pair(RID,SID) where RID and SID are

record identifiers from the R and S relations.

Querying in OLAP

OLAP is a database technology that has been optimized for querying and reporting,

instead of processing transactions. The source data for OLAP is online

transactional processing(OLTP) databases that are commonly stored in data

warehouses.

OLAP is implemented in a multi user client /server environment and offers

consistently rapid response to queries, regardless of database size and complexity .

The purpose of constructing OLAP index structures is to speed up query

processing in data cubes. Given materialized views, query processing should

proceed as follows:

1. Determine which operations should be performed on the available

cuboids:- This involves transforming any selection,projection,roll up and

drill-down operations specified in the query into corresponding SQL or

OLAP operations.


83

2. Determine to which materialized cuboids the relevant operations should

be applied:- This involves identifying all of the materialized cuboids that

may potentially be used to answer the query, estimating the costs of using

the remaining materialized cuboids, and selecting the cuboids with the least

cost.

OLAM ( ONLINE ANALYTICAL MINING)

Online analytical mining integrates online analytical processing (OLAP) and data

mining. It represent a promising direction for mining large databases and data

warehouses.

Importance of OLAM

OLAM is important for the following reasons −

High quality of data in data warehouses − The data mining tools are

required to work on integrated, consistent, and cleaned data. These steps are

very costly in the preprocessing of data. The data warehouses constructed

by such preprocessing are valuable sources of high quality data for OLAP

and data mining as well.

Available information processing infrastructure surrounding data

warehouses − Information processing infrastructure refers to accessing,

integration, consolidation, and transformation of multiple heterogeneous

databases, web-accessing and service facilities, reporting and OLAP

analysis tools.

OLAP−based exploratory data analysis − Exploratory data analysis is

required for effective data mining. OLAM provides facility for data mining

on various subset of data and at different levels of abstraction.

Online selection of data mining functions − Integrating OLAP with

multiple data mining functions and online analytical mining provide users

with the flexibility to select desired data mining functions and swap data

mining tasks dynamically.

OLAM Architecture


84

An OLAM engine performs analytical mining in data cubes in a similar manner as

an OLAP engine performs online analytical processing. Therefore, it is suggested

to have an integrated OLAM and OLAP architecture . Where the OLAM and

OLAP engines both accept users online queries via a user graphical user interface.

An OLAM engine can perform multiple data mining tasks, such as concept

description ,association ,classification, prediction, clustering and time series

analysis. Therefore, it usually consists of multiple, integrated data mining modules,

making it more sophisticated than an OLAP engine. There is no fundamental

difference between the data cube required for OLAP, although OLAM analysis

might require more powerful data cube construction and accessing tools.

Efficient methods of cube computation

Data cube computation is an important task in data warehouse implementation.

The precomputation of all or part of a data cube can greatly reduce the response

time and enhance the performance of online analytical processing. However, such

computation is challenging since it may require large computational time and

storage space. This section explores efficient methods for data cube computation.

Multiway Array Aggregation for full cube computation


85

The multiway array aggregation or simply multiway method computers a

full data cube by using multidimensional array as its basic data structure.

It is a typical MOLAP (Multidimensional online Analytical Processing)

approach that uses direct array addressing, where dimension values are

accessed via the position or index of their corresponding array locations.

A different approach is developed for the array based cube construction, as

follows:

1. Partition the array into chunks:- A chunk is a sub cube that is small

enough to fit into the memory available for cube computation. chunking is a

method for dividing an N-dimensional array into small N- dimensional

chunks, where each chunk is stored as an object on disk. The chunks are

compressed so as to remove wasted space resulting from empty array cells.

2. Compute aggregates by visiting cube cells:- The order in which cells are

visited can be optimized so as to minimize the number of times that each cell

must be revisited, thereby reducing memory access and storage costs.

BUC (Bottom-up Construction) : Computing Iceberg cubes from the apex

cuboid downward

BUC stands for bottom –up construction is an algorithm for the computation

of sparse and iceberg cubes.

Unlike multiway ,BUC constructs the cube from the apex cuboid towards

the base cuboid. This allows BUC to share data partitioning costs.


86

This representation of a lattice of cuboids, with the apex at the top and the

base at the bottom, is commonly accepted in data warehousing. It

consolidated the notions of drill down and roll up.

Star cubing: computing iceberg cubes using a dynamic star tree structure

Star cubing integrates top down and bottom up cube computation and

explores both multidimensional aggregations.

It operates from a data structure called a star tree, which performs lossless

data compression, thereby reducing the computation time and memory

requirements.

A key idea behind star cubing is the concept of shared dimensions . To build

up to this notion.

The order of computation is from the base cuboid, upwards towards the apex

cuboid . This order of computation is similar to that to Multiway.


87

Discovery Driven Exploration of data cubes

A data cube may have a large number of cuboids and each cuboids, and each

cuboid may contain in large number of cells. With such an extremely large space,

it becomes a burden for users just browse a cube. Tools need to be developed to

assist users in intelligently exploring the huge aggregated space of a data cube.

Discovery driven exploration is such a cube exploration approach. The main

features of this approach are:

In discovery driven exploration, pre computed measures or procedures

indicating data exceptions are used to guide the user in the data analysis

process, at all levels of aggregation.

In this approach, an exception is a data cube cell’s value that is significantly

different from the expected value, based on a statistical model.

This approach considers variations and patterns in the measures value across

all of the dimensions to which a cell belongs.

Visual cues or signs such as background color are used to reflect the degree

of exception of each cell, based on the pre- computed exception indicators.

The computation of exception indicators can be overlapped with cube

construction, so that the overall construction of data cubes for discovery

driven exploration is efficient.

Three measures are used as exception indicators to help identify data

anomalies.

1. SelfExp: - This indicates the degree of surprise of the cell value,

relative to other cells at the same levels of aggregation.


88

2. InExp: - This indicates the degree of surprise somewhere under the

cell, if we were to drill down from it.

3. PathExp: - This indicates the degree of surprise for each drill down

path from the cell.

Attribute Oriented Induction In Data Mining - Data

Characterization

Attribute-Oriented Induction

The Attribute-Oriented Induction (AOI) approach to data generalization and

summarization – based characterization was first proposed in 1989 (KDD ‘89

workshop) a few years before the introduction of the data cube approach.

The data cube approach can be considered as a data warehouse – based, pre

computational – oriented, materialized approach.

It performs off-line aggregation before an OLAP or data mining query is submitted

for processing.

On the other hand, the attribute oriented induction approach, at least in its initial

proposal, a relational database query – oriented, generalized – based, on-line data

analysis technique.

However, there is no inherent barrier distinguishing the two approaches based on

online aggregation versus offline precomputation.

Some aggregations in the data cube can be computed on-line, while off-line

precomputation of multidimensional space can speed up attribute-oriented

induction as well.

It was proposed in 1989 (KDD ‘89 workshop).

It is not confined to categorical data nor particular measures.


89

Basic Principles Of Attribute Oriented Induction

Data focusing:

Analyzing task-relevant data, including dimensions, and the result is the initial

relation.

Attribute-removal:

To remove attribute A if there is a large set of distinct values for A but (1) there

is no generalization operator on A, or (2) A’s higher-level concepts are

expressed in terms of other attributes.

Attribute-generalization:

If there is a large set of distinct values for A, and there exists a set of

generalization operators on A, then select an operator and generalize A.

Attribute-threshold control:

Typical 2-8, specified/default.

Generalized relation threshold control (10-30):

To control the final relation/rule size.

Algorithm for Attribute Oriented Induction

InitialRel:

It is nothing but query processing of task-relevant data and deriving the initial

relation.


90

PreGen:

It is based on the analysis of the number of distinct values in each attribute and

to determine the generalization plan for each attribute: removal? or how high to

generalize?

PrimeGen:

It is based on the PreGen plan and performing the generalization to the right

level to derive a “prime generalized relation” and also accumulating the counts.

Presentation:

User interaction:

(1) Adjust levels by drilling,

(2) Pivoting,

(3) Mapping into rules, cross tabs, visualization presentations


91

UNIT III

Association Rule Mining

Association rule mining is the scientific technique used to find out interesting and

frequent patterns from the transactional, spatial, temporal or other databases and to

set associations or relations among patterns( also known as item sets) in order to

discover knowledge.


92

Association rules can be applied in various fields like network management,

catalog design, clustering, classification, marketing etc.

A typical example of association rule mining is market basket analysis. This

process analysis analyzes customer- buying habits by finding associations between

the different items that customers place in their shopping basket. The discovery of

such associations can help retailer develop marketing strategies by gaining insight

into which items are frequently purchased together by customers. For instance, if

customers are buying milk, how likely are they to also buy bread ( and of what

kind of bread) on the same trip to the supermarket? such marketing and plan their

shelf space.

Market Basket Analysis

Suppose, as manager of a branch of HTC company, you would like to learn more

about the buying habits of your customers. You may also want to know which

groups or sets of items are customers likely to purchase on a given trip to store.

To answers your question, market basket analysis may be performed on the retail

data of customer transactions at your store. The result may be used to plan

marketing or advertising strategies , as well as catalogue design. For instance,

market basket analysis may help mangers design different store layouts. In one

strategy, items that are frequently purchased together can be placed in close

proximity in order to further encourage the sale of such items together. If

customers who purchase computer also tend to buy cell phone at the same time,

then placing the computer display close to the cell phone display may help to

increase the sales of both of these items. In an alternative strategy, placing

computer and cell phone at the opposite ends of the store may attract the customers

who purchase such items to pick up other items along the way.

Market basket analysis can also help retailers to plan which items to put in sale at

reduced prices. If customers tend to purchase computers and printers together, then

having a sale on printers may encourage the sale printers as well as computers.


93

Association rule mining is a two step process.

1. Find all frequent items sets or patterns

2. Generate strong association rules from the frequent items sets.

Market basket analysis is just one form of association rule mining. In fact, there

are many kinds of association rules. Association rules can be classified in

various ways, based on the following criteria.

Based on the types of values handled in the rule:- If a rule concerns

associations between the presence or absence of items, it is a Boolean

association rule.

If a rule describes associations between quantitative items or attributes,

then it is a quantitative association rule. In these rules, quantitative

values for items or attribute are partitioned into intervals.

Based on the dimensions of data involved in the rule:- If the items or

attributes in an association rule reference only one dimension, then it is a

single dimensional association rule. The rule could be written as:

Buys [ X, “computer”] buys[X, cell phone]

Based on the levels of abstractions involved in the rule set:- Some

methods for association rule mining can find rules at different levels of

abstractions. For example, suppose that asset of association rules mined

include the following rules:

age [X, “30…39”] buys[ X,” laptop computer”]

age[X, “30…39”] buys[X,”computer”]


94

Based on various extensions to association mining :- Association

mining can be extended to correiation analysis, where the absence or

presence of correlated items can be identified . It can also be extended to

mining maxpatterns and frequent closed itemsets.

Apriori Algorithm

Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding

frequent item sets in a dataset for Boolean association rule. Name of the algorithm

is Apriori because it uses prior knowledge of frequent item set properties. We

apply an iterative approach or level-wise search where k-frequent item sets are

used to find k+1 item sets.

To improve the efficiency of level-wise generation of frequent item sets, an

important property is used called Apriori property which helps by reducing the

search space.

Apriori Property

All non-empty subset of frequent itemset must be frequent. The key concept of

Apriori algorithm is its anti-monotonicity of support measure. Apriori assumes that

All subsets of a frequent item set must be frequent(Apriori property).

If an item set is infrequent, all its supersets will be infrequent.

Before we start understanding the algorithm, go through some definitions which

are explained in my previous post.

Consider the following dataset and we will find frequent itemsets and generate

association rules for them.

Step-1: K=1

Create a table containing support count of each item present in dataset – Called C1

(candidate set)


95

(II)Compare candidate set item’s support count with minimum support count (here

min_support=2 if support_count of candidate set items is less than min_support

then remove those items). This gives us itemset L1.

Step-2: K=2

Generate candidate set C2 using L1 (this is called join step). Condition of

joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.

Check all subsets of an item set are frequent or not and if not frequent

remove that item set.(Example subset of{I1, I2} are {I1}, {I2} they are

frequent. Check for each item set)

Now find support count of these item sets by searching in dataset.

(II) compare candidate (C2) support count with minimum support count(here

min_support=2 if support count of candidate set item is less than min_support then

remove those items) this gives us item set L2.


96

Step-3:

Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and

Lk-1 is that it should have (K-2) elements in common. So here, for L2, first

element should match.

So item set generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2,

I3, I4}{I2, I4, I5}{I2, I3, I5}

Check if all subsets of these itemsets are frequent or not and if not, then

remove that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3}

which are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so

remove it. Similarly check for every item set)

Find support count of these remaining item set by searching in dataset.

I) Compare candidate (C3) support count with minimum support count(here

min_support=2 if support_count of candidate set item is less than min_support then

remove those items) this gives us itemset L3.

Step-4:

Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and

Lk-1 (K=4) is that, they should have (K-2) elements in common. So here,

for L3, first 2 elements (items) should match.


97

Check all subsets of these itemsets are frequent or not (Here itemset formed

by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is

not frequent). So no itemset in C4

We stop here because no frequent itemsets are found further

Limitations of Apriori algorithm

Computationally Expensive. Even though the apriori algorithm reduces the

number of candidate itemsets to consider, this number could still be huge

when store inventories are large or when the support threshold is low.

However, an alternative solution would be to reduce the number of

comparisons by using advanced data structures, such as hash tables, to sort

candidate itemsets more efficiently.

Spurious Associations. Analysis of large inventories would involve more

itemset configurations, and the support threshold might have to be lowered

to detect certain associations. However, lowering the support threshold

might also increase the number of spurious associations detected. To ensure

that identified associations are generalizable, they could first be distilled

from a training dataset, before having their support and confidence assessed

in a separate test dataset.

Mining Multilevel Association Rules

Multilevel association means mining the data in different levels. For many

applications, it is difficult to find strong associations among data items at low level

of abstraction due to the sparsity of data in multidimensional space.

Strong associations discovered at high concept levels that might represent common

sense knowledge. However, What may represent common sense to one user may

seem new or novel to another. Therefore, data mining systems should provide

capabilities to mine association rules at multiple levels of abstraction.

Approaches to mining multilevel Association Rules

“How can we mine multiple association rules efficiently using concept

hierarchies?” Let’s look at some approaches based on a support – confidence

framework.

In general , a top down strategy is employed, where counts are build up for

the calculation of frequent items sets at each concept level, starting at the concept


98

level 1 and working towards the lower, more specific concept levels, until no more

frequent itemsets can found.

Uniform minimum support for all levels (referred to as uniform

support) :- The same minimum support threshold or limit is used when

mining at each level of abstraction.

In Fig, a minimum support threshold of 5% is used throughout ( eg for mining

from computer downward to laptop computer) . Both computer and laptop

computer are found to be frequent whereas desktop computer is not.

The uniform support approach, however, has some difficulties. It is unlikely that

items at lower of abstraction will occur as frequently as those at higher levels of

abstraction.

Reduced minimum support at lower levels (referred to as reduced

support):- Each abstraction level has its own minimum support

threshold or limit. The deeper the abstraction levels, the smaller the

corresponding threshold.

For example in fig. the minimum support threshold for levels 1 and 2

are 6% and 4% respectively. In this way, “computer “ laptop computer

and “ desktop computer “ are all considered frequent.


99

From Association mining to correlation Analysis

In association rule mining, many interesting rules can be found using low support

threshold. Although many rules generated in association rule mining are still not

interesting to the users. Unfortunately, this is especially true when mining at low

support threshold or mining for long patterns. This has been one of the major

bottlenecks for successful application of association rule mining.

Strong rules are not necessarily interesting: - whether or not a rule is

interesting can be assessed either subjectively or objectively. Ultimately,

only the user can judge if a given rule is interesting, and this judgment,

being subjective, may differ from one user to another.

However, objective interestingness measures, based on the statistics

“behind” the data, can be used as one step toward the goal of removing

uninteresting rules from presentation to the user.

Correlation analysis: - Correlation analysis is a good method for detecting

duplications of variables in the data. Sometimes, the same variable appears

accidently more than once in the dataset because the dataset was merged

form multiple sources, the same phenomenon is measured in different units

and so on. One simple way to find redundancies is to look at a correlation

matrix. This shows all the pair wise correlations between variables .

Constraint Based Association Mining


100

For a given set of task relevant data, the data mining process may uncover

thousands of rules, many of which are uninteresting to the user. In constraints-

based miming, mining is performed under the guidance of various kinds of

constraints provided by the user. These constraints include the following:

Knowledge type constraints: These specify the type of knowledge to be

mined, such as association.

Data Constraints: - Theses specify the set of task relevant data.

Dimension /level Constraints:- These specify the dimension of the data, or

levels of the concept hierarchies to be used .

Interestingness Constraints: - These specify threshold on statistical

measures of rule interestingness, such as support and confidence.

Rule Constraints: - These specify the form of rules to be mined. Such

Constraints may be expressed as metarules. as the maximum or minimum

number of predicates that can occur in the rule ancestor or consequent or as

relationships among attributes.

Introduction to Classification

Classification is the process of finding a model or function that describes and

distinguishes data classes or concepts, for the purpose of being able to use these

models to predict the class of objects whose class label is unknown.

A classifier assigns items to classes. In data mining, the items are typically

records from a database. The classes are defined by the person doing the data

mining. Each class has a class label or name. A classifier decides which class an

items belongs to on the basis of the values of its attributes.

Following are the examples of cases where the data analysis task is

Classification


101

A bank loan officer wants to analyze the data in order to know which

customer (loan applicant) are risky or which are safe.

A marketing manager at a company needs to analyze a customer with a

given profile, who will buy a new computer.

In both of the above examples, a model or classifier is constructed to predict the

categorical labels. These labels are risky or safe for loan application data and yes

or no for marketing data.

How Does Classification Works?

With the help of the bank loan application that we have discussed above, let us

understand the working of classification. The Data Classification process includes

two steps −

Building the Classifier or Model

Using Classifier for Classification

Building the Classifier or Model

This step is the learning step or the learning phase.

In this step the classification algorithms build the classifier.

The classifier is built from the training set made up of database tuples and

their associated class labels.

Each tuple that constitutes the training set is referred to as a category or

class. These tuples can also be referred to as sample, object or data points.

Using Classifier for Classification

In this step, the classifier is used for classification. Here the test data is used to

estimate the accuracy of classification rules. The classification rules can be

applied to the new data tuples if the accuracy is considered acceptable.


102

Classification by Decision Tree

A decision tree is a structure that includes a root node, branches, and leaf nodes.

Each internal node denotes a test on an attribute, each branch denotes the outcome

of a test, and each leaf node holds a class label. The topmost node in the tree is the

root node.

The following decision tree is for the concept buy computer that indicates whether

a customer at a company is likely to buy a computer or not. Each internal node

represents a test on an attribute. Each leaf node represents a class

The benefits of having a decision tree are as follows −

It does not require any domain knowledge.


103

It is easy to comprehend.

The learning and classification steps of a decision tree are simple and fast

Decision Tree Induction

The basic algorithm for decision tree induction is a greedy algorithm that

constructs decision trees in a top down recursive divide and conquer manner. The

algorithm, given below, is a well known decision tree induction algorithm.

Algorithm: Generate a decision tree from the given training data.

Input: Data Partition , which is a set of training tuples and their associated class

labels, attribute_list, the set of candidate attribute, attribute_selection_method ( a

procedure to determine the splitting criterion that best partitions the data tuples into

individual classes. This criterion includes a splitting_ attribute and a splitting

point)

Output: A decision tree.

Method:

1. Create a node N;

2. if tuples in D are all of the same class, C then

return N as leaf node labeled with class C;

3. if attribute _list is empty then

return N as a leaf node labeled with the most common class in D;

4. Apply attribute_selection_method (D, attribute_list) to find the best

splitting_criterion;

5. label node N with splitting_criterion;

6. if splitting _attribute is discrete – valued and multiway splits allowed then

attribute _ list = splitting_ attribute;

7. for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition

let Dj be the set of data tuples in D satisfying outcome J;

If Dj is empty then

attach a leaf labeled with the majority class ( most common class) in D to

node N;

else

attach the node returned by generate _ decision _ tree ( Dj, attribute _ list) to

node N;

end for


104

8. return N;

Attribute selection measures

An attribute selection measure is a learning method used for selecting the splitting

criterion that best separates a given data partition, D , of class labeled training

tuples into individual classes. if we were to split D into smaller partitions

according to the outcomes of the splitting criterion, ideally each partition would be

pure.

Attribute selection measures are also known as splitting rules because they

determine how the tuples at a given node are to be split. The attribute selection

measures provide a ranking for each attribute describing the given training tuples.

The attribute having the best score for the measure is chosen as the splitting

attribute for the given tuples.

The tree node created for partition D is labeled with the splitting criterion,

branches are grown for each outcome of the criterion, and the tuples are partitioned

accordingly, This section describes three popular attribute selection measures:

I. Information Gain: The information gain measures is used to select the test

attribute at each node in the tree. The attribute with the highest information

gain is chosen as the best attribute for the current node.

II. Gain ratio : The information gain measure is based toward tests with many

outcomes. That is , it prefers to select attributes having a large number of

values.

III. Gini Index: Gini index measures the impurity of data partition or set of

training tuples.


105


106

UNIT – IV

Introduction to Prediction

Prediction can be viewed as the construction and use of a model to assess the class

of an unlabeled sample or object.

Prediction also assesses the value or value range of an attribute that is given object

is likely to have.

In this view, classification and regression are the two major types of prediction

problems, where classification is used to predict discrete or nominal values, while

regression is used to predict continuous or ordered values.

Data prediction is a two step process, similar to that of data classification as

described earlier. However, for prediction, we lose the terminology of “ class label

attribute “ because the attribute for which values are being predicted is continuous

– valued rather than categorical . The attribute can be simply as the predicted

attribute.

Issues Regarding classification and prediction


107

The major issue is preparing the data for Classification and Prediction. Preparing

the data involves the following activities −

Data Cleaning − Data cleaning involves removing the noise and treatment

of missing values. The noise is removed by applying smoothing techniques

and the problem of missing values is solved by replacing a missing value

with most commonly occurring value for that attribute.

Relevance Analysis − Database may also have the irrelevant attributes.

Correlation analysis is used to know whether any two given attributes are

related.

Data Transformation and reduction − The data can be transformed by

any of the following methods.

o Normalization − The data is transformed using normalization.

Normalization involves scaling all values for given attribute in order

to make them fall within a small specified range. Normalization is

used when in the learning step, the neural networks or the methods

involving measurements are used.

o Generalization − The data can also be transformed by generalizing it

to the higher concept. For this purpose we can use the concept

hierarchies.

Comparison of Classification and Prediction Methods

Here is the criterion for comparing the methods of Classification and Prediction −

Accuracy − Accuracy of classifier refers to the ability of classifier. It

predict the class label correctly and the accuracy of the predictor refers to

how well a given predictor can guess the value of predicted attribute for a

new data.

Speed − This refers to the computational cost in generating and using the

classifier or predictor.

Robustness − It refers to the ability of classifier or predictor to make

correct predictions from given noisy data.

Scalability − Scalability refers to the ability to construct the classifier or

predictor efficiently; given large amount of data.

Interpretability − It refers to what extent the classifier or predictor

understands.

Prediction Techniques


108

1. Neural Network(NN)

Neural network is a network of several interconnected units. Each of

these units consists of input/output characteristics that implement a

local computation or function.

The function could serve be a computation of weighted sums of inputs

which produces iterates until a final output is produced.

The output could serve as an input to other neurons in the network,

This process iterates until a final output is produced.

Neural network is considered to copy the way the human brain works

in the sense that it acquires knowledge from its environment thought a

learning process.

There are numerous types of neural networks with the main categories

being feedforward and recurrent neural networks.

In feedforwards network, the information moves in only one direction,

forward form the input neuron, through the hidden neuron and to the

output neuron.

The recurrent on the other hand , is a neural network , where the

connections between the units form a directed cycle.

The advantages of neural network are that : first they are capable of

handling a wide range of problems, second, they have an ability to

produce good results even in complicated domains, and third they are

able to handle both categorical and continuous data types.

2. Decision Tree

Decision tree is well known to be one other effective prediction

technique in several domains.

Decision tree models are commonly used in data mining to examine

data and induce the tree and its rules that will be used to make

predictions.

The prediction could be to predict categorical values when instances

are to be placed in categories or classes.

The technique could also be utilized in the prediction of continues

variables , where absolute values are required.

Decision tree is developed through an iterative process of splitting

data into distinct groups, where the objective is to maximize the

distance between groups at each split.

Decision tree is known to have an advantages over several technique

due to the output it produces. The output of a decision tree is

transparent, which makes it easy for users or non technical persons to

understand


109

Decision tree techniques are known to have scalability and efficiency

problems.

3. Logistic Regression

Logistics regression is considered as the standard statistical approach

to modeling binary data.

It is classified as a member of the class of generalized linear models, a

broad set of models designed to generalize the normal linear models

to target variables of many different types.

LR has been applied in many data mining classification and prediction

problems. For example, used LR for the classification and prediction

of cancer genes.

Accuracy of a classifier

The accuracy of a classifier refers to the ability of a given classifiers to correctly

predict the class label of new or previously unseen data. Common techniques used

for assessing accuracy of a classifier are:

I. Holdout method and random sub-sampling

II. Cross validation

III. Bootstrap

Holdout method and random sub sampling

In this method, the given data are randomly partitioned into two independent

sets, a training set and a test set.

Typically , two thirds of the data are allocated to the training set, and the

remaining one third is allocated to the test set.

The training set is used to derive the model or classifier, whose accuracy is

estimated with the test set. The estimate is negative because only a portion

of the initial data is used to derive the model.

Random sub sampling is a variation of the holdout method in which the

holdout method is repeated K times.


110

Cross validation

Cross validation is a technique used for estimating the performance of a

predictive model or a classifier.

In cross- validation , the initial data are randomly partitioned into k mutually

exclusive subsets D1, D2…Dk , each of approximately equal size.

Training and testing is performed k times. In iteration I, partition Di is

reserved as the test set, and the remaining partitions are collectively serve as

the training set in order to obtain a first model, which is tested on D1, the

second iteration on subsets D1,D3…. Dk and tested on D2, and so on.

Cross- validation method uses sampling without replacement. There are no

duplicate records in the training and test sets.

Unlike the holdout and random sub sampling methods above, here, each

sample or object is used the same number of times for training and once for

testing.

For classification, the accuracy estimate is the overall number of correct

classifications from l iterations, divided by the total number of tuples in the

initial data.

For prediction, the error estimate can be computed as the total loss from the

k iterations, divided by the total number of initial tuples.

Bootstrap

Bootstrap is a technique used for assessing accuracy of a classifier in data

mining.

Bootstrap method works well with small data sets.

The bootstrap method samples the given training tuples uniformly with

replacement. That is , each time a tuple is selected, it is equally likely to be

selected, it is equally likely to be selected again and re- added to the training

set.

The bootstrap sample is created via a technique called random sampling with

replacement . If the original data set contains M tuples.

This sample constitutes the training set whereas the rest of the original data

forms the test set.

Bagging and Boosting

Bagging and boosting are two such techniques or strategies used for improving

classifier and predictor accuracy . These are useful techniques to improve the

predictive performance of tree models.


111

Bagging

One way to obtain multiple classifiers is through bagging or bootstrap aggregating,

proposed by breiman in 1996 to improve the classification by combining outputs of

classifiers that are trained using randomly generated training sets.

Bagging is a technique that can be used with many classification methods

and regression methods to reduce the variation associated with prediction, and

thereby improve the prediction process.

Boosting

Boosting, like bagging, is a committee based approach that can be used to improve

the accuracy of classification or regression methods. Unlike bagging, which uses a

simple averaging of results to obtain an overall prediction, boosting uses a

weighted average of results obtained form applying a prediction method to various

samples.

Basically, boosting takes multiple random samples from the data and builds a

classification model for each. The training set is changed based on the result of the

previous models. The final classification is the class assigned most often by the

models. The exact algorithms for boosting gave evolved from the original, but the

underlying idea is the same. Boosting has become a very popular addition to data

mining packages.

While bagging is the result of combining the results of combining the results of

independent classifiers and models, boosting is the action of combining classifier

and models of the same type to form a new target concept.

For example

If you were to build a model using one sample of data and then build a new model

using the same algorithm but on a different sample, you might get a different

result. After validating the two models, you cloud choose the one that best met

your objectives. Even better results might be achieved if you built several models

and let them vote, making a prediction based on what the majority recommended.

Differences Between Bagging and Boosting –

S.NO BAGGING BOOSTING


112

1.

Simplest way of combining

predictions that

belong to the same type.

A way of combining

predictions that

belong to the different

types.

2.

Aim to decrease variance, not

bias.

Aim to decrease bias, not

variance.

3.

Each model receives equal

weight.

Models are weighted

according to their

performance.

4.

Each model is built

independently.

New models are influenced

by performance of

previously built models.

5.

Different training data subsets

are randomly drawn with

replacement from the entire

training dataset.

Every new subsets contains

the elements that were

misclassified by previous

models.

6.

Bagging tries to solve over-

fitting problem.

Boosting tries to reduce

bias.


113

7.

If the classifier is unstable

(high variance), then apply

bagging.

If the classifier is stable

and simple (high bias) the

apply boosting.

8. Random forest. Gradient boosting.

Introduction to Clustering

Clustering is the task of dividing the population or data points into a number of

groups such that data points in the same groups are more similar to other data

points in the same group than those in other groups. In simple words, the aim is to

segregate groups with similar traits and assign them into clusters.

Let’s understand this with an example. Suppose, you are the head of a rental store

and wish to understand preferences of your costumers to scale up your business. Is

it possible for you to look at details of each costumer and devise a unique business

strategy for each one of them? Definitely not. But, what you can do is to cluster all

of your costumers into say 10 groups based on their purchasing habits and use a

separate strategy for costumers in each of these 10 groups. And this is what we call

clustering.

Now, that we understand what is clustering. Let’s take a look at the types of

clustering.

Types of Clustering

Broadly speaking, clustering can be divided into two subgroups:

Hard Clustering: In hard clustering, each data point either belongs to a

cluster completely or not. For example, in the above example each customer

is put into one group out of the 10 groups.


114

Soft Clustering: In soft clustering, instead of putting each data point into a

separate cluster, a probability or likelihood of that data point to be in those

clusters is assigned. For example, from the above scenario each costumer is

assigned a probability to be in either of 10 clusters of the retail store.

Strict partitioning clustering with outliers:- here each object belongs to

exactly one cluster.

Strict partitioning clustering with outliers : Objects can also belong to no

cluster, and are considered outliers.

Overlapping clustering /Alternative clustering: While usually a hard

clustering ; objects may belong to more than one cluster.

Hierarchical clustering: Objects that belong to a child cluster also belongs

to the parent cluster.

Subspace clustering: While an overlapping clustering, within a uniquely

defined subspaces, clusters are not expected to overlap. Requirements of

Clustering in Data Mining

Interpretability

The result of clustering should be usable, understandable and interpretable.

Helps in dealing with messed up data

Usually, the data is messed up and unstructured. It cannot be analyzed quickly, and

that is why the clustering of information is so significant in data mining. Grouping

can give some structure to the data by organizing it into groups of similar data

objects. It becomes more comfortable for the data expert in processing the data and

also discover new things.

High Dimensional

Data clustering is also able to handle the data of high dimension along with the

data of small size.

Attribute shape clusters are discovered

Arbitrary shape clusters are detected by using the algorithm of clustering. Small

size cluster with spherical shape can also be found.

Algorithm Usability with multiple data kind

Many different kinds of data can be used with algorithms of clustering. The data

can be like binary data, categorical and interval-based data.

Clustering Scalability The database usually is enormous to deal with. The algorithm should be scalable to

handle extensive database, so it needs to be scalable

Applications of clustering

Some of the main application areas are:


115

Market Research: Cluster analysis is widely used in market research.

Market researches use cluster analysis to partition the general population of

consumers into market segments and to better understand the relationships

between different groups of consumers/potential customers. They can

characterize their customers groups based on purchasing patterns.

World wide web: In the study of social networks, clustering may be used to

recognize communities within large groups of people.

Image processing: Clustering can be used to divide a digital image into

distinct regions for border detection or object recognition.

Crime Analysis: Cluster analysis can be used to identify areas where there

are greater incidences of particular types of crime. By identifying these

distinct areas or hot spots where a similar crime has happened over a period

of time, it is possible to manage law enforcement resources more effectively.

Education: Cluster analysis is used to identify groups of schools or

students with similar properties.

City Planning: Identifying groups of houses according to their house type,

value and geographical location.

Robotics: Clustering algorithms are used for robotic situational awareness to

track objects and detect outliers in sensor data.

Different Types of Clustering Algorithm

The clustering Algorithms are of many types. The following overview will only list

the most prominent examples of clustering algorithms, as there are possibly over 100

published clustering algorithms. Not all provide models for their clusters and can

thus not easily be categorized.

Distribution based methods

It is a clustering model in which we will fit the data on the probability that how it

may belong to the same distribution. The grouping done may be normal or gaussian .

Gaussian distribution is more prominent where we have fixed number of

distributions and all the upcoming data is fitted into it such that the distribution of

data may get maximized . This result in grouping which is shown in figure:-


116

This model works good on synthetic data and diversely sized clusters. But this

model may have problem if the constraints are not used to limit model’s complexity.

Furthermore, Distribution-based clustering produces clusters which assume

concisely defined mathematical models underlying the data, a rather strong

assumption for some data distributions.

For Ex- Expectation-maximization algorithm which uses multivariate normal

distributions is one of popular example of this algorithm .

Centroid based methods

This is basically one of iterative clustering algorithm in which the clusters are

formed by the closeness of data points to the centroid of clusters. Here , the cluster

center i.e. centroid is formed such that the distance of data points is minimum with

the center. This problem is basically one of NP- Hard problem and thus solutions are

commonly approximated over a number of trials.


117

For Ex- K – means algorithm is one of popular example of this algorithm .

The biggest problem with this algorithm is that we need to specify K in advance. It

also has problem in clustering density based distributions.

Connectivity based methods

The core idea of connectivity based model is similar to Centroid based model which

is basically defining clusters on the basis of closeness of data points .Here we work

on a notion that the data points which are closer have similar behavior as compared

to data points that are farther .

It is not a single partitioning of the data set , instead it provides an extensive

hierarchy of clusters that merge with each other at certain distances. Here the choice

of distance function is subjective. These models are very easy to interpret but it lacks

scalability .

For Ex- hierarchical algorithm and it’s variants .

Density Models

In this clustering model there will be a searching of data space for areas of varied

density of data points in the data space . It isolates various density regions based on

different densities present in the data space .

https://www.geeksforgeeks.org/k-means-clustering-introduction/


118

For Ex- DBSCAN and OPTICS.

Subspace clustering

Subspace clustering is an unsupervised learning problem that aims at grouping data

points into multiple clusters so that data point at single cluster lie approximately on a

low-dimensional linear subspace. Subspace clustering is an extension of feature

selection just as with feature selection subspace clustering requires a search method

and evaluation criteria but in addition subspace clustering limit the scope of

evaluation criteria. Subspace clustering algorithm localize the search for relevant

dimension and allow to them to find cluster that exist in multiple overlapping

subspaces. Subspace clustering was originally purpose to solved very specific

computer vision problem having a union of subspace structure in the data but it

gains increasing attention in the statistic and machine learning community. People

use this tool in social network, movie recommendation, and biological dataset.

Subspace clustering raise the concern of data privacy as many such application

involve dealing with sensitive information. Data points are assumed to be

incoherentas it only protects the differential privacy of any feature of a user rather

than the entire profile user of the database.

There are two branches of subspace clustering based on their search strategy.


119

Top-down algorithms find an initial clustering in the full set of dimension

and evaluate the subspace of each cluster.

Bottom-up approach finds dense region in low dimensional space then

combine to form clusters.

Selecting and Using Right DM Technique

Introduction: To do effective business information process in an enterprise we

should use the right Data Mining techniques. The real value of a product for a

business depends on the bottom line selection of a Data Mining technology.

This always translates to the bottom-line profit, increased revenue ,decreased cost,

or return on investment. If the technique and tool do not provide one of these four

assets in a measurable way. It is unlikely anyone in your business will have time to

mine their data.

Data mining needs to be more than finding interesting patterns in large databases if

it is to be successfully deployed in your business

In order to make an intelligent selection among data mining tools and technologies.

It will be helpful to categorize areas where they differ. To more clearly see this and

because there is so much overlap, one of the best ways to see the valid distinctions

between the algorithms is to see what is similar

For instance, although they differ in how they accomplish them, each data mining

algorithm has the following.

Model Structure : The structure that defines the model

Search:- The manner in which the algorithm modifies the model over time as more

data is made available.

Validation:- The determines , by the algorithm, of a valid model.

Data Visualization

Data Visualization is used to communicate information clearly and efficiently to

users by the usage of information graphics such as tables and charts. It helps users

in analyzing a large amount of data in a simpler way. It makes complex data more

accessible, understandable, and usable.


120

Tables are used where users need to see the pattern of a specific parameter, while

charts are used to show patterns or relationships in the data for one or more

parameters.

Tips to follow while representing data visually −

Number all diagrams

Label all diagrams

Ensure that units of measurement on axes are clearly labelled

Place any explanatory information in footnotes below the visual

Check layouts to ensure maximum clarity

Pro and Cons of Data Visualization

Here are some pros and cons to representing data visually −

Pros

It can be accessed quickly by a wider audience.

It conveys a lot of information in a small space.

It makes your report more visually appealing.

Cons

It can misrepresent information – if an incorrect visual representation is

made.

It can be distracting – if the visual data is distorted or excessively used

Importance of Data Visualization

Data visualization is important because of the processing of information in human

brains. Using graphs and charts to visualize a large amount of the complex data

sets is more comfortable in comparison to studying the spreadsheet and reports.

Data visualization is an easy and quick way to convey concepts universally. You

can experiment with a different outline by making a slight adjustment.

Data visualization have some more specialties such as:


121

o Data visualization can identify areas that need improvement or

modifications.

o Data visualization can clarify which factor influence customer behavior.

o Data visualization helps you to understand which products to place where.

o Data visualization can predict sales volumes.

Data visualization tools have been necessary for democratizing data, analytics, and

making data-driven perception available to workers throughout an organization.

They are easy to operate in comparison to earlier versions of BI software or

traditional statistical analysis software. This guide to a rise in lines of business

implementing data visualization tools on their own, without support from IT.

Data visualization techniques

Depending on these factors, you can choose different data visualization techniques

and configure their features. Here are the common types of visualization

techniques:

Charts

The easiest way to show the development of one or several data sets is a chart.

Charts vary from bar and line charts that show the relationship between elements

over time to pie charts that demonstrate the components or proportions between the

elements of one whole.

Plots

Plots allow to distribute two or more data sets over a 2D or even 3D space to show

the relationship between these sets and the parameters on the plot. Plots also vary.

Scatter and bubble plots are some of the most widely-used visualizations. When it


122

comes to big data, analysts often use more complex box plots that help visualize

the relationship between large volumes of data.

Maps

Maps are popular ways to visualize data used in different industries. They allow to

locate elements on relevant objects and areas — geographical maps, building

plans, website layouts, etc. Among the most popular map visualizations are heat

maps, dot distribution maps, cartograms.

Diagrams and matrices

Diagrams are usually used to demonstrate complex data relationships and links and

include various types of data on one visualization. They can be hierarchical,

multidimensional, tree-like.

Matrix is one of the advanced data visualization techniques that help determine the

correlation between multiple constantly updating (steaming) data sets.


123