data mining part-afinancial data analysis retail industry telecommunication industry market analysis...

DATA MINING

PART-A

1.Define Data mining.

It refers to extracting or ―mining‖ knowledge from large amount of data. Data mining is a

process of discovering interesting knowledge from large amounts of data stored either, in database,

data warehouse, or other information repositories

2.Give some alternative terms for data mining.

• Knowledge mining

• Knowledge extraction

• Data/pattern analysis.

• Data Archaeology

3.What is KDD?

KDD is Knowledge Discovery in Databases.

4.What are the steps involved in KDD process?

• Data cleaning

• Data Mining

• Pattern Evaluation

• Knowledge Presentation

• Data Integration

• Data Selection

• Data Transformation

5.Mention some of the data mining techniques.

• Clustering

• Classification

• Association Rule Mining

• Web Mining

6. Give some data mining tools.

DBMiner

GeoMiner

Multimedia miner

WeblogMiner

7. Mention some of the application areas of data mining

DNA analysis

Financial data analysis

Retail Industry

Telecommunication industry

Market analysis

Banking industry

Health care analysis.

8.What is meta learning?

Concept of combining the predictions made from multiple models of data mining and

analyzing those predictions to formulate a new and previously unknown

prediction.

9. What are the challenges to data mining regarding data mining methodology and user

interaction issues?

Mining different kinds of knowledge in databases

Interactive mining of knowledge at multiple levels of abstraction

Incorporation of background knowledge

Data mining query languages and ad hoc data mining

Presentation and visualization of data mining results

Handling noisy or incomplete data

10. What are the challenges to data mining regarding performance issues?

• Efficiency and scalability of data mining algorithms

• Parallel, distributed, and incremental mining algorithms

11. What are the issues relating to the diversity of database types?

• Handling of relational and complex types of data

• Mining information from heterogeneous databases and global information

systems

12. What is meant by pattern?

Pattern represents knowledge if it is easily understood by humans; valid on test data with

some degree of certainty; and potentially useful, novel or validates a hunch about which the used

was curious. Measures of pattern interestingness, either objective or subjective, can be used to

guide the discovery process.

13. How is a data warehouse different from a database?

Data warehouse is a repository of multiple heterogeneous data sources, organized under a

unified schema at a single site in order to facilitate management decision-making. Database

consists of a collection of interrelated data.

14. Define Association Rule Mining.

Association rule mining searches for interesting relationships among items in a given data

set.

15. When we can say the association rules are interesting?

Association rules are considered interesting if they satisfy both a minimum support

threshold and a minimum confidence threshold. Users or domain experts can set such thresholds.

16. Write the Association rule in mathematical notations.

Let I-{i1,i2,…..,im} be a set of items

Let D, the task relevant data be a set of database transaction T is a set of items

An association rule is an implication of the form A=>B where A C I, B C I, and An B=f. The rule

A=>B contains in the transaction set D with support s, where s is the percentage of transactions in

D that contain AUB. The Rule A=> B has confidence c in the transaction set D if c is the

percentage of transactions in D containing A that also contain B.

17. Define support and confidence in Association rule mining.

Support S is the percentage of transactions in D that contain AUB.

Confidence c is the percentage of transactions in D containing A that also contain B.

Support ( A=>B)= P(AUB)

Confidence (A=>B)=P(B/A)

18. How are association rules mined from large databases?

• Find all frequent item sets:

• Generate strong association rules from frequent item sets

19. What is the purpose of Apriori Algorithm?

Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean

association rules. The name of the algorithm is based on the fact that the algorithm uses prior

knowledge of frequent item set properties.

20.Define anti-monotone property.

If a set cannot pass a test, all of its supersets will fail the same test as well.

21. How to generate association rules from frequent item sets?

Association rules can be generated as follows For each frequent item set1, generate all non

empty subsets of 1. For every non empty subsets s of 1, output the rule ―S=>(1-s)‖if Support

count(1) =min_conf, upport_count(s) Where min_conf is the minimum confidence threshold.

22. Give few techniques to improve the efficiency of Apriori algorithm.

• Hash based technique

• Transaction Reduction

• Portioning

• Sampling

• Dynamic item counting

23. What are the things suffering the performance of Apriori candidate generation technique?

• Need to generate a huge number of candidate sets

• Need to repeatedly scan the scan the database and check a large set of candidates by

pattern matching

24. What are the steps in generating frequent item sets without candidate Generation?

Frequent-pattern growth(or FP Growth) adopts divide-and-conquer strategy.

Steps:

Compress the database representing frequent items into a frequent pattern tree or FP tree

Divide the compressed database into a set of conditional database

Mine each conditional database separately

25. Define the concept of classification.

A model is built describing a predefined set of data classes or concepts.

The model is constructed by analyzing database tuples described by

attributes. The model is used for classification.

26. What is Decision tree?

A decision tree is a flow chart like tree structures, where each internal node denotes a test

on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or

class distributions. The top most in a tree is the root node.

27. What is Attribute Selection Measure?

The information Gain measure is used to select the test attribute at each node in the decision tree.

Such a measure is referred to as an attribute selection measure or a measure of the goodness of

split.

28. What is Tree pruning?

When a decision tree is built, many of the branches will reflect anomalies in the training data due to

noise or outlier. Tree pruning methods address this problem of over fitting the data.

29. What are the Tree Pruning Methods?

• Pre pruning

• Post pruning

30. Define Pre Pruning

A tree is pruned by halting its construction early. Upon halting, the node becomes a leaf.

The leaf may hold the most frequent class among the subset samples.

31. Define Post Pruning.

Post pruning removes branches from a ―Fully grown‖ tree. A tree node is pruned by

removing its branches.

32. Define cluster analysis

Cluster analyses data objects without consulting a known class label. The class

labels are not present in the training data simply because they are not known to begin with.

33. Define Clustering.

Clustering is a process of grouping the physical or conceptual data object into

clusters.

34. What do you mean by Cluster Analysis?

A cluster analysis is the process of analyzing the various clusters to organize the different

objects into meaningful and descriptive objects.

35. What are the fields in which clustering techniques are used?

Clustering is used in biology to develop new plants and animal taxonomies. Clustering is

used in business to enable marketers to develop new distinct groups of their customers and

characterize the customer group on basis of purchasing. Clustering is used in the identification of

groups of automobiles Insurance policy customer.

36. What are the requirements of cluster analysis?

The basic requirements of cluster analysis are

• Dealing with different types of attributes.

• Dealing with noisy data.

• Constraints on clustering.

• Determining input parameter and

• Scalability

37. What are the different types of data used for cluster analysis?

The different types of data used for cluster analysis are Quantitative, binary,

Qualitative nominal, Qualitative ordinal.

38. Define Binary variables? And what are the two types of binary variables?

Binary variables are understood by two states 0 and 1, when state is 0, variable is absent and

when state is 1, variable is present. There are two types of binary variables, symmetric and

asymmetric binary variables. Symmetric variables are those variables that have same state values

and weights. Asymmetric variables are those variables that have not same state values and weights.

39. Define nominal variable.

A nominal variable is a generalization of the binary variable. Nominal variable has more

than two states, For example, a nominal variable, color consists of four states, red, green, yellow, or

black. In Nominal variables the total number of states is N and it is denoted by letters, symbols or

integers.

40. Define Ordinal Variable.

An ordinal variable also has more than two states but all these states are ordered in a

meaningful sequence.

41. What is meant by partitioning method?

In partitioning method a partitioning algorithm arranges all the objects into various partitions,

where the total number of partitions is less than the total number of objects. Here each partition

represents a cluster. The two types of partitioning method are k-means and k-medoids.

42. What is Hierarchical method?

Hierarchical method groups all the objects into a tree of clusters that are arranged in a

hierarchical order. This method works on bottom-up or top-down approaches.

43. What is Agglomerative Clustering?

Agglomerative Hierarchical clustering method works on the bottom-up approach. In

Agglomerative hierarchical method, each object creates its own clusters. The single Clusters are

merged to make larger clusters and the process of merging continues until all the singular clusters

are merged into one big cluster that consists of all the objects.

44. What is Divisive Hierarchical Clustering?

Divisive Hierarchical clustering method works on the top-down approach. In this method all

the objects are arranged within a big singular cluster and the large cluster is continuously divided

into smaller clusters until each cluster has a single object.

45. Define Density based method.

Density based method deals with arbitrary shaped clusters. In density-based method,

clusters are formed on the basis of the region where the density of the objects is high.

46. What is a DBSCAN?

Density Based Spatial Clustering of Application Noise is called as DBSCAN. DBSCAN is a

density based clustering method that converts the high-density objects regions into clusters with

arbitrary shapes and sizes. DBSCAN defines the cluster as a maximal set of density connected

points.

47. What do you mean by Grid Based Method?

In this method objects are represented by the multi resolution grid data structure. All the

objects are quantized into a finite number of cells and the collection of cells build the grid structure

of objects. The clustering operations are performed on that grid structure. This method is widely

used because its processing time is very fast and that is independent of number of objects.

48. What is the use of Regression?

Regression can be used to solve the classification problems but it can also be used for

applications such as forecasting. Regression can be performed using many different types of

techniques; in actually regression takes a set of data and fits the data to a formula.

49. What are the reasons for not using the linear regression model to estimate the output data?

There are many reasons for that, One is that the data do not fit a linear model, It is possible

however that the data generally do actually represent a linear model, but the linear model generated

is poor because noise or outliers exist in the data.

50. Comment on Outliers.

Outliers are data values that are exceptions to the usual and expected data.

51. Define data warehouse.

A data warehouse is a repository of multiple heterogeneous data sources organized under a

unified schema at a single site to facilitate management decision making . A data warehouse is a

subject-oriented, time-variant and nonvolatile collection of data in support of management‘s

decision-making process.

51.What are operational databases?

Organizations maintain large database that are updated by daily transactions are called

operational databases.

52. Define OLTP.

If an on-line operational database systems is used for efficient retrieval, efficient storage

and management of large amounts of data, then the system is said to be on-line transaction

processing.

53. Define OLAP.

Data warehouse systems serves users (or) knowledge workers in the role of data analysis

and decision-making. Such systems can organize and present data in various formats. These

systems are known as on-line analytical processing systems.

54. How a database design is represented in OLAP systems?

Star schema

Snowflake schema

Fact constellation schema

55. Write short notes on multidimensional data model.

Data warehouses and OLTP tools are based on a multidimensional data model. This model

is used for the design of corporate data warehouses and department data marts. This model contains

a Star schema, Snowflake schema and Fact constellation schemas. The core of the

multidimensional model is the data cube.

56. Define data cube.

It consists of a large set of facts (or) measures and a number of dimensions.

57. What are facts?

Facts are numerical measures. Facts can also be considered as quantities by which we can

analyze the relationship between dimensions.

58. What are dimensions?

Dimensions are the entities (or) perspectives with respect to an organization for keeping

records and are hierarchical in nature.

59. Define dimension table.

A dimension table is used for describing the dimension. (e.g.) A dimension table for item

may contain the attributes item_ name, brand and type.

60. Define fact table.

Fact table contains the name of facts (or) measures as well as keys to each of the related

dimensional tables.

61. What are lattice of cuboids?

In data warehousing research literature, a cube can also be called as cuboids. For different

(or) set of dimensions, we can construct a lattice of cuboids, each showing the data at different

level. The lattice of cuboids is also referred to as data cube.

62. What is apex cuboid?

The 0-D cuboid which holds the highest level of summarization is called the apex

cuboid. The apex cuboid is typically denoted by all.

63. List out the components of star schema.

A large central table (fact table) containing the bulk of data with no redundancy. A set of

smaller attendant tables (dimension tables), one for each dimension.

64. What is snowflake schema?

The snowflake schema is a variant of the star schema model, where some dimension tables

are normalized thereby further splitting the tables in to additional tables.

65. List out the components of fact constellation schema.

This requires multiple fact tables to share dimension tables. This kind of schema can be

viewed as a collection of stars and hence it is known as galaxy schema (or) fact constellation

schema.

66. Give the major difference between the star schema and the snowflake Schema.

The dimension table of the snowflake schema model may be kept in normalized form to

reduce redundancies. Such a table is easy to maintain and saves storage space.

67. Define concept hierarchy.

A concept hierarchy defines a sequence of mappings from a set of low-level concepts to

higher-level concepts.

68. List out the OLAP operations in multidimensional data model.

Roll-up

Drill-down

Slice and dice

Pivot (or) rotate

69. What is roll-up operation?

The roll-up operation is also called drill-up operation which performs aggregation on a data

cube either by climbing up a concept hierarchy for a dimension (or) by dimension reduction.

70. What is drill-down operation?

Drill-down is the reverse of roll-up operation. It navigates from less detailed data to more

detailed data. Drill-down operation can be taken place by stepping down a concept hierarchy for a

dimension.

71.What is slice operation?

The slice operation performs a selection on one dimension of the cube resulting in a sub

cube.

72. What is dice operation?

The dice operation defines a sub cube by performing a selection on two (or) more

dimensions.

73. What is pivot operation?

This is a visualization operation that rotates the data axes in an alternative presentation of

the data.

74. List out the views in the design of a data warehouse.

Top-down view

Data source view

Data warehouse view

Business query view

75. Define ETL.

The ETL process involves extracting, transforming and loading data from source system.

76. List out the steps of the data warehouse design process.

Choose a business process to model.

Choose the grain of the business process

Choose the dimensions that will apply to each fact table record.

Choose the measures that will populate each fact table record.

77. Define ROLAP.

The ROLAP model is an extended relational DBMS that maps operations on

multidimensional data to standard relational operations.

78. Define MOLAP.

The MOLAP model is a special purpose server that directly implements multidimensional

data and operations.

79. Define HOLAP.

The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from

the greater scalability of ROLAP and the faster computation of MOLAP,(i.e.) a HOLAP server

may allow large volumes of detail data to be stored in a relational database, while aggregations are

kept in a separate MOLAP store.

80. What is enterprise warehouse?

An enterprise warehouse collects all the information‘s about subjects spanning the entire

organization. It provides corporate-wide data integration, usually from one (or) more operational

systems (or) external information providers. It contains detailed data as well as summarized data

and can range in size from a few giga bytes to hundreds of giga bytes, tera bytes (or) beyond.

81. What is data mart?

Data mart is a database that contains a subset of data present in a data warehouse. Data

marts are usually implemented on low-cost departmental servers that are UNIX (or) windows/NT

based. The implementation cycle of the data mart is likely to be measured in weeks rather than

months (or) years.

82. What are dependent and independent data marts?

Dependent data marts are sourced directly from enterprise data warehouses. Independent

data marts are data captured from one (or) more operational systems (or) external information

providers (or) data generated locally with in particular department (or) geographic area.

83. Define metadata.

Metadata is used in data warehouse is used for describing data about data. (i.e.) meta data

are the data that define warehouse objects. Metadata are created for the data names and definitions

of the given warehouse.

84. What is web mining ?

Technique to process information available on web and search for useful data. To discover

web pages, text documents , multimedia files, images, and other types of resources from web.

Used in several fields such as E-commerce, information filtering, fraud detection and education and

research.

85. What is Web Content mining?

Web content mining examines the content of Web pages as well as results of Web

searching. The content includes text as well as graphics data.

86. What is a Crawler?

A robot(or spider or crawler) is a program that traverses the hypertext structure in the Web.

87. What is Web Usage Mining?

Web usage mining performs mining on Web usage data, or Web logs. A Web log is a listing

of page reference data. It is also called click stream data because each entry corresponds to a

mouse click.

88. What is a Harvest System?

The Harvest system is based on the use of caching, indexing, and crawling. Harvest is

actually a set of tools that facilitate gathering of information from diverse sources.

89. Define spatial data mining.

Extracting undiscovered and implied spatial information.

Spatial data: Data that is associated with a location Used in several fields such as

geography, geology, medical imaging etc.

87. What is Information Privacy?

Information Privacy as the individual‘s ability to control the circulation of information

relating tohim/her.

88. What is anonymization?

Anonymiztion is removal of identifying information from data in order to protect privacy of

the data subjects while allowing the data so modified to be used for secondary purposes like data

mining.

89. Give any two pitfalls of Data Mining.

Intentional abuse

Security breach

PART-B

1. Explain the steps in Data mining

There are various steps that are involved in mining data as shown in the picture.

1. Data Integration: First of all the data are collected and integrated from all the different

sources.

2. Data Selection: We may not all the data we have collected in the first step. So in this step

we select only those data which we think useful for data mining.

3. Data Cleaning: The data we have collected are not clean and may contain errors, missing

values, noisy or inconsistent data. So we need to apply different techniques to get rid of

such anomalies.

4. Data Transformation: The data even after cleaning are not ready for mining as we need to

transform them into forms appropriate for mining. The techniques used to accomplish this

are smoothing, aggregation, normalization etc.

5. Data Mining: Now we are ready to apply data mining techniques on the data to discover

the interesting patterns. Techniques like clustering and association analysis are among the

many different techniques used for data mining.

6. Pattern Evaluation and Knowledge Presentation: This step involves visualization,

transformation, removing redundant patterns etc from the patterns we generated.

7. Decisions / Use of Discovered Knowledge: This step helps user to make use of the

knowledge acquired to take better decisions.

2. Explain the KDD process.

Knowledge discovery as a process is depicted and consists of an iterative sequence of the following

steps:

1. Data cleaning: to remove noise and inconsistent data

2. Data integration: where multiple data sources may be combined

3. Data selection: where data relevant to the analysis task are retrieved from the database

4. Data transformation: where data are transformed or consolidated into forms appropriate for

mining by performing summary or aggregation operations.

5. Data mining: an essential process where intelligent methods are applied in order to extract

data pattern.

6. Pattern evaluation to identify the truly interesting patterns representing knowledge based on

some interestingness measures;

7. Knowledge presentation where visualization and knowledge representation techniques are

used to present the mined knowledge to the user.

Steps 1 to 4 are different forms of data preprocessing, where the data are prepared for mining. The

data mining step may interact with the user or a knowledge base.

The interesting patterns are presented to the user and may be stored as new knowledge in the

knowledge base. Data mining is only one step in the entire process but an essential one because it

uncovers hidden patterns for evaluation.

Therefore, data mining is a step in the knowledge discovery process.

3. Explain the concept of Frequent Itemsets, Closed Itemsets and Association Rules.

Frequent Itemsets:

1. A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset.

The set {computer, antivirus software}is a 2-itemset.

2. The occurrence frequency (called support count, frequency or count)of an itemset is the

number of transactions that contain the itemset.

3. If the relative support of an itemset I satisfies a prespecified minimum support threshold

(i.e., the absolute support of I satisfies the corresponding minimum support count

threshold), then I is a frequent itemset.

4. The set of frequent k-itemsets is commonly denoted by Lk.

Closed Itemsets:

An itemset X is closed in a data set S if there exists no proper super-itemset Y such that Y has the

same support count as X in S. An itemset X is a closed frequent itemset in set S if X is both closed

and frequent in S. An itemset X is a maximal frequent itemset (or max-itemset) in set S if X is

frequent, and there exists no super-itemset Y such that X Y and Y is frequent in S.

Association Rules:

1. They are used for finding patterns or relationship in the database.

2. For example, the information that customers who purchase computers also tend to buy

antivirus software at the same time is represented in Association Rule below:

i. Computer → antivirus_software [support = 2%; confidence = 60%]

1. Rule support and confidence are two measures of rule interestingness. They respectively

reflect the usefulness and certainty of discovered rules.

2. A support of 2% for Association Rule means that 2% of all the transactions under analysis

show that computer and antivirus software are purchased together.

3. A confidence of 60% means that 60% of the customers who purchased a computer also

bought the software.

4. Typically, association rules are considered interesting if they satisfy both a minimum

support threshold and a minimum confidence threshold.

5. Such thresholds can be set by users or domain experts. Additional analysis can be

performed to uncover interesting statistical correlations between associated items.

4. Define and explain

i) Support ii) Confidence

Support: The support of an itemset expresses how often the itemset appears in a single transaction

in the database,i.e., the support of an item is the percentage of transactions in which that item (or

items) occurs.

For an association rule, A→B

Support(A→B) = no of tuples containing both A and B/No of tuples

Confidence: The confidence or strength for an association rule A→B is the ratio of the number of

transaction that contain A∪B to the number of transactions that contain A. Confidence is defined as

the measure of certainty associated with each discovered pattern.

For an association rule, A→B

Confidence(A→B) = no of tuples containing both A and B/ No of tuples containing A

5. Define and explain:

i) Gini Index ii) Information Gain

i)Gini Index:

The Gini index is used in CART. The Gini index measures the impurity of D, a data partition or set

of training tuples, as

Gini(D) = 1-i=1mpi2

where pi is the probability that a tuple in D belongs to class Ci.

The Gini index considers a binary split for each attribute.

For each attribute, each of the possible binary splits is considered. For a discrete-valued attribute,

the subset that gives the minimum gini index for that attribute is selected as its splitting subset.

For continuous-valued attributes, each possible split-point must be considered The point giving the

minimum Gini index for a given (continuous-valued) attribute is taken as the split-point of that

attribute.

The attribute that maximizes the reduction in impurity (or, equivalently, has the minimum Gini

index) is selected as the splitting attribute. This attribute and either its splitting subset (for a

discrete-valued splitting attribute) or split-point (for a continuous valued splitting attribute) together

form the splitting criterion.

ii)Information Gain:

ID3 uses information gain as its attribute selection measure. Let node N represent or hold the

tuples of partition D. The attribute with the highest information gain is chosen as the splitting

attribute for node N. This attribute minimizes the information needed to classify the tuples in the

resulting partitions and reflects the least randomness in these partitions.

Such an approach minimizes the expected number of tests needed to classify a given tuple and

guarantees that a simple (but not necessarily the simplest) tree is found. The expected information

needed to classify a tuple in D is given by

Info(D) = -i=1mpilog2pi

where pi is the probability that an arbitrary tuple in D belongs to class Ci.

Information gain is defined as the difference between the original information requirement (i.e.,

based on just the proportion of classes) and the new requirement (i.e., obtained after partitioning

on A). That is,

Gain(A) = Info(D)- InfoA(D)

InfoA(D) is the expected information required to classify a tuple from D based on the partitioning

by A. The smaller the expected information (still) required, the greater the purity of the partitions.

In other words, Gain(A) tells us how much would be gained by branching on A. It is the expected

reduction in the information requirement caused by knowing the value of A. The attribute A with

the highest information gain, (Gain(A)), is chosen as the splitting attribute at node N. This is

equivalent to saying that we want to partition on the attribute A that would do the ―best

classification,‖ so that the amount of information still required to finish classifying the tuples is

minimal (i.e., minimum InfoA(D)).

6. Write short note on Market Basket Analysis.

1. Frequent itemset mining leads to the discovery of associations and correlations among items

in large transactional or relational data sets. The discovery of interesting correlation

relationships among huge amounts of business transaction records can help in many

business decision-making processes, such as catalog design, cross-marketing, and customer

shopping behavior analysis.

2. A typical example of frequent itemset mining is market basket analysis. This process

analyzes customer buying habits by finding associations between the different items that

customers place in their ―shopping baskets‖ as shown in the figure.

3. The discovery of such associations can help retailers develop marketing strategies by

gaining insight into which items are frequently purchased together by customers.

4. For instance, if customers are buying milk, how likely are they to also buy bread (and what

kind of bread) on the same trip to the supermarket? Such information can lead to increased

sales by helping retailers do selective marketing and plan their shelf space.

5. Example: Suppose, as manager of a store, one would like to learn more about the buying

habits of the customers like which groups or sets of items are customers likely to purchase

on a given trip to the store.

6. Market basket analysis may be performed on the retail data of customer transactions at the

store. The results can be used to plan marketing or advertising strategies, or in the design of

a new catalog.

7. For instance, market basket analysis may help design different store layouts.

8. In one strategy, items that are frequently purchased together can be placed in proximity in

order to further encourage the sale of such items together.

9. In an alternative strategy, placing hardware and software at opposite ends of the store may

entice customers who purchase such items to pick up other items along the way.

10. Market basket analysis can also help retailers plan which items to put on sale at reduced

prices.

11. Each item in the store has a Boolean variable representing the presence or absence of that

item. Each basket can then be represented by a Boolean vector of values assigned to these

variables.

12. The Boolean vectors can be analyzed for buying patterns that reflect items that are

frequently associated or purchased together. These patterns can be represented in the form

of association rules.

7. Write a short note on improving the efficiency of Apriori.

The variations of Apriori to improve the efficiency of the original algorithm are:

1. Hash-based technique (hashing itemsets into corresponding buckets):

A k-itemset whose corresponding hasing bucket count is below the threshold cannot be

frequent.

2. Transaction reduction (reducing the number of transactions scanned in future

iterations):

A transaction that does not contain any frequent k-itemsets is useless in subsequent scans.

3. Partitioning (partitioning the data to find candidate itemsets):

Any itemset that is potentially frequent with respect to Dtabse must occur as a frequent

itemset in at least one of the partitions.

4. Sampling (mining on a subset of the given data):

Mining of a subset of given data, lower support threshold + a method to determine the

completeness.

5. Dynamic itemset counting (adding candidate itemsets at different points during a

scan):

Add a new candidate itemsets only when all of their subsets are estimated to be frequent.

8. Explain direct hashing pruning.

The direct hashing and pruning(DHP) algorithm claims to be efficient in the generation of

frequent itemsets hashing and effective in trimming the transaction database by discarding

items from the transactions or removing whole transactions they do no not need to be

scanned. The algorithm uses a hash-based technique to reduce the number of candidate

itemsets generated in the first pass. It is claimed that the number of itemsets in c2 generated

using DHP can be orders of magnitude smaller, so that the scan required to determine L2 is

more efficient.

The algorithm may be divided into the following three parts:

1. The first part finds all the frequent 1-itemsets and all the candidate 2-itemsets.

2. The second part is the more general part including hashing

3. The third part is without the hashing.

Both second and third parts include pruning. Part2 is used for early iterations and Part 3 for

later iterations.

Part-1 :

Essentially the algorithm goes through each transaction counting all the 1-itemsets.

At the same time all the possible 2-itemsets in the current transaction are hashed to a hash table.

The algorithm uses the hash table in the next pass to reduce the number of candidate itemsets. Each

bucket in the hash table has a count, which is increased by one each time an itemset is hashed tothat

bucket. Collisions can occour when different itemsets are hashed to the same bucket. A bit vector is

associated with the hash table to provide a flag for each bucket. If the bucket count is equal or

above the minimum support count, the corresponding flag in the bit vector is set to 1, otherwise it is

set to 0.

Part-2:

This part has two phases. In the first pahse, ck is generated. In the Apriori algorithm

ck is generated by lk-1*lk-1 but the DHP algorithm uses the hash table to reduce the number of

candidate itemsets in ck. An item is included in ck only if the corresponding bit in the hash table bit

vector has been set, that is the number of items hashed to the location is greater than the support.

Although having the corresponding bit vector bit set does not guarantee that the itemset is frequent

due to collisions, the hash table filtering does reduce ck significantly. Each itemset that passes the

hash filtering is a member of ck and is stored in a hash tree, which is used to count the support for

each itemset in the second phase of this part.

In the econd ahse, the hash table for the next step is generated. Both on the support counting and

when the hash table is generated, pruning of the database is carried out. Only itemsets thar are

important to future steps are kept in the database. A k-itemset is not considered useful in a frequent

k+1 itemset unless it appears at least k times in a transaction. The pruning noyt only trims each

transaction by removing the unwanted but also removes transaction that have no itemsets that could

be frequent.

Part-3:

The third part of the algorithm continues until there are no more candidate itemsets.

Instead of using a hash table to find the frequent itemsets, the transaction database is now scanned

to find the support count for each itemset. The database is likely to be now significantly smaller

because of the pruning. When the support count is established the algorithm determines the

frequent itemsets as before by checking against the minimum support. The algorithm then generates

candidate itemsets like the Apriori algorithm does.

9. What are the Guidelines for successful Data Mining?

1. Data mining projects should be carried out by small reams with a strong internal

integration and a loose management style.

2. Before starting a major data mining project, it is recommended that a small pilot

project be carried out. This may involve a steep curve for the project team. This is

of vital importance.

3. A clear problem owner should be identified who is responsible for the project.

Preferably such a person should not be a technical analyst or a consultant but

someone with direct business responsibility, for example someone in a sales or

marketing environment. This will benefit the external integration.

4. The positive return on investment should be realised within 6 to 12 months.

5. Since the roll-out of the results in a data mining application mostly involves larger

groups of people and is technically less complex, it should be a separate and more

strictly managed.

6. The whole project should have the support of the top management of the company.

10. Write short note on K- Mediods.

Basic Principle of K-Mediods:

Instead of taking the mean value of the objects in a cluster as a reference point, actual

objects can be picked to represent the clusters, using one representative object per

cluster.

Each remaining object is clustered with the representative object to which it is the most

similar.

The partitioning method is then performed based on the principle of minimizing the sum

of the dissimilarities between each object(cost function) and its corresponding reference

point.

That is, an absolute-error criterion is used, defined as

E=where E is the sum of the absolute error for all objects in the data set; p is the point in

space representing a given object in cluster Cj; and oj is the representative object of Cj.

In general, the algorithm iterates until, eventually, each representative object is actually

the medoid, or most centrally located object, of its cluster.

. To determine whether a nonrepresentative object, orandom, is a good replacement for a current

representative object, oj, the following four cases are examined for each of the non representative

objects, p, as illustrated in Figure.

Case 1: p currently belongs to representative object, oj. If oj is replaced by

orandom as a representative object and p is closest to one of the other representative

objects, oi, i ≠ j, then p is reassigned to oi.

Case 2: p currently belongs to representative object, oj. If oj is replaced by orandom as

a representative object and p is closest to orandom, then p is reassigned to orandom.

Case 3: p currently belongs to representative object, oi, i ≠ j. If oj is replaced by

orandom as a representative object and p is still closest to oi, then the assignment does not

change.

Case 4: p currently belongs to representative object, oi, i ≠ j. If oj is replaced by orandom

as a representative object and p is closest to orandom, then p is reassigned to orandom

PAM(Partitioning AroundMedoids) was one of the first k-medoids algorithms introduced. It

attempts to determine k partitions for n objects.

11. List the Applications of Clustering:

market research :can help marketers discover distinct groups in their customer bases

and characterize customer groups based on purchasing patterns.

pattern recognition:

data analysis: It can also be used to help classify documents on the Web for

information discovery.

image processing

Biology: can be used to derive plant and animal taxonomies, categorize genes with

similar functionality, and gain insight into structures inherent in populations.

Geography: help in the identification of areas of similar land use in an earth

observation database and in the identification of groups of houses in a city according to

house type, value, and geographic location

Automobile insurance: the identification of groups of automobile insurance policy

holders with a high average claim cost

Outlier detection: the detection of credit card fraud and the monitoring of criminal

activities in electronic commerce.

12. Give the methods for estimating the accuracy of classification.

Holdout Method:

The holdout method requires a training set and a test set. It is sometimes called the

test sample method. The sets are mutually exclusive. It may be that only one dataset is available

which has been divided in two subsets (perhaps 2/3 and 1/3), the training subset and the test or

holdout subset. Once the classification method produces the model using the training set, the test

set can be used to estimate the accuracy. Interesting questions arise in this estimation since a larger

training set would produce a better estimate of the accuracy. A balance must be achieved. Since

none of the best data is used in training , the estimate is not biased, but a good estimate is obtained

only if the test set and the train set are large enough and representative of the whole population.

Random Sub-Sampling Method:

Random Sub-Sampling is very much like the holdout method expect that is does not rely on

a single test. Essentially, the holdout estimation is repeated several times and the accuracy estimate

is obtained by computing the mean of the several trails. Random sub-sampling is likely to produce

better error estimate than those by the holdout Method.

K-fold Cross-validation Method:

In K-fold Cross validation , the available data is randomly divided into k disjoint subsets of

approximately equal size. Once of the subsets is than used as the test set and the remaining K-1 sets

are used for building the classifier. The test is then used to estimate the accuracy. This is done

repeatedly k times so that each subset is used as a test subset once. The accuracy estimate is then

the mean of the estimates for each of the classifier. Cross –validation has been tested extensively

and has been found to generally work well when sufficient data is available. A value of 10 for k has

been found to be adequate and accurate.

Leave-one –out Method:

Leave-one-out is a simpler version of k-fold cross –validation. In this method, one of the

training sample is taken out and the model is generated using the remaining training data. Once the

model is built, the one remaining sample is used for testing and the result is coded as 1 or 0

depending if it was classified correctly or not. The average of such results provides an estimate of

the accuracy. The leave-one-out Method is useful when the dataset is small. For large training

datasets, leave-one-out can become expensive since many iterations are required. Leave-one-out, is

unbiased but has high variance and is therefore not particularly reliable.

Bootstrap Method:

In this method , given a dataset of size n, a bootstrap sample is randomly selected

uniformly with replacement (that is , a sample may be selected more than once) by sampling n

times and used to build a model. It can be shown that only 63.2% of these samples are unique. The

error in building the model is estimated by using the remaining 36.8% of objects that are not in the

bootstrap sample. The final error is then computed as 0.632 times the training error plus 0.368

times the testing error. The figures 0.632 and 0.368 are based on the assumption that if there were n

samples available initially from which n samples with replacement were randomly selected for

training data than the expected percentage of unique samples in the training data would be 0.632 or

63.2% and the number of remaining unique objects used in the test data would be 0.368 or 36.8%

of the initial sample. This is repeated and the average of error estimates is obtained. The bootstrap

method is unbiased and , in contrast to leave-one-out, has low variance but many iterations are

needed for good error estimates if the sample is small, say 25 or less.

k-fold cross –validation , leave-one-out and bootstrap techniques are used in many decision

tree packages including CART and Clementine.

13. Explain the evaluation criteria for classification .

Criteria for evaluation for classification methods, like those for all data mining techniques,

are:

1. Speed

2. Robustness

3. Scalability

4. interpretability

5. goodness of the model

6. flexibility

7. time complexity

Speed:

Speed involves not just the time are computation cost of construction a model (e.g a

decision tree), it also includes the time required to learn to used the model. Obviously a user

wishes to minimize both times although it has to be understood that any significant data mining

project will take time to plan and prepare the data if the problem to be solved is large, a careful

study of the methods available may need to be carried out so that an efficient classification

method may be chosen.

Robustness:

Data errors are common, in particular when data is being collected from a number of

sources and errors may remain even after data cleaning. It is therefore desirable that a method be

able to produce good results in spite of some errors and missing values in datasets.

Scalability:

Many data mining methods were originally designed for small datasets. Many have been

modified to deal with large problems. Given that large datasets are becoming common, it is

desirable that a method continues to work efficiently for large disk- resident database as well.

Interpretability:

A important task of data mining professional is to ensure that the results of data mining are

explained to the decision makers. It is therefore desirable that the end –user able to understand

and gain inside from the result produced by the classification method.

Goodness of the model:

For a model to be effective, it needs to fit the problem that is being solved. For example, in

a decision tree classification, it is desirable to find a decision tree of the ―right‖ size and

compactness with high accuracy.

14. Discuss various distance measure used for computing distance in cluster analysis.

Most cluster analysis methods are based on measuring similarity between objects by

computing the distance between each pair.

Distance is a well understood concept that has a number of simple properties.

1. Distance is always positive.

2. Distance from point x to itself is always zero.

3. Distance from point x to point y cannot be greater than the sum of the distance from

x to some other point z and distance from z to y.

4. Distance from x to y is always the same as from y to x.

Before the data is subjected to cluster analysis, it is best to ensure that only the most

descriptive and discriminatory attributes are used in the input. Furthermore the unit of

measurement of data can significantly influence the result of analysis.

Let the distance between two points x and y (both vectors) be D(x,y).

Euclidean Distance:

Euclidean distance or the L2 norm of the difference vector is most commonly used

to compute distances and has an intuitive appeal but the largest valued attribute may

dominate the distance. It is therefore essential that the attributes are properly scaled.

D(x,y)=(∑(xi- yi)2)1/2

It is possible to use this distance measure without the square root if one wanted to place greater

weight on differences that are large.

A Euclidean distance measure is more appropriate when the data is not standardized, but as noted

above the distance measure can be greatly affected by the scale of the data.

Manhattam Distance:

Another commonly used distance metric is the Manhattam distance are the L1 norm

of the diference vector. In most cases, the results obtained by the Manhattam distance are

similar to those obtained by using the Euclidean distance once again the largest –valued

attribute can dorminate the distance, although not as much as in the Euclidean distance

D(x,y)=∑|xi- yi|

Chebychev distance:

This distance metric is based on the maximum attributes difference. It is also called

L∞ norm of the difference vector.

D(x,y)=Max|xi- yi|

Categorical data distance:

This distance measured may be used if many attributes have categorical values with

only a small number of values (e.g. binary values). Let N be the total number of categorical

attributes

D(x,y)=(numberxi- yi )/N

15. Describe the Methods for computing distance between clusters.

The hierarchal clustering methods requires distance between clusters to be computed. These

distance metrics are often called linkage metrics. Computing distances between large clusters can

be expensive.

The following methods for computing distances between clusters:

1. Single-link algorithm

2. Complete-link algorithm

3. Centroid algorithm

4. Average-link algorithm

5. Ward‘s minimum-variance algorithm

1. Single-link algorithm:

The single-link algorithm is perhaps the simplest algorithm for computing distance

between two clusters. The algorithm determines the distance between two clusters as the

minimum of the distances between all pairs of points (a, x) where a is from the first cluster

and x is from the second. The algorithm therefore requires that all pairwise distances be

computed and the smallest distance found. The algorithm can from chains and can from

elongated clusters. B

B B

B

B

A

A

A A

A

2. Complete-link:

The complete-link algorithm is also called the farthest neighbor algorithm. In this algorithm

the distance between two clusters is defined as the maximum of the pairwise distances (a,x).

therefore must if there are m elements in one cluster and n in the other, all mn pairwise distances

therefore must be computed and the largest chosen.

Complete link is strongly biased towards compact clusters. The two clusters A and B and

the complete-link distance between them. Complete-link can be distorted by moderate outliers in

one or both of the clusters. B

B

B

A

A

A A

A

Both sigle-link and complete-link measures have their difficulties. In the single-link

algorithm each cluster may have an outlier and the two outlier may be nearby and so the distance

between the two clusters would be compared to be small.. single-link can from chain of objects as

clusters are combined since there is no constraint on the distance between objects that are far away

from each other.

On the other hand, the two outliers may be very far away although the clusters are nearby

and the complete-link algorithm will compute the distance as large. The complete-link algorithm

generally works well but if a cluster is naturally of a shape, say, like a banana then perhaps

complete-link is not appropriate.

3. Centroid:

The centroid algorithm computes the distance between two clusters as the distance between

the average point of each of the two clusters. Usually the squared Euclidean distance between the

centroids is used. This approach is easy and generally works well and is more tolerant of somewhat

longer clusters the complete-link algorithm.

4. Average-link:

The average-link algorithm on the other hand computes the distance between two clusters as

the average of all pairwise distances between an object from one cluster and another from the other

cluster. Therefore if there are m elements in one cluster and n in the other, there are mn distances to

be computed, added and divided by mn. This approach also generally works well. It tends to join

clusters with small variances although it is more tolerant of somewhat longer clusters than the

complete-link algorithm.

5. Ward’s minimum-variance Method:

The method generally works well and results in creating small tight clusters. Ward‘s

distance is difference between the total within the cluster sum of squares for two clusters separately

and within the clusters sum of squares from merging the two clusters.

An expression for Ward‘s distance may be derived. It may be expressed as follows:

Dw(A,B) =NANBDC(A,B)/(NA+NB)

Where Dw(A,B) is the ward‘s minimum-variance distance between clusters A and B with NA and

NB objects in them respectively. DC(A,B) is the centroid distance between the two clusters

compared as squared Euclidean distance between the centroids. It has been observed that the

ward‘s method tends to join clusters with a small number of objects and is biased towards

producing clusters with roughly the same number of objects. The distance measure can be sensitive

to outliers.

16. Discuss the advantages & disadvantages hierarchical approaches.

Advantages of the hierarchical approach:

a. The hierarchical approach can provide more insight into data by showing a

hierarchical of clusters than a flat cluster structure created by a portioning method

like the K-Means method.

b. Hierarchical methods are commonly simpler and can be implemented easily.

c. in some applications only proximity data is available and then the hierarchical

approach may be better.

d. Hierarchical methods can provide clusters at different levels of granularity.

Disadvantages of the hierarchical approach:

1. The hierarchical methods do not include a mechanism by which objects that

have been incorrectly put in a cluster may be reassigned to another cluster.

2. The time complexity of hierarchical methods can be shown to be O(n3).

3. The distance matrix requires O(n2) space and becomes very large number of

objects.

4. Different distance metrics and scaling of data can significantly the results.

Hierarchical Methods are quite different from partition –based clustering. In hierarchical

methods there is no need to specify the number of clusters and the cluster seeds. The

agglomerative approach essentially starts with each object in an individual cluster and then

merges clusters to build larger and larger clusters. The divisive approach is the opposite of

that. There is no metric to judge the quality of results.

17. Discuss DBSCAN Method.

DBSCAN (Density based spatial clustering of applications with noise) is one example of a

density-based method for clustering. The method was designed for spatial databases but can

be used in other applications.

It requires two input parameters:

1.The size of the neighbourhood (R)

2.The minimum points in the negihbourhood(N).

Essentially these two parameters determine the density within the clusters the user is willing

to accept since they specify how many points must be in a region. The number of points not

only determines the density of acceptable clusters but it also determines which objects will be

labeled outliers or noise. Objects are declared to be outliers if there are few other objects in

their neighbourhood. The size parameter R determines the size of the clusters found. If R is

big enough, there would be one big cluster and no outliers. If R is small, there will be small

dense cluster and there might be many outliers.

A number of conceptsn that are required in the DBSCAN Method:

1. Neighbourhood:

The neighbourhood of an object y is defined as all the objects that are within

the radius R from y.

2. Core object:

An object y is called a core object if there are N objects within its

neighbourhood .

3. Proximity:

Two objects are defined to de in proximity to each other if they belong to the

same cluster. Object x1 is in proximity to object x2 if two conditions are

satisfied:

(a) The objects are close enough to each other, i.e within a distance of R.

(b) x2 is a core object as defined above.

4. Connectivity:

Two objects x1 and xn are connected if there is a path or chain of objects x1,

x2,……… xn from x1 to xn such that each xi+1 is in proximity to object xi.

The basic algorithm for density-based clustering:

1. Select values of R and N.

2. Arbitrarily select an object p.

3. Retrieve all objects that are connected to p, given R and N.

4. If p is a core object, a cluster is formed.

5. if p is a border object, no objects are in its proximity . Choose another

object go to step 3.

6. Continue the process until all of the objects have been processed.

Essentially the above algorithm forms clusters by connecting neighbouring

core objects and those non-core objects that are boundaries of clustering .

The remaining non-core objects are labelled outliers. If one is trying to

cluster data with in a significant level of noise. By doing so, density-based

methods can extract clusters in spite of high levels of noise.

The performance of DBSCAN has been found to be sensitive to the parameters R

and N and sometimes the user has no sound basis to choose good values of these

parameters. If R and N are not chosen carefully, an algorithm like DBSCAN may

result in either a large number of very small clusters or a few very large clusters.

18. Write short notes on over fitting and Pruning .

The decision tree building algorithm given earlier continues until either all leaf nodes single

class nodes or no more attributes are available for splitting a node that has objects of more than one

class. When the objects being classified have a large number of attributes and a tree of maximum

possible depth is built, the tree quality may not be high since the tree is built to deal correctly with

the training set. In fact, in order to do so, it may become quit complex, with long and very uneven

paths. Some branches of the tree may reflect anomalies due to noise or outliers in the training

samples. Such decision trees are a result of over fitting the training data and may result in poor

accuracy for unseen samples.

According to the Occam‘s razor principle (due to the medieval philosopher William of

Occam) it is best to posit that the world is inherently simple nad to choose the simplest model from

similar models since the simplest model is more likely to be a better model. It can therefore ― shave

off‖ nodes and branches of a decision tree, essentially replacing a whole sub tree by a leaf node, if

it can be established that the error rate in the sub tree is greater than that in the single leaf. This

makes the classifier simpler. A simpler model has less chance of introducing inconsistencies,

ambiguities and redundancies.

Pruning is a technique to make an over fitted decision tree simpler and more general. There

are a number of techniques for pruning a decision tree by removing some splits and sub trees

created by them. One approach involves removing branches from a ―fully grown‖ tree to obtain a

sequence of progressively pruned trees. The accuracy of these trees is then computed and pruned

tree that is accurate enough and simple enough is selected. I t is advisable to use a set of data

different from the training data to decide which is the ― best pruned tree‖.

Another approach is called pre-pruning in which tree construction is halted early.

Essentially a node is not split if this would result in the goodness measure of the tree falling below

a threshold. It is however, quite difficult to choose an appropriate threshold.

19. What are the types of clustering techniques?

The cluster analysis methods may be divided into the following categories:

Partitional Methods:

Partitional Methods obtain a single level partition of objects. These Methods usually are

based on greedy heuristics that are used iteratively to obtain a local optimum solution. Given n

objects, these methods make k< n of data and use an iterative relocation method. It is assumed that

each cluster has at least one object at each object belongs to only one cluster. Objects may be

relocated between clusters as the clusters are refined. Often these methods require that the number

of clusters be specified apriori and this number usually does not change during the processing.

Hierarchical Methods:

Hierarchical Methods obtain a nested partition of the objects resulting in a tree of clusters.

These methods either start with one cluster and then split into smaller clusters(called divisive or top

down ) or start with each object in an individual cluster and then try to merge similar clusters into

larger and larger clusters (called agglomerative or bottom up). In this approach, in contrast to

portioning, tentative clusters may be merged or split based on some criteria.

Density –based methods:

In this class of methods, typically for data point in a cluster, at least minimum number of

points must exist with in a given radius. Density –based methods can deal with arbitrary shape

clusters since the major requirements of such methods is that each cluster be a dense region of

points of surrounded by regions of low density.

Grid –based Methods:

In this class of methods, the object space rather than the data is divided into a grid. Grid

Portioning is based on characteristics of the data and such methods can deal with non-numeric data

more easily. Grid –based methods are not affected by data ordering.

Model –based Methods:

A model is assumed, perhaps based on a probability distribution. Essentially the algorithm

tries to build clusters with a high level of similarity within them and a low level of similarity

between them. Similarity measurement is based on the mean values and the algorithm tries to

minimize the squared –error function.

A simple taxonomy of cluster analysis methods is presented in figure

20. What are types of web mining? Explain them.

Web mining is the application of data mining techniques to find interesting and potentially

useful knowledge from web data. It is normally expected that either the hyperlink structure

of the web or the web log data or both have been used in the mining process.

Web content mining :

It deals with discovering useful information or knowledge from web page

contents. Web contents mining focuses on the web page content rather than the

links. It is a very rich information resource consisting of many types of information,

for example unstructured free text, image, audio, video and metadata as well as

hyperlinks. The content of web pages include no machine readable semantic

information. Search engines, subject directories, intelligent agents, cluster analysis,

and portals are employed to find what a user might be looking for. It has been

suggested that users should be able to pose more sophisticated queries than just

specifying the keywords.

Web structure mining:

It deals with discovering and modeling the link structure of the web. Work

has been carried out to model the web based on the topology of the hyperlinks. This

can help in discovering similarity between sites or in discovering important sites for

a particular topic or discipline or in discovering web communities.

Web usage mining:

It deals with understanding user behaviour interacting with the web or with a

web site. One of the aims is to obtain information that may assist web site

reorganization or assist site adaptation to better suit the user. The mined data often

includes data logs of users‘ interactions with the web. The logs include the web

server logs, proxy server logs, and browser logs. The logs include information about

the referring pages, user identification, time a user spends at a site and the sequence

of pages visited. Information is also collected via cookie file. While web structure

mining shows that page A has a link to page B, web usage mining shows who or

how many people took that link, which site they came from and where they went

when they left page B.

21. What are types of web pages? Explain them

It is possible to classify web pages into several types:

Homepage or the head page:

These pages represent an entry point for the website of an enterprise or a section

within the enterprise or an individual‘s web page.

Index page:

These pages assist the user to navigate through the enterprise‘s web site. A home page

in some cases may also act as an index page.

Reference page:

These pages provide some basic information that is used by a number of other pages.

For example, each page in a web site may have a link to a page that provides the

enterprise‘s privacy policy.

Content page:

These pages only provide content and have little role in assigning a user‘s navigation.

Often these pages are larger in size, have few out-links, and are well down the tree in a

website. They are often leaf nodes of a tree.

22. What are the linkage principles used in designing website?

Relevant linkage principle:

It is assumed that links from a page point to another relevant resources. This

is similar to the assumption that is made for citations in scholarly publications,

where it is assumed that a publication cites only relevant publications. Links are

often assumed to reflect the judgment of the page creator. By providing a link to

another page it is assumed that the creator is making a recommendation for the other

relevant page.

Topical unity principle:

It is assumed that web pages that are co-cited are related. Many web mining

algorithms make use of this relevance assumption as a measure of mutual relevance

between web pages.

Lexical affinity principle:

It is assumed that web pages that the text and the links within a page are

relevant to each other. Once again, it is assumed that the text on a page has been

chosen carefully by the creator to be related to a theme.

23. Write short notes on Fingerprinting.

An approach for comparing a larger number of documents is based on the idea of

fingerprinting documents.

A documents may be divided into all possible substrings of length L. These

substrings are called shingles. Based on the shingles it can define resemblance

R(X,Y) and containment C(X,Y) between two documents X and Y as follows. It

assume S(X) and S(Y) to be a set of shingles for documents X and Y respectively.

R(X,Y) = {S(X) ∩ S(Y)}/ {(S(X) S(Y)}

C(X,Y) = {S(X) ∩ S(Y)}/ {(S(X)}

and

An algorithm like the following may now be used to find similar documents:

1. Collect all the documents that one wishes to compare

2. Choose a suitable shingle width and compute the shingles for each document.

3. Compare the shingles for each pair of documents

4. Identify those documents that are similar.

The algorithm is very inefficient for a large collection of documents. The web is

very large and this algorithm requires enormous storage to store the shingles and

very long processing time to finish pair wise comparison for say even 100 million

documents. This approach is sometimes called full finger printing.

24. Give the problems with Hits algorithm.

Hubs and authorities:

A clear-cut distinction between hubs and authorities may not be appropriate since

many sites are hubs as well as authorities.

Topic drift:

Certain arrangements of tightly connected documents, perhaps due to mutually

reinforcing relationships between hosts, can dominate the HITS computation. These

documents in some instances may not be the most relevant to the query that was

posed. It has been reported that in one case when the search item was ―jaguar‖ the

HITS algorithm converged to a football team called jaguars. Other examples of topic

drift have been found on topics like ―gun control‖, ―abortion‖, and ―movies‖.

Automatically generated links:

Some of the links are computer generated and represent no human judgment but

HITS still gives them equal importance.

Non-relevant documents :

Some queries can return non-relevant documents in the highly ranked queries and this

can then lead to erroneous results from the HITS algorithm. The problem is more

common than it might appear.

Efficiency :

The real-time performance of the algorithm is not good given the steps that involve

finding sites that are pointed to by pages in the root pages.

A number of proposals have been made for modifying HITS. These include:

More careful selection of the base set to reduce the possibility of topic drift. One

possible approach might be to modify the HITS algorithm so that the hub and

authority weights are modified only based on the best hubs and the best authorities.

One may argue that the in-link information is more important than the out-link

information. A hub can become important just by pointing to a lot of authorities.

25. What are the goals of web search?

It has been suggested that the information needs of the user may be divided into three classes:

Navigational:

The primary information need in these queries is to reach a web site that the user

has in mind. The user may know that such a site exists or may have visited the

site earlier but does not know the site URL.

Informational:

The primary information need in such queries to find a website that provides

useful information about a topic of interest. The user does not have a particular

website in mind. For example, a user may wish to find out the difference

between an ordinary digital camera and a digital SLR camera or a user may want

to investigate IT career prospects in Kolkata.

Transactional:

The primary need in such queries is to perform some kind of transaction. The

user may or may not know the target website. An obvious example is to buy a

book. The user may wish to go to amazon.com or the user may want to go to a

local website if there is one. Another user may be interested in downloading

some images or some other multimedia content from a web site.

26. List out the functions of search engine.

A search engine is a rather complex collection of software modules.

A search engine carries out a variety of tasks. These include:

Collecting information:

A search engine would normally collect WebPages or information about them by

web crawling or by human submission of pages.

Evaluating and categorizing information:

In some cases, for example, when web pages are submitted to a directory, it may be

necessary to evaluate and decide whether a submitted page should be selected. In

addition it may be necessary to categorize information based on some ontology used

by the search engine.

Creating a database and creating indexes:

The information collected needs to be stored either in a database or some

kind of file system. Indexes must be created so that the information may be searched

efficiently.

Computing ranks of the web documents:

A variety of methods are being used to determine the rank of each page

retrieved in response to a user query. The information used may include frequency

of key words, value of in-links and out-links from the page and frequency of use of

the page. In some cases, human judgment may used.

Checking queries and executing them:

Queries posed by users need to be checked, for example, for spelling errors

and whether words in the query and recognizable. Once checked , a query is

executed by searching the search engine database.

Presenting results:

How the search engine presents the results to the user is important. The

search engine must determine what results to present and how to display then.

Profiling the users :

To improve search performance, the search engines carry out user profiling

that deals with the a users use search engines

27. Discuss the weakness of Page Rank algorithm.

The Page Rank algorithm also similar weakness since a link cannot always be

considered a recommendation and as noted earlier people are now trying to

influence the links.

The major weakness of the Page Rank algorithm arises because of Page Rank‘s

focus on the importance of a page rather than on its relevancy given the user query.

Page Rank is calculated independently of a user query so the result served on the

first screen may not be the most relevant it may be the one with the highest Page

Rank amongst the page retrieved. A side effect of page Rank therefore can be

something like this. For example, suppose amazon.com decides to get into the real

estate business. Obviously its real estate web site will have links from all other

amazon.com sites which have no relevance whatsoever with real estate. When a user

then searches for real estate, the amazon.com real estate site is likely to be served

high in the results because of its high Page Rank while most other real estate

business whatever their quality will appear further down the list.

The nature of algorithm, Page Rank does not deal with new pages fairly since it

makes high Page Rank pages even more popular by serving them at top of the

results. To improve its rank can do so only by advertising its existence on the web.

The Page Rank algorithm is used for ranking web pages, it will take a very long time

for a new page to become popular even if the new page is of high quality.

Other Issues in Page Ranking:

The algorithm given above is good but do the users interpret relevance and

quality in the same way as the page ranking algorithm does.

Page ranking algorithms consider the following in determining page rank:

1. Page title including key words

2. Page content including links, keywords in links text, spelling errors,

length of the page

3. Out-links including relevance of links to the content

4. In-links and their importance

5. In-links text-keywords

6. In-links page content

7. Traffic including the number of unique visitors and the time spent on the

page.

The page rank algorithms are not perfect and there is a serious problem

regarding visibility of high quality new pages, perhaps other approaches are

needed.

One possibility might be to introduce some randomness to the selection

criteria so some new pages are displayed.

Another option might be let users select if they are willing to look at

some new sites that do not have high page rank.

28. Give the difference between searching conventional text and searching the web.

1. Hyperlink:

The text documents obviously do not have hyperlinks, while the links are very

important components of web documents. In hard copy documents in a library, the

documents are usually structured (e.g. books) and they have been catalogued by cataloguing

experts. The web hyperlinks provide important information. A link from page a to page B

indicates that the author of page A considered page B of some value.

2. Types of information:

Web pages differ widely in structure, quality and their usefulness. Web pages can

consists of text, frames, multimedia objects, animation and other types of information quite

difference from text documents which mainly consist of text but may have some other

object like tables, diagram, figures and perhaps in some images.

3. Dynamics:

The text documents do not change unless a new edition of a book appears while web

pages change frequently because the information on the web including linkage information

is updated all the time and new pages appear almost every seconds finding a pervious

version of a page use almost impossible on the web and links pointing to a page may work

today but not tomorrow.

4. Quality:

The text documents are usually of high quality since they usually go through some

quality control process because they are very expensive to produce. In contrast, much of the

information on the web is of low quality. Compared to the size of the web it may be that

less than 10% of web pages are really useful and of high quality.

5. Huge Size:

Although some of the libraries are very large. The web in comparison is much larger

perhaps which size is approaching 100 terabytes. That is equivalent to about 200 million

books. All the books is the U.S library of congress would fit in about 20 terabytes.

6. Document use:

Compared to the use conventional documents the use of web document is very

different. The web users tend to pose short queries, browse perhaps the first page the results

and then move on.

29. What are the desired qualities of search results?

The results from a search engine ideally should satify the following quality requirements:

5. Precision: Only relevant documents should be returned

6. Recall: All the relevant documents should be returned

7. Ranking: A ranking of the documents providing some indication of the relative

ranking of the results should be returned.

8. First screen: the first page of results should include the most relevant results.

9. Speed: Results should be provided quickly since users have little patience.

Information retrieval evaluation uses test reference collections which consists of a

collection of documents and a set of queries. The result of any retrieval set can then be

compared with the set of relevant documents that is already known. A similar approach

is not possible for search engine.

Precision deals with question like ― what percentage of documents retrieved are

relevant?‖ and ―what rankings are the most relevant documents being given?‖.

Recalls deals with the question ― what percentage of relevant documents are returned

by the search engine?‖. The requirements of a high level of recall are generally in

conflict with precision and correct ranking. To improve the recall, precision is almost

always going to go down unless some user feedback mechanism is used, which adds

complexity and cost to the search process.

It is well known that in this trade-off relationship:

1. Precision decreases faster than recall increases

2. It is much more difficult to improve precision than to improve recall

It has been claimed that in a web search users are often looking for one single document

or a website. Users usually do not and do not wish to go to through many documents. It

is therefore important that a search engine provides the most relevant documents with

high ranking so that they are visible generally in the first screen of information.

30. What are the FASMI Characteristics of OLAP? Explain them.

In the FASMI characteristics of OLAP systems, the name derived from the first letters of

the characteristics are:

Fast: As noted earlier, most OLAP queries should be answered very quickly,perhaps within

seconds. The performance of an OLAP system has to be like that of a search engine. If the response

takes more than say 20 seconds, the user is likely to move away to something else assuming there is

a problem with the query. Achieving such performance is difficult. The data structures must be

efficient. The hardware must be powerful enough for the amount of data and the number of users.

Full pre-computation of aggregates helps but is often not practical due to the large number of

aggregates. One approach is to pre-compute the most commonly queried aggregates and compute

the remaining on-the-fly.

Analytic: An OLAP system must provide rich analytic functionality and it is expected that

most OLAP queries can be answered without any programming. The system should be able to cope

with any relevant queries for the application and the user. Often the analysis will be using the

vendor‘s own tools although OLAP software capabilities differ widely between products in the

market.

Shared: An OLAP system is shared resource although it is unlikely to be shared by

hundreds of users. An OLAP system is likely to be accessed only by a select group of managers

and may be used merely by dozens of users. Being a shared system, an OLAP system should be

provide adequate security for confidentiality as well as integrity.

Multidimensional: This is the basic requirement. Whatever OLAP software is being used,

it must provide a multidimensional conceptual view of the data. It is because of the

multidimensional view of data that we often refer to the data as a cube. A dimension often has

hierarchies that show parent / child relationships between the members of a dimension. The

multidimensional structure should allow such hierarchies.

Information: OLAP systems usually obtain information from a data warehouse. The

system should be able to handle a large amount of input data. The capacity of an OLAP system to

handle information and its integration with the data warehouse may be critical.

31. Describe the characteristics of OLAP SYSTEMS.

The following are the differences between OLAP and OLTP systems.

1. Users: OLTP systems are designed for office workers while the OLAP systems

aredesigned for decision makers. Therefore while an OLTP system may be accessed by hundreds or

even thousands of users in a large enterprise, an OLAP system is likely to be accessed only by a

select group of managers and may be used only by dozens of users.

2. Functions: OLTP systems are mission-critical. They support day-to-day operations of an

enterprise and are mostly performance and availability driven. These systems carry out simple

repetitive operations. OLAP systems are management-critical to support decision of an enterprise

support functions using analytical investigations. They are more functionality driven. These are ad

hoc and often much more complex operations.

3. Nature: Although SQL queries often return a set of records, OLTP systems are designed to

process one record at a time, for example a record related to the customer who might be on the

phone or in the store. OLAP systems are not designed to deal with individual customer records.

Instead they involve queries that deal with many records at a time and provide summary or

aggregate data to a manager. OLAP applications involve data stored in a data warehouse that has

been extracted from many tables and perhaps from more than one enterprise database.

4. Design: OLTP database systems are designed to be application-oriented while OLAP

systems are designed to be subject-oriented. OLTP systems view the enterprise data as a collection

of tables (perhaps based on an entity-relationship model). OLAP systems view enterprise

information as multidimensional).

5. Data: OLTP systems normally deal only with the current status of information. For

example, information about an employee who left three years ago may not be available on the

Human Resources System. The old information may have been achieved on some type of stable

storage media and may not be accessible online. On the other hand, OLAP systems require

historical data over several years since trends are often important in decision making.

6. Kind of use: OLTP systems are used for reading and writing operations while OLAP

systems normally do not update the data.

32. What are the Motivation of using OLAP?

1. Understanding and improving sales: For an enterprise that has many products

and uses a number of channels for selling the products, OLAP can assist in finding the most

popular products and the most popular channels. In some cases it may be possible to find the most

profitable customers. For example, considering the telecommunications industry and only

considering one product, communication minutes, there is a large amount of data if a company

wanted to analyze the sales of product for every hour of the day (24 hours), differentiate between

weekdays and weekends (2 values) and divide regions to which calls are made into 50 regions.

2. Understanding and reducing costs of doing business: Improving sales is one

aspect of improving a business, the other aspect is to analyze costs and to control them as much as

possible without affecting sales. OLAP can assist in analyzing the costs associated with sales. In

some cases, it may also be possible to identify expenditures that produce a high return on

investment (ROI). For example, recruiting a top salesperson may involve significant costs, but the

revenue generated by the salesperson may justify the investment.

33. Explain the methods of Data Cube Implementation.

1. Pre-compute and store all: This means that millions of aggregates will need to be

computed and stored. Although this is the best solution as far as query response time is concerned,

the solution is impractical since resources required to compute the aggregates and to store them will

be prohibitively large for a large data cube. Indexing large amounts of data is also expensive.

2. Pre-compute (and store) none: This means that the aggregates are computed onthe-

fly using the raw data whenever a query is posed. This approach does not require additional space

for storing the cube but the query response time is likely to be very poor for large data cubes.

3. Pre-compute and store some: This means that we pre-compute and store the most

frequently queried aggregates and compute others as the need arises.

34. What is the difference between a Data warehouse and a Data mart.

Data Warehouse Data Mart

Corporate/Enterprise-wide

Union of all data marts

Takes time to build

Low risk of failure

Structure to suit the departmental view of

data

Data received from staging area

Well structured and architectured

Queries on presentation resource

Departmental -wide

A single business process

Faster and easier implementation

High risk of failure

Structure for corporate view of data

Data received from Star joins( facts &

dimensions)

Each data mart has its own narrow view of

data

Technology optimal for data access and

analysis.

35. What are the different types of OLAP Servers?

Relational OLAP (ROLAP) servers:

These are the intermediate servers that stand in between a relational back-end server and

client front-end tools. They use a relational or extended-relational DBMS to store and

manage warehouse data, and OLAP middleware to support missing pieces.

ROLAP servers include optimization for each DBMS back end, implementation of

aggregation navigation logic, and additional tools and services. ROLAP technology

tends to have greater scalability than MOLAP technology.

The DSS server of Microstrategy, for example, adopts the ROLAP approach.

Advantages:

can handle large amounts of data

Disadvantages:

performance is slow.

Multidimensional OLAP (MOLAP) servers:

These servers support multidimensional views of data through array-based

multidimensional storage engines. They map multidimensional views directly to data

cube array structures.

With multidimensional data stores, the storage utilization may be low if the data set is

sparse. In such cases, sparse matrix compression techniques should be explored.

Many MOLAP servers adopt a two-level storage representation to handle dense and

sparse data sets: denser subcubes are identified and stored as array structures, whereas

sparse subcubes employ compression technology for efficient storage utilization.

Advantages:

faster indexing to precomputed summarized data

Disadvantages:

can handle limited amount of data

Hybrid OLAP (HOLAP) servers:

The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting

from the greater scalability of ROLAP and the faster computation of MOLAP.

For example, a HOLAP server may allow large volumes of detail data to be stored in a

relational database, while aggregations are kept in a separate MOLAP store. The

Microsoft SQL Server 2000 supports a hybrid OLAP server.

Specialized SQL servers:

To meet the growing demand of OLAP processing in relational databases, some

database system vendors implement specialized SQL servers that provide advanced

query language and query processing support for SQL queries over star and snowflake

schemas in a read-only environment.

36. What are the applications of OLAP?

OLAP is widely used in several realms of data management. Some of these applications include:

1. Financial Applications

Activity-Based Costing (Resource Allocation)

Budgeting

2. Marketing/Sales Applications

Market Research Analysis

Sales Forecasting

Promotions Analysis

Customer Analysis

Market/Customer Segmentation

3. Business Modelling

Simulating business behavior

Extensive, real-time decision support system for managers.

All these applications need ability to provide managers with the information they need to make

effective decisions about an organization‘s strategic directions. The key indicator of a successful

OLAP application is its ability to provide information, as needed, that is, its ability to provide ‗just-

in-time‘ information for effective decision making. This requires more than base level detailed

data.

37. Explain ETL process.

The ETL process involves extracting, transforming and loading data from source systems. The

process may sound very simple since it only involves reading information from source

databases, transforming it to fit the ODS database model and loading it in the ODS. In practice,

the process is much more complex and may require significant resources to implement. There

are a variety of tools in the market that may help reduce the cost. These tools are available

from several vendors.

The following examples show the importance of data cleaning:

If an enterprise wishes to contact its customers or it is essential that a

complete, accurate and up-to-date list of contact addresses, email addresses and

telephone numbers be available. Correspondence sent to wrong address that is

then redirected does not create a very good impression about the enterprise.

If a customer or supplier calls, the staff responding should be quickly able to

find the person in the enterprise database but this requires that the caller‘s name

or his/her company name is accurately listed in the database.

If a customer appears in the databases with two or more slightly different

names or different account numbers, it becomes difficult to update the

customer‘s information.

ETL requires skills in management, business analysis and technology and is often a significant

component of developing an ODS or data warehouse. The ETL process tends to be different

for every ODS and data warehouse since every system is different. It should not be assumed

that an off-the-shelf ETL system can magically solve all ETL problems. It is best to use a

flexible ETL approach combining perhaps an off-the-shelf tool with some inhouse developed

software to perform the tasks that are required for a proposed ODS or DW. It is essential that

the ETL process be adequately documented so that at a larger stage another staff member(s)

can understand what exactly the ETL does.

An ODS will be refreshed regularly and frequently and therefore ETL will continue to play an

important role in maintaining the ODS since the refreshing task itself will require an ETL

process. The ETL process needed in refreshing would usually need to be automated so that

whenever the ODS is scheduled to be refreshed the ETL process can be found automatically.

ETL system involves resolving many issues including the following:

1. What are the source system? These systems may include relational database

systems (e.g. Oracle, DB2, SQL Server, MYSQL), Legacy systems (e.g.

IMS, ISAM, VASM, flat files) as well as other systems(e.g. CICS, COBOL

legacy systems).

2. To what extent are the source systems and the target system interoperable?

The more different the sources and target, the more complex the ETL

process.

3. What ETL technology will be used? One approach might be to use an in

house custom-built solution. Some tools may be available to help in building

such a solution. Another approach is to acquire a generic ETL package that

will earlier, a large variety of tools are available in the market.

4. How big is the ODS likely to be at the beginning and in the long term?

Database systems tend to grow with time. Consideration may have to be

given to whether some of the data from the ODS will be archived regularly

as the data becomes old and is no longer needed in the ODS.

5. How frequently will the ODS be refreshed or reloaded? Once the system has

been populated. It should be regularly and frequently refreshed. What ETL

process will be used for refreshing?

38. Give the difference between ODS & data ware house.

ODS DW

39. What are benefits of implementing data ware house ?

The benefits of implementing a data ware house are as follows:

To provide a single version of truth about enterprise information. This may appear

obvious but it is not uncommon in an enterprise for two database systems to have

two different version of the truth.

To speed up and hoc reports and queries that involve aggregations across many

attributes which are resources intensive. The managers require trends, sums and

aggregations that allow, for example comparing this year‘s performance to last

year‘s or preparation of forecasts for next year.

To provide a system in which managers who do not have a strong technical

background are able to run complex queries. If the managers are able to access the

information they require, it is likely to reduce the bureaucracy around the

managers.

Data of high quality at detailed

level and assumed availability

Contains current and near-

current data

Real –time and real –time data

loads

Mostly updated at data field level

(even if it may be appended)

Typically detailed data only

Modelled to support rapid data

updates (3NF)

Transactions similar to those

OLTP systems

Used for detailed decision

making and operational

reopening

Used at the operational level

Data may not be perfect , but

sufficient for strategic analysis; data

does not have to be highly available

Contains historical data

Normally batch data loads\

Data is appended, not updated

Contains summarized and detailed

data

Variety of modelling techniques used,

typically multidimensional for data

marts to optimize query performance

Complex queries processing larger

volumes of data

Used for long-term decision making

and management reporting

Used at the managerial level

To provide a database that stores relatively clean data. By using a good ETL

process, the data warehouse should have data of high quality. When errors are

discovered it may be desirable to correct them directly in the data warehouse and

then propagate the correction to the OLTP system.

To provide a database that stores historical data that may have been deleted from

the OLTP systems. To improve response time, historical data is usually not

retained in OLTP systems other than that which is required to respond to customer

queries. The data warehouse can then store the data that is purged from the OLTP

systems.

40. What are the pit falls of data mining ?

These five categories are:

1. Unintentional mistake (mistaken identity):

This kind of problem arises when an innocent person shares some identifier

information with one or more persons that have been identified as having a

profile in data mining application. A number of such incidents have been

found to have taken place in the last year or two. For example, in the air line

passenger screening system.

2. Unintentional mistake (faculty inference):

This kind of problem arises when profile are matched and the data mining

users misinterpret the result of the match. Faulty inference then may be

derived resulting, for example in an innocent person becoming a suspect and

leading to further investigation about the person.

3. Intentional abuse:

People employed in running a data mining system and those in security

organizations have access to sensitive personal information and not all of

them are trustworthy. Sensitive information may be used for unauthorized

purpose for personal financial gain. For example, an employee of an

intelligence or law enforcement agency may provide a profile checking

service for a private company for a fee.

4. Security breach:

Given that data mining systems using personal information have sensitive

information, information may be stolen or carelessly disclosed without strict

security measures. In February 2004, a magnetic tape with information on

about 120,000 Japanese customers of Citibank disappeared while being

shipped by truck from a data management centrein Singapore. The tape held

names, address, account numbers and balances. In June2005, an entire box of

Citibank tapes in the care of the United Parcel Service, with personal

information on nearly four million American customers disappeared. In

February 2005, a US company choice point announced that personal data of

145,000 persons was accessed by an unauthorised person. These cases

highlight the dangers of security breach.

5. Mission creep:

Once a data mining system, for example for national security application, has

been established and personal information from a variety of sources

collected, there is likely to be a temptation to use the information for other

applications. Such mission creep has been reported in data mining

applications.

41. Give the primary aims of data mining .

The aims of the majority of such data mining activities are laudable but the techniques

are not always perfect. Data mining is used for many purpose that are beneficial to

society, as the list of some of the common aims of data mining belows shows.

The primary aim of many data mining applications is to better understand

the customer improve customer services.

Some applications aim to discover anomalous patterns in order to help

identify, for example fraud, abuse, waste, terrorist suspects, or drugs

smugglers.

In many applications in private enterprise, the primary aim to improve the

profitability of an enterprise.

In a number of data mining the primary purpose isto improve human

judgement , for example in making diagnoses, in resolving crime, in sorting

out manufacturing problems, in predicting hare prices or currency

movements or commodity prices.

In some government applications, one of the aim of data mining is to

identify criminal and fraud activities.

In some applications, data mining is used to find that are simply not

possible without the help of data mining , given the huge amounts of data

that must be processed.

42. Write about anonymization?

Anonymization:

Anonymization is removal of identifying information from data in order to protect

privacy of the data subjects while allowing the data so modified to be used for

secondary purposes like data mining . for example, anonymization may be used by a

web user to prevent collection of information by hiding personal information like

cookies and IP addresses.

Anonymization may be based on one-way hashing technique. Hashing allows different

people to share information because when a hash function is applied to any

information the resulting hashed value will always be the same if the same hash

function is applied to the same information. It therefore allows sharing and matching

of information without disclosing personal information about a person. Although

anonymization is usually safe, in some extreme cases identity of the person may be

discovered.

An example of some transaction data for an individual.

Name : Suresh Kumar

Identifier :87652468

Non-sensitive information: whatever

Using hashing it assume that is obtain the following:

Name: hashed value of (Suresh Kumar)=297632139

Identifier: hashed value of (87652468)=333765889

Now when the information is being shared, someone receiving two transaction records

that have been hashed is able to conclude by comparing the hashed values whether the

two records belong to the same individual or not. This could be useful, for example

government departments like customs and immigration and airlines that carry a watch

list of names and other information do not need to have a list of actual names. They

could carry a list of hashed values for matching.

K-Anonymization:

The idea of k-anonymization is related to the problem of compromising a statistical

database, for example a census database. Although no personal information is

included in statistical databases, it is possible to derive personal information if a

number of attribute values together help identify a person or a small number of people

in the database.

It is not a particularly relevant to data mining personal privacy violation. The concept

of quasi-identifier as the set of attributes that together can be used to identify an

individual. It might be something like {date of birth, suburb, number of children}.

The aim idea in k-anonymization is to modify the data in the database to ensure that

each set of quasi-identifier values in it appears at least k times so that a person using

the quasi-identifier to retrieve information will retrieve at least k records.

Privacy-Preserving Data Mining:

Privacy-Preserving data mining techniques usually randomizing the data subjects

personal data. An enterprise wishing to carry out data mining obtains randomized user

information which may be used for data mining but randomization ensures that the

personal information is not disclosed. This may involve use of client-side software

that randomizes the data. For example, the date of birth 1/181 may e randomized as

2/2/82and 2/2/82 and 2/2/82 may be randomized as 3/4/81. The true date of birth then

cannot be determined, by analyzing the randomized date of birth. The randomization

may be based on some parameters, and approximations to the true distributions may

be determined knowing the randomization parameters.

43. Why current privacy principles are ineffective for data mining?

Collection Limation:

In most current data minig applications, data used are normally not collected for the

primary purpose of data mining. Although in some cases additional information may be collected,

for example through loyalty cards. Often data have been collected for another purpose and data

mining is a secondary use of the data. For example, in market basket analysis the data are collected

at the checkout counters for determining what the customer needs to pay for the purchase and

perhaps the customer is likely to be totally obivious.

Furthermore, data mining may involve collection of information from a number of the

organization‘s OLTP systems to bulid a data warehouse. The data subject would have no idea

about such consolidation.

Data Quality:

Data mining is often used only when there is a large amount of information

available. It m,ay be historical information as well as current information collected from a variety

of sources. Data extraction, transformation and loading (ETL) often precedes data mining. An ETL

process should ensure that the quality of data is hgh, but studies suggest that error levels in such

data are significant in spite of ETL. Given that individuals generally do not know about data

mining activities, they are not able to request access to their personal information and corrections,

erasures or deletions which could improve the quality of data.

Purpose specification and use Limitation:

Data for data mninig are usually not collected directly from the data subjects. Even when

information is directly collected, the individuals are usually not informed of all the purposes of data

collection. Data minig techniques may then be used to analyse the personal data. Perhaps from

several visits to the site, without the individual being aware of such analysis. Data mining of

personal information where data subjects consent has not been obtained is a violation of privacy

principles.

The problem of purpose specification and use limitation may be overcome by obtaining data

subjects‘ consent which should be done in a transparent way rather than just including the words

―data mining‖ in the fine print.

Security of Storage:

Data mining often involves collection of information from a number of OLTP systems to

build a data ware house. This makes security even more critical since it can allow an intruder to

access much more personal information than the intruder could access from a single OLTP system.

It is therefore important that a system used for data mining be provided with tight security controls.

Strict security procedures should be followed when it is required to transfer information to another

site either through the network or by transporting magnetic media physically.

Openness, Individual Participation and Accountability:

The openness principle that if data mining is being carried out on personal information then

the data subjects should know that their information is being used for such processing. Because the

data mining is often so far removed from the primary uses of the information, it is not an open and

transparent activity and therefore an individual usually cannot participate.

Accountability may also not be possible since data mining is invisible and the data subjects

do not know that the information they supplied for one purpose is being used for other purpose as

well. To make the process open and accountable, it is essential that data subjects know about all

uses of their data so that they are in a person to ask questions about the uses.

44. Write a short note on Privacy-Preserving Data Mining.

Privacy-Preserving Data Mining:

Information Privacy as the individual‘s ability to control the circulation of information

relating tohim/her.

Privacy-Preserving data mining techniques usually randomizing the data subjects personal

data. An enterprise wishing to carry out data mining obtains randomized user information which

may be used for data mining but randomization ensures that the personal information is not

disclosed. This may involve use of client-side software that randomizes the data.

For example, the date of birth 1/181 may e randomized as 2/2/82and 2/2/82 and 2/2/82 may

be randomized as 3/4/81. The true date of birth then cannot be determined, by analyzing the

randomized date of birth. The randomization may be based on some parameters, and

approximations to the true distributions may be determined knowing the randomization parameters.

45. Write any five applications of data mining.

Data Mining Applications in Sales/Marketing

Data mining is used for market basket analysis to provide information on what product

combinations were purchased together, when they were bought and in what sequence. This

information helps businesses promote their most profitable products and maximize the

profit. In addition, it encourages customers to purchase related products that they may have

been missed or overlooked.

Retail companies uses data mining to identify customer‘s behavior buying patterns.

Data Mining Applications in Banking / Finance

Several data mining techniques e.g., distributed data mining have been researched, modeled

and developed to help credit card fraud detection.

Data mining is used to identify customers loyalty by analyzing the data of customer‘s

purchasing activities such as the data of frequency of purchase in a period of time, total

monetary value of all purchases and when was the last purchase. After analyzing

those dimensions, the relative measure is generated for each customer. The higher of the

score, the more relative loyal the customer is.

Data Mining Applications in Health Care and Insurance

The growth of the insurance industry entirely depends on the ability of converting data into the

knowledge, information or intelligence about customers, competitors and its markets.

Data mining is applied in claims analysis such as identifying which medical procedures are

claimed together.

Data mining enables to forecasts which customers will potentially purchase new policies.

Data mining allows insurance companies to detect risky customers‘ behavior patterns.

Data mining helps detect fraudulent behavior.

Data Mining Applications in Transportation

Data mining helps determine the distribution schedules among warehouses and outlets and

analyze loading patterns.

http://www.zentut.com/data-mining/data-mining-techniques/

Data Mining Applications in Medicine

Data mining enables to characterize patient activities to see incoming office visits.

Data mining helps identify the patterns of successful medical therapies for different

illnesses.

PART-C

1.Write the applications of Data Mining in various fields.

Data Mining is a process that analyzes a large amount of data to find new and hidden information

that improves business efficiency. Various industries have been adopting data mining to their

mission-critical business processes to gain competitive advantages and help business grows.

Data Mining Applications in Sales/Marketing

Data mining enables businesses to understand the hidden patterns inside historical purchasing

transaction data, thus helping in planning and launching new marketing campaigns in prompt and

cost effective way. The following illustrates several data mining applications in sale and marketing.

Data mining is used for market basket analysis to provide information on what product

combinations were purchased together, when they were bought and in what sequence. This

information helps businesses promote their most profitable products and maximize the

profit. In addition, it encourages customers to purchase related products that they may have

been missed or overlooked.

Retail companies uses data mining to identify customer‘s behavior buying patterns.

Data Mining Applications in Banking / Finance

Several data mining techniques e.g., distributed data mining have been researched, modeled

and developed to help credit card fraud detection.

Data mining is used to identify customers loyalty by analyzing the data of customer‘s

purchasing activities such as the data of frequency of purchase in a period of time, total

monetary value of all purchases and when was the last purchase. After analyzing

those dimensions, the relative measure is generated for each customer. The higher of the

score, the more relative loyal the customer is.

To help bank to retain credit card customers, data mining is applied. By analyzing the past

data, data mining can help banks predict customers that likely to change their credit card

affiliation so they can plan and launch different special offers to retain those customers.

Credit card spending by customer groups can be identified by using data mining.

The hidden correlation‘s between different financial indicators can be discovered by using

data mining.

From historical market data, data mining enables to identify stock trading rules.

http://www.zentut.com/data-mining/data-mining-techniques/

Data Mining Applications in Health Care and Insurance

The growth of the insurance industry entirely depends on the ability of converting data into the

knowledge, information or intelligence about customers, competitors and its markets. Data mining

is applied in insurance industry lately but brought tremendous competitive advantages to the

companies who have implemented it successfully. The data mining applications in insurance

industry are listed below:

Data mining is applied in claims analysis such as identifying which medical procedures are

claimed together.

Data mining enables to forecasts which customers will potentially purchase new policies.

Data mining allows insurance companies to detect risky customers‘ behavior patterns.

Data mining helps detect fraudulent behavior.

Data Mining Applications in Transportation

Data mining helps determine the distribution schedules among warehouses and outlets and

analyze loading patterns.

Data Mining Applications in Medicine

Data mining enables to characterize patient activities to see incoming office visits.

Data mining helps identify the patterns of successful medical therapies for different

illnesses.

Data mining applications are continuously developing in various industries to provide more hidden

knowledge that increases business efficiency.

2..Explain various Data Mining Techniques

There are several major data mining techniques that have been used in data mining projects. They

are association, classification, clustering, prediction, sequential patterns and decision tree.

Association

Association is one of the best known data mining technique. In association, a pattern is discovered

based on a relationship between items in the same transaction. That‘s is the reason why association

technique is also known as relation technique. The association technique is used in market basket

analysis to identify a set of products that customers frequently purchase together.

Retailers are using association technique to research customer‘s buying habits. Based on historical

sale data, retailers might find out that customers always buy crisps when they buy beers, and

therefore they can put beers and crisps next to each other to save time for customer and increase

sales.

http://www.zentut.com/data-mining/

Classification

Classification is a classic data mining technique based on machine learning. Basically classification

is used to classify each item in a set of data into one of predefined set of classes or groups.

Classification method makes use of mathematical techniques such as decision trees, linear

programming, neural network and statistics. In classification, we develop the software that can

learn how to classify the data items into groups. For example, we can apply classification in

application that ―given all records of employees who left the company, predict who will probably

leave the company in a future period.‖ In this case, we divide the records of employees into two

groups that named ―leave‖ and ―stay‖. And then we can ask our data mining software to classify the

employees into separate groups.

Clustering

Clustering is a data mining technique that makes meaningful or useful cluster of objects which have

similar characteristics using automatic technique. The clustering technique defines the classes and

puts objects in each class, while in the classification techniques, objects are assigned into

predefined classes. To make the concept clearer, we can take book management in library as an

example. In a library, there is a wide range of books in various topics available. The challenge is

how to keep those books in a way that readers can take several books in a particular topic without

hassle. By using clustering technique, we can keep books that have some kinds of similarities in

one cluster or one shelf and label it with a meaningful name. If readers want to grab books in that

topic, they would only have to go to that shelf instead of looking for entire library.

Prediction

The prediction, as it name implied, is one of a data mining techniques that discovers relationship

between independent variables and relationship between dependent and independent variables. For

instance, the prediction analysis technique can be used in sale to predict profit for the future if we

consider sale is an independent variable, profit could be a dependent variable. Then based on the

historical sale and profit data, we can draw a fitted regression curve that is used for profit

prediction.

Sequential Patterns

Sequential patterns analysis is one of data mining technique that seeks to discover or identify

similar patterns, regular events or trends in transaction data over a business period.

In sales, with historical transaction data, businesses can identify a set of items that customers buy

together a different times in a year. Then businesses can use this information to recommend

customers buy it with better deals based on their purchasing frequency in the past.

Decision trees

Decision tree is one of the most used data mining techniques because its model is easy to

understand for users. In decision tree technique, the root of the decision tree is a simple question or

condition that has multiple answers. Each answer then leads to a set of questions or conditions that

help us determine the data so that we can make the final decision based on it. For example, We use

the following decision tree to determine whether or not to play tennis:

Starting at the root node, if the outlook is overcast then we should definitely play tennis. If it is

rainy, we should only play tennis if the wind is week. And if it is sunny then we should play tennis

in case the humidity is normal.

3. Explain the architecture of a typical Data Mining System.

The architecture of a typical data mining system may have the following major components :

Database, Data warehouse, WorldWideWeb, or other information repository

This is one or a set of databases, data warehouses, spreadsheets, or other kinds of

information repositories.

Data cleaning and data integration techniques may be performed on the data.

Database or data warehouse server:

The database or data warehouse server is responsible for fetching the relevant data,

based on the user‘s data mining request.

Knowledge base:

This is the domain knowledge that is used to guide the search or evaluate the

interestingness of resulting patterns.

Such knowledge can include concept hierarchies, used to organize attributes or

attribute values into different levels of abstraction.

Knowledge such as user beliefs, which can be used to assess a pattern‘s

interestingness based on its unexpectedness, may also be included.

Other examples of domain knowledge are additional interestingness constraints or

thresholds, and metadata (e.g., describing data from multiple heterogeneous

sources).

Data mining engine:

This is essential to the data mining system and ideally consists of a set of functional

modules for tasks such as characterization, association and correlation analysis,

classification, prediction, cluster analysis, outlier analysis, and evolution analysis.

Pattern evaluation module:

This component typically employs interestingness measures and interacts with the

data mining modules so as to focus the search toward interesting patterns. It may

use interestingness thresholds to filter out discovered patterns.

Alternatively, the pattern evaluation module may be integrated with the mining

module, depending on the implementation of the data mining method used.

For efficient data mining, it is highly recommended to push the evaluation of pattern

interestingness as deep as possible into the mining process so as to confine the

search to only the interesting patterns.

User interface:

This module communicates between users and the data mining system, allowing the

user to interact with the system by specifying a data mining query or task,

providing information to help focus the search, and performing exploratory data

mining based on the intermediate data mining results.

Also, it allows the user to browse database and data warehouse schemas or data

structures, evaluate mined patterns, and visualize the patterns in different forms.

4. What are the Data Mining primitives?

A data mining query is defined in terms of data mining task primitives. These primitives allow the

user to interactively communicate with the data mining system during discovery in order to direct

the mining process, or examine the findings from different angles or depths. The data mining

primitives specify the following:

The set of task-relevant data to be mined:

This specifies the portions of the database or the set of data in which the user is interested.

This includes the database attributes or data warehouse dimensions of interest (the relevant

attributes or dimensions).

The kind of knowledge to be mined:

This specifies the data mining functions to be performed, such as characterization,

discrimination, association or correlation analysis, classification, prediction, clustering,

outlier analysis, or evolution analysis.

The background knowledge to be used in the discovery process:

This knowledge about the domain to be mined is useful for guiding the knowledge

discovery process and for evaluating the patterns found. Concept hierarchies are a popular

form of background knowledge, which allow data to be mined at multiple levels of

abstraction. User beliefs regarding relationships in the data are another form of background

knowledge.

The interestingness measures and thresholds for pattern evaluation:

They may be used to guide the mining process or, after discovery, to evaluate the

discovered patterns. Different kinds of knowledge may have different interestingness

measures. For example, interestingness measures for association rules

include support and confidence. Rules whose support and confidence values are below user-

specified thresholds are considered uninteresting.

The expected representation for visualizing the discovered patterns:

This refers to the form in which discovered patterns are to be displayed, which may include

rules, tables, charts, graphs, decision trees, and cubes.

5. Explain Apriori Algorithm.

Apriori is a seminal algorithm for mining frequent itemsets for Boolean association rules.

The name of the algorithm is based on the fact that the algorithm uses prior knowledge of

frequent itemset properties.

Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used

to explore (k+1)-itemsets.

Apriori property is used to improve the efficiency of the level-wise generation of frequent

itemsets.

1. Apriori property states that all nonempty subsets of a frequent itemset must also be

frequent.

2. The property used in the algorithm uses two-step process. Using Lk-1, Lk is found for k >=

2.

The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1

with itself. This set of candidates is denoted Ck. In the join component, Lk-1 is joined

with Lk-1 to generate potential candidates.

The prune step: Ck is a superset of Lk, that is, its members may or may not be frequent,

but all of the frequent k-itemsets are included in Ck. The prune component employs

the Apriori property to remove candidates that have a subset that is not frequent.

Example:A Database has four transactions. Let minimum support and confidence be 50%.

Tid Items bought

1 A,B,D

2 A,D

3 A,C

4 B,D,E,F

Find out the frequent item sets and strong association rules for the above example using

Apriori Algorithm.

Step 1: Scan for support count of each candidate –{A,B,C,D,E,F}

C1=

Item Support

A ¾ =75%

B 2/4=50%

C ¼=25%

D ¾=75%

E ¼=25%

F ¼=25%

Step 2: Comparing with the min support threshold count of 50%.

L1=

Item Support

A 3

B 2

D 3

Step 3: Generate candidate C2 from L1.

C2:

Item

{A,B}

{A,D}

{B,D}

Step 4:

Scan for support count of each candidate in C2.

Item Support

{A,B} ¼=25%

{A,D} 2/4=50%

{B,D} ¼=25%

Step 5:

Compare C2 with minimum support

L2=

Items Support

{A,D} 2

Step 6: Data contains frequent item {A,D}

Therefore the association rule that can be generated from L are:

Association Rule Support Confidence

A->D 2 2/3 = 66%

D->A 2 2/2=100%

As minimum confidence threshold is 50%, then both the rules are output as showing the confidence

above 50%.

The rules are:

Rule 1: A->D

Rule 2: D->A

6. Write FP-Growth Algorithm.

Algorithm: FP growth. Mine frequent itemsets using an FP-tree by pattern fragment growth.

Input:

D, a transaction database;

Min_sup, the minimum support count threshold.

Output: The complete set of frequent patterns.

Method:

1. The FP-tree is constructed in the following steps:

(a) Scan the transaction database D once.

Collect F, the set of frequent items, and their support counts.

Sort F in support count descending order as L, the list of frequent items.

(b) Create the root of an FP-tree, and label it as ―null.‖

For each transaction Trans in D do the following.

Select and sort the frequent items in Trans according to the order of L.

Let the sorted frequent item list in Trans be [p|P], where p is the first element and P is the

remaining list. Call insert tree([p|P], T), which is performed as follows.

If T has a child N such that N.item-name=p.item-name, then increment N‘s count by 1; else create a

new node N, and let its count be 1, its parent link be linked to T, and its node-link to the nodes with

the same item-name via the node-link structure.

If P is nonempty, call insert tree(P, N) recursively.

2. The FP-tree is mined by calling FP growth(FP tree, null), which is implemented as follows.

procedure FP growth(Tree, α)

(1) if Tree contains a single path P then

(2) for each combination (denoted as β) of the nodes in the path P

(3) generate pattern β U α with support_count = minimum support count o f nodes in b;

(4) else for each ai in the header of Tree{

(5) generate pattern β = ai U α with support_count = ai:support count;

(6) construct β‘s conditional pattern base and then β‘s conditional FP tree Treeβ;

(7) if Treeβ≠0 then

(8) call FP growth(Tree β, β); }

7. What is Frequent Pattern Mining?

Frequent pattern mining searches for recurring relationships in a given data set.

Frequent itemset mining leads to the discovery of associations and correlations among items in

large transactional or relational data sets.

The discovery of interesting correlation relationships among huge amounts of business

transaction records can help in many business decision-making processes, such as catalog

design, cross-marketing, and customer shopping behavior analysis.

Market basket analysis is just one form of frequent pattern mining.

Classification of Frequent pattern mining:

Based on the completeness of patterns to be mined:

o We can mine the complete set of frequent itemsets, the closed frequent itemsets,

and the maximal frequent itemsets, given a minimum support threshold.We can

also mine constrained frequent itemsets ,approximate frequent itemsets , near-

match frequent itemsets , top-k frequent itemsets.

o Different applications may have different requirements regarding the

completeness of the patterns to be mined, which in turn can lead to different

evaluation and optimization methods.

Based on the levels of abstraction involved in the rule set:

o Some methods for association rule mining can find rules at differing levels of

abstraction.

o For example,

buys(X, ―computer‖)→buys(X, ―HP printer‖)

buys(X, ―laptop computer‖)→buys(X, ―HP printer‖)

o In these Rules, the items bought are referenced at different levels of abstraction.

o The rule set mined is consisting of multilevel association rules. If, instead, the rules

within a given set do not reference items or attributes at different levels of

abstraction, then the set contains single-level association rules.

Based on the number of data dimensions involved in the rule:

o If the items or attributes in an association rule reference only one dimension, then it

is a single-dimensional association rule

o For example,

buys(X, ―computer‖)→buys(X, ―antivirus software‖)

o If a rule references two or more dimensions, then it is a multidimensional

association rule. For example,

age(X, ―30…39‖)^income(X, ―42K…48K‖)→buys(X, ―high resolution

TV‖):

Based on the types of values handled in the rule:

o If a rule involves associations between the presence or absence of items, it is a

Boolean association rule.

o For example,

Rules buys(X, ―computer‖))buys(X, ―HP printer‖) are Boolean association

rules.

o If a rule describes associations between quantitative items or attributes, then it is a

quantitative association rule.

age(X, ―30…39‖)^income(X, ―42K…48K‖)→buys(X, ―high resolution

TV‖):

Age and income, have been discriminated.

.

Based on the kinds of rules to be mined:

o Frequent pattern analysis can generate various kinds of rules and other interesting

relationships. Many of the rules are redundant or do not indicate a correlation

relationship among itemsets. Thus, the discovered associations can be further

analyzed to uncover statistical correlations, leading to correlation rules.

o We can also mine strong gradient relationships among items. Example ―The average

sales from Sony Digital Camera increase over 16% when sold together with Sony

Laptop Computer‖: both Sony Digital Camera and Sony Laptop Computer are

siblings, where the parent itemset is Sony.

Based on the kinds of patterns to be mined:

o Many kinds of frequent patterns can be mined from different kinds of data sets like

frequent itemset mining, Sequential pattern mining, Structured pattern mining.

8. Explain Decision Tree Induction Algorithm using Information Gain

A decision tree is a classifier expressed as a recursive partition of the instance

space. The decision tree consists of nodes that form a rooted tree,

meaning it is a directed tree with a node called ―root‖ that has no incoming

edges. All other nodes have exactly one incoming edge. A node with outgoing

edges is called an internal or test node. All other nodes are called leaves (also

known as terminal or decision nodes). In a decision tree, each internal node

splits the instance space into two or more sub-spaces according to a certain

discrete function of the input attributes values. In the simplest and most frequent case, each test

considers a single attribute, such that the instance space is

partitioned according to the attribute‘s value. In the case of numeric attributes,

the condition refers to a range.

Each leaf is assigned to one class representing the most appropriate target

value. Alternatively, the leaf may hold a probability vector indicating the probability

of the target attribute having a certain value. Instances are classified by

navigating them from the root of the tree down to a leaf, according to the

outcome of the tests along the path.

The weather attributes are Outlook, Temperature, Humidity, and Wind Speed. They can

have the following values:

Outlook = {sunny, overcast, rain}

Temperature = {hot, mild, cool}

Humidity ={high, normal}

Wind = {weak, strong}

Ans.

Day Outlook Temperature Humidity Wind Play Tennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Construct decision tree using classification algo:

Decide on which day you can play tennis.

Attribute <play Tennis> has 2 values {yes,no}

Info Gained before splitting:

I(S1,S2) = -log2(Si/S)

I(9,5) =-(9/14) log2(9/14) - (5/14)log2(5/14)

=0.4097 + 0.5305

=0.9402

Outlook:

S11= Sunny + Play tennis (yes) (2)

S21= Sunny + Play tennis (no) (3)

S12= Overcast + Play tennis (yes) (4)

S22= Overcast + Play tennis (no) (0)

S13= Rain + Play tennis (yes) (3)

S23= Rain + Play tennis (no) (2)

E(outlook) =(5/14) I(S11,S21) +(4/14) I(S12,S22) +(5/14)I(S13,S23)

I(S11,S21) = -log2(Si/S)

= -(2/5)log2(2/5) – (3/5) log2(3/5)

= 0.9708

I(S12,S22) = -log2(Si/S)

= -(4/4)log2(4/4) +0

= 0

I(S13,S23) = -log2(Si/S)

= -(2/5)log2(2/5) – (3/5) log2(3/5)

=0.9708

E(Outlook) =(5/14) I(S11,S21) +(4/14) I(S12,S22) +(5/14)I(S13,S23)

=(5/14)*(0.9708) + (4/14)*(0) +(5/14)*(0.9708)

=0.6930

Gain(Outlook) =I(S1,S2) –E(Outlook)

=0.9406- 0.693

=0.2472

Temperature:

S11= Hot + Play tennis (yes) (2)

S21= Hot + Play tennis (no) (2)

S12= Mild + Play tennis (yes) (4)

S22= Mild + Play tennis (no) (2)

S13= Cool + Play tennis (yes) (3)

S23= Cool + Play tennis (no) (1)

I(S11,S21) = -log2(Si/S)

= -(2/4)log2(2/4) – (2/4) log2(2/4)

= 1

I(S12,S22) = -log2(Si/S)

= -(4/6)log2(4/6) – (2/6) log2(2/6)

= 0.9179

I(S13,S23) = -log2(Si/S)

= -(3/4)log2(3/4) – (1/4) log2(1/4)

=0.811

E(Temperature) =(5/14) I(S11,S21) +(4/14) I(S12,S22) +(5/14)I(S13,S23)

=(4/14)*(1) + (6/14)*(0.9179) +(4/14)*(0.811)

=0.9108

Gain(Temperature)= I(S1,S2) –E(Temperature)

=0.9406-0.9108

=0.029

Humidity:

S11= high + Play tennis (yes) (3)

S21= high + Play tennis (no) (4)

S12= normal + Play tennis (yes) (6)

S22= normal + Play tennis (no) (1)

E(Humidity) =(7/14) I(S11,S21) +(7/14) I(S12,S22)

I(S11,S21) = -log2(Si/S)

= -(3/7)log2(3/7) – (4/7) log2(4/7)

=0.523 + 0.4613

= 0.9843

I(S12,S22) = -log2(Si/S)

=-(6/7)log2(6/7) – (1/7) log2(6/7)

=0.5916

E(Humidity) =(7/14) I(S11,S21) +(7/14) I(S12,S22)

= (7/14)*(0.9843) + (7/14) * (0.5916)

=0.7879

Gain(Humidity) = 0.9402-0.7879

=0.152

Wind:

S11= Weak+ Play tennis (yes) (6)

S21= Weak + Play tennis (no) (2)

S12= Strong + Play tennis (yes) (3)

S22= Strong+ Play tennis (no) (3)

I(S11,S21) = -log2(Si/S)

= -(6/8)log2(6/8) – (2/8) log2(2/8)

=0.311+(1/2)

= 0.811

I(S12,S22) = -log2(Si/S)

=-(3/6)log2(3/6) – (3/6) log2(3/6)

=1

E(Wind) =(8/14) I(S11,S21) +(6/14) I(S12,S22)

=0.892

Gain(Wind)= 0.9402-0.892

=0.0486

As Outlook has the maximum gain, it is used as the decision attribute for the root node.

Outlook

Sunny Overcast Rain

Similarly for the next node, find the Gain with respect to the three Values, Sunny,

overcast,rain

9. Explain K-means clustering

K-means clustering has following steps

The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k

clusters so that the resulting intracluster similarity is high but the intercluster similarity is low.

Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can

be viewed as the cluster‘s centroid or center of gravity.

The k-means algorithm proceeds as follows:

First, it randomly selects k of the objects, each of which initially represents a cluster mean or

center. For each of the remaining objects, an object is assigned to the cluster to which it is the

most similar, based on the distance between the object and the cluster mean. It then computes

the new mean for each cluster. This process iterates until the criterion function converges.

Typically, the square-error criterion is used, defined as

E =i=1kp ∈Ci|p-mi|2

where E is the sum of the square error for all objects in the data set; p is the point in space

representing a given object; and mi is the mean of cluster Ci (both p and mi are

multidimensional).

Algorithm:

Input:

k: the number of clusters,

D: a data set containing n objects.

Output: A set of k clusters.

Method:

(1) arbitrarily choose k objects from D as the initial cluster centers;

(2) repeat

(3) (re)assign each object to the cluster to which the object is the most similar, based on the

mean value of the objects in the cluster;

(4) update the cluster means, i.e., calculate the mean value of the objects for each cluster;

(5) until no change;

Advantages:

Works well when the clusters are compact clouds that are rather well separated from one

another.

relatively scalable and efficient in processing.

Limitations:

Can be applied only when the mean of a cluster is defined.

The necessity for users to specify k in advance can be seen as a disadvantage.

Not suitable for discovering clusters with nonconvex shapes or clusters of very different size.

Sensitive to noise and outlier data points because a small number of such data can substantially

influence the mean value.

Example: Create 3 clusters for a set of values

{2,3,6,8,9,12,15,18,22}

Ans.

Step 1: Break the set into 3 clusters. Randomly assign data to three clusters and calculate the

mean value:

K1: 2,8,15 – mean 8

K2: 3,9,18 – mean 10

K3: 6,12,22 – mean 13.3

Step 2: Reassign the values to the numbers which are close to the mean of K1(8) to K1, mean of

K2(10) to K2, mean of K3(13.3) to K3.

K1: 2,3,6,8,9 – mean 5.6

K2: - mean 0

K3: - 12,15,18,22 – mean 16.75

Step 3: Reassign

K1: 6,8,9 – mean 7.6

K2: 2,3 – mean 2.5

K3: 12,15,18,22 – mean 16.75

Step 4:Reassign

K1: 6,8,9 – mean 7.6

K2: 2,3 – mean 2.5

K3: 12,15,18,22 – mean 16.75

Last two groups are same. 3 clusters are:

Cluster 1= {6,8,9}, Cluster 2 ={2,3} , Cluster 3={12,15,18,22}

10. Explain Naïve Bayesian Classification with an example.

Predict a class label of an unknown tuple X= {age =’<=20’ , Income= ‘Medium’,

Student=’Yes’, Credit_Rating=’Fair’} using Naïve Bayesian Classification.

Age Income Student Credit_rating Class: Buys

Laptop

>30 Medium No Excellent No

<=20 High No Fair No

21..30 High Yes Fair Yes

<=20 High No Excellent No

21..30 Medium No Excellent Yes

21..30 High No Fair Yes

<=20 Medium Yes Excellent Yes

>30 Medium No Fair Yes

>30 Medium Yes Fair Yes

>30 Low Yes Fair Yes

<=20 Low Yes Fair Yes

>30 Low Yes Excellent No

21..30 Low Yes Excellent Yes

<=20 Medium No Fair No

Ans.

In order to classify, divide the height attribute into 3 ranges:

(0,20]

(20,30]

(31,∞)

Attribute Value Count Probabilities

Yes No Yes(9) No(5)

Age (0,20] 2 3 2/9 3/5

(20,30] 4 0 4/9 0

(31,∞) 3 2 3/9 2/5

Income Low 3 1 3/9 1/5

Medium 4 2 4/9 2/5

High 2 2 2/9 2/5

Student Yes 6 1 6/9 1/5

No 3 4 3/9 4/5

Credit Rating Fair 6 2 6/9 2/5

Excellent 3 3 3/9 3/5

X=< age =’<=20’ , Income= ‘Medium’, Student=’Yes’, Credit_Rating=’Fair’>

Step 1: Calculating trial probability

P(Yes) P(No)

9/14 5/14

Step 2: Find conditional probability

P(X/Yes) P(X/No)

(2/9)*(4/9)*(6/9)*(6/9)=32/729 (3/5)*(2/5)*(1/5)*(2/5)=(12/625)

Step 3: Calculating likelihood of tuple X being in each class

Likelihood of buying a laptop= P(X/Yes) * P(Yes)

= (32/729) *(9/14)=32/1134 =0.028

Likelihood of not buying a laptop= P(X/No) * P(No)

=(12/625)*(5/14)=4/875= 0.0068

Step 4: We estimate P(X) by summing individual likelihood value since t

P(t) = 0.0348

Step 5: Obtain the actual probabilities of each

P(Yes/t) = Likelihood of buying a laptop/P(t)

= 0.028/0.0348=0.8045

P(No/t) = Likelihood of not buying a laptop/P(t)

= 0.0068/0.0348= 0.1954

t belongs to class ‗Yes‘

Based on these prob, we classify the new tuple as ‗Yes:buys a laptop‘ because it has the highest

probability.

Q.4.

Name Gender Height (in

metres)

Output 1 Output 2

Kristina F 1.6 Short Medium

Jim M 2 Tall Medium

Maggie F 1.9 Medium Tall

Martha F 1.88 Medium Tall

Stephanie F 1.7 Short Medium

Bob M 1.85 Medium Medium

Kathy F 1.6 Short Medium

Dave M 1.7 Short Medium

Worth M 2.2 Tall Tall

Steven M 2.1 Tall Tall

Debbie F 1.8 Medium Medium

Todd M 1.95 Medium Medium

Kim F 1.9 Medium Tall

Amy F 1.8 Medium Medium

Wynette F 1.75 Medium Medium

Ans.

In order to classify, divide the height attribute into 6 ranges:

(0,1.6]

(1.6,1.7]

(1.7,1.8]

(1.8,1.9]

(1.9,2]

(2,∞)

Probabilities associated with attributes:

Attribute Value Count Probabilities

Short Med Tall Short(4) Med(8) Tall(3)

Gender M 1 2 3 1/4 2/8 3/3

F 3 6 0 ¾ 6/8 0

Height (0,1.6] 2 0 0 2/4 0 0

(1.6,1.7] 2 0 0 2/4 0 0

(1.7,1.8] 0 3 0 0 3/8 0

(1.8,1.9] 0 4 0 0 4/8 0

(1.9,2] 0 1 1 0 1/8 1/3

(2,∞) 0 0 2 0 0 2/3

Table shows the count and subsequent probabilities associated with the attributes. Using the

training data, we estimate the trial probability.

t= <Adam, M, 1.95 m>

Step 1: Calculating trial probability

P(Short) P(Med) P(Tall)

4/15 = 0.268 8/15 = 0.536 3/15 = 0.2

Step 2: Find conditional probability

P(t/Short) P(t/Med) P(t/Tall)

(1/4)*0=0 (1/8)*(2/8)=(1/32) (3/3)*(1/3)=(1/3)

Step 3: Calculating likelihood of tuple t being in each class

Likelihood of being short= P(t/Short) * P(Short)

= 0*0.268=0

Likelihood of being medium= P(t/Med) * P(Med)

=(1/32)*(8/15)=0.01667

Likelihood of being tall= P(t/Tall) * P(Tall)

= (1/3)*(3/15)=0.0667

Step 4: We estimate P(t) by summing individual likelihood value since t

P(t) = 0.0832

Step 5: Obtain the actual probabilities of each

P(short/t) = Likelihood of being short/P(t)

= 0

P(Med/t) = Likelihood of being short/P(t)

= 0.01667/0.0832= 0.2

P(Tall/t) = Likelihood of being tall/P(t)

= 0.0667/0.832 = 0.799

t belongs to class TALL.

Based on these prob, we classify the new tuple as TALL because it has the highest probability.

11. Explain NAÏVE ALGORITHM with example

It is a brute force algorithm to find the association between items

This algorithm is explained with an example

In our example we consider 4 items four sale

Bread,Cheese,Juice,Milk and 4 transactions are listed in [Table 1]

Table1

Transaction ID Items

100 Bread,Cheese

200 Bread,Cheese,Juice

300 Bread,Milk

400 Cheese,Juice,Milk

The 4 items and all the combinations of this 4 items and their frequencies of occurrence

is given [Table 2]

Table2

Itemset Freauency

Bread 3

Cheese 3

Juice 2

Milk 2

Bread,Cheese 2

Bread,Juice 1

Bread,Milk 1

Cheese,Juice 2

Cheese,Milk 1

Juice,Milk 1

Bread,Cheese,Juice 1

Bread,Cheese,Milk 0

Cheese,Juice,Milk 1

Juice,Bread,Milk 0

Bread,Cheese,Juice, Milk 0

Our aim is to find Association rule with minimum support 50% and minimum confidence

of 75%

Minimum support of 50% are the itemset that occur in atleast two transactions out of 4

transactions.

Such itemsets are called frequent

The set of frequent item sets is listed in Table3

Bread

3

Cheese

3

Juice

2

Milk

2

{Bread, Cheese}

2

{Cheese, Juice}

2

[Frequency >=2]

To determine two item sets leading to association rules with required confidence of 75% is

calculated as follows

Confidence of

Confidence (x->y)=support(xy)/support(x)=p(x y)/p( x)

In Table 3 the two itemsets area(Bread,Cheese) and (Cheesed,Juice)

Confidence(Bread->Cheese)=2/3=67%

Confidence(Cheese->Bread)=2/3=67%

Confidence(Cheese->Juice)=2/3=67%

Confidence(Juice->Cheesed)=2/2=100%

Only the last row(Juice->Cheese) has the confidence greater than 75%.

Such rules having more than users specify minimum confidence are called confident.

Similarly items such that satisfied required minimum support called frequent.

12.What is Web Structure mining?

1. With Web structure mining, information is obtained from the actual organization of pages on the

Web.

2. It can be viewed as creating a model of the Web organization or a portion of it.

3. Uses:

to classify Web pages

to create similarity measures between documents.

4. Techniques of Web Structure Mining:

Page Rank and HITS Algorithm

5. Page Rank

Was designed to increase the effectiveness and improve the efficiency of the search engines.

It is used to measure the importance of a page and to prioritize the pages returned from a

traditional search engine using keyword searching.

Google uses this technique.

The PageRank value for a page is calculated based on the number of pages that point to it.

Given a page p, we use Bp to be the set of pages that point to p and Fp to be the set of links

out of p. The PageRank of p is defined as:

PR(p)=cq∈ Bp PR(q)/N(q)

0<c<1 , used for normalization

|Nq|=|Fq|

Rank sink, is a problem with PageRank when a cyclic reference occurs.

13. Explain HITS Algorithm

HITS aimed at finding both authoritative pages and hubs.

An authority is the ‗best source‘ for the requested information and a hub is a page that

contains links to authoritative pages.

It identifies authoritative pages and hub pages by creating weights. A search can be viewed

as having a goal of finding the best hubs and authorities.

HITS algorithm:

Hyperlink-induced topic search(HITS) finds hubs an authoritative pages. The technique

contains two components:

Based on a given set of keywords(found in a query), a set of relevant pages is found.

Hub an authority measures are associated with these pages. Pages with the highest

values are returned.

A search engine SE, is used to find a small set, root set (R), of pages P, which satisfy

the given query, q. This set is then expanded into a larger set, base set (B), by adding

pages linked either to or from R. This is used to induce a subgraph of the Web. This

graph is the one that is actually examined to find the hubs and authorities.

HITS Algorithm

R=SE(W,q)

B=R ∪ {pages that link to pages in R}

GB,L= Subgraph of W induced by B

GB, L'= Delete links in G within same site

Xp= ∑q where q,p L' Yq;Find authority weights

Yp= ∑q where p,q L' Xq;Find hub weights

A=p p has one of the highest Xp}

H=p p has one of the highest Yp}

14. What is Web Usage Mining?

1. Web usage mining performs mining on Web usage data, or Web logs. A Web log is a listing of

page reference data. It is also called clickstream data because each entry corresponds to a mouse

click.

2. When these logs are examined from a server perspective, mining uncovers information about the

sites where the service resides an can help improve the design of the sites.

3. By evaluating a client‘s sequence of clicks, information about a user (or a group of users) is

detected. This could be used to perform pre-fetching and caching of pages.

4. Uses of Web usage Mining:

Personalization for a user can be achieved by keeping track of previously accessed

pages.

By determining frequent access behavior for users, needed links can be identified to

improve the performance.

Identifying common access behaviors can help improve actual design of Web pages and

make other modifications to the site.

Patterns can be used to gather business intelligence to improve sales and advertisement.

5. Activities of Web usage Mining:

Preprocessing activities center around reformatting the Web log data before processing.

Pattern discovery activities form the major portion of the mining activities because these

activities look to find hidden patterns within the log data.

Pattern analysis is the process of looking at and interpreting the results of the discovery

activities.

6. Preprocessing:

It includes cleansing, user identification , session identification, path completion and

formatting.

Before processing the log, the data may be changed in several ways. For security or

privacy reasons, the page addresses may be change into unique page identifications.

This also will save storage space. Also, the data may be cleansed by removing irrelevant

information.

All pages visited from one source could be grouped by a server to understand the

patterns of page references from each user.

A common technique is for a server site to divide the log records into sessions. A

session is a set of page references from one source site during one logical period. Login

and logoff represents the logical start and end of the session.

7. Data Structures:

Data Structures are used to keep track of patterns identified during the Web usage

mining process.

One of the basic data structure is trie. A trie is a rooted tree, where each path from the

root to leaf represents a sequence. They are used to store strings for pattern-matching

applications.

The wastage of space in trie is solved by compressing nodes together when they have

degrees of one. A compressed trie is called suffix tree.

Figure shows sample tries

With the help of suffix tree, it is efficient to find not only the subsequence in a

sequence, but also to find the common subsequences among multiple sequences.

8. Pattern Discovery

The most common data mining technique used on clickstream data is that of uncovering

traversal pattern. A traversal pattern is a set of pages visited by a user in a session. Other

types of patterns may be uncovered by Web usage mining.

Different types of Traversal Patterns are:

Association Rules

Episodes

Sequential patterns

Forward sequences

Maximal frequent sequences

9. Pattern Analysis:

Pattern Analysis is the process of looking at and interpreting the results of discovery

activities.

Once the patterns are identified, they must be analyzed to determine how that information

can be used. Some of the generated patterns may be deleted and determined not to be of

interest.

Web logs can identify patterns that are of interest because of their uniqueness or statistical

properties. Each log is examined to find patterns based on any desired requirements. The

patterns found across two logs are compared for similarity.

15. What is Web Content mining?

1. Web content mining examines the content of Web pages as well as results of Web searching.

The content includes text as well as graphics data. Web content mining is further divided into Web

page content mining and search results mining.

2. Web page content mining is traditional searching of Web pages via content, while Search results

mining is a further search of pages found from a previous search.

3. Web content mining can improve on traditional search engines through such techniques as

concept hierarchies and synonyms, user profiles, and analyzing the links between pages.

4. Web content mining uses data mining techniques for efficiency, effectiveness and scalability.

5. Web content mining can be divided into:

Agent-based approach

Database based approach

Agent-based approach:

They have software systems(agents) that perform the content mining.

For example, search engines. Intelligent search agents use techniques like user

profiles or knowledge concerning specific domains.

Information filtering utilizes IR techniques, knowledge of the link structures to

retrieve an categorize documents.

Personalize Web agents use information about user preferences to direct their

search.

Database-based Approach:

It views the Web data as belonging to a database.

There have been approaches that view the Web as a multilevel database, and there

have been many query languages that target the Web.

6. Type of Text Mining:

Basic content mining is a type of text mining.

Text mining hierarchy

Simplest functions at the top and more complex functions at the bottom.

Many Web content mining activities have centre around techniques to summarize the

information found.

Simple search engines retrieve relevant documents usually using a keyword-base retrieval

technique similar to those found in traditional IR systems.

Figure shows the text mining hierarchy

16.Online Transaction Processing (OLTP) Online Analytical Processing (OLAP)

Application Oriented Used to analyze and forecast business needs

Up to date and consistent at all times Data is consistent only up to the last update

Detailed data Summarized data

Isolated data Integrated data

Queries touch small amount of data Queries touch large amounts of data

Fast response time Slow response time

Updates are frequent Updates are less frequent

Concurrency is the biggest performance concern Each report or query requires lot of resources

Clerical Users Managerial/Business Users

OLTP targets specific process like ordering from an online store

OLAP integrates data from different processes like (Ordering, processing, inventory, sales etc.,)

Performance sensitive Performance relaxed

Few accessed records per time Large volumes accessed at a time

Read/Update access Mostly read and occasional update

No redundancy Redundancy cannot be avoided

Databases size is usually around 100 MB to 100 GB

Databases size is usually around 100 GB to a few TB.

17..Describe the Three-Tier Data Warehouse Architecture

Data warehouses often adopt a three-tier architecture, as presented in Figure

1. The bottom tier is a warehouse database server that is almost always a relational

database system. Back-end tools and utilities are used to feed data into the bottom

tier from operational databases or other external sources (such as customer profile

information provided by external consultants). These tools and utilities perform data

extraction, cleaning, and transformation (e.g., to merge similar data from different

sources into a unified format), as well as load and refresh functions to update the

data warehouse. The data are extracted using application program

interfaces known as gateways. A gateway is supported by the underlying DBMS and

allows client programs to generate SQL code to be executed at a server. Examples

of gateways include ODBC (Open Database Connection) and OLEDB (Open Linking

and Embedding for Databases) by Microsoft and JDBC (Java Database Connection).

This tier also contains a metadata repository, which stores information about

the data warehouse and its contents.

2. The middle tier is an OLAP server that is typically implemented using either

(1) a relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps operations

on multidimensional data to standard relational operations; or

(2) a multidimensional OLAP (MOLAP) model, that is, a special-purpose server

that directly implements multidimensional data and operations.

3. The top tier is a front-end client layer, which contains query and reporting tools,

analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

18. What are the features of Data warehouse?

Subject-oriented:

A data warehouse is organized around major subjects, such as customer, supplier, product,

and sales. Rather than concentrating on the day-to-day operations and transaction

processing of an organization, a data warehouse focuses on the modeling and analysis of

data for decision makers.

Hence, data warehouses typically provide a simple and concise view around particular

subject issues by excluding data that are not useful in the decision support process.

Integrated:

A data warehouse is usually constructed by integrating multiple heterogeneous sources,

such as relational databases, flat files, and on-line transaction records.

Data cleaning and data integration techniques are applied to ensure consistency in naming

conventions, encoding structures, attribute measures, and so on.

Time-variant:

Data are stored to provide information from a historical perspective (e.g., the past 5–10

years). Every key structure in the data warehouse contains, either implicitly or explicitly, an

element of time.

Nonvolatile:

A data warehouse is always a physically separate store of data transformed from the

application data found in the operational environment.

Due to this separation, a data warehouse does not require transaction processing, recovery,

and concurrency control mechanisms. It usually requires only two operations in data

accessing: initial loading of data and access of data.

19. Explain DATA CUBE OPERATIONS

A number of operations may be applied to data cubes. The common ones are:

Roll-p

Drill-down

Slice and dice

Pivot

Roll-up Roll-up is like zooming out on the data cube. It is required when the user needs further abstraction or less detail. This operation performs further aggregations on the data, for example, from single degree programs to all programs offered by a School or department We first define hierarchies on two dimensions. Hierarchy is defined on the dimension degree: 1. Science (BSc, BIT) 2. Medicine (MBBS) 3. Business and Law (BCom, LLB)

Drill-down

Drill-down is like zooming in on the data and is therefore the reverse of roll-up. It is an appropriate operation when the user needs further details or when the user wants to partition more finely or wants to focus on some particular values of certain dimensions. Drill-down adds more details to the data.

Slice and dice Slice and dice are operations for browsing the data in the cube. The terms refer to the ability to look at information from different viewpoints. A slice is a subset of the cube corresponding to a single value for one or more members of the dimensions. For example, a slice operation is performed when the user wants a selection on one dimension of a three-dimensional cube resulting in a twodimensional site. Let the degree dimension be fixed as degree = BIT. The slice will not include any information about other degrees.

The dice operation is similar to slice but dicing does not involve reducing the number of dimensions. A dice is obtained by performing a selection on two or more dimensions.

Pivot or Rotate The pivot operation is used when the user wishes to re-orient the view of the data cube. It may involve swapping the rows and columns, or moving one of the row dimensions into the column dimension. .

20. Explain various Schemas used in Designing Data Warehouse

Data Warehouse environment usually transforms the relational data model into some special

architectures. There are many schema models designed for data warehousing but the most

http://datawarehouse4u.info/Data-Warehouse-Schema-Architecture.html

commonly used are:

- Star schema

- Snowflake schema

- Fact constellation schema

The determination of which schema model should be used for a data warehouse should be based

upon the analysis of project requirements, accessible tools and project team preferences.

STAR SCHEMA

The star schema architecture is the simplest data warehouse schema. It is called a star schema

because the diagram resembles a star, with points radiating from a center. The center of the star

consists of fact table and the points of the star are the dimension tables. Usually the fact tables in a

star schema are in third normal form(3NF) whereas dimensional tables are de-normalized. Despite

the fact that the star schema is the simplest architecture, it is most commonly used nowadays and is

recommended by Oracle.

Fact Tables

A fact table typically has two types of columns: foreign keys to dimension tables and measures

those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level.

Dimension Tables

A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a

dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary keys of

each of the dimension tables are part of the composite primary key of the fact table. Dimensional

attributes help to describe the dimensional value. They are normally descriptive, textual values.

Dimension tables are generally small in size then fact table.

Typical fact tables store data about sales while dimension tables data about geographic

region(markets, cities) , clients, products, times, channels.

The main characteristics of star schema:

Simple structure -> easy to understand schema

Great query effectives -> small number of tables to join

Relatively long time of loading data into dimension tables -> de-normalization, redundancy data

caused that size of the table could be large.

The most commonly used in the data warehouse implementations -> widely supported by a large

number of business intelligence tools

http://datawarehouse4u.info/Data-warehouse-schema-architecture-star-schema.html

http://datawarehouse4u.info/Data-warehouse-schema-architecture-snowflake-schema.html

http://datawarehouse4u.info/Data-warehouse-schema-architecture-fact-constellation-schema.html

http://datawarehouse4u.info/Data-warehouse-schema-architecture-star-schema.html

SNOWFLAKE SCHEMA

The snowflake schema architecture is a more complex variation of the star schema used in a data

warehouse, because the tables which describe the dimensions are normalized.

FACT CONSTELLATION For each star schema it is possible to construct fact constellation schema(for example by splitting

the original star schema into more star schemes each of them describes facts on another level of

dimension hierarchies). The fact constellation architecture contains multiple fact tables that share

many dimension tables.

The main shortcoming of the fact constellation schema is a more complicated design because many

variants for particular kinds of aggregation must be considered and selected. Moreover, dimension

tables are still large.

21.Describe Codd’s OLAP Characteristics Codd et al’s 1993 paper listed 12 characteristics (or rules) OLAP systems. Another six in 1995 followed these. Codd restructured the 18 rules into four groups. These rules provide another point of view on what constitutes an OLAP system. All the 18 rules are available at http://www.olapreport.com/fasmi.htm . Here we discuss 10 characteristics, that are most important. 1. Multidimensional conceptual view: As noted above, this is central characteristic of an OLAP system. By requiring a multidimensional view, it is possible to carry out operations like slice and dice. 2. Accessibility (OLAP as a mediator): The OLAP software should be sitting between data sources (e.g data warehouse) and an OLAP front-end. 3. Batch extraction vs interpretive: An OLAP system should provide multidimensional data staging plus precalculation of aggregates in large multidimensional databases. 4. Multi-user support: Since the OLAP system is shared, the OLAP software should provide many normal database operations including retrieval, update, concurrency control, integrity and security. 5. Storing OLAP results: OLAP results data should be kept separate from source data. Read-write OLAP applications should not be implemented directly on live transaction data if OLAP source systems are supplying information to the OLAP system directly. 6. Extraction of missing values: The OLAP system should distinguish missing values from zero values. A large data cube may have a large number of zeros as well as some missing values. If a distinction is not made between zero values and missing values, the aggregates are likely to be computed incorrectly. 7. Treatment of missing values: An OLAP system should ignore all missing values regardless of their source. Correct aggregate values will be computed once the missing values are ignored. 8. Uniform reporting performance: Increasing the number of dimensions or database size should not significantly degrade the reporting performance of the OLAP system. This is a good objective although it may be difficult to achieve in practice. 9. Generic dimensionality: An OLAP system should treat each dimension as equivalent in both is structure and operational capabilities. Additional operational capabilities may be granted to selected dimensions but such additional functions should be grantable to any dimension. 10. Unlimited dimensions and aggregation levels: An OLAP system should allow unlimited dimensions and aggregation levels. In practice, the number of dimensions is rarely more than 10 and the number of hierarchies rarely more than six.

22.Explain OECD principles to protect information Privacy

1. Collection Limitation Principle

There should be limits to the collection of personal data and any such data should be obtained by

lawful and fair means and, where appropriate, with the knowledge or consent of the data subject.

2. Data Quality Principle

Personal data should be relevant to the purposes for which they are to be used, and, to the extent

necessary for those purposes, should be accurate, complete and kept up-to-date.

3. Purpose Specification Principle

The purposes for which personal data are collected should be specified not later than at the time of

data collection and the subsequent use limited to the fulfilment of those purposes or such others as

are not incompatible with those purposes and as are specified on each occasion of change of

purpose.

4. Use Limitation Principle

Personal data should not be disclosed, made available or otherwise used for purposes other than

those specified in accordance with Paragraph 9 except:

a) with the consent of the data subject; or

b) by the authority of law.

5. Security Safeguards Principle

Personal data should be protected by reasonable security safeguards against such risks as loss or

unauthorised access, destruction, use, modification or disclosure of data.

6. Openness Principle

There should be a general policy of openness about developments, practices and policies with

respect to personal data. Means should be readily available of establishing the existence and nature

of personal data, and the main purposes of their use, as well as the identity and usual residence of

the data controller.

7. Individual Participation Principle

An individual should have the right:

a) to obtain from a data controller, or otherwise, confirmation of whether or not the data controller

has data relating to him;

b) to have communicated to him, data relating to him

i) within a reasonable time;

ii) at a charge, if any, that is not excessive;

iii) in a reasonable manner; and

iv) in a form that is readily intelligible to him;

c) to be given reasons if a request made under subparagraphs (a) and (b) is denied, and to be able to

challenge such denial; and

d) to challenge data relating to him and, if the challenge is successful to have the data erased,

rectified, completed or amended.

8. Accountability Principle

A data controller should be accountable for complying with measures which give effect to the

principles stated above.

23. What are the major issues in Data Mining?

Mining methodology and user interaction issues:

Mining different kinds of knowledge in databases:

o Because different users can be interested in different kinds of knowledge, data

mining should cover a wide spectrum of data analysis and knowledge discovery

tasks, association and correlation analysis, classification, prediction, clustering,

outlier analysis, and evolution analysis .

o These tasks may use the same database in different ways and require the

development of numerous data mining techniques.

Interactive mining of knowledge at multiple levels of abstraction:

o Interactive mining allows users to focus the search for patterns, providing and

refining data mining requests based on returned results.

o User can interact with the data mining system to view data and discovered

patterns at multiple granularities and from different angles.

Incorporation of background knowledge:

o Background knowledge, or information regarding the domain under study, may

be used to guide the discovery process and allow discovered patterns to be

expressed in concise terms and at different levels of abstraction.

Data mining query languages and ad hoc data mining:

o Relational query languages (such as SQL) allow users to pose ad hoc queries for

data retrieval.

o High-level data mining query languages need to be developed to allow users to

describe ad hoc data mining tasks by facilitating the specification of the relevant

sets of data for analysis, the domain knowledge, the kinds of knowledge to be

mined, and the conditions and constraints to be enforced on the discovered

pattern.

Presentation and visualization of data mining results:

o Discovered knowledge should be expressed in high-level languages, visual

representations, or other expressive forms so that the knowledge can be easily

understood and directly usable by humans.

Handling noisy or incomplete data:

o Data cleaning methods and data analysis methods that can handle noise are

required, as well as outlier mining methods for the discovery and analysis of

exceptional cases.

Pattern evaluation—the interestingness problem:

o Several challenges remain regarding the development of techniques to assess the

interestingness of discovered patterns, particularly with regard to subjective

measures that estimate the value of patterns with respect to a given user class,

based on user beliefs or expectations.

Performance issues:

Efficiency and scalability of data mining algorithms:

o To effectively extract information from a huge amount of data in databases, data

mining algorithms must be efficient and scalable.

o In other words, the running time of a data mining algorithm must be predictable

and acceptable in large databases.

.

Parallel, distributed, and incremental mining algorithms:

o The huge size of many databases, the wide distribution of data, and the

computational complexity of some data mining methods are factors motivating

the development of parallel and distributed data mining algorithms

Issues relating to the diversity of database types:

Handling of relational and complex types of data:

o Databases may contain complex data objects, hypertext and multimedia data, spatial

data, temporal data, or transaction data. Specific data mining systems should be

constructed for mining specific kinds of data.

Mining information from heterogeneous databases and global information systems:

o The discovery of knowledge from different sources of structured, semistructured, or

unstructured data with diverse data semantics poses great challenges to data mining.

Data mining may help disclose high-level data regularities in multiple heterogeneous

databases that are unlikely to be discovered by simple query systems and may

improve information exchange and interoperability in heterogeneous databases.

o Web mining, becomes a very challenging and fast-evolving field in data mining.

data mining part-afinancial data analysis retail industry telecommunication industry market analysis...

Documents