christoph f. eick a few answers review september 23, 2010 1.in the news clustering problem we...

3
Christoph F. Eick A Few Answers Review September 23, 2010 1. In the news clustering problem we computed the distance between two news entities based on their (key-) wordlists A and B as follows: distance(A,B)=1-(|AB)|/|AB|) with ‘||’ denoting set cardinality; e.g. | {a,b}|=2. Why do we divide by (AB) in the formula? 2. What is the main difference between ordinal and a nominal attributes? The values of nominal attributes are ordered; this fact has to be considered when assessing similarity between two attribute values 3. Name two descriptive data mining methods! 4. What are the reasons for the current popularity of knowledge discovery in commercial and scientific applications? 5. Most prediction techniques employ supervised learning approaches. Explain! 6. What role does exploratory data analysis play in a data mining project? 7. Assume we have a dataset in which the median of the first attribute is twice as large as the mean of the first attribute? What does this tell you about the distribution of the first attribute? 8. What is (are) the characteristic(s) of a good histogram (for an attribute)? It captures the most important characteristics of the underlying density function. 9. Assume you find out that two attributes have a correlation of 0.02; what does this tell you about the relationship of the two attributes? Answer the same question assuming the correlation is -0.98! 0.02:= no linear relation. -0.98:=a strong linear relationship exists—if the value of one attribute goes up the value of the other goes down

Upload: joseph-paul

Post on 27-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Christoph F. Eick A Few Answers Review September 23, 2010 1.In the news clustering problem we computed the distance between two news entities based on

Christoph F. Eick

A Few Answers Review September 23, 20101. In the news clustering problem we computed the distance between two news entities based on their (key-)

wordlists A and B as follows: distance(A,B)=1-(|AB)|/|AB|) with ‘||’ denoting set cardinality; e.g. |{a,b}|=2. Why do we divide by (AB) in the formula?

2. What is the main difference between ordinal and a nominal attributes?The values of nominal attributes are ordered; this fact has to be considered when assessing similarity between two

attribute values3. Name two descriptive data mining methods!4. What are the reasons for the current popularity of knowledge discovery in commercial and scientific

applications?5. Most prediction techniques employ supervised learning approaches. Explain!6. What role does exploratory data analysis play in a data mining project?7. Assume we have a dataset in which the median of the first attribute is twice as large as the mean of the first

attribute? What does this tell you about the distribution of the first attribute?8. What is (are) the characteristic(s) of a good histogram (for an attribute)?It captures the most important characteristics of the underlying density function.9. Assume you find out that two attributes have a correlation of 0.02; what does this tell you about the

relationship of the two attributes? Answer the same question assuming the correlation is -0.98!0.02:= no linear relation. -0.98:=a strong linear relationship exists—if the value of one attribute goes up the value of

the other goes down

Page 2: Christoph F. Eick A Few Answers Review September 23, 2010 1.In the news clustering problem we computed the distance between two news entities based on

Christoph F. Eick

More Answers1. The decision tree induction algorithm, discussed in class, is a greedy algorithm.

Explain!Does not backtrack/change previously made decisions; in general, greedy algorithms

center on finding a path from the current state to the goal state, and do not revise the currently taken path from the initial state to the current state.

2. Compute the Gini-gain for a 3-way split for a 3-class classification problem; the class-distribution before the split is (10, 5, 5) and after the split the class distribution is (0,0,5), (9, 2,0) and (1,3,0).

G(1/2,1/4,1/4)-(1/4*0+11/20*G(9/11.2/11,0)+4/20*G(1/4,3/4))3. What is overfitting ? What is underfitting? What can be done to address

overfitting/underfitting in decision tree induction?Overfitting: the model is too complex, the training error is very low but the testing error

is not minimal.Underfitting: the model is too simple, both training error and testing error are high.4. Most decision tree learning tools use gain-ratio and not information gain; why?5. Are decision trees suitable for classification problems involving continuous attributes

when classes have multi-modal (http://en.wikipedia.org/wiki/Multimodal) distributions? Give reasons for your answer.

Yes, because

Page 3: Christoph F. Eick A Few Answers Review September 23, 2010 1.In the news clustering problem we computed the distance between two news entities based on

Christoph F. Eick

Questions Review September 23, 20101. In the news clustering problem we computed the distance between two news entities based on their (key-)

wordlists A and B as follows: distance(A,B)=1-(|AB)|/|AB|) with ‘||’ denoting set cardinality; e.g. |{a,b}|=2. Why do we divide by (AB) in the formula?

2. What is the main difference between ordinal and a nominal attributes?3. Name two descriptive data mining methods!4. What are the reasons for the current popularity of knowledge discovery in commercial and scientific

applications?5. Most prediction techniques employ supervised learning approaches. Explain!6. What role does exploratory data analysis play in a data mining project?7. Assume we have a dataset in which the median of the first attribute is twice as large as the mean of the first

attribute? What does this tell you about the distribution of the first attribute?8. What is (are) the characteristic(s) of a good histogram (for an attribute)?9. Assume you find out that two attributes have a correlation of 0.02; what does this tell you about the

relationship of the two attributes? Answer the same question assuming the correlation is -0.98!10. The decision tree induction algorithm, discussed in class, is a greedy algorithm. Explain!11. Compute the Gini-gain for a 3-way split for a 3-class classification problem; the class-distribution before the

split is (10, 5, 5) and after the split the class distribution is (0,0,5), (9, 2,0) and (1,3,0). 12. What is overfitting ? What is underfitting? What can be done to address overfitting/underfitting in decision

tree induction?13. Most decision tree learning tools use gain-ratio and not information gain; why?14. Are decision trees suitable for classification problems involving continuous attributes when classes have multi-

modal (http://en.wikipedia.org/wiki/Multimodal) distributions? Give reasons for your answer.