k-means clustering exercises and loan data in r

Download K-Means Clustering Exercises and Loan Data in R

If you can't read please download the document

Upload: kevin-bahr

Post on 29-Dec-2015

792 views

Category:

Documents


3 download

DESCRIPTION

Analyze evaluation measures for clustering algorithms, including the silhouette value and psuedo-F statistic. Iris data set and a loans data set were analyzed to find the optimal number of clusters determined by the silhouette value and psuedo-F statistic. R statistic language used.

TRANSCRIPT

  • ! " !

    STAT 522 Clustering and Affinity Analysis Assignment 3 Measuring Cluster Goodness Kevin Bahr

    1. Why do we need to evaluation measures for cluster algorithms?

    Like every statistical algorithm, we need evaluation measures to determine their effectiveness. There are three main evaluation measures for clustering algorithms that help focus the analysis. We deploy methods to determine 1) the optimal number of clusters to identify, 2) if one set of clusters is preferable to another, and 3) if the clusters correspond to reality and not just mathematical convenience.

    To identify the optimal number of cluster, we can compare the pseudo-F statistic and/or the silhouette values of each clustering solution. Both methods take into account cluster cohesion and cluster separation.

    Determining if one set of clusters is preferable to another is also done through the pseudo-F statistic and/or the silhouette values of each clustering solution. We can conduct exploratory data analysis once we identify the optimal number of clusters. We can also split the data into training and testing sets in order to validate the clustering solution.

    We want the clustering solution to correspond tor reality and not just mathematical convenience. Take the iris data set for example. Figure 1 shows that there are 3 separate flower species. A clustering solution may recommend that 2 clusters are optimal (Figure 2) and have the highest silhouette value (0.630). The versicolor and virginica species could easily be combined into once cluster since their Petal.Width vs. Petal.Length are much further from the setosa species. However, the analyst may want to use 3 clusters (Figure 3) in order to ease interpretation of the data and separate the 3 species, even though the silhouette value is slightly lower (0.507).

  • ! #!

    !

    Figure 1 - Iris Petal.Width vs Petal.Length w/ Species Overlay

    !

    Figure 2 - Iris k=2 cluster solution

  • ! $!

    !

    Figure 3 - Iris k=3 cluster solution

    2. What is cluster separation and cluster cohesion?

    Cluster separation and cluster cohesion are two concepts that when measured, help determine how good a cluster and/or a clustering solution is. Cluster separation is the distance of the center of one cluster to the center of another cluster, while cluster cohesion refers to how close together data values are within a single cluster.

    The solid lines in figure 4 show cluster cohesion of cluster A and the dotted lines show cluster separation.

    !

    Figure 4 - Cluster Cohesion vs. Cluster Separation

  • ! %!

    3. Why is SSE not necessarily a good measure of cluster quality?

    Sum of squares error (SSE) is not necessarily a good measure of cluster quality because it only measures the distance between each record and its cluster center. SSE is always decreasing as the number of clusters increases, which is not a measure for cluster goodness, since cluster separation is not taken into account.

    Figures 5 and 6 show how SSE tends to always decrease as the number of clusters increases.

    !

    Figure 5 - SSE Iris Data Set

    !

    Figure 6 - SSE Loans Training Data Set

  • ! &!

    4. What is a silhouette? What is its range? Is it a characteristic of a cluster, a variable, or a data value?

    A silhouette is a measure of both cluster cohesion and cluster separation for a particular data value. It typically ranges from 0 to 1 (can be negative for misclassified values) and is a characteristic of a data value.

    For a one-dimensional data set, the value of a silhouette is determined by:

    ! = ! = ! !!

    max!(! !!!)

    ! is the distance between the data value and the nearest cluster center , and ! is the distance between the data value and the data values cluster center.

    For a multi-dimensional data, the value of a silhouette is similar as above, but ! and ! ! are determined by the Euclidean distance formula for each record. We let x represents each data point in the record, y represent the cluster means for the cluster assignment of x, and z represent the next nearest cluster means.

    ! = , = ! ! ! ! ! ! + ! ! ! ++ ! ! ! ! ! ! !! !

    ! = , = ! ! ! + ! ! ! ++ ! ! ! ! ! + ! + ! ! ! ! ! !

    The silhouette value of a cluster is found by averaging all silhouette values assigned to a cluster, whereas the silhouette value of a clustering solution is found by averaging all silhouette values. The silhouette value of a cluster and clustering solution assess the goodness of a cluster a solution as a whole.

    5. How do we interpret a silhouette value?

    We can interpret a silhouette value through the following ranges:

    1 The data point lies on the center of its cluster (! ! or ! ! of figure 7) 0.50 1 Stronger evidence of the data point fitting its cluster assignment.

    Values that fall within this range are closer to the center of their cluster assignment and further away from the nearest cluster (! ! of figure 7).

    0.25 0.50 Moderate evidence of the data point fitting its cluster assignment (! ! of figure 7).

  • ! !

    0 0.25 Weak evidence of the data point fitting its cluster assignment. Values of this assignment are close to the edge of their cluster assignment (! of figure 7).

    A negative value indicates wrong cluster assignment (! of figure 7). The value of !, the distance to the nearest cluster center, is lesser than the value of !, the distance to the cluster center.

    !

    !

    Figure 7 - Silhouette Points and estimated Values

    6. Explain how silhouette accounts for both separation and cohesion.

    Silhouette accounts for both separation and cohesion by including with within and between cluster variation for the calculation. Figure 8 shows various points assigned to cluster A and their distances to its cluster center and to its nearest cluster center. The solid lines account for cluster cohesion and the dotted lines account for cluster separation. Points ! and ! will have larger silhouette values than points ! and ! due to their proximity to the center of cluster A.

    !

    Figure 8 - Cluster Separation and Cohesion w/ ai and bi

  • ! ( !

    7. How is average silhouette interpreted?

    The average silhouette value across all records measures the goodness of the clustering solution. Average silhouette values closer to 0 are considered weak whereas values closer to 1 are considered a better clustering solution. Average silhouette values closer to 1 will have tighter cluster cohesion around the centers and greater separation between the clusters. A subject matter expert may have insight to clarify acceptable average silhouette value for a clustering solution.

    8. When will a data value have a perfect silhouette value? What is this value?

    A data value will have a perfect silhouette when is lies on the center of a cluster. The silhouette value of this point will be 1. In figure 9, point ! will have a perfect silhouette value of 1 because it lies on the cluster center ((8-0)/8 = 1).

    !

    Figure 9 - A Perfect Silhouette Value of 1

    !9. Describe what a silhouette plot is.

    A silhouette plot displays all of a clusters silhouette values in descending order, grouped by each cluster. Figure 10, 11, and 12 show the silhouette plots for the iris data set with k = 2, 3, and 4.

  • ! ) !

    !

    Figure 10 - Iris Silhouette Plot, k=2

    !

    Figure 11 - Iris Silhouette Plot, k=3

  • ! * !

    !

    Figure 12 - Iris Silhouette Plot, k=4

    As the centers increase, the integrity of the clusters decreases. The strongest cluster is cluster 1 in Figure 10, where k=2. No silhouette values are below 0.50 in this cluster. The silhouette value of the k=2 clustering solution, as we saw earlier in figure 2, is 0.630, which is higher than the k=3 and k=4 clustering solution.

    10. Should the analyst always choose the cluster solution with the better mean silhouette value? Explain.

    No, the analyst should not always choose the cluster solution with the better mean silhouette value. There may be times when the client is looking for a specific type of solution or when mathematical convenience does not correspond to reality.

    Table 1 - Centers and Mean Sil. for Iris Data

    We saw earlier in figure 1 how there are 3 species of flowers in the Iris data set, but clustering solution with k=2 has the highest mean silhouette value (Table 1 shows mean silhouette values for k=2, 3, and 4). In this situation, the analyst can propose 2 clusters based on the best silhouette value and 3 clusters based on Species convenience. A subject matter expert can make a distinction as to which to use.

    Centers Mean Sil.

    2 0.732

    3 0.626

    4 0.547

  • ! "+ !

    11. Explain how the pseudo-F statistic accounts for both separation and cohesion.

    The pseudo-F statistic accounts for both separation and cohesion by dividing the separation between clusters (MSB) by the cohesion within the clusters (MSE). Assuming we have defined SSB as the sum of squares between clusters and SSE as the sum of squares within the clusters, we can calculate the pseudo-F statistic as:

    ! !"# !

    !!" ! 1!!" !

    We can conduct a quick example of the pseudo-F statistic with the iris data set for k=2.

    Running the k-means clustering leads us to:

    The grand mean, M (Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width), equaling (0.428, 0.440, 0.440, 0.458).

    We then have the following cluster counts and centers:

    Cluster 1: ! = 50 and ! ! = 0.545, 0.363, 0.662, 0.656 Cluster 2: ! = !"" and ! = !0.196, 0.595, 0.078, 0.061)

    The SSB for each cluster is

    Cluster 1: 50 X {(0.545 0.428) ! + (0.363 0.440) ! + (0.662 0.440! ! + (0.656 0.458) !}

    Cluster 2: 100 X {(0.196 0.428) ! + (0!595 !0!440! ! + (! .!"# !0.440) ! + (! .!"# ! .!"# )!!}

    The sums of the two SSB values are 29.038 while the sum of the SSE (calculations not shown) is 12.127. We can now calculate the pseudo-F statistic.

    ! = !!"#

    = !29!038 ! !12!127 !"# ! = !"# .!"

    From the calculation, we can see clearly how the pseudo-F statistic accounts for both separation and cohesion.

  • ! "" !

    12. Why does the pseudo-F statistic have the word pseudo in its name?

    The pseudo-F statistic has the word pseudo in its name because it tends to find clusters in random data that does not have any true clusters. It often rejects the null hypothesis (that there are no clusters in the data). The statistic will find clusters even in randomly generated data, when there were no true clusters.

    13. Explain how we can use the pseudo-F statistic to select the optimal number of clusters.

    The pseudo-F statistic can be used to select the optimal number of clusters by conducting k-means analysis on numerous different values of k and then calculating the pseudo-F statistic and p-value for each solution. The solution with the smallest p-value is considered to be the best clustering solution, even if the pseudo-F statistic is larger in another clustering solution. The different degrees of freedom need to be accounted for in each model.

    As a continuation of problem 11, we can calculate the p-value with the pseudo-F statistic and the two degrees of freedom: ! = k -1= 1 and ! = N - k = 148. We could use this p-value to compare to other clustering solutions, say for k=3 and k=4. This is how we would use the pseudo-F statistic to find the optimal number of clusters.

    14. True or false: The best clustering model is the one with the largest value of pseudo-F. Explain.

    False. One must account for the two different degrees of freedom: k - 1 and n k for each model. The lower p-value for each pseudo-f statistic will identify which clustering solution is preferable.

    15. What is our cluster validation methodology?

    Our cluster validation methodology is used to ensure that the clusters in the training data set mirror those in the test data set. This is done by first applying cluster analysis to the training data set in order to find an appropriate clustering solution, then apply the cluster solution to the test data set, and finally use graphics and statistics to ensure that the training and test clusters match. The degree to how much the two data sets match is at the discretion of the analyst and/or subject matter expert.

  • ! "# !

    16. Why might statistical hypothesis tests not be very helpful for big data applications?

    Statistical hypothesis tests might not be very helpful for big data applications because the tests often reject the null hypothesis for very large sample sizes. Small differences between samples in large sample sizes result in the null hypothesis to be rejected. With very large sample sizes like those in big data applications, the analyst can make a judgment at how significant these small differences between the samples are in order to support or reject the statistical hypothesis test.

    17. What are the criteria for determining whether there is a match between the clusters uncovered in the training and test data sets?

    The p-value can be calculated and examined to see if the null hypothesis will be rejected, but if the sample size is large, then we risk a chance that the null hypothesis will be unduly rejected. We can look at how similar the data sets are to each other by looking at the cluster means and standard deviations, and make a judgment based on how close or apart they are.

    Hands-on Exercises

    Use the Loans data set for Exercises 18 22.

    It is interesting to note, before analysis, that Request amount and Interest have a one-to-one correlation. The Interest variable was removed from discussion, but can be added upon request.

    All values were standardized using min-max normalization for the clustering exercises. The Approval variable was removed prior to clustering and added back to the data after clustering for analysis.

    18. Use k-means with k = 3 to generate a cluster model with the training data set.

    K-means was applied with k = 3 in R using the kmeans function. The cluster sizes and centers were found and the approval rating per cluster was calculated (table 2).

  • ! "$ !

    Table 2 - Cluster Centers and Aggregate Approval, k=3

    Cluster #Records Debt.Income FICO Amount Approval

    1 71472 0.163 0.569 0.119 46%

    2 27454 0.212 0.621 0.666 37%

    3 51376 0.180 0.643 0.346 62%

    Cluster 1 is the largest and consists of 71,472 records. These records have the lowest Debt to Income Ratio and lowest Request Amount, but also the lowest FICO Score on average resulting in almost half of loans to be approved.

    Cluster 2 is the smallest with 27,454 records. These records have higher Debt to Income Ratio than the other clusters, above average FICO Score, and the highest Request Amount. These records are more risky to lenders due to the higher Request Amount and Debt to Income Ratio.

    Cluster 3 has 51,376 records. They have the highest FICO Score, low-moderate Request Amount, and the most approvals out of the clusters (62%). Lenders view these attributes as less risky.

    We go on to examine the silhouette plot (figure 13) and mean silhouette values by cluster (table 3) to get a glimpse of the integrity of the clusters. Silhouette values were calculated using the Euclidean distance formula for multiple dimensions.

    !

    Figure 13 - Silhouette Plot for Loans Training Data Set, k=3

  • ! "%!

    !Table 3 - Mean Silhouette Values, k=3

    Cluster 1 Cluster 2 Cluster 3 Overall Mean Silhouette

    0.4765 0.5104 0.4398 0.4701

    Cluster 2, the green line in figure 13, has drastically less individual records with silhouette value below 0.25, which correlates to the highest average silhouette of 0.5104 in table 2. The overall silhouette value for the clustering solution is 0.4701, and we might consider the clustering solution to be moderately good, but also left open to interpretation to a subject matter expert.

    19. Use k-means with k = 4 to generate a cluster model with the training data set.

    Table 4 shows the results of the clustering solution with 4 centers and with the aggregated Approval rating added to the data.

    Table 4 - Cluster Centers and Aggregate Approval, k=4

    Cluster #Records Debt.Income FICO Amount Approval

    1 44791 0.187 0.640 0.376 59%

    2 24889 0.213 0.620 0.682 36%

    3 59271 0.152 0.656 0.140 66%

    4 21351 0.190 0.365 0.114 ~0%

    Cluster 1 consists of 44,791 records, has a moderate Request Amount, and above average FICO Score. The higher FICO Scores and modest Request Amounts result in over half of the records being approved for a loan.

    Cluster 2 consists of 23,889 records, has the highest Debt to Income Ratio and Request Amount, along with an above average FICO Score. These records are likely to pose a moderate risk to lenders, as the FICO Score is above average, but Debt to Income Ratio is higher than the other clusters and the Request Amount is the highest.

    Cluster 3 consists of 59,271 records and has the lowest Debt to Income Ratio, highest FICO Score, and very low Request Amount. These records had the highest approval rating, most likely because they tend to have the highest FICO Score,

  • ! "& !

    lowest Debt to Income Ratio, and low Request Amount. Naturally, these pose a lesser risk to the lender.

    Cluster 4 consists of 21,351 records and has a lower FICO Score and Request Amount. The low FICO Score combined with the two other variables resulted in only 4 records in this cluster being approved. Lenders probably view people with the lower FICO Score as more risky, resulting in barely any approvals.

    Moving on to the silhouette plot (figure 14), we notice that cluster 2 has the lowest proportion of records below 0.25 silhouette value and cluster 4 has the majority of records below 0.50 silhouette value.

    !

    Figure 14 - Silhouette Plot for Loans Training Data Set, k=4

    Table 5 - Mean Silhouette Values, k=4

    Cluster 1 Cluster 2 Cluster 3 Cluster 4 Overall Mean Silhouette

    0.4422 0.5021 0.4676 0.3979 0.4558

    Table 5 shows the overall silhouette value for the k=4 clustering solution is 0.4558, which is just slightly lower than the k=3 model. Cluster 2 has the strongest cluster integrity at 0.5021 while cluster 4 has the weakest cluster integrity at 0.3979.

  • ! " !

    20. Compare the mean silhouette values for the two cluster models. Which model is preferred?

    The k = 3 model is preferred, but the solution with k=4 is very interesting compared to the k=3 solution. A subject matter expert may want to make a judgment which solution to proceed with. Using the silhouette method, mathematical convenience says to proceed with the k=3 solution because of the slightly higher average silhouette value (0.47 vs. 0.45). But, when looking at how the clusters differ, the k=4 clustering solution has a wider range of approval ratings and cluster means.

    21. With the test data set, apply k-means with the value of k from the preferred model above.

    K-means clustering with k=3 was applied to the test data, which consists of 49,698 records. Values were first normalized using min-max normalization. The Approval variable was removed prior to clustering and added back to the data after clustering for analysis.

    22. Perform validation of the clusters you uncovered in Exercises 20 and 21.

    Density plots, means, standard deviations, p-values, t-stats, and approval ratings were compared for the training and test data.

    Figures 15 20 compare training and test cluster results through density plots. As seen in the previous analysis of the k=3 training data, the peaks/centers of the density plots roughly represent the cluster center amount. From looking at the density plots, it appears that there is little difference between the training and test data clusters, which looks good for our analysis.

  • ! "( !

    !

    Figure 15 - Density Plot of Debt to Income Ratio, Training Data

    !

    Figure 16 - Density Plot of Debt to Income Ratio, Test Data

  • ! ") !

    !

    Figure 17 - Density Plot of FICO Score, Training Data

    !

    Figure 18 - Density Plot of FICO Score, Test Data

  • ! "* !

    !

    Figure 19 - Density Plot of Request Amount, Training Data

    !

    Figure 20 - Density Plot of Request Amount, Test Data

    After looking at the density plots, we want to get a closer look at how much the training and test clusters differ from each other. Tables 6, 7, and 8 each shows cluster mean and standard deviation. Again, all values have remained standardized using min-max normalization.

  • ! #+!

    Table 6 - Summary Statistics for Cluster 1

    Cluster 1 Training Test Training Test Training Test DIR DIR FICO FICO Amount Amount Mean 0.164 0.160 0.570 0.574 0.119 0.125 SD 0.130 0.126 0.155 0.155 0.066 0.069 Records 71472 23741 71472 23741 71472 23741 Table 7 - Summary Statistics for Cluster 2

    Cluster 2 Training Test Training Test Training Test DIR DIR FICO FICO Amount Amount Mean 0.213 0.207 0.622 0.625 0.667 0.700 SD 0.151 0.148 0.121 0.123 0.110 0.115 Records 27454 9046 27454 9046 27454 9046 Table 8 - Summary Statistics for Cluster 3

    Cluster 3 Training Test Training Test Training Test DIR DIR FICO FICO Amount Amount Mean 0.180 0.178 0.643 0.645 0.347 0.364 SD 0.123 0.124 0.107 0.111 0.080 0.084 Records 51376 16911 51376 16911 51376 16911 At a quick glance, the clusters appear to be very similar. We can get more precise by calculating the hypothesis statistics for each cluster and variable. Table 9 shows the exact differences, t-stat, and p-value. Table 9 Variable differences and hypothesis test results

    Cluster 1 Cluster 2 Cluster 3 DIR FICO Amt DIR FICO Amt DIR FICO Amt Diff 0.004 0.004 0.006 0.006 0.003 0.003 0.002 0.002 0.003 t-stat 4.273 -3.97 -11.7 3.388 -2.26 -23.5 1.877 -2.05 -23.4 p-val 0.000 0.000 0.000 0.000 0.024 0.000 0.060 0.039 0.000 The hypotheses being testing are:

    !: !:

    So we would reject ! if the p-value were sufficiently small, say below 0.05, to be 95% confident.

  • ! #" !

    At first glance, the hypothesis test results suggest that the cluster means from the training and test data sets do not match. However, we need to keep in mind that the hypothesis is easily rejected for large sample sizes with small differences. Nearly all p-values are small, with the largest being 0.060. Instead, we want to look at the actual variable differences and make a judgment as to really how far apart they might be. The differences range from 0.002 to 0.006, which are still quite small, even on the min-max normalization scale. It appears that we can assume the training and test data to be validated. Additionally, table 10 shows the approval between the training and test data. Table 10 - Approval Comparison between Training and Test Data

    The approval ratings, which werent included in the clustering analysis, are nearly matching in all of the clusters. This can further validated the training and test data. For additional analysis, it would be interesting to obtain data on loan payback rates in order to optimize analysis for profit and loss.

    Approval %

    Cluster Training Data Test Data

    1 46% 46%

    2 37% 36%

    3 62% 61%