exploratory data analysis project 2

MATH 189 HOMEWORK 3

Yufan Gao

Yingbai He

Wenting Li

Zijian Tao

Yingjie Wu

Mengbin Zhang

1. Introduction

The Million Song Dataset is a freelyavailable collection of audio features and metadata for a

million contemporary popular music tracks. The core of the dataset is the feature analysis and

metadata for one million songs. Year Prediction MSD Data Set is a subset of the Million Song

Dataset. This data set shows the prediction of the release year of a song from audio features. It is

also a collaboration between LabROSA(Columbia University) and The Echo Nest.

2. Data

First, we start by plotting the data given in box plots. There are 12 features in the timbre vector

generated by The Echo Nest. The following 12 graphs show the boxplot of each features per

year.

From the plots we can see that a lot of the points are outliers. The medians tend to fluctuate a lot

in the early years possibly due to not enough data to get the better estimate, so do the

interquartile ranges. We can see that some of the features are growing over the years, such as the

1st feature. There are also features that are decreasing over the years, such as Feature 6. Feature

2 can be interesting that it increases in the early years but stays almost the same for the later

years. It may means that the later musics have the similar value for this feature. Also, there are

many features that tend to be more stable and the medians are very close to each other. It would

seem that they do not change much during the recent decades. It is possible that they have less

impact on the year predicting than the other features. Features like the first one seems to be

useful than features like feature 12 when predicting the year of the song since the difference

between each year for feature 12 is very small. This will make it harder to determine the

relationship between the feature and the year.

In the following parts, we will analyze the data more closely.

3. Methods and Analysis

The following is a basic plot of the counts of music in different years, we could see that

this resembles an exponential distribution,

although further investigation of whether it

follows such distribution makes seldom any

contributions to what we’ll explore from this

dataset.

According to UCI repository, the data should be divided into training and test part, and we could see a summary of the training data’s year. So the year spans from 1922 to 2011, with the majority of them lies in 2000s.

4. Regression Time We first try to fit in a linear regression line by using lm function. The left is a small portion of the summary, we then try to delete the predictors that have little significance, in other words big p value.

Then we use this formula: “lm(V1~.V5V13V23V31V52V55V56V57V61V68V69V80V81V82V87V91, data=train)” to build a regression model using our training data, and try to predict the year in the test data with confidence=interval.

From the summary we could observe the wired behavior of linear regression for classification problem, the maximum, 2051, is much bigger than 2011 and the minimum is, 1932, also indicates ignorance of the interval between 1922 and 1932. We can confirm the undesired behavior of the model by calculating the mismatch rate.

The successful hit for even training data is 5.1%, such low rate is an indication that we should try another method. So we turn to linear discriminant analysis for year classification. We fit the training data into lda function and use predict function to evaluate successful prediction rate.

So the success rate is 8.7%, we can see an improvement over the linear regression model. But we care about the success rate for test data, so we calculate it the following way:

We then suspect that maybe decreasing the number of predictors may help give us better or similar results, a simpler model would have better interpretation than a complicated model even though they have similar result.

The number above 7.81% is the training success rate, which is still better than linear regression model. And for test success rate, it’s 7.48%, not bad.

But in general, the rate is less than 10%, which is relatively low. So we tried instead to predict which decade a particular song is in, which must be easier to predict. First we convert years into decades, for example 1922 is transformed into 2, 2007 is transformed into 10. Then we predict the decade by still apply linear discriminant analysis, here is what we get for training data:

and for test data:

rate of around 60% is breezing than 7%. Finally, we try principle component analysis to decrease the number of predictors by projecting all the 90 predictors onto smaller dimensions.

And then put the new variables together to form new training and test data set. We first try to predict with all the predictors, the result looks as follows:

and for test data:

We then try different number of principal components, and calculate the test data success rate with the following plot:

From this plot we can observe that the rate decrease nearly monotonically as we add more later components to build up the model. So the best result we can get is when we train our data with only the first principal component, this is an intriguing phenomena because it actually explains the essence of PCA, in which the first component explains most of the feature. In this case, we are able to build a simple model using only one predictor and reach similar result as before, 57% vs 60%. Therefore, we could represent each observation with just PC1 to observe the trend of the music throughout the century, at least represented by our data.

It seems from the plot that the later years tend to have music of more variety, since the values of later years’ PC1 span more than former years. However, if we take into account of the sheer number of music available in later years in our dataset, the wide range of values in later years might be the result of mere large volume of data available. To see what really happens, we plot the final box plot of PC1 against year.

From this plot we can see that as the year proceed, the mean or the majority of the song’s timber tend not to change much. However, there are more and more “outliers”, which can be interpreted as other kinds of songs that are quite distinct from the majority. The plot might also imply that there have been an increase of the variety of songs during the later years although the majority or the mainstream remains the same.

exploratory data analysis project 2

Documents