Download - And the Oscar goes to…ai/projects/2014/OscarPredictor... · And the Oscar goes to… Creating an Academy Award winner predictor using supervised learning methods Romi Gelman Noa

And the Oscar goes to… Creating an Academy Award winner predictor using supervised learning methods

Romi Gelman Noa Rogozinski

Introduction to Artificial Intelligence, The Hebrew University, 2015

ABSTRACT: Our goal in this project is to predict whether an Oscar nominated movie would indeed win an Oscar for “Best

Picture”. In order to create a worthy predictor we implemented two supervised learning techniques – the decision tree algorithm

with two different heuristics and the binary perceptron model. In this project we present the results, analyze the way the

implemented algorithms behave with different parameters and compare between the methods to arrive at the best predictor.

1. INTRODUCTION

1.1. The Problem

The Academy Awards, or The Oscars, is an annual American awards ceremony honoring cinematic

achievements in the film industry. It is considered to be the most important and prestigious cinema awards

ceremony in the world. “The Best Picture” award is the most important award category, as it represents all

of the efforts that are put into a film, including directing, acting, music composition, writing, and editing.

Thus, each year after the nominees for “Best Picture” are listed, the entire film industry and the media that

covers it, are filled with speculations and predictions for the winner. Traditionally, these predictions have

been made by movie experts, but advances in technology have made way for new methods for predictions.

For instance, a team of researchers1 has recently found crowdsourcing to be a very successful predictor,

translating Oscar predictions data from online betting markets into probabilities.

1.2. Our Approach

In this project, we wish to predict which movies out of a given list of real Oscar-nominees for “Best Picture”

would win an Oscar. We base our project on the presumption that Oscar winning movies are getting picked

throughout the years according to recurring patterns relating to the movies’ attributes.

Our approach for predicting whether Oscar nominated movies will win an Oscar is to address the problem as

a supervised learning problem. Supervised learning was a natural choice, since each movie can be defined as

a set of its attributes, and, with the use of existing information about past nominees, can be considered as an

example, with the attributes as the input and a label ("won" or "lost") as the desired output. The attributes of

these Oscar-winning and Oscar-losing examples can be then fed into various models from the supervised

learning field in order to train the models, and also to later test them.

For this project, we used the decision tree and the binary perceptron model, as they were the clear choice for

supervised learning models and fit well with our representation of the problem. The algorithms in both

models identify which of a movie’s attributes are more relevant to its chance at winning an Oscar and then

try to predict if a movie would win according to the values of its attributes. The performance of each model

and comparisons between them will be discussed later in the project.

We believe information about the significance of the different components of a film and their effect on

whether a film would have a higher chance to win an Oscar may be highly valuable for production

companies.

1 http://blog.revolutionanalytics.com/2014/02/oscars-betfair.html, http://www.predictwise.com/node/2837

http://blog.revolutionanalytics.com/2014/02/oscars-betfair.html

http://blog.revolutionanalytics.com/2014/02/oscars-betfair.html

http://www.predictwise.com/node/2837

2. REPRESENTATION OF THE PROBLEM

2.1. The Movie Attributes

Each of the movies that were nominated for “Best Picture” in the 87 years the Academy Awards have taken

place is represented as a set of attributes retrieved from IMDb and stored as an instance of the Movie class.

We decided to concentrate on attributes relating to the movie’s cast and crew and their previous success in

the Oscar’s, dry stats attributes, content-related attributes, and on whether the movies won other cinematic

awards that took place before the Oscars.

In order for the learning algorithms to allow generalization (and to avoid over-fitting the learned examples),

we discretized (made a coarser division of) the possible values of some attributes.

Following is the list of attributes used in our project –

a) Attributes relating to the content:

Genre, as extracted from IMDB database (Drama, Comedy, Crime, Action , Biography ,

Adventure, Animation , Western, Romance, Musical , History, Fantasy, War, Mystery, Thriller,

Sport, Family, Music, Sci-Fi)

MPAA rating – the Motion Picture Association of America rating of a film’s suitability for

certain audiences, based on its content: “G” – for general audiences, “PG” – parental guidance

suggested, “PG-13” – parents strongly cautioned, “R” – restricted

Age Category of Main Character – the age of the main character, rounded down to the nearest

10-year mark it belongs to (0, 10, 20, …, 60, 70); determined by the real age of the first

actor/actress listed for the movie in IMDb

Theme of movie: we picked three themes: war, physical and mental disease, and show

business; created a pool of representative keywords for each theme, and checked for each movie

if the movie contains any of the keywords. If so, we classified the movie as belonging to the

corresponding theme.

Note - the selection of the themes is based on “movie experts' chatter” during the 2015 Oscar’s

ceremony.

b) Attributes relating to dry stats:

IMDb rating - the rating given to the movie by IMDB users: from 0 to 10 rounded to the

nearest multiple of 0.5

Runtime category - length of movie, rounded down to the closest half hour

Season of release – January to March, April to June, July to September, October to December

c) Attributes relating to the quality of the cast and crew:

Total number of times cast won an Oscar – counts the total number of times any of the first 3

actors listed for the movie in IMDb had won an Oscar (not including current movie)

Total number of times cast was nominated for an Oscar – counts the total number of times

any of the first 3 actors listed for the movie in IMDb were nominated for an Oscar, including

wins (not including current movie). This attribute was also discretized.

Producers’ quality – this category is split into 3 categories: “won” – if any of the producers of

the film had won an Oscar for other movies; “nominated” – if any of the producers of the film

had been nominated for an Oscar; “nada” – no nominations for any of the producers for other

films

Director’s quality - this category is split into 3 categories, as in producers' quality.

Writer’s quality - this category is split into 3 categories, as in producers' quality.

Editors' quality - this category is split into 3 categories, as in producers' quality.

d) Attributes relating to other cinematic achievements:

Won Bafta - whether the movie had won the “BAFTA” awards

2.2. Training Set and Test Set

In order to measure the predictive power of each of the models (their ability to correctly predict whether a

nominated movie will win an Oscar), we had to test the performance of each model on movies that weren’t

given as training examples during the model’s learning process. Therefore, for each trial, we divided our list

of movie nominees (our set of examples) into a training set and a test set, the split was done so that all of the

movies nominated in the same year were in the same subset. The movies in the training set were used to

teach the model, and the ones in the test set were used to measure the quality of the learning of the model.

The size of the training set differs from trial to trial. In each trial we applied one of two methods to divide

the examples into a training set and a test set. In the first method the set itself is randomly created each time

and therefore independent from the training sets in the other trials. As for the second method, we decided it

is more natural for predictions regarding Oscar winners to rely on the history of winnings and therefore the

training set contains the earlier years and the test set the later years.

2.3. Extracting the Data

We used a table2 listing all Oscar nominations and wins for “Best Picture”, “Best Actor”, “Best Supporting

Actor”, “Best Actress”, “Best Supporting Actress”, “Best Director”, “Best Film Editing”, “Best Writing –

Original Screenplay”, and “Best Writing – Adapted Screenplay”. This table was used as the base input list

of movies nominated for “Best Picture”, as well as for fetching the Oscar nomination and winning statistics

of the cast and crew for each movie. In order to extract the attributes for each movie in the input list, we

used the IMDb python package and the OMDb API web service (using the Python requests package).

Note: some of the attributes for some of the movies were missing from the database, so we created a

separate “doesn’t exist in db” value for each attribute (we will later regard this as the “missing attributes”

problem).

3. METHODS

3.1. Models and Algorithms

3.1.1. Decision Tree:

The Decision Tree model uses a list of attributes and a set of classified examples defined by their

attributes in order to create a flow-chart like structure (the tree) that is used to classify new inputs. Each

internal node of the tree represents an attribute, and its branches represent the different values for that

attribute. Each leaf node represents a classification. A new input can be represented according to its

attributes by a path from the root of the tree to one of the tree's leafs, and is classified according to the

leaf (“true” or “false”).

The Decision Tree Learning algorithm (DTL) is based on recursively choosing the “best” attribute to

divide the current examples by and setting it as the root of the subtree. Choosing the “best” attribute is

done using various heuristics. We used the entropy based heuristics "Information Gain" and

"Information Gain Ratio" in our implementation.

The "Information Gain" heuristic chooses the attribute A with the largest IG, i.e. minimizes the

average conditional entropy - H(Examples|A).

IG(Examples|A) = H(Examples) − H(Examples|A)

H(Examples |A) = ∑p(v)H(Examples|A = v)

v∈A

The "Information Gain Ratio" heuristic chooses the attribute with the largest ratio of information

gain to the intrinsic information.

2 http://www.ya-shin.com/awards/awards.html

http://www.ya-shin.com/awards/awards.html

IGR(A) =IG(Examples|A)

H(A)

3.1.2. Binary Perceptron:

The binary perceptron model is a classical classification model taken from the field of artificial neural

networks and supervised learning. This model is based on the idea of a single neuron that fires if and

only if its output, which is calculated as the total weighted sum of its inputs, is higher than a certain

threshold (in our case, 0). An n-dimensional binary perceptron has n inputs and n weights. During its

training process, it receives examples, which are n-dimensional inputs with their classification label

(⟨𝑋→, 𝑦0⟩), and adjusts its weights to better fit each example seen (according to the

online algorithm showed below). Once the training process is done, an input can be

classified into one of two categories according to whether the trained perceptron fired

or not upon receiving it.

In our case, the n-dimensional input (X→

) is the attributes of a certain nominated movie. A movie’s

(binary) label in the training set (𝑦0) represents whether or not it won an Oscar. The weight given to the

i’th attribute is marked 𝑊𝑖, and the perceptron’s output for a movie is marked 𝑌.

The following equation describes the classification done by a trained perceptron –

𝑌 = 𝑠𝑔𝑛 (𝑊→ ∗ 𝑋→) = 𝑠𝑔𝑛(∑𝑊𝑖

𝑛

𝑖=1

𝑋𝑖)

The Binary Perceptron Learning Algorithm –

Given a set of P inputs {< 𝑋1→ , 𝑌01 >,…, < 𝑋𝑝

→ , 𝑌0𝑝

>} where 𝑋𝑖→

is the i’th n-dimensional input and

Y0𝑖 ∈ {1,−1} is that input’s label, and given that the inputs are linearly separable, the following

algorithm finds a set of weights (𝑊 → ∈ 𝑅𝑛) so that for each 𝑖 ∈ {1…𝑝} it holds that 𝑌0

𝑖 = 𝑌𝑖.

1. Start with an arbitrary set of weights 𝑊0→

2. Loop over the given set of inputs from first to last repeatedly until convergence:

If 𝑌0𝑖 = sgn(𝑊𝑖−1

→ ∗ 𝑋𝑖→ ), go to the next example (in case the perceptron's output for the example

matches its label). Else, update the weights: 𝑊𝑖→

= 𝑊𝑖−1→

+ 𝑌0𝑖 ∗ 𝑋𝑖→

3.2. Implementation

3.2.1. Decision Tree Implementation:

The implementation of the Decision Tree and the two different heuristics is as learned in class. Since

we build the decision tree using only the attribute values of the examples in the training set, when

testing movies from the test set, we occasionally encounter a missing attribute value. In this case we

return the default value of false. We made this choice since most nominated movies don't win an Oscar

and when changing the default value to true didn’t affect much the results.

3.2.2. Perceptron Implementation:

The implementation of the Binary Perceptron is according to the algorithm described above. Two

problems we had to address while implementing the learning algorithm and testing of this model were

how to handle missing attributes and the convergence of the weights vector.

Any "missing attributes" (i.e. attributes of the perceptron that couldn’t be determined when creating the

movie instance of a nominee) were given a default value of 0. We chose 0 as the default value so that in

the case that the output of the perceptron doesn’t match the true winning status of the nominee and an

update of the weights vector is made, the update wouldn’t affect the current learned weight value of the

weight that represents the missing attribute.

Regarding the convergence of the algorithm, the algorithm described above converges only if the

examples are linearly separable (can be divided by a hyper plane into two classification areas so all of

the examples with the same label are in the same area). According to Cover’s Theorem, as long as

P < 2N (the number of examples is less than double the dimension of the perceptron), all dichotomies

are possible – i.e. the examples can be linearly separable. In our case, N ≈30, and in most of the cases

we tested P > 60, so the examples are not necessarily linearly separable, and the algorithm doesn’t

converge for sure. For this reason, we limited the number of iterations as part of the algorithm's

implementation.

The implementation of the input vectors for the perceptron was done with consideration to the fact that

some movie attributes have continuous numerical values (such as “imdb rating” and “times cast was

nominated”), while others have non-numeric values (such as “genre” and “plot keywords”).

“Continuously-valued” attributes were assigned a single weight in the perceptron (and so a single index

in the input vector), and “discrete-valued” attributes were assigned several “Boolean” weights – one for

each of its possible values.

In the process of teaching the perceptron a single example, translating a movie’s “continuously-valued”

attribute value into a value in its input vector involved simply extracting the value of the attribute and

assigning the attribute’s corresponding index in the input vector a value that matches the attribute’s

value in its magnitude. For example, a movie with a higher “imdb rating” would have a greater value in

the index representing “imdb rating” in its input vector, comparing to a movie with a lower rating. On

the other hand, translating a movie’s “discrete-valued” attribute value into perceptron inputs involved

assigning a positive value to the index that represents that value in the input vector and assigning a

negative value to the indices that represent the other possible values for that attribute.

4. THE TESTING PROCESS

After implementing the basic algorithm, we played with several parameters relating to different aspects of the

project (algorithms, attributes, and training examples) in order to find the configuration that gives the best

predictions. All the parameters chosen have some degree of freedom that allows us to change their values within

the limits of the chosen methods.

The parameters we chose are:

a) Method for choosing the test set examples – randomly or not

b) Number of perceptron training iterations

c) Training set size

d) Default choice for missing attribute values –the default choice is given by the decision tree when

encountering a path when testing an example from the test set that reaches an attribute value that didn’t exist

in the training set.

e) Attributes – which attributes from the above description are taken under consideration (inserted to the list of

attributes given to the DTL algorithm and perceptron model)

f) Level of Discretization – regular (the level of discretization described for each attribute in the second

chapter) or a more coarse division of the values for some attributes. A stricter discretization can be done on

attributes such as ‘genre’ (by defining five main genres – Action, History, Drama, Family and Sci-fi),

‘runtime’, ‘IMDb rating’ and ‘age category of main character’ (by narrowing it down to only 4 age groups).

5. RESULTS AND ANALYSIS

We examined our algorithms on two different sets of data: the first contained all the movies nominated from the

last 15 years and the second contained all movies since 1928, the first year the Academy Awards were held. For

each data set, we performed several tests, differing in several parameters.

In this section, we will show the results of the different tests, and in the next section we will analyze the results

and try to understand if we can answer the question of whether it is possible to predict if a movie will win an

Oscar, and if so, which method works best.

5.1. Evaluating the quality of the prediction

Since it is problematic to determine the quality of our predictors according to a single form of measurement,

we will use 5 different indicators:

Overall correct = number of matches

test set size

True positives = number of positive matches

number of Oscar winners in test set = percent of actual winners that were rightly predicted to be winners

True negatives = number of negative matches

number of Oscar losers in test set = percent of actual losers that were rightly predicted to be losers

False positives = number of Oscar losers with a positive predictor outcome

number of positive predictor outcomes

False negatives = number of Oscar winners with a negative predictor outcome

number of negative predictor outcomes

* A match signifies that the predictor’s output for the movie was the same as its real label, a positive match

is a match for a movie with a “true” label (did win the Oscar) and a negative match is a match for a movie

with a “false” label (didn’t win an Oscar).

We used several measures and not only the “Overall correct” measure due to the fact that most movies do

not win the Oscars (only 1 out of 5-6 movies wins each year), and thus even a predictor that returns “false”

for every input would show “good” performance according to this measure. The use of several relative

measures, when taken into account together, can portray a predictor’s quality in a more accurate way.

5.2. Results from the first data set

This data-set contains movies nominated for the "best picture" award, starting from the 2000 Academy

Awards and up to the 2014 Academy Awards.

5.2.1. Base Configuration results

We began our testing with a base configuration of parameters:

a) Training set size – 80% of the data set (training set consisted of movies from 2000-2011 and testing

set consisted of movies from 2012-2014).

b) Choosing the test set examples – examples were not chosen randomly, i.e. the movies in the training

set were taken from the chronologically first 80% of years in the last 15 years.

c) Perceptron training iterations – 6

d) Default choice for missing attribute values – False

e) Attributes - using all of the attributes described above

f) Level of Discretization – regular

Chart 1 shows that the Decision Tree with the information gain ratio heuristic gave the best prediction

for the base configuration. Figure of the learned tree can be seen in figure 1 in the appendix.

As can be seen in Chart 1, and repeatedly in the rest of the results, that the false positives are a lot higher

than the false negatives. This may be misleading since the percent of false positives is calculated in

relation to the total number of winning movies and the false negatives are calculated in relation to the

total number of losing movies, which is much bigger. Therefore, even a single false positive creates a

giant leap in the percentage of false positives.

Chart 1 - base configuration results (small data set)

When looking at the predictions this configuration returns for the movies nominated in the 2015

Academy Awards we receive the following –

Movie Did the

movie win

the Oscars

According to the Decision

Tree with the Information

Gain heuristic



Gain Ratio heuristic

According

to the

Perceptron

American Sniper False False True False

Birdman True False True True

Boyhood False False False False

Selma False False False False

The Grand Budapest Hotel False True False False

The Imitation Game False False True True

The Theory Of Everything False False False True

Whiplash False False False False

5.2.2. Comparisons to the base configuration

Our goal was to improve the results of the base configuration by playing with the values of the different

parameters listed in the previous chapter. For each of the parameters for comparison, we compared the

results of several different values in order to find the best one. We will now discuss the conclusions

drawn from the analysis of these comparisons.

When comparing different number s of training iterations of the perceptron we found that that the

optimal number of training iterations in our model is approximately 6 (as can be seen in chart 2 in the

appendix – 7.1). Furthermore, we found that the size of the training set used changes the quality of the

prediction significantly, an 80-20 split achieves the best results for this data-set (see chart 3).

Most of the attributes used by us to represent a movie (listed above) have a positive effect on the

quality of the prediction. An exception is the 'age category of main character' attribute, that when

removing it completely, an improvement is seen in both heuristics of the decision tree (see chart 4).

When creating a strict discretization of some of the movie attributes ('genre', 'runtime', 'imdb rating',

'age category of main character' and ' times cast nominated'), we notice a clear loss in the quality of the

prediction of the decision tree with the information gain ratio heuristic and a clear gain of the quality of

the prediction of the decision tree with the information gain heuristic. The perceptron does better in

some measurements and worse in others. On the other hand, when narrowing the range of values of

only the 'genre' and 'age category of main character' attributes (partial discretization), we get a

significant improvement in the quality of the prediction of the decision tree with the information gain

ratio heuristic (this is in fact the best predictor found for this data-set). In the two other methods, some

measurements improve while others do worse (see chart 5).

Figure 2 - The best predictor found for the first data set. Trained on 80% of data set with base configuration

parameter values and discretization of the ‘genre’ and 'age category of main character' attributes.

In addition to these results, changes such as changing the default classification of the decision trees,

shuffling the years before assigning them to the training and test sets, balancing the number of winners

and losers in the data set, and removing additional attributes resulted in worse results comparing to the

base configuration described above.

To conclude the results of the different comparisons, the best predictor for this data set was received

when using the decision tree model with the information gain ratio heuristic and setting the parameters

to the base configuration parameter values with the further discretization of the ‘genre’ and 'age

category of main character' attributes. This prediction gave 66.6% true positives, 95.6 % true negatives,

33.3 false positives and 4.3% false negatives. The learned tree is shown in figure 2, the perceptron

weights of this configuration can be seen in the appendix.

5.3. Results from the second data set

The second data-set contains all the movies ever nominated for the "best picture" award, starting from the

1928 Academy Awards and up to the 2014 Academy Awards.

5.3.1. Base Configuration results

We began our testing with a base configuration of parameters:

a) Training set size – 70% of the data set

b) Choosing the test set examples – examples were not chosen randomly, i.e. the movies in the training

set were taken from the chronologically first 70% of years in the data set

c) Perceptron training iterations – 6

d) Default choice for missing attribute values – False

e) Attributes - using all of the attributes described above

f) Level of discretization - regular

Chart 6 shows that the Decision Tree with the information gain ratio heuristic gave the best prediction

for the base configuration when taking into account all of the measures together.

Chart 6 - base configuration results (complete data set)

When looking at the predictions this configuration returns for the movies nominated in the 2015

Academy Awards we get the following –

Movie Did the

movie win

the Oscars



Gain heuristic



Gain Ratio heuristic

According

to the

Perceptron

American Sniper False False False True

Birdman True False False True

Boyhood False False False False

Selma False False False True

The Grand Budapest

Hotel

False False False True

The Imitation Game False False False False

The Theory Of Everything False False False True

Whiplash False False False True

Marked in bold are the differences comparing to the results from the small data set

We can see that training on the complete data set caused the decision trees to become more “negative”,

and the perceptron to become more “positive”, and overall the results are less accurate.

5.3.2. Comparisons to the base configuration

In similar to the comparisons done with the first data set, we tried to improve the above results by

playing with the different parameter values and also by applying some of the conclusions from the first

data set. The results are brought forward here.

We found that the size of the training set does indeed impact the results, and that the 70%-30% split

seems to be best, though it is not very decisive because the impact on the performance in various

measures is different for each measure (see chart 7 in the appendix - 7.2). Attempts to completely

remove an attribute didn’t result in an improvement (some of these attempts are shown in chart 8).

We also tried to recreate the success of the predictor that was achieved for the first data set by further

discretizing attributes. When we applied discretization to the ‘genre’ attribute no significant change

was noticed. By discretizing both the ‘genre’ and ‘age of main character’ attributes, the decision tree

with the information gain ratio heuristic results and the perceptron results actually got worse (became

too “negative”), this being in contradiction to the beneficial impact this change had on the first data set

(see chart 9).

In addition, once again we saw that changing the default classification of the decision trees and

shuffling the years before assigning them to the training and test sets resulted in same or worse results

comparing to the base configuration described above.

To conclude, the predictors for the complete data set were not very efficient, and it is hard to say which

configuration was best. We present here one of the best configurations with its results. The following

prediction was received when using the decision tree model with the information gain ratio heuristic

and setting the parameters to the base configuration parameter values. This prediction gave results of

48% true positives, 78% true negatives, 69% false positives and 12% false negatives. The learned tree

is shown in an attached file named bestDecisionTreeBigDataSet(iGRatio).gv.pdf.

6. CONCLUSIONS AND DISCUSSION

6.1. Can we predict if a movie will win an Oscar

The question of whether it is possible to predict, through the use of “dry” attributes regarding the movie and

its cast and crew, whether a nominated movie will win an Oscar doesn’t have a very clear answer.

According to the results, it seems that predictions have a better chance at being accurate if made based on a

relatively small range of years immediately prior to the range of years in the test set. Overall, the results for

the first and smaller data set (2000-2014) were better than the results for the large and complete data set. To

illustrate, the “true positives” rate in the complete data set never reaches the 50% mark, while the better

configurations used on the small data set all exceed 65% for “true positives”. Perhaps the qualities by which

movies are picked for winning the Oscars have changed over the years, and so a predictor that is trained from

a training set that contains movies from a long span of years may in fact be learning from “inconsistent” data,

which results in poor performance.

The results show that for most of the configurations, the prediction is better than chance level, but we still

wouldn’t put our money on it. The only configuration we found that we can consider as a good predictor

involved using the decision tree model with the information gain ratio heuristic and setting the parameters to

the base configuration parameter values with the further discretizing of the ‘genre’ and 'age category of main

character’ attributes and running this configuration on the smaller data set (from the last 15 years) with a 80-

20 split. We can assume the discretization of these two attributes helped to avoid overfitting, and thus

produced better results.

Chart 10 – prediction results of applying the decision tree model with the information gain ratio heuristic on the small

data set, when setting the parameters to the base configuration parameter values and with the further discretization of the ‘genre’ and 'age category of main character’ attributes

6.2. Comparing between the methods

As described above, for both data sets the best results were achieved when using the decision tree model with

the information gain ratio heuristic. Even so, throughout most of the configurations in both data sets, the

performance of the perceptron is similar in its quality to the quality of the IG ratio heuristic of the decision

tree. It is difficult to make an accurate distinction between the different methods due to the different

performance on the different measures. The perceptron does relatively well in the “true positives”

measurement, whereas the IG ratio heuristic of the decision tree usually has more “true negatives”.

When comparing the two heuristics of the decision tree, we see that the information gain ratio heuristic

usually does better. This is probably due to the fact that the information gain heuristic has a bias for choosing

attributes with many possible values, a bias that is reduced in the information gain ratio heuristic because it

takes the number and size of branches into account when choosing an attribute. This can also explain why

when we created a stricter discretization of attribute values or removed attributes with many values, the

performance of information gain improved, and sometimes even exceeded that of information gain ratio.

It is interesting to see that adding “Boolean” inputs to the perceptron (which classically receives

“continuous” inputs) did not have a negative impact on its prediction abilities. In fact, removing these kinds

of attributes from the perceptron did have a negative effect.

6.3. Going Forward

Throughout our project, we identified several issues that should be taken under consideration, and that could

be used to make improvements:

1) Using even distribution to divide attribute values –

When discretizing the different attributes values, we chose values using fixed intervals. Dividing the

values according to their distribution could contribute to making better distinctions between them.

2) A single winner per year –

A main issue is that when we implemented our algorithms we simplified the problem and didn’t take into

account the fact that only one of the nominees wins each year. Our algorithms test each movie in separate

and not as part of the group of nominees for that year. Our predictor in fact seeks the “Oscar winner

quality” and not the “relative Oscar winner quality” (the movie with best chances of winning the Oscars

from a group of specific nominees).

This desired type of prediction does not fit the decision tree and binary perceptron models. Trying to

solve this problem with different methods that address this issue could improve our results and lower our

typically high “false positives” rate.

3) Improving the Movie object –

We could of course expand our data sources and also improve the plot analysis in order to find new

attributes that might make more essential distinctions between movies. For example, we could add more

attributes relating to other prestigious awards the nominated movies had won.

7. APPENDIX

7.1. Comparison results of small data-set

Figure 1 – The decision tree received using the gain ratio heuristic on the first data set with the base configuration. An example of the path for the movie – "imitation Game" which was wrongly predicted:

𝑤𝑜𝑛𝐵𝑎𝑓𝑡𝑎: 𝑓𝑎𝑙𝑠𝑒 → 𝑡𝑖𝑚𝑒𝑠𝐶𝑎𝑠𝑡𝑁𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑑: 1 (𝑟𝑜𝑢𝑛𝑑𝑒𝑑 𝑡𝑜 0) → 𝑡𝑖𝑚𝑒𝑠𝐶𝑎𝑠𝑡𝑊𝑜𝑛: 0 →

𝑖𝑚𝑑𝑏𝑅𝑎𝑡𝑖𝑛𝑔: 8.0 → 𝑚𝑜𝑛𝑡ℎ𝑂𝑓𝑅𝑒𝑙𝑒𝑎𝑠𝑒: 𝑂𝑐𝑡𝑇𝑜𝐷𝑒𝑐 → 𝑡𝑟𝑢𝑒

Chart 2 - comparing results of perceptron with different number of training iterations

Training set size – 70% Training set size – 90%

Chart 3 - comparing the results of different sizes of training sets

Results when removing ‘times cast nominated’ attribute Results when removing ‘age category of main character’ attribute

Chart 4 – examples of results when removing attributes

High discretization (genre, runtime, imdb rating etc.) Adding discretization in ‘genre’ and ‘age category of main character’

Chart 5 – comparing results when changing the level of discretization

Example of weight vector of the perceptron –

'wonBafta': 72, 'Sci-Fi': -24, 'Comedy':-11, 'Drama': -11, 'Action': -11, 'History': -11, , 'Family': -11,

ageCategoryOfMainCharacter-20': 0, 'ageCategoryOfMainCharacter-30': 16, 'ageCategoryOfMainCharacter-

45': -8, 'ageCategoryOfMainCharacter-60': -8, AprToJune': 14, 'OctToDec': 6', JulyToSept': -26, 'JanToMar':

6, 'timesCastNominated': 0, 'timesCastWon': 6, 'editorQual': 20, writerQual': 0, producerQual': 10,

'directorQual': -26, 'imdbRating': 9.0, 'war': -26, 'retarded': 50, 'showBiz': -26', 'runtime': 0', 'mpaa': 6

7.2. Comparison results of big data set

Training set size – 60% Training set size – 80%

Chart 7 - comparing the results of different sizes of training sets of the complete data set

Results when removing genre attribute Results when removing ‘times cast nominated’ attribute

Chart 8 – examples of results when removing attributes in complete data set

With discretization to the ‘genre’ attribute With discretization to ‘genre’ and ‘age category of main character’

Chart 9 – comparing results when changing the level of discretization in complete data set

Download - And the Oscar goes to…ai/projects/2014/OscarPredictor... · And the Oscar goes to… Creating an Academy Award winner predictor using supervised learning methods Romi Gelman Noa

Top Related