is beauty really in the eye of the beholder?cs229.stanford.edu/proj2014/yun (albee) ling,...

Is Beauty Really in the Eye of the Beholder?Yun (Albee) Ling (yling), Jocelyn Neff (jfneff), and Jessica Torres (jntorres)

Abstract

Recent research suggests that high facial symmetry plays a large role in whether or not a person is deemed beautiful.To test this hypothesis, a beauty classifier was built utilizing various machine learning algorithms in hopes of modelingsociety’s general idea of beauty. Data collection included survey responses and online facial images. Facial recognition andfacial feature extraction was performed with Haar Cascades. Beauty classifiers consisted of algorithms such as RandomForest, Support Vector Machines, K-Nearest Neighbors, and Logistic Regression. Facial features were detected with 0.02error, and beauty was correctly classified using the Random Forest model with 0.21 error.

1 IntroductionA common belief is that more symmetric faces are more beau-tiful. If this is true, could it be possible that beauty is notsubjective, but in fact a result of a mathematical compilationof facial features? By applying machine learning to face detec-tion, facial feature extraction, and the quantification of facialsymmetry, a model theoretically could be built that attemptsto predict a face’s beauty based off features alone. Using fea-tures that represent the symmetry of a face, a model’s highaccuracy could suggest that society’s idea of beauty can belearned by a machine. Moreover, a model’s poor performancecould highlight intrinsic qualities of beauty that cannot bequantified in model features alone.

2 Materials and Methods2.1 Data2.1.1 Kaggle

Kaggle provided 7,048 96x96 gray-scale images with thepupils, eye corners, eyebrow corners, and mouth corners iden-tified. Each image was a head shot of a face. The images werenot necessarily head-on shots, and the subjects had varied fa-cial expressions. In addition, there was only a total of 50subjects: each subject had multiple photos of him or her witha variety of poses and facial expressions. In order to train theHaar Cascades Classifier, the Kaggle dataset was partitionedinto 70% training and 30% testing.

2.1.2 Survey

Fourteen celebrity images were selected to be distributed ina survey asking users for their assessment of the celebrity’sbeauty. Widely held opinions of a celebrity’s attractivenesscould impact responses, so lesser-known celebrities were cho-sen. In addition, the set of celebrities was limited to Cau-casians to reduce biases caused by ethnicity. There were equalnumber of males and females. Each image was selected sothat each facial image was forward facing with little facialtilt. These qualifications eased facial detection, ensuring thatthe features and distances derived from these facial featureswere correct. Each celebrity was ranked on a scale of 1 to 10.With 242 responses, there were a total of 3,388 rankings offacial beauty. The mean rating across all faces was 6.22. Therating distribution for each face followed a Normal Distribu-tion where the mean varied from 4.26 to 8.23. These photosserved as the test set for the models later developed (Figure1).

2.1.3 Online Images

In addition to the celebrity images, 100 images were collectedof faces online to serve as the training set for the models.These images also served as the test set for the Haar Cascade

feature detector: in order to train the models, facial featuresfirst needed to be identified. For each of these images, a binaryclassification of beautiful or not beautiful was given. Theseimages, like the celebrity images, were cropped to 96x96 pixelsso that there would be more resemblance between the Kaggletraining set and the test set.

2.2 Facial Feature ExtractionUsing the Haar Cascades facial feature detector provided byOpenCV [8], separate models were developed to detect thebounding box for the face, eyes, nose, and mouth. OpenCVhad a face and eye detector, but lacked a mouth and nosedetector. In order to detect the nose, a Haar Cascade modelwas trained. It was trained on 45 positive images and 100negative images selected from the 70% of the Kaggle data la-beled as training data. For the developed model, a windowsize of 24 x 24 pixels was used with 5 stages of training, aminimum hit rate of 0.99, and a max false alarm rate of 0.5.The positive images were tight bounding boxes of the facialfeatures to be detected.

In a eyeball comparison of the developed and availablemodel, the available model had a tighter and more accuratebounding box for the features. Therefore, the online modelwas used for the test set feature extraction. Similar findingswere found for a nose detector online [4].

In order to minimize false positives, the search space forthe classifier at each stage was minimized. First, the bound-ing box for the face was identified. Then, the pixel range forsearching for the eyes, nose, and mouth was reduced to withinthe box. The eye model resulted in the largest number of falsepositives, so additional constraints were implemented. Giventhat all faces were 96x96 pixels and the faces were constrainedto all be similar facial shots, the search space was reduced tothe upper 30 pixels of the face. If the detector still foundmore than two ”eyes”, all boxes were considered questionableand the classifier did not classify the eyes. If, however, thefeature detector found one or two ”eyes”, then the eye withthe lower x-axis pixel location, i.e. the top left corner of thebox was further to the left side of the picture, then that boxwas deemed to be the right eye (left and right side of the faceare from the subject’s point of view).

Nose and mouth were also detected. The mouth classifierwas likely to return many ”mouths”, so only the boundingbox that had the largest y coordinate (and thus furthest tothe bottom of the picture) was classified as the mouth.

2.3 Feature Matrix CreationAfter facial features were detected, a feature matrix was cre-ated as input for beauty classification models. In order to

1

calculate various distance statistics, coordinates for the cen-ter of boxes that enclose each of the facial features (eyes, noseand mouth) were identified. The final feature matrix includedbasic facial features that are commonly cited in literature [6],such as length and width of the nose. In addition, attributesthat are concerned about symmetry were added, since ourhypothesis is that facial symmetry is associated with beauty.Specially, two measurements that were cited in multiple lit-erature [5,7], Central Facial Asymmetry (CFA) and FacialAsymmetry (FA) were part of our feature matrix. FA is thesum of pairwise differences between midpoints of each facialfeature in the x-axis direction, while CFA is the sum of differ-ences between midpoints of adjacent facial features. Figure 2is a visualization of all the distance statistics calculated foreach face.

2.4 Classification

Four classification algorithms implemented in R: Logistic Re-gression (LR) [1]; Support Vector Machine (SVM) [1], K-Nearest Neighbor and, Random Forest (RF) [1]. All classifica-tion models were trained on Leave One Out Cross Validation(LOOCV) method.

2.4.1 Feature Selection

The original feature matrix included 40 features. A reducedfeature matrix included 10 features were the most correlatedwith human ratings of beauty in the 100 images using “FS-elector” Package in R. [1] All machine learning models wereimplemented in both feature matrices for comparison. P val-ues of FA and CFA for each of the 14 celebrity faces wereoutput based on the 100 training images. The p-values werecalculated by dividing the number of training faces with thesame or lower values of FA or CFA as the test example by100.

2.4.2 Random Forest

Random Forest is a classification method utilizing combina-tions of tree predictors such that each tree is built indepen-dently from the others. A single classification tree is often notan accurate classification function. Thus, RF utilizes an ap-proach which aggregates several inefficient classification treesusing a “bagging’” procedure: at each step of the algorithm,several observations and several variables are randomly cho-sen and a classification tree is built from this new data set. Forexample, given a training set X = x1, . . . , xn with responsesY = y1, . . . , yn, bagging repeatedly selects a random samplewith replacement of the training set and fits trees to thesesamples. The result is a final classification decision obtainedby either a majority vote on all the classification trees orby averaging the predictions from all the individual regres-sion trees for unseen samples, x′: f̂ = 1

B

∑Bb=1 f̂b(x

′), whereB = b1, . . . , bn sampling. For our purposes here, majorityvote on all classification trees was taken. The package ”rpart”from R was used in generating our random forest model.

2.4.3 Logistic Regression

Logistic regression (LG) is a classification parametric modelwhich is used to estimate the probability of specific observa-tions to belonging to a particular class. The built in function”glm” from R was used to build our binary LG model assum-ing a binomial distribution. LOOCV was performed with thepackage ”boot” package in R as well.

2.4.4 Support Vector Machines (SVM)

SVM is a supervised learning technique that finds the maxi-mum margin separating hyperplane between classes. Kerneltricks are often applied to map input features into high di-mensional space, especially when data are not linearly sepa-rable. Function “ksvm” in R package “kernlab” was utilizedin all analyses. [3] Various SVM models and kernels were ap-plied to the entire feature matrix with 40 features per faceand the reduced feature matrix with only 10 most correlatedfeatures. The SVM models include C-SVM classification (C-svc), nu-SVM classification (nu-svc), and bound constraint-SVM classification (C-bsvc). The kernels include radial Basiskernel function ”Gaussian” (rbfdot), polynomial kernel func-tion (polydot), linear kernel function (vanilladot), hyperbolictangent kernel function (tanhdot), Laplacian kernel function(laplacedot), and ANOVA RBF kernel function (anovadot).

2.4.5 K-Nearest Neighbor

KNN is an unsupervised learning algorithm that takes inputinteger k. For a given test example, the classification comesfrom the majority vote of the k data points that are the clos-est in Euclidean distance. Ties are broken at random. Thismethod was chosen because its excellent performance in aprevious study. [6] Function “knn” in R package “class” wasused in all analyses using all possible values of k. [1]

3 Results

3.1 Facial Feature Extraction

The combination of feature detectors collected yielded a suc-cessful facial feature detector model. Table 1 summarizes theresults from the Haar cascade classifiers developed and used.The training error was the error from testing on Kaggle 1,400images where faces were not necessarily forward-facing, un-titled, relaxed expressions. To determine if the algorithmcorrectly found the feature, the x,y coordinate specified byKaggle for the feature had to lie within the bounding boxthe feature detector found. There were two test sets: the100 images that would then comprise the training set for thebeauty models, and the 16 celebrity images that comprisedthe test set for the models. Because neither of these sets ofimages had feature locations identified, a correct classificationwas eyeballed by running through the results. Accuracy wasmuch higher, as all the testing images were forward facing,minimal tilt, relaxed faces. In the case that the model didnot find a feature, no prediction was made. Given that therewere only 116 test images, predictions that were not made bythe feature detector were manually tagged in order to provideall features to the beauty models.

3.2 Feature Selection

The 10 most correlated features are distance between rightpupil and top of forehead, distance between midpoint of twopupils and top of forehead, distance between midpoint of theforehead and midpoint of two pupils, distance between nosetip and left corner of mouth, distance between right pupiland right edge of forehead, distance between left pupil andleft edge of face, FA, distance between midpoint of two pupilsand nose tip, CFA, and distance between two pupils/Lengthof face. CA and CFA, which measure the overall asymmetry

2

of each face, are among the most associated features. Further-more, the average values for CA and CFA are both smallerin people who were rated beautiful. This is evident that themost beautiful people have more symmetric faces. Amongthe seven celebrities that were deemed beautiful, 3 have sig-nificantly symmetric faces. However only 1 out of the sevencelebrities that were not rated beautiful has symmetric facebased on the p-values.

3.3 KNNKNN was first implemented using all 40 features in the fea-ture matrix. It achieved the lowest training error of 0.48 withk=42 of 46. Eight and nine out of the 14 final test exampleswere predicted successfully. Next, KNN was applied usingonly 10 features that were the most correlated with humanratings of the 100 training images. This classifier performs thebest when k=3 or 5 and the training error 0.36. Using k =3 and 5, KNN was able to accurately classify ten and nine ofthe 14 celebrity faces accurately. As its best, KNN reaches atraining error of 0.36 and testing error of 0.29 after feature se-lection by correlation and when three nearest neighbors wereconsidered. See Figure 3.

3.4 SVMWith full feature matrix, the lowest training error was zero,which is from nu-SVM with all three different kernels: ra-dial Basis kernel function ”Gaussian”, Laplacian kernel andANOVA RBF kernel. Amongst the three kernels, ANOVARBF had the lowest testing error (0.29). Using reduced fea-ture matrix, the lowest training error rate (0.0) came from nu-SVM classification with Laplacian kernel, although the testingerror rate was very high (0.57). The lowest training error rate(0.36) was from bound constraint-SVM classification with hy-perbolic tangent kernel, as well as nu-SVM classification withpolynomial kernel function. The overall lowest training andtesting error was reported in Table 2.

3.5 Logistic RegressionLogistic Regression was implemented on the complete featurematrix created. Table 3 describes the top ten features sortedby most significant p-values. As observed there were very fewfeatures that held compelling p-values. In training our LGmodel the observed training error with LOOCV was 0.44. Forthe test set our LG model performed worse than our trainingset, only being about to accurately classify images at randomwith a test error of 0.5.

3.6 Random ForestRandom forest was implemented in the binary classificationof beautiful or not using again the full feature matrix. Thetraining error received was 0.20 on 100 images with LOOCV.The CA and CFA feature variables were observed to play twoof the most important roles in classification (See TABLE 4).The results pertaining to the test error was similar to thetraining error, 0.21. This suggest that our model held a ade-quate amount of trade-off between variance and bias.

4 Discussion

4.1 Facial Feature ExtractionIn order to better understand the Haar Cascades model, amodel was developed. While the training algorithm provided

by OpenCV [4] was optimized, it is still a very slow algorithmto run. In order to speed up the runtime so iterations on themodel could be made, a smaller window and less images werechosen. Given that the algorithm loops over all pixel boxesup to the pixel box size specified (24 x 24 pixels) increas-ing the pixel size, while likely increasing the accuracy of themodel, drastically lengthens training time. A drastic hold-upon training was also the number of negative images. Whilethe algorithm trained instantaneously on the positive images,negative images were processed more slower.

The results showed that testing precision was perfect,meaning that all error was created from the predictor not de-tecting any feature on a test example. Given that the modelsand symmetry features depended on an accurate representa-tion of the facial features, it was best to sacrifice some accu-racy for higher precision. Thus, the feature detector wouldonly consider a feature correct if there was high confidence ina feature prediction, explaining the high precision.

4.2 Comparison of Classification ModelsThe observation that Random Forest classifier outperformsSupport Vector Machines, Logistic Regression and K-NearestNeighbor classifiers might be attributed to the presence ofnoise within the data. Tree classifiers like RF indirectly treatfeatures unequally by discarding irrelevant attributes and us-ing informative/discriminative features more frequently. Onthe other hand, in SVM classification all features are treatedequally and hence SVMs are more sensitive to high dimen-sional data. Interestingly enough, for the case of LogisticRegression because LR is conceptually simple and has provento produce robust results, it was expected to perform muchbetter than what was actually observed. Some reasons forthis observation could be that LR tries to find broad relation-ships between independent variables and predicted classes,but without guidance they cannot find non-linear signals orinteractions between multiple variables. KNN’s low perfor-mance, evidently highlights that some features do not containenough information of the beauty of faces. This would explainwhy RF out performs KNN in this project. Another draw-back of our KNN implementation here is that the best modelused a small value of k, k =3. The bias is low due to the smallvalue of k, but the variance is high. [3] This means that ourmodel might not perform well with other data sets. Similarlyto KNN, SVM also overfit our data since the best trainingerror was zero, much smaller than the test error. Nu-SVMmodel worked the best amongst the three models because itputs boundaries on the number of support vectors and errorsto deal with the variance-bias trade-off. [8]

Random Forest’s high performance in both training andtest sets allowed for its clear victory over all other algorithmstested. RF offers a powerful tool in addition to classification,variable importance ranking. This is done by estimating thepredictive values of variables by scrambling the variables andseeing how much the model performance drops. When ap-plied here both CA and CFA, which measure overall facialasymmetry, are the two most important feature variables toclassification. This correlated well to survey responses, themore beautiful faces were more symmetric according to CFAand FA. Thus, we conclude symmetry can be positively asso-ciated with facial beauty.

3

5 Future WorkSurvey results could have been more deeply analyzed to dis-cover unconscious bias that could impact the results. Sexualorientation could also impact a person’s perception of beauty.While the survey was pitched as a beauty rating and not anattractiveness rating, the two often go hand in hand so thatthe attractive person is also deemed beautiful.

With more time to train the Haar Cascade model, a betterfeature detector could be built using more positive and neg-ative images, and a larger window size. A tighter boundingbox would yield a more accurate representation of the face.In fact, a bounding box that is simply a point that identifiesseparate points on a feature would be preferred. Then, noapproximation would be needed to find the center of the fea-ture. Additional features could also be detected to test otherfeatures’ effect on symmetry and beauty. For instance, EmilyDeschanel’s face was deemed by the algorithm to be relativelysymmetric, but she had a mean beauty score of 4 from surveyrespondents. Perhaps other features affected people’s percep-tion of her beauty. For instance, she has a very strong jawlinefor a woman. Other detectors could be trained to detect thesefeatures as more features would aid in developing a more ac-

curate model. we could do to improve our beauty classifier.

Based on our classifier results, there is a good amount ofnoise in our feature matrix and only using the reduced fea-ture matrix left out some information. Feature selection couldhave been done more carefully, i.e. compare various othermethods like mutual information or PCA to better representour data. On a separate note, although the feature matrixwas created to best capture the facial feature in our case,there could be other statistics that also contains informationabout beauty of each face. There were studies that identifiedpoints on each face and used pairwise distances between thepoints as the feature matrix. [6] They further reduced thefeature matrix using PCA. In order to quickly collect surveydata on the additional 100 images that served as the trainingset for the model, only binary classifications of beautiful ornot beautiful could be gathered. With more allotted time amore precise detector could have been built i.e., a scaled rat-ing rather than binary. Other models like SVR, GLM, EM,and KMeans could also have be evaluated. Lastly, becausethe models were implemented with default settings, optimiz-ing the model parameters might have improved performance.

6 Bibliography1. R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

2. Haar Cascades. URL http://alereimondo.no-ip.org/OpenCV/34

3. Hastie, Trevor, et al. The elements of statistical learning. Vol. 2. No. 1. New York: Springer, 2009.

4. OpenCV Github Repository. URL https://github.com/Itseez/opencv

5. Kagian, Amit, et al. ”A machine learning predictor of facial attractiveness revealing human-like psychophysical biases.” Vision research 48.2 (2008): 235-243.

6. Eisenthal, Yael, Gideon Dror, and Eytan Ruppin. ”Facial attractiveness: Beauty and the machine.” Neural Computation 18.1 (2006): 119-142.

7. Grammer, Karl, and Randy Thornhill. ”Human (< em> Homo sapiens</em>) facial attractiveness and sexual selection: The role of symmetry and averageness.” Journal ofcomparative psychology 108.3 (1994): 233.

8. Hornik, Kurt, David Meyer, and Alexandros Karatzoglou. ”Support vector machines in R.” Journal of statistical software 15.9 (2006): 1-28.

7 Tables and Figures

Model Training Error Testing Error Testing Error Testing Precison(100 images) (16 celebrities) (116 images)

Developed Nose Detector 0.80 N/A 0.29 1.00Online[4] Nose Detector 0.23 0.18 0.07 1.00Online[4] Eye Detector 0.17 0.30 0.25 1.00Online[4] Mouth Detector 0.35 0.00 0.07 1.00

Table 1: Error results for the Haar Cascade Feature Detectors

Figure 1: Feature detections made by Haar Cascade model.

4

Figure 2: Calculated features represented by pur-ple lines.

Figure 3: Accuracy of KNN by K Using Whole Feature Matrix in 100Training Images

Feature RF Importance RankingFA 11.33

CFA 10.78nose mouth r 9.46

forehead nose midpoint 9.22eye forhead l 8.19

nose leye 6.74diff pupil forhead r 4.48

diff pupil face l 4.44nose mouse 3.32

nose mouth midpoint 2.95eye forhead r 2.68

Table 4: Top Ten Features from Ran-dom Forest Classification Model

Model Training Error Testing ErrorRandom Forest 0.20 0.21Logistic Regression 0.44 0.50KNN 0.36 0.29SVM 0.00 0.29

Table 2: Comparison of ClassificationModels for Beauty Detection

Estimate Std. Error z value Pr(>|z|)(Intercept) 84.92 32.79 2.59 0.01

pupil face ratio -189.67 77.10 -2.46 0.01nose mouth l -4.26 1.94 -2.20 0.03

diff nose face l 5.34 2.67 2.00 0.05nose mouth r 2.66 1.50 1.77 0.08

d pupil 2.53 1.44 1.75 0.08nose mouth midpoint -0.74 0.44 -1.70 0.09

diff mouth face l -8.32 5.38 -1.55 0.12diff pupil forhead r -4.78 3.96 -1.21 0.23

mouth chin midpoint 0.57 0.48 1.19 0.23

Table 3: Top Ten Features from Logistic Regression Model

5

is beauty really in the eye of the beholder?cs229.stanford.edu/proj2014/yun (albee) ling,...

Documents