paper title (use style: paper title) · web view—measuring social signals has often proved...

Feature Extraction with Computational Intelligence for Head Pose Estimation

Shane Reid School of

Computing,Engineering and Intelligent systems

Ulster UniversityMagee College, UK

[email protected]

Sonya Coleman School of



[email protected]

Dermot Kerr School of


Ulster UniversityMagee College, [email protected]

Philip Vance School of



[email protected]

Siobhan O’Neill School of Psychology

Ulster University Magee College UK

[email protected]

Abstract—Measuring social signals has often proved challenging as they are often characterized by subtle movements which are difficult to detect. Head pose is one such social signal used to indicate where an individual’s attention is focused. This paper will discuss the problem of head pose estimation by defining the problem in terms of two fields of view, pan and tilt. A novel approach for head pose estimation is described that uses histogram of oriented gradients with support vector machines. The approach is compared with a template matching approach, among others, using a well-known dataset. The results show that the histogram of oriented gradients approach is the most accurate, able to determine head pan to within one class approximately 79% of the time, and head tilt to within one class approximately 82% of the time.

Keywords—Head Pose estimation, Template matching, Social signal processing, Histogram of oriented gradients, Support vector machine

I. INTRODUCTION Social signals are observable behaviors which are displayed

during the interaction of two or more people. The social signals of any given individual will induce changes in the social signals of others within their vicinity [1]. For example, if person A smiles at person B, then person B will typically respond by smiling back. By interpreting social signals, we can make predictions about future potential behavior. Figure 1 shows two boxers squaring off before a fight. They both maintain eye contact, lean slightly forward with their shoulders raised and maintain a neutral expression. These are all social signals that indicate conflict.

For humans, interpreting and displaying social signals is often performed unconsciously (although there are occasions where people consciously regulate their social signals such as public speaking or job interviews). For computers however, it is often very difficult to detect social signals, and even more difficult to interpret them. A kiss and a head-butt might look similar but they mean two very different things. To this end in order to successfully interpret social signals, it is important that there is firstly robust systems in place for describing them.

Gestures in general are used to regulate the human interaction and they are often performed consciously or unconsciously to convey some specific meaning and are therefore a category of social signals. Furthermore, gestures can also express an emotion and hence convey social signals for behaviors like shame (e.g. breaking eye contact) and embarrassment (e.g. blushing). Posture is unconsciously regulated and thus can be considered as the most reliable nonverbal social cue. The three-main social signal posture conveys are inclusive vs non-inclusive, face-to-face vs parallel and congruent vs non-congruent. These cues may be used to help distinguish between an extrovert and an introvert person [16].The systematic analysis of body movement dates back to early photography experiments of Étienne-Jules Marey and Eadweard Muybridge in the late 1800s [17]. Later the advent of video recording and playback equipment, researchers were able to analyze behavior on a finer timescale.[18] From videos researchers coded specific behaviors that they were interested in, such as mutual gaze [4].

In [19] a generic method for automated measurement and analysis of body motion is presented, which posits that given motion takes place at the joints, the skeleton can be considered as a set of body parts connected by joints. Body poses

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Figure 1: Mohamad Ali squaring off with Ernie Terrell 1967

can then be represented as instantiations of these joint positions. All joints and body parts in the human body form a kinematic tree, a hierarchal model of interconnectivity. The joint at the top of the tree (typically the pelvis) forms a root to which all other joints are relative. When these joints are connected to a body part the one higher in the tree hierarchy is considered the parent and the other the child [9]. This model provides a quantitative analysis of full body motion, however there is still scope for generating qualitative description labels by defining rules on the measured motion, such as “right hand is moving up”

Rather than focussing on general gesture recognition, we the initial aim is to concentrate on head pose estimation initially. The ability to estimate head pose, innate in humans, is an important social signal which is used in a number of different ways. From an evolutionary standpoint, being able to quickly determine sudden changes in another’s head pose allows a person to determine where they should focus their attention in order to avoid any perceived danger [2]. This ability is not just limited to fight or flight scenarios but is also indicative of good communication skills where a person is expected to orientate their head towards another (intended target) during a conversation.

Head pose estimation is a significant problem that has been surveyed extensively in [3] where they review over 90 papers. A selection of the approaches are based on the use of template matching based algorithms. In appearance template matching a given image is compared to a set of exemplars each labelled with a discrete pose in order to find the most similar view. Figure 2 shows an example which uses three templates. By computing the similarity of each template to the test image, we can determine the head pose of the test image is determined as being that which is most similar. However, let’s consider two images of the same individual at slightly different poses and two images of different individuals at the same pose. In this scenario the effect of identity can cause more dissimilarity than the change in head pose. Methods to counter this include applying a filter to try and limit the effect of appearance such as in Figure 2, or the use of multiple templates in order to generate a more robust similarity score as described in [IMVIP 201820]. Additionally, the approach taken by [4] uses image comparison to compute nearest neighbor template matching. In [5] they utilize normalized cross correlation appearance based template matching to compute facial recognition regardless of the individuals head pose. The effect of Gabor filtering on appearance based template matching was investigated in [6]. The approach taken in [7] used support vector machines to discriminate between 3 different head poses without any feature detection. Finally, the approach taken by [8] generates a sparse representation of the face image before classification using support vector regression.

Certain gestures are also conveyed using head movements, such as nodding, to show understanding etc. Hence a method for automatically determining head pose robustly and with a high degree of accuracy is of great importance. For the purpose of this research, head pose will be defined in terms of two

dimensions. The first is pan which describes whether the head is oriented left or the right. The second is tilt which describes whether the head is oriented up or down. In Section II the proposed method for estimating head pose is outlined. This is a feature detection method in which a set of labelled training images are converted into a set of feature vectors. These feature vectors are then used to train a machine learning algorithm, in this case specifically a support vector machine. To predict the label of a given test image, that image is converted into a feature vector and classified and the results are presented and discussed in Section III. In Section IV the results are compared

with the benchmark algorithms described in [3] and [20], and future work is discussed.

II. METHODOLOGIES

In this paper the pointing 04 head pose image dataset [9] is used. This dataset contains 15 sets of images. Each set consists of two series of 93 images of the same person at different poses. There are 15 individuals in the dataset, some wearing glasses or not and having various skin colors. The pose is defined by two angles, pan and tilt, which both vary from -90 to 90 degrees. The possible angles in each dimension are as follows:

Pan={-90,-75,-60,-45,-30,-15,0,+15,+30,+45,+60,+75,+90}Tilt= {-90,-60,-30,-15, 0, +15, +30, +60, +90}

80% of the images, the first 12 sets, will be used for training or as templates. The remaining 20%, the last 3 sets, will be used for testing. Each image has a resolution of 384x288px. We Two present two approaches are presented,

Figure 2: Crude example of head pose detector using three templates. These images have been pre-processed using Gabor wavelets

Phil Vance, 25/06/18,

Reminder to fix

Shane Reid, 06/25/18, RESOLVED

Added “head pose image dataset”

Phil Vance, 06/25/18, RESOLVED

Is this the name of the dataset?

the proposed approach which combines HoG and SVM for head pose estimation (pan and tilt independently) and the template matching approach presented in [20].

A. HOG+SVM Histogram of oriented gradients is a feature descriptor used

for object detection. It works by first splitting an image into a number of cells. For the pixels within each cell a histogram of gradient directions is compiled. Figure 3 provides a visualization of this, with the original image on the left and the corresponding histograms on the right. The feature vector is the concatenation of these histograms. In this paper, each cell was 16x16 pixels and there are 9 direction bins for each cell. Since the images are 384 x 288px this results in a vector of 3888 features for each image.

Figure 3 Visualization of histogram of oriented gradients

The HOG feature vectors of the training set, 80% of the set of total images, were then calculated and used to train two support vector machines;, one for classifying head pan and the other for classifying head tilt. Both of the support vector machines used a radial basis function kernel with a C value of 10 and gamma value of 0.01. They were both trained using a ‘One vs rest’ decision function. Using the remaining 20% of the dataset for testing, the head pose of a given image can then be predicted by converting said image into a HOG feature vector and generating a prediction from the two trained support vector machines.

B. Template MatchingWe implement the template matching approach proposed in

[IMVIP201820] for comparison purposes. In this approach cross correlation is used to measure the similarity of a given image and a template. For a given image I and template T the correlation is given by:

corr ( x , y )=∑x ' , y '

T ( x ' , y ' ) . I ( x+ x' , y+ y ' )¿¿ (1)

However this algorithm has problems, primarily that if one image has more energy i.e. one image is much brighter than the other, the correlation will be hard to detect, hence we use the normalised similarity measure:

normcorr ( x, y )

¿=∑

x ' , y'T ( x ' , y ' ) .I ( x +x ' , y+ y ' )¿

√∑x' y 'T ' (x' , y ')2 .∑

x' y 'I(x+x ' , y+ y ')2 (2)

As previously noted 80% of the images are used as templates and the remaining 20% are used for testing. This means that for each head pose there are 24 templates. Figure 4 shows an example of how the robust similarity score for each head pose is calculated. Once all 24 similarities are calculated using normalized cross correlation the three largest similarities and the three smallest similarities are removed. The mean of the remaining 18 values is calculated as the new more robust similarity score. This process is repeated for each of the 93 sets of templates until we are left with 93 similarity scores. Head tilt is estimated by taking the mean of the four largest similarity scores for each head tilt (with the exception of -90 and +90 as they only have one similarity score each) to calculate a tilt similarity score. The tilt class with the largest tilt similarity score is estimated as be the correct tilt pose. Head pan is estimated in a similar manner, by taking the mean of the three largest similarity scores for each head pan to calculate a pan similarity score.. The head pan is estimated as the class with the larges pan similarity score.

Figure 4: Example of how multiple templates are used to compute a more

Shane Reid, 06/25/18, RESOLVED

Added i.e. one image is much brighter than the other

Phil Vance, 06/25/18, RESOLVED

Not sure what this means. Energy from what context?

Shane Reid, 25/06/18,

Changed to [20]

accurate similarity score

[III.] RESULTS

In this section we compare the proposed HOG and SVM based approach with the template matching approach in [IMVIP201820] and a number fo other benchmark approaches. We compare the approaches by calculating the absolute error value for each image x using

AbsError ( x )=¿ Label ( x )−PredictedValue ( x )∨¿(3)

where Label(x) is the actual pan or tilt label of the image x, and PredictedValue(x) is the value predicted by either the template matching framework or the HOG+SVM. The absolute error for each test image is presented in Table 2. The mean absolute error for the set of n test images was calculated by:

MeanAbsError (n)=∑x∈n

AbsError (x)

¿n∨¿¿

and the accuracy was then calculated by:

Accuracy ( n )=

∑x∈n {1,∧|l x−px|≤1

0 ,∧|lx−px|>1¿n∨¿¿

where lxis the template label and px is the predicted label. The template matching framework had a pan accuracy of 51.76%, compared with HOG+SVM approach which had a pan accuracy of 79.03%. Similarly, the template matching approach had a tilt accuracy of 71.33% compared with 82.29% for the HOG+SVM approach. The mean absolute pan error for the template matching approach was 31.08°, compared with 14.95° for the proposed HOG+SVM approach. Similarly the mean absolute tilt error for the template matching approach was 25.03°, compared with 15.78° for the proposed HOG+SVM approach. These results show that the use of HOG feature extraction in combination with the SVM has both a higher accuracy and a lower mean error than the template matching framework for both head pan and head tilt.

Additionally, Figure 5 shows the frequency of pan error for both methods. It demonstrates that the majority of cases were classified within 30 degrees of the actual pan value for both methods. The HOG+SVM approach was within 15 degrees 79.03% of the time and within 30 degrees 93.73% of the time. The template matching approach was less accurate however, within 15 degrees only 51.79% of the time, and 30 degrees

Table 2: Results Table

Pan Tilt

Template Matching HOG and SVM Template Matching HOG and SVM

Abs Deg Freq Percent Abs Deg Freq Percent Abs Deg Freq Percent Abs Deg Freq Percent

0 124 22.22% 0 199 35.66% 0 197 35.30% 0 243 43.55%

15 165 29.57% 15 242 43.37% 15 104 18.64% 15 154 27.60%

30 102 18.28% 30 82 14.70% 30 133 23.84% 30 102 18.28%

45 56 10.04% 45 19 3.41% 45 37 6.63% 45 32 5.73%

60 37 6.63% 60 10 1.79% 60 39 6.99% 60 18 3.23%

75 26 4.66% 75 0 0.00% 75 18 3.23% 75 5 0.90%

90 30 5.38% 90 2 0.36% 90 14 2.51% 90 2 0.36%

105 6 1.08% 105 0 0.00% 105 8 1.43% 105 1 0.18%

120 4 0.72% 120 1 0.18% 120 8 1.43% 120 1 0.18%

135 2 0.36% 135 0 0.00% 135 0 0.00% 135 0 0.00%

150 1 0.18% 150 1 0.18% 150 0 0.00% 150 1 0.18%

165 1 0.18% 165 1 0.18% 165 0 0.00% 165 0 0.00%

180 4 0.72% 180 1 0.18% 180 0 0.00% 180 0 0.00%

Accuracy ± one class

51.79% 79.03% 71.33% 82.29%

Mean Error Degrees

31.08 14.95 25.03 15.78

70.07% of the time. Similarly, Figure 6 shows the frequency of tilt error for both methods. The graph shows that the majority of cases were classified within 30 degrees. The HOG+SVM approach correctly estimated the tilt to within 15 degrees 71.15% of the time, and within 30 degrees 89.43% of the time. The template matching approach was again less accurate, estimating the correct tilt within 15 degrees 53.94% of the time, and within 30 degrees 77.78% of the time.

Figure 5: Graph of frequency of pan error

Figure 4: Graph of frequency of tilt error

III.[IV.] COMPARISON WITH BENCHMARK ALGORITHMS

For robustness, we compare the proposed HOG+SVM approach with similar benchmark algorithms. Table 3 shows the benchmark results for the dataset as presented in [3]. The methods outlined in this paper compare favorably, particularly the HOG+SVM approach which has an absolute mean error better than or equal to the majority of the benchmarks.

Of the methods presented, only two conclusively beat the proposed HOG+SVM approach in terms of having a lower absolute mean error: Stiefelhagen’s neural network method [11], which was trained using 80% of the images randomly selected from the set of all 15 individuals plus additional artificial images generated by mirroring the training set, achieving a mean error of 9.5° and 9.7° for pan and tilt respectably; and Voit’s [14] neural network method, which used 50% of the images for training, the first series from each set. They also mirrored the dataset in order to obtain more test images. This demonstrates that the proposed HOG+SVM algorithms works better than other existing algorithms when trained with a relatively small dataset.

IV.[V.] CONCLUSION

We present the problem of head pose estimation, and define the problem in terms of two dimensions, pan and tilt. We have presented a computational intelligence basedintelligence-based approach to accurately estimate the pan and tilt of head pose using a feature matching approach which utilizes histogram of oriented gradients to train two support vector machines, one

Table 3: Benchmarks for pointing ‘04 datasetMean Abs Error

Pan TiltJ.Wu[10] - -

Template matching framework 31.08° 25.03° {51.79%, 71.33%}*†HOG + SVM 14.95° 15.78° {79.03%, 82.29%}*†

Steifelhagen[11] 9.5° 9.7°Human Performance[12] 11.8° 9.4°

Gourier (Associative Memories)[12] 10.1° 15.9°Tu (High-order SVD)[13] 12.9° 19.97° {49.25%, 54.84%}†

Tu (PCA)[13] 14.11° 14.98° {55.20%, 57.99%}†Tu (LEA)[13] 15.88° 17.44° {45.16%, 50.61%}†

Voit[14] 12.3° 12.77°Zhu (PCA)[15] 26.9° 35.1°Zhu (LDA)[15] 25.8° 26.9°Zhu (LPP)[15] 24.7° 22.6°

Zhu (Local-PCA)[15] 24.5° 37.6°Zhu (Local-LPP)[15] 29.2° 40.2°Zhu (Local-LDA)[15] 19.1° 30.7°

*Any neighboring poses considered correct classification as well† DOF classified separately

1. Used 80% of Pointing 04 images for training and 10% for evaluation2. Human performance with training3. Better results over different reported methods4. Better results have been obtained with manual localization5. Results for 32-dim embedding

for pan and one for tilt. The results demonstrate that the feature matching approach is the most accurate, achieving a pan accuracy of 79.03% within one class and 82.29% tilt accuracy within one class when compared with the recently proposed template matching approach. The template matching approach achieved a head pan accuracy of 59.86% within one class and a tilt accuracy of 71.33% within one class.

For future work we propose mirroring the training dataset in order to double the number of training images and improve performance. Additionally we will extend the HOG+SVM approach for use with video data, as the algorithm runs in near real time. However, the main goal of this research is the ability to estimate head pose and interpret for the purpose of social signal processing.

REFERENCES

[1] J.K. Burgoon, N. Magnenat-Thalmann, M. Pantic, and A. Vinciarelli, Social Signal processing, Cambridge University Press, 2017

[2] J.Shen and L.Itti, “Top-down influences on visual attention during listening are modulated by observer sex”, Vision research, vol. 65, pp 62-76, July 2012

[3] E. Murphy-Chutorian and M. M. Trivedi, "Head Pose Estimation in Computer Vision: A Survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 607-626, April 2009.

[4] S. Niyogi and W. T. Freeman, "Example-based head tracking," in Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, Killington, VT, 1996, pp. 374-378.

[5] D.J. Beymer, “Face recognition under varying pose”, Artificial intelligence lab, Massachusetts institute of technology, Cambridge MA, Memorandum rept. ADA290205, 1993

[6] J. Sherrah, S.Gong, E.J. Ong, “Face distributions in similarity space under varying head pose”, Image and vision computing, vol 19, issue 12, pp 807-819, Oct 2001

[7] J. Huang, X. Shao and H. Wechsler, "Face pose discrimination using support vector machines (SVM)," in Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170), Brisbane, Qld, 1998, pp. 154-156 vol.1.

[8] H. Moon and M. L. Miller, "Estimating facial pose from a sparse representation [face recognition applications]," in International Conference on Image Processing, Singapore 2004, pp. 75-78 Vol. 1.

[9] N. Gourier, D. Hall, and J. Crowley, “Estimating Face Orientation from Robust Detection of Salient Facial Structures”, Proc. ICPR Workshop Visual Observation of Deictic Gestures, pp. 25-33, 2004

[10] J. Wu, J. Pedersen, D. Putthividhya, D. Norgaard, and M.M. Trivedi, “A Two-Level Pose Estimation Framework Using Majority Voting of Gabor Wavelets and Bunch Graph Analysis”, in Proc. ICPR Workshop Visual Observation of Deictic Gestures, pp 13-20, 2004.

[11] R. Stiefelhagen “Estimating head pose with neural networks-results on the pointing04 ICPR workshop evaluation data” in Proc. Pointing 2004 Workshop: Visual Observation of Deictic Gestures, vol1, issue 5, pp 21-24, 2004

[12] N. Gourier, J. Maisonnasse, D. Hall, and J.L Crowley “Head Pose Estimation on Low Resolution Images”, in International Evaluation workshop on classification of Events, Activities and Relationships, CLEAR 2006: Multimodal Technologies for perception of Humans, Southampton, UK, 2006, pp 270-280

[13] J. Tu, Y. Fu , Y. Hu, and T. Huang, “Evaluation of head pose estimation for studio data” in International Evaluation workshop on classification of Events, Activities and Relationships, CLEAR 2006: Multimodal Technologies for perception of Humans, Southampton, UK, 2006, pp 281-290

[14] M.Voit, K.Nickel, and R.Stiefelhagen, “Neural network-based head pose estimation and multi-view fusion” in International Evaluation workshop on classification of Events, Activities and Relationships, CLEAR 2006: Multimodal Technologies for perception of Humans, Southampton, UK, 2006, pp 291-298

[15] Z. Li, Y. Fu, J. Yuan, T. S. Huang and Y. Wu, "Query Driven Localized Linear Discriminant Models for Head Pose Estimation," in IEEE International Conference on Multimedia and Expo, Beijing, 2007, pp. 1810-1813.

[16] A. Vinciarelli, M. Pantic, and H. Bourlard, “Social signal processing: Survey of an emerging domain”, Image and vision computing, vol 27, issue 12, pp 1743-1759, Nov 2009

[17] R. Klette and G. Tee, “Understanding human motion: A historic review” in Human Motion, B. Rosenhahn, R. Klette, D. Metaxas, Springer Netherlands, 2008 pp 1-22

[18] R. Poppe “12: Automatic analysis of bodily Social signals” in Social signal processing, J.K. Burgoon, N. Magnenat-Thalmann, M. Pantic, and A. Vinciarelli , Cambridge University Press, pp. 155--167.

[19] R. Poppe, S.Van Der Zee, D.K Heylen, and P.J Taylor, “AMAB: Automated measurement and analysis of body motion” in Behaviour research methods, vol. 46, issue 3, pp. 625-633. Sept 2014

[20] S. Reid, S. Coleman, S. O’Neill, D. Kerr, P. Vance “Template matching for head pose estimation” in Proc. Irish machine vision and image processing conference, Belfast, UK, 2018, Unpublished

Shane Reid, 20/06/18,

I've inserted the imvip paper here but marked it as unpublished,

paper title (use style: paper title) · web view—measuring social signals has often proved...

Documents