marko linna real-time human pose estimation from video ... › ~malinna › theses ›...

48
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Marko Linna Real-time Human Pose Estimation from Video with Convolutional Neural Networks Master’s Thesis Degree Programme in Computer Science and Engineering September 2016

Upload: others

Post on 23-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Marko Linna

Real-time Human Pose Estimation from Videowith Convolutional Neural Networks

Master’s ThesisDegree Programme in Computer Science and Engineering

September 2016

Page 2: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional Neural Networks. Department of Computer Science and Engineering, Univer-sity of Oulu, Oulu, Finland. Master’s thesis, 48 p.

ABSTRACT

There is a growing need for real-time human pose estimation from monocularRGB images in applications such as human computer interaction, assisted liv-ing, video surveillance, people tracking, activity recognition and motion capture.For the task, depth sensors and multi-camera systems are usually more expensiveand difficult to set up than conventional RGB video cameras. Recent advancesin convolutional neural network research have allowed to replace of traditionalmethods with more efficient convolutional neural network based methods in manycomputer vision tasks.

This thesis presents a method for real-time multi-person human pose estima-tion from video by utilizing convolutional neural networks. The method is aimedfor use case specific applications, where good accuracy is essential and variationof the background and poses is limited. This enables to use a generic networkarchitecture, which is both accurate and fast.

The problem is divided into two phases: (1) pretraining and (2) fine-tuning. Inpretraining, the network is learned with highly diverse input data from publiclyavailable datasets, while in fine-tuning it is trained with application specific datarecorded with Kinect.

The method considers the whole system, including person detector, pose esti-mator and an automatic way to record application specific training material forfine-tuning. The method can be also thought of as a replacement for Kinect, and itcan be used for higher level tasks such as gesture control, games, person trackingand action recognition.

Keywords: human pose estimation, person detection, convolutional neural net-works, computer vision

Page 3: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

Linna M. (2016) Reaaliaikainen ihmisen asentojen tunnistaminen videokuvastakonvoluutioneuroverkoilla. Oulun yliopisto, tietotekniikan osasto. Diplomityö, 48 s.

TIIVISTELMÄ

Reaaliaikaiselle ihmisen asentojen tunnistamiselle monokulaarisesta RGB kuvas-ta on kasvava tarve monissa sovelluksissa, kuten ihmisen ja tietokoneen välises-sä vuorovaikutuksessa, hoivakodeissa, videovalvonnassa, henkilöiden seurannas-sa, aktiviteettien tunnistamisessa ja liikkeenkaappauksessa. Kyseiseen tehtäväänsyvyysanturit ja monikamerajärjestelmät ovat yleensä kalliimpi ja vaikeamminasennettava vaihtoehto kuin tavanomainen videokamera. Viimeaikainen kehityskonvoluutioneuroverkkojen tutkimuksessa on aiheuttanut perinteisten menetel-mien korvautumisen suorituskykyisemmillä konvoluutioneuroverkkopohjaisillamenetelmillä monissa tietokonenäön tehtävissä.

Tässä työssä esitellään menetelmä reaaliaikaiseen monen henkilön asennontun-nistukseen videosta käyttämällä konvoluutioneuroverkkoja. Menetelmä on tar-koitettu tapauskohtaisiin sovelluksiin, joissa hyvä tarkkuus on välttämätöntä jamuutokset taustoissa ja asennoissa rajallisia. Näissä olosuhteissa on mahdollistakäyttää yleiskäyttöistä verkkoarkkitehtuuria, joka on sekä tarkka että nopea.

Ongelma on jaettu kahteen vaiheeseen: (1) esiopetus ja (2) hienosäätö. Esiope-tuksessa verkko opetetaan useista julkisesti saatavilla olevista tietokannoista pe-räisin olevalla monipuolisella datalla. Hienosäädössä verkko opetetaan Kinectillänauhoitetulla tapauskohtaisella datalla.

Menetelmä ottaa huomioon koko järjestelmän, sisältäen henkilöiden paikan-nuksen, asentojen tunnistamisen ja automaattisen menetelmän tapauskohtaisenopetusdatan nauhoittamiseen Kinectillä. Menetelmä voidaan myös ajatella Ki-nectin korvaajana ja sitä voidaan käyttää korkeamman tason tehtäviin, kuteneleohjaukseen, peleihin, henkiöiden seurantaan ja aktiviteettien tunnistamiseen.

Avainsanat: ihmisen asentojen tunnistaminen, henkilöiden paikannus, konvoluu-tioneuroverkot, tietokonenäkö

Page 4: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

TABLE OF CONTENTS

ABSTRACT

TIIVISTELMÄ

TABLE OF CONTENTS

FOREWORD

ABBREVIATIONS

1. INTRODUCTION 8

2. BACKGROUND AND RELATED WORK 102.1. AlexNet and variants . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2. Early human pose estimation methods using ConvNets . . . . . . . . 102.3. VGG Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4. GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5. Residual Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.6. Human pose estimation from video . . . . . . . . . . . . . . . . . . . 142.7. Top ranking methods . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3. METHOD 173.1. Person detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2. Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3. Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4. Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5. Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5.1. Convolutional layer . . . . . . . . . . . . . . . . . . . . . . . 263.5.2. Fully connected layer . . . . . . . . . . . . . . . . . . . . . . 273.5.3. Rectified linear unit . . . . . . . . . . . . . . . . . . . . . . . 283.5.4. Local response normalization . . . . . . . . . . . . . . . . . 283.5.5. Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.5.6. Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6. Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.7. Testing details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.8. Network visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4. EVALUATION 364.1. Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5. ADDITIONAL EXPERIMENTS 405.1. Input image flipping . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Page 5: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

5.2. Arm pose network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6. DISCUSSION 416.1. Comparison to state-of-the-art . . . . . . . . . . . . . . . . . . . . . 416.2. Necessity of fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . 416.3. Convergence problem . . . . . . . . . . . . . . . . . . . . . . . . . . 416.4. Other ConvNet architectures . . . . . . . . . . . . . . . . . . . . . . 426.5. Personal comments . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7. CONCLUSION 43

8. REFERENCES 44

Page 6: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

FOREWORD

I would like to thank Machine Vision Group at CMVS for the opportunity to study andwrite thesis about computer vision and convolutional neural networks. The topic wasinteresting and I look forward to possibilities to continue studies in future. I would alsolike to express my gratitude to CMVS for the possibility of employment as researchassistant during writing.

I would like to thank Iaroslav Melekhov for instructions related to Caffe and par-ticular computing server at CSE department. I would also like to thank Hannu Rautiofor giving support to certain Linux PC. Especially I would like to thank Dr. Juho Kan-nala and Dr. Esa Rahtu for supervising this work and giving valuable information andguidance on the topic.

Oulu, Finland September 19, 2016

Marko Linna

Page 7: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

ABBREVIATIONS

BBC British Broadcasting CorporationCIFAR Canadian Institute for Advanced ResearchCNN Convolutional Neural NetworkDOF Degree of FreedomDPM Deformable Part ModelFC Fully ConnectedFLIC Frames Labeled In CinemaGPU Graphics Processing UnitGT Ground TruthILP Integer Linear ProgrammingILSVRC ImageNet Large Scale Visual Recognition ChallengeIoU Intersection-over-UnionLRN Local Response NormalizationLSP Leeds Sports PosemAP Mean Average PrecisionMPII Max Planck Institute for InformaticsMODEC Multimodal Decomposable ModelNMS Non-Maximum SuppressionPCK Percentage of Correct KeypointsRGB Red-Green-BlueR-CNN Region-based Convolutional Neural NetworkRPN Region Proposal NetworkReLU Rectified Linear UnitRoI Region of InterestSGD Stochastic Gradient DescentVGG Visual Geometry GroupZF Zeiler and Fergus

Page 8: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

8

1. INTRODUCTION

Articulated human pose estimation has been a long-standing problem in computer vi-sion. The problem can be defined as localization of body joints. There has been a grow-ing need for accurate human pose estimation in applications such as human computerinteraction, assisted living, video surveillance, people tracking, action recognition andmotion capture. While depth sensors and multi-camera systems have been used for thetask successfully, the focus of the recent research is on methods using monocular RGBimages. There is a need for single RGB camera systems as they are often easier andcheaper to set up in target environment.

Despite the advances in recent research, humans yet perform better than computers.Human pose estimation is difficult problem for many reasons. Firstly, human bodyhas 244 degrees of freedom (DOFs) [1] with a few hundred of joints. Most of themovements between joints are not relevant, but the remaining DOFs (roughly 20) stillresult in countless number of poses, which in turn causes challenges in joint localiza-tion. Moreover, occlusions and variability in clothing, body shape, scale and illumi-nation further makes pose estimation difficult. Ultimately, with many applications, themethod should be fast enough for real-time operation.

In recent years, the research has moved from traditional methods [2, 3, 4, 5] towardsconvolutional neural networks (ConvNets, CNNs) [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,17]. Due to this, significant improvements in accuracy have been accomplished. Effi-cient ConvNet architectures require heavy computation and lots of memory in trainingphase. The number of network parameters can range from few millions to hundredsof millions. Powerful GPUs are basically mandatory for training, and yet it can takeseveral days or weeks to train a network. However, in testing time the computationalneed is much smaller and memory requirement is often reasonable.

Lack of training data is often a problem with ConvNets, causing poorer generaliza-tion and accuracy. Data augmentation tries to counteract this by artificially extendingtraining data. Typical data augmentation methods are mirroring, rotating, scaling andjittering. In addition, advanced data augmentation methods have been studied recently.Pishchulin et al. [18] create and reshape 3D human shape model in order to adjustpose and shape of persons. Further, they render reshaped persons over different back-grounds.

Many state-of-the-art ConvNet human pose estimation methods uses more complexnetwork architectures and they perform considerably well in unconstrained environ-ments [15, 17, 19], where large variations in pose, clothing, view angle and backgroundexists. While these methods have high accuracy, they are usually slow considering realtime pose estimation. Recent research [7, 8] shows that by using a generic ConvNetarchitecture, a competitive accuracy can be achieved, while still maintaining a shortforward pass time. This is the main motivation of the method presented in this the-sis. The method does not aim for overall human pose estimation in diverse input data,but rather target to specific use cases where high accuracy and speed are required. Insuch cases, the problem is different, because the environment is usually constrained,persons are in close proximity of the camera and poses are restricted.

The method presented in this thesis is a single-camera multi-person human poseestimation system, targeted for use case specific applications. Such applications are,for example, gesture control, games, action recognition and person tracking. In order to

Page 9: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

9

support multiple people, a person detector is utilized, which gives locations and scalesof the persons in the target image. This brings the method towards the practice, sincethe location and scale of a person are not expected to be known, which is the casewith many state-of-the-art methods [15, 16, 17]. The method uses generic ConvNetarchitecture, having eight layers. The key idea is to pretrain the network with highlydiverse input data and then fine-tune it with use case specific data. The evaluation ofthe method shows that competitive accuracy can be achieved in application specificpose estimation, while operating in real-time. The main contributions of this thesisare:

1. Working replacement for Kinect [20] by using a fast and accurate pose estima-tion network together with a state-of-the-art person detector.

2. Utilization of Kinect for automatic training data generation, making it easy togenerate large amount of annotated training data1.

3. Utilization of person detector to crop person centered images in both trainingand testing, thus enabling multi-person pose estimation in real world images.

4. Ability to learn from heterogeneous training data, where the set of joints is notthe same in all the training samples, thus making possible to use more varieddatasets in training.

1The source code of the tool available at https://github.com/malinna/PersonTracker

Page 10: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

10

2. BACKGROUND AND RELATEDWORK

Despite the fact that ConvNet history goes back to 1980s [21], it has been in recentyears when ConvNets have gained popularity in computer vision tasks, mainly due toavailability of computational resources. However, it was in back in 1998 when the firstsuccessful ConvNet application was developed. It was called LeNet [22] and it couldclassify handwritten digits. LeNet had two stacked convolutional layers, followed bythree fully connected layers. After each convolutional layer was pooling layer. LeNethad 60K parameters.

2.1. AlexNet and variants

ConvNets became popular much later, in 2012, when AlexNet [23] was introduced.AlexNet could classify images on different categories and it won ILSVRC 20121 com-petition by a significant margin to the second contestant. AlexNet had deeper and spa-tially larger network architecture than LeNet. It had five stacked convolutional layers,followed by three fully connected layers. In addition, three pooling layers followed thefirst, second and fifth convolutional layer. The architecture of AlexNet is illustrated inFigure 1. AlexNet had 60M parameters.

Conv 1 + Norm

FC 6

Conv 2 Conv 3

Max pooling96

11

11

5Conv 1 + Norm 5

256

3

3

384Max pooling

Conv 43

3

384

Conv 53

3

256

555FC 7

FC 8

Max pooling

4096 4096

1000

224224

3

Figure 1. The architecture of AlexNet. The stride on the first convolutional layer is 4.All other convolutional layers have stride of 1. All pooling use 3× 3 kernel and strideof 2. The last layer outputs class probabilities for 1000 categories.

ZF Net [24] was an improvement to AlexNet. It had otherwise similar architecture,but filter size and stride were smaller in the first convolutional layer. Because of this,the first and second convolutional layer retained much more information, which in turnimproved classification performance.

2.2. Early human pose estimation methods using ConvNets

ConvNets were used in human pose estimation for the first time in late 2013, when To-shev and Szegedy [7] presented their method. Their network architecture was similarto AlexNet, but the last layer was replaced by a regression layer, which output joint co-ordinates. In addition, they trained a cascade of pose regression networks. The cascade

1http://image-net.org/challenges/LSVRC/2012

Page 11: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

11

started off by estimating an initial pose. Then at subsequent stages, additional regres-sion networks were trained to predict a transition of the joint locations from previousstage to the true location. Thus, each subsequent stage refined the currently predictedpose. Similar idea is applied in more recent work by Carreira et al. [13].Jain et al. [6] demonstrated that ConvNet based human pose estimation can meet

the performance, and in many cases outperform the traditional methods, particularlydeformable part models (DPMs) [2] and multimodal decomposable models (MOD-ECs) [5]. Their network architecture consisted of three convolutional layers, followedby three fully connected layers. Pooling was applied after the first two convolutionallayers. They trained the network for each body part (e.g. wrist, shoulder, head) sep-arately. Each network was applied as sliding windows to overlapping regions of theinput image. A window of pixels was mapped to a single binary output: the presenceor absence of that body part. This made possible to use much smaller network, at theexpense of having to maintain a separate set of parameters for each body part. Theirnetwork had 4M parameters.

2.3. VGG Net

VGG Net [25], introduced in 2014 by Simonyan and Zisserman, showed that by usingvery small convolution filters (3 × 3) and pushing the depth of the network to 16-19layers (see Figure 2), a significant improvement on the prior architectures could beachieved on image recognition tasks. VGG Net placed second in the ILSVRC 20142

image recognition challenge. Since then, VGG Net has been used successfully also inobject detection, most importantly in Fast R-CNN [26] and Faster R-CNN [27]. Thelatter is used in this thesis for person detection and is explained more closely later inSection 3.1.

2http://image-net.org/challenges/LSVRC/2014

Input image224x224x3

224x224x64

112x112x128

56x56x25628x28x512 14x14x512 7x7x512 1x1x4096 1x1x1000

Convolution (k=3, s=1)

Max pooling (k=2, s=2)

Fully connected

Figure 2. The architecture of 16 layer VGG Net (Configuration D in [25]).

Page 12: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

12

2.4. GoogLeNet

GoogLeNet [28], introduced by Szegedy et al., ranked first in the ILSVRC 2014 imagerecognition challenge. GoogLeNet presented a new ConvNet architecture, where thenetwork consisted of Inception Modules. The key idea of Inception Module is to feedthe input data simultaneously to several convolutional layers and then concatenate out-puts of each layer into a single output. Each convolutional layer have different filtersize and they produce spatially equal sized outputs. Because of this, a single InceptionModule can process information at various scales simultaneously, thus leading to bet-ter performance. An Inception Module can also have a pooling layer side by side withconvolutional layers. In order to avoid computational blow up, Inception Modules uti-lize 1× 1 convolutional layers for dimension reduction. A typical Inception Module isdescribed in Figure 3. The main benefit of this architecture is that it allows for increas-ing both the depth and width of the network, while keeping computational complexityin control. GoogLeNet have 5M parameters (AlexNet 60M). Since introducing it, afew follow-up versions with performance improvements have been proposed [29, 30].

Conv5x5

Conv3x3

Conv1x1

Pool 3x3

1x1

1x1

1x1

Input volume

Output volume

Figure 3. The architecture of a typical Inception Module. There are three convolutionallayers with kernel sizes 1 × 1, 3 × 3 and 5 × 5, in addition to one pooling layer withkernel size 3 × 3. Dashed volumes are 1 × 1 convolutions implementing dimensionreduction. Stride of 1 is used in every layer.

2.5. Residual Networks

In 2015, He et al. introduced the Residual Network (ResNet) architecture [31], whichwon the ILSVRC 20153 on image recognition. They replaced traditional stacking con-volutional layers with Residual Modules. In a single Residual Module, a couple ofstacking convolutional layers are bypassed with a skip connection. The output of theskip connection is then added to the output of the stacking layers. Every convolutionallayer in a Residual Module utilizes Batch Normalization [33] to cope with internalcovariate shift. ReLU [34, 35] is used for non-linearity. Two different Residual Mod-

3http://image-net.org/challenges/LSVRC/2015

Page 13: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

13

Input volume

HxWxD

Outputvolume

HxWxN

3x3s=1

1x1s=1Skip connection

ConvolutionBatch normalization

ReLU

Element-wise sum

3x3s=1

(a)

Input volume

HxWxD

Outputvolume

HxWxN

1x1s=1

3x3s=1

1x1s=1

1x1s=1Skip connection

Convolution Batch normalizationReLU Element-wise sum

(b)

Figure 4. The architectures of (a) basic and (b) bottleneck Residual Modules. Resid-ual Module keeps the spatial size unchanged. The purpose of 1 × 1 convolutionsin bottleneck Residual Module, before and after 3 × 3 convolution, is to reduce andthen increase (restore) dimensions, leaving the 3 × 3 layer a bottleneck with smallerinput/output dimensions. Batch Normalization and ReLU precede each convolution,which is in contrast to [31], where they are after. It was later shown in [32] that thisway training is faster and better results can be achieved. If needed, the skip connec-tion branch can have 1× 1 convolution (dashed volumes) to match dimensions beforeelement-wise addition.

ules, basic and bottleneck, were proposed by the authors. These are described moreclosely in Figure 4. A typical ResNet architecture consists of a great number of stackedResidual Modules, making the network much deeper (from tens to hundreds of layers)compared to traditional networks. The authors of ResNet demonstrate that it is easierto optimize a very deep Residual Network than its counterpart, a traditional networkwith stacking layers. With Residual Networks, the training error is much lower whenthe depth increases, which in turns gives accuracy improvements.Since introducing the Residual Network architecture, several improvements have

been proposed. The authors, He et al., introduced a new Residual Module, whichfurther makes training easier and improves generalization [32]. Zagoruyko and Ko-modakis argued that very deep ResNet architectures are not needed for state-of-the-artperformance [36]. They decreased the depth of the network and increased the sizeof a Residual Module by adding more features and convolutional layers. Their WideResidual Network trains faster and it outperformed previous ResNet architectures on

Page 14: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

14

CIFAR-10 and CIFAR-100 datasets4 by a distinct margin. Currently, ResNets are state-of-the-art ConvNet models and they have been shown to perform remarkable well bothin image recognition [31, 32, 36] and human pose estimation tasks [17, 19].

2.6. Human pose estimation from video

Since introducing the first ConvNet method for human pose estimation, a number ofrelated methods have been proposed [9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 37, 38].While most of the methods focus on estimating poses in isolated still images, only fewconcentrate on pose estimation in videos. Utilizing of the temporal information of sub-sequent frames of a video may be a valuable cue when estimating keypoint locations.

A method for human pose estimation from video was introduced by Pfister et al.in 2014 [8]. Their method utilized the temporal information in constrained gesturevideos. This was achieved by training the network with multiple frames so that theframes were inserted into the separate color channels of the input. For example, withthree input frames, the number of color channels would be nine. The network archi-tecture was similar to AlexNet, having five convolutional layers, followed by threefully connected layers, from which the last one was a regression layer. Pooling wasdone after the first, second and fifth convolutional layer. However, there were somedifferences compared to the previous architectures. Some of the convolutional layerswere much deeper and pooling was non-overlapping, when in most of the previousarchitectures it was overlapping. The network had 100M parameters and it producedsignificantly better pose predictions on constrained gesture videos than the previouswork. For this reason, this architecture is used in this thesis. The architecture is ex-plained more closely in Section 3.5.

Optical flow has been used successfully in several works. Jain et al. [9] use it tocreate motion feature images, which are fed to ConvNet together with correspondingRGB frames. In addition, optical flow has been used in [11, 14] to warp keypointheatmaps of neighboring frames in order to reinforce the confidence of the currentframe.

2.7. Top ranking methods

Wei et al. [16] proposed a multi-stage ConvNet architecture for articulated human poseestimation, where each stage increasingly refined body part estimations. In their archi-tecture, they used image features and confidence maps from previous stages jointly tomake the network also learn spatial relationships between body parts, thus consider-ably reducing error for ending up selecting wrong body part from the final confidencemap. As their network had 6 stages and many layers (52), it was at risk of encounter-ing vanishing gradients during training. To prevent this, an intermediate loss layer wasused after each stage to enforce intermediate supervision. Their network performedremarkably well on standard benchmarks, FLIC [5], MPII [39] and LSP [40], out-

4https://www.cs.toronto.edu/~kriz/cifar.html

Page 15: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

15

performing previous methods at the time. However, their network had problems withmultiple people in close proximity.Recently, Newell et al. [17] introduced a new ConvNet architecture for human pose

estimation, which achieved state-of-the-art results on the FLIC [5] and MPII [39]benchmarks outperforming all recent methods. Their network architecture benefit fromrecently introduced ResNets, convolution-deconvolution architectures [41, 42, 43] andintermediate supervision [16]. The core of the architecture is Hourglass Module, whichimplements bottom-up, top-down architecture, making possible to better process fea-tures across different scales (see Figure 5). Hourglass Network consists of two stackedHourglass Modules and it outputs heatmaps in two stages, where the network predictsthe probability of each joint’s presence at every pixel. The first stage outputs initialpredictions, while the second outputs final predictions. In training with intermediatesupervision, the loss is applied for both predictions separately using the same groundtruth. The architecture of Hourglass Network is described in Figure 6.Majority of the recent human pose estimation methods focus on a single person

case [12, 15, 16, 17, 38], where the approximate location of a person is expected to beknown. However, there has been also some research in multi-person case, where theclose proximity of different persons causes challenges to pose estimation. Pishchulin etal. [10] uses body part detection and pose estimation jointly to draw a conclusion aboutthe number of persons in an image, identify occluded body parts and disambiguatebody parts between people in close proximity of each other. Their method differs fromprevious work in that they don’t use separate person detection and pose estimationsteps, but instead solve both problems together. The main idea of their method is touse ConvNet and graphical model jointly. The method starts by detecting body part

Input volume

Residual Module

Downsampling: max pooling (k=2, s=2)

4x4x256

32x32x256

16x16x2568x8x256 4x4x512

Upsampling: nearest neighbor interpolation

4x4x5128x8x512

8x8x5128x8x5128x8x256

16x16x512

16x16x256

32x32x256

32x32x512

64x64x512

64x64x256

64x64xD

16x16x512

16x16x512

32x32x512

64x64x512 64x64x512

Outputvolume

8x8x512 16x16x512 32x32x512

Element-wise sum

Figure 5. The architecture of Hourglass Module. The main building block is a bottle-neck Residual Module (green volumes) as illustrated in Figure 4b. There are two typesof Residual Modules. They both have 1 × 1 → 3 × 3 → 1 × 1 convolutions but thenumber of filters is different. Accordingly, the filter sizes are: (1) 128 → 128 → 256and (2) 256 → 256 → 512. The network branches off at different resolutions andjoins together with the main pipeline after each nearest neighbor upsampling of thelower resolution. This allows the preservation of spatial accuracy at each scale in ad-dition to the further processing of spatial features that are best captured at a particularresolution.

Page 16: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

16

7x7s=2

Input image 128x128x128

256x256x3

Stage 1 predictions:K heatmaps

128x128x64

64x64x128

64x64x128 64x64x256 64x64x512

1x1s=1

64x64x512

64x64x384

1x1s=1

64x64x256

1x1s=1

64x64xK

Residual Module

Downsampling: max pooling (k=2, s=2)

Concatenation

Convolution Batch normalization

ReLU

Hourglass Module

Element-wise sum

1x1s=1

64x64x384

1x1s=1

64x64x384

64x64x256

64x64x512

1x1s=1

64x64x512

1x1s=1

64x64x512

Stage 2 predictions:K heatmaps

1x1s=1

64x64xK

64x64x384

Figure 6. The architecture of the Hourglass Network. The two Hourglass Modules areidentical. The bottleneck Residual Modules have 1× 1 → 3× 3 → 1× 1 convolutionswith (1) 64 → 64 → 128 and (2) 128 → 128 → 256 filter sizes. The network outputsone joint presence probability heatmap for each joint.

candidates (e.g. potential head, shoulder or knee) by utilizing an altered version ofFast R-CNN [26]. Then the body part candidates are used to form a graph, where everydistinct body part is connected to all other body parts by a pairwise term. A pairwiseterm is used to generate a cost or reward to be paid by all feasible solutions of the poseestimation problem for which the both body parts belong to the same person. The poseestimation problem is regarded as an Integer Linear Programming (ILP) that minimizesover the set of feasible solutions. Additional costs, variables and constraints ensure thatfeasible solutions unambiguously selects and classifies body part candidates as bodypart classes, and that body part candidates are clustered into distinct people.Recently, Insafutdinov et al. [19] introduced a follow-up work, which further im-

proves performance by utilizing ResNets, new image-conditioned pairwise terms be-tween body parts and a new incremental optimization method. Also several othermethods, using similar joint use of ConvNet and graphical model for learning spatialrelationships between body parts, have been introduced recently [37, 44, 45].

Page 17: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

17

3. METHOD

The presented pose estimation method is targeted for video inputs. The rough steps fora single video frame in testing are:

1. Detect persons.

2. Crop person centered images.

3. Feedforward person images to the pose estimation network.

The person bounding boxes are solved from the input frame by using a separate objectdetector [27]. The pose estimation is done for each person individually. As a result ofthe pose estimation, the network outputs predicted (x, y) locations of body keypoints.The pose estimation process is described in Figure 7.The network is pretrained with data from multiple publicly available datasets, thus

offering good initialization values for fine-tuning. Pretraining and fine-tuning are eval-uated separately. For the training and evaluation of the fine-tuning, data recorded withKinect is used. As for ConvNet framework, Caffe [46] with small modifications1 isutilized.

1https://github.com/malinna/caffe-pose_network

H2

Detect persons

Input frameWxHx3

Crop persons

H1

H2

H1

H1

personsH2

Rescale

Pose estimation network

Horizontal flip

Average pose

Estimatedpose

Average

Figure 7. The pose estimation process for a single input frame. Zero padding is addedfor cropped person images on regions outside the image borders. Person images arerescaled to size 224 × 224 before feedforwarding them to the network. The networkreturns 13 body keypoints for each person.

Page 18: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

18

3.1. Person detection

Faster R-CNN [27] is utilized for detecting persons from training and testing images.It is a ConvNet based method and it has shown very good results in object detectiontasks. The basic idea of Faster R-CNN is to use a separate region proposal network(RPN) for detecting regions from an input image that most likely contain objects. Thenthese region proposals are fed to detection network [26], to make the final decisionabout object existence and bounding boxes. The important thing is that RPN sharesconvolutional features with detection network, thus enabling nearly cost-free regionproposals.The network architecture of Faster R-CNN is described in Figure 8. The layers in the

groupA are shared fully convolutional layers. The authors of Faster R-CNN conductedexperiments with two network models, the Zeiler and Fergus model [24] (ZF) and theSimonyan and Zisserman model [25] (VGG). These models have 5 and 16 shareableconvolutional layers.The layers in the group B are RPN specific layers and they are fully convolutional.

RPN operates in a sliding window fashion. At each sliding window location, 9 regionproposals are predicted simultaneously. These proposals are parametrized relative to 9reference boxes, called anchors. Each anchor is centered at the sliding window in ques-tion, and is associated with a scale and aspect ratio. The class score layer outputs scoresthat are used to estimate probability of an object / not object for each proposal by the

Region proposal layers

Shared layers

Conv 1

Convk=1, s=1

Class score

Conv 2

Conv n

...

Detection layers(for each k region proposal)

Convk=1, s=1

Bounding box predictionConv

k=3, s=1

RoI pooling

FC

FCClass score

FCBounding box

prediction

A

B

C

k region proposals

NMS

Detection layers(for each k region proposal)k region proposal)k

SoftmaxClass

probability

SoftmaxClass

probability

FC

Input image

1:1

1:2

2:1

1282 2562 5122

Anchors

Figure 8. The network architecture of Faster R-CNN. Local response normalization(LRN) and pooling layers omitted for simplification.

Page 19: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

19

following softmax layer. The bounding box prediction layer outputs the parametrizedcoordinates. At each sliding window position, there are 2× 9 = 18 class probabilitiesand 4 × 9 = 36 coordinates. Some of the resulting RPN proposals highly overlapwith each other, thus creating redundant proposals. In order to reduce the number ofproposals, non-maximum suppression (NMS) is adopted on the proposal regions basedon their class probabilities. The remaining RPN proposals are input to the detectionnetwork.

The layers in the group C are detection network specific layers. The key layer inthe detection network is the RoI pooling layer, which dynamically expands the volumebased on the number of the input proposals. Other input to RoI pooling layer is theoutput of the last shared convolutional layer, from which pooling is done based on theRoIs of each proposal. The Roi pooling layer is followed by a two fully connectedlayers. After these comes the class score and probability layers, and the bounding boxregression layer. Class score is computed for m object and one background classes.

As a summary, the layers A and B together forms region proposal network and thelayers A and C detection network. The operation of Faster R-CNN with an exampleimages are shown in Figure 9.

Faster R-CNN training requires four steps. The goal is to use convolutional param-eters of detection network also with RPN. The training steps are:

1. Train all the RPN specific layers (A and B).

2. Train all the detection network specific layers (A and C) by using the proposalsgenerated by the RPN resulted in the previous step. At this point, two separatenetworks exists.

3. Fine-tune the RPN specific layers (B) by fixing the parameters of the shared con-volutional layers to values resulted in the previous step. Now the two networksshare the convolutional layers.

4. Fine-tune the detection network specific layers (C) by keeping the shared layersfixed and using the proposals generated by the RPN resulted in the previous step.

The time per image of Faster R-CNN on NVidia K40 GPU is 59 ms with the ZFmodel and 198 ms with the VGG model. The main results, as authors report in thepaper [27], are presented in Table 1. The experiments are performed on PASCAL

Table 1. The mean average precision (mAP) and time per image of Faster R-CNN asreported in the paper [27].

Model Shared layers Train data Test data mAP (%) Time (ms)

ZF 5 VOC 2007 VOC 2007 59.9 59

VGG 16 VOC 2007 VOC 2007 69.9 198VGG 16 VOC 2007+2012 VOC 2007 73.2 198VGG 16 VOC 2012 VOC 2012 67.0 198VGG 16 VOC 2007+2012 VOC 2012 70.4 198

Page 20: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

20

Figure 9. The results of Faster R-CNN with samples from the MPII Human Posedataset. Top 100 RPN proposals after non-maximum suppression are in yellow. Theprobability of these bounding boxes containing an object is most highest. The finalbounding boxes resulted from the detection network are in green. On the top-leftcorner of a bounding box is the probability of the box containing a person. Otherobject types than person are not shown here. In addition, a person detection result isdiscarded and not shown, if the probability of the bounding box in question is below0.6. The VGG model was utilized for getting these results.

Page 21: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

21

VOC 20072 and 20123 datasets. They both have 20 object classes: person, bird, cat,cow, dog, horse, sheep, aeroplane, bicycle, boat, bus, car, motorbike, train, bottle,chair, dining table, potted plant, sofa and tv/monitor. As person is included, it wasenough to use already trained network models, hence training of Faster R-CNN couldbe skipped. In addition, it was noted that the accuracy of the models in question wassufficient with the datasets used in this thesis.

However, it was noticed that sometimes Faster R-CNN gives false positives. This isnot a problem in training, since both the ground truth and the Faster R-CNN are usedtogether to crop the training image. But in testing, the pose estimation is also per-formed for false positives. However, most likely these false positives could be filtered,especially with constrained images, by adjusting the parameters of Faster R-CNN (e.g.tightening the decision threshold or reducing the amount of region proposals). In theevaluation, also the ground truth is used to decide if the frame has a person or not, soit is guaranteed that all the evaluation frames contain a person. Apart from this, FasterR-CNN was run for the original fine-tuning evaluation frames, where the ground truthwas not yet used for the frame selection. This resulted in false positive rate of 2.86%and false negative rate of 0.65%. In all of the original evaluation frames, there isone fully visible person making gestures in constrained environment. Person detectionwas considered false if the resulted bounding box did not contain a person, or if ithad partially visible person on the edges of the bounding box. In other words, if theintersection-over-union (IoU) ratio between the detection and the ground truth was 0.5or less.

3.2. Data augmentation

Faster R-CNN person detector is applied for every training image. For each detectedperson, the IoU between the detected person bounding box and the expanded groundtruth bounding box is calculated. The expanded ground truth person box is the tightestbounding box, including all the joints, expanded by a factor of 1.2. The person boxhaving the biggest IoU is selected as the best choice. Based on the best IoU, the trainingimage is augmented as described in Table 2.

2http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html

3http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

Table 2. The relation between the person box overlapping ratio and the data augmen-tation.

Person box type used in augmentationOverlapping ratio Faster R-CNN Ground truth

IoU > 0.7 XIoU < 0.5 X0.5 ≥ IoU ≤ 0.7 X X

Page 22: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

22

Figure 10. Visualization of data augmentation with three selected samples from theMPII Human Pose dataset [39]. On the left column are the image samples and on theright the cropped person centered images for each image sample. Ground truth (thepoints in the skeleton) and the ground truth expanded bounding boxes are in green.Detected person boxes are in blue and the value on the top-left corner of the box isthe overlapping ratio (IoU) between the ground truth expanded box and the detectedperson box. The crop area is in red (dashed square) and it is expanded according to thelongest side of the person bounding box. The middle image sample shows well thatwhen person detection fails to capture the whole person, the ground truth is used tocrop the person image. Otherwise the goal is to use the detected person box.

Page 23: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

23

If the detected person box is near to the ground truth expanded person box, onlythe former is used to crop the person image. And if the detected person box is farfrom the ground truth expanded person box, only the latter is used to crop the person.And in between of these, both person boxes are used to crop the person, resulting intwo training images, where both have small differences in translation and scale. Theshortest side of a person box is expanded to equal the longest side, resulting a squarecrop area, defining the person image used in training. Zero padding is added whereneeded. A single cropped person image is rescaled to size 224× 224 before feeding itto the network.

In addition to aforementioned, a training image is augmented by doing a horizontalflip, giving double version of the image. All in all, a single person image from a sourcedataset can result in either two or four augmented person centered training images.Data augmentation is visualized in Figure 10.

3.3. Pretraining

The model is pretrained from scratch by using several publicly available datasets (seeTable 3). The number of annotated joints varies between the datasets. The MPII Hu-man Pose [39], Fashion Pose [47] and LSP [40] have full body annotations, while theFLIC [5] and BBC Pose [48] have only upper body annotated. Since the presentedmethod uses a single point for the head, and because the MPII Human Pose and LSPhave annotations for the neck and head top, the center point of these is taken and usedas a head point.

As the aim was to study that whether additional partially annotated training databrings improvement over using only fully annotated samples, the validation samplesshould be fully annotated. Thus, all the fully annotated (13 joints) person images areput to a single pool, from which 2000 images are sampled randomly for validation. The

Table 3. Overview of used datasets in pretraining. Only the training set of the MPIIHuman Pose is used, because the annotations are not available for the test set. In theBBC Pose, the training set is annotated semi-automatically [49], while the test set ismanually annotated. Only manually annotated data from the BBC pose is used. Dataaugmentation is utilized to expand the training data.

Person boxes usedfrom the dataset

Person boxes used forpretraining and validation

AnnotatedpointsDataset Train Test Total Train (aug.) Validation

MPII Human Pose [39] 1-16 28821 0 28821 71018 1160Fashion Pose [47] 13 6530 765 7295 14538 694LSP [40] 14 1000 1000 2000 5074 146FLIC [5] 11 3987 1016 5003 14780 0BBC Pose [48] 7 0 2000 2000 6764 0

40338 4781 45119 112174 2000

Page 24: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

24

Figure 11. Example pose estimations with the pretrained network. Samples are takenrandomly from the testing set of the MPII Human Pose dataset. The green boundingboxes are the results of person detection and the number on the top-left corner is theprobability of a box containing a person.

Page 25: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

25

validation images are then removed from the pool. Next, all the partially annotatedimages are put to the same pool so that it eventually contains person images withheterogeneous set of annotated joints. Finally, the pool is used in training.

The purpose of the pretrained model is to offer a good weight initialization valuesfor fine-tuning. Pretraining takes 23 hours on three NVidia Tesla K80 GPUs. Figure 11contains example pose estimations with the pretrained network.

3.4. Fine-tuning

The purpose of the fine-tuning is to adapt the pretrained model for the particular usecase. For instance, a gesture control system or a game. The pretrained model alone isnot a good enough pose estimator for use cases in this thesis, because the used shal-low network architecture lacks the capacity to perform well with highly diverse train-ing data. More complicated network architectures, such as [17, 15] would certainlygive better results, but then the speed gain achieved with shallow network architecturewould most likely be lost.

In fine-tuning, the pretrained model is used for weight initialization. When the net-work is fine-tuned with use case specific data, for example to estimate poses in gesturecontrol system, the training data is most likely consistent in respect of poses and en-vironment. This is a good thing when thinking of accuracy. Even a shallow networkcan produce a very good estimations, if the training data is limited to particular usecase. Using more complicated, and potentially slower, network architectures in thesesituations is therefore not necessary.

Figure 12. Example pose estimations with the fine-tuned network. Predictions are inred and Kinect ground truth in green. On the columns are five different frames fromthe evaluation data. The first row shows results of the full fine-tuning (experiment3) and the second row shows results of the phase 1 (experiment 1). Experiments areexplained later in Section 4. Full videos are available at https://youtu.be/qjD9NBEHapY and https://youtu.be/e-P5SYL-Aqw.

Page 26: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

26

In this thesis, Kinect [20] is used to produce annotations for the fine-tuning data,but alternative methods can be considered as well. Kinect produces accurate enoughannotations in restricted environments so it was not required to consider other methods,for instance manual annotating. Figure 12 contains example pose estimations with thefine-tuning data.

3.5. Network architecture

Generic ConvNet architecture is utilized for the pose estimation task. The networkhave five convolutional layers followed by three fully connected layers, from whichthe last layer is regression layer. The architecture is described in Figure 13.

Conv 1k=7, s=2

FC 6Conv 2

k=5, s=1

Conv 3k=3, s=1

109x109x96

33x33x25617x17x512

Conv 4k=3, s=1

17x17x512

Conv 5k=3, s=1

17x17x512

FC 7

FC 8

4096 4096

26

224x224x3

6x6x51233x33x25633x33x96

17x17x256

Convolution Max pooling Fully connectedMax poolingLocal Response Normalization

109x109x96

33x33x96109x109x96

k=3, s=3 k=2, s=2 k=3, s=3

Figure 13. The network architecture. Letters k and s means kernel size and stride.

The regression layer produces position estimates (x, y) for human body joints. Moreclosely, one estimation for head, six for arms and six for legs, a total of 13 positionestimations. The network input size is 224 × 224 × 3. The network does not utilizeany spatiotemporal information, but treats all training images individually. GenericConvNet architecture is used, because it has shown to perform well in human poseestimation tasks [6, 7, 8]. The forward pass time of the network is 20 ms on NVidiaTesla K80 GPU, which makes it highly capable for real-time tasks.

3.5.1. Convolutional layer

The first five layers of the network are convolutional layers. Convolutional layer have aset of learnable filters, with spatially small receptive field. These filters learn to activatefor different features in input, for instance oriented edges or color blobs. The operationof an example convolutional layer is illustrated in Figure 14.During forward pass, each filter is slid (convolved) across the width and height of

the input volume. At every filter position, the neuron output y is computed as a dotproduct of the weight and input vectors (w and x) summed with bias b. This is definedas

y =∑i

wixi + b (1)

In most cases, convolutional layers use an activation function f after the neuron outputy, defined as

a = f(y) (2)

Page 27: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

27

2

2x2

8

8

6

4

4

x0

x1

x2

x3

x4

x5

x6

x7

w0w1w2w3w4w5w6w7

b

f

x0 ,…, x7

2x2

Figure 14. An example convolutional layer. The input volume of size 8 × 8 × 2 is indark red. The output volume of size 4× 4× 6 is in blue. Typically the dimensions ofthe volumes would be much larger. The input volume can be, for example, an RGBimage (if current layer is the first layer), or an output of previous layer. The receptivefield of size 2 × 2 (equivalently this is the filter size) is in magenta. In this example,the stride of 2 is used to slide the filter spatially (magenta arrows). The blue circlesare neurons. Neuron inputs are denoted as xi, weights as wi and bias as b. Eachneuron along the depth are connected to the same local region in the input volume.An activation function f is applied to the neuron output y. The neurons in each 4 × 4depth slice have the same weights and bias. In this example, there are 4× 4× 6 = 96neurons, 2× 2× 2× 6 = 48 weights, 6 biases and 6 filters.

Typically the activation function implements a non-linearity. The most commonlyused activation functions are logistic sigmoid f(x) = 1/(1 + e−x), hyperbolic tangentf(x) = tanh(x) and rectified linear unit f(x) = max(0, x). The last one is used inthis thesis and is explained in more detail later in Section 3.5.3. The weight vector andbias are called parameters of a convolutional layer and they are shared at every depthslice. The network optimization tries to find optimal values for these parameters byutilizing gradient descent and backpropagation algorithms.

3.5.2. Fully connected layer

The last three layers of the network are fully connected layers. As the name suggests,each neuron is connected to all the activations of the previous layer, instead of a small

Page 28: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

28

receptive field, as is in the case of convolutional layers. Furthermore, this is howneurons are connected in regular neural networks. The output of a fully connectedlayer is a vector.

3.5.3. Rectified linear unit

All the layers, except the last one, utilizes rectifier [34, 35] as a non-linearity. Therectifier is an activation function defined as

f(x) = max(0, x), (3)

where x is the neuron output before an activation function. In other words, the neuronoutput is simply thresholded at zero. A unit employing the rectifier is called a rectifiedlinear unit, shortly ReLU. The rectifier has been the most popular activation functionfor ConvNets in recent years [50]. It has been shown that supervised training of verydeep neural networks is much faster, if the hidden layers are composed of ReLU [35,23]. The main advantages of ReLU over logistic sigmoid and hyperbolic tangent are(1) easily obtained sparse representations of a network and (2) smaller probability ofvanishing gradient [35]. Sparse networks can be thought of as being more biologicallyplausible neural networks. Sparsity of such networks are closer to the hypothesizedsparsity of human brain, which may explain some of the benefit of using rectifiers.

3.5.4. Local response normalization

Local response normalization (LRN) is performed after the first convolutional layer.LRN normalizes over local input regions and it has proven to give small performanceimprovement in image classification tasks [23]. It has also been used previously inpose estimation tasks [8]. LRN is computed for each spatial position (x, y) as

bi = ai/

k + α

n

i+n/2∑j=i−n/2

a2j

β

, (4)

where the sum is taken over n adjacent channels (or depth slices). Zero padding isadded where needed and the dimension of the input volume keeps unchanged. Theoutput of the LRN in ith depth slice is denoted as bi and input as ai. The constantsk = 2, n = 5, α = 10−4 and β = 0.75 used in this thesis are the same as in [23], wherethey are determined by using validation data. While these values worked in [23], theyare perhaps not optimal for the presented method, because the training datasets are notthe same. However, the effect of LRN on pose estimation accuracy is not consideredsignificant, so even with the optimal values there would be no considerable gain inaccuracy.

Page 29: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

29

3.5.5. Pooling

The basic idea of pooling is described in Figure 15. The purpose of pooling is to reducethe spatial size of the input data. This leads to fewer amount of parameters and thusreduces computational cost and controls overfitting. However, it is argued that poolingmay not be necessary. It could be replaced by a convolutional layer with larger stride.Springenberg et al. [51] shows that this can be done without loss in accuracy on severalimage recognition benchmarks. It is likely that future ConvNet architectures containsvery few or no pooling layers.In the presented method, pooling is carried out in the layers 1, 2 and 5. All the layers

use non-overlapping pooling, which effectively reduces the parameter count and thecomputation time in the network. The kernel window size and the stride of the poolingin the layers 1, 2 and 5 are described in Table 4. The pooling function for a singlekernel window is defined as

yi = max(x0, x1, . . . , xn), (5)

where xn is nth value in the input volume kernel window and yi is ith output value ofthe pooling. Pooling is visualized in Figure 16.

62

2

6

4

4f

4 8 2 1

9 2 0 1

7 3 6 5

2 0 8 4

4 8

9 2 2

7 8

99

4

4 2

2

2x2

f=max(...)

Figure 15. An example pooling layer. Pooling downsamples the input volume spatiallyand keeps the volume depth unchanged. Each depth slice is pooled independently. Theinput volume of size 4 × 4 × 6 is in dark red. The output volume of size 2 × 2 × 6is in dark green. Typically the dimensions of the volumes would be much larger. Theinput volume is typically an output of a convolutional layer. The filter of size 2 × 2(the receptive field in magenta) and the stride of 2 is used to slide the filter spatially.Thus, the pooling is non-overlapping. Pooling function is denoted as f . Most commonpooling function is the max operation. On the bottom is an example of max poolingon a single depth slice.

Page 30: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

30

Table 4. Pooling parameters implementing non-overlapping pooling in the layers 1,2 and 5. Typically larger kernel sizes and strides are not used, because then poolingwould be too destructive. Non-overlapping max pooling with these parameters hasbeen used successfully in pose estimation and localization tasks [8, 52].

Layer 1 2 5

Kernel 3× 3 2× 2 3× 3Stride 3 2 3

109x109 37x37

33x33 17x17

6x617x17

Figure 16. Visualization of max pooling in the layers 1, 2 and 5.

3.5.6. Dropout

Dropout with probability of p = 0.5 is utilized in the fully connected layers 6 and7. Dropout performs regularization over the training data and it has been shown toimprove ConvNet performance in vision tasks [53]. In training, dropout temporarilyremoves neurons, along with all their incoming and outgoing connections. This pre-vents neurons from learning too much on the training data, making them generalizebetter to unseen data. Neurons to be dropped are selected randomly. A neuron attraining time is present with a fixed probability of p. All neurons are always presentat testing time. Using dropout has several advantages. Firstly, it reduces overfitting,because not all neurons are trained. To put it other way, it makes network to bettergeneralize to new data by learning more robust features. It works as a countermea-sure for a limited amount of training data. However, one drawback of dropout is thatit significantly increases the training speed. This is because of each training iterationeffectively tries to train a different random architecture and the gradients that are beingcomputed are not gradients of the final architecture that will be used at test time. Byusing a higher dropout ratio, training can suffer less overfitting, in the cost of longertraining time.

Page 31: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

31

3.6. Training details

In model optimization, the network weights are updated using batched stochastic gra-dient descent (SGD), which performs well with large training sets [54]. The hyper-parameters of SGD are momentum and learning rate. The momentum is set to 0.9.In pretraining, where the network is trained from scratch, the learning rate is set to10−2, weights are initialized randomly using Xavier algorithm [55] and biases are setto zero. In fine-tuning, the learning rate is set to 10−3. The loss function used inoptimization penalizes the distance between predictions and ground truth. The lossfunction is weighted Euclidean (L2) loss defined as

E =1

2N

N∑i=1

wi

∥∥∥xgti − xpredi

∥∥∥2

2, (6)

where vectors w, xgt and xpred holds joint coordinates and weights in form of (x1, y1,x2, y2, ..., x13, y13). Weight wi is set to 0, if the ground truth of the joint coordinatexgti is not available. Otherwise, it is set to 1. This way only the annotated jointscontribute to the loss. This enables training the network using datasets having only theupper body annotations, along with datasets having full body annotations. Ability toutilize heterogeneous training data, where the set of joints is not the same in all trainingsamples, potentially leads to better performance as more training data can be used.

As for comparison, pretraining is done also without using weighted Euclidean loss.In this case, only images with fully annotated joint positions (13 joints) are used, sothat the training data is homogenous regarding to joint annotations. Doing this reducesthe size of the training data from 112174 to 66598 images. The average joint predictionerror with heterogeneous and homogenous data are 15.7 and 16.6 pixels on 224× 224images. With heterogeneous data, there is about 5% improvement on prediction error.

In batched SGD, a batch size of 256 is used. Each iteration selects images for thebatch randomly from the full training set. A training image contains roughly centeredperson of which joints are annotated. The training images are resized to 224× 224 be-fore feeding them to the network. Mean pixel value of 127 is reduced from every pixelcomponent and the pixel components are normalized to range [-1, 1]. Joint annotationsare normalized to range [0, 1], according to the cropped person centered image.

3.7. Testing details

The person detector is applied for an image from which poses are to be estimated.Person images are cropped based on detections as described earlier. In addition, foreach person image, a horizontally flipped double is created. Both the original andthe doubled person images are fed to the network. The final joint prediction vectoris average of the estimations of these two (the predictions of the doubled image areflipped so that they correspond predictions of the original image). By doing this, asmall gain in accuracy is achieved. This is covered in detail later in Section 5.

Page 32: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

32

3.8. Network visualization

The pretrained network is analyzed through filter and activation visualization. Theconvolutional layers are visualized in Figures 17, 18, 19, 20 and 21, where the learnedfilters are on the left and the activations on the right. Every activation box shows anactivation map corresponding to a filter at same location on the left. Fully connectedlayers are visualized in Figure 22. Activations are obtained with an example imageshown in Figure 23.

Filter visualizations display learned weights and they can be used to observe whatkind of patterns or blobs the network has learned. In addition, they may unveil possibleproblems in learning. Well trained network usually display nice and smooth filterswithout any noise patterns. Filters on the first convolutional layer are easier to interpretas they see directly the input RGB data. For simplicity, in other convolutional layers,the filters are summed along the input depth. These filters may not be as interpretableas the filters on the first layer, but still they may reveal valuable information.

(a) (b)

Figure 17. The network visualizations of the 1st layer. (a) The filters have capturedoriented edges and color blobs. However, they seem to be slightly noisy, which canbe a symptom of too short training time or poor regularization, that may have led tooverfitting. (b) The activations.

Page 33: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

33

(a) (b)

Figure 18. The network visualizations of the 2nd layer. (a) The filters have capturedmostly edges and seem to resemble Gabor filters, which may be a good sign as Gaborfilters are similar to those of the human visual system. More importantly, Gabor filtersare used for edge detection so it is likely that these filters have learned somethinguseful. (b) The activations. There seem to be quite many zero activation maps (roughlyhalf), which indicates dead filters, and may be a symptom of too high learning rate.

(a) (b)

Figure 19. The network visualizations of the 3rd layer. (a) The filters are not as inter-pretable as the filters of the previous layers. Even so, there are some edges and cornersvisible. (b) The activations.

Page 34: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

34

(a) (b)

Figure 20. The network visualizations of the 4th layer. (a) Again, the filters of theselayers are not as interpretable as the filters of the first two layers. Even so, there aresome edges and corners visible. (b) The activations.

(a) (b)

Figure 21. The network visualizations of the 5th layer. (a) The filters. There can beseen mostly horizontal and vertical edges. (b) The activations.

Page 35: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

35

(a) (b)

Figure 22. The network visualizations of the fully connected layers. On top is layeroutputs and on bottom their histograms. (a) The 6th layer. (b) The 7th layer.

Figure 23. The network visualization of the last layer. Predicted joint coordinatessuperimposed onto the validation sample used in the forward pass.

Page 36: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

36

4. EVALUATION

4.1. Metric

Pretraining and fine-tuning are evaluated with the percentage of correct keypoints(PCK) metric [5], where the joint location estimate is considered correct, if its L2distance to the ground truth is at most 20% of the torso length. The torso length is theL2 distance between the right shoulder and the left hip.

4.2. Data

2000 randomly taken samples are used for the evaluation of the pretraining. For fine-tuning, a data recorded with Kinect for Windows v2 is used (see Table 5). The jointestimates produced by Kinect are used as a ground truth. It was made sure that thedata was recorded in a such way, that the error in the joint estimations is minimal.Practically this means a good lightning conditions, no extremely rapid movements andno body part occlusions. The gestures performed in the data tries to mimic differentgesture control events, where the hands are used for tasks like object selection, moving,rotating and zooming, in addition to hand drawing and wheel steering. Additional 4000frames, with identical clothing, are recorded for the evaluation of the fine-tuning.

Table 5. Kinect recorded fine-tuning data for training. All the frames have similarbackground, person and gestures, but clothing differs. Additional 4000 frames, whichhave identical clothing (clothing number 1), are recorded for the evaluation.

Clothing Frames

1 272222 187603 202444 207265 105606 116667 10136

119314

4.3. Experiments

Three different fine-tuning experiments are performed, using different set of trainingframes in each case (See Table 6). The experiments 1 and 2 together uses the sametraining frames as the experiment 3. Basically, the experiment 3 is the same as theexperiments 1 and 2 performed consecutively. The purpose of this divide is to see theeffect of using the same/different clothing between the training and testing data. The

Page 37: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

37

Table 6. Fine-tuning experiments. The training data have (1) different clothing fromthe testing data in every frame, (2) the same clothing as the testing data in every frame,(3) the same clothing as the testing data in some of the frames. In phase 2, the fine-tuning is done over already fine-tuned network of the phase 1. Otherwise it is doneover the pretrained network.

# Name Initialization network Clothing in training frames

1 phase 1 pretrain 2,3,4,5,6,72 phase 2 phase 1 13 full pretrain 1,2,3,4,5,6,7

experiment 1 express more of the ability of generalization (for all people) while theexperiments 2 and 3 of specificity (for certain people).

4.4. Results

The results are displayed in Table 7 and Figure 24. In full fine-tuning (experiment 3),with the use case specific data, the accuracy of 96.8% is achieved. In fine-tuning phase1 (experiment 1), where no same clothing occurs between the training and testing data,the accuracy is 90.6%. However, if we look at the accuracy of wrist (pretrain: 24.5%,phase 1: 67.4%, full: 89.2%), which is the most challenging body joint to estimate, butperhaps also the most important one considering a gesture control system, we can seethat additional case specific training data can significantly improve the accuracy andmake the system usable in practice. This originates partially from the fine-tuning data,where the wrist location variation is biggest. It is most likely, that if more training datawould be used, and perhaps a better data augmentation, a better wrist accuracy couldbe achieved with the current network architecture. After all, the wrist accuracy is stilldecent, making the presented method useful for many use cases.

The results indicate that a trade-off between generalization and specificity exists be-tween pretraining and fine-tuning. This can be seen by comparing accuracies between

Table 7. Pose estimation results ([email protected]). The first three cases uses the pretrainvalidation samples (2000 images) in testing, while the other models uses the fine-tuning validation samples (4000 frames).

Network Head Wrist Elbow Shoulder Hip Knee Ankle All

Mean pose 31.1 18.9 8.5 11.8 10.0 40.8 33.5 21.4Pretrain 84.2 41.6 60.5 76.9 72.8 62.6 53.7 63.1Fine-tune (full) 77.5 22.2 42.9 49.8 52.5 42.6 38.6 44.2

Pretrain 86.1 24.5 64.1 86.8 88.0 82.5 64.0 69.6Fine-tune (phase 1) 95.3 67.4 87.3 98.4 96.3 96.4 95.5 90.6Fine-tune (phase 2) 99.6 88.1 95.6 99.9 97.3 98.5 98.5 96.6Fine-tune (full) 99.3 89.2 95.9 99.7 97.6 98.5 98.6 96.8

Page 38: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

38

0 0.05 0.1 0.15 0.2

Normalized distance

0

10

20

30

40

50

60

70

80

90

100

De

tection

ra

te (

%)

Head

fine-tune (full)

fine-tune (phase 2)

fine-tune (phase 1)

pretrain

fine-tune (full)

pretrain

mean pose

0 0.05 0.1 0.15 0.2

Normalized distance

0

10

20

30

40

50

60

70

80

90

100

De

tection

ra

te (

%)

Wrist

0 0.05 0.1 0.15 0.2

Normalized distance

0

10

20

30

40

50

60

70

80

90

100

De

tection

ra

te (

%)

Elbow

0 0.05 0.1 0.15 0.2

Normalized distance

0

10

20

30

40

50

60

70

80

90

100

De

tectio

n r

ate

(%

)

Shoulder

0 0.05 0.1 0.15 0.2

Normalized distance

0

10

20

30

40

50

60

70

80

90

100

De

tectio

n r

ate

(%

)

Hip

0 0.05 0.1 0.15 0.2

Normalized distance

0

10

20

30

40

50

60

70

80

90

100

De

tectio

n r

ate

(%

)

Knee

0 0.05 0.1 0.15 0.2

Normalized distance

0

10

20

30

40

50

60

70

80

90

100

De

tectio

n r

ate

(%

)

Ankle

0 0.05 0.1 0.15 0.2

Normalized distance

0

10

20

30

40

50

60

70

80

90

100

De

tectio

n r

ate

(%

)

All

Figure 24. Pose estimation results ([email protected]). The dashed lines uses the pretrainvalidation samples (2000 images) in testing, while the solid lines uses the fine-tuningvalidation samples (4000 frames). To put it other way, the dashed lines represent gen-eralization accuracy, while the solid lines represent use case specific accuracy. Thelabel indicates which network is used in testing.

Page 39: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

39

the pretrained and fine-tuned networks, first with the pretrain validation samples andthen with the fine-tuning validation samples. The pretrain validation samples expressthe case of generalization as they contain a large variation of persons and poses in un-constrained environment. On the contrary, the fine-tuning validation samples reflectsthe case of specificity as they have restricted poses in constrained environment. Afterthe full fine-tuning, the accuracy on the pretrain validation set drops from 63.1% to44.2% (light red and dark red curves in Figure 24), while in the same time, the usecase specific accuracy increases from 69.6% to 96.8% (blue and magenta curves). Incertain cases, the loss in generalization is acceptable, if at the same time, gain in speci-ficity is achieved. One example of a such case is a gesture control system set up in afactory, where all the persons wear identical clothing. Most importantly, while genericperson detection in highly varying poses and contexts is an important and challengingproblem, our results show that in some use cases the state-of-the art for the genericproblem may produce inferior results compared to a simpler approach which has beenspecifically trained for the problem at hand.

Page 40: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

40

5. ADDITIONAL EXPERIMENTS

The additional experiments described in this section were made using only the trainingframes of the MPII Human Pose dataset with train/test split ratio of 9 : 1 for personboxes. Accuracy was calculated using [email protected] metric, which is PCK measure thatuses the matching threshold as 50% of the head segment length.

5.1. Input image flipping

In testing, the pose estimation network takes two inputs. The first one is the originalimage and the second one is the horizontally flipped version of the original image. Thefinal prediction is average of these two. Concerning this, two experiments were made:

1. Input original image only (1 input).

2. Input both the original and the flipped image (2 inputs).

The experiment 1 (original image only) resulted in accuracy of 61.8%, while withthe experiment 2 (original and flipped image) the accuracy was 63.2%. As the latterproduced slightly better accuracy, it was decided to use two inputs in the final solu-tion. However, it was also noted that with two inputs, the network forward pass timeincreased from 14ms to 20ms. If considering the whole system, this increase is notregarded significant as the bottlenecking component is the person detection network(59ms or 198ms).

5.2. Arm pose network

It was experimented if a separate arm pose network could improve overall accuracy.This required training of Faster R-CNN to detect both arms and persons. The trainingresulted in mAP of 85.9% for person and 73.4% for arm. The resulted network wasthen used to crop arm images for arm pose network training, similar way as with per-sons, but now with heavier data augmentation, including rotating, scaling, mirroringand jittering. Arm pose network predicts locations of three joints: wrist, elbow andshoulder. Distinct arm detections were linked to persons based on shoulder distancesand overlapping ratios of arm and person boxes. If an arm detection could not be linkedto any person detection, it was discarded.

With arm pose network, the overall [email protected] accuracy increased from 63.2% to68.2%. Arm detection rate was 69.0%. Arm pose network improved accuracy in 68.9%of the test images. However, in 25.4% images, the accuracy was reduced and in 5.7%images arm pose network had no effect. Based on this experiment, it can be hypoth-esized that by doing the same for legs and head, the accuracy improvement would beeven better. However, this would increase memory and computational requirements,thus making it harder to maintain real-time operation.

Page 41: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

41

6. DISCUSSION

6.1. Comparison to state-of-the-art

The evaluation results were not compared against related work in public bench-marks [39, 40, 5] for a number of reasons. Firstly, the primary objective of this thesiswas to experiment how ConvNets perform if they are trained purely for specific usecases. Because of this, training and evaluation data needed to be use case specific too.This in turn caused that the results were not directly comparable to public benchmarks,which contain highly varying unconstrained images. Secondly, the joints are not fullyequal with most of the benchmarks, thus making comparison unfair. The joints wereassigned by Kinect, as it was used to produce the use case specific data. Lastly, eval-uation data from several benchmarks were used in pretraining, causing the network tooverfit on the data which was supposed to be used on evaluation.

6.2. Necessity of fine-tuning

While the current evaluation mostly investigates the effect of fine-tuning after pretrain-ing, it doesn’t tell about the necessity of fine-tuning itself. What if the network wouldbe trained with both pretraining and use case specific fine-tuning data at once? Wouldthe results be same, better or worse than with separate pretraining and fine-tuningsteps? Also, what if only fine-tuning data would be used to train the network? Howwould the results change? One can made plausible hypothesis that training with bothdata at once would result in better generalization, but on same time, would the use casespecific accuracy suffer deterioration? Also, it can be hypothesized that training withfine-tuning data only would most likely give poorer generalization, but how it wouldaffect on use case specific accuracy? Would the accuracy be better or worse? Theseare important questions which could be studied in future work.

6.3. Convergence problem

This thesis taught the author a lot about ConvNets as the previous knowledge was zero.The number of variables in ConvNets is huge and it is all but easy to get them workas wanted. On early phase, the biggest problem was to get the network to converge intraining. Weight initialization have an impact on convergence, especially with deeperarchitectures. Every subsequent layer can cause gradual shrinking or growing of thesignal, if the weights are too small or large. Xavier [55] initialization address thisby making sure that the weights are appropriate. Switching from Gaussian to Xavierinitialization solved convergence problems encountered in this thesis. Recently intro-duced Batch Normalization [33] tries to address the same problem by unifying scaleand mean between every layer.

Page 42: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

42

6.4. Other ConvNet architectures

The results presented in this theses were achieved with two ConvNet architectures:(1) VGG [25] for person detection and (2) AlexNet variant [8] for pose estima-tion. It would have been interesting to test different architectures, especially recentResNets [31] and Inception networks [28], in pose estimation to see how they per-form against generic architectures. In person detection, the faster and less accurate ZFNet [24] was not experimented. The accuracy decrease from VGG to ZF is not consid-erably large so it can be hypothesized that the ZF Net accuracy would be enough, es-pecially in more constrained environments. Another interesting object detector, calledYOLO [56], was introduced recently. It is considerably faster than Faster R-CNN, buthave problems with small objects and multiple objects close to each other. Also, thedetection accuracy is not quite as good as of Faster R-CNN, but still it would be apotential choice for person detector.

6.5. Personal comments

For the author, writing of this thesis was both interesting and beneficial project. Af-terwards it is easy to see that ConvNets are really a significant technology for com-puter vision tasks. ConvNets have replaced traditional methods in areas such as objectdetection, object recognition, pose estimation and segmentation. And this has hap-pened relatively fast in recent years. Research is rising and most of the big companies(Google, Microsoft, NVidia, Apple, Amazon, Facebook, car companies, etc.) investon it. Potential is huge. For the author, it will be interesting to see where the researchgets. Applications are numerous: self-driving cars, robots, driver aiding systems in ve-hicles, smart homes, human computer interaction, gesture control, surveillance, personrecognition, action recognition, games etc.

Page 43: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

43

7. CONCLUSION

A real-time ConvNet based system for human pose estimation was introduced in thisthesis. It achieved accuracy of 96.8% ([email protected]) by fine-tuning the network for spe-cific use case. The method can be thought of as a replacement for Kinect, and it can beused in various tasks, like gesture control, gaming, person tracking and action recog-nition. The method supports heterogeneous training data, where the set of joints is notthe same in all the training samples, thus enabling utilization of different datasets intraining. The use of a separate person detector brings the method towards practice,where the person locations in the input images are not expected to be known. In addi-tion, an automatic and easy way to create large amounts of annotated training data byusing Kinect was demonstrated. The network forward time of the presented method is20 ms. In addition, the time per image of the person detector is about 60 or 200 ms.

As for future work, there are several things that could be considered in order to getbetter accuracy. One option would be to use current network as a coarse estimatorand use another network or networks to refine the pose estimation. In addition, asthe presented method is targeted for video inputs, the utilization of spatiotemporaldata would most likely give accuracy boost. The network forward time of the persondetector is relatively slow compared to the pose estimation network (20 ms vs. 60/200ms). While the person detector works well with diverse input data, perhaps, with mostpose estimation use cases, that is not necessary. By using more restricted and possiblyfaster person detector, a good enough performance in more constrained environmentscould be most likely achieved. Also, with ConvNets, generally, holds that if moredata used in training, the better performance gained. Hence, the use of more advanceddata augmentation methods, such as [18], especially in the fine-tuning, would mostprobably lead to better accuracy. Advanced data augmentation could, for example,change colors of the clothes, adjust limb poses and change backgrounds.

Page 44: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

44

8. REFERENCES

[1] Zatsiorsky V. & Prilutsky B. (2012) Biomechanics of skeletal muscles. HumanKinetics.

[2] Felzenszwalb P., McAllester D. & Ramanan D. (2008) A discriminatively trained,multiscale, deformable part model. In: IEEE Conference on Computer Vision andPattern Recognition (CVPR), pp. 1–8.

[3] Andriluka M., Roth S. & Schiele B. (2009) Pictorial structures revisited: Peopledetection and articulated pose estimation. In: IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pp. 1014–1021.

[4] Yang Y. & Ramanan D. (2011) Articulated pose estimation with flexiblemixtures-of-parts. In: IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), pp. 1385–1392.

[5] Sapp B. & Taskar B. (2013) Modec: Multimodal decomposable models for hu-man pose estimation. In: IEEE Conference on Computer Vision and PatternRecognition (CVPR), pp. 3674–3681.

[6] Jain A., Tompson J., Andriluka M., Taylor G.W. & Bregler C. (2014) Learninghuman pose estimation features with convolutional networks. In: InternationalConference on Learning Representations (ICLR). URL: https://arxiv.org/abs/1312.7302.

[7] Toshev A. & Szegedy C. (2014) Deeppose: Human pose estimation via deep neu-ral networks. In: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 1653–1660.

[8] Pfister T., Simonyan K., Charles J. & Zisserman A. (2014) Deep convolutionalneural networks for efficient pose estimation in gesture videos. In: Asian Confer-ence on Computer Vision (ACCV).

[9] Jain A., Tompson J., LeCun Y. & Bregler C. (2014) Modeep: A deep learningframework using motion features for human pose estimation. In: Asian Confer-ence on Computer Vision (ACCV), Springer, pp. 302–315.

[10] Pishchulin L., Insafutdinov E., Tang S., Andres B., Andriluka M., Gehler P. &Schiele B. (2016) Deepcut: Joint subset partition and labeling for multi personpose estimation. In: IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR). URL: http://arxiv.org/abs/1511.06645.

[11] Pfister T., Charles J. & Zisserman A. (2015) Flowing convnets for human poseestimation in videos. In: International Conference on Computer Vision (ICCV).URL: https://arxiv.org/abs/1506.02897.

[12] Tompson J., Goroshin R., Jain A., LeCun Y. & Bregler C. (2015) Efficient objectlocalization using convolutional networks. In: IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pp. 648–656.

Page 45: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

45

[13] Carreira J., Agrawal P., Fragkiadaki K. & Malik J. (2016) Human pose esti-mation with iterative error feedback. In: IEEE Conference on Computer Visionand Pattern Recognition (CVPR). URL: https://arxiv.org/abs/1507.06550.

[14] Charles J., Pfister T., Magee D., Hogg D. & Zisserman A. (2016) Personaliz-ing human video pose estimation. In: IEEE Conference on Computer Visionand Pattern Recognition (CVPR). URL: https://arxiv.org/abs/1511.06676.

[15] Lifshitz I., Fetaya E. & Ullman S. (2016) Human pose estimation using deepconsensus voting. In: European Conference on Computer Vision (ECCV). URL:https://arxiv.org/abs/1603.08212.

[16] Wei S.E., Ramakrishna V., Kanade T. & Sheikh Y. (2016) Convolutional posemachines. In: IEEE Conference on Computer Vision and Pattern Recognition(CVPR). URL: https://arxiv.org/abs/1602.00134.

[17] Newell A., Yang K. & Deng J. (2016) Stacked hourglass networks for humanpose estimation. In: European Conference on Computer Vision (ECCV). URL:https://arxiv.org/abs/1603.06937.

[18] Pishchulin L., Jain A., Andriluka M., Thormählen T. & Schiele B. (2012) Articu-lated people detection and pose estimation: Reshaping the future. In: IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), pp. 3178–3185.

[19] Insafutdinov E., Pishchulin L., Andres B., Andriluka M. & Schiele B. (2016)Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In:European Conference on Computer Vision (ECCV). URL: https://arxiv.org/abs/1605.03170.

[20] Shotton J., Sharp T., Kipman A., Fitzgibbon A., Finocchio M., Blake A., CookM. & Moore R. (2013) Real-time human pose recognition in parts from singledepth images. Communications of the ACM 56, pp. 116–124.

[21] Rumelhart D.E., Hinton G.E. & Williams R.J. (1985) Learning internal represen-tations by error propagation. Tech. rep., DTIC Document.

[22] LeCun Y., Bottou L., Bengio Y. & Haffner P. (1998) Gradient-based learningapplied to document recognition. Proceedings of the IEEE 86, pp. 2278–2324.

[23] Krizhevsky A., Sutskever I. & Hinton G.E. (2012) Imagenet classification withdeep convolutional neural networks. In: Advances in Neural Information Pro-cessing Systems (NIPS), pp. 1097–1105.

[24] Zeiler M.D. & Fergus R. (2014) Visualizing and understanding convolutionalnetworks. In: European Conference on Computer Vision (ECCV), pp. 818–833.URL: https://arxiv.org/abs/1311.2901.

[25] Simonyan K. & Zisserman A. (2015) Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representa-tions (ICLR). URL: https://arxiv.org/abs/1409.1556.

Page 46: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

46

[26] Girshick R. (2015) Fast r-cnn. In: IEEE International Conference on ComputerVision (ICCV), pp. 1440–1448.

[27] Ren S., He K., Girshick R. & Sun J. (2015) Faster r-cnn: Towards real-time objectdetection with region proposal networks. In: Advances in Neural InformationProcessing Systems (NIPS), pp. 91–99. URL: https://arxiv.org/abs/1506.01497.

[28] Szegedy C., Liu W., Jia Y., Sermanet P., Reed S., Anguelov D., Erhan D.,Vanhoucke V. & Rabinovich A. (2015) Going deeper with convolutions. In:IEEE Conference on Computer Vision and Pattern Recognition (CVPR). URL:https://arxiv.org/abs/1409.4842.

[29] Szegedy C., Vanhoucke V., Ioffe S., Shlens J. & Wojna Z. (2016) Rethinking theinception architecture for computer vision. In: IEEE Conference on ComputerVision and Pattern Recognition (CVPR). URL: https://arxiv.org/abs/1512.00567.

[30] Szegedy C., Ioffe S. & Vanhoucke V. (2016) Inception-v4, inception-resnet andthe impact of residual connections on learning. arXiv preprint arXiv:1602.07261.

[31] He K., Zhang X., Ren S. & Sun J. (2016) Deep residual learning for imagerecognition. In: IEEE Conference on Computer Vision and Pattern Recognition(CVPR). URL: https://arxiv.org/abs/1512.03385.

[32] He K., Zhang X., Ren S. & Sun J. (2016) Identity mappings in deep residualnetworks. arXiv preprint arXiv:1603.05027 .

[33] Ioffe S. & Szegedy C. (2015) Batch normalization: Accelerating deep networktraining by reducing internal covariate shift. In: International Conference on Ma-chine Learning (ICML). URL: https://arxiv.org/abs/1502.03167.

[34] Nair V. & Hinton G.E. (2010) Rectified linear units improve restricted boltzmannmachines. In: International Conference on Machine Learning (ICML), pp. 807–814.

[35] Glorot X., Bordes A. & Bengio Y. (2011) Deep sparse rectifier neural networks.In: International Conference on Artificial Intelligence and Statistics (AISTATS),pp. 315–323.

[36] Zagoruyko S. & Komodakis N. (2016) Wide residual networks. In: British Ma-chine Vision Conference (BMVC).

[37] Tompson J.J., Jain A., LeCun Y. & Bregler C. (2014) Joint training of a convolu-tional network and a graphical model for human pose estimation. In: Advancesin Neural Information Processing Systems (NIPS), pp. 1799–1807.

[38] Gkioxari G., Toshev A. & Jaitly N. (2016) Chained predictions using convolu-tional neural networks. In: European Conference on Computer Vision (ECCV).URL: https://arxiv.org/abs/1605.02346.

Page 47: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

47

[39] Andriluka M., Pishchulin L., Gehler P. & Schiele B. (2014) 2d human pose es-timation: New benchmark and state of the art analysis. In: IEEE Conference onComputer Vision and Pattern Recognition (CVPR).

[40] Johnson S. & Everingham M. (2010) Clustered pose and nonlinear appearancemodels for human pose estimation. In: British Machine Vision Conference(BMVC). Doi:10.5244/C.24.12.

[41] Noh H., Hong S. & Han B. (2015) Learning deconvolution network for semanticsegmentation. In: IEEE International Conference on Computer Vision (ICCV),pp. 1520–1528.

[42] Zhao J., Mathieu M., Goroshin R. & Lecun Y. (2015) Stacked what-where auto-encoders. arXiv preprint arXiv:1506.02351 .

[43] Rematas K., Ritschel T., Fritz M., Gavves E. & Tuytelaars T. (2016) Deep re-flectance maps. In: IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR). URL: https://arxiv.org/abs/1511.04384.

[44] Fan X., Zheng K., Lin Y. & Wang S. (2015) Combining local appearance andholistic view: Dual-source deep neural networks for human pose estimation.In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1347–1355.

[45] Chen X. & Yuille A.L. (2014) Articulated pose estimation by a graphical modelwith image dependent pairwise relations. In: Advances in Neural InformationProcessing Systems (NIPS), pp. 1736–1744.

[46] Jia Y., Shelhamer E., Donahue J., Karayev S., Long J., Girshick R., GuadarramaS. & Darrell T. (2014) Caffe: Convolutional architecture for fast feature embed-ding. In: 22nd ACM international conference on Multimedia, pp. 675–678. URL:https://arxiv.org/abs/1408.5093.

[47] Dantone M., Gall J., Leistner C. & van Gool L. (2013) Human pose estimationusing body parts dependent joint regressors. In: IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pp. 3041–3048.

[48] Charles J., Pfister T., Everingham M. & Zisserman A. (2013) Automatic andefficient human pose estimation for sign language videos. International Journalof Computer Vision (IJCV) .

[49] Buehler P., Everingham M., Huttenlocher D.P. & Zisserman A. (2011) Upperbody detection and tracking in extended signing sequences. International Journalof Computer Vision (IJCV) 95, pp. 180–197.

[50] LeCun Y., Bengio Y. & Hinton G. (2015) Deep learning. Nature 521, pp. 436–444.

[51] Springenberg J.T., Dosovitskiy A., Brox T. & Riedmiller M. (2015) Striving forsimplicity: The all convolutional net. In: International Conference on LearningRepresentations (ICLR). URL: https://arxiv.org/abs/1412.6806.

Page 48: Marko Linna Real-time Human Pose Estimation from Video ... › ~malinna › theses › linna2016thesis.pdf · Linna M. (2016) Real-time Human Pose Estimation from Video with Convolu-tional

48

[52] Sermanet P., Eigen D., Zhang X., Mathieu M., Fergus R. & LeCun Y. (2014)Overfeat: Integrated recognition, localization and detection using convolutionalnetworks. In: International Conference on Learning Representations (ICLR).URL: https://arxiv.org/abs/1312.6229.

[53] Srivastava N., Hinton G.E., Krizhevsky A., Sutskever I. & Salakhutdinov R.(2014) Dropout: a simple way to prevent neural networks from overfitting. Jour-nal of Machine Learning Research (JMLR) 15, pp. 1929–1958.

[54] Bottou L. (2012) Stochastic gradient descent tricks. In: Neural Networks: Tricksof the Trade, Springer, pp. 421–436.

[55] Glorot X. & Bengio Y. (2010) Understanding the difficulty of training deep feed-forward neural networks. In: International Conference on Artificial Intelligenceand Statistics (AISTATS).

[56] Redmon J., Divvala S., Girshick R. & Farhadi A. (2016) You only look once:Unified, real-time object detection. In: IEEE Conference on Computer Visionand Pattern Recognition (CVPR). URL: https://arxiv.org/abs/1506.02640.