[email protected], mathieu.salzmann@epfl ... · can play guitar in a bedroom, a...

13
Encouraging LSTMs to Anticipate Actions Very Early Mohammad Sadegh Aliakbarian 1,3 , Fatemeh Sadat Saleh 1,3 , Mathieu Salzmann 2 , Basura Fernando 1 , Lars Petersson 1,3 , Lars Andersson 3 1 Australian National University, 2 CVLab, EPFL, Switzerland, 3 Smart Vision Systems, CSIRO [email protected], [email protected], [email protected] Abstract In contrast to the widely studied problem of recogniz- ing an action given a complete sequence, action anticipa- tion aims to identify the action from only partially available videos. As such, it is therefore key to the success of com- puter vision applications requiring to react as early as pos- sible, such as autonomous navigation. In this paper, we pro- pose a new action anticipation method that achieves high prediction accuracy even in the presence of a very small percentage of a video sequence. To this end, we develop a multi-stage LSTM architecture that leverages context-aware and action-aware features, and introduce a novel loss func- tion that encourages the model to predict the correct class as early as possible. Our experiments on standard bench- mark datasets evidence the benefits of our approach; We outperform the state-of-the-art action anticipation methods for early prediction by a relative increase in accuracy of 22.0% on JHMDB-21, 14.0% on UT-Interaction and 49.9% on UCF-101. 1. Introduction Understanding actions from videos is key to the success of many real-world applications, such as autonomous nav- igation and sports analysis. While great progress has been made to recognize actions from complete sequences [18, 42, 7, 9, 32, 45, 33, 10, 2] in the past decade, action anticipa- tion [27, 28, 26, 49, 34, 35, 23], which aims to predict the observed action as early as possible, has become a popu- lar research problem only recently. Anticipation is crucial in scenarios where one needs to react before the action is finalized, such as to avoid hitting pedestrians with an au- tonomous car, or to forecast dangerous situations in surveil- lance scenarios. The key difference between recognition and anticipation lies in the fact that the methods tackling the latter should predict the correct class as early as possible, given only a few frames from the beginning of the video sequence. To address this, several approaches have introduced new train- Figure 1. Overview of our approach. Given a small portion of sequential data, our approach is able to predict the action cate- gory with very high performance. For instance, in UCF-101, our approach anticipates actions with more than 80% accuracy given only the first 1% of the video. To achieve this, we design a a model that leverages action- and context-aware features together with a new loss function that encourages the model to make cor- rect predictions as early as possible. ing losses encouraging the score [34] or the rank [23] of the correct action to increase with time, or penalizing increas- ingly strongly the classification mistakes [13]. In practice, however, the effectiveness of these losses remains limited for very early prediction, such as from 1% of the sequence. In this paper, we introduce a novel loss that encourages making correct predictions very early. Specifically, our loss models the intuition that some actions, such as running and high jump, are highly ambiguous after seeing only the first few frames, and false positives should therefore not be pe- nalized too strongly in the early stages. By contrast, we would like to predict a high probability for the correct class as early as possible, and thus penalize false negatives from the beginning of the sequence. Our experiments demon- strate that, for a given model, our new loss yields signif- icantly higher accuracy than existing ones on the task of early prediction. In particular, in this paper, we also contribute a novel multi-stage Long Short Term Memory (LSTM) architec- ture for action anticipation. This model effectively extracts and jointly exploits context- and action-aware features (see 1 arXiv:1703.07023v3 [cs.CV] 14 Aug 2017

Upload: others

Post on 25-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: firstname.lastname@data61.csiro.au, mathieu.salzmann@epfl ... · can play guitar in a bedroom, a concert hall or a yard. To overcome this, some methods localize the feature extrac-tion

Encouraging LSTMs to Anticipate Actions Very Early

Mohammad Sadegh Aliakbarian1,3, Fatemeh Sadat Saleh1,3, Mathieu Salzmann2, Basura Fernando1,Lars Petersson1,3, Lars Andersson3

1Australian National University, 2CVLab, EPFL, Switzerland, 3Smart Vision Systems, [email protected], [email protected], [email protected]

Abstract

In contrast to the widely studied problem of recogniz-ing an action given a complete sequence, action anticipa-tion aims to identify the action from only partially availablevideos. As such, it is therefore key to the success of com-puter vision applications requiring to react as early as pos-sible, such as autonomous navigation. In this paper, we pro-pose a new action anticipation method that achieves highprediction accuracy even in the presence of a very smallpercentage of a video sequence. To this end, we develop amulti-stage LSTM architecture that leverages context-awareand action-aware features, and introduce a novel loss func-tion that encourages the model to predict the correct classas early as possible. Our experiments on standard bench-mark datasets evidence the benefits of our approach; Weoutperform the state-of-the-art action anticipation methodsfor early prediction by a relative increase in accuracy of22.0% on JHMDB-21, 14.0% on UT-Interaction and 49.9%on UCF-101.

1. IntroductionUnderstanding actions from videos is key to the success

of many real-world applications, such as autonomous nav-igation and sports analysis. While great progress has beenmade to recognize actions from complete sequences [18, 42,7, 9, 32, 45, 33, 10, 2] in the past decade, action anticipa-tion [27, 28, 26, 49, 34, 35, 23], which aims to predict theobserved action as early as possible, has become a popu-lar research problem only recently. Anticipation is crucialin scenarios where one needs to react before the action isfinalized, such as to avoid hitting pedestrians with an au-tonomous car, or to forecast dangerous situations in surveil-lance scenarios.

The key difference between recognition and anticipationlies in the fact that the methods tackling the latter shouldpredict the correct class as early as possible, given only afew frames from the beginning of the video sequence. Toaddress this, several approaches have introduced new train-

Figure 1. Overview of our approach. Given a small portionof sequential data, our approach is able to predict the action cate-gory with very high performance. For instance, in UCF-101, ourapproach anticipates actions with more than 80% accuracy givenonly the first 1% of the video. To achieve this, we design a amodel that leverages action- and context-aware features togetherwith a new loss function that encourages the model to make cor-rect predictions as early as possible.

ing losses encouraging the score [34] or the rank [23] of thecorrect action to increase with time, or penalizing increas-ingly strongly the classification mistakes [13]. In practice,however, the effectiveness of these losses remains limitedfor very early prediction, such as from 1% of the sequence.

In this paper, we introduce a novel loss that encouragesmaking correct predictions very early. Specifically, our lossmodels the intuition that some actions, such as running andhigh jump, are highly ambiguous after seeing only the firstfew frames, and false positives should therefore not be pe-nalized too strongly in the early stages. By contrast, wewould like to predict a high probability for the correct classas early as possible, and thus penalize false negatives fromthe beginning of the sequence. Our experiments demon-strate that, for a given model, our new loss yields signif-icantly higher accuracy than existing ones on the task ofearly prediction.

In particular, in this paper, we also contribute a novelmulti-stage Long Short Term Memory (LSTM) architec-ture for action anticipation. This model effectively extractsand jointly exploits context- and action-aware features (see

1

arX

iv:1

703.

0702

3v3

[cs

.CV

] 1

4 A

ug 2

017

Page 2: firstname.lastname@data61.csiro.au, mathieu.salzmann@epfl ... · can play guitar in a bedroom, a concert hall or a yard. To overcome this, some methods localize the feature extrac-tion

Fig. 1). This is in contrast to existing methods that typi-cally extract either global representations for the entire im-age [6, 45, 46, 7] or video sequence [38, 16], thus not fo-cusing on the action itself, or localize the feature extractionprocess to the action itself via dense trajectories [43, 42, 9],optical flow [8, 45, 20] or actionness [44, 3, 14, 48, 39, 24],thus failing to exploit contextual information. To the bestof our knowledge, only two-stream networks [30, 8, 4, 22]have attempted to jointly leverage both information typesby making use of RGB frames in conjunction with opticalflow to localize the action. Exploiting optical flow, how-ever, does not allow these methods to explicitly leverageappearance in the localization process. Furthermore, com-puting optical flow is typically expensive, thus significantlyincreasing the runtime of these methods. By not relying onoptical flow, our method is significantly more efficient: Ona single GPU, our model analyzes a short video (e.g., 50frames) 14 times faster than [30] and [8].

Our model is depicted in Fig. 2. In a first stage, it focuseson the global, context-aware information by extracting fea-tures from the entire RGB image. The second stage thencombines these context-aware features with action-awareones obtained by exploiting class-specific activations, typi-cally corresponding to regions where the action occurs. Inshort, our model first extracts the contextual information,and then merges it with the localized one.

As evidenced by our experiments, our approach signif-icantly outperforms the state-of-the-art action anticipationmethods on all the standard benchmark datasets that weevaluated on, including UCF-101 [36], UT-Interaction [26],and JHMDB21 [15]. In the supplementary material, we fur-ther show that our combination of context- and action-awarefeatures is also beneficial for the more traditional task of ac-tion recognition. Moreover, we evaluate the effect of opticalflow features for both action recognition and anticipation.

2. Related WorkThe focus of this paper is twofold: Action anticipation,

with a new loss that encourages correct prediction as earlyas possible, and action modeling, with a model that com-bines context- and action-aware information using multi-stage LSTMs. Below, we discuss the most relevant ap-proaches for these two aspects.

2.1. Action Anticipation

The idea of action anticipation was introduced by [28],which models causal relationships to predict human activi-ties. This was followed by several attempts to model the dy-namics of the observed actions, such as by introducing inte-gral and dynamic bag-of-words [27], using spatial-temporalimplicit shape models [49], extracting human body move-ments via skeleton information [55], and accounting for thecomplete and partial history of observed features [17].

More recently, [35] proposed to make use of binarySVMs to classify video snippets into sub-action categoriesand obtain the final class label in an online manner usingdynamic programming. To overcome the need to train oneclassifier per sub-action, [34] extended this approach to us-ing a structural SVM. Importantly, this work further intro-duced a new objective function to encourage the score ofthe correct action to increase as time progresses.

While the above-mentioned work made use of hand-crafted features, recent advances have naturally led to thedevelopment of deep learning approaches to action antici-pation. In this context, [23] proposed to combine a Con-volutional Neural Network (CNN) with an LSTM to modelboth spatial and temporal information. The authors furtherintroduced new ranking losses whose goal is to enforce ei-ther the score of the correct class or the margin between thescore of the correct class and that of the best score to benon-decreasing over time. Similarly, in [13], a new loss thatpenalizes classification mistakes increasingly strongly overtime was introduced in an LSTM-based framework thatused multiple modalities. While the two above-mentionedmethods indeed aim at improving classification accuracyover time, they do not explicitly encourage making correctpredictions as early as possible. By contrast, while account-ing for ambiguities in early stages, our new loss still aims toprevent false negatives from the beginning of the sequence.

Instead of predicting the future class label, in [40], theauthors proposed to predict the future visual representation.However, the main motivation for this was to work withunlabeled videos, and the learned representation is thereforenot always related to the action itself.

2.2. Action Modeling

Most recent action approaches extract global representa-tions for the entire image [6, 46, 7] or video sequence [38,16]. As such, these methods do not truly focus on the ac-tions of interest, but rather compute a context-aware rep-resentation. Unfortunately, context does not always bringreliable information about the action. For example, onecan play guitar in a bedroom, a concert hall or a yard. Toovercome this, some methods localize the feature extrac-tion process by exploiting dense trajectories [43, 42, 9] oroptical flow [20]. Inspired by objectness, the notion of ac-tionness [44, 3, 14, 48, 39, 24] has recently also been pro-posed to localize the regions where a generic action occurs.The resulting methods can then be thought of as extractingaction-aware representations. In other words, these meth-ods go to the other extreme and completely discard the no-tion of context which can be useful for some actions, suchas playing football on a grass field.

There is nevertheless a third class of methods that aim toleverage these two types of information [30, 8, 4, 22, 45].By combining RGB frames and optical flow in two-stream

2

Page 3: firstname.lastname@data61.csiro.au, mathieu.salzmann@epfl ... · can play guitar in a bedroom, a concert hall or a yard. To overcome this, some methods localize the feature extrac-tion

Figure 2. Overview of our approach. We propose to extract context-aware features, encoding global information about the scene, andcombine them with action-aware ones, which focus on the action itself. To this end, we introduce a multi-stage LSTM architecture thatleverages the two types of features to predict the action or forecast it. Note that, for the sake of visualization, the color maps were obtainedfrom 3D tensors (512×W ×H) via an average pooling operation over the 512 channels.

architectures, these methods truly exploit context and mo-tion, from which the action can be localized by learningto distinguish relevant motion. This localization, however,does not directly exploit appearance. Here, inspired by thesuccess of these methods, we develop a novel multi-stagenetwork that also leverages context- and action-aware in-formation. However, we introduce a new action-aware rep-resentation that exploits the RGB data to localize the ac-tion. As such, our approach effectively leverages appear-ance for action-aware modeling, and, by avoiding the ex-pensive optical flow computation, is much more efficientthan the above-mentioned two-stream models. In particular,our model is about 14 times faster than the state-of-the-arttwo-stream network of [8] and has less parameters. The re-duction in number of parameters is due to the fact that [8]and [30] rely on two VGG-like networks (one for eachstream) with a few additional layers (including 3D convolu-tions for [8]). By contrast, our model has, in essence, a sin-gle VGG-like architecture, with some additional LSTM lay-ers, which only have few parameters. Moreover, our workconstitutes the first attempt at explicitly leveraging context-and action-aware information for action anticipation. Fi-nally, we introduce a novel multi-stage LSTM fusion strat-egy to integrate action and context aware features.

3. Our Approach

Our goal is to predict the class of an action as early aspossible, that is, after having seen only a very small por-tion of the video sequence. To this end, we first introducea new loss function that encourages making correct pre-dictions very early. We then develop a multi-stage LSTMmodel that makes use of this loss function while leveragingboth context- and action-aware information.

3.1. A New Loss for Action Anticipation

As argued above, a loss for action anticipation shouldencourage having a high score for the correct class as earlyas possible. However, it should also account for the factthat, early in the sequence, there can often be ambiguitiesbetween several actions, such as running and high jump.Below, we introduce a new anticipation loss that followsthese two intuitions.

Specifically, let yt(k) encode the true activity label attime t, i.e., yt(k) = 1 if the sample belongs to class k and0 otherwise, and yt(k) denote the corresponding label pre-dicted by a given model. We define our new loss as

L(y, y) = − 1

N

N∑k=1

T∑t=1

[yt(k) log(yt(k))+

t(1− yt(k))T

log(1− yt(k))

], (1)

where N is the number of action classes and T the length(number of frames) of the input sequence.

This loss function consists of two terms. The first one pe-nalizes false negatives with the same strength at any pointin time. By contrast, the second one focuses on false posi-tives, and its strength increases linearly over time, to reachthe same weight as that on false negatives. Therefore, therelative weight of the first term compared to the second oneis larger at the beginning of the sequence. Altogether, thisencourages predicting a high score for the correct class asearly as possible, i.e., preventing false negatives, while ac-counting for the potential ambiguities at the beginning ofthe sequence, which give rise to false positives. As we seemore frames, however, the ambiguities are removed, andthese false positives are encouraged to disappear.

3

Page 4: firstname.lastname@data61.csiro.au, mathieu.salzmann@epfl ... · can play guitar in a bedroom, a concert hall or a yard. To overcome this, some methods localize the feature extrac-tion

Figure 3. Our feature extraction network. Our CNN model forfeature extraction is based on the VGG-16 structure with somemodifications. Up to conv5-2, the network is the same as VGG-16. The output of this layer is connected to two sub-models. Thefirst one extracts context-aware features by providing a global im-age representation. The second one relies on another network toextract action-aware features.

Our new loss matches the desirable properties of an ac-tion anticipation loss. In the next section, we introduce anovel multi-stage architecture that makes use of this loss.

3.2. Multi-stage LSTM Architecture

To tackle action anticipation, we develop the novelmulti-stage recurrent architecture based on LSTMs depictedby Fig. 2. This architecture consists of a stage-wise combi-nation of context- and action-aware information. Below, wefirst discuss our approach to extracting these two types ofinformation, and then present our complete multi-stage re-current network.

3.2.1 Context- and Action-aware Modeling

To model the context- and action-aware information, we in-troduce the two-stream architecture shown in Fig. 3. Thefirst part of this network is shared by both streams and, upto conv5-2, corresponds to the VGG-16 network [31], pre-trained on ImageNet for object recognition. The output ofthis layer is connected to two sub-models: One for context-aware features and the other for action-aware ones. Wethen train these two sub-models for the same task of actionrecognition from a single image using a cross-entropy lossfunction defined on the output of each stream. In practice,we found that training the entire model in an end-to-endmanner did not yield a significant improvement over train-ing only the two sub-models. In our experiments, we there-fore opted for this latter strategy, which is less expensivecomputationally and memory-wise. Below, we first discussthe context-aware sub-network and then turn to the action-aware one.

Context-Aware Feature Extraction. This sub-model issimilar to VGG-16 from conv5-3 up to the last fully-connected layer, with the number of units in the last fully-connected layer changed from 1000 (original 1000-way Im-ageNet classification model) to the number of activities N .

In essence, this sub-model focuses on extracting a deeprepresentation of the whole scene for each activity and thus

incorporates context. We then take the output of its fc7 layeras our context-aware features.

Action-Aware Feature Extraction. As mentioned be-fore, the context of an action does not always correlate withthe action itself. Our second sub-model therefore aims atextracting features that focus on the action itself. To thisend, we draw inspiration from the object classification workof [56]. At the core of this work lies the idea of Class Acti-vation Maps (CAMs). In our context, a CAM indicates theregions in the input image that contribute most to predict-ing each class label. In other words, it provides informationabout the location of an action. Importantly, this is achievedwithout requiring any additional annotations.

More specifically, CAMs are extracted from the activa-tions in the last convolutional layer in the following man-ner. Let fl(x, y) represent the activation of unit l in thelast convolutional layer at spatial location (x, y). A scoreSk for each class k can be obtained by performing globalaverage pooling [21] to obtain, for each unit l, a featureF l =

∑x,y fl(x, y), followed by a linear layer with weights

{wkl }. That is, Sk =

∑k w

kl Fl. A CAM for class k at loca-

tion (x, y) can then be computed as

Mk(x, y) =∑l

wkl fl(x, y) , (2)

which indicates the importance of the activations at location(x, y) in the final score for class k.

Here, we propose to make use of the CAMs to extractaction-aware features. To this end, we use the CAMs in con-junction with the output of the conv5-3 layer of the model.The intuition behind this is that conv5-3 extracts high-levelfeatures that provide a very rich representation of the im-age [53] and typically correspond to the most discriminativeparts of the object [1, 29], or, in our case, the action. There-fore, we incorporate a new layer to our sub-model, whoseoutput can be expressed as

Ak(x, y) = conv5−3(x, y)× ReLU(Mk(x, y)) , (3)

where ReLU(Mk(x, y)) = max(0,Mk(x, y)). As shownin Fig. 4, this new layer is then followed by fully-connectedones, and we take our action-aware features as the output ofthe corresponding fc7 layer.

3.2.2 Sequence Learning for Action Anticipation

To effectively combine the information contained in thecontext-aware and action-aware features described above,we design the novel multi-stage LSTM model depicted byFig. 2. This model first focuses on the context-aware fea-tures, which encode global information about the entire im-age. It then combines the output of this first stage with ouraction-aware features to provide a refined class prediction.

4

Page 5: firstname.lastname@data61.csiro.au, mathieu.salzmann@epfl ... · can play guitar in a bedroom, a concert hall or a yard. To overcome this, some methods localize the feature extrac-tion

Figure 4. Action-aware feature extraction. Given the fine-tunedfeature extraction network, we introduce a new layer that altersthe output of conv5-3. This lets us filter out the conv5-3 featuresthat are irrelevant, to focus on the action itself. Our action-awarefeatures are taken as the output of the last fully-connected layer.

To train this model for action anticipation, we make useof our new loss introduced in Section 3.1. Therefore, ulti-mately, our network models long-range temporal informa-tion and yields increasingly accurate predictions as it pro-cesses more frames.

Specifically, we write the overall loss of our model as

Lo =1

V

V∑i=1

(Lc,i + La,i) , (4)

where V is the total number of training sequences. Thisloss function combines losses corresponding to the context-aware stage and to the action-aware stage, respectively. Be-low, we discuss these two stages in more detail.

Learning Context. The first stage of our model takes asinput our context-aware features, and passes them througha layer of LSTM cells followed by a fully-connected layerthat, via a softmax operation, outputs a probability for eachaction class. Let yc,i be the vector of probabilities for allclasses and all time steps predicted by the first stage forsample i. We then define the loss for a single sample as

Lc,i = L(yi, yc,i) , (5)

where L(·) is our new loss defined in Eq. 1, and yi is theground-truth class label for sample i.

Learning Context and Action. The second stage of ourmodel aims at combining context-aware and action-awareinformation. Its structure is the same as that of the firststage, i.e., a layer of LSTM cells followed by a fully-connected layer to output class probabilities via a softmaxoperation. However, its input merges the output of the firststage with our action-aware features. This is achieved byconcatenating the hidden activations of the LSTM layerwith our action-aware features. We then make use of thesame loss function as before, but defined on the final pre-diction. For sample i, this can be expressed as

La,i = L(yi, ya,i) , (6)

where ya is the vector of probabilities for all classes pre-dicted by the second stage.

Inference. At inference, the input RGB frames areforward-propagated through our model. We therefore ob-tain a probability vector for each class at each frame. Whileone could simply take the probabilities in the current framet to obtain the class label at time t, via argmax, we pro-pose to increase robustness by leveraging the predictions ofall the frames up to time t. To this end, we make use of anaverage pooling of these predictions over time.

4. ExperimentsIn this section, we first compare our method with state-

of-the-art techniques on the task of action anticipation, andthen analyze various aspects of our model, such as the influ-ence of the loss function and of the different feature types.In the supplementary material, we provide additional exper-iments to analyze the effectiveness of different LSTM ar-chitectures, and the influence of the number of hidden unitsand of our temporal average pooling strategy. We also re-port the performance of our method on the task of actionrecognition from complete videos with and without opticalflow, and action anticipation with optical flow.

4.1. Datasets

For our experiments, we made use of the standard UCF-101 [36], UT-Interaction [26], and JHMDB-21 [15] bench-marks, which we briefly describe below.

The UCF-101 dataset consists of 13,320 videos (eachcontains a single action) of 101 action classes including abroad set of activities such as sports, playing musical instru-ments and human-object interaction, with an average lengthof 7.2 seconds. UCF-101 is one of the most challengingdatasets due to its large diversity in terms of actions and tothe presence of large variations in camera motion, clutteredbackground and illumination conditions. There are threestandard training/test splits for this dataset. In our compar-isons to the state-of-the-art for both action anticipation andrecognition, we report the average accuracy over the threesplits. For the detailed analysis of our model, however, werely on the first split only.

The JHMDB-21 dataset is another challenging datasetof realistic videos from various sources, such as movies andweb videos, containing 928 videos and 21 action classes.Similarly to UCF-101, in our comparison to the state-of-the-art, we report the average accuracy over the three stan-dard splits of data. Similar to UCF-101 dataset, each videocontains one action starting from the beginning of the video.

The UT-Interaction dataset contains videos of contin-uous executions of 6 human-human interaction classes:shake-hands, point, hug, push, kick and punch. The datasetcontains 20 video sequences whose length is about 1 minuteeach. Each video contains at least one execution of each in-teraction type, providing us with 8 executions of human ac-tivities per video on average. Following the recommended

5

Page 6: firstname.lastname@data61.csiro.au, mathieu.salzmann@epfl ... · can play guitar in a bedroom, a concert hall or a yard. To overcome this, some methods localize the feature extrac-tion

experimental setup, we used 10-fold leave-one-out crossvalidation for each of the standard two sets of 10 videos.That is, within each set, we leave one sequence for testingand use the remaining 9 for training. Following standardpractice, we also made use of the annotations provided withthe dataset to split each video into sequences containing in-dividual actions.

4.2. Implementation Details

CNN and LSTM Configuration. The parameters of theCNN were optimized using stochastic gradient descent witha fixed learning rate of 0.001, a momentum of 0.9, a weightdecay of 0.0005, and mini-batches of size 32. To train ourLSTMs, we similarly used stochastic gradient descent witha fixed learning rate of 0.001, a momentum of 0.9, and mini-batch size of 32. For all LSTMs, we used 2048 hidden units.To implement our method, we used Python and Keras [5].We will make our code publicly available.

Training Procedure. To fine-tune the network on eachdataset, we augment the data, so as to reduce the effect ofover-fitting. The input images were randomly flipped hori-zontally and rotated by a random amount in the range -8 to8 degrees. We then extracted crops according to the follow-ing procedure: (1) Compute the maximum cropping rectan-gle with a given aspect ratio (320/240) that fits inside theinput image. (2) Scale the width and height of the croppingrectangle by a factor randomly selected in the range 0.8-1. (3) Select a random location for the cropping rectanglewithin the original input image and extract the correspond-ing subimage. (4) Scale the subimage to 224× 224.

After these geometric transformations, we further ap-plied RGB channel shifting [47], followed by randomly ad-justing image brightness, contrast and saturation with a fac-tor α = 0.3. The operations are: for brightness, α×Image,for contrast, Image×α+(1.0−α)×mean(grey(Image)),and for saturation, Image×α+(1.0−α)×grey(Image).

4.3. Comparison to the State-of-the-Art

We compare our approach to the state-of-the-art actionanticipation results reported on each of the three datasetsdiscussed above. We further complement these state-of-the-art results with additional baselines that make use of ourcontext-aware features with the loss of either [23] or [13].Note that a detailed comparison of different losses withinour model is provided in Section 4.4.1.

Following standard practice, we report the so-called ear-liest and latest prediction accuracies. Note, however, thatthere is no real agreement on the proportion of frames thatthe earliest setting corresponds to. For each dataset, wemake use of the proportion that has been employed by thebaselines (i.e., either 20% or 50%). Note also that our ap-proach relies on at most T frames (with T = 50 in prac-

Table 1. Comparison with state-of-the-art baselines on the task ofaction anticipation on the JHMDB-21 dataset. Note that our ap-proach outperforms all baselines significantly in both settings.

Method Earliest LatestDP-SVM [34] 5% 46%S-SVM [34] 5% 43%Where/What [35] 10% 43%Ranking Loss [23] 29% 43%Context-Aware+Loss of [13] 28 % 43%Context-Aware+Loss of [23] 33% 39%Ours 55% 58%

Table 2. Comparison with state-of-the-art baselines on the task ofaction anticipation on the UT-Interaction dataset.

Method Earliest LatestD-BoW [27] 70.0% 85.0%I-BoW [27] 65.0% 81.7%CuboidSVM [26] 31.7% 85.0%BP-SVM [19] 65.0% 83.3%CuboidBayes [27] 25.0% 71.7%DP-SVM [34] 13.0% 14.6%S-SVM [34] 11.0% 13.4%Context-Aware+Loss of [13] 45.0 % 65.0%Context-Aware+Loss of [23] 48.0% 60.0%Ours 84.0% 90.0%

Table 3. Action Anticipation on the UCF-101 dataset.Method Earliest LatestContext-Aware+Loss of [13] 30.6 % 71.1%Context-Aware+Loss of [23] 22.6% 73.1%Ours 80.5% 83.4%

tice). Therefore, in the latest setting, where the baselinesrely on the complete sequences, we only exploit the firstT frames. We believe that the fact that our method signifi-cantly outperforms the state-of-the-art in this setting despiteusing less information further evidences the effectiveness ofour approach.

JHMDB-21. The results for the JHMDB-21 dataset areprovided in Table 1. In this case, following the baselines,earliest prediction corresponds to observing the first 20%of the sequence. Note that we clearly outperform all thebaselines by a significant margin in both the earliest and lat-est settings. Remarkably, we also outperform the methodsthat rely on additional information as input, such as opticalflow [34, 35, 23] and Fisher vector features based on Im-proved Dense Trajectories [34]. This clearly demonstratesthe benefits of our approach for anticipation.

UT-Interaction. We provide the results for the UT-Interaction dataset in Table 2. Here, following standardpractice, 50% of the sequence was observed for earliest pre-diction, and the entire sequence for latest prediction. Recallthat our approach uses at most T = 50 frames for predic-

6

Page 7: firstname.lastname@data61.csiro.au, mathieu.salzmann@epfl ... · can play guitar in a bedroom, a concert hall or a yard. To overcome this, some methods localize the feature extrac-tion

Figure 5. Comparison of different losses for action anticipationon UCF-101. We evaluate the accuracy of our model trained withdifferent losses as a function of the number of frames observed.This plot clearly shows the superiority of our loss function.

tion in both settings, while the average length of a completesequence is around 120 frames. Therefore, as evidenced bythe results, our approach yields significantly higher accu-racy despite using considerably less data as input.

UCF-101. We finally compare our approach with our twobaselines on the UCF-101 dataset. While this is not a stan-dard benchmark for action anticipation, this experiment ismotivated by the fact that this dataset is relatively large, hasmany classes, with similarity across different classes, andcontains variations in video capture conditions. Altogether,this makes it a challenging dataset to anticipate actions,especially when only a small amount of data is available.The results on this dataset are provided in Table 3. Here,the earliest setting corresponds to using the first 2 framesof the sequences, which corresponds to around 1% of thedata. Again, we clearly outperform the two baselines con-sisting of exploiting context-aware features with the loss ofeither [23] or [13]. We believe that this further evidencesthe benefits of our approach, which leverages both context-and action-aware features with our new anticipation loss. Adetailed evaluation of the influence of the different featuretypes and losses is provided in the next section.

4.4. Analysis

In this section, we provide a more detailed analysis of theinfluence of our loss function and of the different featuretypes on anticipation accuracy. Finally, we also provide avisualization of our different feature types, to illustrate theirrespective contributions.

4.4.1 Influence of the Loss Function

Throughout the paper, we have argued that our novel loss,introduced in Section 3.1, is better-suited to action anticipa-

Figure 6. Influence of our average pooling strategy. Our sim-ple yet effective average pooling leverages the predictions of allthe frames up to time t. As shown on the UCF-101 dataset, thisincreases anticipation performance, especially at very early stages.

tion than existing ones. To evaluate this, we trained severalversions of our model with different losses. In particular, asalready done in the comparison to the state-of-the-art above,we replaced our loss with the ranking loss of [23] (rankingloss on detection score) and the loss of [13], but this timewithin our complete multi-stage model, with both context-and action-aware features.

Furthermore, we made use of the standard cross-entropy(CE) loss, which only accounts for one activity label foreach sequence (at time T ). This loss can be expressed as

LCE =

N∑k=1

[yT (k) log(yT (k))

+(1− yT (k)) log(1− yT (k))] . (7)

We then also modified the loss of [13], which consists ofan exponentially weighted softmax, with an exponentiallyweighted cross-entropy loss (ECE), written as

LECE =T∑

t=1

−e−(T−t)N∑

k=1

[yt(k) log(yt(k))

+(1− yt(k)) log(1− yt(k))] . (8)

The main drawback of this loss comes from the fact thatit does not strongly encourage the model to make correctpredictions as early as possible. To address this issue, wealso introduce a linearly growing loss (LGL), defined as

LLGL =

T∑t=1

− t

T

N∑k=1

[yt(k) log(yt(k))

+(1− yt(k)) log(1− yt(k))]. (9)

While our new loss, introduced in Section 3.1, also makesuse of a linearly-increasing term, it corresponds to the false

7

Page 8: firstname.lastname@data61.csiro.au, mathieu.salzmann@epfl ... · can play guitar in a bedroom, a concert hall or a yard. To overcome this, some methods localize the feature extrac-tion

Table 4. Importance of the different feature types.Feature Model Earliest K=1 Latest K=50

Context-Aware LSTM 62.80% 72.71%Action-Aware LSTM 69.33% 77.86%Context+Action MS-LSTM 80.5% 83.37%

positives in our case, as opposed to the false negatives inthe LGL. Since some actions are ambiguous in the first fewframes, we find it more intuitive not to penalize false pos-itives too strongly at the beginning of the sequence. Thisintuition is supported by our results below, which show thatour loss yields better results than the LGL.

In Fig. 5, we report the accuracy of the correspondingmodels as a function of the number of observed frames onthe UCF-101 dataset. Note that our new loss yields muchhigher accuracies than the other ones, particularly whenonly a few frames of the sequence are observed; With only2 frames observed, our loss yields an accuracy similar to theother losses with 30–40 frames. With 30fps, this essentiallymeans that we can predict the action 1 second earlier thanother methods. The importance of this result is exemplifiedby research showing that a large proportion of vehicle acci-dents are due to mistakes/misinterpretations of the scene inthe immediate time leading up to the crash [13, 25].

Moreover, in Fig. 6, we report the performance of thecorresponding models as a function of the number of ob-served frames when using our average pooling strategy.Note that this strategy can generally be applied to any actionanticipation method and, as shown by comparing Figs. 5and 6, increases accuracy, especially at very early stages,which clearly demonstrates its effectiveness. Note that us-ing it in conjunction with our loss still yields the best resultsby a significant margin.

4.4.2 Influence of the Features

We then evaluate the importance of the different featuretypes, context-aware and action-aware, on anticipation ac-curacy. To this end, we compare models trained using eachfeature type individually with our model that uses themjointly. For the models using a single feature type, we madeuse of a single LSTM to model temporal information. Bycontrast, our approach relies on a multi-stage LSTM, whichwe denote by MS-LSTM. Note that all models were trainedusing our new anticipation loss. The results of this experi-ment on the UCF-101 dataset are provided in Table 9. Theseresults clearly evidence the importance of using both featuretypes, which consistently outperforms individual ones.

Since we extract action-aware features, and not motion-aware ones, our approach will not be affected by irrelevantmotion. The CNN that extracts these features learns to focuson the discriminative parts of the images, thus discarding

Apply Lipstick Band Marching Archery Blowing Candles

RGB Frame

Context-Aware Conv5

Action-Aware Conv5Figure 7. Visualization of action-aware and context-aware featureson UCF-101. These samples are representative of the data.

the irrelevant information. To confirm this, we conductedan experiment on some classes of UCF-101 that containirrelevant motions/multiple actors, such as Baseball pitch,Basketball, Cricket Shot and Ice dancing. The results ofour action-aware and context-aware frameworks for theseclasses are: 66.1% vs. 58.2% for Baseball pitch, 83% vs.76.4% for Basketball, 65.1% vs. 58% for Cricket Shot,and 92.3% vs. 91.7% for Ice dancing. This shows that ouraction-aware features can effectively discard irrelevant mo-tion/actors to focus on the relevant one(s).

4.4.3 Visualization

Finally, we provide a better intuition of the kind of informa-tion each of our feature types encode (see Fig. 7). This vi-sualization was computed by average pooling over the 512channels of Conv5-3 (of both the context-aware and action-aware sub-networks). As can be observed in the figure, ourcontext-aware features have high activations on regions cor-responding to any relevant object in the scene (context).By contrast, in our action-aware features, high activationscorrectly correspond to the focus of the action. Therefore,they can reasonably localize the parts of the frame that moststrongly participate in the action happening in the video andreduce the noise coming from context.

5. ConclusionIn this paper, we have introduced a novel loss function to

address very early action anticipation. Our loss encouragesthe model to make correct predictions as early as possible inthe input sequence, thus making it particularly well-suitedto action anticipation. Furthermore, we have introduceda new multi-stage LSTM model that effectively combinescontext-aware and action-aware features. Our experimentshave evidenced the benefits of our new loss function over

8

Page 9: firstname.lastname@data61.csiro.au, mathieu.salzmann@epfl ... · can play guitar in a bedroom, a concert hall or a yard. To overcome this, some methods localize the feature extrac-tion

existing ones. Furthermore, they have shown the impor-tance of exploiting both context- and action-aware informa-tion. Altogether, our approach significantly outperforms thestate-of-the-art in action anticipation on all the datasets weapplied it to. In the future, we intend to study new ways toincorporate additional sources of information, such as densetrajectories and human skeletons in our framework.

References[1] G. Bertasius, J. Shi, and L. Torresani. Deepedge: A multi-

scale bifurcated deep network for top-down contour detec-tion. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 4380–4389, 2015.

[2] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould.Dynamic image networks for action recognition. In IEEEInternational Conference on Computer Vision and PatternRecognition CVPR, 2016.

[3] W. Chen, C. Xiong, R. Xu, and J. J. Corso. Actionness rank-ing with lattice conditional ordinal random fields. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 748–755, 2014.

[4] G. Cheron, I. Laptev, and C. Schmid. P-cnn: Pose-based cnnfeatures for action recognition. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 3218–3226, 2015.

[5] F. Chollet. keras. https://github.com/fchollet/keras, 2015.

[6] A. Diba, A. Mohammad Pazandeh, H. Pirsiavash, andL. Van Gool. Deepcamp: Deep convolutional action & at-tribute mid-level patterns. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages3557–3565, 2016.

[7] J. Donahue, L. Anne Hendricks, S. Guadarrama,M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-rell. Long-term recurrent convolutional networks for visualrecognition and description. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 2625–2634, 2015.

[8] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutionaltwo-stream network fusion for video action recognition. InIEEE Conference on Computer Vision and Pattern Recogni-tion, 2016.

[9] B. Fernando, P. Anderson, M. Hutter, and S. Gould. Discrim-inative hierarchical rank pooling for activity recognition. InIEEE Conference on Computer Vision and Pattern Recogni-tion, 2016.

[10] B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, andT. Tuytelaars. Rank pooling for action recognition. In IEEETransactions on Pattern Analysis and Machine Intelligence,2016.

[11] G. Gkioxari and J. Malik. Finding action tubes. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 759–768, 2015.

[12] S. Herath, M. Harandi, and F. Porikli. Going deeper intoaction recognition: A survey. Image and Vision Comput-ing, 60:4 – 21, 2017. Regularization Techniques for High-Dimensional Data Analysis.

[13] A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Sax-ena. Recurrent neural networks for driver activity anticipa-tion via sensory-fusion architecture. In 2016 IEEE Inter-national Conference on Robotics and Automation (ICRA),pages 3118–3125. IEEE, 2016.

[14] M. Jain, J. Van Gemert, H. Jegou, P. Bouthemy, and C. G.Snoek. Action localization with tubelets from motion. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 740–747, 2014.

[15] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black.Towards understanding action recognition. In Proceedingsof the IEEE International Conference on Computer Vision,pages 3192–3199, 2013.

[16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,and L. Fei-Fei. Large-scale video classification with convo-lutional neural networks. In Proceedings of the IEEE con-ference on Computer Vision and Pattern Recognition, pages1725–1732, 2014.

[17] Y. Kong, D. Kit, and Y. Fu. A discriminative model withmultiple temporal scales for action prediction. In EuropeanConference on Computer Vision, pages 596–611. Springer,2014.

[18] I. Laptev. On space-time interest points. International Jour-nal of Computer Vision, 64(2-3):107–123, 2005.

[19] K. Laviers, G. Sukthankar, D. W. Aha, M. Molineaux,C. Darken, et al. Improving offensive performance throughopponent modeling. In Proceedings of AAAI Conference onArtificial Intelligence and Interactive Digital Entertainment,pages 58–63.

[20] Y. Li, W. Li, V. Mahadevan, and N. Vasconcelos. Vlad3:Encoding dynamics of deep features for action recognition.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 1951–1960, 2016.

[21] M. Lin, Q. Chen, and S. Yan. Network in network. In arXivpreprint arXiv:1312.4400, 2013.

[22] J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporallstm with trust gates for 3d human action recognition. InEuropean Conference on Computer Vision, pages 816–833.Springer, 2016.

[23] S. Ma, L. Sigal, and S. Sclaroff. Learning activity progres-sion in lstms for activity detection and early detection. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 1942–1950, 2016.

[24] D. Oneata, J. Revaud, J. Verbeek, and C. Schmid. Spatio-temporal object detection proposals. In European Confer-ence on Computer Vision, pages 737–752. Springer, 2014.

[25] P. Rueda-Domingo, T andLardelli-Claret, J. Luna delCastillo, J. Jimenez-Moleon, M. Garcia-Martin, andA. Bueno-Cavanillas. The influence of passengers on therisk of the driver causing a car collision in spain: Analysisof collisions from 1990 to 1999. Accident Analysis and Pre-vention, 2004.

[26] M. Ryoo, C.-C. Chen, J. Aggarwal, and A. Roy-Chowdhury.An overview of contest on semantic description of humanactivities (sdha) 2010. In Recognizing Patterns in Signals,Speech, Images and Videos, pages 270–285. Springer, 2010.

9

Page 10: firstname.lastname@data61.csiro.au, mathieu.salzmann@epfl ... · can play guitar in a bedroom, a concert hall or a yard. To overcome this, some methods localize the feature extrac-tion

[27] M. S. Ryoo. Human activity prediction: Early recognition ofongoing activities from streaming videos. In Computer Vi-sion (ICCV), 2011 IEEE International Conference on, pages1036–1043. IEEE, 2011.

[28] M. S. Ryoo and J. K. Aggarwal. Spatio-temporal relationshipmatch: Video structure comparison for recognition of com-plex human activities. In Computer vision, 2009 ieee 12thinternational conference on, pages 1593–1600. IEEE, 2009.

[29] F. Saleh, M. S. Aliakbarian, M. Salzmann, L. Petersson,S. Gould, and J. M. Alvarez. Built-in foreground/backgroundprior for weakly-supervised semantic segmentation. In Eu-ropean Conference on Computer Vision, pages 413–432.Springer, 2016.

[30] K. Simonyan and A. Zisserman. Two-stream convolutionalnetworks for action recognition in videos. In Advancesin Neural Information Processing Systems, pages 568–576,2014.

[31] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. In arXiv preprintarXiv:1409.1556, 2014.

[32] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao. Amulti-stream bi-directional recurrent neural network for fine-grained action detection. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages1961–1970, 2016.

[33] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao. Amulti-stream bi-directional recurrent neural network for fine-grained action detection. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages1961–1970, 2016.

[34] K. Soomro, H. Idrees, and M. Shah. Online localizationand prediction of actions and interactions. arXiv preprintarXiv:1612.01194, 2016.

[35] K. Soomro, H. Idrees, and M. Shah. Predicting the where andwhat of actors and actions through online action localization.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 2648–2657, 2016.

[36] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of101 human actions classes from videos in the wild. 2012.

[37] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsuper-vised learning of video representations using lstms. CoRR,abs/1502.04681, 2, 2015.

[38] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.Learning spatiotemporal features with 3d convolutional net-works. In 2015 IEEE International Conference on ComputerVision (ICCV), pages 4489–4497. IEEE, 2015.

[39] M. Van den Bergh, G. Roig, X. Boix, S. Manen, andL. Van Gool. Online video seeds for temporal window ob-jectness. In Proceedings of the IEEE International Confer-ence on Computer Vision, pages 377–384, 2013.

[40] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipatingvisual representations from unlabeled video. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 98–106, 2016.

[41] D. Waltisberg, A. Yao, J. Gall, and L. Van Gool. Variationsof a hough-voting action recognition system. In Recognizingpatterns in signals, speech, images and videos, pages 306–312. Springer, 2010.

[42] H. Wang and C. Schmid. Action recognition with improvedtrajectories. In Proceedings of the IEEE International Con-ference on Computer Vision, pages 3551–3558, 2013.

[43] L. Wang, Y. Qiao, and X. Tang. Action recognition withtrajectory-pooled deep-convolutional descriptors. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 4305–4314, 2015.

[44] L. Wang, Y. Qiao, X. Tang, and L. Van Gool. Actionnessestimation using hybrid fully convolutional networks. arXivpreprint arXiv:1604.07279, 2016.

[45] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, andL. Van Gool. Temporal segment networks: towards goodpractices for deep action recognition. In European Confer-ence on Computer Vision, pages 20–36. Springer, 2016.

[46] X. Wang, A. Farhadi, and A. Gupta. Actions ˜ transforma-tions. In IEEE International Conference on Computer Visionand Pattern Recognition CVPR. IEEE, 2016.

[47] R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deepimage: Scaling up image recognition. arXiv preprintarXiv:1501.02876, 7(8), 2015.

[48] G. Yu and J. Yuan. Fast action proposals for human actiondetection and search. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 1302–1311, 2015.

[49] G. Yu, J. Yuan, and Z. Liu. Predicting human activities us-ing spatio-temporal structure of interest points. In Proceed-ings of the 20th ACM international conference on Multime-dia, pages 1049–1052. ACM, 2012.

[50] T.-H. Yu, T.-K. Kim, and R. Cipolla. Real-time action recog-nition by spatiotemporal semantic and structural forests. InBMVC, volume 2, page 6, 2010.

[51] F. Yuan, V. Prinet, and J. Yuan. Middle-level representationfor human activities recognition: the role of spatio-temporalrelationships. In ECCV, pages 168–180. Springer, 2010.

[52] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan,O. Vinyals, R. Monga, and G. Toderici. Beyond short snip-pets: Deep networks for video classification. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 4694–4702, 2015.

[53] M. D. Zeiler and R. Fergus. Visualizing and understandingconvolutional networks. In European Conference on Com-puter Vision, pages 818–833. Springer, 2014.

[54] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang. Real-time action recognition with enhanced motion vector cnns.arXiv preprint arXiv:1604.07669, 2016.

[55] X. Zhao, X. Li, C. Pang, X. Zhu, and Q. Z. Sheng. On-line human gesture recognition from motion data streams.In Proceedings of the 21st ACM international conference onMultimedia, pages 23–32. ACM, 2013.

[56] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-ralba. Learning deep features for discriminative localization.In IEEE International Conference on Computer Vision andPattern Recognition CVPR, 2016.

10

Page 11: firstname.lastname@data61.csiro.au, mathieu.salzmann@epfl ... · can play guitar in a bedroom, a concert hall or a yard. To overcome this, some methods localize the feature extrac-tion

Supplementary MaterialIn this supplementary material, we analyze different as-

pects of our approach via several additional experiments.While the main paper discusses action anticipation, here,we focus on evaluating our approach on the task of ac-tion recognition. Therefore, we first provide a comparisonto the state-of-the-art action recognition methods on threestandard benchmarks, and evaluate the effect of exploitingadditional optical flow features for both action recognitionand anticipation. We then analyze the effect of our differ-ent feature types in several loss functions, the influence ofthe number of hidden units and of our average pooling inLSTMs, and, finally, the effect of our multi-stage LSTMarchitecture.

Comparison to State-of-the-Art Action Recog-nition Methods

We first compare the results of our approach to state-of-the-art methods on UCF-101, JHMDB-21 and UT-Interaction in terms of average accuracy over the standardtraining and testing partitions. In Table 5, we provide theresults on the UCF-101 dataset. Here, for the comparisonto be fair, we only report the results of the baselines that donot use any other information than the RGB image and theactivity label (we refer the readers to the baselines’ papersand the survey [12] for more detail). In other words, while ithas been shown that additional, handcrafted features, suchas dense trajectories and optical flow, can help improve ac-curacy [42, 43, 20, 30, 2], our goal here is to truly evalu-ate the benefits of our method, not of these features. Note,however, that, as discussed in the next section of this supple-mentary material, our approach can still benefit from suchfeatures. As can be seen from the table, our approach out-performs all these RGB-based baselines. In Tables 6 and 7,we provide the results for JHMDB-21 and UT-Interaction.Again, we outperform all the baselines, even though, in thiscase, some of them rely on additional information such asoptical flow [11, 44, 34, 35, 23] or IDT Fisher vector fea-tures [34]. We believe that these experiments show the ef-fectiveness of our approach at tackling the action recogni-tion problem.

Exploiting Optical FlowNote that our approach can also be extended into a two-

stream architecture to benefit from optical flow information,as state-of-the-art action recognition methods do. In partic-ular, to extract optical flow features, we made use of thepre-trained temporal network of [30]. We then computedthe CNN features from a stack of 20 optical flow frames (10frames in the x-direction and 10 frames in the y-direction),from t − 10 to t at each time t. As these features are po-tentially loosely related to the action (by focusing on mo-

Table 5. Comparison with state-of-the-art methods on UCF-101(average accuracy over all training/testing splits). For the compar-ison to be fair, we focus on the baselines that, as us, only use theRGB frames as input.

Method AccuracyDynamic Image Network [2] 70.0%Dynamic Image Network + Static RGB [2] 76.9%Rank Pooling [9] 72.2%DHR [9] 78.8%Zhang et al. [54] 74.4%LSTM [37] 74.5%LRCN [7] 68.8%C3D [38] 82.3%Spatial Stream Net [30] 73.0%Deep Network [16] 65.4%ConvPool (Single frame) [52] 73.3%ConvPool (30 frames) [52] 80.8%ConvPool (120 frames) [52] 82.6%Ours 83.3%Diff. to State-of-the-Art +0.7%

Table 6. Comparison with state-of-the-art methods on JHMDB-21(average accuracy over all training/testing splits). Note that whilethe methods of [11, 44, 34, 35] use motion/optical flow informa-tion and [34] uses IDT Fisher vector features, our method yieldsbetter performance.

Method AccuracyWhere and What [35] 43.8%DP-SVM [34] 44.2%S-SVM [34] 47.3%Spatial-CNN [11] 37.9%Motion-CNN [11] 45.7%Full Method [11] 53.3%Actionness-Spatial [44] 42.6%Actionness-Temporal [44] 54.8%Actionness-Full Method [44] 56.4%Ours 58.3%Diff. to State-of-the-Art +1.9%

tion), we merge them with the input to the second stage ofour multi-stage LSTM. In Table 8, we compare the resultsof our modified approach with state-of-the-art methods thatalso exploit optical flow. Note that our two-stream approachyields accuracy comparable to the state-of-the-art.

We also conducted an experiment to evaluate the ef-fectiveness of incorporating optical flow in our frameworkfor action anticipation. To handle the case where lessthan 10 frames are used, we padded the frame stack withgray images (with values 127.5). Our flow-based approachachieved 86.8% for earliest and 91.8% for latest predictionon UCF-101, thus showing that, if runtime is not a concern,optical flow can indeed help increase the accuracy of ourapproach.

11

Page 12: firstname.lastname@data61.csiro.au, mathieu.salzmann@epfl ... · can play guitar in a bedroom, a concert hall or a yard. To overcome this, some methods localize the feature extrac-tion

Table 7. Comparison with state-of-the-art methods on UT-Interaction (average accuracy over all training/testing splits). Notethat while the methods of [34] uses motion/optical flow informa-tion and IDT Fisher vector features, our method yields better per-formance.

Method AccuracyD-BoW [27] 85.0%I-BoW [27] 81.7%Cuboid SVM [26] 85.0%BP-SVM [19] 83.3%Cuboid/Bayesian [27] 71.7%DP-SVM [34] 14.6%Yu et al. [50] 83.3%Yuan et al. [51] 78.2%Waltisberg et al. [41] 88.0%Ours 90.0%Diff. to State-of-the-Art +2.0%

Table 8. Comparison with the state-of-the-art approaches that useoptical flow. For the comparison to be fair, we focus on the base-lines that, as us, use RGB frames+optical flow as input.

Method Accuracy

Spatio-temporal ConvNet [16] 65.4%LRCN + Optical Flow [7] 82.9%LSTM + Optical Flow [37] 84.3%Two-Stream Fusion [8] 92.5%CNN features + Optical Flow [30] 73.9%ConvPool (30 frames) + OpticalFlow [52] 87.6%ConvPool (120 frames) + OpticalFlow [52] 88.2%VLAD3 + Optical Flow [20] 84.1%Two-Stream ConvNet [30] 88.0%Two-Stream Conv.Pooling [52] 88.2%Two-Stream TSN [45] 91.5%

Ours + Optical Flow 91.8%

We further compare our approach with the two-streamnetwork [30], designed for action recognition, applied tothe task of action anticipation. On UCF-101, this modelachieved 83.2% for earliest and 88.6% for latest prediction,which our approach with optical flow clearly outperforms.

Effect of Different Feature TypesHere, we evaluate the importance of the different fea-

ture types, context-aware and action-aware, on recognitionaccuracy. To this end, we compare models trained usingeach feature type individually with our model that usesthem jointly. For all models, we made use of LSTMs with2048 units. Recall that our approach relies on a multi-stage LSTM, which we denote by MS-LSTM. The results ofthis experiment for different losses are reported in Table 9.These results clearly evidence the importance of using bothfeature types, which consistently outperforms using indi-

Table 9. Importance of the different feature types using differentlosses. Note that combining both types of features consistentlyoutperforms using a single one. Note also that, for a given model,our new loss yields higher accuracies than the other ones.

Feature Sequence Learning AccuracyContext-Aware LSTM (CE) 72.38%Action-Aware LSTM (CE) 74.24%Context+Action MS-LSTM (CE) 78.93%Context-Aware LSTM (ECE) 72.41%Action-Aware LSTM (ECE) 77.20%Context+Action MS-LSTM (ECE) 80.38%Context-Aware LSTM (LGL) 72.58%Action-Aware LSTM (LGL) 77.63%Context+Action MS-LSTM (LGL) 81.27%Context-Aware LSTM (Ours) 72.71%Action-Aware LSTM (Ours) 77.86%Context+Action MS-LSTM (Ours) 83.37%

vidual ones in all settings.

Robustness to the Number of Hidden Units

Based on our experiments, we found that for largedatasets such as UCF-101, the 512 hidden units that somebaselines use (e.g. [7, 37]) do not suffice to capture the com-plexity of the data. Therefore, to study the influence of thenumber of units in the LSTM, we evaluated different ver-sions of our model with 1024 and 2048 hidden units (since512 yields poor results and higher numbers, e.g., 4096,would require too much memory) and trained the modelwith 80% training data and validated on the remaining 20%.For a single LSTM, we found that using 2048 hidden unitsperforms best. For our multi-stage LSTM, using 2048 hid-den units also yields the best results. We also evaluated theimportance of relying on average pooling in the LSTM. Theresults of these different versions of our MS-LSTM frame-work are provided in Table 10. This shows that, typically,more hidden units and average pooling can improve accu-racy slightly.

Effect of the LSTM Architecture

Finally, we study the effectiveness of our multi-stageLSTM architecture at merging our two feature types. To thisend, we compare the results of our MS-LSTM with the fol-lowing baselines: A single-stage LSTM that takes as inputthe concatenation of our context-aware and action-awarefeatures (Concatenation); The use of two parallel LSTMswhose outputs are merged by concatenation and then fedto a fully-connected layer (Parallel). A multi-stage LSTMwhere the two different feature-types are processed in the

12

Page 13: firstname.lastname@data61.csiro.au, mathieu.salzmann@epfl ... · can play guitar in a bedroom, a concert hall or a yard. To overcome this, some methods localize the feature extrac-tion

Table 10. Influence of the number of hidden LSTM units and of ouraverage pooling strategy in our multi-stage LSTM model. Theseexperiments were conducted on the first splits of UCF-101 andJHMDB-21.

Average HiddenSetup Pooling Units UCF-101 JHMDB-21

Ours (CE) wo/ 1024 77.26% 52.80%Ours (CE) wo/ 2048 78.09% 53.43%Ours (CE) w/ 2048 78.93% 54.30%

Ours (ECE) wo/ 1024 79.10% 55.33%Ours (ECE) wo/ 2048 79.41% 56.12%Ours (ECE) w/ 2048 80.38% 57.05%

Ours (LGL) wo/ 1024 79.76% 55.70%Ours (LGL) wo/ 2048 80.10% 56.83%Ours (LGL) w/ 2048 81.27% 57.70%

Ours wo/ 1024 81.94% 56.24%Ours wo/ 2048 82.16% 57.92%Ours w/ 2048 83.37% 58.41%

Table 11. Comparison of our multi-stage LSTM model with di-verse fusion strategies. We report the results of simple concate-nation of the context-aware and action-aware features, their usein two parallel LSTMs with late fusion, and swapping their or-der in our multi-stage LSTM, i.e., action-aware first, followed bycontext-aware. Note that multi-stage architectures yield better re-sults, with the best ones achieved by using context first, followedby action, as proposed in this paper.

Feature SequenceOrder Learning AccuracyConcatenation LSTM 77.16%Parallel 2 Parallel LSTMs 78.63%Swapped MS-LSTM (Ours) 78.80%Ours MS-LSTM (Ours) 83.37%

reverse order (Swapped), that is, the model processes theaction-aware features first and, in a second stage, combinesthem with the context-aware ones; The results of this com-parison are provided in Table 11. Note that both multi-stageLSTMs outperform the single-stage one and the two paral-lel LSTMs, thus indicating the importance of treating thetwo types of features sequentially. Interestingly, processingcontext-aware features first, as we propose, yields higheraccuracy than considering the action-aware ones at the be-ginning. This matches our intuition that context-aware fea-tures carry global information about the image and will thusyield noisy results, which can then be refined by exploitingthe action-aware features.

Furthermore, we evaluate a CNN-only version of our ap-

proach, where we removed the LSTM, but kept our averagepooling strategy to show the effect of our MS-LSTM ar-chitecture on top of the CNN. On UCF-101, this achieved69.53% for earliest and 73.80% for latest prediction. Thisshows that, while this CNN-only framework yields reason-able predictions, our complete approach with our multistageLSTM benefits from explicitly being trained on multipleframes, thus achieving significantly higher accuracy (80.5%and 83.4%, respectively). While the LSTM could in prin-ciple learn to perform average pooling, we believe that thelack of data prevents this from happening.

Acknowledgement. Authors thank Oscar Friberg for hisassistance in conducting additional experiments of the sup-plementary material.

13