recognition of two-person interactions using a ...cvrc.ece.utexas.edu/publications/s. park two...

Recognition of Two-person InteractionsUsing a Hierarchical Bayesian Network

Sangho ParkDepartment of Electrical and Computer

EngineeringThe University of Texas at Austin

Austin, TX 78712, USA

[email protected]

J.K. AggarwalDepartment of Electrical and Computer

EngineeringThe University of Texas at Austin

Austin, TX 78712, USA

[email protected]

ABSTRACTRecognizing human interactions is a challenging task due to themultiple body parts of interacting persons and the concomitant oc-clusions. This paper presents a method for the recognition of two-person interactions using a hierarchical Bayesian network (BN).The poses of simultaneously tracked body parts are estimated atthe low level of the BN, and the overall body pose is estimated atthe high level of the BN. The evolution of the poses of the multi-ple body parts are processed by a dynamic Bayesian network. Therecognition of two-person interactions are expressed in terms of se-mantic verbal descriptions at multiple levels; individual body-partmotions at low level, single-person actions at middle level, and two-person interactions at high level. Example sequences of interactingpersons illustrate the success of the proposed framework.

Categories and Subject DescriptorsI.4.8 [Image Processing and Computer Vision]: Scene analy-sis—motion; I.2.10 [Artificial Intelligence]: Vision and Scene Un-derstanding—video analysis, motion; H.5.1 [Information Inter-faces and Presentation]: Multimedia Information Systems—video

General TermsAlgorithms

Keywordssurveillance, event recognition, scene understanding, human inter-action, motion, hierarchy, Bayesian network

1. INTRODUCTIONThe recognition of human interaction has many applications in

video surveillance, video-event annotation, virtual reality, and robotics.Recognizing human interactions, however, is a challenging task dueto the ambiguity caused by body articulation, loose clothing, andmutual occlusion between body parts. This ambiguity makes it dif-ficult to track moving body parts and to recognize their interaction.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.IWVS’03, November 7, 2003, Berkeley, California, USA.Copyright 2003 ACM 1-58113-780-X/03/00011 ...$5.00.

The recognition task depends on the reliable performance of low-level vision algorithms that include segmentation and tracking ofsalient image regions and extraction of object features. Involvingmore than one person makes the task more complicated, since theindividual tracking of multiple interacting body parts needs to bemaintained along the image sequence.

In our previous paper [15], we presented a method to segmentand track multiple body parts in two-person interactions. Our methodis based on multi-level processing at pixel-level, blob-level, andobject-level. At pixel level, individual pixels are classified into ho-mogeneous blobs according to color. At blob level, adjacent blobsare merged to form large blobs according to a blob similarity met-ric. At object level, sets of multiple blobs are labeled as humanbody-part regions according to domain knowledge. The multiplebody part regions are tracked along the image sequence. As it isshown in fig. 1, the body parts lack information about their poses,such as the orientation of head, the hand position of the upper body,the foot position of the lower body, etc.

In this paper, we present a methodology that estimates body-part pose and recognizes different two-person interactions includ-ing ‘pointing’, ‘punching’, ‘standing hand-in-hand’, ‘pushing’, and‘hugging’. The recognition algorithm is preceded by a feature ex-traction algorithm that extracts body-pose features from the seg-mented and tracked body-part regions. We use ellipses and convexhulls to represent body parts, and build a hierarchical Bayesian net-work to estimate individual body poses at each frame. Includingthe dynamics of the body-pose-changes along the sequence leadsus to the recognition of two-person interactions.

Fig. 2 shows the overall system diagram. The lower-left panelshows a tree structure that represents the individual body part re-gions of a person; the root node represents the whole body region,and its children nodes represent head, upper body, and lower body,respectively. Each body part is subdivided into skin and non-skinparts such as face and hair, hand and torso, etc. The lower-rightpanel shows a Bayesian network to estimate the overall body posein each frame including the poses of torso, head, arm, and leg. Theupper-left panel shows the image sequence represented by the con-catenation of two tree structures for two interacting persons at eachframe. The upper-right panel shows a dynamic Bayesian networkthat recognizes the two-person interactions. The result of recogni-tion of the two-person interactions is provided in terms of semanticdescription with ‘subject + verb + object’ format for a user-friendlyverbal interface.

The contributions of this paper are as follows: (1) A new hierar-chical framework is proposed for the recognition of two-person in-teractions at a detailed level using a hierarchical Bayesian network.

65

(2) Ambiguity in human interaction due to occlusion is handled byinference with the Bayesian network. (3) A human-friendly vocab-ulary is generated for high-level event description. (4) A stochasticgraphical model is proposed for the recognition of two-person in-teractions in terms of ‘subject + verb + object’ semantics.

(a) (b)

Figure 1: An examplar frame of the ’pointing’ sequence: theinput image (a) and its tracked body parts indexed by differentcolors (b).

The rest of the paper is organized as follows: Section 2 summa-rizes research related to the representation of human body, and therecognition paradigms for event recognition. Section 3 presents thehierarchical Bayesian network that estimates the individual body-part poses as well as the overall body pose, and Section 4 addressesthe need to process relative information between interacting per-sons. Section 5 presents the recognition method that uses dynam-ics of pose evolution along the sequence. Results and conclusionfollow in Sections 6 and 7, respectively.

2. PREVIOUS WORKIn the last decade, the computer vision community has conducted

extensive research on the analysis of human motion in general. Themany approaches that have been developed for the analysis of hu-man motion can be classified into two categories: model-based andappearance-based. See [1, 6, 11] for reviews. We will discussappearance-based methods below.

Early research on human motion has focused on analyzing a sin-gle person in isolation [3, 17], or on tracking only a subset of thebody parts, such as head, torso, hands, etc.[5, 19], while researchon analyzing the motion of multiple people has focused on silhou-ettes [7, 12, 18], stick figures [4], contours [20, 14], or color [15].

In addition, research has used a coarse-level representation ofthe human body such as a moving bounding box or silhouette, orhas focused on the detection of specific interactions predefined interms of head or hand velocity. However, a coarse-level representa-tion of a human body using a bounding box is not powerful enoughfor detailed recognition of human interactions involving body parts.Silhouette-based or contour-based representation may be hamperedby mutual occlusion caused by moving body parts in human inter-actions. Model-based representation is not robust due to ambigui-ties in body shape caused by loose clothing.

Many approaches have been proposed for behavior recognitionusing various methods including hidden Markov models, finite stateautomata, context-free grammar, etc. Oliver et al. [12] presented acoupled hidden Markov model (CHMM) for gross-level human in-teractions between two persons such as ‘approach’, ‘meet’, ‘walktogether’, and ‘change direction’ from bird-eye view sequences.Sato et al. [18] presented a method to use spatio-temporal veloc-ity of pedestrians to classify their interaction patterns. Hongeng etal. [8] proposed probabilistic finite state automata (FA) for gross-level human interactions. Their system utilized user-defined hier-

Figure 2: System diagram. The lower-left panel shows a treestructure that represents individual body part regions of a per-son. The lower-right panel shows a Bayesian network to esti-mate the body poses in each frame. The upper-left panel showsa sequence represented by the concatenation of two tree struc-tures for two interacting persons at each frame. The upper-right panel shows a dynamic Bayesian network, which recog-nizes the two-person interactions.

archical multiple scenarios of human interaction. Park et al. [14]proposed a string-matching method using a nearest neighbor clas-sifier for detailed-level recognition of two-person interactions suchas ‘hand-shaking’, ‘pointing’, and ‘standing hand-in-hand’. Wadaet al. [21] used nondeterministic finite state automata (NFA) us-ing state product space. Kojima et al. [10] presented a naturallanguage-based description of a single person activities. Park etal. [16] presented a recognition method that combines model-basedtracking and deterministic finite state automata.

3. HIERARCHICAL BAYESIAN NETWORKUnderstanding events in video requires a representation of causal-

ity among random variables. Causal relations and conditional inde-pendencies among the random variables are efficiently representedby a directed acyclic graph (DAG) in a Bayesian Network (BN).Given a set of N variables H1:N = H1, . . . , HN , the joint prob-ability distribution P (H1:N) = P (H1, H2, . . . , HN) can be fac-tored into a sparse set of conditional probabilities as follows ac-cording to the conditional independency:

P (H1:N) = ΠNi=1P (Hi | pa(Hi)) (1)

where pa(Hi) is the set of parent nodes of node Hi.A video sequence of human interactions will exhibit occlusion

caused by obstructed body parts. Occlusion may result in incom-plete values for the random variables. Occlusion problems causedby human interaction are well suited to the BN, which can performinference with a partially observed set e of variables referred to as‘evidence’. With the evidence provided, the posterior probabilitydistribution for a state h ∈ H is P (h|e). The inference is to esti-mate the most likely joint configuration of the variables given theevidence:

h� = arg maxhP (h | e) (2)

Because our proposed BN is a tree without loops (fig. 4 (a)), in-ference is effectively performed by a message-passing algorithm[9].

Understanding semantic events from video requires the incorpo-ration of domain knowledge combined with image data. A human

66

body model is introduced as the domain knowledge to group free-form blobs into meaningful human body parts: head, upper bodyand lower body, each of which is recursively sub-divided into skinand non-skin parts. The body model is appearance-based and rep-resents image regions corresponding to each body part.

We need body-part features such as arm-tip, leg-tip, torso width,etc. to estimate the body poses. In this paper we propose a methodto model human body-parts by combining an ellipse representationand a convex hull-based polygonal representation of the interactinghuman body parts.

An ellipse is defined by the major and minor axes and their orien-tations in a two-dimensional (2D) space, and optimally representsthe dispersedness of an arbitrary 2D data (i.e., an image region inour case.) The direction and length of the major and minor axes arecomputed by principal component analysis (PCA) of the 2D data;the two eigenvectors of the image region define the center positionsand directions of the axes, and the eigenvalues define the lengths ofthe axes.

A convex hull is defined to be the minimum convex closure ofan arbitrary image region specified by a set of vertex points, andrepresents the minimum convex area containing the region [13].We use the set of contour points of the individual body-part regionsegmented in fig. 1(b) to generate the convex hull of the corre-sponding body part. The maximum curvature points of the con-vex hull indicate candidate positions of hand or foot, and are usedas extra observation evidence for the BN. Our method uses color-and region-based segmentation of body parts, and is effective inhandling mutual occlusion. Mutual occlusion inevitably involvesuncertainty in estimation of body pose.

Note that the ellipse and the convex hull have trade-offs. The ad-vantage of the ellipse representation is its robustness against outlierpixels due to image noise; however, it is not sensitive enough todetect a hand position (i.e., the arm tip) located at an extreme po-sition of the upper-body distribution, resulting in false negative indetection. In contrast, the convex hull representation of the upperbody effectively detects a candidate arm tip as one of the maximumcurvature points of the convex hull. However the convex hull rep-resentation is prone to false positive in the detection of the arm tipwhen the hand is nearer than the elbow from the trunk or when athin long region is attached to the upper body due to image noise.In order to cope with the problem we choose as the candidate handpositions the maximum curvature points that coincide with a skinblob. We use a skin detector described in [15].

We introduce a hierarchical Bayesian network that infers bodypose. The Bayesian network estimates the configuration of thehead, hand, leg, and torso in a hierarchical way by utilizing par-tial evidence (i.e., the ellipse parameters and convex hull param-eters) observed in each frame. All the observation node states areinitially continuous, but they are normalized and discretized to gen-erate codebook vectors by vector quantization, as follows:

Let v ∈ R be a random variable with a uniform prior probabilitydistribution in the range [va, vb]. Assume that V = {1, . . . , L} isthe discretized space of R and has L clusters. Then v is convertedto a codebook vector vk ∈ V by multiple thresholding as follows:

vk = arg mini{ v − va

vb − va− i

L} for all i ∈ [1, L] (3)

It is assumed that the conditional probability distribution of eachof the observation variables is a triangular distribution centered atthe most probable state. The most probable state is determined bytraining data.

A two-person interaction is represented in terms of the configu-ration changes of the individual body parts, and is recognized by a

(a) (b)

Figure 3: Body-part parameterization for fig. 1(b): Each bodypart is parameterized by both an ellipse (a) and a convex hull(b)

dynamic Bayesian network (DBN) that utilizes the hierarchical in-formation of the body pose. The recognition results are expressedin terms of semantic motion descriptions. Example sequences il-lustrate the success of the proposed framework. The body parts arerepresented by ellipses and convex hulls (fig. 3), and then processedby a hierarchical Bayesian network. The hierarchical Bayesian net-work estimates the configuration of the head, hand, leg, and torsoby utilizing partial evidence observed in the frame, analyzes tem-poral changes of the body poses along the sequence, and interpretshuman interactions.

3.1 Head Pose EstimationA Bayesian network for estimating the head pose of a person is

constructed as shown in fig. 4(a) with a geometric transformationparameter α, an environmental setup parameter β, and head poseH2 as a hidden node, and head appearance V1 and V2 as obser-vation nodes. The geometric transformation, which may includecamera geometry, determines the imaging process. The environ-mental setup, which may include lighting conditions, determinesthe reflectance of light from the head. The head pose, which mayinclude the head’s three dimensional rotation angles, determineswhich part of the head is visible.

An example of the observations (i.e., evidence) of the head ap-pearance is shown in fig. 5. For each head, the large ellipse repre-sents the overall head region and the small ellipse, the face region.Arrows represent the vector from the head center to the face cen-ter. The angle of the vector corresponds to the value of observationnode V1, and the ratio of the two ellipses in a head correspondsto the value of observation node V2 in fig. 4. Those observationsof the head appearance are determined by the generative model pa-rameters α, β, and H2. The causal relations between the generativemodel parameters and observations are represented by the arrowsbetween the hidden nodes and the observation nodes in fig. 4. Inthis paper we assume, for simplicity, that the geometric transfor-mation parameter α and the environmental setup parameter β arefixed and known. This assumption is based on our experience thatthe detailed recognition of human interactions requires that body-part information be available or able to be inferred from image data.That is, we assume that a fixed camera is used with a viewing axisparallel to the ground (i.e., horizon), and that the people are in anindoor situation with constant lighting conditions. (Note that thisassumption may be relaxed by learning the parameters α and β,with enough training data.) Currently, this assumption allows thenodes α and β to be omitted from computation and simplifies thestructure of the Bayesian network, as shown in fig. 4(b).

Our task is to estimate the posterior probability P (H2 | V1:2) ofthe head pose H2 given the joint observations V1:2 = V1, V2. By

67

Bayes rule, the posterior probability is converted to the product ofthe conditional probability P (V1:2 | H2) and the prior probabilityP (H2) as follows:

P (H2 | V1:2) = P (V1:2 | H2)P (H2) (4)

α β H2

V1 V2

H2

V1 V2

(a) (b)

Figure 4: Bayesian network for head pose estimation of a per-son. See the legend below for the node labels.

The legend for fig. 4 is as follows:

α: Geometric transformation parameter such as camera model

β: Environmental setup parameter such as lighting condition

H2: Head pose

V1: Angle of vector from head center to face center

V2: Ratio of face area to head area

Fig. 5 shows the observation data for head poses of the humansin fig. 3 (a). The angle of vector from head center to face centeris represented by an arrow for each person, and the ratio of facearea to head area is computed from the two overlapping ellipses ofa human head. The hidden node’s states are defined as H2 =

Figure 5: Observations for head pose estimation. For eachhead, the large ellipse represents the overall head region andthe small ellipse, the face region. Arrows represent the vectorfrom the head center to the face center.

{‘front-view’, ‘left-view’, ‘right-view’, ‘rear-view’}.Our goal is to estimate the hidden node’s discrete states, repre-

sented as verbal vocabulary, by using the evidence of the observa-tion node states.

3.2 Arm Pose EstimationAnother Bayesian network is built for estimating the pose of the

salient arm. The salient arm is defined to be the arm that is involvedin human interaction. Usually, the salient arm is the outermost armstretching toward the opposite person. Fig. 6 shows the BN forthe arm pose. The geometric transformation parameter α and theenvironment-setup parameter β are dropped from computation, asin section 3.1. The legend for fig. 6 is as follows:

H3: Vertical pose of the outermost arm

H4: Horizontal pose of the outermost arm

V4: Vertical position of maximum curvature point of upper-body convex hull

H3 H4

V4 V5 V6 V7

Figure 6: Bayesian network for arm pose estimation of a per-son. See the legend below for the node labels.

V5: Torso ellipse aspect ratio of major to minor axes

V6: Torso ellipse rotation on image plane

V7: Horizontal position of maximum curvature point of upper-body convex hull

Fig. 7 shows the observation for estimating the arm pose. Weobserve the torso ellipse and the convex hull for the each person.

The ellipse A in fig. 7(a) represents the degree of spatial distri-bution of the upper-body region of the left person, and is summa-rized in terms of the major and minor axes and their orientationscomputed from principal component analysis (PCA). The torso el-lipse’s aspect ratio and rotation angle on the image plane are usedas observation evidence V5 and V6 of the BN. A convex hull repre-sents the maximal area of the region. The image coordinate of themaximum curvature point C of the convex hull of the left personin fig. 7(b) is detected as a candidate hand position, and is used asextra observation evidence V4 and V7 for the BN. The same com-putations apply to the right person in fig. 7 with the ellipse B andthe maximum curvature point D. A separate BN is used for theright person.

If multiple maximum curvature points exist for a candidate handposition, we choose the maximum curvature point that coincideswith a skin blob detected by a skin detector described in [15]. How-ever, the skin blob can be occluded by articulation and it is not guar-anteed that the maximum curvature point of a convex hull actuallyrepresents an arm tip. Our solution for this ambiguity is to estimatethe arm pose based on the observations using the BN in fig. 6.

(a) (b)

Figure 7: Observation for the arm pose estimation. Individ-ual upper body is represented by an ellipse A or B in (a) andthe corresponding convex hull in (b). The maximum curvaturepoints C and D in (b) are detected as candidate hand positions.

In this case our goal is to estimate the posterior probability P (H3:4 |V4:7) of the arm’s vertical pose H3 and horizontal pose H4 giventhe joint observations V4:7 = V4, V5, V6, V7 as follows:

P (H3:4 | V4:7) = P (V4:7 | H3:4)P (H3:4) (5)

The hidden node’s states are defined below:

H3 = {‘low’, ‘mid-low’, ‘mid-high’, ‘high’}H4 = {‘withdrawn’, ‘intermediate’, ‘stretching’}

68

3.3 Leg Pose EstimationA Bayesian network similar to that in section 3.2 is built for es-

timating the leg pose of a person as shown in fig. 8, with hiddennodes H5 and H6 and observation nodes V8 ∼ V11.

H5 H6

V8 V9 V10 V11

Figure 8: Bayesian network for pose estimation of an interact-ing person. The BN is composed of 6 hidden nodes H1:6 and11 observation nodes V1:11. See the legend below for the nodelabels.

The legend for fig. 8 is as follows:

H5: Vertical pose of the outermost leg

H6: Horizontal pose of the outermost leg

V8: Vertical position of maximum curvature point of lower-body convex hull

V9: Lower-body ellipse aspect ratio of major to minor axes

V10: Lower-body ellipse rotation on image plane

V11: Horizontal position of maximum curvature point of lower-body convex hull

Fig. 9 shows the lower body of each person represented by anellipse and a convex hull. The BN for leg pose has similar structureto the BN for arm pose, but the prior probability distribution andthe conditional probability distribution may differ significantly.

(a) (b)

Figure 9: Observation for the leg pose estimation. Individuallower body is represented by an ellipse A or B in (a) and cor-responding convex hull in (b). The maximum curvature pointsC and D in (b) are detected as candidate hand positions.

Our goal is to estimate the posterior probability P (H5:6 | V8:11)of the leg’s vertical pose H5 and horizontal pose H6 given the ob-servations V8:11 as follows:

P (H5:6 | V8:11) = P (V8:11 | H5:6)P (H5:6) (6)

The hidden node’s states are defined below:

H5 = {‘low’, ‘middle’, ‘high’}H6 = {‘withdrawn’, ‘intermediate’, ‘out-reached’}

3.4 Overall Body Pose EstimationThe individual Bayesian networks are integrated in a hierarchy

to estimate the overall body pose of a person, as shown in fig. 10.The proper hierarchy depends on domain specific knowledge. It isalso possible to learn the structure of the BN, given enough train-ing data. However, in the video surveillance domain there is usu-ally not enough training data available. Therefore, we manuallyconstructed the structure of the BN.

The overall BN is specified by the prior probability distributionof H1 and the conditional probability distributions of the rest of thenodes. The BN has discrete hidden nodes and discrete observationnodes that are discretized by vector quantization in section 3. Theprobability distributions are approximated by discrete tabular form,and trained by counting the frequency of co-occurance of states ina node and its parent nodes using training sequences.

We introduce an additional hidden node, H1, for the torso poseas a root note of the hierarchical BN, and an additional observationnode, V3, the median width of the torso in an image. The torso poseis inferred from the poses of the head, arm, and leg, as well as theadditional observation of the median width of the torso.

The states of the hidden node H1 are defined as H1 = {‘front-view’, ‘left-view’, ‘right-view’, ‘rear-view’}.

Our goal is to estimate the posterior probabilities P (H1 | H2:6, V3)of the torso pose H1 given the other body part poses H2:6 and theobservation V3, as follows:

P (H1 | H2:6, V3) = P (H2:6, V3 | H1)P (H1) (7)

Our hierarchical Bayesian network efficiently decomposes theoverall task of estimating the joint posterior probability P (H1:6 |V1:11) into individual posterior probabilities, as described in sec-tions 3.1 through 3.4.

4. RELATIVE POSITIONThe pose estimation of body parts in the previous section is based

on the single person’s posture in terms of an object-centered coor-dinate system. However, knowing the body poses is not enough forthe recognition of interaction between people.

Recognizing two-person interactions requires information aboutthe relative positions of the two persons’ individual body parts. Forexample, in order to recognize ‘approaching’, we need to know therelative distance and direction of torso movement with respect tothat person. This fact leads us to the notion of the dynamics of bodymovements: the evolution of body part motion along the sequence.

We represent the interacting persons’ relative position at boththe gross and detailed levels. Fig. 11 shows the results of gross-level processing for body-part segmentation. Each person’s fore-ground silhouette is vertically projected and modeled by a one-dimensional (1D) Gaussian (fig. 11(a)). A mixture of the 1D Gaus-sians is used to model multiple people. Two Gaussians are fitted,each of which models a single person without occlusion. Frame-to-frame update of these Gaussian parameters along the sequence(fig. 11(b)) amounts to tracking the whole body translation of eachperson in the horizontal image dimension. In fig. 11(b), the plot’sx-axis represents the frame number and the y-axis, the horizontalimage dimension. We can see that the left person (i.e., the firstperson) approaches the right person (i.e., the second person), whostands still up to frame 17; then the second person moves backwhile the first person stands still. However, we don’t know whatkind of event happens between the two persons; does the first per-son push the second person, or does the first person meet the secondperson and the second person turn back and depart?

69

H1

H2 H3 H4 H5 H6

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11

Figure 10: Bayesian network for pose estimation of an interacting person. The BN is composed of 6 hidden nodes H1:6 and 11 visiblenodes V1:11. See the legend below for the node labels.

(a) (b)

Figure 11: Gross-level estimation of proximity between twopersons in the ‘pushing’ sequence of fig. 1 using a Gaussian-based tracker [15] . (a) shows input image, its 1D projection,and 1D Gaussian approximation from top to bottom, respec-tively. (b) shows the estimated horizontal position of torso cen-ter with vertical bars representing uncertainty in terms of stan-dard deviation of the Gaussian tracker. X-axis is frame num-ber, and y-axis is the pixel coordinate of the horizontal imagedimension.

In order to recognize the specific interaction, we analyze furtherthe configuration of persons. Fig. 12(a) shows an example of de-tailed information about the relative position of the interacting bodyparts in fig. 7. It represents the position of the estimated hand, C,of the left person and the estimated ellipse, B, of the upper body ofthe right person. The proximity between the estimated arm tip andthe estimated upper body is measured in terms of the eigenvalueλ of the principal component analysis results that specify the el-lipse B. The proximity is discretized into three states: {‘touching’,‘near’, ‘far’}. The proximity relations are computed for all combi-nations of the body parts (i.e., the head, arm, leg, and torso) of theleft person with respect to those of the right person, resulting in a4×4 matrix format. The most significant combination is chosen asthe salient proximity in a given frame of a specific interaction type.For example, in the ‘pushing’ interaction, the proximity relation ofthe pushing person’s arm with respect to the nearest body part ofthe pushed person is chosen as the salient proximity in the interac-tion. The proximity relation is discretized into three states: in the‘touching’ state if the distance is less than λ, in the ‘near’ state ifless than 1.5 × λ, and in the ’far’ state otherwise.

We also determine the relative orientations of the torso poses be-tween the two persons. An interaction between two persons facingeach other may have very different connotation from a similar in-teraction in which one faces the other’s back. Fig. 12(b) showsan example of relative torso poses; the top panel shows the cam-era setup viewing individual persons from distance, and the bottom

panel shows the corresponding appearance. In the example, the leftperson is in front-view (i.e., azimuth1 = 0o), and the right per-son is in left-view (i.e., azimuth2 = 90o). The relative azimuthRadefines the relative torso poses between the two persons as fol-lows:

Ra = azimuth2 − azimuth1 (8)

(a) (b)

Figure 12: Relative pose information. (a) shows relative dis-tance between the left person’s hand C and the right person’supper-body ellipse B of fig. 7. (b) shows relative orientation be-tween the torso poses of the two simulated persons captured bya camera.

5. RECOGNITION BY DYNAMIC BAYESIANNETWORK

We model human interaction as a sequence of state changes thatrepresents the configuration and movement of individual body parts(i.e., torso, arms, and legs) in the spatio-temporal domain. Thismodel requires a representation of 6 interacting processes (3 bodyparts × 2 persons) for a two-person interaction. Involving mul-tiple people and multiple body parts makes it difficult to apply asimple HMM model that has a state space composed of a singlerandom variable. Because the overall state space is very large,the joint probability distribution becomes intractable, and a pro-hibitively large amount of data is required to train the model. It iswell-known that the exact solution of extensions of the basic HMMto three or more processes is intractable [12].

Our solution to overcome the exponential growth of the param-eter space is to represent the human behavior at multiple levels ina hierarchy. We utilize the hierarchy of our BN and abstract thestates and the events at multiple levels. Individual body-part mo-tions are analyzed at low level, and the whole-body motion of a

70

person is processed at middle level by combining the individualbody-part motions. The two-person interaction patterns are recog-nized at high level by incorporating the whole-body motion andspatial/temporal constraints about relative positions and causal re-lations between the two persons.

One of the ultimate goals of recognizing visual events is the auto-matic generation of user-friendly descriptions of the events. At thehigh level, we consider the recognition of human behavior from theviewpoint of language understanding in terms of ‘subject + verb +(object)’. The subject corresponds to the person of interest in theimage, the verb to the motion of the subject, and the object to theoptional target of the motion (i.e., usually the other person’s bodypart). Our language-based framework is similar to [10], but ourinterest is in the interaction between two autonomous persons.

5.1 Body-part pose evolutionOur dynamic Bayesian network (DBN) is constructed by estab-

lishing temporal links between the identical hidden nodes of theBN in fig. 10 at time t and t + 1. This means that we model thehidden state evolutions based on the first-order Markov process.

We use a DBN equivalent of HMM for each body part: legs,torso, and arms. The difference between a DBN and an HMM isthat a DBN represents the hidden state in terms of a set of randomvariables, 1qt, ...,

Nqt, i.e., it uses a distributed representation ofstate. By contrast, in an HMM, the state space consists of a singlerandom variable qt. The DBN is a stochastic finite state machinedefined by initial state distribution π, observation probability B,and state transition probability A.

A = {aij | aij = P (qt+1 = j | qt = i)}B = {bj(k) | bj(k) = P (vt = k | qt = j)}

π = {πi | πi = P (q1 = i)}where qt is the state value of the hidden node H at time t, and vt

is the observation value of the observation node V at time t.The observation probability B and the initial state distribution π

specify the DBN at a given time t and are already established in ourBayesian network in section 3. Therefore, our DBN only needs thestate transition probability A that specifies the state evolution fromtime t to time t + 1. Fig. 13 shows a DBN equivalent of HMMunrolled for T = 4 time-slices, where jqt are the hidden node statefor the j-th body part at time t: 1q for the leg, 2q for the torso, and3q for the arm, respectively.

The HMM hidden states correspond to the progress of the ges-ture with time.

1Q, the set of HMMs for for legs, has three kinds: 1Q = {“bothlegs are together on the ground”, “both legs are spread on the ground”,and “one foot is moving in the air while the other is on the ground.”}

Similarly, 2Q, the set of HMMs for the torso, has three kinds:2Q = {“stationary”, “moving forward”, and “moving backward.”}

3Q, the set of HMMs for arms, has three kinds: 3Q = {“botharms stay down”, “at least one arm stretches out”, and “at least onearm gets withdrawn.”}

The assumption of independence between the individual HMMsin our network structure drastically reduces the size of the overallstate space and the number of relevant joint probability distribu-tions.

5.2 Whole-body pose evolutionThe evolution of a person’s body pose is defined by the three

results, 1:3q1, of the HMMs in section 5.1 and forms the descrip-tion of the overall body pose evolution of the person. Examples ofthe description include {“stand still with arms down”, “move for-ward with arm(s) stretched outward”, “move backward with arm(s)

jq1jq2

jq3jq4

Figure 13: DBN equivalent of HMM for j-th body part unrolledfor T = 4. Nodes represent states, and arrows represent allow-able transitions, i.e., transitions with non-zero probability. Theself-loop on state i means P (qt = i|qt−1 = i) = A(i, i) > 0.

raised up”, “stand stationary while kicking with leg(s) raised up”,etc.}

Instead of using an exponentially increasing number of all pos-sible combinations of the joint states of the DBN, we focus only ona subset of all states to register the semantically meaningful bodymotions involved in two-person interactions. We observe that a typ-ical two-person interaction usually involves a significant movementof either arm(s) or leg(s), but not both. This means that we assumethe joint probability distribution of meaningless combinations in1:3q1 has a probability virtually equal to 0 in usual body-part posecombinations.

5.3 Two-person interactionTo recognize an interaction pattern between two persons, we in-

corporate the relative poses of the two persons. We juxtapose theresults of the mid-level description of the individual person alonga common time line, and add spatial and temporal constraints of aspecific interaction type based on domain knowledge about causal-ity in interactions.

The spatial constraints are defined in terms of the relative posi-tion and orientation describe in section 4. For example, “standinghand-in-hand” requires parallel torso poses between the two per-sons (Ra = 0o), while “pointing at the opposite person” requiresopposite torso poses between them (Ra = 180o). The “pointingat the opposite person” interaction also involves a systematic in-crease of proximity between a person’s moving arm and the torsoor head of the opposite person. The proximal target of the arm mo-tion (i.e., the torso or head) is regarded as an ‘object’ term in oursemantic verbal description of the interaction. Our body-part seg-mentation module provided a set of meaningful body-part labels:{head, hair, face, upper body, hand, leg, foot} as a vocabularyfor the ‘object’ term.

The temporal constraints are defined in terms of causal relationsof the two person’s body pose changes. We use Allen’s intervaltemporal logic [2] to describe the causal relations in temporal do-main (i.e., before, meet, overlap, start, during, and finish etc.) Forexample, a ‘pushing’ interaction involves “a person moving for-ward with arm(s) stretched outward toward the second person” fol-lowed by “move-backward of the second person” as a result of‘pushing’.

We introduce appropriate spatial/temporal constraints for each ofthe various two-person interaction patterns as domain knowledge.The satisfaction of specific spatial/temporal constraints gates thehigh-level recognition of the interaction. Therefore, the mid-leveland high-level recognitions are characterized by the integration ofdomain-specific knowledge, whereas the low-level recognition ismore closely related to the pure motion of a body part.

6. EXPERIMENTAL RESULTSWe have tested our system for various human interaction types

including (1) approaching, (2) departing, (3) pointing, (4) standinghand-in-hand, (5) shaking hands, (6) hugging, (7) punching, (8)

71

kicking, and (9) pushing. The images used in this work are 320×240 pixels in size, obtained at a rate of 15 frames/sec. Six pairsof different interacting people with various clothing were used toobtain the total 56 sequences (9 interactions × 6 pairs of people),in which a total of 285, 293, 232, 372, 268, 281, 220, 230, 264frames are contained in each of the above interaction types (1) -(9), respectively.

We first evaluated the performance of the BN using the test se-quences of the ‘pointing’ interaction (fig. 14.) Fig. 18(a) showsmore frames of the ‘pointing’ interaction sequence, where the leftperson stands stationary while the right person raises and lowershis arm to point at the left person. The desired performance of theBN is that stationary posture should be recognized as ‘stationary’and moving posture should be recognized as ‘moving’. Our BNshowed the corresponding results (See fig. 15.)

(a) (b) (c)

Figure 14: Example frames of the ’pointing’ sequence. (a) in-put image frame, (b) segmented and tracked body parts, (c)ellipse parameterization of each body part.

Fig. 15 shows the values of the observation nodes V1 - V7 be-fore quantization and the corresponding beliefs (i.e., the posteriorprobability) for the two persons in fig. 18(a). In figs. 15(a) and (d),the angle of vector from the head ellipse center to the face ellipsecenter (V1) and the angle of rotation of the torso ellipse (V6) re-fer to the left y-axis, while all the other values of the visible nodesare normalized in terms of the height of the person and refer to theright y-axis. The normalization makes the system robust against theheight variation of different people. We can incorporate a valida-tion procedure in the Bayesian network in a straightforward mannerto check whether an observation is valid in a given situation. Forexample, if the first person’s hand is occluded, its observation isnullified and removed from the evidence.

Figs. 15 (b) and (c) show the BN’s belief about each state of thehidden nodes for the horizontal and vertical poses of the salient armfor the left persons in the sequence, given the evidence in fig. 15(a).The BN’s belief shows that the horizontal pose of the arm of the leftperson stays in the withdrawn state, and the vertical pose of the armstays in the low or the mid-low state. These results demonstrate thatthe belief of the BN for the left person is stable for stationary posesin the sequence.

Corresponding data for the right person is shown in figs. 15 (d)- (f). Fig. 15 (d) shows the systematic change of hand position(V4, V7) and torso ellipse (V5, V6) as the second person points atthe first person. Figs. 15 (e) and (f) show the BN’s belief abouteach state of the hidden nodes for the horizontal and vertical posesof the salient arm for the right persons in the sequence, given theevidence in fig. 15(d). The BN’s belief shows that the horizontalpose of the arm of the right person changes from intermediate →stretching → intermediate → withdrawn states, and that the ver-tical pose of the arm changes from mid-low → mid-high → high→ mid-high → mid-low → low as he moves his arm. This resultcorresponds well to the human interpretation of the input sequence.This results demonstrate that the belief of the BN for the right per-son properly detects the state changes as the person moves his arm

for the interaction.

(a) (b) (c)

(d) (e) (f)

Figure 16: Example frames of different interactions includ-ing positive interactions: ‘standing hand-in-hand’ (a), ‘shakinghands’ (b), ‘hugging’ (c), and negative interactions: ‘punching’(d), ‘kicking’ (e), ‘pushing’ (f). Degree of occlusion increasesfrom the left to the right examples in each row.

Fig. 16 shows other examplar poses of positive interactions: stand-ing hand-in-hand, shaking hands, hugging, and negative interac-tions: punching, kicking, pushing. The degree of occlusion in-creases from the left to the right examples in each row.

The results of the BN of the individual sequences are processedby the dynamic Bayesian network for the sequence interpretations.The dynamic BN generates verbal descriptions of the interactionsat the body-parts level, the single-person level, and the mutual in-teraction level. The overall description is plotted over a time line.

A cross-validation procedure is used to classify the 6 sequencesfor each interaction type; that is, among the 6 sequences obtainedfrom different pairs of people, 5 sequences were used as trainingdata to test the other one sequence. All the 6 sequences were testedin the same way for each of the 9 interaction types. The accuraciesof the sequence classification are 100, 100, 67, 83, 100, 50, 67, 50,83 percent for the above interaction types (1) - (9), respectively.The overall average accuracy is 78 percent. Fig. 17 shows exam-ples of the time line of the interaction behaviors pointing, punching,standing hand-in-hand, pushing, and hugging, represented in termsof events and simple behaviors. Fig. 18 shows subsampled framesof various interaction sequences. Note that the degree of occlusionincreases from top to bottom rows.

7. CONCLUSIONThis paper has presented a method for the recognition of two-

person interactions using a hierarchical Bayesian network (BN).The poses of simultaneously tracked body parts are estimated atthe low level, and the overall body pose is estimated at the highlevel. The evolution of the poses of the multiple body parts duringthe interaction is analyzed by a dynamic Bayesian network. Therecognition of two-person interactions is achieved by incorporatingdomain knowledge about relative poses and event causality.

The key contributions of this paper are as follows: (1) A hierar-chical framework is used for multiple levels of event representationfrom the body-part level, to the multiple-bodies level, and to video-sequence level. (2) Occlusions occurring in human interaction arehandled by inference of a Bayesian network. (3) User-friendly de-scription of the event is provided in terms of ‘subject + verb + ob-ject’ semantics.

Our experiments show that the proposed method efficiently rec-

72

(a) (d)

(b) (e)

(c) (f)

Figure 15: BN’s beliefs about the arm poses of the two persons in the ‘pointing’ sequence of fig. 18(a). The plots in the left and rightcolumns show the results for the left and right person in the sequence, respectively. See the text for the details.

73

(a)

(b)

(c)

(d)

(e)

Figure 17: Semantic interpretation of two-person interactions: ‘pointing’(a), ‘punching’ (b), ‘standing hand-in-hand’ (c), ‘pushing’(d), and ‘hugging’ (e)

74

ognizes detailed interactions that involve the motions of the mul-tiple body parts. Our plans for future research include the ex-ploration of the following aspects: (1) extending the method tocrowd behavior recognition, (2) incorporating various camera-viewpoints, and (3) recognizing more diverse interaction patterns.

8. REFERENCES[1] J. Aggarwal and Q. Cai. Human motion analysis: a review.

Computer Vision and Image Understanding, 73(3):295–304,1999.

[2] J. F. Allen and G. Ferguson. Actions and events in intervaltemporal logic. Journal of Logic and Computation,4(5):531–579, 1994.

[3] A. Bakowski and G. Jones. Video surveillance tracking usingcolor region adjacency graphs. In Image Processing and ItsApplications, pages 794–798, 1999.

[4] A. Data, M. Shah, and N. Lobo. Person-on-person violencedetection in video data. In Proc. Int’l Conference on PatternRecognition, volume 1, pages 433–438, Quebec city, CA,2002.

[5] A. M. Elgammal and L. Davis. Probabilistic framework forsegmenting people under occlusion. In Int’l Conference onComputer Vision, volume 2, pages 145–152, Vancouver,Canada, 2001.

[6] D. Gavrila. The visual analysis of human movement: asurvey. Computer Vision and Image Understanding,73(1):82–98, 1999.

[7] I. Haritaoglu, D. Harwood, and L. S. Davis. W4: Real-timesurveillance of people and their activities. IEEE transactionson Pattern Analysis and Machine Intelligence,22(8):797–808, August 2000.

[8] S. Hongeng, F. Bremond, and R. Nevatia. Representation andoptimal recognition of human activities. In IEEE Conf. onComputer Vision and Pattern Recognition, volume 1, pages818–825, 2000.

[9] J.Pearl. Probabilistic Reasoning in Intelligent Systems:Networks of Plausible Inference, pages 337–340. MorganKaufmann, San Mateo, CA, 1988.

[10] A. Kojima and T. Tamura. Natural language description ofhuman activities from video images based on concepthierarchy of actions. Int’l Journal of Computer Vision, 6,2001.

[11] T. Moeslund and E. Granum. A survey of computervision-based human motion capture. Computer Vision andImage Understanding, 81(3):231–268, 2001.

[12] N. M. Oliver, B. Rosario, and A. P. Pentland. A BayesianComputer Vision System for Modeling Human Interactions.IEEE Trans. Pattern Analysis and Machine Intelligence,22(8):831–843, August 2000.

[13] J. O’Rourke. Computational Geometry in C, chapter 3, pages70–112. Cambridge University Press, 1994.

[14] S. Park and J. Aggarwal. Recognition of human interactionusing multiple features in grayscale images. In Proceedingson Internaitonal Conference on Pattern Recognition,volume 1, pages 51–54, Barcelona, Spain, September 2000.

[15] S. Park and J. K. Aggarwal. Segmentation and tracking ofinteracting human body parts under occlusion andshadowing. In IEEE Workshop on Motion and VideoComputing, pages 105–111, Orlando, FL, 2002.

[16] S. Park, J. Park, and J. K. Aggarwal. Video retrieval ofhuman interactions using model-based motion tracking andmulti-layer finite state automata. In Lecture Notes in

Computer Science: Image and Video Retrieval. SpringerVerlag, 2003.

[17] R. Rosales and S. Sclaroff. Inferring body pose withouttracking body parts. In Computer Vision and PatternRecognition, pages 721–727, Hilton Head Island, SouthCarolina, 2000.

[18] K. Sato and J. K. Aggarwal. Recognizing two-personinteractions in outdoor image sequences. In IEEE Workshopon Multi-Object Tracking, Vancouver, CA, 2001.

[19] J. Sherrah and S. Gong. Resolving visual uncertainty andocclusion through probabilistic reasoning. In British MachineVision Conference, Bristol, pages 252–261, UK, 2000.

[20] N. Siebel and S. Maybank. Real-time tracking of pedestriansand vehicles. In IEEE Workshop on PETS, Kauai, Hawaii,2001.

[21] T. Wada and T. Matsuyama. Multiobject behaviorrecognition by event driven selective attention method. IEEEtransaction on Pattern Analysis and Machine Intelligence,22(8):873–887, August 2000.

75

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

Figure 18: Subsampled frames of various interaction sequences: ‘pointing’, ‘punching’, ‘standing hand-in-hand’, ‘pushing’, and‘hugging’ interactions from top to bottom rows.

76

recognition of two-person interactions using a ...cvrc.ece.utexas.edu/publications/s. park two...

Documents