draft for transactions on pattern analysis and …morariu/publications/morariuhandfootpami… ·...
TRANSCRIPT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1
Tracking people’s hands and feet using mixed
network AND/OR search
Vlad I. Morariu, Member, IEEE, David Harwood,Member, IEEE,
and Larry S. Davis,Fellow, IEEE
Abstract
We describe a framework that leverages mixed probabilisticand deterministic networks and their
AND/OR search space to efficiently find and track the hands andfeet of multiple interacting humans
in 2D from a single camera view. Our framework detects and tracks multiple people’s heads, hands,
and feet throughpartial or full occlusion; requires few constraints (does not require multiple views,
high image resolution, knowledge of performed activities,or large training sets); and makes use of
constraints and AND/OR Branch-and-Bound with lazy evaluation and carefully computed bounds to
efficiently solve the complex network that results from the consideration of inter-person occlusion.
Our main contributions are 1) a multi-person part-based formulation that emphasizes extremities and
allows for the globally optimal solution to be obtained in each frame, and 2) an efficient and exact
optimization scheme that relies on AND/OR Branch-and-Bound, lazy factor evaluation, and factor cost
sensitive bound computation.
We demonstrate our approach on three datasets: the public single person HumanEva dataset, outdoor
sequences where multiple people interact in a group meetingscenario, and outdoor one-on-one basketball
videos. The first dataset demonstrates that our framework achieves state-of-the-art performance in the
single person setting, while the last two demonstrate robustness in the presence of partial and full
occlusion and fast non-trivial motion.
Index Terms
Tracking, Motion, Pictorial Structures.
V. I. Morariu, D. Harwood, and L. S. Davis are with the the Department of Computer Science, AV Williams Bldg, University
of Maryland, College Park, MD 20742.
E-mail: {morariu,lsd}@cs.umd.edu, [email protected].
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2
I. I NTRODUCTION
The problem of tracking people’s hands and feet has been widely studied, as its solution is
often required for higher-level reasoning about human activities. Even in the single person, known
activity case, ambiguities are introduced by substantial appearance variations and possible self-
occlusions. Reasoning about the hands and feet of multiple interacting people is more difficult
due to inter-person occlusion, particularly without the simplifying assumption that the space of
poses and movements is constrained by a set of known activities.
Our goal is to track people’s hands and feet efficiently and reliably, without strong assumptions
on pose or activity space. We reason about inner joints (knees, elbows, neck and waist) only
to ensure that the solution matches the image and satisfies physical constraints; however, we do
not track them over time, to reduce computational complexity. Thus, we first detect and track
people, obtain a set of extremity detections separately foreach person, and extend them to tracks
to fill in periods where extremities are not detected. These tracks become extremity candidates,
which are subsequently labeled as people’s hands and feet.
We formulate the extremity candidate to person assignment problem using mixed probabilistic
and deterministic networks. The probabilistic network is an undirected graphical model which
encodes the probability of a body configuration by decomposing it into a product of factors that
measure image likelihoods of individual body parts or otherrelationships between them. The
deterministic network enforces hard constraints (i.e., probabilities of 0 or 1) which ensure that
body segments have bounded length, or that two overlapping extremity candidates can only be
assigned simultaneously to two limbs if they are observed toocclude each other. A transition
model between hand/foot assignments in consecutive framesensures temporal consistency.
To solve the assignment problem, we perform AND/OR search onour mixed deterministic and
probabilistic network[1], [2].1 In the presence of determinism, AND/OR search has been shown
to reduce search space (and complexity) by checking hard constraints during the search process to
prune inconsistent paths early. Moreover, given suitable bounds on the optimal solution, AND/OR
Branch-and-Bound [3] can provide a dramatic additional reduction of the search space while still
ensuring that the globally optimal solution is found. An important aspect of our approach is that
1In [2], the termmixed networks denotes the case where a belief network and a constraint network are mixed. Instead of a
directed model, we use an undirected graphical model to represent the(non-deterministic) probability of a configuration.
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
Step 1: Track people Step 2: Track extremity candidates Step3: Detect overlap/occlusion Step 4: Create graph Step 5: Obtain assignment
Fig. 1. Framework overview. Assignments are also tracked temporally (not shown).
it reduces image likelihood evaluations during search by computing factor entries on-demand;
this requires careful computation of bounds for Branch-and-Bound, since a naive approach tends
to touch all factor entries. We believe that our lazy factor evaluation approach, coupled with
factor cost sensitive bounds computation, is generally applicable to other machine learning tasks
where the computation of factors dominates total inferencetimes.
Our framework has the following advantages: 1) it detects and tracksmultiple people’s heads,
hands, and feet throughpartial or full occlusion; 2) it requires few constraints (does not require
multiple views, high image resolution, knowledge of performed activities, or large training
sets); and 3) it exploits determinism during AND/OR search,reducing the number of likelihood
evaluations through lazy factor evaluation and cost sensitive bound computation, while obtaining
the exact solution to the complex loopy network generated byinter-person occlusion constraints.
Our main contributions are 1) a multi-person part-based formulation that emphasizes extrem-
ities and allows for the globally optimal solution to be obtained in each frame, and 2) an
efficient and exact optimization scheme that relies on AND/OR Branch-and-Bound, lazy factor
evaluation, and factor cost sensitive bottom-up bound computation. The first contribution is
important because few approaches currently exist that dealwith the difficulties of pose estimation
for multiple interacting people. The novelty in our optimization scheme lies not in our use of
AND/OR Branch-and-Bound on the mixed network search space, but in our bound computation
approach that focuses on evaluating as few factor entries (i.e. image likelihoods) as possible.
A. Related Work
Several approaches to body part tracking deal with self-occlusion and appearance variations
by mapping a set of high-dimensional features computed froma person’s image region to low
dimensional coordinates, such as joint locations, angles,or some other latent variable [4], [5].
These approaches can obtain great results, but they either restrict the activity and require cyclic
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
behavior [4] or require very large training sets [5]. Additionally, they rely on global shape or
image features that are unreliable when multiple people occlude each other.
Part-based approaches reduce the reliance on large training sets and activity constraints by
modeling the human body as a set of articulating body parts. Generally, these approaches detect
a set of body part candidates and assemble them according to an articulated model that imposes
local part constraints. To handle occlusion, some maintaina tree structure by applying dynamic
programming in stages (detecting the first arm/leg, and thensearching for a second, given the
first) [6], [7]. Others use loopy graphs to directly incorporate part self-occlusion, and resort
to approximate inference methods such as Belief Propagation(BP) [8], [9], [10]. Jiang and
Martin [11] use a nontree model with hard mutual exclusion constraints to deal with occlusion,
approximating its globally optimal solution by the relaxation of an integer linear program (ILP).
To incorporate arbitrary pairwise constraints, Renet al. [12] approximately solve an integer
quadratic program (IQP) by a linear relaxation and subsequent gradient descent step. Huaet al.
[13] employ a Belief Propagation Monte Carlo approach, detecting a small set of part candidates
to construct an importance sampling distribution for non-parametric message passing. Other
single person pose approaches employ a chains model to encode part context (e.g., hands) [14];
max-covering to incorporate higher order constraints, image features, and body linkage [15];
branch-and-bound search on non-tree models[16]; and, an A*-search[17].
Our approach also uses an articulated model to factorize theoverall score into a product of
local part scores; however, our variables represent part endpoints, instead of part candidates.
This allows us to detect only extremity endpoints (head, hands, feet) cheaply, constrain the
remaining endpoints (neck, waist, knees, elbows) using length constraints, and evaluate part
likelihoods on-demand (instead of searching for all parts independently over the whole image).
We reason about self-occlusion and inter-person occlusionby dynamically adding hard constraints
between extremities, and instead of resorting to approximate methods, we obtain the globally
optimal solution to the loopy network by AND/OR Branch-and-Bound [3]. In contrast to previous
globally optimal search approaches [17], [16], ours considers the larger graphical model induced
by inter-person occlusion constraints, exploits hard constraints before and during search, and
computes the search heuristic while explicitly avoiding costly image feature computations.
Recent approaches to tracking multiple people simplify the problem by assuming a small
set of activities, such as walking, running, etc. [18], [19], [20]. Once people are localized,
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
f1
f2
k1 k
2
w
ne2
a2
e1
a1
h
k1 k
2
w
ne2
a2
e1
a1
h
f1
f2
1 k2
a2
a1
h
1 k2
f1
f2
a2
a1
h
a2
a1
h
a2
a1
h
f1
f2
Person 1
f1
f2
Person 2
f1
f2
Person 3
(a) Probabilistic (b) Deterministic (c) Deterministic (d) Deterministic (e) Probabilistic
Image likelihoods Length constraints Self-occlusion Inter-person occlusion Temporal
Fig. 2. Probabilistic and deterministic graphical models.
these approaches generally infer each person’s pose independently, using strong motion and
pose assumptions to deal with inter-person occlusion. Parkand Aggarwal [21] track interacting
humans without this assumption by using a hierarchical blob-based approach; however, they deal
with only two interacting people and require body parts to have different colors. Though we still
require that people are first localized, hands and feet are assigned globally, considering inter-
person occlusion constraints. The idea of dynamically adding constraints between person tracks
has been explored in [22]; here, we dynamically add constraints between pairs of potentially
occluding extremities. More recently, Eichner and Ferrari[23] introduce an occlusion predictor
and mutual exclusion terms between body parts of different people to jointly model poses of
interacting people. The model includes only upper body poseand requires approximate inference
due to its complexity. In contrast, our proposed approach models both upper and lower body
pose and obtains the globally optimal solution in each frame. Our approach employs hard mutual
exclusion occlusion constraints, but these can be relaxed to be probabilistic as in [23].
Zhu et al. [24] also employed AND/OR graphs for determining human pose, but only to
compactly represent the pose space learned during training. They manually create the structure
of AND/OR graphs, and automatically learn parameters of potentials defined on the graph. In
our work, AND/OR graphs are used to efficiently search for solutions to dynamically constructed
graphical models; the AND/OR graph is not explicitly created, but is implicitly searched. Since
computer vision problems have large search spaces and imageoperations are costly, Branch-
and-Bound can drastically reduce computations, and has beenemployed for object detection and
segmentation [25], [26]. One advantage of Branch-and-Bound on AND/OR graphs is that they
can greatly reduce image evaluations if factor entries are computed only when necessary.
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
II. PROBABILISTIC AND DETERMINISTIC GRAPHICAL MODELS
In this section we describe our mixed probabilistic and deterministic graphical model that
inherently handles multiple people. We assume that bounding boxes have been obtained for
each person, and that each person track has a set of extremitycandidate tracklets. Additionally,
we require a set of functions that evaluate image likelihoods given body segment configurations.
Since various detection, tracking, and image likelihood approaches exist, and our graphical model
does not depend on the particular approach, we first describeour general model, and then provide
implementation details in section IV.
A. Probabilistic graphical model
We use an undirected graphical modelP = (X,D, F ) to represent the image likelihood of
a configuration. The nodesX = {xpi |i = h, a1, a2, f1, f2, n, e1, e2, w, k1, k2}, are the locations of
the head (h), hands (a1, a2), feet (f1, f2), neck (n), elbows (e1, e2), waist (w), and knees (k1,
k2), of each person (indexed byp). Note that eachxpi denotes the endpoint of one or more
body segments in the articulated model, e.g.,xpk1
andxpf1
are endpoints of a lower leg segment.
The discrete domainDpi ∈ D of each variable, is a set of candidate locations plus theunknown
value. Candidates for head, hands, and feet are the tracks described in section IV-A; candidates
for the remaininginner joints, are obtained as described in section II-D. Theunknown value
allows for missed detections, when true hands and feet are not present among the choices, and
false positives, when extremity candidates are not actually hands and feet. Finally,F is a set
containing three types of factors: (1)fskel(xi, xj, I, θ) takes two endpoints of a skeletal body
segment and measures how well the segment is explained by theimage (by edges, segmentation,
etc); (2) fapp(xi, xj, I, θ) takes two endpoints of symmetric body parts (hands and feet), and
enforces appearance similarity; and (3)funk(xi) is a constant penaltycunk 2 if xi is unknown,
and 1 otherwise. Here,I and θ are the image and external parameters (such as body segment
widths), respectively. We maximize the posterior,P (X|I, θ) ∝ P (I|X, θ)P (X|θ), such that
P (I|X, θ) ∝∏
p
∏
(xpi,x
pj)∈Eskel
fskel(xpi , x
pj , I, θ)
∏
(xpi,x
pj)∈Eapp
fapp(xpi , x
pj , I, θ)
2This value manages the tradeoff between a bad assignment or declaringa partunknown. We set this value to.004.
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7����������
����������
��
�
�
��
�
�������
�����
��
�
�
��
�
�������
���������� ���� ��
� ������ ����� ��
���������
��������������
��
�
�
��
�
��
�
�
��
�
������
�����
��
�
�
��
�
��
�
�
��
�
� �
��� ������ ��������
� ������ ����� ��
� ���� � ��������
� ���� � ����� �����
�
�
��
�
��
�
�
��
�
������� �
��� ������ ��������
� ������ ����� ��
� ���� � ��������
� ���� � ����� ���
��
�
�
��
�
�
���������� ���� ��
� ������ ����� ��
���������
��������������
���������
��������������
!"����#�������##�$��������� %��"����#�������##�$���������
� �
Fig. 3. Occlusion constraints: simultaneous assignment of overlapping candidates is disallowed if thereis no evidence that
the overlap is caused by occlusion. When there is visual evidence of occlusion, thedisallowed assignment pairs list is empty,
and two overlapping detections can be simultaneously assigned. When there is no visual evidence of occlusion, thedisallowed
assignment pairs list ensures that only one of the overlapping detections is assigned. In thisfigure, disallowed pairs included
by an assignment are highlighted in yellow.
P (X|θ) ∝∏
p
∏
i
funk (xpi )
∏
(xp
i,x
p
j)∈E
fprior(xpi , x
pj , θ)
whereE is the set of edges in model (partitioned into symmetric appearance and skeletal edges,
Eskel andEapp); skeletal and symmetric edges are solid and dotted in Figure 2a, respectively.
See section IV for additional details on factors. Instead ofprecomputing the factors (which is
costly, as we will show), we employ a lazy evaluation scheme,evaluating each factor entry the
first time that it is needed during search. The priorP (X|θ) penalizes unknown locations through
funk and can include other priors on nodes throughfprior , e.g., pairwise length priors.
B. Deterministic constraints
A deterministic networkR = (X,D,C) encodes length and occlusion constraints. Variables
X and their domainsD are the same as in the previous section.
1) Length constraints: Length constraints, depicted in Figure 2b, ensure that bodysegments
have bounded length. The motivation for hard length constraints is that body segment length
is bounded in 3D as a ratio of height, and will be bounded even after projection to 2D under
mild assumptions (i.e., camera is not pointed down toward people’s heads). As long as length
constraints are satisfied, we do not prefer one length over another since foreshortening can cause
body segments to be arbitrarily short, so we use uniform length priors. Minimum lengths can
also be imposed for practical reasons, e.g., to avoid the degenerate case of zero-length segments.
2) Occlusion constraints: Intra- and inter-person occlusion constraints are added only between
extremities (not inner joints), as we are interested mainlyin tracking them. As Figure 3 shows,
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
���� ������� �������������� ����� �������
�� �������� ���� �� ����
�������
�� ����
����
Fig. 4. Occlusion evidence over time. Partial overlap: if a period of no overlap (between tracklet pairs) occurs immediately
before or after a period of partial overlap, then this is counted as evidence of occlusion for the partial overlap period, and
simultaneous assignment is allowed.Full overlap: during periods of full overlap, the track of the occluded part is assumed to
be unreliable and only one of the overlapping tracks can be assigned.No overlap: simultaneous assignment is always allowed
during periods ofno overlap.
two extremity tracklets can overlap either due to occlusion(first column) or detector false
positives (second and third columns). In the former case, each tracklet corresponds to a real
extremity, so both should be assigned; in the latter, only one tracklet should be assigned to
a person. To be conservative, when two candidates overlap (as in Figs 1 and 3), they can be
assigned simultaneously only if visual evidence suggests that they are occluding each other, e.g.,
if two initially non-overlapping candidates are observed to move toward each other and partially
overlap. If two candidatesv ∈ Dpi and v′ ∈ Dq
j overlap for a period of time but no evidence
of occlusion is observed, then mutual exclusion constraints are automatically added betweenxpi
andxqj during that time period, preventingv andv′ from being simultaneously assigned.
We use a two threshold approach to determine if there is enough evidence to allow simultane-
ous assignment of overlapping tracklets, based on following assumptions: 1) low-level trackers
generally track extremities through partial occlusion if they were at some point observed in iso-
lation, and 2) low-level trackers generally fail for extremities that become fully occluded. Using
these assumptions, we use two thresholds on the area of overlap between two tracked extremities
to create three types of relationships between extremity pairs: full overlap, partial overlap, and
no overlap. During a period of partial overlap, we allow two tracklets to be simultaneously
assigned to two body parts only if there is a period of no overlap immediately before or after
the partial overlap period, due to the first assumption. During a period of full overlap, tracklets
are not allowed to be simultaneously assigned, due to the second assumption. During periods of
no overlap we allow simultaneous assignment. Figure 4 illustrates this approach.
Intra- and inter-person occlusion constraints, both obtained by this approach, are illustrated in
Figures 2c and 2d. Note that Figure 2c shows the worst case scenario (all extremity pairs have
occlusion constraints), but in Figure 2d only some pairs have overlapping domains. A graph
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
(a) (b) (c) (d) (e) (f)
Fig. 5. Example of how inner joint domains are obtained. Fixed head and hand locations each constrain the location of the
elbow, (a) and (b), respectively, and the elbow must lie in their intersection(c). The resulting domain for each inner joint
(color-coded) can be obtained this way given all extremities(d). If we have a silhouette(e), we further constrain locations(f).
with fully connected extremities is very unlikely, becauselength constraints cause domains to be
spatially localized; thus, hand candidates generally do not overlap foot candidates, and a person
only has constraints between immediate neighbors.
C. Temporal assignment tracking
An additional set of factors,ftrans(xt−1i , xt
i), links the extremities of people who appear
in consecutive frames, enforcing temporal assignment consistency (see Figure 2e). Structural
changes over time can result in a complex overall graph, so weapproximate the solution for
a sequence as follows. We first ignore temporal factors to obtain the exact topk solutions in
each frame by AND/OR Branch-and-Bound on the mixed network defined byP andR. We
then compute the transition probability between two framesas the product of all temporal factors
between extremity nodes that appear in both frames,ftrans(Xt−1, X t) =
∏
p
∏
i ftrans(xp,t−1i , xp,t
i ),
and obtain the best sequence of assignments by dynamic programming (the Viterbi algorithm).
D. Node domains
Recall that for extremities,Xpe = {xp
i |i = h, a1, a2, f1, f2}, domains consist of detected
candidate locations plus theunknown state. Instead of explicitly detecting internal joints, one
advantage of our formulation is that a set extremity candidates and hard length constraints yield
compact feasible regions for each internal joint. Assume that we fix an extremity assignment;
then, given an extremity location, any internal jointmust lie inside of a circular region centered
at that extremity, with radius equal to the maximum allowable distance imposed by the length
constraints. To satisfy all constraints, an internal jointmust lie in the intersection of the feasible
regions given each extremity. LetDpij(v
pj ) denote the feasible domain of inner jointxp
i if extremity
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10
xpj is fixed to a single locationvpj ; then the domain ofxp
i isDpi =
⋂
j, s.t.xj∈XpeDp
ij(vpj ), as depicted
in Figure 5. However,xpj is not yet known, so the internal joint domain given an extremity is
the union over the set of potential extremity locations:Dpi =
⋂
j, s.t.xj∈Xpe
(
⋃
v∈DpjDp
ij(v))
. We
discretize inner joint regions into uniform grids to obtainfeasible joint domains.
III. AND/OR SEARCH WITH COSTLY FACTORS
Consider the problem as described in Section II-A, where the waist and neck joints, together
with a width parameter, define the torso segment of the body. If there areNw candidate locations
for the waist, andNn locations for the neck, then there areNwNn candidate configurations for
the torso segment. Thus, the factor that evaluates local image likelihoods of torso configurations
would need to be evaluatedNwNn times to precompute all factor entries before processing. If
Nw and Nn are large enough relative to the graphical model complexity(as they are in our
experiments), the process of precomputing factor entries can be more time-consuming than the
subsequent optimization problem. Since informed search (e.g. Branch-and-Bound) avoids a large
part of the search space, lazy evaluation can reduce image evaluations. However, if the upper
bounds needed to guide the search are computed using standard approaches, e.g., via MBE [3],
all factor entries would be evaluated in the process. By usingthe careful bound computation
approach we describe below, AND/OR search spaces can be usedto reduce evaluation cost
while obtaining the exact global solution. Unlike other approaches, such as Belief Propagation
(BP), whose performance degrades as determinism (probabilities close or equal to 0 or 1) is
introduced3, AND/OR Branch-and-Bound leverages determinism present in the network.
We first summarize AND/OR search spaces [1], mixed networks [2], and AND/OR Branch-
and-Bound [3], and then describe how to efficiently compute upper bounds directly from the
data while evaluating few costly factor entries.
A. AND/OR search spaces
A graphical modelP = (X,D, F ) is defined by a set of variablesX = {x1, . . . xn}, the
domainDi ∈ D of each variable, and a set of functions (or factors)F defined on subsets of
X. The primal graph of a graphical model is the undirected graphG = (V,E) whose nodes
3see [27] for an analysis of BP in the presence of determinism or near-determinism
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11
V are the variables,X, and whose edge setE is formed by connecting any two nodes if their
corresponding variables appear together as arguments in one or more of the factors inF (see
Fig 6a). A pseudo tree T = (V,E ′), is a directed rooted tree defined on all of the graph nodes;
any arc ofG which is not included inE ′ is a back-arc (see Fig 6b). Given thispseudo tree,
the associated AND/ORsearch tree has alternating levels of OR and AND nodes, labeledXi
and 〈Xi, xi〉, respectively, whereXi is one of the variables inX, andxi is a value fromDi.
The root of the search tree is an OR node labeled with the root of T . Each child of an OR
nodeXi, labeled〈Xi, xi〉, represents an instantiation ofXi with a valuexi. The children of
each OR node〈Xi, xi〉 are the children ofXi in the pseudo treeT . Thus, depth first traversal
of the AND/OR search space begins with the root node, and alternates between choosing from
possible assignments for a variable at OR nodes and decomposing the search into independent
sub-problems at AND nodes. A solution is a subtreeT ⊆ T , and is defined as follows: (1) it
contains the root node, (2) each OR node inT must have exactly one of its successors inT ,
and (3) each non-terminal AND node inT must have all of its successors inT (see Fig 6c).
Each nodeXi of the pseudo tree has an associated bucketBT (Xi) containing each factor
in F whose scope includesXi and is fully contained along the path from the root down to
Xi. During the search process, at each AND noden = 〈Xi, xi〉, the factors inBT (Xi) can be
evaluated, as all their arguments have been instantiated along the path from the root to noden.
During depth first traversal of the graph, factors in a node’sbucket are evaluated when the node
is first opened. After a node’s subtree has been evaluated, the partial solution of that subtree is
propagated to its predecessor. For a max-product task such as ours, the value at an OR node is
the maximum of its children, and the value at each AND node is the product of the values of
its children and the values of factors in its bucketBT (Xi).
The AND/OR search graph can be implicitly searched by caching solutions at OR nodes.
A noden is cached based on values assigned to itsparent set, which is defined as the union
of all ancestors ofn in the pseudo-tree connected by an edge in theprimal graph to n or
its descendants in the pseudo-tree. The graph size and search complexity are determined by the
maximum parent set size in the graph. Letting the maximum parent set size be theinduced width
(or treewidth) and denoting it byw, graph size and hence search complexity areO(nkw), where
k = maxi |Di| is the maximum domain size. Parent sets are directly affected by the ordering of
nodes in the search pseudo-tree, so it is important to choosean ordering which yields the smallest
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
A
D
CB
E
X = {A,B,C,D,E}
DA ={ 0, 1 }
DB ={ 0, 1 }
DC ={ 0, 1 }
DD={ 0, 1 }
DE={ 0, 1 }
F = { f1(A,B), f2(B,E),
f3(A,D), f4(C,D)},
f5(C,A)}
A
D
CB
E
[A]
[B]
[A]
[A,C]
A
0
B C
0 1 0 1
1
B C
0 1 0 1
OR
AND
OR
AND
E
0 1
E
0 1
D
0 1
D
0 1
E
0 1
E
0 1
D
0 1
D
0 1
OR
AND
A
0
BC
0 10 1
1
B C
0 1 0 1
OR
AND
OR
AND
E
0 1
D
0 1
D
0 1
E
0 1
D
0 1
D
0 1
OR
AND
(a) (b) (c) (d)
��
��
�
�
��
��
��
��
�
� � �
���
��
�
� � �
�
�
� �
�
�
�
�
�
�
� �
�
�
�
�
��
��
�
�
��
��
��
��
�
� � �
���
��
�
� � �
�
�
� �
�
�
�
�
�
�
e1
0 1
e2
0 1
w 0 1
k1
0
1 k
2
0
1
a1 0
f1 0
a2
0
f2
0
h 0
n 0 1
(e) (f) (g)Fig. 6. Sample AND/OR search space: (a) network description and primal graph, (b) pseudo-tree (with the parent set used
for defining OR context in square brackets, e.g.[A] is the parent set for nodeB and [A,C] is the parent set for nodeD), (c)
AND/OR search tree, with solution highlighted, (d) AND/OR search graph, inwhich nodes are implicitly merged by caching
previously reached solutions, (e) AND/OR search tree and solution of our graphical model with two candidate locations for
elbows, knees, waist, and neck, and a single candidate location for all other joints (we are using a limited number of candidates
for display only; joints are round nodes, their candidate assignments aresquare nodes), (f) corresponding AND/OR search graph,
(g) corresponding domains in image coordinates, with color coded joint domains and the solution highlighted in red.
possible parent sets. This can only be done efficiently by greedy approximation algorithms such
as Min-Fill [3]. Figures 6b and 6d respectively show the parent sets in square brackets for each
variable, and the implicit AND/OR graph that results.
B. Mixed networks
Given a constraint networkR = (X,D,C), whereX and D are defined as above, andC
is a set of deterministic constraints defined on subsets ofX which allow or disallow certain
tuples of variable assignments, AND/OR search can be modified to check constraints ofRwhile maximizingP. This can be done by replacing theprimal graph described above with the
union of the primal graphs ofP andR. During the search process, constraints (and constraint
processing techniques [28]) can be applied to prune inconsistent paths early (see [2] for details).
In our case, it is beneficial to avoid exploring portions of the search tree which have no solutions
by performing backtrack free search. This can be done by preprocessing constraints using Bucket
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13
Elimination (BE) [29]; in cases where BE is infeasible, Mini-bucket Elimination (MBE) [30]
can be used to reduce the amount of explored dead ends.
C. Branch-and-Bound
Branch-and-Bound [3] can further reduce the search space and complexity required to obtain
the optimal solution. Branch-and-Bound search involves updating the lower bound on the best
solutions each time a solution sub-tree is traversed. The lower bound, coupled with upper bounds
on best extensions of a partial solution can be used to prune the search space. In particular, a
node is opened only if the upper bound on extensions of the current partial solution is greater
than the current lower bound. Unlike the lower bound, which is computed during search, the
upper bound on the value of a subtree is obtained from the graphical model before search, using
the process of Mini-Bucket Elimination (MBE) [30], [3], whichpartitions buckets with large
parent sets into smaller mini-buckets. This partitioning ensures that the approximate best solution
has a score greater than or equal to that of the original best solution, but is cheaper to compute.
D. Branch-and-Bound with costly factors
A disadvantage of Mini-Bucket Elimination is that it evaluates most if not all factor entries.
This is usually acceptable, but in our case accessing the value of a factor entry is an expensive
operation. If the cost of evaluating one entry of factorfj ∈ F is αj for j = 1, . . . ,m, then the
cost of evaluating all entries offj is βj = αj
∏
Xi∈scope(fj) |Di|. To avoid evaluating all entries, we
first sort factors byβj, and then iteratively remove the largest factors until the total remaining
evaluation cost is below some ratior. To ensure that any solution of the resulting graphical
model bounds the original from above, we must replace each removed factor by a constant that
bounds all of its entries. By construction, the factors described hereafter are always bounded by
1, so upper bounds obtained by MBE on the reduced problem can be used to obtain the exact
global solution, while ensuring that few entries need to be evaluated in the process.
Our lazy evaluation approach to dealing with costly factorsis summarized by the algorithm
in Figure 7. The first step is to compute the variable orderingneeded to construct the pseudo
treeT . Constraint processing is then performed to yield a new set ofconstraint factorsC ′. In
our implementation, constraint processing involves propagating hard constraints toward the root
of the pseudo tree (using MBE), producing a new set of constraints that are consistent with
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
Inputs:
(1) probabilistic and constraint networks,R = (X,D,C) andP = (X,D, F ), respectively
(2) factor entry evaluation costs,αj for eachfj ∈ F
(3) upper bound cost ratio,r
Outputs: Topk assignmentsx1, . . . , xk, ranked byP s.t.xi ∈ R.
1: compute variable ordering and pseudo treeT using Min-fill heuristics
2: process constraints,C ′ ← process(C, T )3: Fub ← F , β ← ∑
j βj =∑
j αj
∏
Xi∈scope(fj) |Di|4: while
∑
j,s.t. fj∈Fubβj > rβ do
5: fj ← argmaxfj∈Fubβj
6: Fub ← Fub \ fj7: end while
8: compute upper bounds,U ← bounds(C ′, Fub), by MBE or AO traversal
9: perform AND/OR Branch-and-Bound onR andP using constraintsC ′ and boundsU
10: return top k assignments
Fig. 7. AND/OR Branch-And-Bound with Lazy Evaluation of Costly Factors
the initial pseudo treeT . The constraint processing step does not change the pseudo tree, but
allows the algorithm to prune paths earlier during depth-first search. In the next steps, a subset
of factorsFub ⊆ F is chosen for upper bound computation to ensure that most factor entries are
not evaluated in computing the upper boundsU . Finally, AND/OR Branch-and-Bound search is
performed on the mixed network, using the processed constraints C ′ and the upper boundU .
The top-k solutions are obtained by maintaining pointers to the top-k partial solutions obtained
in each subtree during depth first search, merging and pruning the list of solutions while exiting
explored subtrees. The constraint processing step may be redundant if it is performed using MBE
and upper bounds are also computed by MBE using the same mini-buckets. However, it is not
redundant if upper bounds are computed by AND/OR traversal,as constraint processing will
allow AND/OR traversal to encounter dead ends earlier. Similarly, it is not redundant if MBE is
optimized to the binary nature of constraints to reduce memory and computational requirements,
enabling MBE to perform less approximation (by using larger mini-buckets). Also, note that
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15
upper bounds, which are computed using MBE in [3] can also be computed by traversal of the
AND/OR search space of the reduced problem. The traversal ismore computationally intensive
than MBE, but touches fewer entries in factors fromFub, leading to an overall gain in efficiency.
IV. I MPLEMENTATION DETAILS
Various approaches can be used to detect person, hand, and foot candidates [13], [18], [21],
[31], [32], [33], or link them into tracklets [31], [34], [35], [36]. Similarly, various approaches
exist for evaluating body segment image likelihoods [7], [6], [32], [37]. Because they require
little training or parameter tuning, we use silhouette-based approaches based on [21], [31] for
person and extremity detection, [34], [35] for tracking, and [6] for image likelihood evaluation.
A. Candidate tracklets
Before detecting and tracking body parts, we adopt the commonstrategy of first tracking
people’s bounding boxes ([19], [31]). Towards this end, we first detect moving foreground pixels
using background subtraction [38]. Once foreground pixelsare obtained, we detect people and
assemble them into tracklets using a data association approach based on [34]. These tracklets
provide for each person an estimate of the neighborhood and scale at which to search for potential
hands and feet, which are also linked into tracklets. In the following subsections, we describe the
detection association approach, assuming that a set of detections is provided, and then provide
a brief overview of the human and extremity detection and post-processing steps we perform.
1) Tracking by association: The algorithm presented in [34] consists of three stages, each of
which involves linking the tail of one tracklet (a sequence of already linked detections) to the
head of another tracklet. Our scenes have no static occluders and entries/exits are not of interest;
consequently, we use a simplified approach, iteratively applying the Hungarian algorithm to
match tails to heads such that the product of the link probabilities is maximized, considering
longer time gaps between tracklets at each iteration. We avoid using motion-based features that
assume constant velocity since their performance suffers when people change directions often
(as they would in a basketball game); we instead use optical flow [39] based features for linking
probabilities. We model each detection as a tupleD = (c, R, L), containing the centerc, a
neighborhoodR aroundc, and a set of pairsL = {(li, di)}. Here, li is a 2D image location
inside R at distancedi from the centerc. To compare two detectionsD1 and D2 in frames
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16
t1 and t2, respectively, we propagate theli locations of the earlier detectionD1 using flow as
lti ← lt−1i +ut−1,t(l
t−1i ), until we reach framet2, whereut−1,t(l
t−1i ) is the flow vector at location
lt−1i from framet− 1 to framet. The linking probabilityPlink is given by
Plink (D1, D2) ∝1
N1,2
∑
lt2i∈R2
exp
(
−(dt1i − ||(lt2i − c2||)22σ2
)
.
The sum is over all pixels fromD1 propagated to framet2 whose propagated locationslt2i (the
hat indicates that these are predicted, instead of actual detected pixel locations) are inside region
R2, andN1,2 is the size of the union of the propagated pixels fromD1 and the pixels ofD2.
The intuition behind this measure is that ifD1 andD2 are the same detection, when propagated
from t1 to t2, pixels insideR1 will move to R2 and their distance from the center will remain
the same, leading to a value of1 for the above equation. The exponential term penalizes pixels
that change distance from the detection center while providing some rotational invariance.
2) Person tracklets: Person detections are obtained from silhouette contour peaks, similar to
[31], though we do not perform silhouette-based tracking. The neighborhood regionR used for
association consists of pixels whose pathlength (or inner distance) from the contour head point
through the foreground mask is less than.15h, whereh is the person’s height approximated from
the mediany-coordinate of the contour points below the head.4 Tracklets are formed iteratively
by flow-based linking with tracklet gaps of{1, 8, 16, 40} frames. Gaps in tracklets are filled by
the mean location of the pixels propagated from regionR1 of one tracklet’s tail to regionR2 of
the other tracklet’s head. To handle longer gaps, we performanother iteration allowing gaps up
to 160 frames, but flow becomes unreliable for long gaps, so the linking probability is based on
theχ2 distance measure between the two RGB histograms accumulatedfor each tracklet. We fill
gaps between linked tracklets by meanshift tracking [40], keeping approximated locations only
if meanshift tracking reaches one tracklet from the other (either forward or backward in time).
3) Extremity tracklets: Hands and feet are represented as circular regions at high curvature
locations on silhouette or skin blob contours (for hands only), within some distance from the
head. The centers are near the wrist and ankle, with radii given as a fixed fraction of the person
height (rhand = h30
, rfoot = h15
). Curvature is approximated by the angle between the two segments
4Only points with a horizontal displacement within a fixed ratio of the vertical displacement from the head (.3 in our
experiments) are used for estimating height.
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17
�� ���� ���� ����
��
��
Fig. 8. Body segment likelihoods are split into four regions, defined by their distance from the body segment.
formed by a contour point and its two neighbors of equal distanceδ forward and backward along
the contour. After non-maximal suppression, high curvature points are binned into hand and foot
detections based on distance from the head, and centers are set to the centroid of pixels within
distance2r of the contour point, wherer is the hand/foot radius. Detections are linked into
tracklets using maximum gaps of{1, 8, 16, 40} frames, filling in gaps using pixels from region
R1 of the tail propagated to regionR2 of the head.
B. Image likelihoods and priors
1) Body segment likelihood: Our body segment likelihoods are based on those described by
Felzenszwalbet al. [6]. Two end-points and a width parameter define the rectangular region
that roughly represents the body segment and its nearby neighborhood. The rectangular region
is divided into multiple sub-regionsRr ([6] used two sub-regions). The likelihood of a body
part given the two end-points is thenfskel(xi, xj, I, θ) =∏
r qcrr (1 − qr)
(ar−cr), whereqr is the
foreground pixel probability for regionRr, ar is the area of regionRr, andcr is the foreground
pixel count in regionRr. We use up to four regions for all body segments, defined by thedistance
from the center axis of a body segment, with manually selected parametersq1 = .99, q2 = .9,
q3 = .5, andq4 = .3 (ordered from center-most to outer-most regions; see figure8); for upper
arm, lower arm, and head segments, which are thinner, we use only regionsR2 andR4.
2) Appearance similarity: Appearance similarity factors represented by dotted linesin figure
2a are computed asfapp(xi, xj , I, θ) = exp(−cappdKL) where dKL(xi, xj) is the symmetric
Kullback-Leibler divergence between appearance models ofxi andxj. Appearance is represented
by normalized RGB triples(R/S,G/S, S/3), whereS = R + G + B, sampled around the
extremity locations, and divergence is computed efficiently by Kernel Density Estimation [41].
3) Temporal motion consistency: Temporal factors reuse the tracklet linking probabilityPlink
described above, modified to consider transitions to and from the unknown state and to allow
for more flexibility for cases where flow fails (e.g., very fast motions). For the case where both
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18
xt−1i andxt
i are actual candidates (i.e., notunknown) the factor is defined asftrans(xt−1i , xt
i) =
max
(
cswitch ,Nlost
N1,2e−
||ct−1
i−ct
i||2
2σ2 + Plink(xt−1i , xt
i)
)
wherecti is the center ofxi at timet, andNlost
is the number of pixels for which flow could not be estimated (in [39], this occurs when forward
and backward flow do not match, usually due to fast motion or occlusion). We truncate the factor
to a minimum ofcswitch to make large jumps equally likely. The switch from a known candidate
to theunknown state or vice versa is given a cost of√cswitch , so a large jump can be seen as a
switch from the tracklet ofxt−1i to unknown followed by a switch to the tracklet ofxt
i.
4) Priors: We do not use strong pose priors to avoid becoming pose or activity specific. In
particular, we assume a uniform distribution on body segment length as long as hard length
constraints are satisfied. We do, however, impose simple priors on the torso and head. The torso
prior penalizes torso segments that deviate much from vertical position (we generally detect and
track people who are close to standing position). The head prior penalizes acute neck angles to
coincide with physical constraints; this prior involves three joints (head, neck, and waist), but
since we have a single head location in our approach, it is effectively a pair-wise prior.
V. EXPERIMENTS
We demonstrate our approach on three datasets: outdoor scenes containing a group of three
to five interacting people; one-on-one basketball in outdoor scenes5; and part of the publicly
available HumanEva I dataset [42]. While our focus is on tracking extremities ofmultiple people
efficiently, we use the single person HumanEva I dataset to show that our model compares well
to state-of-the-art approaches, observing only a small performance drop in the absence of large
training sets or activity models. For all experiments below, parameters were set manually but
they were not tuned. That is, we used the same parameters for all datasets (e.g.,cunk, cswitch,
occlusion thresholds, body part parameters), and all body part likelihoods were weighed equally
(i.e., all had a weight of 1). Table I provides dataset statistics, Figure 9 shows qualitative results,
and tables II and III quantify performance in terms of average pixel error and precision-recall.
A. Performance measures
We measure system performance in two ways: average pixel error and precision-recall mea-
sures. Both measures require that detected people and their extremities are associated to ground
5We will make the outdoor and basketball datasets publicly available.
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19
TABLE I
DATASETS
Mean±std Annot. People/Frames Height People frame
HumanEva I 5533 292± 34 4029 1Group 1429 128± 14 5497 3-5Basketball 101933 133± 27 2499 2
TABLE II
AVERAGE PIXEL ERROR. FOR THE HUMAN EVA DATASET, WHERE WE INTERPOLATED MISSING HANDS FEET TO BE ABLE TO
COMPARE WITH EXISTING WORK, WE SHOW THE ERRORS BY VISIBILITY STATUS. FOR THE OTHER TWO DATASETS, WE
SHOW THE AVERAGE ERROR ONLY FOR DETECTED HANDS AND FEET, AND SPECIFY WHAT PERCENTAGE OF ALL
ANNOTATED EXTREMITIES WERE UNKNOWN, GROUPED BY VISIBILITY STATUS.
Overall Hands Feet
HumanEvaAverage over all errors 12.65 13.64 11.66Fully and partially visible only 10.86 11.17 10.55Fully visible only 10.28 10.39 10.18
groupAverage error 2.87 2.34 3.17Unknown, fully visible 8.0% 10.2% 5.9%Unknown, partially visible 7.1% 10.0% 4.2%Unknown, not visible 23.4% 35.6% 11.4%
basketballAverage error 7.42 8.56 6.68Unknown, fully visible 15.5% 20.6% 10.4%Unknown, partially visible 6.4% 7.6% 5.2%Unknown, not visible 11.2% 19.0% 3.5%
truth people and extremities. We do this hierarchically, first fixing associations between detected
and ground truth people, and then computing extremity associations. First, we match detected
person bounding boxes and ground truth tracks (one-to-one), minimizing average distance be-
tween head locations. Our framework does not differentiatebetween left and right hands or
feet, so average pixel errors are computed from the maximum matching between detected and
ground truth extremities that minimizes average error. Theaverage pixel errors for each dataset
are visually displayed in Figure 10. Precision and recall iscomputed in a similar way, but we
also allow detected hands (or feet) of a detected person to match the ground truth hands (or
feet) of other nearby people, a stricter evaluation since a detected hand (or foot) that matches the
ground truth hand (or foot) of a wrong ground truth person will be counted as a false positive.
A detected hand or foot is considered a false positive if the distance to its matching ground truth
location is greater than one-tenth of the ground truth person height. If a detected person does
not match a ground truth person, all detected extremities will be counted as false positives; all
unmatched extremities of a ground truth person are false negatives.
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 20
TABLE III
PRECISION-RECALL EVALUATION OF CANDIDATE TRACKLETS, SINGLE- AND MULTI -FRAME ASSIGNMENTS. THE
FOLLOWING ABBREVIATIONS ARE USED: P–PRECISION, RNV–RECALL AS A RATIO OF ALL ANNOTATIONS EVEN IF THEY
ARE NOT VSIBLE, RPV–RECALL AS A RATIO AT LEAST PARTIALLY VISIBLE EXTREMITIES , RV–RECALL OF AS A RATIO OF
FULLY VISIBLE EXTREMITIES , F1{NV, PV, V}-F1 SCORES COMPUTED WITH THE CORRESPONDING RECALL VALUE. NOTE
THAT THE SINGLE- OR MULTI-FRAME ASSIGNMENT PROCESS CAN ONLY REMOVE FROM THE CANDIDATELIST, SO THE
RECALL VALUES FOR THE CANDIDATES PROVIDE AN UPPER BOUND ON THE RECALL AFTER ASSIGNMENT.
hands feetP Rnv Rpv Rv F1nv F1pv F1v P Rnv Rpv Rv F1nv F1pv F1v
HumanEva Icandidates 0.69 0.75 0.91 0.96 0.72 0.78 0.80 1.00 0.77 0.91 0.95 0.87 0.95 0.97assign., single-frame 0.96 0.74 0.90 0.95 0.84 0.93 0.96 1.00 0.76 0.90 0.94 0.86 0.95 0.97assign., multi-frame 0.96 0.74 0.90 0.95 0.84 0.93 0.96 1.00 0.76 0.90 0.94 0.86 0.94 0.97
groupcandidates 0.51 0.42 0.67 0.80 0.46 0.58 0.62 0.49 0.80 0.88 0.93 0.61 0.63 0.64assign., single-frame 0.90 0.40 0.64 0.77 0.56 0.75 0.83 0.97 0.76 0.85 0.90 0.85 0.91 0.93assign., multi-frame 0.92 0.41 0.65 0.79 0.57 0.76 0.85 0.98 0.77 0.85 0.90 0.86 0.91 0.94
basketball,automaticcandidates 0.51 0.36 0.46 0.53 0.42 0.49 0.52 0.68 0.68 0.72 0.78 0.68 0.70 0.73assign., single-frame 0.74 0.33 0.43 0.50 0.46 0.55 0.59 0.85 0.62 0.65 0.71 0.72 0.74 0.77assign., multi-frame 0.74 0.34 0.44 0.50 0.46 0.55 0.60 0.85 0.62 0.65 0.71 0.72 0.74 0.78
basketball,interactivecandidates 0.58 0.43 0.55 0.63 0.50 0.57 0.61 0.75 0.76 0.80 0.86 0.76 0.77 0.80assign., single-frame 0.80 0.42 0.53 0.61 0.55 0.64 0.69 0.89 0.71 0.74 0.81 0.79 0.81 0.85assign., multi-frame 0.81 0.42 0.54 0.61 0.55 0.65 0.70 0.90 0.72 0.75 0.82 0.80 0.82 0.86
Fig. 9. Sample results for datasets with multiple people. All videos are obtained from a static camera; however, some images
are cropped to highlight person of interest.
B. Datasets
HumanEva I dataset. We processed all frames of S2Box 1, S2 Gestures1, S2 Walking 1,
S3 Box 1, S3 Gestures1, and S3Walking 1 (700 to 1100 frames per sequence), all from
camera 2. The video resolution is 640 by 480 and people are on average 292 pixels tall. Errors are
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 21
BasketballHumanEva Group
radius=average error
Fig. 10. Average errors from table II in context.
computed from the ground truth motion capture data projected onto the single camera image. To
compare to previously reported average error measurements, we automatically filled in unknown
hand and foot locations as a post-processing step using one of two approaches: for short periods
where hands/feet are not known, we interpolate between known locations using cubic spline
interpolation; for long periods, we set the location to a standard position for the unknown part
along the vertical axis of the person’s bounding box (.5h and2rfoot from the bottom of bounding
box for hands and feet, respectively). Table II reports the results on the videos in their original
size of 640 x 480. The errors are competitive with the state-of-the-art techniques tabulated by
Martinez et al. in [43], where the authors themselves report average errorrates of 13.2 pixels
using bipedal motion constraints for the legs. The best results were obtained by [4] which were
estimated by Martinezet al. from the 3D errors to be about 5 to 7 pixels in 2D. This approach
was activity specific and assumed cyclic motion. Other approaches had errors between 10 and
14 pixels [44], [45], and all were activity (and sometimes even view) specific. Our approach
was not trained on the HumanEva I dataset, nor on any activityor view, but was still able to
obtain comparable results (average errors for legs are 11.66 pixels). Since our algorithm does not
train on poses and is not activity specific, we do not expect itto accurately hallucinate positions
of occluded extremities, so we annotated hand and foot ground truth as fully visible, partially
visible, and not visible (fully occluded). Table II shows errors for all extremities; performance
is best when extremities are not occluded, but the error introduced by interpolation is small.
Outdoor group dataset. The outdoor dataset of multiple interacting people includes actions
such as hand-shakes, drinking from mugs, and gestures, and contains periods of partial and full
inter-person occlusion. The video resolution is 480 by 270 at a frame-rate of 30fps, with an
average of 128 pixels of vertical resolution for each person. In most sequences, people wore
similar clothing, making the task particularly difficult inthe presence of inter-person occlusion.
We annotated 1324 frames of this dataset (5497 pose instances), and as table II shows, our results
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 22
were very good, with average errors of roughly 3 pixels. For these videos, we do not interpolate
or guess unknown hand/foot locations, but we instead reporthow often (as a percentage) parts are
declared unknown by our system. As table II indicates, most unknown extremities are missed
because they were fully occluded. Hands are occluded more often than feet (e.g., hands in
pockets, hands are occluded by torso), so they are declared unknown more often than feet. Since
they are also smaller, they are missed more often even when they are visible.
Outdoor one-on-one basketball dataset.The one-on-one basketball dataset contains videos
of roughly 100,000 frames of 960 by 540 video at 30fps (about 1hr). These videos contain a
large variety of natural poses that occur during a game, and include rapid motions and severe
occlusion as the defensive player often maintains close proximity to the offensive player. Players
are on average 132 pixels tall. Hands are more difficult to detect in these videos due to their
small size and relatively fast motions. Because some of the sequences are longer than 25,000
frames, automatic head tracking fails a number of times. Forthis reason, we report results with
both fully automatic and partially annotated bounding box tracking. Partially annotated results
are obtained by allowing the user to provide input during theperson bounding box tracking
step to remove player track merges/switches. This does not require the user to add missing
detections, i.e., person bounding boxes are automaticallydetected by the person detector, and
manual interaction involves only the association of a player identity to these detections to correct
tracker mistakes. The amount of user interaction is minimal: for the seven sequences of roughly
100,000 frames, a total of 168 tracklets were obtained usingthe fully automatic approach,
whereas manual intervention ensures that only 14 tracks areobtained. For evaluation, every
100th frame of each sequence is manually annotated with ground truth extremity locations.
Hand and foot average errors (with user interaction) are shown in table II; the performance hit
caused by incorrectly associating tracklets is small, as can be seen in table III.
C. Inference approach
We evaluate our lazy evaluation inference approach by performing inference using (Loopy)
Belief Propagation (BP), Variable (or Bucket) Elimination (VE) [29], and AND/OR Branch-and-
Bound (AO). For BP, we use a C/C++ version of the code provided by [46]; we also used the
open source library libDAI and obtained similar results [47]. We also compare various ways
of checking hard constraints before evaluating probabilistic factor entries. Constraints are either
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 23
ignored (only in the BP case), partially considered (P), or fully processed (F). Constraints are
ignored by fully populating all factor tables with actual factor values as a preprocessing step.
They are partially considered by first checking any immediate constraints before evaluating a
probabilistic factor (byimmediate, we mean any constraint functions defined on the same variable
subset as the factor in question, and in the AND/OR case any other constraints instantiated along
the search). Constraints are fully processed as described insection III.
For the AND/OR approaches, we compare three methods of precomputing upper bounds from
factors (lower bounds are obtained dynamically during search). The straightforward approach
assumes a constant upper bound (C) based on knowledge of how factors are constructed; in
our case no factor has a value greater than1, so we can assume this upper bound without
actually evaluating any factor entries. We also compare Mini-bucket Elimination (MBE) [30],
[3] to a full traversal of the reduced AND/OR search space (AO); in both of these cases, the
most costly factors are removed as determined by the parameter r to give a reduced search
space. Thus the methods that we compare are(BP), (BP,P), (VE,P), (AO,P,C), (AO,P,MBE),
(AO,F,C), and (AO,F,AO). The naming pattern uses(method,constraint,bounds) triplets to
describe the general inference, constraint handling, and upper bound approaches, respectively.
We omit the constraint and bounds parts of the triplets when they are ignored. Also, note
that because our full constraint processing consists of performing MBE on the constraints only
(a bottom-up approach, with respect to the pseudo tree),(AO,P,MBE) and (AO,F,MBE) are
equivalent to each other; however, other constraint processing techniques (such as propagating
constraints top-down with respect to the pseudo tree) may beappropriate for the case where upper
bounds are computed with MBE. Note that all of these inferenceapproaches except forBP and
BP,P obtain theexact globally optimal solution in each frame. Because we are dealing with
loopy graphical models, Belief Propagation is not guaranteed to converge, and the determinism
(zeros and ones as probabilities) in our problems further reduces the convergence rate. In our
experiments, the BP-based approaches often did not converge, leading to solutions which did not
satisfy the hard constraints or attain the globally optimalsolution. Exact approaches work for
our problem because we apply occlusion constraints at the extremity level, not between internal
nodes, sufficiently reducing the overall treewidth to allowfor exact inference. We perform these
experiments only on the group dataset, as it contains the most complex interactions.
Inference vs. evaluation. Figure 11 shows the average time spent evaluating image likelihoods
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 24
Fig. 11. Inference and likelihood evaluation times: average
time per frame by inference approach (outdoor group dataset
only). VE and AO approaches are globally optimal and yield
the same solution. BP almost never converges, due to the hard
constraints, and does not yield valid solutions.
0 100 200 300 400 500 600 700 800 900Total # of factor entries (thousands)
0
100
200
300
400
500
600
700
800
900
Evalu
ate
d e
ntr
ies
(thousa
nds)
BPBP,PVE,PAO,P,CAO,P,MBEAO,F,CAO,F,AO
0 100 200 300 400 500 600 700 800 900Total # of factor entries (thousands)
0
5
10
15
20
25
30
Tota
l ti
me /
min
tim
e
BPBP,PVE,PAO,P,CAO,P,MBEAO,F,CAO,F,AO
Fig. 12. Benefits of lazy evaluation: number of image
likelihoods evaluated vs number of total possible image likeli-
hoods.Inference speedup: time ratio of each approach to our
proposed inference approach (AO,F,AO) by problem size. The
Lowess curve [48] is fitted to each method’s scatter plot to
show general trend (sub-sampled data-points are also shown).
(red) versus performing inference (blue) for each method. It is evident that almost all time is
spent on evaluating image likelihoods, and not on performing inference while searching for the
best assignment. Inference time is so small relative to factor evaluation time that it is less useful
to speed up inference itself, and more useful to avoid image likelihood evaluations by consulting
constraints before evaluation and performing lazy evaluation. Figure 12 shows the total number
of evaluated factor entries and speedup as problem size varies (measured by total possible image
evaluations). Two results become evident: (1) methods thatleverage constraints and perform
lazy evaluation are faster than those that do not, and (2) methods that leverage constraints and
perform lazy evaluation are more scalable than those that donot. The scalability observation is
based on the fact that the percent of evaluated factor entries becomes almost constant once a
certain problem size is reached (approx 300,000 total factor entries). For the largest problems,
our proposed approachAO,F,AO is almost 25 times faster than straightforwardBP!
Constraints. The time difference betweenBP andBP,P times in Figure 11 shows that even
a simple application of constraints (pairwise constraintsin this case) before evaluating image
likelihoods can significantly reduce average times. In the absence of hard constraints, average
times would be similar toBP. In addition, additional constraint processing lowers total times
despite the increased inference time spent processing constraints, since the two approaches which
use full constraint processing result in the fastest average total times.
Upper bounds.Figures 11 and 12 show that our approach of using full traversal of the reduced
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 25
100 101 102
k
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Avera
ge t
ime p
er
fram
e (
sec)
VE,PAO,P,CAO,P,MBEAO,F,CAO,F,AO
10-2 10-1 100
r
0.25
0.30
0.35
0.40
0.45
0.50
0.55
Avera
ge t
ime p
er
fram
e (
sec)
AO,P,MBEAO,F,AO
Fig. 13. Inference parameters: effect of varyingk (number of top solutions) andr (upper bound ratio).
AND/OR search space for computing upper bounds results in the fastest average times. This is
faster than using MBE for upper bound computation because MBE evaluates more of the entries
of factors kept for upper bound computation, whereas much ofthe search space is pruned by the
top-down search space traversal, especially when bottom-up constraint processing is performed
by MBE. A surprising result is that using constant upper bounds is faster than using MBE upper
bounds; in this case, the time saved by not evaluating any entries to compute upper bounds was
greater than the time saved by the more informative upper bounds provided by MBE.
Parameters. Finally, we evaluate performance as a function of parameters k (number of
returned solutions) andr (factor approximation ratio). Figure 13 shows a trade-off between the
quality of the upper bounds and cost of evaluated entries required to improve those bounds. For
high values ofr (meaning more factors are kept for upper bound computation), upper bounds
are more accurate, but too much time is spent obtaining thosebounds. Ifr decreases too much,
upper bounds become uninformative and more time is spent during search. The optimal value
of r is around.10; a desirable result from the graph of time vsr is that total time is insensitive
to deviations from the optimal value ofr. The second graph, shows that total time scales well
with larger values ofk, the number of solutions per connected component in the primal graph.
D. Inter-person occlusion
We compare our joint inter-person occlusion reasoning to two alternative approaches: (1)
computing the best assignmentindividually by ignoring assignments of other people, and (2)
computing the best joint assignmentiteratively by fixing the assignment of the best scoring person
that is consistent with the already-fixed set. Since our implementation obtains the head location
from people’s bounding boxes and does not allow it to change during the assignment process,
we enforce occlusion constraints between the head and otherbody parts as a pre-processing
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 26
TABLE IV
COMPARISON OF INTER-PERSON OCCLUSION APPROACHES
hands feetocclusion approach P R F1 P R F1
individual 0.69 0.66 0.67 0.84 0.82 0.83individual, pre-proc. 0.82 0.66 0.73 0.86 0.83 0.84iterative, pre-proc. 0.87 0.62 0.72 0.91 0.78 0.84joint, single-frame 0.90 0.64 0.75 0.97 0.85 0.91joint, multi-frame 0.92 0.65 0.76 0.98 0.85 0.91
step by removing from consideration any assignments that violate a head-limb constraint. Table
IV shows detection results on the group dataset for various occlusion approaches, in terms of
precision, recall (computed only when extremities are at least partially visible), and F1 measures.
The first row shows the result of finding the best assignment for each person individually, without
head constraint pre-processing (thus, skin blobs corresponding to people’s faces are incorrectly
assigned as hands). The second and third rows show the individual and iterative approaches, both
with head constraint pre-processing. Finally, the last tworows show the results of our approach,
with and without temporal assignment tracking. Note that the single-frame results are single-
frame only in the sense that we are reporting the best solution found per frame; candidates are
still obtained from candidate tracklets obtained from multiple frames. Table IV shows that as
we increase the complexity of occlusion reasoning, performance increases. F1 remains fixed or
is lower only between the individual and iterative occlusion handling approaches; in this case,
obtaining joint solutions iteratively increases precision significantly, but also lowers recall. Our
joint approach increases both precision and recall. The multi-frame results show that our temporal
transition factors do improve results, but very little; this might be explained by the fact that some
temporal information is included during low-level tracklet formation. In addition, single frame
performance is already very high (up to the quality of the detected set of candidates), and since
the multi-frame approach add candidate locations, it provides limited gains in performance.
E. Generalization in the absence of silhouettes
We explore the use of Histograms of Gradients (HOG), as implemented by Felzenszwalb et al.
[49], as replacements for silhouettes. We detect humans using [49], train detectors for axis-aligned
joints and rotated body parts6 on the TUD training set [18] using Partial Least Squares (PLS) and
6We use templates of size (64, 64), (48, 48), and (72, 72) for the head, feet, and other joints, and (64, 104), (64, 72) and (80,
144) for the lower legs, upper legs, and torso, respectively, all usinga HOG block stride of 8
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 27
Fig. 14. Generalization to case with HOG features and no silhouettes. Individual detectors generate informative response maps.
0.0 0.2 0.4 0.6 0.8 1.0recall
0.0
0.2
0.4
0.6
0.8
1.0
pre
cisi
on
torso (AP = 0.9)head (AP = 0.59)lowerleg1 (AP = 0.37)foot1 (AP = 0.31)foot2 (AP = 0.22)lowerleg2 (AP = 0.17)upperleg1 (AP = 0.048)upperleg2 (AP = 0.046)
P R F1
detections 0.33 0.63 0.43candidates 0.32 0.64 0.43top candidate 0.68 0.54 0.60assignments, single 0.83 0.61 0.70assignments, multi 0.84 0.61 0.71
Fig. 15. Quantiative results with HOG features replacing silhouettes.Left: precision-recall curves of individual models evaluated
as detectors on fixed size 300 by 300 pixel windows extracted from thetud-campus test sequence, where people are scaled to
a height of 200 pixels.Middle: tracking performance (feet only). Thetop candidate row is a baseline comparison to keeping
the top-scoring location for each detector individually.Right: sample results of our framework using only HOG features.
Quadratic Discriminant Analysis (QDA) as in [50], and evaluate on thetud-campus dataset [18].
These videos have constant foreground motion and are short,making background model training
difficult. Figures 14 and 15 show qualitative and quantitaveresults of HOG-based detectors
individually as well as part of our entire framework (using raw uncalibrated responses to detect
humans/feet and model part likelihoods). Our focus is on extremity detection and tracking in
these experiments, so while humans were automatically detected using [49], manual supervision
was used as for the basketball dataset to fix identity switches/merges (however, detector noise
remains–no bounding boxes were added or adjusted). Our results show that individual part
detectors and likelihood models are informative but not sufficient to associate feet to people.
However, as part of our framework, precision increases significantly.
The common approach of running part models at discretized translation and rotation intervals
requires over 500,000 evaluations per person to run 10 part detectors every 8 pixels and 10
degrees for a 300 by 300 pixel region of interest, as in our experiments. As figure 12 shows,
even if all possible likelihood evaluations are performed in our framework, most cases require
below 500,000 evaluations for all (up to 5) people combined at a much finer set of angles (since
parts are defined by their endpoints); AOBB reduces these to roughly 50,000 image evaluations.
Figure 11 shows that likelihood evaluation (not inference)dominates the total time, so little can
be gained by a more efficient inference, e.g., fast generalized distance transform [6].
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 28
VI. CONCLUSION
We proposed a framework for detecting and tracking extremities of multiple interacting people.
We quantitatively evaluated our approach on the publicly available HumanEva I dataset, a
dataset of a group of interacting people, and a dataset of one-on-one basketball games. Our
experiments show that AND/OR Branch-and-Bound with lazy evaluation can significantly reduce
computational cost, while yielding the globally optimal solution in each frame. Our approach
is to be flexible enough to deal with significant occlusion of people in groups as well as rapid
motions and large pose variations observed during basketball games.
REFERENCES
[1] R. Dechter and R. Mateescu, “And/or search spaces for graphical models,”Artif. Intell., 2007.
[2] R. Mateescu and R. Dechter, “Mixed deterministic and probabilistic networks,” ICS Technical Report (Submitted), 2008.
[3] R. Marinescu and R. Dechter, “And/or branch-and-bound search for combinatorial optimization in graphical models,”ICS
Technical Report, 2008.
[4] C.-S. Lee and A. Elgammal, “Body pose tracking from uncalibratedcamera using supervised manifold learning,” inEHuM,
2006.
[5] R. Urtasun and T. Darrell, “Sparse probabilistic regression for activity-independent human pose inference,”CVPR, 2008.
[6] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for object recognition,”IJCV, 2005.
[7] D. Ramanan, D. Forsyth, and A. Zisserman, “Tracking people bylearning their appearance,”PAMI, 2007.
[8] E. B. Sudderth, M. I. M, W. T. Freeman, and A. S. Willsky, “Distributed occlusion reasoning for tracking with nonparametric
belief propagation,” inNIPS, 2004.
[9] L. Sigal and M. Black, “Measure locally, reason globally: Occlusion-sensitive articulated pose estimation,” inCVPR, 2006.
[10] A. Gupta, A. Mittal, and L. S. Davis, “Constraint integration for efficient multiview pose estimation with self-occlusions,”
PAMI, 2008.
[11] H. Jiang and D. Martin, “Global pose estimation using non-tree models,” in CVPR, 2008.
[12] X. Ren, A. Berg, and J. Malik, “Recovering human body configurations using pairwise constraints between parts,” in
ICCV, 2005.
[13] G. Hua, M.-H. Yang, and Y. Wu, “Learning to estimate human posewith data driven belief propagation,” inCVPR, 2005.
[14] L. Karlinsky, M. Dinerstein, D. Harari, and S. Ullman, “The chainsmodel for detecting parts by their context,” inCVPR,
2010.
[15] H. Jiang, “Human pose estimation using consistent max-covering,” in ICCV, 2009.
[16] T.-P. Tian and S. Sclaroff, “Fast globally optimal 2d human detection with loopy graph models,” inCVPR, 2010.
[17] M. Bergtholdt, J. H. Kappes, S. Schmidt, and C. Schnorr, “A study of parts-based object class detection using complete
graphs,”IJCV, 2010.
[18] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection and people-detection-by-tracking,”CVPR, 2008.
[19] S. Gammeter, A. Ess, T. Jaggli, K. Schindler, B. Leibe, and L. V. Gool, “Articulated multi-body tracking under egomotion,”
in ECCV, 2008.
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 29
[20] T. Zhao and R. Nevatia, “Tracking multiple humans in complex situations,” PAMI, 2004.
[21] S. Park and J. K. Aggarwal, “Simultaneous tracking of multiple bodyparts of interacting persons,”CVIU, 2006.
[22] T. Yu and Y. Wu, “Decentralized multiple target tracking using netted collaborative autonomous trackers,” inCVPR, 2005.
[23] M. Eichner and V. Ferrari, “We are family: Joint pose estimation ofmultiple persons,” inECCV, 2010.
[24] L. Zhu, Y. Chen, Y. Lu, C. Lin, and A. Yuille, “Max margin and/or graph learning for parsing the human body,” inCVPR,
2008.
[25] C. H. Lampert, M. B. Blaschko, and T. Hofmann, “Beyond slidingwindows: Object localization by efficient subwindow
search,” inCVPR, 2008.
[26] V. Lempitsky, A. Blake, and C. Rother, “Image segmentation by branch-and-mincut,” inECCV, 2008.
[27] R. Dechter, B. Bidyuk, R. Mateescu, and E. Rollon, “On the powerof belief propagation: A constraint propagation
perspective,” inIn Heuristics, Probabilities and Causality: A tribute to Judea Pearl., 2010.
[28] R. Dechter,Constraint processing. Elsevier Morgan Kaufmann, 2003.
[29] ——, “Bucket elimination: A unifying framework for probabilistic inference,” inUAI, 1996.
[30] K. Kask and R. Dechter, “Mini-bucket heuristics for improved search,” in UAI, 1999.
[31] I. Haritaoglu, D. Harwood, and L. Davis, “W4: real-time surveillance of people and their activities,”PAMI, 2000.
[32] R. Ronfard, C. Schmid, and B. Triggs, “Learning to parse pictures of people,” inECCV, 2002.
[33] W. Schwartz, A. Kembhavi, D. Harwood, and L. Davis, “Human detection using partial least squares analysis,” in
Proceedings of the International Conference on Computer Vision, 2009.
[34] C. Huang, B. Wu, and R. Nevatia, “Robust object tracking by hierarchical association of detection responses,” inECCV,
2008.
[35] M. Everingham, J. Sivic, and A. Zisserman, “’hello! my name is... buffy’ - automatic naming of characters in tv video,”
in BMVC, 2006, pp. 889–908.
[36] Y. Li, C. Huang, and R. Nevatia, “Learning to associate: Hybridboosted multi-target tracker for crowded scene,” inCVPR,
2009, pp. 2953–2960.
[37] H. Sidenbladh and M. J. Black, “Learning image statistics for bayesian tracking,” inICCV, 2001.
[38] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. S. Davis, “Real-time foreground-background segmentation using
codebook model.”Real-Time Imaging, 2005.
[39] A. S. Ogale and Y. Aloimonos, “A roadmap to the integration of early visual modules,”IJCV, 2007.
[40] D. Comaniciu and P. Meer, “Mean shift analysis and applications,”in ICCV, 1999.
[41] V. I. Morariu, B. V. Srinivasan, V. C. Raykar, R. Duraiswami,and L. S. Davis, “Automatic online tuning for fast gaussian
summation,” inAdvances in Neural Information Processing Systems (NIPS), 2008.
[42] L. Sigal, A. O. Balan, and M. J. Black, “Humaneva: Synchronized video and motion capture dataset and baseline algorithm
for evaluation of articulated human motion,”Int. J. Comput. Vision, vol. 87, no. 1-2, pp. 4–27, 2010.
[43] J. Martinez del Rincon, J. Nebel, D. Makris, and C. Orrite, “Tracking human body parts using particle filters constrained
by human biomechanics,” inBMVC, 2008.
[44] R. Poppe, “Evaluating example-based pose estimation: Experiments on the humaneva sets,” inEHuM2, 2007.
[45] N. Howe, “Evaluating lookup-based monocular human pose tracking on the humaneva test data,” inEHuM, 2006.
[46] C. Yanover and Y. Weiss, “Finding the m most probable configurations using loopy belief propagation,” inNIPS, 2004.
[47] J. M. Mooij, “libDAI: A free and open source C++ library for discrete approximate inference in graphical models,”Journal
of Machine Learning Research, vol. 11, pp. 2169–2173, Aug. 2010.
August 23, 2012 DRAFT
DRAFT FOR TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 30
[48] W. S. Cleveland and S. J. Devlin, “Locally weighted regression: Anapproach to regression analysis by local fitting,”
Journal of the American Statistical Association, vol. 83, no. 403, pp. 596–610, Sept 1988.
[49] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part based
models,” inPAMI, 2010.
[50] W. R. Schwartz, A. Kembhavi, D. Harwood, and L. S. Davis, “Human detection using partial least squares analysis,” in
ICCV, 2009.
Vlad I. Morariu received the BS and MS degrees in computer science from the Pennsylvania State
University in 2005, and the PhD degree in computer science from the University of Maryland in 2010.
He is currently a research associate in the Computer Vision Laboratory ofthe Institute for Advanced
Computer Studies at the University of Maryland, College Park. His research interests include activity
recognition, object detection, graphical models, and probabilistic logic models. He is a member of the
IEEE.
David Harwood David Harwood received the graduate degrees from the University ofTexas and the
Massachusetts Institute of Technology. His research, with many publications, is in the fields of computer
image and video analysis and AI systems for computer vision. He is a longtime member of the research
staff of the Computer Vision Laboratory of the Institute for Advanced Computer Studies at the University
of Maryland College Park. He is a member of the IEEE.
Larry S. Davis received the BA degree from Colgate University in 1970 and the MS and PhD degrees in
computer science from the University of Maryland in 1974 and 1976, respectively. From 1977 to 1981,
he was an assistant professor in the Department of Computer Science at the University of Texas, Austin.
He returned to the University of Maryland as an associate professor in 1981. From 1985 to 1994, he was
the director of the University of Maryland Institute for Advanced Computer Studies. He is currently a
professor in the institute and in the Computer Science Department, as well asthe chair of the Computer
Science Department.
August 23, 2012 DRAFT