using wearable inertial sensors for posture and position tracking...

Using Wearable Inertial Sensors for Posture and PositionTracking in Unconstrained Environments through Learned

Translation Manifolds

Aris ValtazanosSchool of Informatics

University of EdinburghEdinburgh, EH8 9AB

[email protected]

D. K. ArvindSchool of Informatics

University of EdinburghEdinburgh, EH8 [email protected]

S. RamamoorthySchool of Informatics

University of EdinburghEdinburgh, EH8 9AB

[email protected]

ABSTRACTDespite recent advances in 3-D motion capture, the problemof simultaneously tracking human posture and position inan unconstrained environment remains open. Optical sys-tems provide both types of information, but are confinedto a restricted area of capture. Inertial sensing alleviatesthis restriction, but at the expense of capturing only rela-tive (postural) and not absolute (positional) information. Inthis paper, we propose an algorithm combining the relativemerits of these systems to track both position and posturein challenging environments. Offline, we combine an optical(Kinect) and an inertial sensing (Orient-4) platform to learna mapping from posture variations to translations, which weencode as a translation manifold. Online, the optical sourceis removed, and the learned mapping is used to infer posi-tions using the postures computed by the inertial sensors.We first evaluate our approach in simulation, on motion se-quences with ground-truth positions for error estimation.Then, the method is deployed on physical sensing platformsto track human subjects. The proposed algorithm is shownto yield a lower average cumulative error than comparableposition tracking methods, such as double integration of ac-celerometer data, on both simulated and real sensory data,and in a variety of motions and capture settings.

Categories and Subject DescriptorsH.4.0 [Information Systems Applications]: General

KeywordsWearable inertial sensors, optical motion capture, transla-tion models, manifold learning

1. INTRODUCTIONMany information processing applications deal with the

analysis of human motion, as captured by an ensemble of

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.IPSN’13, April 8–11, 2013, Philadelphia, Pennsylvania, USA.Copyright 2013 ACM 978-1-4503-1959-1/13/04 ...$15.00.

sensor devices. In this context, it is often essential to de-termine both the absolute position of tracked subjects, aswell as finer-grained information on their body posture, suchas arm movements or gait patterns. Furthermore, it is of-ten required that motion be captured in challenging, un-constrained areas, for example, a large office with multiplerooms and corridors, or an open outdoor environment.

The problem of simultaneous posture and position track-ing is of interest to a wide range of motion capture ap-plications. One notable domain is tracking in construc-tion and fire-fighting missions, which requires monitoring ofthe deployed responders in challenging and potentially un-known environments. Position tracking systems that havebeen considered in this domain (e.g. indoor global posi-tioning sensor (GPS) systems, ultra-wide bands [12], radio-frequency identification tags [11]) typically require an in-frastructure that requires detailed knowledge of the envi-ronment (e.g. placing signal receivers at known positions),while also suffering from line-of-sight constraints. Address-ing these needs is a challenging task when the environmentis difficult to negotiate or cannot be accessed by the systemdesigners. Furthermore, these systems cannot monitor pos-ture, which is often important in order to determine and as-sess the actions performed by the deployed subjects. Similarneeds arise in application areas such as physical gaming [18]and human-robot interaction in rescue missions [6], wherehuman position and posture must also be captured in un-constrained settings. Existing techniques in these domainsare similarly restricted to either providing only part of therequired data, or requiring special infrastructure in order tobe deployed. The above issues raise the need for trackingsystems that can produce position and posture from a singleset of sensory devices, which are not sensitive to the mor-phology of the capture environment, and which do not relyon synchronisation between multiple heterogeneous sources.

Despite recent advances in motion capture and sensingtechnologies, fulfilling the above requirements in a robustmanner is a challenging task. For instance, optical motioncapture systems can capture both positional and posturaldata, but only within contained environments limited to asmall area of capture. Inertial sensing systems allow forgreater flexibility in the capture environment, as they arenot restricted by line-of-sight constraints between the sens-ing devices and the tracked subject. However, they do so atthe expense of not yielding absolute positions, as their calcu-lations are based on relative rotational estimates, e.g. gyro-

subtraction and normalisation

between consecutive frames

Manifold

clusters

project &

cluster

Translation

modelsOffline

Manifold

clusters

Translation

models

posture

estimated

translation

linear

regression

project

cluster selection

Online

mapping

Inertial

sensing

Optical

source

Inertial

sensingPostures

Positions

Posture diffs.

Translations

Figure 1: Overall structure of the proposed system. An inertial sensing and an optical source are synchronisedand jointly used to learn generative models of whole-body translations in an offline phase. These translationsare encoded as linear regression-based mappings from projected latent representations of posture differences,as detected from the inertial source, and positional variations, as detected from the optical source. Online,the optical source is removed, and the learned model is used to predict translations for the tracked subject.

scope readings. Similarly, general-purpose GPS sensors cancompute absolute positions, but at a coarse level of preci-sion and without supplying postural information, while alsohaving limited applicability in indoor environments. Othermotion capture technologies, e.g. magnetic systems, alsosuffer from one or multiple of the above limitations.

In order to address the challenges of simultaneous pos-ture and position capture in unconstrained environments,one would need to combine the relative strengths of theseheterogeneous systems in a principled manner. In this pa-per, we propose a hybrid position and posture tracking algo-rithm (Figure 1), which jointly uses an inertial and an opticalmotion capture system to learn local models of translationfor a given human subject. The algorithm consists of anoffline learning and online generation phase. In the offlinephase, body posture data collected from the inertial sen-sors are synchronised with position data captured from theoptical source and aggregated into a single dataset. Due totheir high dimensionality, the posture data are projected andclustered on a low-dimensional manifold, which captures thesalient kinematic structure of the dataset. For each cluster,the projected data are used to learn a mapping from localposture differences to whole-body translations through linearregression. In the online phase of the algorithm, the opticalsource is removed, and the learned models are used to gen-erate translation vectors from estimated posture differences.By iteratively applying these translations, the proposed sys-tem can track both the position and the posture of a subjectperforming motions similar to those captured in the offlinephase, thus overcoming the main limitation of inertial sys-tems discussed above. Moreover, due to the removal of theoptical source in the online phase, the system is not affectedby the morphology of the capture environment.

In the remainder of this paper, we first review related mo-tion capture technologies, explaining how our approach com-bines their relative strengths (Section 2). Then, we describeour method for posture and position tracking, distinguish-ing between the translation learning and generation phases(Section 3). In Section 4, our approach is evaluated in sim-ulations and physical experiments; first, on data from theCarnegie Mellon Motion Capture Database [3], which areannotated with ground-truth positions, and then, on a hu-man motion capture environment, where we use the Kinectand the Orient platform as sensing systems (Figure 2). Ouralgorithm is shown to yield a lower overall position error

than the related established tracking method of accelerationintegration. Furthermore, in the physical experiments, wedemonstrate examples of successful position tracking in achallenging office environment, where existing motion cap-ture technologies cannot be applied in isolation. We reviewthe key features of our work in Section 5.

(a) Microsoft Kinect. (b) Orient-4 device.

Figure 2: Motion tracking platforms. (a): Opti-cal source – Kinect device with two cameras anda depth-finding sensor. (b): Inertial measurementunit – Orient-4 device with tri-axial gyroscopes, ac-celerometers, and magnetometers.

2. BACKGROUND AND RELATED WORKTraditional optical motion capture systems, e.g. [4], use

an ensemble of high-resolution cameras to track the loca-tions of a set of reflective markers placed on the body of asubject. The marker positions are used to compute the fullpose (position and posture) of the subject. However, opticalsystems suffer from a number of drawbacks that impact theirapplicability. First, motion capture must be carried out indedicated studios, which are often expensive to set up andmaintain. Second, the total area of capture is limited toa small volume, and subjects cannot be tracked outside itsboundaries. Thus, optical systems cannot be used to tracksubjects interacting outside the studio, e.g. moving in andout of rooms, or navigating along corridors in a building.Third, occlusion problems impact the ability of these sys-tems to track subjects consistently and reliably.

A recent development has been the creation of devicescombining stereo cameras and depth-estimating sensors, no-tably the Kinect [1] (Figure 2(a)). The output of these sen-sors is used to generate a three-dimensional point cloud,

which can then be analysed to determine the pose of atracked subject, or fit the data to a skeleton model [16].These stereo-camera devices are significantly cheaper thantraditional optical systems, and remove the need for markerson the subject’s body. Furthermore, the portability of thesensors allows for anyplace motion capture. However, liketraditional optical systems, these devices must remain fixedduring tracking, and the volume of capture is limited to ap-proximately 15m3. This makes them unsuitable for trackingsubjects in large or unconstrained spaces.

An alternative to optical systems are wireless inertial sens-ing platforms, such as the Orient interface [21] (Figure 2(b)).Inertial sensing systems collect data from an ensemble ofsensor nodes placed on the subject’s body. Each device typ-ically consists of 3-axis inertial sensors such as gyroscopes,accelerometers, and magnetometers. These sensors jointlyestimate the rotation of the body part the device is placedon, relative to a fixed point on the subject’s body. Datafrom the different body parts are transmitted wirelessly toa base station and aggregated to determine the overall pos-ture of the subject. Alternatively, data can also be stored onthe devices and analysed offline at a later time. The latterfeature makes these devices suitable for motion capture andprocessing in environments impacted by communication anddata transmission constraints.

Due to the wireless transmission and capture of data, iner-tial sensing avoids the occlusion problems arising in opticalsystems. More importantly, as no fixed tracking source isrequired, subjects can be tracked in a greater variety of en-vironments and in larger areas than with optical systems.The main drawback of inertial sensing is the relative rota-tional nature of estimates, which means that only posturescan be determined directly. Unfortunately, this approachdoes not extend to absolute spatial positions, as computa-tions are performed relative to a stationary reference point.

Position tracking using inertial measurement units hasbeen the subject of several studies, most following a model-based approach, where measurements are filtered througha position model to predict the most likely translation ofthe tracked subject. Employed models range from Kalman[20, 8, 10] and particle filters [13], to alternative heuristicapproaches based on gait event detection [15, 22, 9].

Our approach is different in being a model-free methodwith respect to the measurement units and their outputdata, where no assumptions are made on the placement ofthe sensors or the nature of the motion being performed.Instead, the aggregated sensory data is treated as a singlefeature vector, from which a mapping to whole-body transla-tions is learned. This leads to an unsupervised method thatcan learn a model of translation without the incorporationof additional knowledge on the problem. However, the lackof a specific model also means that the quality of the learnedmappings inevitably depends on the quality and variabilityof the training examples. In later sections of this paper, weprovide illustrations of how the accuracy of our approach isaffected by the provided data.

Motivated by the above features, in our experimental eval-uation we compare against the established model-free methodof double integration of accelerometer data. This method re-lies on the integration of data coming from a single sensorin order to track position over time, without having an in-ternal model on the placement of that sensor. In Section4, we show that using multiple sensors in conjunction with

a learning algorithm can lead to better tracking of whole-body positions, in both simulated and physical experiments.Furthermore, our approach can also be combined with exist-ing models, where the generated translations can be treatedas predictive estimates for a filter. Integration with model-based filtering methods is an area of future work for us.

Dimensionality reduction has been used in inertial sensornetworks as a discriminative model for activity recognitionand gait phase detection [19, 17]. In our work, we extendthis notion by using low-dimensional subspaces as gener-ative models for translations. In this generative context,learned manifolds have been used in robotics, in order tofacilitate imitation of human gaits by humanoid robots [14,7]. By adopting this flexible representation, our objective isto similarly approximate a wide variety of motion dynamics.

3. METHOD

3.1 Sensory device outputs

3.1.1 Kinect

Figure 3: Body contour tracking using the Kinect.The tracking software automatically detects the out-line of a human body, and tracks it as a cloud ofpoints (shown as a blue blob).

We use the OpenNI body tracking interface [2] to detectand track the position of human subjects (Figure 3). Thesoftware automatically detects the outline of a human body,and tracks it as a collection of NI image point coordinates,~B = {(x1, y1), . . . , (xNI , yNI )}. The absolute position ofthe tracked body is approximated as the centroid of thesepoints as computed through image moments.

Let W , H be the width and height (in pixels) of the cam-era image, and let I be a 2-D array, such that

I(a, b) =

{1, (xa, yb) ∈ ~B

0, (xa, yb) /∈ ~B, (1)

where 1 ≤ a ≤ W, 1 ≤ b ≤ H. The raw image moments,Mij are defined as

Mij =

W∑a=1

H∑b=1

xia · yjb · I(a, b). (2)

Based on these definitions, the image coordinates of thecentroid of the tracked body, C

.= (x, y) are given by

(x, y) = ( round(M10/M00), round(M01/M00) ). (3)

The depth of each image pixel (with respect to the device)is measured by the range-finding sensor of the Kinect. Thisinformation is used to convert the computed image centroid,(x, y), to the centroid of the body surface that is visible tothe Kinect. These coordinates approximate to the absolutepositional coordinates of the tracked body,

p = (xB , yB , zB). (4)

3.1.2 Orient inertial measurement units

(a)

Orient

device

3-axis

gyroscope

3-axis

accelerometer

3-axis

magnetometer

4-D

quaternion

Euler

angle (X)

Euler

angle (Y)

Euler

angle (Z)

(b)

0 10 20 30 40 50−4

−2

0

2

4

Time (s)

Ang

le (

rad)

(c)

Figure 4: Posture estimation using the Orient de-vices. (a): 3-D model of the tracked body. Eachdevice is mapped to a different limb. (b): Orien-tation estimation. Quaternions are computed fromraw sensor data and converted to Euler angles. (c):Example of angles produced by an Orient device.

The posture of the tracked subject is computed by theOrient devices. Each device is placed on a limb of the sub-ject’s body (Figure 4(a)). The raw data from the device’ssensors (triaxial gyroscope, accelerometer, and magnetome-ter) computes a quaternion representing the orientation atthat limb, which is in turn converted into three-dimensionalEuler angles (Figure 4(b)) based on a pre-specified rotationorder. Due to this convention, an Euler angle can repre-sent the orientation more succinctly than the correspondingquaternion, thus reducing the size of our feature set. Anexample of the angles output by an Orient is given in Fig-ure 4(c). By aggregating the angles computed by all devicesplaced on the subject’s body, we obtain the posture vector

π = {(θx1 , θy1 , θz1), . . . , (θxND

, θyND, θzND

)}, (5)

where ND is the number of deployed units, and (θxi , θyi , θ

zi )

are the angles computed by the i-th unit.

3.2 Learning translation manifolds

3.2.1 Offline learning phaseIn the offline phase, Kinect positions are synchronised

with data from the Orient devices. From this data, a map-ping from posture variations, as computed by the Ori-

ent devices, to translations, as computed by the Kinect, islearned through local linear regression.

Let {(p1, π1, t1), . . . , (pτ+1, πτ+1, tτ+1)} be a set of recordedsynchronised training data, comprising (τ+1) absolute posi-tion and posture pairs, along with the times t at which eachpair was recorded. By taking the difference of successiveinstances, we obtain a training data set of τ unnormalisedtranslations (i.e. position differences), posture variations1,and time differences,

D = {(dp1, dπ1, dt1), . . . , (dpτ , dπτ , dtτ )} = {(p2 − p1,π2 − π1, t2 − t1), . . . , (pτ+1 − pτ , πτ+1 − πτ , tτ+1 − tτ )}.

(6)

At this stage, translations dp = (dx, dy, dz) do not accountfor the absolute orientation of the subject’s body. To addressthis problem, we assume that at least one inertial measure-ment unit, u, is placed on a point where it can measure thesubject’s absolute orientation, θ, with respect to the trans-verse plane of motion. We focus on this single angle (insteadof computing three-dimensional absolute orientations) be-cause it is closely correlated with most turning movementsthat occur during walking motion sequences. Thus, by nor-malising with respect to θ, we can compensate for turns andchanges of direction in the motion of the subject. We takeu to be the device placed on the subject’s waist or hips as arepresantative location for this purpose. The required angleθ is computed through u’s magnetometers, which measureabsolute orientations using Earth’s magnetic field. Thus,the unnormalised translation components on the transverse

plane, (dx, dy), can be normalised through the rotation(dx

dy

)=

(cos(−θ) − sin(−θ)sin(−θ) cos(−θ)

)·

(dx

dy

). (7)

We also normalise translations and posture differences withrespect to their recorded time intervals. Thus, the nor-malised training data set is given by

D = {(dp1, dπ1), . . . , (dpτ , dπτ )} =

{(dp1/dt1, dπ1/dt1), . . . , (dpτ/dtτ , dπτ/dtτ )}.(8)

-Dimensionality reduction: The size of each posture varia-tion vector, dπ, is D = 3·ND, where ND is the number of de-ployed devices. Even if ND is not particularly large, it maybe difficult to learn a direct mapping to associated trans-lations, due to the different modalities of the posture data.To overcome this problem, we project posture variations toa latent space, from which a mapping can be learned moreefficiently. We use Principal Component Analysis (PCA),which embeds data into a low-dimensional linear manifoldby maximising their variance [5]. Thus, this method seeksto preserve the high-dimensional structure of the data in theprojected space. We review the key features of PCA below.

Let {dπi}, 1 ≤ i ≤ τ be the set of posture variation vec-tors, each having dimensionality D. The mean, dπ, andcovariance matrix, S, of these vectors are given by

dπ =1

τ

τ∑i=1

dπi, S =1

τ

τ∑i=1

(dπi − dπ)(dπi − dπ)T , (9)

1Angle differences are constrained to lie in [−π,+pi).

respectively. Now let d be the target dimensionality of thelow-dimensional latent space, where d < D. We obtain thed eigenvectors (or principal components) of S, u1, . . . , ud,each of dimensionality D, corresponding to the d largesteigenvalues, λ1, . . . , λd of this matrix. These vectors are setas the columns of a D × d matrix

M =

u1,1 · · · ud,1...

. . ....

u1,D · · · ud,D

(10)

The latent representation of a D-dimensional posture varia-tion vector dπ is given by

φ = dπ ·M. (11)

We refer to the manifold projections φ as the feature vectorsof our translation learning algorithm. In both simulated andphysical experiments (Section 4), we set the target subspacedimensionality to d = 3.

01

20

0.51

−0.5

0

0.5

(a)

01

20

0.51

−0.5

0

0.5

(b)

Figure 5: Feature vector clustering example. (a):Projected points. (b): Division in 100 clusters, eachrepresented by a different colour.

-Feature vector clustering: In a given set of training ex-amples, there may be groups of similar posture variationsleading to related translation vectors. To exploit this simi-larity, we group the projected feature vectors into clusters ofrelated data points, and learn a separate translation map-ping for each cluster (instead of a single mapping for thewhole dataset). We use the k-means clustering algorithm,which groups input points into a specified number of k dis-tinct clusters [5]. As we use clustering in a learning context,we use small (with respect to the size of the dataset) valuesof k to avoid overfitting the training data; in our experi-ments, this value does not exceed 2% of the overall numberof training points. Despite the need for this manually speci-fied parameter, k-means clustering has the advantage that itdoes not make assumptions about cluster structure (whereasdistribution methods such as expectation-maximisation [5]assume a Gaussian form), while also favouring clusters ofapproximately equal sizes.

Figure 5 illustrates an example of clustering on a set ofthree-dimensional points. When applied on a dataset of τfeature vectors, the algorithm returns the set of clusters C ={c1, . . . , ck}, with centres ~µ = {c1.µ1, . . . , ck.µk}, where eachcluster ci, 1 ≤ i ≤ k, consists of a set of Ni feature vectors

ci = {φi1, . . . , φiNi}. (12)

-Translation mapping learning: For each data cluster ci ={φi1, . . . , φiNi

}, we learn a mapping from its constituent fea-

ture vectors to the corresponding translations, T c = {dpi1,

. . . , dpiNi}. We learn a separate mapping for each direction

of motion (x, y, z) through linear regression on the trainingpoints. In other words, we represent each translation com-ponent as a linear function of the projected feature vectors.

To learn these mappings, we collect all feature vectors ofa cluster as a Ni × d design matrix,

X =

φi1,1 · · · φi1,d...

. . ....

φiNi,1· · · φiNi,d

. (13)

Furthermore, we define three observation vectors, one foreach of the directions of motion, such that

x =

dxi1...

dxiNi

, y =

dyi1...

dyiNi

, z =

dzi1...

dziNi

. (14)

For each observation vector v, we learn a linear mappingfrom the design matrix X using least squares approximation,represented by a set of d weights w:

w = (XTX)−1XTv . (15)

By applying this procedure to all three observation vectors,x, y, z , we obtain the linear mapping weights for clusterci, wi

x,wiy,w

iz, respectively. These can be collectively rep-

resented as the cluster translation mapping

Wi =

wix,1 · · · wi

x,d

wiy,1

. . . wiy,d

wiz,1 · · · wi

z,d

, (16)

with a different Wi computed for each cluster. Thus, thelatent space becomes a translation manifold that can gen-erate translations from given feature vectors.

3.2.2 Online translation generationLearned mappings can be applied to novel instances of

posture variations and predict whole-body translations. As-suming a known initial estimate of position, (x0, y0, z0) andorientation, θ0, predicted translations can be chained to-gether to track position over time.

Let dπt be the subject’s estimated posture variation attime t, and let θt be the subject’s absolute orientation atthat time. Furthermore, let dtt be the length of the timeinterval over which dπt was recorded. The projection of dπton the translation manifold, φt, is computed as φt = dπt ·M,where M is the learned projection mapping from the high-dimensional to the latent low-dimensional space. The clusternearest to φt is given by

c∗ = arg minci∈C

δ(φt, ci.µi) (17)

where δ(·, ·) is the Euclidean distance between two points.Then, if W∗ is the cluster translation mapping for c∗, ourmodel predicts a normalised translation for φt as

dpt.= (dx, dy, dz) = W∗ · φTt (18)

The updated predicted position at time t, xt, yt, zt, is ob-

tained by applying the orientation θt to dpt, scaling it by dtt

0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

Time (s)

Cum

ulat

ive

erro

r (m

)

Learned TranslationManifoldRoot Joint AccelerationDouble Integration

(a) Walk

0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

Time (s)

Cum

ulat

ive

erro

r (m

)

Learned Translation ManifoldRoot Joint AccelerationDouble Integration

(b) Walk

0 5 10 150

0.05

0.1

0.15

0.2

Time (s)

Cum

ulat

ive

erro

r (m

)


(c) Fast walk

0 5 10 150

0.05

0.1

0.15

0.2

Time (s)

Cum

ulat

ive

erro

r (m

)

Learned Translation ManifoldRoot Joint Acceleration Double Integration

(d) Fast walk

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

Time (s)

Cum

ulat

ive

erro

r (m

)


(e) Walk with turn

0 5 10 150

0.05

0.1

0.15

0.2

Time (s)

Cum

ulat

ive

erro

r (m

)


(f) Walk with turn

0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

Time (s)

Cum

ulat

ive

erro

r (m

)


(g) Run

0 2 4 6 80

0.05

0.1

0.15

0.2

Time (s)

Cum

ulat

ive

erro

r (m

)


(h) Run

0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

Time (s)

Cum

ulat

ive

erro

r (m

)

Learned TranslationManifold

Root JointAccelerationDouble Integration

(i) Run with turn

0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Time (s)

Cum

ulat

ive

erro

r (m

)


(j) Run with turn

0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Time (s)

Cum

ulat

ive

erro

r (m

)


(k) Run with turn and pause

0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

Time (s)

Cum

ulat

ive

erro

r (m

)


(l) Run with turn and pause

Figure 6: Cumulative translation errors,12 CMU database motions. Red: Learned translation manifoldtrained on given motion sequence. Blue: Double integration of root joint acceleration.

to reflect the length of the current time interval, and addingit to the previously estimated position, (xt−1, yt−1, zt−1):

xtytzt

=

xt−1

yt−1

zt−1

+ dtt

cos(θt) − sin(θt) 0sin(θt) cos(θt) 0

0 0 1

dxdxdz

,

(19)

starting at the position (xt−1, yt−1, zt−1) = (x0, y0, z0).Our method can generate translations from novel instances

of feature vectors, and track the position of a subject withoutan optical source. This property is important in complex un-constrained environments, where optical systems cannot bedirectly applied. As joint angles are inherently supplied bythe inertial devices, our approach can simultaneously trackboth position and posture from a single set of sensors.

4. RESULTS

4.1 Simulation resultsThe learning framework was first evaluated on sequences

from the Carnegie Mellon University (CMU) Database [3].Motions in this dataset were captured using an optical sys-tem that tracks reflective body markers. Posture vectorswere formed by aggregating the marker positions for thelower body joints (thighs, shins, ankles, feet). Note thatthis is a slightly different representation to the one given

in Section 3, where posture vectors consisted of joint an-gles, not joint positions. However, this differentiation doesnot impact the applicability of our algorithm, which has nointernal model of the nature of the supplied feature vectors.

The position of the root joint, at the subject’s hips, wastaken as the absolute position of the body. This was usedas a ground-truth benchmark, against which the iterativelypredicted positions were checked.

We compared against a related open-loop position gener-ation technique, the double integration of acceleration. Thismethod similarly generates local translations that can bechained together to compute positions. As the CMU datasetdoes not explicitly provide acceleration data, we simulatedthis information by extracting accelerations from successivepositions at the subject’s root joint, and integrating themtwice to generate translations.

We first assessed the ability of translation manifolds to re-produce translations on the datasets they are trained on. To-wards this end, we trained the learning algorithm on severaldifferent motion sequences. We then compared the similar-ity of the generated translations to the ground-truth trans-lations, as estimated by differences of consecutive root jointpositions. Our metric is the cumulative translation error,obtained by iteratively summing the Euclidean distance ofeach generated vector from the corresponding ground truth.

The results for 12 distinct motion sequences, ranging fromsimple straight walking to running with turns, are shownin Figure 6. In all cases, the translations generated by the

0 2 4 6−0.5

−0.25

0

0.25

0.5

x (m)

y (m

)

True PositionEstimated Position

(a) Walk

0 2 4 6−0.5

−0.25

0

0.25

0.5

x (m)

y (m

)


(b) Walk

0 2 4 6−0.5

−0.25

0

0.25

0.5

x (m)

y (m

)


(c) Fast walk

0 2 4 6−0.5

−0.25

0

0.25

0.5

x (m)

y (m

)


(d) Fast walk

0 1 2 3 4 5−0.5

0

0.5

1

1.5

2

x (m)

y (m

)


(e) Walk with turn

0 1 2 3 4 5−0.5

0

0.5

1

1.5

2

x (m)

y (m

)


(f) Walk with turn

0 2 4 6−0.5

−0.25

0

0.25

0.5

x (m)

y (m

)


(g) Run

0 2 4 6−0.5

−0.25

0

0.25

0.5

x (m)

y (m

)


(h) Run

0 1 2 3 4 5−0.5

0

0.5

1

1.5

2

x (m)

y (m

)


(i) Run with turn

0 1 2 3 4 5−0.5

0

0.5

1

1.5

2

x (m)

y (m

)


(j) Run with turn

0 1 2 3 4−3

−2.5

−2

−1.5

−1

−0.5

0

x (m)

y (m

)


(k) Run with turn and pause

0 1 2 3 4−2.5

−2

−1.5

−1

−0.5

0

x (m)

y (m

)


(l) Run with turn and pause

Figure 7: Overall position error estimation for 12 novel instances of unseen motion sequences. Red: Trajec-tories estimated by learned translation manifold. Black: Ground-truth trajectories.

learned manifold yield a lower cumulative error than the cor-responding double integration ones. For the simpler walkingmotions, the discrepancy between the two methods is shownto increase over time, thus suggesting that the translationmanifold is more effective at capturing motion dynamics.

The true potential of learned translation manifolds canbe fully assessed when applied on novel instances of pre-viously unseen motions. In our second simulated experi-ment, we trained our model on a dataset consisting of 11different motions by the same subject: a straight walk, twostraight walks followed by a 90◦ left/right turn, two walkswith a left/right veer, a fast straight walk, a straight run,two straight runs followed by a 90◦ left/right turn, and tworuns with a left/right veer. The total duration of these cap-tures is 114 seconds, with walking-type and running-typemotions accounting for 86 and 28 seconds, respectively. Byincluding different types of motions, our aim was to modela wide range of posture variation-translation pairs, and im-prove generalisation to novel motion instances.

The learned mapping was applied on 12 new motion se-quences of various types. For these motions, we measuredthe discrepancy between the trajectories predicted by themanifold, and the ground-truth trajectories. The resultingtrajectories are demonstrated in Figure 7. As previously,our algorithm is shown to reproduce accurate positions fornormal walks, with the error increasing for running-type mo-tions. This increase is partly explained by the larger number

of walking motion data points in the training set, which bi-ases the manifold towards translations of smaller magnitude.

4.2 Experimental resultsIn the physical experiments, we evaluated the translation

learning algorithm on sensory data obtained from physicaldevices, using the Kinect as the optical and the Orient plat-form as the inertial sensing source. In the training phase,synchronised data from the two sources were used to learntranslation manifolds. An important restriction in this casewas the small capture volume of the Kinect (approximately15m3), which limited the variety of motions that could beperformed by the subject. Thus, our framework is evaluatedmainly on walking motions which require less physical space.

4.2.1 Constrained environment experimentsThe learning algorithm was first compared with the accel-

eration integration method against ground-truth positionsestimated by the Kinect. Unlike simulation experiments, ac-celerations were now directly supplied by the accelerometersof Orient devices, so translations were generated throughdouble integration of this data.

We captured 18 motion sequences of variable length, rang-ing from 20 to 180 seconds. We used a total of 4 Orient de-vices, placed on the subject’s waist (root joint), right thigh,left thigh, and left ankle. Motions were captured in an of-fice environment, which impacted the quality of the sen-sory readings, especially magnetometers, due to metal in

0 1 2 3 4 5−1.5

−1

−0.5

0

0.5

x (m)

y (m

)

0 1 2 3 4 5 6−1.5

−1

−0.5

0

0.5

1

x (m)

y (m

)

0 2 4 6 8−1.5

−1

−0.5

0

0.5

1

x (m)

y (m

)

0 10 20 30 40 500

20

40

60

80

100

Time (s)

Cum

ulat

ive

erro

r (m

)

Learned TranslationManifoldHip AccelerometerDouble Integration

0 20 40 60 800

50

100

150

200

250

Time (s)

Cum

ulat

ive

erro

r (m

)

Learned Translation ManifoldHip AccelerometerDouble Integration

0 50 100 150 2000

200

400

600

800

1000

Time (s)

Cum

ulat

ive

erro

r (m

)


(a) (b) (c)

0 1 2 3 4 5−1.5

−1

−0.5

0

0.5

1

x (m)

y (m

)

0 1 2 3 4 5−1.5

−1

−0.5

0

0.5

x (m)

y (m

)

0 1 2 3 4 5−1.5

−1

−0.5

0

0.5

1

x (m)

y (m

)

0 50 100 150 2000

200

400

600

800

1000

Time (s)

Cum

ulat

ive

erro

r (m

)


0 10 20 30 40 50 600

20

40

60

80

100

120

Time (s)

Cum

ulat

ive

erro

r (m

)


0 10 20 30 40 50 600

10

20

30

40

50

60

70

Time (s)

Cum

ulat

ive

erro

r (m

)


(d) (e) (f)

Figure 8: Cumulative generated translation errors for 6 novel motion sequences. The learning algorithmwas trained on 12 sequences of varying length and motion composition. Top of each subfigure: Ground-truthpositions computed by the body tracking interface. Bottom: Cumulative errors. Red: Learned translationmanifold. Blue: Double integration of the acceleration of the root joint.

the building structure. For each capture, the subject wasallowed to perform any sequence and combination of walk-ing and standing, provided s/he remained within the capturearea of the Kinect.

We selected 12 of the captured sequences as the trainingset, and we used the remaining 6 as novel instances for eval-uation. As with simulation experiments, we first comparedthe cumulative error of translations generated by the learned

manifold, and translations from double integration of rootjoint accelerations.

Figure 8 illustrates this comparison, along with the cor-responding ground-truth positions captured by the Kinect.In all 6 trials, the subject was observed to repeatedly movearound the capture area in a loop. Although in some casesthe cumulative error generated by the double integrationmethod was initially lower, in all trials the learning method

0 10 20 30 40 50 600

50

100

150

Time (s)

Cum

ulat

ive

erro

r (m

)


(a) 1 sequence (369 points)

0 10 20 30 40 50 600

20

40

60

80

Time (s)

Cum

ulat

ive

erro

r (m

)


(b) 2 sequences (675 points)

0 10 20 30 40 50 600

20

40

60

80

Time (s)

Cum

ulat

ive

erro

r (m

)


(c) 3 sequences (945 points)

0 10 20 30 40 50 600

20

40

60

80

Time (s)

Cum

ulat

ive

erro

r (m

)


(d) 5 sequences (1,741 points)

0 10 20 30 40 50 600

20

40

60

80

Time (s)

Cum

ulat

ive

erro

r (m

)


(e) 10 sequences (3,127 points)

0 10 20 30 40 50 600

10

20

30

40

50

60

70

Time (s)

Cum

ulat

ive

erro

r (m

)


(f) 12 sequences (4,357 points)

Figure 9: Effect of training data set size on generated translations. The cumulative error is shown to decreaseas the number of training motion sequences (and training data points) increases.

(a) (b) (c)

Figure 10: Unconstrained environment illustration. (a): Office room and corridor. (b): Illustration of theapproximate trajectory followed by the subject, starting and ending at the same point. (c): Approximatedimensions of the trajectory.

had a considerably lower error at the end of the sequence.This superior performance was achieved despite some irreg-ularities in the captured positional data, as, for example,in Figures 8(b) and 8(c). This demonstrates that our ap-proach can learn a robust translation model from noisy sen-sory data, which can yield more accurate translations thanmethods operating directly on raw data.

The error of the translation manifold algorithm inevitablydepends on the size and quality of the training data set. Tobetter understand this effect, we assessed the performanceof the algorithm on the last trial of Figure 8 under varyingtraining sets. The results are shown in Figure 9, where westart with just one training sequence, comprising only a fewdata points, and progressively increase this number. It canbe seen that when only one short sequence is supplied, theperformance is considerably worse than the double integra-tion method. However, as more motion instances are added

to the training set, the error is shown to decrease signifi-cantly over time. This indicates that the learning model re-lies on a good coverage of the posture and translation space,in order to be able to generalise effectively to novel instances.Thus, when recording data, it is important to ensure thatthe tracked subjects perform a wide range of motions, in-cluding various combinations of different motion types (e.g.straight walks and turns).

Another related constraint on the performance of our al-gorithm is that motions captured during training must besimilar to those executed in the online generation phase. Forexample, if a manifold is learned only from walking motions,it is highly unlikely to yield accurate translations on novelrunning motions. Thus, it is essential to capture not only asignificant quantity of data (as shown in Figure 9), but alsorepresentative sequences that will be qualitatively similar tothe motions the system will be tested on when deployed.

0 0.5 1 1.5 2−2

0

2

4

6

x (m)

y (m

)

0 1 2 3 4−4

−2

0

2

4

6

x (m)

y (m

)

0 1 2 3 4−4

−2

0

2

4

6

x (m)

y (m

)

0 10 20 30 40 500

50

100

150

200

250

Time (s)

Cum

ulat

ive

erro

r (m

)

Learned Translation ManifoldHip Accelerometer Double Integration

0 10 20 30 40 500

20

40

60

80

100

120

Time (s)

Cum

ulat

ive

erro

r (m

)


0 10 20 30 40 50 600

50

100

150

200

Time (s)

Cum

ulat

ive

erro

r (m

)


(a) (b) (c)

−2 −1 0 1 2 3−4

−2

0

2

4

6

8

x (m)

y (m

)

−1 0 1 2 3−4

−2

0

2

4

6

8

x (m)

y (m

)

−1 0 1 2 3 4−4

−2

0

2

4

6

8

x (m)

y (m

)

0 10 20 30 40 500

50

100

150

Time (s)

Cum

ulat

ive

erro

r (m

)


0 10 20 30 40 500

20

40

60

80

100

120

140

Time (s)

Cum

ulat

ive

erro

r (m

)


0 10 20 30 40 50 600

50

100

150

Time (s)

Cum

ulat

ive

erro

r (m

)


(d) (e) (f)

Figure 11: Generated positions for a subject following the trajectory shown in Figure 10, 6 trials. Topsubfigures: Generated positions. Bottom: Cumulative errors and comparison with double integration.

4.2.2 Unonstrained environment experimentsIn the second set of physical experiments, we evaluated

the learning method in an unconstrained office environment(Figure 10). This represents a setting where an opticalsource cannot be used to track subjects, due to its largerarea and morphology (doors, corridors). The subject wasasked to follow a trajectory consisting of several landmarkpoints, located inside an office room and in an adjacent cor-ridor. There was no restriction on the time given to follow

this trajectory, so the subject was allowed to pause for ar-bitrary periods of time.

We used the same set of training sequences as in the firstset of physical experiments to learn a translation manifold.Figure 11 shows the generated trajectories for six distincttrials of the subject moving along the prescribed path, alongwith the corresponding error comparison with the doubleintegration method. The true precise trajectory followedin each case was not known, however, the subject alwaysended his path at the same point where he started. Thus,

by comparing the difference between the start and end pointsof each generated trajectory, we can get an estimate of theresulting error.

Despite not being aware of the duration and nature ofthe motions performed by the subject, the learning algo-rithm is observed to produce translations that closely followthe true trajectory. The computed mean error for the fi-nal position was 1.783m, with the overall sum of distancesbetween landmark points being about 30m. Furthermore,as shown in the bottom subfigures of Figure 11, the learn-ing algorithm maintains the superior performance level overthe double integration method. A common trait of bothsets of physical experiments is that they feature several al-ternations between straight walking and turning motions,which are characterised by repeated variations in the veloc-ity profile of the tracked subject. In this context, the doubleintegration method initially produces an error comparablewith the learning algorithm, but in both cases the marginincreases exponentially over time. The learning method istherefore successful in identifying the salient structure of thehigh-dimensional data, and using it to learn a mapping thatcan be applied to novel motions.

5. CONCLUSIONSWe have presented a method for simultaneous posture

and position tracking in unconstrained environments, basedon learned generative translation manifolds. In an offlinelearning phase, two heterogeneous tracking sources, an in-ertial sensing (Orient-4) and an optical (Kinect) platform,are jointly used to learn a mapping from posture variations,as estimated by the former, to whole-body translations, asestimated by the latter. This mapping is learned throughlinear regression on clustered latent representations of pos-ture variations. Online, the optical source is removed, andthe learned translation manifold is used to generate transla-tions for novel motion instances. The generative method isexperimentally shown to outperform the related model-free,dead-reckoning method of acceleration integration, and tocorrectly reproduce the structure of previously unseen tra-jectories in unconstrained environments.

One drawback of our approach is that a different map-ping must be learned whenever the system is tested on anew user. This characteristic is due to skeletal morphologyand limb dimension constraints, which vary among differ-ent subjects. Thus, motion sequences captured on a specificsubject may not adequately cover the posture difference andtranslation space for a different subject, thus leading to in-complete mappings. Nevertheless, one interesting extensionto our work would be to learn translation manifolds fromdatasets which contain motions from various subjects withdifferent characteristics (e.g. short/tall). This extensionwould be well-suited to the feature vector clustering proce-dure described in Section 3.2.1. In this context, we wouldemploy a hierarchical clustering approach, where the evalu-ated subject would first be matched to the nearest (in termsof body morphology) user in the training data set, and thena translation would be generated based on the learned map-pings for the matched subject.

A major strength of our approach is that it does not makeassumptions about the nature of the performed motion, thenumber and placement of inertial measurement units, or themorphology of the tracked subject’s body. This property isadvantageous for two reasons. First, our method can be ap-

plied to complex motions spanning all three dimensions (e.g.forward jumps), where traditional model-based approachestracking gait events and foot contacts would fail. Second,for simpler, planar motion types (e.g. walking sequences),our method can be used as a predictive step for model-basedfiltering approaches, in order to obtain lower positional er-rors. Extending our work in these directions would furtheremphasise the benefits of using machine learning techniquesto exploit the structure of high-dimensional data producedby physical sensor networks.

AcknowledgmentsAV has been supported by a doctoral studentship from theCentre for Speckled Computing, funded by the Scottish Fund-ing Council (Strategic Research Development ProgrammeR32329) and EPSRC (Basic Technology Programme C523881).

6. REFERENCES[1] Microsoft Kinect. http://www.xbox.com/kinect.

[2] OpenNI kinect body tracking interface.http://www.openni.org.

[3] CMU motion capture database.http://mocaps.cs.cmu.edu.

[4] Vicon motion capture systems.http://www.vicon.com.

[5] C. M. Bishop. Pattern Recognition and MachineLearning (Information Science and Statistics).Springer-Verlag New York, Inc., 2006.

[6] J. Casper and R. Murphy. Human-robot interactionsduring the robot-assisted urban search and rescueresponse at the world trade center. IEEE Transactionson Systems, Man, and Cybernetics, 33(3):367–385,2003.

[7] R. Chalodhorn, D. B. Grimes, K. Grochow, andR. P. N. Rao. Learning to walk through imitation. InInternational Joint Conference on ArtificalIntelligence (IJCCAI), pages 2084–2090, 2007.

[8] J. A. Corrales, F. A. Candelas, and F. Torres. Hybridtracking of human operators using imu/uwb datafusion by a kalman filter. In International Conferenceon Human-Robot Interaction (HRI), pages 193–200,2008.

[9] R. Feliz, E. Zalama, and J. G. Garcia-Bermejo.Pedestrian tracking using inertial sensors. Journal ofPhysical Agents, 3(1):35–42, 2009.

[10] E. Foxlin. Pedestrian tracking with shoe-mountedinertial sensors. Computer Graphics and Applications,IEEE, 25(6):38 –46, 2005.

[11] J. Guerrieri, M. Francis, P. Wilson, T. Kos, L. Miller,N. Bryner, D. Stroup, and L. Klein-Berndt.Rfid-assisted indoor localization and communicationfor first responders. In European Conference onAntennas and Propagation (EuCAP), pages 1–6, 2006.

[12] H. Khoury and V. Kamat. Evaluation of positiontracking technologies for user localization in indoorconstruction environments. Automation inConstruction, 18(4):444–457, 2009.

[13] Y. Kobayashi and Y. Kuno. People tracking usingintegrated sensors for human robot interaction. InIEEE International Conference on IndustrialTechnology, pages 1617 –1622, 2010.

[14] K. F. MacDorman, R. Chalodhorn, and M. Asada.Periodic nonlinear principal component neuralnetworks for humanoid motion segmentation,generalization, and generation. In InternationalConference on Pattern Recognition (ICPR) (4), pages537–540, 2004.

[15] L. Ojeda and J. Borenstein. Personal dead-reckoningsystem for gps-denied environments. In IEEEInternational Workshop on Safety, Security andRescue Robotics, pages 1 –6, 2007.

[16] J. Shotton, A. W. Fitzgibbon, M. Cook, T. Sharp,M. Finocchio, R. Moore, A. Kipman, and A. Blake.Real-time human pose recognition in parts from singledepth images. In International Conference onComputer Vision and Pattern Recognition (CVPR),pages 1297–1304, 2011.

[17] A. Valtazanos, D. K. Arvind, and S. Ramamoorthy.Comparative study of segmentation of periodic motiondata for mobile gait analysis. In ACM InternationalConference on Wireless Health, pages 145–154, 2010.

[18] C. Wren, A. Azarbayejani, T. Darrell, andA. Pentland. Pfinder: Real-time tracking of the

human body. IEEE Transactions on Pattern Analysisand Machine Intelligence, 19(7):780–785, 1997.

[19] A. Yang, S. Iyengar, S. Sastry, R. Bajcsy,P. Kuryloski, and R. Jafari. Distributed segmentationand classification of human actions using a wearablemotion sensor network. In Computer Vision andPattern Recognition Workshops (CVPRW), pages 1–8, 2008.

[20] A. D. Young. From posture to motion: the challengefor real time wireless inertial motion capture. InInternational Conference on Body Area Networks(BodyNets), pages 131–137, 2010.

[21] A. D. Young, M. J. Ling, and D. K. Arvind. Orient-2:a realtime wireless posture tracking system using localorientation estimation. In Workshop on EmbeddedSensor Networks (EmNets), pages 53–57, 2007.

[22] X. Yun, E. Bachmann, H. Moore, and J. Calusdian.Self-contained position tracking of human movementusing small inertial/magnetic sensor modules. InInternational Conference on Robotics and Automation(ICRA), pages 2526 –2533, 2007.

using wearable inertial sensors for posture and position tracking...

Documents