r robust realtime visual-inertial navigation and mapping rn n b … · 2014-11-24 · mobile robot...

1
System Overview Robust Realtime Visual-Inertial Navigation and Mapping Konstantine Tsotsos, Stefano Soatto <[email protected]> <[email protected]> Kinematic and Measurement Models for Monocular Visual-Inertial Fusion Crowded poster hall (many moving people) Indoor building environment (many specular highlights) Outdoor natural environment (many occluding boundaries due to foliage) ! Measurements of linear acceleration and rotational velocity from an inertial measurement unit (IMU) are used as inputs to a kinematic-statistical model Map Building from Filter Output ! World scale recovered by fusion with accelerometers ! Fully realtime implementation using an Extended Kalman Filter ! Further details to appear in [1] and [2]. ! Estimated 3D coordinates of features are stored in their initial coordinate frames and saved in a relative pose & co-visibility graph ! Lines denote co-visibility edges and store relative transformations, and each 3D point in the map is associated with a node of the graph ! Detected loop closures add an edge to the graph, enabling efficient local map correction Results [1] K. Tsotsos, A. Chiuso, and S. Soatto. Robust Inference for Visual-Inertial Sensor Fusion. IEEE International Conference on Robotics and Automation (In Review), 2015 [2] J. Hernandez, K. Tsotsos, and S. Soatto. Observability, Identifiability, and Sensitivity of Vision-Aided Inertial Navigation. IEEE International Conference on Robotics and Automation (In Review), 2015 8 > > > > > > < > > > > > > : ˙ T = v T (0) = 0 ˙ R = R( b ! imu - b ! b )+ n R R(0) = R 0 ˙ v = R(imu - b )+ γ + n v ˙ ! b = w b ˙ b = b . ! Acceleration and rotational velocity are known, but biased, inputs with gravity as a known parameter. Model driven by Gaussian process noise ! Feature tracks from a monocular camera rigidly attached to the IMU provide measurements of relative pose and world structure y i (t)= ( g cb g -1 (t)X i s ) + n i (t) 2 R 2 X i s = X i s (g t i ,y t i ,Z t i )= g t i ¯ y t i Z t i 8 > < > : ˙ y i t i =0, i =1,...,N (j ) ˙ Z i t i =0 ˙ g j =0, j =1,...,K (t). ! Additional complexities: 1) camera-IMU alignment: 2) temporal misalignment: ˙ g cb =0 ˙ t d =0 Key Novelty in Measurement Processing: Outlier Rejection ! Flow-based tracking provides long tracks for accurate depth estimation, but are prone to drift ! Track lengths limited only by visibility ! Unpredictable drift can cause a feature to become an outlier at any time, e.g. occlusion boundaries, specular highlights, non-rigid scene Goal: Ensure features are not outliers at all times before use ! Perform updates at the current time using only features represented in the state Key Modeling Assumption: Residual process should be temporally uncorrelated, not just instantaneously small ! Typical robust filtering strategies rely on instantaneous tests of the residual ! The residual process being temporally white is a stronger condition ! Perform outlier testing on the residual process rather than on individual samples instantaneously

Upload: others

Post on 11-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: R Robust Realtime Visual-Inertial Navigation and Mapping Rn n b … · 2014-11-24 · Mobile Robot Platforms: Pioneer 3-AT, Robosoft RobuROC 6, AscTec Pelican, ARDrone2.0 FPGAs: Altera

System Overview

Robust Realtime Visual-Inertial Navigation and Mapping Konstantine Tsotsos, Stefano Soatto

<[email protected]> <[email protected]>

Kinematic and Measurement Models for Monocular Visual-Inertial Fusion

Crowded poster hall (many moving people)

Indoor building environment (many specular highlights)

Outdoor natural environment (many occluding boundaries due to foliage)

!! Measurements of linear acceleration and rotational velocity from an inertial measurement unit (IMU) are used as inputs to a kinematic-statistical model

Map Building from Filter Output

!! World scale recovered by fusion with accelerometers

! Fully realtime implementation using an Extended Kalman Filter

! Further details to appear in [1] and [2].

! Estimated 3D coordinates of features are stored in their initial coordinate frames and saved in a relative pose & co-visibility graph

! Lines denote co-visibility edges and store relative transformations, and each 3D point in the map is associated with a node of the graph

! Detected loop closures add an edge to the graph, enabling efficient local map correction

Results

University of Toronto Summer 2008, Summer 2009Research AssistantDeveloped algorithms for LIDAR-based lighting-invariant imaging, and assisted graduate students of the AutonomousSpace Robotics Laboratory (supervisor: Prof. T. Barfoot) in the development terrain mapping and stereo visionalgorithms. Provided logistical support for field expeditions.

Awards University of California, Los Angeles

2011-2012: Dean’s Graduate Student Researcher Fellowship

University of Toronto

2010: Best Poster Presentation at Undergrad. Eng. Research Day (title: Depth estimation and occlusion modeling inmonocular images)2010: Natural Sciences and Engineering Research Council of Canada (NSERC) Undergraduate Student ResearchAwards Program Scholarship (supervisor: Prof. R. Zemel)2009: Best Poster Presentation Award at Undergrad. Eng. Research Day (title: Using a laser scanner for vision basedlocalization and 3D path planning on a mobile robot)2009: Eng. Sci. Research Opportunities Program (ESROP) Fellowship (supervisor: Prof. T. Barfoot)2007: University of Toronto Scholar Award

Publications Submitted for Review

[1] K. Tsotsos, A. Chiuso, and S. Soatto. Robust Inference for Visual-Inertial Sensor Fusion. IEEE InternationalConference on Robotics and Automation (In Review), 2015[2] J. Hernandez, K. Tsotsos, and S. Soatto. Observability, Identifiability, and Sensitivity of Vision-Aided InertialNavigation. IEEE International Conference on Robotics and Automation (In Review), 2015

Refereed Conference Proceedings

[1] K. Tsotsos, A. Pretto, and S. Soatto. Visual-Inertial Ego-Motion Estimation for Humanoid Platforms. IEEE-RASInternational Conference on Humanoid Robots, December 2012

Manuscripts in Preparation

[1] K. Tsotsos, J. Hernandez, A. Pretto, A. Chiuso, and S. Soatto. Visual-Inertial Sensor Fusion: Modeling, Analysisand Robust Inference. For International Journal of Robotics Research[2] K. Tsotsos, J. Hernandez, G. Graber, T. Campbell, T. Pock, J. P. How, and S. Soatto. Task-Aware Object-LevelScene Segmentation. For Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015

Professional

Service

Reviewer: International Journal of Computer Vision, IEEE Transactions on Robotics

Teaching

Experience

Spring 2013: Teaching Assistant CS174B: Intro. to Comp. Vision, UCLA. Instructor: Prof. S. SoattoFall 2010: Teaching Assistant ESC101H1F: Introduction to Engineering Design, U. of T. Instructor: J. Foster

Relevant

Coursework

University of California, Los Angeles: Computer Vision (S. Soatto), Convex Optimization (L. Vandenberghe),Numerical Analysis, Bayesian Inference for Vision (A. Yuille), Statistical Modeling for Vision (S. Zhu), StochasticEstimation and Dynamical Systems (J. Speyer)University of Toronto: Computer Vision (D. Fleet), Machine Learning (R. Zemel), Artificial Intelligence (S. McIlraith),Control Theory (M. Maggiore), Robot Manipulator Modeling & Control (M. Maggiore)

Relevant Skills Programming Languages: C/C#/C++, Java, PIC & NIOS2 assembly, MATLAB

Hardware Description Languages: Verilog 2001

Software: Latex, OrCad / PSPICE, ROS, Blender, OpenCV, VLFeat

Hardware Experience:

Mobile Robot Platforms: Pioneer 3-AT, Robosoft RobuROC 6, AscTec Pelican, ARDrone2.0

FPGAs: Altera Cyclone II, Xilinx XUP Virtex II Pro Development System

In this standard model, data from the IMU are considered as (output) measurements. However, it is cus-tomary to treat them as (known) input to the system, by writing ! in terms of !imu and ↵ in terms of↵imu:

! = !imu � !b

+ nR|{z}

�n!

↵ = R(↵imu � ↵b

) + � + nv|{z}

�Rn↵

(5)

This equality is valid for samples (realizations) of the stochastic processes involved, but it can be mislead-ing as, if considered as stochastic processes, the noises above are not independent of the states. Such adependency is nevertheless typically neglected. The resulting mechanization model is

8>>>>>><

>>>>>>:

T = v T (0) = 0

R = R(b!imu � b!b

) + nR

R(0) = R0

v = R(↵imu � ↵b

) + � + nv

!b

= wb

↵b

= ⇠b

.

(6)

1.5 Imaging model and alignment

Initially we assume there is a collection of points Xi, i = 1, . . . , N , visible from time t = 0 to the currenttime t. If ⇡ : R3 ! R2;X 7! [X1/X3, X2/X3] is a canonical central (perspective) projection, assuming thatthe camera is calibrated,1 aligned,2 and that the spatial frame coincides with the body frame at time 0, wehave

yi(t) =RT

1:2(t)(Xi � T1:2(t))

RT

3 (t)(Xi � T3(t))

.= ⇡(g�1(t)Xi) + ni(t) (7)

If the feature first appears at time t = 0 and if the camera reference frame is chosen to be the origin theworld reference frame so that T (0) = 0;R(0) = I, then we have that yi(0) = ⇡(Xi) + ni(0), and therefore

Xi = yi(0)Zi + ni (8)

where y is the homogeneous coordinate of y, y = [yT 1]T , and ni = [ni

T

(0)Zi 0]T . Here Zi is the (unknown,scalar) depth of the point at time t = 0, and again the dependency of the noise on the state is neglected.With an abuse of notation, we write the map that collectively projects all points to their correspondinglocations on the image plane as y(t) = ⇡(g�1(t)X) + n(t), or:

y(t).=

2

6664

y1

y2

...yN

3

7775(t) =

2

6664

⇡(RT (X1 � T ))⇡(RT (X2 � T ))

...⇡(RT (XN � T ))

3

7775+

2

6664

n1(t)n2(t)...

nN (t)

3

7775(9)

In practice, the measurements y(t) are known only up to a transformation gcb

mapping the body frameto the camera, often referred to as “alignment”:

yi(t) = ⇡�gcb

g�1(t)Xi

s

�+ ni(t) 2 R2 (10)

We can then, as done for the points Xi, add it to the state with trivial dynamics gcb

= 0.It may be convenient in some cases to represent the points Xi

s

in the reference frame where they firstappear, say at time t

i

, rather than in the spatial frame. This is because the uncertainty is highly structuredin the frame where they first appear: if Xi(t

i

) = yi(ti

)Zi(ti

), then yi(ti

) has the same uncertainty of thefeature detector (small and isotropic on the image plane) and Zi has a large uncertainty, but it is constrainedto be positive.

1Intrinsic calibration parameters are known and compensated for.2The pose of the camera relative to the IMU is known and compensated for.

4

!! Acceleration and rotational velocity are known, but biased, inputs with gravity as a known parameter. Model driven by Gaussian process noise!! Feature tracks from a monocular camera rigidly attached to the IMU provide measurements of relative pose and world structure

In this standard model, data from the IMU are considered as (output) measurements. However, it is cus-tomary to treat them as (known) input to the system, by writing ! in terms of !imu and ↵ in terms of↵imu:

! = !imu � !b

+ nR|{z}

�n!

↵ = R(↵imu � ↵b

) + � + nv|{z}

�Rn↵

(5)

This equality is valid for samples (realizations) of the stochastic processes involved, but it can be mislead-ing as, if considered as stochastic processes, the noises above are not independent of the states. Such adependency is nevertheless typically neglected. The resulting mechanization model is

8>>>>>><

>>>>>>:

T = v T (0) = 0

R = R(b!imu � b!b

) + nR

R(0) = R0

v = R(↵imu � ↵b

) + � + nv

!b

= wb

↵b

= ⇠b

.

(6)

1.5 Imaging model and alignment

Initially we assume there is a collection of points Xi, i = 1, . . . , N , visible from time t = 0 to the currenttime t. If ⇡ : R3 ! R2;X 7! [X1/X3, X2/X3] is a canonical central (perspective) projection, assuming thatthe camera is calibrated,1 aligned,2 and that the spatial frame coincides with the body frame at time 0, wehave

yi(t) =RT

1:2(t)(Xi � T1:2(t))

RT

3 (t)(Xi � T3(t))

.= ⇡(g�1(t)Xi) + ni(t) (7)

If the feature first appears at time t = 0 and if the camera reference frame is chosen to be the origin theworld reference frame so that T (0) = 0;R(0) = I, then we have that yi(0) = ⇡(Xi) + ni(0), and therefore

Xi = yi(0)Zi + ni (8)

where y is the homogeneous coordinate of y, y = [yT 1]T , and ni = [ni

T

(0)Zi 0]T . Here Zi is the (unknown,scalar) depth of the point at time t = 0, and again the dependency of the noise on the state is neglected.With an abuse of notation, we write the map that collectively projects all points to their correspondinglocations on the image plane as y(t) = ⇡(g�1(t)X) + n(t), or:

y(t).=

2

6664

y1

y2

...yN

3

7775(t) =

2

6664

⇡(RT (X1 � T ))⇡(RT (X2 � T ))

...⇡(RT (XN � T ))

3

7775+

2

6664

n1(t)n2(t)...

nN (t)

3

7775(9)

In practice, the measurements y(t) are known only up to a transformation gcb

mapping the body frameto the camera, often referred to as “alignment”:

yi(t) = ⇡�gcb

g�1(t)Xi

s

�+ ni(t) 2 R2 (10)

We can then, as done for the points Xi, add it to the state with trivial dynamics gcb

= 0.It may be convenient in some cases to represent the points Xi

s

in the reference frame where they firstappear, say at time t

i

, rather than in the spatial frame. This is because the uncertainty is highly structuredin the frame where they first appear: if Xi(t

i

) = yi(ti

)Zi(ti

), then yi(ti

) has the same uncertainty of thefeature detector (small and isotropic on the image plane) and Zi has a large uncertainty, but it is constrainedto be positive.

1Intrinsic calibration parameters are known and compensated for.2The pose of the camera relative to the IMU is known and compensated for.

4

However, to relate Xi(ti

) to the state, we must bring it to the spatial frame, via g(ti

), which is unknown.Although we may have a good approximation of it, the current estimate of the state g(t

i

), the pose when thepoint first appears should be estimated along with the coordinates of the points. Therefore, we can representXi using yi(t

i

), Zi(ti

) and g(ti

):Xi

s

= Xi

s

(gti , yti , Zti) = g

ti ytiZti (11)

Clearly this is an over-parametrization, since each point is now represented by 3+6 parameters instead of 3.However, the pose g

ti can be pooled among all points that appear at time ti

, considered therefore as a group.At each time, there may be a number j = 1, . . . ,K(t) groups, each of which has a number i = 1, . . . , N

j

(t)points. We indicate the group index with j and the point index with i = i(j), omitting the dependency onj for simplicity. The representation of Xi

s

then evolves according to

8><

>:

yiti

= 0, i = 1, . . . , N(j)

Zi

ti= 0

gj

= 0, j = 1, . . . ,K(t).

(12)

2 Analysis of the model

The goal here is to exploit imaging and inertial measurements to infer the sensor platform trajectory. Forthis problem to be well-posed, a (su�ciently exciting) realization of !imu,↵imu and y should constrain the setof trajectories that satisfy (6)-(12) to be unique. If there are di↵erent trajectories that satisfy (4) with thesame outputs, they are indistinguishable. If the set of indistinguishable trajectories is a singleton (containsonly one element, presumably the “true” trajectory), the model (4) is observable, and one may be able toretrieve a unique point-estimate of the state using a filter, or observer.

While it is commonly accepted that the model (4) or its equivalent reduced realization, is observable,this is the case only when biases are exactly constant. But if biases are allowed to change, however slowly,the observability analysis conducted thus far cannot be used to conclude that the indistinguishable set is asingleton. Indeed, we show that this is the not the case, by computing the indistinguishable set explicitly.The following claim is proven in the appendix.

Claim 1 (Indistinguishable Trajectories). Let g(t) = (R(t), T (t)) 2 SE(3) satisfy (6)-(12) for some knownconstant � and functions ↵imu(t), !imu(t) and for some unknown functions ↵

b

(t),!b

(t) that are constrainedto have k↵

b

(t)k ✏, k!b

(t)k ✏, and k!b

(t)k ✏ at all t, for some ✏ < 1.Suppose g(t)

.= �(g

B

g(t)gA

) for some constant gA

= (RA

, TA

), gB

= (RB

, TB

), � > 0, with bounds onthe configuration space such that3 kT

A

k MA

and 0 < m�

|�| M�

. Then, under su�cient excitationconditions, g(t) satisfies (6)-(12) if and only if

kI �RA

k 2✏

m(!imu :R+)(13)

|� � 1| k1✏+M�

kI �RA

km(↵imu :I1) (14)

kTA

k ✏(k2 + (2M�

+ 1)MA

)

m�

m(!imu :I2) (15)

k(1�RT

B

)�k ✏(k3 +M�

MA

)

m�

m(!imu � !b

:I3)+

+(|� � 1|+ ✏)M(!imu � !

b

:I3)k�km

m(!imu � !b

:I3) (16)

for Ii

and ki

determined by the su�cient excitation conditions.

3Here �(g) is a scaled rigid motion: if g = (R, T ), then �(g) = (R,�T ).

5

However, to relate Xi(ti

) to the state, we must bring it to the spatial frame, via g(ti

), which is unknown.Although we may have a good approximation of it, the current estimate of the state g(t

i

), the pose when thepoint first appears should be estimated along with the coordinates of the points. Therefore, we can representXi using yi(t

i

), Zi(ti

) and g(ti

):Xi

s

= Xi

s

(gti , yti , Zti) = g

ti ytiZti (11)

Clearly this is an over-parametrization, since each point is now represented by 3+6 parameters instead of 3.However, the pose g

ti can be pooled among all points that appear at time ti

, considered therefore as a group.At each time, there may be a number j = 1, . . . ,K(t) groups, each of which has a number i = 1, . . . , N

j

(t)points. We indicate the group index with j and the point index with i = i(j), omitting the dependency onj for simplicity. The representation of Xi

s

then evolves according to

8><

>:

yiti

= 0, i = 1, . . . , N(j)

Zi

ti= 0

gj

= 0, j = 1, . . . ,K(t).

(12)

2 Analysis of the model

The goal here is to exploit imaging and inertial measurements to infer the sensor platform trajectory. Forthis problem to be well-posed, a (su�ciently exciting) realization of !imu,↵imu and y should constrain the setof trajectories that satisfy (6)-(12) to be unique. If there are di↵erent trajectories that satisfy (4) with thesame outputs, they are indistinguishable. If the set of indistinguishable trajectories is a singleton (containsonly one element, presumably the “true” trajectory), the model (4) is observable, and one may be able toretrieve a unique point-estimate of the state using a filter, or observer.

While it is commonly accepted that the model (4) or its equivalent reduced realization, is observable,this is the case only when biases are exactly constant. But if biases are allowed to change, however slowly,the observability analysis conducted thus far cannot be used to conclude that the indistinguishable set is asingleton. Indeed, we show that this is the not the case, by computing the indistinguishable set explicitly.The following claim is proven in the appendix.

Claim 1 (Indistinguishable Trajectories). Let g(t) = (R(t), T (t)) 2 SE(3) satisfy (6)-(12) for some knownconstant � and functions ↵imu(t), !imu(t) and for some unknown functions ↵

b

(t),!b

(t) that are constrainedto have k↵

b

(t)k ✏, k!b

(t)k ✏, and k!b

(t)k ✏ at all t, for some ✏ < 1.Suppose g(t)

.= �(g

B

g(t)gA

) for some constant gA

= (RA

, TA

), gB

= (RB

, TB

), � > 0, with bounds onthe configuration space such that3 kT

A

k MA

and 0 < m�

|�| M�

. Then, under su�cient excitationconditions, g(t) satisfies (6)-(12) if and only if

kI �RA

k 2✏

m(!imu :R+)(13)

|� � 1| k1✏+M�

kI �RA

km(↵imu :I1) (14)

kTA

k ✏(k2 + (2M�

+ 1)MA

)

m�

m(!imu :I2) (15)

k(1�RT

B

)�k ✏(k3 +M�

MA

)

m�

m(!imu � !b

:I3)+

+(|� � 1|+ ✏)M(!imu � !

b

:I3)k�km

m(!imu � !b

:I3) (16)

for Ii

and ki

determined by the su�cient excitation conditions.

3Here �(g) is a scaled rigid motion: if g = (R, T ), then �(g) = (R,�T ).

5

!! Additional complexities: 1) camera-IMU alignment: 2) temporal misalignment:

˙gcb = 0

td = 0

Key Novelty in Measurement Processing: Outlier Rejection! Flow-based tracking provides long tracks for accurate depth estimation, but are prone to drift ! Track lengths limited only by visibility ! Unpredictable drift can cause a feature to become an outlier at any time, e.g. occlusion boundaries, specular highlights, non-rigid scene

Goal: Ensure features are not outliers at all times before use ! Perform updates at the current time using only features represented in the state

Key Modeling Assumption: Residual process should be temporally uncorrelated, not just instantaneously small ! Typical robust filtering strategies rely on instantaneous tests of the residual ! The residual process being temporally white is a stronger condition ! Perform outlier testing on the residual process rather than on individual samples instantaneously