hybrid tracking for augmented realitysirius.cs.put.poznan.pl/~inf71361/pliki/document.pdf ·...

University Klagenfurt

Department of Computer Science

ISYS

Master’s thesis

HYBRID TRACKING FOR AUGMENTED REALITY

Tymoteusz Sielach

Supervisor

prof. Martin Hitz

Klagenfurt, 2009

Contents

1 Introduction 1

2 Analysis 4

2.1 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Visual Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Marker-based tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Tracking without markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Software libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.2 Inertial Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Accelerometers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Gyroscopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6-DOF inertial trackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.3 Other technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

GPS and DGPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Electronic compass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Mechanical tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Gravity sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Ultra sound tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Ultra-Wideband . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1.4 Hybrid tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.1 Ultra mobile PC and TabletPC . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.2 Mobile phone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Windows Mobile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Android . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

iPhone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Symbian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2.3 AR Telescopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3 Propositions of solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.1 Hybrid tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.2 Tracking from pre-acquired data . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Development of the prototype 33

3.1 Description of prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Technical specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

I

II

3.1.2 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.3 Camera pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.1.4 Point coordinates estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.1.5 Features - finding, describing, matching . . . . . . . . . . . . . . . . . . . . 42

SURF based feature tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Pyramid Lucas-Kanade algorithm . . . . . . . . . . . . . . . . . . . . . . . 43

3.1.6 Marker based tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.1.7 Initial camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.1.8 Inertial tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Possible improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2.1 Descriptor based tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2.2 Kalman filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2.3 Weighted tracking error function . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2.4 Speeding-up and better robustness . . . . . . . . . . . . . . . . . . . . . . . 47

3.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3 History of development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Natural feature tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Pose estimation and reconstruction procedures . . . . . . . . . . . . . . . . 50

Improvement of architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Improvements of tracking quality . . . . . . . . . . . . . . . . . . . . . . . . 50

Change of feature tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Speeding up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Conclusion 55

4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1.1 Outdoor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1.2 Indoor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1.3 Verification of requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Hardware recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Tracking method recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3.1 Pose estimation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.2 Map management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Tracking with extensible map . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Tracking from pre-acquired data . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.3 Feature tracking and detection method . . . . . . . . . . . . . . . . . . . . . 61

4.3.4 Inertial subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4 Initialisation and marker subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5 Problems unsolved . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6 Technical problems during development . . . . . . . . . . . . . . . . . . . . . . . . 63

Bibliography 67

Chapter 1

Introduction

There is a historical project called “Burgbau zu Friesach”, which takes place in Friesach (Carinthia).

Project relies on building a medieval castle from basis using medieval methods and technologies.

The whole process will take 30 to 40 years. Actually the construction site is fixed and designers

are working in their office in Friesach. The project is already in progress, but the first stones will

stand on the construction site only in a few years. The problem is:

• How to make the construction site attractive for visitors. (Especially in the early phase ofconstruction)

• How to document the whole process.

• How to tell the knowledge about construction to visitors in a attractive way.

The idea is to solve it using modern informatics. The problem of the documentation can

be solved using cameras, which will automatically take picture of construction, store them in a

database and process them in future. Problems related with telling the knowledge and presenting

the building can be solved by creating the augmented reality system, which will augment the

multimedia content and a 3D model of arising castle in the constructions site. Augmented reality

is very good approach for that purpose, because it does not neglect things, which happen in real

world (building) and are very important. For example the 3D presentation of the castle and

its construction watched on the desktop PC at home does not give so many informations and

impressions. AR gives information in not obtrusive and way. Hence, the construction site can look

like it was in medieval, without information boards and arrows guiding visitors. It will make an

impression of really living construction site not just a museum.

I would like to introduce the vision of the system as user’s scenarios. There will be two types

of users:

• Visitor - someone, who comes to sightsee the construction site. Is focused on getting infor-mation about the construction process and history.

• Designer - a person, who wants to see how the castle will look in the future, compare somevariants of design, adjust some dimensions of the building.

Visitor

The family came to see the construction of the castle. They bought tickets and got a handheld

device with 7 inch screen on one side and camera on the second. After turning it on an image

from camera appeared on the screen. The instruction recommends to look through the device on

two color stakes sticked near the entrance to the construction site. After a second of looking on

1

Introduction 2

stakes the caption ’ready’ cropped out on the screen. From that moment all movements of the

device are tracked by the system. The father of the family directed the lens to empty place where a

castle should be build in the nearly 30 years, but on the screen it appeared immediately. In the right

corner of the screen appeared a date indicating a moment in time in about 30 years. The main

functionality of the system is to render the castle in the concrete state of construction indicated by

date. The family discovered that the application contains a time slider which allows to choose a

moment in time, in which the castle will be displayed. It appears after pushing the ’time line’ button

and covers the whole screen. It contains dates and some milestones in the construction process.

After changing the selected time stamp (simply by sliding a finger on the touch screen), the time

line disappears and the augmented castle in the specific construction phase is visible. The family

approached nearer to the building to better see details of the 3D model. They realized that there is

a rectangle in the middle of the screen (similar to that one in the viewfinder in the camera). When

the romanesque window was visible in that rectangle, its color changed. When the father pushed the

’shutter’ button a few windows with text and multimedia content about that detail appeared. They

were displayed till he was watching that detail through a device and disappeared after changing the

virtual gaze direction. Later they learned that in order to prevent windows to disappear, they can

be hold by hand on the touch screen. They also noticed that apart from castle there are other 3D

objects. When they came closer, they noticed, that there are virtual posters containing information

about castle and its current construction phase. They were easy to read through the handheld device,

it was also easy to intuitively change the view between castle and information poster. When they

were changing the date on the time line, the content of posters was also changing according to the

castle’s state. After sightseeing of the castle, the family went to other parts of construction site.

They came to the place, where stones were prepared. The virtual posters appeared around the area

into which they entered, also other posters near the castle vanished. The father approached with a

device to a man dressed like a medieval mason. Multimedia content related to mason appeared on

the screen, so he could read about his work and role in the construction of the castle.

After sightseeing the virtual castle the family came back home, and decided to visit the projects

website. There was a text field on that website, where they introduced their ticket id-number. After

sending the form, they could see their whole trip in the construction site, a list of things they saw,

read the descriptions again and see multimedia content. The son said, that it will be very helpful

for him to write an essay to the class of history.

Designer

The designer is sitting in his office in the center of Friesach. He draws a few versions of the

castle and its more detail parts using a CAD software. He tries to imagine how will they look in

the real world and choose the best one. He is also thinking how will it fit to the shape of the terrain

and what dimensions should be the best. Answers for all of those questions can be obtained using

augmented reality. The designer saves models of the castle in the BZF file format (BZF is an

extended 3DStudio file [*.3ds] used in the whole platform for storing 3D objects) using a plug-in

for his designing software. Then he loads that file to handheld device (the same one used by the

family in the previous scenario) and goes to the construction site. He calibrates the device using

stake markers in the construction site (the same way as in visitor’s scenario) and starts to watch

his project. First he looks at the castle from few places and large distance, then he approaches

closer to walls and looks at some details. After that he switches the system to another version of

project. He walks further to examine it from a distance. Then he realizes that the tower is too low.

He quickly switches to dimension adjusting mode and starts to tune the towers height. All of his

changes are visible immediately, which is very helpful by adjusting the ideal height. After that he

corrects few other dimensions in the same way. He also realizes, that a defending wall does not fit

Introduction 3

to other parts of the castle very well, but he prepared another variant before. He quickly switches

to that version. When he has an ideal solid of the castle on the screen, he pushes a ’shutter’ button

(utilized also in visitor’s scenario). This button in designer’s mode has a different function, it

stores a screen-shoot on a hard drive. Then he sends this image to his co-workers by email, using

standard software (integrated in UMPC’s OS). In order to save adjusted dimensions he chooses

’save’ option from the menu (available in the dimension adjusting mode). Changes are stored in

the BZF file and can be propagated to the CAD software again.

Design of such a system is not very common and causes some problems. Initially we can divide

them into two groups:

• Interface design problems. I suspect that, the system can display to many multimedia contentin one time, which can make the user bewildered and overloaded of information. Rendering

of the castle from inside also can be a problem. Let us imagine, that someone wants to see

the courtyard, but its floor is quite high and not build already. The user stands under it

and does not see anything. Similar problem can happen if he walks to the inside of some

solid (for example defensive wall). It is a open problem if the castle should be rendered from

inside and how to present it to the user.

• Technical problems. Design of augmented reality applications is hard and still full of nottypical problems. Each system is a new challenge. But in the macro scale there are some

common problems in augmented reality applications. Augmented reality relies on rendering

3D models and other data on the image of real world taken by camera. In order do it, the

system must have information about the exact camera position and orientation. The problem

of acquiring this data is called ’tracking’. There are many methods of tracking, which utilize

many different physical phenomenons. All of them have advantages and disadvantages, which

causes that they are very suitable for some aims and not suitable for others. The very often

met approach is the combining of few methods in one system. The accuracy of tracking is

also crucial in AR purposes. The second problem is 3D rendering. It relies on placing a 3D

model in a scene so that it will not overlap object, which occur before a virtual object. It

should look like it was a part of real world. The computational power for rendering (tracking

too) is also a problem. Many mobile phones and PDA’s do not support hardware floating

point operations.

In this thesis I will focus on solving a tracking problem in the outdoor environment.

Chapter 2

Analysis

The aim of this work is to design the augmented reality system, which:

• will make the sightseeing of the construction site of ’Burgau zu Freisach’ project more atrac-tive

• will be able to render big buildings

• will robustly work in limited outdoor area. (Area can be prepared)

• will have short response time

• can render at least 25 fps

• will be easy to use - intuitive interface, lightweight

• will be able to display multimedia content of several types. (3D models, videos, images, text)

• will be able to play sounds

With designing of augmented reality system come 3 types of problems:

• Rendering - subsystem which allows to draw several objects on the screen. It must be fastenough to render the image fluently.

• Tracking - To properly render an augmented scene, system must know the exact trajectoryand position of the camera. It is called geometrical registration problem.

• Hardware platform - in contemporary world no one is constructing the computer from scratch.There are many ready micro computers on the market. Mainly they use Intel-architecture

processors or ARM processors. There is a problem to choose that one which is the fastest

and has enough built in sensors for the aim of tracking.

The first section considers the tracking problem. The next section describes hardware platforms.

Rendering isn’t in the area of this thesis.

2.1 Tracking

In this section we will consider a variety of tracking techniques. They are ordered in turn of

frequency of appearing in contemporary augmented reality systems. In the implementation of

tracking we can distinguish two approaches:

4

2.1. Tracking 5

• Outside-in tracking - sensing devices are placed in the environment in motionless places.Moving objects are being tracked.

• Inside-out tracking - sensors are mounted on the moving object, which will be tracked.

The first approach works in specially prepared environments, which are relatively small. The

second approach can work in an unprepared and unbounded environment.

2.1.1 Visual Tracking

Visual tracking is an approach, which uses one or more cameras. Special algorithms are looking for

features in the raster image, describe and classify them. Consecutive frames are analyzed and the

spatial correspondence between them is calculated. The features can be known for the software and

even specially put in the environment - this approach is called “marker-based tracking”. Markers

don’t have to be visible for the human. We have visual tracking systems which rely for example on

infrared LEDs[CN98]. There are also tracking systems which rely only on natural features (edges,

bulbs, T - junctions, colors, etc.) They can work in the unprepared environments - this fact greatly

increases the area of applications of Augmented Reality. The camera can be attached to the tracked

device (for example: Tablet, PDA or HMD) - inside-out approach. Or the system can consist of

many cameras mounted in the environment - outside-in approach. In the next subsections we

will discuss marker-based tracking, tracking without markers and software libraries, which can be

helpful in implementation of visual tracking.

Marker-based tracking

Marker based tracking is one of the most developed topics. During last 15 years very many types

of fiducial markers were developed. The features, which differentiate several marker systems are:

• Robustness, velocity, and accuracy of tracking.

• Type of markers. In the simplest way we can distinguish: color markers and shape markers.

• A variety of distances from which the marker can be recognized. For example if we have avery big marker, which is noticeable from great distance, cannot be recognized, when camera

is next to it and captures only a part.

• How many bits of information can be coded on the marker. Some standards of markersassume the use of checksum bits.

• How big are markers, and how obtrusive are they? How large is the area, they cover?Studierstube Tracker library [WLS08] contains some concepts of unobtrusive markers.

• If marker system is scalable? Is it possible to use it on the large area?

• How many markers must be in one frame to calculate the 6-DOF pose.

Here I present the list of marker systems which I studied:

ISO/IEC16022 standard [Wikb]. Marker contains a 2D data matrix and two solid edges,

which help to find a marker. ISO/IEC16022 markers can be combined together to make a large ma-

trix consisting of segments delimited by solid edges. The segments have a shape of square or rectan-

gle. There is many standards of dimensions of such markers - from 8x8 up to 144x144. The largest

version can contain 2355 bits of data. Each marker has an ECC200 checksum. ISO/IEC16022

was designed to replace the old-fashioned barcodes. There are many free libraries for reading and

2.1. Tracking 6

(a) ISO/IEC 16022 marker (b) ARTag marker (c) Frame marker

(d) Split marker (e) Dot marker (f) Circular markers

(g) Nested marker (h)stakemarker

Figure 2.1: Types of markers

generating ISO/IEC16022 codes. The Studierstube Tracker library can also estimate 6DOF pose

of the camera using such marker. A big advantage of this system is that it can contain big amount

of data. We can store there whole URL or even a simple 3D model. See 2.1(a).

There are a few other standards using data matrices, for example ARTag (2.1(b))[Fia05].

ARTag marker contains 36 bits of information, 10-bits of them contains ID data (1024 differ-

ent markers), other 26-bit provide redundancy to decrease a chance of false identification. ARTag

is designed to be very robustly identified and tracked.

ARToolkit markers [Kat] - system of square-shaped markers identified by graphic template.

The template can be any black-and-white image, but the more detailed is the symbol, the worse

is the quality of identification. Symbols consisting of big black-and-white areas yield the best

robustness and enlarges the distance from which markers are recognizable.

Frame markers [WLS08] - rectangular markers, which consist only of a frame(inside a marker

can be anything). At the interior side of a frame the id information(9 bits) is encoded, making it

look like a decoration. The code contains checksum and is arranged so, that allows to determine

the orientation. See 2.1(c)

2.1. Tracking 7

Split markers [WLS08] - a variation of frame marker, which consists only of two parallel sides

of the frame. Both sides contain barcodes with equal id information(6-bits) encoded. A pair of

barcodes differentiate only on the one bit, which stores the orientation. See 2.1(d)

Dot markers [WLS08]. Each marker is a black dot with a white ring. Markers form a two

dimensional grid, which is applied to the flat surface. Each 4 dots form a grid cell, which is

matched against the previously precomputed template. This allows to determine the position and

orientation of the camera. The solution is scalable but limited only to the flat surface. Dots cover

very little part of the surface. See 2.1(e)

Nested markers [TKK+06] - is the solution how to improve scalability and accuracy of

tracking for black-and-white rectangular markers. See 2.1(g). The concept is recursive: there is

one high level marker which consists of some smaller (lower level) markers. Lower level markers can

also contain nested smaller markers. Each marker contains a visual code and can be unambiguously

identified. When the camera is far form the marker - the system uses the top-layer marker. If the

camera is close - the lower level markers are used for geometrical registration. The system can also

use markers of many levels simultaneously to increase the tracking accuracy.

Multi - ring color markers [CN98] - is another approach to improve scalability of marker-

based tracking. They are quite similar to nested markers, because this concept also relies on

embedding one marker in the another. The author of the article [CN98] proposed two solutions:

“Proportional width ring markers” and “Constant width ring markers”. The first concept increases

the tracking range. Marker consist of a few concentric rings, the width of each ring is 2 times bigger

than the width of its internal ring. There exists markers of many levels : the level 1 markers consist

of a centre and one ring, level 2 markers have 2 rings and so on. To determine the 6-DOF pose the

camera must see three or more markers. Therefore 1 level markers must be arranged dense, bigger

markers will be arranged sparser because system uses them if distance to the lens is large. Farther,

rings have one of 6 colors, which helps with identification. The constant width ring markers has

also constant number of rings. Hence the tracking range of each marker is also constant.

Invisible markers embedded in images [Hwa07] - Standard black-and-white square markers

can be applied to the static or moving images as a noise, which is unnoticeable for human. The

encoded information can be extracted using standard camera and Wienner filter. If we subtract an

original image from the acquired one after noise reduction, we will receive an image of a marker.

Then it has to be normalized and converted to binary image using global precomputed threshold.

The resultant picture can be robustly processed with a marker tracking software.

IR markers [PP04] - proposed system uses ARToolkit-style markers drawn with an IR invisible

ink. Application uses 2 cameras coincided with a half mirror, one of them is equipped with IR

filter for capturing markers, the second is equipped with visible light cutoff filter for capturing

real scene. The signal from IR camera is processed with the ARToolkit library and 3D models are

augmented with an image from scene camera. It looks like the application was a robust marker-less

visual tracking system.

Stake markers - my own idea for “Burgbau zu Friesach” project. The marker is a stake with

color bars on the upper end and the second end is sharp to stick it vertically in the earth (see figure

2.1(h)). Color bars are creating a visual code which allows to distinguish markers. The width of

a bar is constant and big enough to be recognized by camera from large distance. In order to

calculate the position and orientation two markers must be in the field of view. The color code

should be similar to that one used in Multi - ring color markers. The number of used colors

should be low, to avoid error recognision in different lighting conditions. In addition colors can be

divided into two groups according to the wave frequency (for exmaple: (red, yellow, orange) and

(blue, violet, green)). Colors from two groups could be used alternatingly also to avoid errors in

2.1. Tracking 8

recognision. For each stake the system must remember its 2D position on the earth and height.

Height of all markers must be normalized to some equal imaginary level. I assume that all stakes

are perpendicular to that level. Stake markers are designed to be sticked around the area, where

tracking system should work. On the figure 2.2 two methods of sticking are presented (in context

of ’Burgbau zu Friesach’ project). Gray color indicates area, where spectators should not stay, red

squares represent stakes. We must also assume that virtual objects are placed in the middle of the

environment (in the circle called castle). On the figure 2.2(a) we see that markers are placed in

the meedle, so they are always visible, when looking on the castle. They also should not disturb

during wathcing, because they will be covered by the rendered 3D model. But when the user will

stand very close to stakes or between them, tracking can stop to work, because the arising castle

will cover some markers on the oposite side of the circle. On the figure 2.2(b) stakes are placed

arround the area. The big advantage of this method is that spectators can be anywhere in the

area, but the danger of covering markers by the building is larger.

Effectiveness of proposed marker system depends of where the user is directing his camera. I

assume that in most of cases the camera will look in the axis parallel to the surface of the earth

(small tilts are possible) In order to support 6-DOF in every location in the tracking area, stake

markers must be integrated with some other technique.

During tracking the system is always looking for points where color bars are putting together.

The length of color bar is constant, so 2D and 3D coordinates of those points are easy to calculate.

If the camera sees 2 stakes, 2 points from each of them could be used to estimate the camera pose.

The more points or stakes are visible, the better quality can be achieved.

(a) In the middle (b) Arround

Figure 2.2: Methods of placement of stake markers

All systems presented in this section can be used with inside-out and outside-in tracking.

However, for augmented reality applications inside-out tracking is more desired. This approach

relies on tracking of markers in the environment. If the environment increases its size or changes

so, that some markers will be covered, we can simply add some cheap markers. In the outside-

in system the network of cameras would have to be extended - it is more expensive and more

complicated.

Tracking without markers

As it was already told, visual tracking relies on finding features in consecutive frames and calcu-

lating the frame-to-frame correspondence. In this section we will consider a case, where natural

features (not known before) are utilized. We can distribute the problem into parts:

1each bit has 6 possible values, number of bits is variable

2.1. Tracking 9

Table 2.1: Table of marker-based solutions

Name Type Data MFT SDK TR

ISO/IEC16022 square, B&W 2355 1 Studierstube Tracker for pose es-timation and many others (onlyreading)

S

ARTag square, B&W 10 1 ARToolkitPlus SARToolkitmarkers

square, B&W, anypattern

N/A 1 ARToolkit and StudierstubeTracker

S

Frame mark-ers

square, frame 9 1 Studierstube Tracker S

Split markers barcode 6 2 Studierstube Tracker SDot markers circle N/A 3 Studierstube Tracker SNested mark-ers

square, nested N/A 1 N/A I

Multi - ringcolor markers

circle, nested N/D 1 3 N/A I

Invisiblemarkers

square, invisible N/A 1 ARToolkit with filtering S

IR markers square, invisible N/A 1 ARToolkit with hardware SExplanation: Data - amount of data expressed in bits, MFT - minimal number of markers in viewto calculate 6-DOF pose of the Camera, SDK - implementation is available on the internet, TR -Tracking range { S - standard, I - design of a marker increases tracking range }

• Finding features. The features can be edges, bulbs, T - junctions, colors or even horizonsilhouette. There are many already investigated solutions for finding natural features like

Laplace or Sobel operators and their modifications. Two most important parameters of the

finding method are: repeatability (the same feature are found when object is seen form

different angles and under different lighting conditions) and time of calculations. One of the

best NF detectors is FAST algorithm [RD05], which was mentioned in the article [WRM+08].

• Finding of two corresponding features on two different camera frames. The system mustknow if the feature A on one frame and feature B on the second frame are the same point in

the space. This problem also has some solutions, but in many cases they are computationally

expansive and don’t work on-line, or are robust only in specific conditions. Solution of that

problem relies on describing features and comparing descriptions frame-to-frame. Or they

can rely on comaparing the raw patches of image. There is no guarantee, that there are

two or more features with the same description. Hence, there were investigated methods

of evaluating the certainty of match. They rely on prediction of camera position or on an

assumption, that a velocity of camera movement is limited.

• Outlier removal. Not all matches with a previous frame are true. (The matching algorithmcan simply fail on recursive texture). Not all features found by the system belongs to static

objects on the scene. Sometimes feature can belong to moving objects (like moving people).

The system cannot use them for tracking.

• Calculation of geometrical pose. Methods of finding 2D - 3D correspondence from 2D pro-jection have been already investigated. The problem of finding the camera pose is similar

to problems appearing in stereo-vision. In stereo-vision there are two cameras, which are

looking at the same scene (their fields of view are partially overlapping). The geometrical

relation between them is known, but not always. In vision based tracking we have two con-

secutive frames, which partially overlap, and geometrical relation between positions of the

2.1. Tracking 10

camera are also unknown. Mathematical model of the camera is similar to that one, used in

3D graphic enriched by modeling of radial distortion and slight tangential distortion. Mathe-

matical model of the lens has a few coefficients which can be acquired by analysing a picture

from the camera. It is important to find on the picture projections of points which 3D coor-

dinates are known. In the model we can distribute the intrisic and extrisic camera matrix.

Intrisic matrix models parameters of the lens, translates 3D coordinates into 2D coordinates,

does not change during work of the camera (if we do not use zoom). Extrisic matrix models

geometrical relation between camera and world (rotation and translation coefficients).

• Initial calibration. If the system tracks the camera pose using unknown features - it can onlyestimate the camera pose in the coordinate system which is not connected with coordinate

system of the real world. (This is called incremental tracking.) To create that connection,

system must find some feature, which coordinates in the real world are already known (for

example a calibrated marker or a set previously learned features).

Feature selection and tracking As I told - finding of corresponding features between two

images is crucial. We can divide methods of feature tracking into two groups by taking into account

the compared value.

• Raw patches. In the simplest case raw image patches can be compared, of course the searcharea of each feature must be limited because of time consumption. Such approach also would

require, that searched features had the same size in compared frames. It is not an option for

augmented reality. But there are feature tracking algorithms based on raw patches, which

are robust and quite fast. An example of such solution can a pyramid implementation of

Lucas-Kanade algorithm described in [Bou02]. That version of Lucas-Kanade algorithm is

implemented in OpenCV library and very easy to use. This class of algorithms is called

optical-flow techniques. They are robust only when differences in images are small. The

pyramid implementation solves that problem, but is more computationally expansive than

a standard version. In augmented reality applications in order to support the limitation of

differences, the frames must be captured very frequently, or some initial guesses of features

positions must be provided.

• Descriptor based methods are more robust, than previously described techniques. In order tocheck differences between features, they descriptors are compared. Most feature descriptors

have a property of affine and/or scale invariability. It means that, when features on two

images are scaled in respect to each other, (for example one image is a part of the second

one) or warped, they can be robustly matched. Another advantage is that unlike the optical

flow methods, they do not require little differences between images. But if we assume that

limitation, much of computational effort can be saved. The typical applications of feature

descriptors are image stitching and pattern recognition. The survey and comparison of

features descriptors is presented in [MS05]. The example of descriptor, which is used in

augmented reality applications is SIFT [Low04] (scale - invariant feature tracking). Example

of AR application utilizing SIFT is described here [WRM+08]. The successor of SIFT is

SURF [BTGL06].

However, the quality and reliability of tracking highly depends of features and not only of algorithm.

Important issues of that problem are described in paper ’Good feature to track’ of Shi and Tomasi

[ST94]. The obvious property of a feature is the difference between its intensity and intensity of

surrounding. Distinctive edges and corners are easier to find and their position can be determined

2.1. Tracking 11

more accurately. Another problem is orientation of features. Lets assume, that a tracked edge is

parallel to the direction of movement. In this case tracking cannot be accurate. Shi and Tomasi

proposed a measure, which allows to evaluate usefulness of a feature for tracking. It relies on a

analysis of the eigenvalues of a gradient matrices of an image patch. The outcome of Shi and

Tomasi is that both eigenvalues of a ’good’ feature must be large. Large eigenvalues indicate

corners or ’salt and paper’ textures, which can by reliably tracked. Another issue of tracking is

a feature displacement model. The simplest case assumes that feature can be translated by a

2D vector. More sophisticated model assumes feature warping, which better reflects reality. The

feature warping is represented by a 2x2 affine transformation matrix (the 2D displacement vector

also must be included). This model has 6 parameters hence, it is harder to solve the equation. The

convergence of this model is better, when features are ’good’ in terms of eigenvalue analysis. The

affine feature displacement model has been used in an AR application described in paper [KM07].

But in this case, the affine transformation matrix has been calculated from the knowledge about

camera position and orientation.

Natural feature tracking using pre-acquired data Natural features not always are un-

known from above. There is a project called Archeoguide [SK], which uses ’reference images’. First

of all the area, where the tracking system should work, is being photographed from many sides

and under different angles (poses of the camera by each snapshot is known). When the tracking

system works, each live frame is compared with the best matching frame from the database and

a frame-to-frame correspondence is calculated(frames in the database are calibrated). The sys-

tem works similarly to the marker-based tracking (we can treat reference images as markers), but

matching operation is more complicated and requires the usage of techniques described above. The

advantage of this solution is that we can pre-process some information and reduce the complexity

of on-line calculations. Disadvantage: the database of reference images must be actual. A similar

solution is proposed in the article [CCP02].

Issues of camera pose estimation - camera models

Here I present the mathematical model of camera projection, which is common for all camera

registration techniques.

xcyczc

= [R] xwywzw

+ txtytz

(2.1)

[u

v

]=

[fx 0 cx0 fy cy

] xc/zcyc/zc1

(2.2)

Where R is 3 × 3 rotation matrix, tx, ty and tz are coordinates of translation vector. 2 × 3matrix is the intrisic camera matrix. fx and fy are focal lengths, cx and cy are coordinates of the

camera central point expressed in pixels counted from the upper corner of the image. There are

two focal lengths in order to model cameras of non-square pixels. This equation does not contain

distortion model. Here I present camera model with distortion used in OpenCV.

2.1. Tracking 12

Figure 2.3: The camera model

x′ = x/z

y’ = y/z

x” = x’(1 + k1r2 + k2r4 + k3r6

)+ 2p1x′y′ + p2(r2 + 2x′2)

y” = y’(1 + k1r2 + k2r4 + k3r6

)+ p1(r2 + 2y′2) + 2p2x′y′

r2 = x′2 + y′2[u

v

]=

[fx 0 cx0 fy cy

] x′′

y′′

1

(2.3)

Where k1, k2, k3 are radial distortion coefficients and p1, p2 are tangential distortion coefficients.

They are not changing, when camera resolution changes, but focal length and center point must

be scaled.

In order to calculate intrisic and extrisic parameters the vector of 3D points and their pro-

jections must be given. This data can be easily acquired by photographing a calibrated fiducial

marker. It is also important to calculate the image point coordinates with subpixel accuracy.

OpenCV contains algorithms which can estimate intrisic camera parameters, but only from copla-

nar points. The more control points are on the image and more images are taken, the better

estimations will be produced.

The intrisic camera parameters must be estimated only once, but extrisic parameters per each

frame. There are a few methods of estimating the camera pose. Most of them relies on solving the

equations presented above. As we see they are non-linear because of the division by z. But the

camera geometry is also 3 dimensional and can be introduced that way (see figure 2.3). In order

to solve equations based on the geometry introduced on the figure the units in world coordinate

frame and camera coordinate frame must be uniform. There is a scalar ratio which translates pixel

to world units and vice-versa. That scalar is ’hidden’ in fx and fy in the equation 2.2.

Solving the camera registration problem

Here I present a short list of method for solving camera registration problem. The list of course

does not contain all algorithms, but gives an overview of already investigated approaches. We

can distinguish two approaches: analytic - based on modeling rays from object to camera in the

3 dimensional space, iterative - based on minimizing re-projection error. Some methods combine

these approaches together and exploit the advantages of both.

2.1. Tracking 13

• Least square minimisation of re-projection error. This is a family of iterative methods.The base is the least square iteration schema (for example Gauss-Newton method) which

is minimising the difference between real points in the image and projected points. The

estimated parameters are elements of rotation matrix and translation vector. The parameters

must be initialized by some guesses of values. In the visual tracking purpose the best initial

values are extrisic camera parameters from previous frame. Many implementations of least

square solvers require a matrix of first derivatives with respect to parameters, but it can

be calculated using the differential quotient. Such method is briefly described in the paper

“Pose Tracking from Natural Features on Mobile Phones ” [WRM+08].

• DLT method. Direct linear transformation. Method relies on solving linear equations basedon the perspective camera geometry. It can estimate extrisic parameters and the scaling

factor between coordinate frames. Detailed description can be found here [dlt] [Qin96].

• POSIT algorithm [DD95] is an approach, which combines analytic and iterative methods.First the orientation and translation is calculated by solving linear equations then resulting

values are refined using iterative approach. Algorithm converges very quickly. Hence, is good

for real time calculations.

• Lowe’s algorithm - approach relies on iterative least square minimisation. The object isdescribed in units of camera coordinate frame. It makes the mathematical model so easy,

that it was possible to find equations of first derivatives with respect to position and rotation

parameters. Detailed description can be found here [ACB98].

• SCAAT. Incremental tracing approach based on iterative Kalman filter. It is able to estimatepose when information is incomplete (for example to less points has been found). The

algorithm is presented here [WB97]. Another visual tracking algorithm based on ideas of

SCAAT is described here [JNY00].

Presented algorithms are full of great ideas, which can be applied without using the whole

solution. For example the first method (least square minimisation) can be enriched by Kalman

filter. Or any analytic method can be improved by iterative refinement. The choice of method

must be adjusted to the problem specifics and initial knowledge about camera movement.

3D reconstruction for visual tracking

As I told at the beginning, all algorithms require knowledge about 3D coordinates of world

points. In the case of calibrated fiducial markers it is not a problem, but if tracking has to rely on

initially unknown natural features, the coordinates must be obtained somehow. Techniques of 3D

reconstruction must be applied. In order to find 3D coordinates of a point two different calibrated

camera shots are needed. The coordinates of a point are found by calculating the intersection of

two rays going from the center points of the cameras (two views) through image planes to the

world point. The algorithm is described in details here : [TV98] The coordinates of the point

found that way are not very accurate because camera calibration errors and inaccurate feature

finding. It also depends of the difference of camera pose in two considered views. The biggest

error appears on the Z-axis of camera coordinate system (line from N to O on the figure 2.3). But

it should not disturb by solving camera registration problem. The requirement of two views means

that, the point must be found three times to be utilized for tracking. Sometimes it is to late, when

camera moves fast, or the point is unstably tracked. Correct calculation of points position form

only 2 views is also optimistic case.

2.1. Tracking 14

In my opinion it is possible to obtain 3D coordinates of a point from a single view. To do this

some more information is needed. For example an information about the plane on which the point

lies. If the tracking system would have a knowledge about the rough 3D model of the environment

and the initial camera pose, it could calculate on which plane a point lies, hence the 3D coordinates

of it. Point on the plane has two degrees of freedom, the camera also delivers 2 variables (u and v

coordinate), so the equation system has one not ambiguous solution.

There is another approach, which allows to find coordinates of points from 2 views and also

a camera pose. This is called ’five point algorithm’, because it needs at least five corresponding

points. This algorithm is able to calculate all of those things only up to the scale factor. If the real

distance between two points in the image is known - all coordinates can be expressed in metric

units. This limitation is not a problem. Data about the metric distance can be provided using

fiducial markers or they can be simply assumed. Such approach is presented in the system described

in paper [KM07]. The described application is initialised like that: user directs the camera to the

scene, pushes the button, then moves the camera 10 cm to the side (program believes, that it was

10 cm) and pushes the button again. In that way 2 views are indicated and five point algorithm

is applied. The main product of five point algorithm is the 3 × 3 essential matrix. From thatmatrix relative rotation and translation can be derived. The way how to do it and the whole

algorithm with implementation is described here [Nis03a] and [SEN]. If the relative geometric

correspondence between each pair of frames was accumulated (the initial camera pose should be

known), we would get the actual camera pose, hence the tracking system. This approach has one

advantage: not all points have to be tracked from frame to frame, only a few is needed to keep

the scaling factor. But the big disadvantage is drift. If any measurement of relative movement

will be inaccurate, that error will have influence for estimated camera pose forever. There is no

chance to correct that drift, because the system does not remember previous state or frame. This

approach is called ’Structure from Motion’. In order to improve the quality of the essential matrix

estimation all outliers must be removed. The most sophisticated method of outlier removal and

pose estimation is preemptive RANSAC, described in the paper [Nis03b]. The system has live

performance, but it is not the only solution. Second competitive technology is described further.

Simultaneous Localisation and Mapping (SLAM) is a group of technologies, which orig-

inates from robotics. It was implemented in order to build a map of the unknown environment

and use it to track the position of a robot. We can argue, that the movement of a robot is different

and more predictable than a movement of handheld camera. But successful implementations of

AR based on SLAM [KM07] are the best evidence that the difference is not so large. The problem

of creating the map and immediate tracking from that map is like a ’hen and egg problem’. Errors

made during the map creation phase have influence on tracking and tracking has influence on cal-

culation of position of new points and correction of previous ones. The idea which was widely used

to solve that problem is Kalman filter [Kal]. Kalman filter is a stochastic model of some process. It

is able to predict the state of the process in future. Generally two operations can be performed on

the filter structure: prediction - calculation of the process state in future, and correction - entering

actual measurement of the state, which is connected with an update of coefficients of a stochastic

model. Kalman filter has its modification called Extended Kalman filter (EKF), which supports

non-linear models. KFs and EKFs are widely used in automation.

In terms of augmented reality Kalman filter has a few applications. It can be used to predict

the camera position and orientation. That predicted value (if its close to truth) can be used for

finding of 2D positions of environment points in the camera frame. It can be also used as an initial

guess of camera pose for the iterative pose refinement procedure. They should also reduce the drift

appearing in visual tracking systems. Kalman filters have also their application in interpretation of

2.1. Tracking 15

data from sensors. They are able to smooth the noise coming from the sensor (like accelerometer).

Here I would like to describe how a typical SLAM system works on the basis of the solution

from [Dav03]. In such a system camera movements and also positions of points have probabilistic

models. It means that the state of the system is modeled as a vector, which can be divided into two

parts. The first one describes actual camera position and orientation, second one has an entry for

each feature point. When a new frame is acquired the system predicts the camera pose. After that

all feature point are reprojected and according to their pose uncertainty the elliptic search areas

for each of them are created. The size of the ellipse depends of the uncertainty of the point and of

the camera pose. Images of points are searched in those search areas and the convergence matrix

and state vector are updated according to they real positions. In such a system convergence matrix

is quite large: (7 + 3x) × (7 + 3x), where x is a number of feature points. But the maintenanceof that matrix is very important (also its non - diagonal entries), because it models correlations

between points. For example if some points are laying on the same surface and near to each other

- their absolute position can be uncertain, but knowledge about relative positions of those points

can be very sure. Initialization of a new point is done also in a top-down way. A new point is added

to the database using information from one view only. It is modeled as a endless straight line. The

depth of a point cannot be calculated form one view, so some assumptions are done according to

initial knowledge about the environment. The depth is modeled as a 1D range on that straight line,

which is then fulfilled with big number of regularly distributed particles (see figure 2.4(b)). These

particles create a discreet model of probability density function. At the beginning all particles are

equal, and in next frames observations of that point are applied to them. After some number of

frames particles should form a single peak like on the figure 2.4(a). If it will not happen, the point

is discarded. In opposite case the point is used for tracking with a depth estimate in the place of

peak. The quality of the position estimate is very high but takes much time and requires that,

this point is visible for some period of time. Moreover after the invested effort, the point should

be used for tracking for a long time. Design of that system assumes that the tracking viewpoint

will be repeatable.

(a) Propability density of depth (b) Graphic interpretation of particles

Figure 2.4: Initialisation of point in probabilistic SLAM

Survey of visual markerless tracking

On the figure 2.5 there is a survey of all natural feature tracking techniques discussed in this

section. They are divided into processing stages of the algorithm. The red pentagon with K letter

indicates techniques utilizing Kalman filters. Possible types of Kalman filter are presented in left

2.1. Tracking 16

bottom corner of the image.

Figure 2.5: Summary of visual tracking from natural features

Software libraries

Computer vision systems are not standardized, but there are some software libraries, which gather

implementations of algorithms useful at creating of augmented reality systems. This section is

distributed into two parts: first one is describing frameworks - implementations of whole func-

tionalities like marker-based tracking or marker-less tracking, and libraries, which implement only

individual algorithms (without connections between them).

Frameworks

Table 2.2: AR Frameworks

Name LC NFT Markers PE license platformARToolkit Yes No ARToolkit Yes GPL C/C++ and Java

(Wrapper). Linux,windows, macOS

ARToolkitPlus Yes No ARToolkit, ARTag Yes GPL C/C++. Windows,Linux, WinCE

Studierstube Tracker Yes No ARToolkit, ARTag,Frame, Split, Dot

Yes N/A C/C++. Windows,Linux, WinCE,Symbian, iPhone

Studierstube ES Yes Yes ARToolkit, ARTag,Frame, Split, Dot

Yes N/A C/C++. Windows,Linux, WinCE,Symbian

SceneLib No Yes No Yes LGPL LinuxLC - lens calibration, PE - Pose estimation, NFT - natural feature tracking

2.1. Tracking 17

In the table 2.2 for a great attention deserves SceneLib, which is a framework for designers of

SLAM for robotics. I did not searched that library very wide, but there are AR solutions based

on it. For example [Dav03].

Alghorithms and other functionality

OpenCV - a large library for computer vision and image processing. Windows and Linux

implementation are available. Library is divided into 4 parts:

• CXCORE. Data structures used in CV. Simple drawing and rendering. Time measurement.Mathematical operations (matrices).

• HighGUI: Contains functions for displaying windows, refreshing graphic on them. Capturingimage from devices and files. Allow rapid prototyping and portability.

• CV - typical algorithms for computer vision. SURF, filters, corner detection, camera cali-bration, stereo-vision.

• Machine learning.

OpenCV contains very large functionality - sufficient in many computer vision applications.

Allows rapid development. Contains very interesting system of types - arrays, matrices and images

are aggregated to one super-type. Functions are recognizing input type at runtime.

OpenSURF [Eva09] - open source implementation of SURF. I have compiled that library and

started the program and it was working very slow. I have replaced the Fast-Hessian feature

detector (standard for SURF) with FAST [RD05]. The program was still working to slow. SUFR

implementation in OpenCV works much better.

Integrating Vision Toolkit - (http://ivt.sourceforge.net/) A library similar to OpenCV (a

set of algorithms). Here I want to point differences in respect to OpenCV:

• Contains SIFT implementation

• Harris corner detector is faster than in OpenCV

• Very good object-oriented architecture

2.1.2 Inertial Tracking

This approach relies on capturing translations and rotations of a object using the inertia phe-

nomenon. Movement of a body in a space can be introduced as a sequence of translations and

rotations. Inertial sensor must be mounted on tracked object(inside-out approach). Inertial track-

ing is sourceless - doesn’t need any reference object in the environment. The most considerable

problem is drift. Devices for measuring the translation are called accelerometers, devices for mea-

suring rotations are called gyroscopes.

Accelerometers

Accelerometers are devices, which measure the linear acceleration of a body in one axis. In the

simplest case they consist of a mass and subsystem, which measures the force affecting the mass.

(For example a mass mounted on the piezo-electric crystal [RJD+01]). Acceleration of a body is

proportional to the inertial force affecting the mass. Double integration in time of acceleration

value gives the position. However, the calculated position is not errors free:

• measurement errors

http://ivt.sourceforge.net/

2.1. Tracking 18

Table 2.3: Table of 3d accelerometers

Name range noise U. rate Price OS NotesPhidget Ac-celerometer 2

± 3G 6 mili G 60Hz £114.95 Windows,Windows CE

USB, is designed for quickprototyping of AR solu-tions. Price includes ship-ment.

MOD-MMA7260Q3

± 1.5G± 6G

N/A N/A e38.95 N/A Board with ARM proces-sor and 3D accelerometer,must be programmed bythe user.

USB1600-PC 4

± 1.5G± 6G

N/A >60Hz $275 Windows,windowsCE

Similar to PhidgetAc-celerometer, comes withdrivers, software, andcode samples

• discretization errors

• numerical integration errors

• accumulation of previous errors

These factors cause, that accelerometers can precisely track the position, but only in the little

period of time. Hence, tracking system can not rely solely on accelerometers. It should be sup-

ported by second system which has not a feature of drift (for example visual tracking). A single

accelerometer has 1 degree of freedom. 3D accelerometer (3-DOF) consist of 3 single accelerometer

measuring acceleration in 3 perpendicular axes.

Fortunately, accelerometers are available on the market. There are analog electronic circuits

with build-in 3D accelerometers. They can be connected to the one-chip microcomputer (with

analog input) and transmit data to the PC (USB or RS-232). Devices, which can be simply

connected to the PC (without soldering and low-level programming) are also available. Some

products contain SDK for C++ and other programming languages. Few handheld devices (for

example smartphones) have built in 3D accelerometers.

Gyroscopes

Gyroscopes are devices, which measure the orientation in the space. (insensitive to the translation

in opposite to accelerometer). Most of old-fashioned gyroscopes use a quickly rotating wheel as

a reference. The movement of the object is sensed by rotation encoders, which measure angles

between the tracked body and the wheel. The main problem of gyroscopes is drift caused by small

friction between the axis of the wheel and bearing. This error can be minimized by calculating the

orientation iteratively, but it results in the accumulation of numerical errors. Hence, the gyroscope

must be periodically re-calibrated to ensure the accuracy in time. Electronic gyroscope can be

also implemented as a micro-electro-mechanical system (MEMS), which contain micro-miniature

vibrating elements instead of spinning wheels. Here I introduce a PC gyroscope called Inerti-

aCube2+ (based on MEMS): table 2.4 and [FHA98]. Another approach in building gyroscopes

(and also accelerometers) are silicon micromachines (iMEMS) [LKPB02]. The principle or work

of the iMEMS gyroscope is described here [GK]. The accuracy of silicon micromachined sensors

are less accurate than MEMS, but their dimensions, power-consumption and cost are very low.

2http://www.trossenrobotics.com/store/p/5160-PhidgetAccelerometer-3-Axis.aspx, http://www.active-robots.com/products/phidgets/three-axis-accelerometer.shtml

3http://www.olimex.com/dev/mod-mma7260q.html4http://www.embeddedsys.com/subpages/products/usb1600.shtml

2.1. Tracking 19

Table 2.4: MEMS and iMEMS accelerometers

Name InertiaCube 2+Technology MEMSMaximum angular speed 1200 ◦/sAngular resolution 0.01 degUpdate rate 180 HzInterface RS-232/USB adapterPrice about e2000http://www.intersense.com/uploadedFiles/Products/IC2+_datasheet_0908.pdfName Gyro Breakout BoardTechnology iMEMSMaximum angular speed 150 ◦/sSensitivity 12.5 mV/◦/sUpdate rate 80 HzInterface N/APrice about e69Note Board contains only one-chip 1-DOF analog gyroscope. Requires

microcomputer with ADCs.http://www.watterott.com/Gyro-Breakout-Board-ADXRS150-150-degree-sec_1Name IMU Combo BoardTechnology iMEMSMaximum angular speed 75 ◦/sSensitivity 15 mV/◦/sUpdate rate 40 HzInterface N/APrice about e69Note Board contains only one-chip 3-DOF analog gyroscope. Requires

microcomputer with ADCs.http://www.sparkfun.com/commerce/product_info.php?products_id=842

The lack of accuracy can be recompensed with visual tracking system. These features cause, that

silicon micromachines are very often utilized in mobile devices.

6-DOF inertial trackers

Accelerometers and gyroscopes deliver different type of information. In order to robustly calculate

the 6-DOF position, the system must integrate both accelerometers and gyroscopes. There are a

few such solutions available on the market: for example InertiaCube [FHA98]. InertiaCube con-

tains 3 accelerometers, 3 vibrating gyroscopes and 3 magnetometers. The idea of inertial tracking

system based on silicon micromachines is proposed in the article [LKPB02] and implemented for

example in Wii Remote [Wii] game controller. This device contains a 3D accelerometer and can

be enriched by the gyroscope connected to the expansion slot. Wii Remote is also using visual

tracking system consisting of IR diodes and IR camera on the controller. In the table 2.5 a cheap

6-DOF inertial tracker is introduced. It consists of 3 iMEMS accelerometers, 3 gyroscopes and

microcontroller with A/D converters for data acquisition and communication with the PC.

Each accelerometer is able to measure dynamic (vibration) and static (gravity or tilt) acceler-

ation. If the system should measure the tilt, the high-sensitivity devices are recommended. It is

possible to measure the tilt around 2 axes, which are parallel to the earth, but rotation around the

third axis (perpendicular one) is not sensible. It could be sensible if the rotation axis was quite far

(10 cm) from the sensor, so that the linear acceleration (caused by angular speed) will affect the

sensor. In this situation the motion along the circle and along the straight line is indistinguish-

http://www.intersense.com/uploadedFiles/Products/IC2+_datasheet_0908.pdfhttp://www.watterott.com/Gyro-Breakout-Board-ADXRS150-150-degree-sec_1http://www.sparkfun.com/commerce/product_info.php?products_id=842

2.1. Tracking 20

Table 2.5: 6-DOF inertial sensors

Name Atomic IMU - 6 Degrees of FreedomG. range ± 300◦/sG. sensitivity 3.3 mV/◦/sG. update rate 88 HzA. range ± 1.5G ± 6Gprice e115,00interface UARTnotes Device consists of sensors and Atmel ATMega168TM microcom-

puter. (Must be programmed by the user)

able, when the system sensible only the data from accelerometer. But if the rigid body, which

position is measured, will be equipped with the second 3D accelerometer, the rotation becomes

distinguishable.

6-DOF tracking with 2 accelerometers. I think, that the estimation of 6-DOF position

is possible using of two 3D accelerometers mounted on two opposite ends of the handheld device.

In this case, there exists only one axis (which goes through both accelerometers) around which,

the rotation cannot be sensed. It is mentioned in the article [acc], that such a system can measure

roll, pitch and yaw until the common axis of accelerometers does not point to the gravity. The

signals begins to disappear when the common axis closes to the acceleration vector. Only sharp

movements are measured accurately. I suppose that, the accuracy of such system will be much

worse, than silicon micromachined gyroscope.

2.1.3 Other technologies

GPS and DGPS

Global positioning system. GPS is a system for 3-DOF position tracking on the whole world.

Accuracy of that system is about 10-20 meters. (after eliminating S/A in 2000) [Roy]. System

consists of 28 satellites, which are orbiting around th earth. There is 3-4 atomic clocks on each

satellite to ensure time measurement. Each satellite sends the actual time and its position. Special

receiver is calculating the distance to each visible satellite and its own position. If it sees 4 satellites

- the position and altitude can be estimated, if receiver sees less than 4 satellites the 2D position

can be calculated using altitude introduced by the user.

DGPS - differential GPS. Supporting system for GPS. Consists of earth-bound stations, which

are very precisely placed. These stations are receiving signals from satellite and calculating correc-

tions, which are transmitted to special DGPS receivers. Receivers are also analysing signal from

satellites and signals form earth-bound stations and calculating the position with 1 - 3 m precision.

I found that the HA-NDGPS system which is now under development will have accuracy 0.1-0.15

m [DGP]. There is many standards of DGPS. There is also many stations on the earth [Gal].

Everyone can put the station in his own, if there is a need. DGPS is used in some augmented

reality applications like for example LIFEPLUS [Vla04]. Accuracy of DGPS is not good enough

to base the camera registration on it, but it can be used as a component of hybrid large area AR

systems.

GPS and DGPS have become very common in last few years. The are parts of handheld devices

(Phones, PDA’s) and are also available as separate devices with Bluetooth or USB interface. Most

of GPS receivers are using NMEA protocol to communicate with the world. One of the problems

of GPS technology is the low update rate (about 1 Hz). Second: the signal disappear, when for

example vehicle goes under a wide overpass. In order to prevent such situation, the GPS receiver

2.1. Tracking 21

can be enriched by low-cost accelerometer [Dav08]. GPS receivers can also integrate other sensors

like magnetometers (electronic compass).

Electronic compass

Electronic compass is a very sensitive magnetometer, which measures magnetic field of the earth.

Magnetometer consists of a coil (or a few coils). If the magnetic field around the coil is changing,

the variable current arises in the coil. That current is a function of distance from the magnetic

field source and relative orientation between emitting and receiving coil. Magnetic trackers are

widely used in VR, because of their low price. Electronic compasses use magneto-inductive ele-

ments instead of coils, mostly they contain many of them. Measurement of such a system can be

obviously affected by other sources of magnetic field and also by the tilt of the device. Electronic

compasses very often integrate tilt sensors (for example inclinometers) in order to improve quality

of measurement. Accuracy of the electronic compass is about 0.5 degree. One problem of such

sensors is that the magnetic field of the earth isn’t homogeneous. Electronic compass is a very

good replenishment for GPS, because gives data about 2D orientation. An interesting AR solution,

which exploits compass is Wikitude [Wika].

Mechanical tracking

This type of tracking system uses mechanical linkages between reference and tracked objects. There

are two types of mechanical trackers:

• The reference and the target are connected by a chain of linkages. The position is computedfrom angles between linkages. Angles are measured using potentiometers or incremental

encoders.

• The reference and the target are connected by a system of wires. These wires are rolled oncoils and tensed by a spring system in order to measure a distance accurately.

The number of degrees of freedom depends of the construction. Most of systems supports 6 DOF,

but only a limited range of motion is possible. The tracking range is about 1.8 m. Such system

can be used in immersive human interface (in the case, when the user cannot walk very far).

Mechanical linkages have their successful application in force-feedback systems, which are also

useful in user interfaces. Mechanical tracking is also utilized for augmented reality purposes, in a

group of systems called AR telescopes. They are used for tracking of camera orientation (change

of position is impossible) in very limited range of movement (only yaw and pitch). But they were

replaced by visual tracking because of the lack of accuracy. Electronic rotation encoders are very

temperature sensitive.

Gravity sensors

Inclinometers are devices, which measure their orientation regarding to the gravitation field. In

most cases they consist of some closed dish with a fluid and sensors, which measure the level of

the fluid. Orientation of the dish can be obtained by measuring the pressure or level of liquid.

There are also solutions which use electrolytic fluid or opto-electric sensors. Main limitations of

this technology are: long response time because of viscosity of the liquid, shock and acceleration

sensitivity. The second problem can solved by integrating inclinometers with accelerometers for

shock measurement. Inclinometers do not need any reference (like inertial sensors and compasses).

2.1. Tracking 22

Ultra sound tracking

Ultra sound tracking is based on the measurement of time of flight of the sound wave. Frequency

of the pulse signal is form 20 kHz to 40 kHz to prevent the user from hearing it. In order to

track the position and orientation of the object 3 or more emitters must be placed on it. To

measure 3D position of each emitter 3 receivers must be mounted on the reference. Emitters are

small and lightweight and can be easily carried by the person or mounted on any object. They

send signal sequentially or each of them uses different frequency. Tracking accuracy is very good

(0.5mm - 6mm). Limitations of such systems are: range (from 25 cm to 4.5 m) and sensitivity to

temperature, pressure, humidity and occlusion. All factors which affect the speed of sound wave

are decreasing the accuracy or make the tracking impossible. It limits the application of such

system only to rooms.

Ultra-Wideband

Ultra-Wideband is technology based on radio waves similar to Bluetooth or WiFi. It operates with

a very low energy and and using very wide wave spectrum. The Ubisense company developed a

tracking system based on that technology [uwb]. Systems consists of emitters placed on objects

and receivers mounted on the reference. The density of receivers network is similar to the density

of access-points in WLAN network. The tracking accuracy is about 10-15 cm. It was designed to

track people in a big building. In general the system has 3 degrees of freedom and extending it to

6 DOF would be hard or impossible, because of the low accuracy.

2.1.4 Hybrid tracking

As we see technologies described in previous section have some advantages and disadvantages and

none of them is ideal for tracking in the outdoor environment. Hence, the intuition suggests,

that the good system should combine two or more technologies. Such approach is called hybrid

tracking. The weakness of one technology can be compensated by another solution. For example a

system can integrate GPS with 3D accelerometer, which track the position (through short period

of time) when GPS signal is lost. In order to show, how hybrid tracking works I will present a few

examples. Example projects are also introduced in the table 2.6 which gives overview of techniques

integrated in each of them.

Examples

System II (2003)

System II (described in paper [CMC03]) is very interesting because it involves camera with

wide angle ’fish-eye’ lens (190 ◦). Additionally the system is equipped with 3D gyroscope, which is

used as a prediction for the feature tracker. The camera pose is estimated using typical ’Structure

from Motion’ approach. The essential matrix is calculated and rotation matrix and translation

vector are derived from it.(Structure from Motion) In the fish eye lens spherical distortions are very

high, so that they have to be compensated. The wide view has many advantages. Its field of view

covers very large area, even during large rotations some part of the field of view stays common for

two successive frames. ’Fish-eye’ lens allows also to properly estimate movements along focal axis

of the camera. But it has also disadvantages: resolution of image in front of the camera is very

low and generates estimation errors from movements perpendicular to camera axis like panning or

tilting.

2.1. Tracking 23

The system provides also 3D reconstruction implemented as triangulation from 2 views. It

is done during the essential matrix calculation. Please notice that coordinates of these features

are not utilized directly for tracking. Authors of the article told that it is not a mature solution.

Moreover they have written that the system is drifting.

System I (2004)

B. Jiang U. Neumann and S. You presented a hybrid tracking system in their paper “A Robust

Hybrid Tracking System for Outdoor Augmented Reality” [JNY04]. I called it ’system I’ because

it does not have any name. It integrates digital 3D gyroscope and visual tracking. The designers

assumed that the change in view caused by little rotation is much bigger than the change resulting

from a little linear movement. Hence, systems contains a gyroscope and no accelerometer which

decreases the cost. The system is designed for urban environment. Visual registration is based on

detection of unknown lines and tracking them from frame to frame. The global orientation in gen-

eral is measured by the gyroscope. The measurements are updated by the vision tracking system

from time to time to prevent the drift. Each estimate of camera pose from visual subsystem is eval-

uated to be confidential or not. Only reliable estimates are used to correct the drift. If unreliable

measurement is pronounced, the visual subsystem cames back to the previous, reliable state and

estimates the movement to the actual camera image. Technologies used here are supplementing

each other. When the rotation of the camera is too fast to use visual tracking - only the inertial

subsystem is used. But when the user stops for a moment or starts to move slowly - information

from camera will correct the drift. This system could not be used in the natural environment

because straight lines does not exist in the inartificial world. However, the way of data processing

proposed in this system is a very good base for every system which integrates visual and inertial

techniques.

LIFEPLUS (2004)

LIFEPLUS system [Vla04] was designed to support sightseeing of cultural heritage places. It

integrates visual tracking, DGPS and compass. The system is large and complicated. Besides

tracking it contains network infrastructure based on GPRS and WLAN which allows access to the

remote database of multimedia content. The systems is designed to work on the very large area.

The information from camera and compass are sufficient for tracking, but data from GPS give

initial calibration and allow other parts of the tracking system to work more reliable.

Vidente(2008)

Vidente system [SMK+08] developed at the University of Technology in Graz has been designed

to visualize pipes and cables under the surface of the earth. The tracking system integrates GPS

supporting EGNOS and very accurate inertial orientation sensor (InertiaCube 3). GPS system

delivers the position within accuracy if few meters. InertiaCube3 is very accurate (less than one

degree) sensor and costs e2000 (data from http://www.cybermind.nl/Info/EURO_PriceList.

htm#ISense). These two sensors deliver tracking accurate enough to render pipes and cables which

position is taken from GIS database. Such tracking system is quite easy to implement and does

not require much computational power, but is too expansive.

WikiTude(2009)

WikiTude[Wika] has been already mention in this work. This system integrates the GPS and

a compass in order to show descriptions of interesting places on the earth. It is not hard to notice

that such tracking system is not very accurate, but rendering of labels on big buildings or other big

objects does not require high precision. This system shows how weakness of one technique can be

improved by another one. The GPS delivers 3-DOF position and the compass 1 DOF orientation.

I suspect that two other parameters of orientation are sensed by accelerometers, but I did not

found any wide documentation of that software.

http://www.cybermind.nl/Info/EURO_PriceList.htm#ISensehttp://www.cybermind.nl/Info/EURO_PriceList.htm#ISense

2.2. Hardware 24

Table 2.6: Hybrid augmented relity systems

Name Visual Inertial GPS compassSystem I • • ◦ ◦WikiTude ◦ • • •Vidente ◦ • • ◦LIFEPLUS • ◦ • •System II • • ◦ ◦

These few examples show that hybrid systems are good direction of development. If one

technology is too weak to support all possible conditions, why not to add a second one, which

works better in some specific situations. Combining of technologies gives another profit: for

example in visual tracking computations are very important and consume much time, but some

hints from inertial sensors can simplify them.

2.2 Hardware

In this section I would like to discuss the hardware platform for augmented reality systems. A few

years ago AR was implemented on wearable computers and user was watching the world through

HMDs. These times are fortunately gone. Contemporary AR systems are implemented on tablet

PCs or mobile phones equipped with camera. We can say that user is looking through the handheld

device. The augmented image from camera is displayed on the screen in the real time. In the next

sections I will consider 3 hardware solutions.

2.2.1 Ultra mobile PC and TabletPC

Ultra Mobile PCs (UMPC) are handheld devices with big display (7-10 inches). Their processors

are compatible with PC architecture, hence they run PC operating systems like Windows or Linux.

The size of RAM is also like in a normal PC. They have communication devices like USB ports,

WiFi and Bluetooth. Some of them are equipped with keyboard. Here I briefly analyse the

usefulness of UMPC in the ’Burgbau zu Friesach’ project. Further in the table 2.7 are examples

of some UMPCs and TabletPCs.

Advantages:

• Big display 7-10 inches. Large display allows to see more. The age of visitors of the exhibitionwill be diverse, devices also have to be adjusted to needs of all users. I am talking here about

elderly people who very often have problems with eyesight. If the display is bigger, they feel

more comfortable and do not have to use glasses. Bigger displays also have great resolution,

which is helpful with rendering fonts. They also can fit larger amount of text or graphic,

which helps to avoid the necessity of scrolling. The quality of display is also important,

because of lighting conditions. The display must have very high contrast to be readable in

sunny weather.

• High portability. In the introduction I described a user story, where the family is goingthrough the construction site with a mobile device. UMPC is a device fitting to that scenario

very well. It is lightweight and can work about 5 hours on the battery. I am not sure, if it

will be able to work so long with running AR application. Probably not, because of large

processor usage. Maybe the use of additional battery will be necessary.

• Allows to connect another devices (cameras, sensors). Many of UMPCs available on markethave build-in cameras of quite high resolution, but they can be not sufficient for outdoor

2.2. Hardware 25

augmented reality. Very often the camera is mounted on the incorrect side, which prevent

the use for AR purposes. Fortunately, UMPCs have USB ports. (some of them even IEEE

1394) Most of cameras and other sensors are also using that interface, so the possibilities

of expansion are large. But UMPC equipped with additional camera and sensors requires

casing, which will keep all parts together. An example of system based on UMPC with

housing and grip is presented here [SMK+08].

• Large computing power, much memory. The computing power of UMPC is sufficient foraugmented reality. The main difference between them and laptops is single core processors.

UMPCs have also 2D/3D hardware graphic accelerators. The PC like processor contains

floating point unit, which is useful for calculations connected with tracking and rendering.

• Easy implementation of software. Software for such device can be developed on standardPC and simply run on the UMPC. No emulator or special compiler is needed. Operating

systems also do not differ. The main difference can be found in the hardware user interface.

UMPCs have touch screen and some times QWERTY keyboards.

Disadvantages:

• The exhibitor must own devices and rent it to spectators. It increases the risk of stealing ordamaging by the visitors.

• The UMPC can be too big. I know that this is the negation of something I told before. Theweight of UMPC is about 800g. For some people walking with something so heavy in hands

can be uncomfortable.

Table 2.7: UMPCs and TabletPCs available on the market

Name Processor Pr. speed RAM Price Cam/CB/WiFi/BT/

USB/COM/IEEE1394/GPU

Weight

SamsungQ1EX-71G

VIA Nano 1.2 GHz 2GB $ 749 1/1/1 2/0/0/1 640g

Samsung Q1Ultra

Intel UltraMobile A110

800 MHz 1GB e1029 1/1/1 2/0/0/1 860g

ELVTouchscreen-Panel-PC

VIA Nano 1.0 GHz 1GB e1049 0/0/0 3/2/0/1 3600g

GigabyteM704

VIA C7MULV

1.2 GHz 768 MB e745 1/0/1/1 2/0/1/1 780g

CB - camera on the back side

2.2.2 Mobile phone

Implementation of Augmented reality on the mobile phones is also possible but much harder than

on the PC. I am speaking here about phones which allow to run native code on their processors

(Symbian, WinCE, iPhone). Implementations of AR applications on J2ME phones are not known.

In comparison to contemporary PCs the computational power and amount of RAM is much lower.

Some mobile phones also do not support floating point operations. FPU is not essential, but

numeric algorithms behave more stable when they are working on FPU. I decided to discuss 4

mobile platforms:Symbian, WinCE, iPhone and Android. The last one is being programmed in

2.2. Hardware 26

(a) Samsung Q1 Ultra (b) Gigabyte M704

Figure 2.6: Example of ultra mobile PC

Java, but does not contain standard Java VM known form PCs or J2ME phones, hence is faster

and allows to run native code on the processor. As a crucial criterion I consider:

• Presence and quality of the camera.

• Presence of inertial sensors. Even more and more new mobile phones are equipped withaccelerometer and gyroscopes.

• Ability to run native code implemented in C/C++. Most of computer vision libraries areimplemented in C/C++, mathematical libraries as well. C/C++ technology guaranties the

best trade-off between speed of code and comfort of programing. However, porting of software

from PC to mobile phone requires complete re-engineering of code. (described in [WS09]) But

there

hybrid tracking for augmented realitysirius.cs.put.poznan.pl/~inf71361/pliki/document.pdf ·...

Documents