hybrid tracking for augmented realitysirius.cs.put.poznan.pl/~inf71361/pliki/document.pdf ·...

75
University Klagenfurt Department of Computer Science ISYS Master’s thesis HYBRID TRACKING FOR AUGMENTED REALITY Tymoteusz Sielach Supervisor prof. Martin Hitz Klagenfurt, 2009

Upload: others

Post on 23-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • University Klagenfurt

    Department of Computer Science

    ISYS

    Master’s thesis

    HYBRID TRACKING FOR AUGMENTED REALITY

    Tymoteusz Sielach

    Supervisor

    prof. Martin Hitz

    Klagenfurt, 2009

  • Contents

    1 Introduction 1

    2 Analysis 4

    2.1 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.1.1 Visual Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    Marker-based tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    Tracking without markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    Software libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.1.2 Inertial Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    Accelerometers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    Gyroscopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    6-DOF inertial trackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.1.3 Other technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    GPS and DGPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    Electronic compass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    Mechanical tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    Gravity sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    Ultra sound tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    Ultra-Wideband . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.1.4 Hybrid tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    2.2.1 Ultra mobile PC and TabletPC . . . . . . . . . . . . . . . . . . . . . . . . . 24

    2.2.2 Mobile phone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    Windows Mobile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    Android . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    iPhone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    Symbian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    2.2.3 AR Telescopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    2.3 Propositions of solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.3.1 Hybrid tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.3.2 Tracking from pre-acquired data . . . . . . . . . . . . . . . . . . . . . . . . 31

    3 Development of the prototype 33

    3.1 Description of prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    Technical specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    I

  • II

    3.1.2 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    3.1.3 Camera pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3.1.4 Point coordinates estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    3.1.5 Features - finding, describing, matching . . . . . . . . . . . . . . . . . . . . 42

    SURF based feature tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    Pyramid Lucas-Kanade algorithm . . . . . . . . . . . . . . . . . . . . . . . 43

    3.1.6 Marker based tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.1.7 Initial camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.1.8 Inertial tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    3.2 Possible improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.2.1 Descriptor based tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.2.2 Kalman filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    3.2.3 Weighted tracking error function . . . . . . . . . . . . . . . . . . . . . . . . 47

    3.2.4 Speeding-up and better robustness . . . . . . . . . . . . . . . . . . . . . . . 47

    3.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    3.3 History of development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    Natural feature tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    Pose estimation and reconstruction procedures . . . . . . . . . . . . . . . . 50

    Improvement of architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    Improvements of tracking quality . . . . . . . . . . . . . . . . . . . . . . . . 50

    Change of feature tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    Speeding up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    4 Conclusion 55

    4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    4.1.1 Outdoor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    4.1.2 Indoor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    4.1.3 Verification of requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    4.2 Hardware recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    4.3 Tracking method recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    4.3.1 Pose estimation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    4.3.2 Map management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    Tracking with extensible map . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    Tracking from pre-acquired data . . . . . . . . . . . . . . . . . . . . . . . . 61

    4.3.3 Feature tracking and detection method . . . . . . . . . . . . . . . . . . . . . 61

    4.3.4 Inertial subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    4.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    4.4 Initialisation and marker subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    4.5 Problems unsolved . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4.6 Technical problems during development . . . . . . . . . . . . . . . . . . . . . . . . 63

    Bibliography 67

  • Chapter 1

    Introduction

    There is a historical project called “Burgbau zu Friesach”, which takes place in Friesach (Carinthia).

    Project relies on building a medieval castle from basis using medieval methods and technologies.

    The whole process will take 30 to 40 years. Actually the construction site is fixed and designers

    are working in their office in Friesach. The project is already in progress, but the first stones will

    stand on the construction site only in a few years. The problem is:

    • How to make the construction site attractive for visitors. (Especially in the early phase ofconstruction)

    • How to document the whole process.

    • How to tell the knowledge about construction to visitors in a attractive way.

    The idea is to solve it using modern informatics. The problem of the documentation can

    be solved using cameras, which will automatically take picture of construction, store them in a

    database and process them in future. Problems related with telling the knowledge and presenting

    the building can be solved by creating the augmented reality system, which will augment the

    multimedia content and a 3D model of arising castle in the constructions site. Augmented reality

    is very good approach for that purpose, because it does not neglect things, which happen in real

    world (building) and are very important. For example the 3D presentation of the castle and

    its construction watched on the desktop PC at home does not give so many informations and

    impressions. AR gives information in not obtrusive and way. Hence, the construction site can look

    like it was in medieval, without information boards and arrows guiding visitors. It will make an

    impression of really living construction site not just a museum.

    I would like to introduce the vision of the system as user’s scenarios. There will be two types

    of users:

    • Visitor - someone, who comes to sightsee the construction site. Is focused on getting infor-mation about the construction process and history.

    • Designer - a person, who wants to see how the castle will look in the future, compare somevariants of design, adjust some dimensions of the building.

    Visitor

    The family came to see the construction of the castle. They bought tickets and got a handheld

    device with 7 inch screen on one side and camera on the second. After turning it on an image

    from camera appeared on the screen. The instruction recommends to look through the device on

    two color stakes sticked near the entrance to the construction site. After a second of looking on

    1

  • Introduction 2

    stakes the caption ’ready’ cropped out on the screen. From that moment all movements of the

    device are tracked by the system. The father of the family directed the lens to empty place where a

    castle should be build in the nearly 30 years, but on the screen it appeared immediately. In the right

    corner of the screen appeared a date indicating a moment in time in about 30 years. The main

    functionality of the system is to render the castle in the concrete state of construction indicated by

    date. The family discovered that the application contains a time slider which allows to choose a

    moment in time, in which the castle will be displayed. It appears after pushing the ’time line’ button

    and covers the whole screen. It contains dates and some milestones in the construction process.

    After changing the selected time stamp (simply by sliding a finger on the touch screen), the time

    line disappears and the augmented castle in the specific construction phase is visible. The family

    approached nearer to the building to better see details of the 3D model. They realized that there is

    a rectangle in the middle of the screen (similar to that one in the viewfinder in the camera). When

    the romanesque window was visible in that rectangle, its color changed. When the father pushed the

    ’shutter’ button a few windows with text and multimedia content about that detail appeared. They

    were displayed till he was watching that detail through a device and disappeared after changing the

    virtual gaze direction. Later they learned that in order to prevent windows to disappear, they can

    be hold by hand on the touch screen. They also noticed that apart from castle there are other 3D

    objects. When they came closer, they noticed, that there are virtual posters containing information

    about castle and its current construction phase. They were easy to read through the handheld device,

    it was also easy to intuitively change the view between castle and information poster. When they

    were changing the date on the time line, the content of posters was also changing according to the

    castle’s state. After sightseeing of the castle, the family went to other parts of construction site.

    They came to the place, where stones were prepared. The virtual posters appeared around the area

    into which they entered, also other posters near the castle vanished. The father approached with a

    device to a man dressed like a medieval mason. Multimedia content related to mason appeared on

    the screen, so he could read about his work and role in the construction of the castle.

    After sightseeing the virtual castle the family came back home, and decided to visit the projects

    website. There was a text field on that website, where they introduced their ticket id-number. After

    sending the form, they could see their whole trip in the construction site, a list of things they saw,

    read the descriptions again and see multimedia content. The son said, that it will be very helpful

    for him to write an essay to the class of history.

    Designer

    The designer is sitting in his office in the center of Friesach. He draws a few versions of the

    castle and its more detail parts using a CAD software. He tries to imagine how will they look in

    the real world and choose the best one. He is also thinking how will it fit to the shape of the terrain

    and what dimensions should be the best. Answers for all of those questions can be obtained using

    augmented reality. The designer saves models of the castle in the BZF file format (BZF is an

    extended 3DStudio file [*.3ds] used in the whole platform for storing 3D objects) using a plug-in

    for his designing software. Then he loads that file to handheld device (the same one used by the

    family in the previous scenario) and goes to the construction site. He calibrates the device using

    stake markers in the construction site (the same way as in visitor’s scenario) and starts to watch

    his project. First he looks at the castle from few places and large distance, then he approaches

    closer to walls and looks at some details. After that he switches the system to another version of

    project. He walks further to examine it from a distance. Then he realizes that the tower is too low.

    He quickly switches to dimension adjusting mode and starts to tune the towers height. All of his

    changes are visible immediately, which is very helpful by adjusting the ideal height. After that he

    corrects few other dimensions in the same way. He also realizes, that a defending wall does not fit

  • Introduction 3

    to other parts of the castle very well, but he prepared another variant before. He quickly switches

    to that version. When he has an ideal solid of the castle on the screen, he pushes a ’shutter’ button

    (utilized also in visitor’s scenario). This button in designer’s mode has a different function, it

    stores a screen-shoot on a hard drive. Then he sends this image to his co-workers by email, using

    standard software (integrated in UMPC’s OS). In order to save adjusted dimensions he chooses

    ’save’ option from the menu (available in the dimension adjusting mode). Changes are stored in

    the BZF file and can be propagated to the CAD software again.

    Design of such a system is not very common and causes some problems. Initially we can divide

    them into two groups:

    • Interface design problems. I suspect that, the system can display to many multimedia contentin one time, which can make the user bewildered and overloaded of information. Rendering

    of the castle from inside also can be a problem. Let us imagine, that someone wants to see

    the courtyard, but its floor is quite high and not build already. The user stands under it

    and does not see anything. Similar problem can happen if he walks to the inside of some

    solid (for example defensive wall). It is a open problem if the castle should be rendered from

    inside and how to present it to the user.

    • Technical problems. Design of augmented reality applications is hard and still full of nottypical problems. Each system is a new challenge. But in the macro scale there are some

    common problems in augmented reality applications. Augmented reality relies on rendering

    3D models and other data on the image of real world taken by camera. In order do it, the

    system must have information about the exact camera position and orientation. The problem

    of acquiring this data is called ’tracking’. There are many methods of tracking, which utilize

    many different physical phenomenons. All of them have advantages and disadvantages, which

    causes that they are very suitable for some aims and not suitable for others. The very often

    met approach is the combining of few methods in one system. The accuracy of tracking is

    also crucial in AR purposes. The second problem is 3D rendering. It relies on placing a 3D

    model in a scene so that it will not overlap object, which occur before a virtual object. It

    should look like it was a part of real world. The computational power for rendering (tracking

    too) is also a problem. Many mobile phones and PDA’s do not support hardware floating

    point operations.

    In this thesis I will focus on solving a tracking problem in the outdoor environment.

  • Chapter 2

    Analysis

    The aim of this work is to design the augmented reality system, which:

    • will make the sightseeing of the construction site of ’Burgau zu Freisach’ project more atrac-tive

    • will be able to render big buildings

    • will robustly work in limited outdoor area. (Area can be prepared)

    • will have short response time

    • can render at least 25 fps

    • will be easy to use - intuitive interface, lightweight

    • will be able to display multimedia content of several types. (3D models, videos, images, text)

    • will be able to play sounds

    With designing of augmented reality system come 3 types of problems:

    • Rendering - subsystem which allows to draw several objects on the screen. It must be fastenough to render the image fluently.

    • Tracking - To properly render an augmented scene, system must know the exact trajectoryand position of the camera. It is called geometrical registration problem.

    • Hardware platform - in contemporary world no one is constructing the computer from scratch.There are many ready micro computers on the market. Mainly they use Intel-architecture

    processors or ARM processors. There is a problem to choose that one which is the fastest

    and has enough built in sensors for the aim of tracking.

    The first section considers the tracking problem. The next section describes hardware platforms.

    Rendering isn’t in the area of this thesis.

    2.1 Tracking

    In this section we will consider a variety of tracking techniques. They are ordered in turn of

    frequency of appearing in contemporary augmented reality systems. In the implementation of

    tracking we can distinguish two approaches:

    4

  • 2.1. Tracking 5

    • Outside-in tracking - sensing devices are placed in the environment in motionless places.Moving objects are being tracked.

    • Inside-out tracking - sensors are mounted on the moving object, which will be tracked.

    The first approach works in specially prepared environments, which are relatively small. The

    second approach can work in an unprepared and unbounded environment.

    2.1.1 Visual Tracking

    Visual tracking is an approach, which uses one or more cameras. Special algorithms are looking for

    features in the raster image, describe and classify them. Consecutive frames are analyzed and the

    spatial correspondence between them is calculated. The features can be known for the software and

    even specially put in the environment - this approach is called “marker-based tracking”. Markers

    don’t have to be visible for the human. We have visual tracking systems which rely for example on

    infrared LEDs[CN98]. There are also tracking systems which rely only on natural features (edges,

    bulbs, T - junctions, colors, etc.) They can work in the unprepared environments - this fact greatly

    increases the area of applications of Augmented Reality. The camera can be attached to the tracked

    device (for example: Tablet, PDA or HMD) - inside-out approach. Or the system can consist of

    many cameras mounted in the environment - outside-in approach. In the next subsections we

    will discuss marker-based tracking, tracking without markers and software libraries, which can be

    helpful in implementation of visual tracking.

    Marker-based tracking

    Marker based tracking is one of the most developed topics. During last 15 years very many types

    of fiducial markers were developed. The features, which differentiate several marker systems are:

    • Robustness, velocity, and accuracy of tracking.

    • Type of markers. In the simplest way we can distinguish: color markers and shape markers.

    • A variety of distances from which the marker can be recognized. For example if we have avery big marker, which is noticeable from great distance, cannot be recognized, when camera

    is next to it and captures only a part.

    • How many bits of information can be coded on the marker. Some standards of markersassume the use of checksum bits.

    • How big are markers, and how obtrusive are they? How large is the area, they cover?Studierstube Tracker library [WLS08] contains some concepts of unobtrusive markers.

    • If marker system is scalable? Is it possible to use it on the large area?

    • How many markers must be in one frame to calculate the 6-DOF pose.

    Here I present the list of marker systems which I studied:

    ISO/IEC16022 standard [Wikb]. Marker contains a 2D data matrix and two solid edges,

    which help to find a marker. ISO/IEC16022 markers can be combined together to make a large ma-

    trix consisting of segments delimited by solid edges. The segments have a shape of square or rectan-

    gle. There is many standards of dimensions of such markers - from 8x8 up to 144x144. The largest

    version can contain 2355 bits of data. Each marker has an ECC200 checksum. ISO/IEC16022

    was designed to replace the old-fashioned barcodes. There are many free libraries for reading and

  • 2.1. Tracking 6

    (a) ISO/IEC 16022 marker (b) ARTag marker (c) Frame marker

    (d) Split marker (e) Dot marker (f) Circular markers

    (g) Nested marker (h)stakemarker

    Figure 2.1: Types of markers

    generating ISO/IEC16022 codes. The Studierstube Tracker library can also estimate 6DOF pose

    of the camera using such marker. A big advantage of this system is that it can contain big amount

    of data. We can store there whole URL or even a simple 3D model. See 2.1(a).

    There are a few other standards using data matrices, for example ARTag (2.1(b))[Fia05].

    ARTag marker contains 36 bits of information, 10-bits of them contains ID data (1024 differ-

    ent markers), other 26-bit provide redundancy to decrease a chance of false identification. ARTag

    is designed to be very robustly identified and tracked.

    ARToolkit markers [Kat] - system of square-shaped markers identified by graphic template.

    The template can be any black-and-white image, but the more detailed is the symbol, the worse

    is the quality of identification. Symbols consisting of big black-and-white areas yield the best

    robustness and enlarges the distance from which markers are recognizable.

    Frame markers [WLS08] - rectangular markers, which consist only of a frame(inside a marker

    can be anything). At the interior side of a frame the id information(9 bits) is encoded, making it

    look like a decoration. The code contains checksum and is arranged so, that allows to determine

    the orientation. See 2.1(c)

  • 2.1. Tracking 7

    Split markers [WLS08] - a variation of frame marker, which consists only of two parallel sides

    of the frame. Both sides contain barcodes with equal id information(6-bits) encoded. A pair of

    barcodes differentiate only on the one bit, which stores the orientation. See 2.1(d)

    Dot markers [WLS08]. Each marker is a black dot with a white ring. Markers form a two

    dimensional grid, which is applied to the flat surface. Each 4 dots form a grid cell, which is

    matched against the previously precomputed template. This allows to determine the position and

    orientation of the camera. The solution is scalable but limited only to the flat surface. Dots cover

    very little part of the surface. See 2.1(e)

    Nested markers [TKK+06] - is the solution how to improve scalability and accuracy of

    tracking for black-and-white rectangular markers. See 2.1(g). The concept is recursive: there is

    one high level marker which consists of some smaller (lower level) markers. Lower level markers can

    also contain nested smaller markers. Each marker contains a visual code and can be unambiguously

    identified. When the camera is far form the marker - the system uses the top-layer marker. If the

    camera is close - the lower level markers are used for geometrical registration. The system can also

    use markers of many levels simultaneously to increase the tracking accuracy.

    Multi - ring color markers [CN98] - is another approach to improve scalability of marker-

    based tracking. They are quite similar to nested markers, because this concept also relies on

    embedding one marker in the another. The author of the article [CN98] proposed two solutions:

    “Proportional width ring markers” and “Constant width ring markers”. The first concept increases

    the tracking range. Marker consist of a few concentric rings, the width of each ring is 2 times bigger

    than the width of its internal ring. There exists markers of many levels : the level 1 markers consist

    of a centre and one ring, level 2 markers have 2 rings and so on. To determine the 6-DOF pose the

    camera must see three or more markers. Therefore 1 level markers must be arranged dense, bigger

    markers will be arranged sparser because system uses them if distance to the lens is large. Farther,

    rings have one of 6 colors, which helps with identification. The constant width ring markers has

    also constant number of rings. Hence the tracking range of each marker is also constant.

    Invisible markers embedded in images [Hwa07] - Standard black-and-white square markers

    can be applied to the static or moving images as a noise, which is unnoticeable for human. The

    encoded information can be extracted using standard camera and Wienner filter. If we subtract an

    original image from the acquired one after noise reduction, we will receive an image of a marker.

    Then it has to be normalized and converted to binary image using global precomputed threshold.

    The resultant picture can be robustly processed with a marker tracking software.

    IR markers [PP04] - proposed system uses ARToolkit-style markers drawn with an IR invisible

    ink. Application uses 2 cameras coincided with a half mirror, one of them is equipped with IR

    filter for capturing markers, the second is equipped with visible light cutoff filter for capturing

    real scene. The signal from IR camera is processed with the ARToolkit library and 3D models are

    augmented with an image from scene camera. It looks like the application was a robust marker-less

    visual tracking system.

    Stake markers - my own idea for “Burgbau zu Friesach” project. The marker is a stake with

    color bars on the upper end and the second end is sharp to stick it vertically in the earth (see figure

    2.1(h)). Color bars are creating a visual code which allows to distinguish markers. The width of

    a bar is constant and big enough to be recognized by camera from large distance. In order to

    calculate the position and orientation two markers must be in the field of view. The color code

    should be similar to that one used in Multi - ring color markers. The number of used colors

    should be low, to avoid error recognision in different lighting conditions. In addition colors can be

    divided into two groups according to the wave frequency (for exmaple: (red, yellow, orange) and

    (blue, violet, green)). Colors from two groups could be used alternatingly also to avoid errors in

  • 2.1. Tracking 8

    recognision. For each stake the system must remember its 2D position on the earth and height.

    Height of all markers must be normalized to some equal imaginary level. I assume that all stakes

    are perpendicular to that level. Stake markers are designed to be sticked around the area, where

    tracking system should work. On the figure 2.2 two methods of sticking are presented (in context

    of ’Burgbau zu Friesach’ project). Gray color indicates area, where spectators should not stay, red

    squares represent stakes. We must also assume that virtual objects are placed in the middle of the

    environment (in the circle called castle). On the figure 2.2(a) we see that markers are placed in

    the meedle, so they are always visible, when looking on the castle. They also should not disturb

    during wathcing, because they will be covered by the rendered 3D model. But when the user will

    stand very close to stakes or between them, tracking can stop to work, because the arising castle

    will cover some markers on the oposite side of the circle. On the figure 2.2(b) stakes are placed

    arround the area. The big advantage of this method is that spectators can be anywhere in the

    area, but the danger of covering markers by the building is larger.

    Effectiveness of proposed marker system depends of where the user is directing his camera. I

    assume that in most of cases the camera will look in the axis parallel to the surface of the earth

    (small tilts are possible) In order to support 6-DOF in every location in the tracking area, stake

    markers must be integrated with some other technique.

    During tracking the system is always looking for points where color bars are putting together.

    The length of color bar is constant, so 2D and 3D coordinates of those points are easy to calculate.

    If the camera sees 2 stakes, 2 points from each of them could be used to estimate the camera pose.

    The more points or stakes are visible, the better quality can be achieved.

    (a) In the middle (b) Arround

    Figure 2.2: Methods of placement of stake markers

    All systems presented in this section can be used with inside-out and outside-in tracking.

    However, for augmented reality applications inside-out tracking is more desired. This approach

    relies on tracking of markers in the environment. If the environment increases its size or changes

    so, that some markers will be covered, we can simply add some cheap markers. In the outside-

    in system the network of cameras would have to be extended - it is more expensive and more

    complicated.

    Tracking without markers

    As it was already told, visual tracking relies on finding features in consecutive frames and calcu-

    lating the frame-to-frame correspondence. In this section we will consider a case, where natural

    features (not known before) are utilized. We can distribute the problem into parts:

    1each bit has 6 possible values, number of bits is variable

  • 2.1. Tracking 9

    Table 2.1: Table of marker-based solutions

    Name Type Data MFT SDK TR

    ISO/IEC16022 square, B&W 2355 1 Studierstube Tracker for pose es-timation and many others (onlyreading)

    S

    ARTag square, B&W 10 1 ARToolkitPlus SARToolkitmarkers

    square, B&W, anypattern

    N/A 1 ARToolkit and StudierstubeTracker

    S

    Frame mark-ers

    square, frame 9 1 Studierstube Tracker S

    Split markers barcode 6 2 Studierstube Tracker SDot markers circle N/A 3 Studierstube Tracker SNested mark-ers

    square, nested N/A 1 N/A I

    Multi - ringcolor markers

    circle, nested N/D 1 3 N/A I

    Invisiblemarkers

    square, invisible N/A 1 ARToolkit with filtering S

    IR markers square, invisible N/A 1 ARToolkit with hardware SExplanation: Data - amount of data expressed in bits, MFT - minimal number of markers in viewto calculate 6-DOF pose of the Camera, SDK - implementation is available on the internet, TR -Tracking range { S - standard, I - design of a marker increases tracking range }

    • Finding features. The features can be edges, bulbs, T - junctions, colors or even horizonsilhouette. There are many already investigated solutions for finding natural features like

    Laplace or Sobel operators and their modifications. Two most important parameters of the

    finding method are: repeatability (the same feature are found when object is seen form

    different angles and under different lighting conditions) and time of calculations. One of the

    best NF detectors is FAST algorithm [RD05], which was mentioned in the article [WRM+08].

    • Finding of two corresponding features on two different camera frames. The system mustknow if the feature A on one frame and feature B on the second frame are the same point in

    the space. This problem also has some solutions, but in many cases they are computationally

    expansive and don’t work on-line, or are robust only in specific conditions. Solution of that

    problem relies on describing features and comparing descriptions frame-to-frame. Or they

    can rely on comaparing the raw patches of image. There is no guarantee, that there are

    two or more features with the same description. Hence, there were investigated methods

    of evaluating the certainty of match. They rely on prediction of camera position or on an

    assumption, that a velocity of camera movement is limited.

    • Outlier removal. Not all matches with a previous frame are true. (The matching algorithmcan simply fail on recursive texture). Not all features found by the system belongs to static

    objects on the scene. Sometimes feature can belong to moving objects (like moving people).

    The system cannot use them for tracking.

    • Calculation of geometrical pose. Methods of finding 2D - 3D correspondence from 2D pro-jection have been already investigated. The problem of finding the camera pose is similar

    to problems appearing in stereo-vision. In stereo-vision there are two cameras, which are

    looking at the same scene (their fields of view are partially overlapping). The geometrical

    relation between them is known, but not always. In vision based tracking we have two con-

    secutive frames, which partially overlap, and geometrical relation between positions of the

  • 2.1. Tracking 10

    camera are also unknown. Mathematical model of the camera is similar to that one, used in

    3D graphic enriched by modeling of radial distortion and slight tangential distortion. Mathe-

    matical model of the lens has a few coefficients which can be acquired by analysing a picture

    from the camera. It is important to find on the picture projections of points which 3D coor-

    dinates are known. In the model we can distribute the intrisic and extrisic camera matrix.

    Intrisic matrix models parameters of the lens, translates 3D coordinates into 2D coordinates,

    does not change during work of the camera (if we do not use zoom). Extrisic matrix models

    geometrical relation between camera and world (rotation and translation coefficients).

    • Initial calibration. If the system tracks the camera pose using unknown features - it can onlyestimate the camera pose in the coordinate system which is not connected with coordinate

    system of the real world. (This is called incremental tracking.) To create that connection,

    system must find some feature, which coordinates in the real world are already known (for

    example a calibrated marker or a set previously learned features).

    Feature selection and tracking As I told - finding of corresponding features between two

    images is crucial. We can divide methods of feature tracking into two groups by taking into account

    the compared value.

    • Raw patches. In the simplest case raw image patches can be compared, of course the searcharea of each feature must be limited because of time consumption. Such approach also would

    require, that searched features had the same size in compared frames. It is not an option for

    augmented reality. But there are feature tracking algorithms based on raw patches, which

    are robust and quite fast. An example of such solution can a pyramid implementation of

    Lucas-Kanade algorithm described in [Bou02]. That version of Lucas-Kanade algorithm is

    implemented in OpenCV library and very easy to use. This class of algorithms is called

    optical-flow techniques. They are robust only when differences in images are small. The

    pyramid implementation solves that problem, but is more computationally expansive than

    a standard version. In augmented reality applications in order to support the limitation of

    differences, the frames must be captured very frequently, or some initial guesses of features

    positions must be provided.

    • Descriptor based methods are more robust, than previously described techniques. In order tocheck differences between features, they descriptors are compared. Most feature descriptors

    have a property of affine and/or scale invariability. It means that, when features on two

    images are scaled in respect to each other, (for example one image is a part of the second

    one) or warped, they can be robustly matched. Another advantage is that unlike the optical

    flow methods, they do not require little differences between images. But if we assume that

    limitation, much of computational effort can be saved. The typical applications of feature

    descriptors are image stitching and pattern recognition. The survey and comparison of

    features descriptors is presented in [MS05]. The example of descriptor, which is used in

    augmented reality applications is SIFT [Low04] (scale - invariant feature tracking). Example

    of AR application utilizing SIFT is described here [WRM+08]. The successor of SIFT is

    SURF [BTGL06].

    However, the quality and reliability of tracking highly depends of features and not only of algorithm.

    Important issues of that problem are described in paper ’Good feature to track’ of Shi and Tomasi

    [ST94]. The obvious property of a feature is the difference between its intensity and intensity of

    surrounding. Distinctive edges and corners are easier to find and their position can be determined

  • 2.1. Tracking 11

    more accurately. Another problem is orientation of features. Lets assume, that a tracked edge is

    parallel to the direction of movement. In this case tracking cannot be accurate. Shi and Tomasi

    proposed a measure, which allows to evaluate usefulness of a feature for tracking. It relies on a

    analysis of the eigenvalues of a gradient matrices of an image patch. The outcome of Shi and

    Tomasi is that both eigenvalues of a ’good’ feature must be large. Large eigenvalues indicate

    corners or ’salt and paper’ textures, which can by reliably tracked. Another issue of tracking is

    a feature displacement model. The simplest case assumes that feature can be translated by a

    2D vector. More sophisticated model assumes feature warping, which better reflects reality. The

    feature warping is represented by a 2x2 affine transformation matrix (the 2D displacement vector

    also must be included). This model has 6 parameters hence, it is harder to solve the equation. The

    convergence of this model is better, when features are ’good’ in terms of eigenvalue analysis. The

    affine feature displacement model has been used in an AR application described in paper [KM07].

    But in this case, the affine transformation matrix has been calculated from the knowledge about

    camera position and orientation.

    Natural feature tracking using pre-acquired data Natural features not always are un-

    known from above. There is a project called Archeoguide [SK], which uses ’reference images’. First

    of all the area, where the tracking system should work, is being photographed from many sides

    and under different angles (poses of the camera by each snapshot is known). When the tracking

    system works, each live frame is compared with the best matching frame from the database and

    a frame-to-frame correspondence is calculated(frames in the database are calibrated). The sys-

    tem works similarly to the marker-based tracking (we can treat reference images as markers), but

    matching operation is more complicated and requires the usage of techniques described above. The

    advantage of this solution is that we can pre-process some information and reduce the complexity

    of on-line calculations. Disadvantage: the database of reference images must be actual. A similar

    solution is proposed in the article [CCP02].

    Issues of camera pose estimation - camera models

    Here I present the mathematical model of camera projection, which is common for all camera

    registration techniques.

    xcyczc

    = [R] xwywzw

    + txtytz

    (2.1)

    [u

    v

    ]=

    [fx 0 cx0 fy cy

    ] xc/zcyc/zc1

    (2.2)

    Where R is 3 × 3 rotation matrix, tx, ty and tz are coordinates of translation vector. 2 × 3matrix is the intrisic camera matrix. fx and fy are focal lengths, cx and cy are coordinates of the

    camera central point expressed in pixels counted from the upper corner of the image. There are

    two focal lengths in order to model cameras of non-square pixels. This equation does not contain

    distortion model. Here I present camera model with distortion used in OpenCV.

  • 2.1. Tracking 12

    Figure 2.3: The camera model

    x′ = x/z

    y’ = y/z

    x” = x’(1 + k1r2 + k2r4 + k3r6

    )+ 2p1x′y′ + p2(r2 + 2x′2)

    y” = y’(1 + k1r2 + k2r4 + k3r6

    )+ p1(r2 + 2y′2) + 2p2x′y′

    r2 = x′2 + y′2[u

    v

    ]=

    [fx 0 cx0 fy cy

    ] x′′

    y′′

    1

    (2.3)

    Where k1, k2, k3 are radial distortion coefficients and p1, p2 are tangential distortion coefficients.

    They are not changing, when camera resolution changes, but focal length and center point must

    be scaled.

    In order to calculate intrisic and extrisic parameters the vector of 3D points and their pro-

    jections must be given. This data can be easily acquired by photographing a calibrated fiducial

    marker. It is also important to calculate the image point coordinates with subpixel accuracy.

    OpenCV contains algorithms which can estimate intrisic camera parameters, but only from copla-

    nar points. The more control points are on the image and more images are taken, the better

    estimations will be produced.

    The intrisic camera parameters must be estimated only once, but extrisic parameters per each

    frame. There are a few methods of estimating the camera pose. Most of them relies on solving the

    equations presented above. As we see they are non-linear because of the division by z. But the

    camera geometry is also 3 dimensional and can be introduced that way (see figure 2.3). In order

    to solve equations based on the geometry introduced on the figure the units in world coordinate

    frame and camera coordinate frame must be uniform. There is a scalar ratio which translates pixel

    to world units and vice-versa. That scalar is ’hidden’ in fx and fy in the equation 2.2.

    Solving the camera registration problem

    Here I present a short list of method for solving camera registration problem. The list of course

    does not contain all algorithms, but gives an overview of already investigated approaches. We

    can distinguish two approaches: analytic - based on modeling rays from object to camera in the

    3 dimensional space, iterative - based on minimizing re-projection error. Some methods combine

    these approaches together and exploit the advantages of both.

  • 2.1. Tracking 13

    • Least square minimisation of re-projection error. This is a family of iterative methods.The base is the least square iteration schema (for example Gauss-Newton method) which

    is minimising the difference between real points in the image and projected points. The

    estimated parameters are elements of rotation matrix and translation vector. The parameters

    must be initialized by some guesses of values. In the visual tracking purpose the best initial

    values are extrisic camera parameters from previous frame. Many implementations of least

    square solvers require a matrix of first derivatives with respect to parameters, but it can

    be calculated using the differential quotient. Such method is briefly described in the paper

    “Pose Tracking from Natural Features on Mobile Phones ” [WRM+08].

    • DLT method. Direct linear transformation. Method relies on solving linear equations basedon the perspective camera geometry. It can estimate extrisic parameters and the scaling

    factor between coordinate frames. Detailed description can be found here [dlt] [Qin96].

    • POSIT algorithm [DD95] is an approach, which combines analytic and iterative methods.First the orientation and translation is calculated by solving linear equations then resulting

    values are refined using iterative approach. Algorithm converges very quickly. Hence, is good

    for real time calculations.

    • Lowe’s algorithm - approach relies on iterative least square minimisation. The object isdescribed in units of camera coordinate frame. It makes the mathematical model so easy,

    that it was possible to find equations of first derivatives with respect to position and rotation

    parameters. Detailed description can be found here [ACB98].

    • SCAAT. Incremental tracing approach based on iterative Kalman filter. It is able to estimatepose when information is incomplete (for example to less points has been found). The

    algorithm is presented here [WB97]. Another visual tracking algorithm based on ideas of

    SCAAT is described here [JNY00].

    Presented algorithms are full of great ideas, which can be applied without using the whole

    solution. For example the first method (least square minimisation) can be enriched by Kalman

    filter. Or any analytic method can be improved by iterative refinement. The choice of method

    must be adjusted to the problem specifics and initial knowledge about camera movement.

    3D reconstruction for visual tracking

    As I told at the beginning, all algorithms require knowledge about 3D coordinates of world

    points. In the case of calibrated fiducial markers it is not a problem, but if tracking has to rely on

    initially unknown natural features, the coordinates must be obtained somehow. Techniques of 3D

    reconstruction must be applied. In order to find 3D coordinates of a point two different calibrated

    camera shots are needed. The coordinates of a point are found by calculating the intersection of

    two rays going from the center points of the cameras (two views) through image planes to the

    world point. The algorithm is described in details here : [TV98] The coordinates of the point

    found that way are not very accurate because camera calibration errors and inaccurate feature

    finding. It also depends of the difference of camera pose in two considered views. The biggest

    error appears on the Z-axis of camera coordinate system (line from N to O on the figure 2.3). But

    it should not disturb by solving camera registration problem. The requirement of two views means

    that, the point must be found three times to be utilized for tracking. Sometimes it is to late, when

    camera moves fast, or the point is unstably tracked. Correct calculation of points position form

    only 2 views is also optimistic case.

  • 2.1. Tracking 14

    In my opinion it is possible to obtain 3D coordinates of a point from a single view. To do this

    some more information is needed. For example an information about the plane on which the point

    lies. If the tracking system would have a knowledge about the rough 3D model of the environment

    and the initial camera pose, it could calculate on which plane a point lies, hence the 3D coordinates

    of it. Point on the plane has two degrees of freedom, the camera also delivers 2 variables (u and v

    coordinate), so the equation system has one not ambiguous solution.

    There is another approach, which allows to find coordinates of points from 2 views and also

    a camera pose. This is called ’five point algorithm’, because it needs at least five corresponding

    points. This algorithm is able to calculate all of those things only up to the scale factor. If the real

    distance between two points in the image is known - all coordinates can be expressed in metric

    units. This limitation is not a problem. Data about the metric distance can be provided using

    fiducial markers or they can be simply assumed. Such approach is presented in the system described

    in paper [KM07]. The described application is initialised like that: user directs the camera to the

    scene, pushes the button, then moves the camera 10 cm to the side (program believes, that it was

    10 cm) and pushes the button again. In that way 2 views are indicated and five point algorithm

    is applied. The main product of five point algorithm is the 3 × 3 essential matrix. From thatmatrix relative rotation and translation can be derived. The way how to do it and the whole

    algorithm with implementation is described here [Nis03a] and [SEN]. If the relative geometric

    correspondence between each pair of frames was accumulated (the initial camera pose should be

    known), we would get the actual camera pose, hence the tracking system. This approach has one

    advantage: not all points have to be tracked from frame to frame, only a few is needed to keep

    the scaling factor. But the big disadvantage is drift. If any measurement of relative movement

    will be inaccurate, that error will have influence for estimated camera pose forever. There is no

    chance to correct that drift, because the system does not remember previous state or frame. This

    approach is called ’Structure from Motion’. In order to improve the quality of the essential matrix

    estimation all outliers must be removed. The most sophisticated method of outlier removal and

    pose estimation is preemptive RANSAC, described in the paper [Nis03b]. The system has live

    performance, but it is not the only solution. Second competitive technology is described further.

    Simultaneous Localisation and Mapping (SLAM) is a group of technologies, which orig-

    inates from robotics. It was implemented in order to build a map of the unknown environment

    and use it to track the position of a robot. We can argue, that the movement of a robot is different

    and more predictable than a movement of handheld camera. But successful implementations of

    AR based on SLAM [KM07] are the best evidence that the difference is not so large. The problem

    of creating the map and immediate tracking from that map is like a ’hen and egg problem’. Errors

    made during the map creation phase have influence on tracking and tracking has influence on cal-

    culation of position of new points and correction of previous ones. The idea which was widely used

    to solve that problem is Kalman filter [Kal]. Kalman filter is a stochastic model of some process. It

    is able to predict the state of the process in future. Generally two operations can be performed on

    the filter structure: prediction - calculation of the process state in future, and correction - entering

    actual measurement of the state, which is connected with an update of coefficients of a stochastic

    model. Kalman filter has its modification called Extended Kalman filter (EKF), which supports

    non-linear models. KFs and EKFs are widely used in automation.

    In terms of augmented reality Kalman filter has a few applications. It can be used to predict

    the camera position and orientation. That predicted value (if its close to truth) can be used for

    finding of 2D positions of environment points in the camera frame. It can be also used as an initial

    guess of camera pose for the iterative pose refinement procedure. They should also reduce the drift

    appearing in visual tracking systems. Kalman filters have also their application in interpretation of

  • 2.1. Tracking 15

    data from sensors. They are able to smooth the noise coming from the sensor (like accelerometer).

    Here I would like to describe how a typical SLAM system works on the basis of the solution

    from [Dav03]. In such a system camera movements and also positions of points have probabilistic

    models. It means that the state of the system is modeled as a vector, which can be divided into two

    parts. The first one describes actual camera position and orientation, second one has an entry for

    each feature point. When a new frame is acquired the system predicts the camera pose. After that

    all feature point are reprojected and according to their pose uncertainty the elliptic search areas

    for each of them are created. The size of the ellipse depends of the uncertainty of the point and of

    the camera pose. Images of points are searched in those search areas and the convergence matrix

    and state vector are updated according to they real positions. In such a system convergence matrix

    is quite large: (7 + 3x) × (7 + 3x), where x is a number of feature points. But the maintenanceof that matrix is very important (also its non - diagonal entries), because it models correlations

    between points. For example if some points are laying on the same surface and near to each other

    - their absolute position can be uncertain, but knowledge about relative positions of those points

    can be very sure. Initialization of a new point is done also in a top-down way. A new point is added

    to the database using information from one view only. It is modeled as a endless straight line. The

    depth of a point cannot be calculated form one view, so some assumptions are done according to

    initial knowledge about the environment. The depth is modeled as a 1D range on that straight line,

    which is then fulfilled with big number of regularly distributed particles (see figure 2.4(b)). These

    particles create a discreet model of probability density function. At the beginning all particles are

    equal, and in next frames observations of that point are applied to them. After some number of

    frames particles should form a single peak like on the figure 2.4(a). If it will not happen, the point

    is discarded. In opposite case the point is used for tracking with a depth estimate in the place of

    peak. The quality of the position estimate is very high but takes much time and requires that,

    this point is visible for some period of time. Moreover after the invested effort, the point should

    be used for tracking for a long time. Design of that system assumes that the tracking viewpoint

    will be repeatable.

    (a) Propability density of depth (b) Graphic interpretation of particles

    Figure 2.4: Initialisation of point in probabilistic SLAM

    Survey of visual markerless tracking

    On the figure 2.5 there is a survey of all natural feature tracking techniques discussed in this

    section. They are divided into processing stages of the algorithm. The red pentagon with K letter

    indicates techniques utilizing Kalman filters. Possible types of Kalman filter are presented in left

  • 2.1. Tracking 16

    bottom corner of the image.

    Figure 2.5: Summary of visual tracking from natural features

    Software libraries

    Computer vision systems are not standardized, but there are some software libraries, which gather

    implementations of algorithms useful at creating of augmented reality systems. This section is

    distributed into two parts: first one is describing frameworks - implementations of whole func-

    tionalities like marker-based tracking or marker-less tracking, and libraries, which implement only

    individual algorithms (without connections between them).

    Frameworks

    Table 2.2: AR Frameworks

    Name LC NFT Markers PE license platformARToolkit Yes No ARToolkit Yes GPL C/C++ and Java

    (Wrapper). Linux,windows, macOS

    ARToolkitPlus Yes No ARToolkit, ARTag Yes GPL C/C++. Windows,Linux, WinCE

    Studierstube Tracker Yes No ARToolkit, ARTag,Frame, Split, Dot

    Yes N/A C/C++. Windows,Linux, WinCE,Symbian, iPhone

    Studierstube ES Yes Yes ARToolkit, ARTag,Frame, Split, Dot

    Yes N/A C/C++. Windows,Linux, WinCE,Symbian

    SceneLib No Yes No Yes LGPL LinuxLC - lens calibration, PE - Pose estimation, NFT - natural feature tracking

  • 2.1. Tracking 17

    In the table 2.2 for a great attention deserves SceneLib, which is a framework for designers of

    SLAM for robotics. I did not searched that library very wide, but there are AR solutions based

    on it. For example [Dav03].

    Alghorithms and other functionality

    OpenCV - a large library for computer vision and image processing. Windows and Linux

    implementation are available. Library is divided into 4 parts:

    • CXCORE. Data structures used in CV. Simple drawing and rendering. Time measurement.Mathematical operations (matrices).

    • HighGUI: Contains functions for displaying windows, refreshing graphic on them. Capturingimage from devices and files. Allow rapid prototyping and portability.

    • CV - typical algorithms for computer vision. SURF, filters, corner detection, camera cali-bration, stereo-vision.

    • Machine learning.

    OpenCV contains very large functionality - sufficient in many computer vision applications.

    Allows rapid development. Contains very interesting system of types - arrays, matrices and images

    are aggregated to one super-type. Functions are recognizing input type at runtime.

    OpenSURF [Eva09] - open source implementation of SURF. I have compiled that library and

    started the program and it was working very slow. I have replaced the Fast-Hessian feature

    detector (standard for SURF) with FAST [RD05]. The program was still working to slow. SUFR

    implementation in OpenCV works much better.

    Integrating Vision Toolkit - (http://ivt.sourceforge.net/) A library similar to OpenCV (a

    set of algorithms). Here I want to point differences in respect to OpenCV:

    • Contains SIFT implementation

    • Harris corner detector is faster than in OpenCV

    • Very good object-oriented architecture

    2.1.2 Inertial Tracking

    This approach relies on capturing translations and rotations of a object using the inertia phe-

    nomenon. Movement of a body in a space can be introduced as a sequence of translations and

    rotations. Inertial sensor must be mounted on tracked object(inside-out approach). Inertial track-

    ing is sourceless - doesn’t need any reference object in the environment. The most considerable

    problem is drift. Devices for measuring the translation are called accelerometers, devices for mea-

    suring rotations are called gyroscopes.

    Accelerometers

    Accelerometers are devices, which measure the linear acceleration of a body in one axis. In the

    simplest case they consist of a mass and subsystem, which measures the force affecting the mass.

    (For example a mass mounted on the piezo-electric crystal [RJD+01]). Acceleration of a body is

    proportional to the inertial force affecting the mass. Double integration in time of acceleration

    value gives the position. However, the calculated position is not errors free:

    • measurement errors

    http://ivt.sourceforge.net/

  • 2.1. Tracking 18

    Table 2.3: Table of 3d accelerometers

    Name range noise U. rate Price OS NotesPhidget Ac-celerometer 2

    ± 3G 6 mili G 60Hz £114.95 Windows,Windows CE

    USB, is designed for quickprototyping of AR solu-tions. Price includes ship-ment.

    MOD-MMA7260Q3

    ± 1.5G± 6G

    N/A N/A e38.95 N/A Board with ARM proces-sor and 3D accelerometer,must be programmed bythe user.

    USB1600-PC 4

    ± 1.5G± 6G

    N/A >60Hz $275 Windows,windowsCE

    Similar to PhidgetAc-celerometer, comes withdrivers, software, andcode samples

    • discretization errors

    • numerical integration errors

    • accumulation of previous errors

    These factors cause, that accelerometers can precisely track the position, but only in the little

    period of time. Hence, tracking system can not rely solely on accelerometers. It should be sup-

    ported by second system which has not a feature of drift (for example visual tracking). A single

    accelerometer has 1 degree of freedom. 3D accelerometer (3-DOF) consist of 3 single accelerometer

    measuring acceleration in 3 perpendicular axes.

    Fortunately, accelerometers are available on the market. There are analog electronic circuits

    with build-in 3D accelerometers. They can be connected to the one-chip microcomputer (with

    analog input) and transmit data to the PC (USB or RS-232). Devices, which can be simply

    connected to the PC (without soldering and low-level programming) are also available. Some

    products contain SDK for C++ and other programming languages. Few handheld devices (for

    example smartphones) have built in 3D accelerometers.

    Gyroscopes

    Gyroscopes are devices, which measure the orientation in the space. (insensitive to the translation

    in opposite to accelerometer). Most of old-fashioned gyroscopes use a quickly rotating wheel as

    a reference. The movement of the object is sensed by rotation encoders, which measure angles

    between the tracked body and the wheel. The main problem of gyroscopes is drift caused by small

    friction between the axis of the wheel and bearing. This error can be minimized by calculating the

    orientation iteratively, but it results in the accumulation of numerical errors. Hence, the gyroscope

    must be periodically re-calibrated to ensure the accuracy in time. Electronic gyroscope can be

    also implemented as a micro-electro-mechanical system (MEMS), which contain micro-miniature

    vibrating elements instead of spinning wheels. Here I introduce a PC gyroscope called Inerti-

    aCube2+ (based on MEMS): table 2.4 and [FHA98]. Another approach in building gyroscopes

    (and also accelerometers) are silicon micromachines (iMEMS) [LKPB02]. The principle or work

    of the iMEMS gyroscope is described here [GK]. The accuracy of silicon micromachined sensors

    are less accurate than MEMS, but their dimensions, power-consumption and cost are very low.

    2http://www.trossenrobotics.com/store/p/5160-PhidgetAccelerometer-3-Axis.aspx, http://www.active-robots.com/products/phidgets/three-axis-accelerometer.shtml

    3http://www.olimex.com/dev/mod-mma7260q.html4http://www.embeddedsys.com/subpages/products/usb1600.shtml

  • 2.1. Tracking 19

    Table 2.4: MEMS and iMEMS accelerometers

    Name InertiaCube 2+Technology MEMSMaximum angular speed 1200 ◦/sAngular resolution 0.01 degUpdate rate 180 HzInterface RS-232/USB adapterPrice about e2000http://www.intersense.com/uploadedFiles/Products/IC2+_datasheet_0908.pdfName Gyro Breakout BoardTechnology iMEMSMaximum angular speed 150 ◦/sSensitivity 12.5 mV/◦/sUpdate rate 80 HzInterface N/APrice about e69Note Board contains only one-chip 1-DOF analog gyroscope. Requires

    microcomputer with ADCs.http://www.watterott.com/Gyro-Breakout-Board-ADXRS150-150-degree-sec_1Name IMU Combo BoardTechnology iMEMSMaximum angular speed 75 ◦/sSensitivity 15 mV/◦/sUpdate rate 40 HzInterface N/APrice about e69Note Board contains only one-chip 3-DOF analog gyroscope. Requires

    microcomputer with ADCs.http://www.sparkfun.com/commerce/product_info.php?products_id=842

    The lack of accuracy can be recompensed with visual tracking system. These features cause, that

    silicon micromachines are very often utilized in mobile devices.

    6-DOF inertial trackers

    Accelerometers and gyroscopes deliver different type of information. In order to robustly calculate

    the 6-DOF position, the system must integrate both accelerometers and gyroscopes. There are a

    few such solutions available on the market: for example InertiaCube [FHA98]. InertiaCube con-

    tains 3 accelerometers, 3 vibrating gyroscopes and 3 magnetometers. The idea of inertial tracking

    system based on silicon micromachines is proposed in the article [LKPB02] and implemented for

    example in Wii Remote [Wii] game controller. This device contains a 3D accelerometer and can

    be enriched by the gyroscope connected to the expansion slot. Wii Remote is also using visual

    tracking system consisting of IR diodes and IR camera on the controller. In the table 2.5 a cheap

    6-DOF inertial tracker is introduced. It consists of 3 iMEMS accelerometers, 3 gyroscopes and

    microcontroller with A/D converters for data acquisition and communication with the PC.

    Each accelerometer is able to measure dynamic (vibration) and static (gravity or tilt) acceler-

    ation. If the system should measure the tilt, the high-sensitivity devices are recommended. It is

    possible to measure the tilt around 2 axes, which are parallel to the earth, but rotation around the

    third axis (perpendicular one) is not sensible. It could be sensible if the rotation axis was quite far

    (10 cm) from the sensor, so that the linear acceleration (caused by angular speed) will affect the

    sensor. In this situation the motion along the circle and along the straight line is indistinguish-

    http://www.intersense.com/uploadedFiles/Products/IC2+_datasheet_0908.pdfhttp://www.watterott.com/Gyro-Breakout-Board-ADXRS150-150-degree-sec_1http://www.sparkfun.com/commerce/product_info.php?products_id=842

  • 2.1. Tracking 20

    Table 2.5: 6-DOF inertial sensors

    Name Atomic IMU - 6 Degrees of FreedomG. range ± 300◦/sG. sensitivity 3.3 mV/◦/sG. update rate 88 HzA. range ± 1.5G ± 6Gprice e115,00interface UARTnotes Device consists of sensors and Atmel ATMega168TM microcom-

    puter. (Must be programmed by the user)

    able, when the system sensible only the data from accelerometer. But if the rigid body, which

    position is measured, will be equipped with the second 3D accelerometer, the rotation becomes

    distinguishable.

    6-DOF tracking with 2 accelerometers. I think, that the estimation of 6-DOF position

    is possible using of two 3D accelerometers mounted on two opposite ends of the handheld device.

    In this case, there exists only one axis (which goes through both accelerometers) around which,

    the rotation cannot be sensed. It is mentioned in the article [acc], that such a system can measure

    roll, pitch and yaw until the common axis of accelerometers does not point to the gravity. The

    signals begins to disappear when the common axis closes to the acceleration vector. Only sharp

    movements are measured accurately. I suppose that, the accuracy of such system will be much

    worse, than silicon micromachined gyroscope.

    2.1.3 Other technologies

    GPS and DGPS

    Global positioning system. GPS is a system for 3-DOF position tracking on the whole world.

    Accuracy of that system is about 10-20 meters. (after eliminating S/A in 2000) [Roy]. System

    consists of 28 satellites, which are orbiting around th earth. There is 3-4 atomic clocks on each

    satellite to ensure time measurement. Each satellite sends the actual time and its position. Special

    receiver is calculating the distance to each visible satellite and its own position. If it sees 4 satellites

    - the position and altitude can be estimated, if receiver sees less than 4 satellites the 2D position

    can be calculated using altitude introduced by the user.

    DGPS - differential GPS. Supporting system for GPS. Consists of earth-bound stations, which

    are very precisely placed. These stations are receiving signals from satellite and calculating correc-

    tions, which are transmitted to special DGPS receivers. Receivers are also analysing signal from

    satellites and signals form earth-bound stations and calculating the position with 1 - 3 m precision.

    I found that the HA-NDGPS system which is now under development will have accuracy 0.1-0.15

    m [DGP]. There is many standards of DGPS. There is also many stations on the earth [Gal].

    Everyone can put the station in his own, if there is a need. DGPS is used in some augmented

    reality applications like for example LIFEPLUS [Vla04]. Accuracy of DGPS is not good enough

    to base the camera registration on it, but it can be used as a component of hybrid large area AR

    systems.

    GPS and DGPS have become very common in last few years. The are parts of handheld devices

    (Phones, PDA’s) and are also available as separate devices with Bluetooth or USB interface. Most

    of GPS receivers are using NMEA protocol to communicate with the world. One of the problems

    of GPS technology is the low update rate (about 1 Hz). Second: the signal disappear, when for

    example vehicle goes under a wide overpass. In order to prevent such situation, the GPS receiver

  • 2.1. Tracking 21

    can be enriched by low-cost accelerometer [Dav08]. GPS receivers can also integrate other sensors

    like magnetometers (electronic compass).

    Electronic compass

    Electronic compass is a very sensitive magnetometer, which measures magnetic field of the earth.

    Magnetometer consists of a coil (or a few coils). If the magnetic field around the coil is changing,

    the variable current arises in the coil. That current is a function of distance from the magnetic

    field source and relative orientation between emitting and receiving coil. Magnetic trackers are

    widely used in VR, because of their low price. Electronic compasses use magneto-inductive ele-

    ments instead of coils, mostly they contain many of them. Measurement of such a system can be

    obviously affected by other sources of magnetic field and also by the tilt of the device. Electronic

    compasses very often integrate tilt sensors (for example inclinometers) in order to improve quality

    of measurement. Accuracy of the electronic compass is about 0.5 degree. One problem of such

    sensors is that the magnetic field of the earth isn’t homogeneous. Electronic compass is a very

    good replenishment for GPS, because gives data about 2D orientation. An interesting AR solution,

    which exploits compass is Wikitude [Wika].

    Mechanical tracking

    This type of tracking system uses mechanical linkages between reference and tracked objects. There

    are two types of mechanical trackers:

    • The reference and the target are connected by a chain of linkages. The position is computedfrom angles between linkages. Angles are measured using potentiometers or incremental

    encoders.

    • The reference and the target are connected by a system of wires. These wires are rolled oncoils and tensed by a spring system in order to measure a distance accurately.

    The number of degrees of freedom depends of the construction. Most of systems supports 6 DOF,

    but only a limited range of motion is possible. The tracking range is about 1.8 m. Such system

    can be used in immersive human interface (in the case, when the user cannot walk very far).

    Mechanical linkages have their successful application in force-feedback systems, which are also

    useful in user interfaces. Mechanical tracking is also utilized for augmented reality purposes, in a

    group of systems called AR telescopes. They are used for tracking of camera orientation (change

    of position is impossible) in very limited range of movement (only yaw and pitch). But they were

    replaced by visual tracking because of the lack of accuracy. Electronic rotation encoders are very

    temperature sensitive.

    Gravity sensors

    Inclinometers are devices, which measure their orientation regarding to the gravitation field. In

    most cases they consist of some closed dish with a fluid and sensors, which measure the level of

    the fluid. Orientation of the dish can be obtained by measuring the pressure or level of liquid.

    There are also solutions which use electrolytic fluid or opto-electric sensors. Main limitations of

    this technology are: long response time because of viscosity of the liquid, shock and acceleration

    sensitivity. The second problem can solved by integrating inclinometers with accelerometers for

    shock measurement. Inclinometers do not need any reference (like inertial sensors and compasses).

  • 2.1. Tracking 22

    Ultra sound tracking

    Ultra sound tracking is based on the measurement of time of flight of the sound wave. Frequency

    of the pulse signal is form 20 kHz to 40 kHz to prevent the user from hearing it. In order to

    track the position and orientation of the object 3 or more emitters must be placed on it. To

    measure 3D position of each emitter 3 receivers must be mounted on the reference. Emitters are

    small and lightweight and can be easily carried by the person or mounted on any object. They

    send signal sequentially or each of them uses different frequency. Tracking accuracy is very good

    (0.5mm - 6mm). Limitations of such systems are: range (from 25 cm to 4.5 m) and sensitivity to

    temperature, pressure, humidity and occlusion. All factors which affect the speed of sound wave

    are decreasing the accuracy or make the tracking impossible. It limits the application of such

    system only to rooms.

    Ultra-Wideband

    Ultra-Wideband is technology based on radio waves similar to Bluetooth or WiFi. It operates with

    a very low energy and and using very wide wave spectrum. The Ubisense company developed a

    tracking system based on that technology [uwb]. Systems consists of emitters placed on objects

    and receivers mounted on the reference. The density of receivers network is similar to the density

    of access-points in WLAN network. The tracking accuracy is about 10-15 cm. It was designed to

    track people in a big building. In general the system has 3 degrees of freedom and extending it to

    6 DOF would be hard or impossible, because of the low accuracy.

    2.1.4 Hybrid tracking

    As we see technologies described in previous section have some advantages and disadvantages and

    none of them is ideal for tracking in the outdoor environment. Hence, the intuition suggests,

    that the good system should combine two or more technologies. Such approach is called hybrid

    tracking. The weakness of one technology can be compensated by another solution. For example a

    system can integrate GPS with 3D accelerometer, which track the position (through short period

    of time) when GPS signal is lost. In order to show, how hybrid tracking works I will present a few

    examples. Example projects are also introduced in the table 2.6 which gives overview of techniques

    integrated in each of them.

    Examples

    System II (2003)

    System II (described in paper [CMC03]) is very interesting because it involves camera with

    wide angle ’fish-eye’ lens (190 ◦). Additionally the system is equipped with 3D gyroscope, which is

    used as a prediction for the feature tracker. The camera pose is estimated using typical ’Structure

    from Motion’ approach. The essential matrix is calculated and rotation matrix and translation

    vector are derived from it.(Structure from Motion) In the fish eye lens spherical distortions are very

    high, so that they have to be compensated. The wide view has many advantages. Its field of view

    covers very large area, even during large rotations some part of the field of view stays common for

    two successive frames. ’Fish-eye’ lens allows also to properly estimate movements along focal axis

    of the camera. But it has also disadvantages: resolution of image in front of the camera is very

    low and generates estimation errors from movements perpendicular to camera axis like panning or

    tilting.

  • 2.1. Tracking 23

    The system provides also 3D reconstruction implemented as triangulation from 2 views. It

    is done during the essential matrix calculation. Please notice that coordinates of these features

    are not utilized directly for tracking. Authors of the article told that it is not a mature solution.

    Moreover they have written that the system is drifting.

    System I (2004)

    B. Jiang U. Neumann and S. You presented a hybrid tracking system in their paper “A Robust

    Hybrid Tracking System for Outdoor Augmented Reality” [JNY04]. I called it ’system I’ because

    it does not have any name. It integrates digital 3D gyroscope and visual tracking. The designers

    assumed that the change in view caused by little rotation is much bigger than the change resulting

    from a little linear movement. Hence, systems contains a gyroscope and no accelerometer which

    decreases the cost. The system is designed for urban environment. Visual registration is based on

    detection of unknown lines and tracking them from frame to frame. The global orientation in gen-

    eral is measured by the gyroscope. The measurements are updated by the vision tracking system

    from time to time to prevent the drift. Each estimate of camera pose from visual subsystem is eval-

    uated to be confidential or not. Only reliable estimates are used to correct the drift. If unreliable

    measurement is pronounced, the visual subsystem cames back to the previous, reliable state and

    estimates the movement to the actual camera image. Technologies used here are supplementing

    each other. When the rotation of the camera is too fast to use visual tracking - only the inertial

    subsystem is used. But when the user stops for a moment or starts to move slowly - information

    from camera will correct the drift. This system could not be used in the natural environment

    because straight lines does not exist in the inartificial world. However, the way of data processing

    proposed in this system is a very good base for every system which integrates visual and inertial

    techniques.

    LIFEPLUS (2004)

    LIFEPLUS system [Vla04] was designed to support sightseeing of cultural heritage places. It

    integrates visual tracking, DGPS and compass. The system is large and complicated. Besides

    tracking it contains network infrastructure based on GPRS and WLAN which allows access to the

    remote database of multimedia content. The systems is designed to work on the very large area.

    The information from camera and compass are sufficient for tracking, but data from GPS give

    initial calibration and allow other parts of the tracking system to work more reliable.

    Vidente(2008)

    Vidente system [SMK+08] developed at the University of Technology in Graz has been designed

    to visualize pipes and cables under the surface of the earth. The tracking system integrates GPS

    supporting EGNOS and very accurate inertial orientation sensor (InertiaCube 3). GPS system

    delivers the position within accuracy if few meters. InertiaCube3 is very accurate (less than one

    degree) sensor and costs e2000 (data from http://www.cybermind.nl/Info/EURO_PriceList.

    htm#ISense). These two sensors deliver tracking accurate enough to render pipes and cables which

    position is taken from GIS database. Such tracking system is quite easy to implement and does

    not require much computational power, but is too expansive.

    WikiTude(2009)

    WikiTude[Wika] has been already mention in this work. This system integrates the GPS and

    a compass in order to show descriptions of interesting places on the earth. It is not hard to notice

    that such tracking system is not very accurate, but rendering of labels on big buildings or other big

    objects does not require high precision. This system shows how weakness of one technique can be

    improved by another one. The GPS delivers 3-DOF position and the compass 1 DOF orientation.

    I suspect that two other parameters of orientation are sensed by accelerometers, but I did not

    found any wide documentation of that software.

    http://www.cybermind.nl/Info/EURO_PriceList.htm#ISensehttp://www.cybermind.nl/Info/EURO_PriceList.htm#ISense

  • 2.2. Hardware 24

    Table 2.6: Hybrid augmented relity systems

    Name Visual Inertial GPS compassSystem I • • ◦ ◦WikiTude ◦ • • •Vidente ◦ • • ◦LIFEPLUS • ◦ • •System II • • ◦ ◦

    These few examples show that hybrid systems are good direction of development. If one

    technology is too weak to support all possible conditions, why not to add a second one, which

    works better in some specific situations. Combining of technologies gives another profit: for

    example in visual tracking computations are very important and consume much time, but some

    hints from inertial sensors can simplify them.

    2.2 Hardware

    In this section I would like to discuss the hardware platform for augmented reality systems. A few

    years ago AR was implemented on wearable computers and user was watching the world through

    HMDs. These times are fortunately gone. Contemporary AR systems are implemented on tablet

    PCs or mobile phones equipped with camera. We can say that user is looking through the handheld

    device. The augmented image from camera is displayed on the screen in the real time. In the next

    sections I will consider 3 hardware solutions.

    2.2.1 Ultra mobile PC and TabletPC

    Ultra Mobile PCs (UMPC) are handheld devices with big display (7-10 inches). Their processors

    are compatible with PC architecture, hence they run PC operating systems like Windows or Linux.

    The size of RAM is also like in a normal PC. They have communication devices like USB ports,

    WiFi and Bluetooth. Some of them are equipped with keyboard. Here I briefly analyse the

    usefulness of UMPC in the ’Burgbau zu Friesach’ project. Further in the table 2.7 are examples

    of some UMPCs and TabletPCs.

    Advantages:

    • Big display 7-10 inches. Large display allows to see more. The age of visitors of the exhibitionwill be diverse, devices also have to be adjusted to needs of all users. I am talking here about

    elderly people who very often have problems with eyesight. If the display is bigger, they feel

    more comfortable and do not have to use glasses. Bigger displays also have great resolution,

    which is helpful with rendering fonts. They also can fit larger amount of text or graphic,

    which helps to avoid the necessity of scrolling. The quality of display is also important,

    because of lighting conditions. The display must have very high contrast to be readable in

    sunny weather.

    • High portability. In the introduction I described a user story, where the family is goingthrough the construction site with a mobile device. UMPC is a device fitting to that scenario

    very well. It is lightweight and can work about 5 hours on the battery. I am not sure, if it

    will be able to work so long with running AR application. Probably not, because of large

    processor usage. Maybe the use of additional battery will be necessary.

    • Allows to connect another devices (cameras, sensors). Many of UMPCs available on markethave build-in cameras of quite high resolution, but they can be not sufficient for outdoor

  • 2.2. Hardware 25

    augmented reality. Very often the camera is mounted on the incorrect side, which prevent

    the use for AR purposes. Fortunately, UMPCs have USB ports. (some of them even IEEE

    1394) Most of cameras and other sensors are also using that interface, so the possibilities

    of expansion are large. But UMPC equipped with additional camera and sensors requires

    casing, which will keep all parts together. An example of system based on UMPC with

    housing and grip is presented here [SMK+08].

    • Large computing power, much memory. The computing power of UMPC is sufficient foraugmented reality. The main difference between them and laptops is single core processors.

    UMPCs have also 2D/3D hardware graphic accelerators. The PC like processor contains

    floating point unit, which is useful for calculations connected with tracking and rendering.

    • Easy implementation of software. Software for such device can be developed on standardPC and simply run on the UMPC. No emulator or special compiler is needed. Operating

    systems also do not differ. The main difference can be found in the hardware user interface.

    UMPCs have touch screen and some times QWERTY keyboards.

    Disadvantages:

    • The exhibitor must own devices and rent it to spectators. It increases the risk of stealing ordamaging by the visitors.

    • The UMPC can be too big. I know that this is the negation of something I told before. Theweight of UMPC is about 800g. For some people walking with something so heavy in hands

    can be uncomfortable.

    Table 2.7: UMPCs and TabletPCs available on the market

    Name Processor Pr. speed RAM Price Cam/CB/WiFi/BT/

    USB/COM/IEEE1394/GPU

    Weight

    SamsungQ1EX-71G

    VIA Nano 1.2 GHz 2GB $ 749 1/1/1 2/0/0/1 640g

    Samsung Q1Ultra

    Intel UltraMobile A110

    800 MHz 1GB e1029 1/1/1 2/0/0/1 860g

    ELVTouchscreen-Panel-PC

    VIA Nano 1.0 GHz 1GB e1049 0/0/0 3/2/0/1 3600g

    GigabyteM704

    VIA C7MULV

    1.2 GHz 768 MB e745 1/0/1/1 2/0/1/1 780g

    CB - camera on the back side

    2.2.2 Mobile phone

    Implementation of Augmented reality on the mobile phones is also possible but much harder than

    on the PC. I am speaking here about phones which allow to run native code on their processors

    (Symbian, WinCE, iPhone). Implementations of AR applications on J2ME phones are not known.

    In comparison to contemporary PCs the computational power and amount of RAM is much lower.

    Some mobile phones also do not support floating point operations. FPU is not essential, but

    numeric algorithms behave more stable when they are working on FPU. I decided to discuss 4

    mobile platforms:Symbian, WinCE, iPhone and Android. The last one is being programmed in

  • 2.2. Hardware 26

    (a) Samsung Q1 Ultra (b) Gigabyte M704

    Figure 2.6: Example of ultra mobile PC

    Java, but does not contain standard Java VM known form PCs or J2ME phones, hence is faster

    and allows to run native code on the processor. As a crucial criterion I consider:

    • Presence and quality of the camera.

    • Presence of inertial sensors. Even more and more new mobile phones are equipped withaccelerometer and gyroscopes.

    • Ability to run native code implemented in C/C++. Most of computer vision libraries areimplemented in C/C++, mathematical libraries as well. C/C++ technology guaranties the

    best trade-off between speed of code and comfort of programing. However, porting of software

    from PC to mobile phone requires complete re-engineering of code. (described in [WS09]) But

    there