field service support with google glass and webrtc724724/fulltext02.pdf• a google glass...

Field Service Support with Google Glass and WebRTC Support av Fälttekniker med Google Glass och WebRTC PATRIK OLDSBERG

Examensarbete inom Datorteknik, Grundnivå, 15 hp Handledare på KTH: Reine Bergström Examinator: Ibrahim Orhan TRITA-STH 2014:68 KTH Skolan för Teknik och Hälsa 136 40 Handen, Sverige

Abstract

The Internet is dramatically changing the way we com-municate, and it is becoming increasingly important forcommunication services to adapt to context in whichthey are used.

The goal of this thesis was to research how GoogleGlass and WebRTC can be used to create a communi-cation system tailored for field service support.

A prototype was created where an expert is able toprovide guidance for a field technician who is wearingGoogle Glass. A live video feed is sent from Glass tothe expert from which the expert can select individualimages. When a still image is selected it is displayed tothe technician through Glass, and the expert is able toprovide instructions using real time annotations.

An algorithm that divides the selected image intosegments was implemented using WebGL. This made itpossible for the expert to highlight objects in the imageby clicking on them.

The thesis also investigates di�erent options for ac-cessing the hardware video encoder on Google Glass.

Sammanfattning

Internet har dramatiskt ändrat hur vi kommunicerar,och det blir allt viktigare för kommunikationssystem attkunna anpassa sig till kontexten som de används i.

Målet med det här examensarbetet var att undersökahur Google Glass och WebRTC kan användas för attskapa ett kommunikationssystem som är skräddarsyttför support av fälttekniker.

En prototyp skapades som låter en expert ge väg-ledning åt en fälttekniker som använder Google Glass.En videoström skickas från Glass till experten, och den-ne kan sedan välja ut enstaka bilder ur videon. När enstillbild väljs så visas den upp på Glass för teknikern,och experten kan sedan ge instruktioner med hjälp avrealtidsannoteringar.

En algoritm som delar upp den utvalda bilden i seg-ment implementerades med WebGL. Den gjorde det möj-ligt för experten att markera objekt i bilden genom attklicka på dem.

Examensarbetet undersöker också olika sätt att fåtillgång till hårdvarukodaren för video i Google Glass.

Contents

1 Introduction 111.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3 Research Goals and Contributions . . . . . . . . . . . . . . . . . . . 121.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Literature Review 132.1 Wearable Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 History of Wearable Computers . . . . . . . . . . . . . . . . . 132.2 Augmented Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1 Direct vs. Indirect Augmented Reality . . . . . . . . . . . . . 152.2.2 Video vs. Optical See-Through Display . . . . . . . . . . . . 152.2.3 E�ect on Health . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.4 Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Collaborative Communication . . . . . . . . . . . . . . . . . . . . . . 182.4 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.1 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4.2 Noise Reduction Filters . . . . . . . . . . . . . . . . . . . . . 202.4.3 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.4 Image Region Labeling . . . . . . . . . . . . . . . . . . . . . . 22

3 Technology 253.1 Google Glass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.1.2 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.3 Microinteractions . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1.5 Development . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1.6 Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Alternative Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2.1 Meta Pro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2.2 Vuzix M100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.3 Oculus Rift Development Kit 1/2 . . . . . . . . . . . . . . . . 29

CONTENTS

3.2.4 Recon Jet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.5 XMExpert . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Web Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.1 WebRTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.2 WebGL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Mario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.1 GStreamer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Video Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.5.1 Hardware Accelerated Video Encoding . . . . . . . . . . . . . 34

4 Implementation of Prototype 354.1 Baseline Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 Hardware Accelerated Video Encoding . . . . . . . . . . . . . . . . . 36

4.2.1 gst-omx on Google Glass . . . . . . . . . . . . . . . . . . . . 364.3 Ideas to implement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.1 Annotating the Technician’s View . . . . . . . . . . . . . . . 374.3.2 Aligning Annotations and Video . . . . . . . . . . . . . . . . 37

4.4 Still Image Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . 384.4.1 WebRTC Data Channel . . . . . . . . . . . . . . . . . . . . . 394.4.2 Out of Order Messages . . . . . . . . . . . . . . . . . . . . . . 39

4.5 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.5.1 Image Processing using WebGL . . . . . . . . . . . . . . . . . 414.5.2 Image Segmentation using Hough Transform . . . . . . . . . 424.5.3 Image Segmentation using Median Filters . . . . . . . . . . . 424.5.4 Region Labeling . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.6 Glass Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.6.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.6.2 OpenGL ES 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . 444.6.3 Photographs by Technician . . . . . . . . . . . . . . . . . . . 44

4.7 Signaling Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.7.1 Sessions and users . . . . . . . . . . . . . . . . . . . . . . . . 454.7.2 Server sent events . . . . . . . . . . . . . . . . . . . . . . . . 454.7.3 Image upload . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Result 475.1 Web Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2 Google Glass Application . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Discussion 516.1 Analysis of Method and the Result . . . . . . . . . . . . . . . . . . . 51

6.1.1 Live Video Annotations . . . . . . . . . . . . . . . . . . . . . 516.1.2 Early Prototype . . . . . . . . . . . . . . . . . . . . . . . . . 516.1.3 WebGL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.1.4 OpenMAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

CONTENTS

6.1.5 Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.2 Further Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2.1 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . 526.2.2 Video Annotations . . . . . . . . . . . . . . . . . . . . . . . . 526.2.3 UX Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2.4 Gesture and Voice Input . . . . . . . . . . . . . . . . . . . . . 536.2.5 More Annotations Options . . . . . . . . . . . . . . . . . . . 536.2.6 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.3 Reusability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.4 Capabilities of Google Glass . . . . . . . . . . . . . . . . . . . . . . . 546.5 E�ects on Human Health and the Environment . . . . . . . . . . . . 54

6.5.1 Environmental Impacts . . . . . . . . . . . . . . . . . . . . . 546.5.2 Health Concerns . . . . . . . . . . . . . . . . . . . . . . . . . 54

7 Conclusion 57

Bibliography 59

Introduction

1.1 Background

The Internet is dramatically changing the way we communicate, and it is becomingincreasingly important for communication services to be adaptable to the contextin which they are used. They also need to be flexible enough to be able to integrateinto new contexts without excessive e�ort.

Using a wearable device allows a communication system to be tailored for thecontext to a greater extent. With additional information available such as move-ment, heart-rate or perspective of the user, a richer user experience can be achieved.

Wearable devices have a huge potential in many di�erent business fields. Anupcoming form of wearable devices are currently head-mounted displays (HMD).HMDs have the advantage being able to display information in a hands-free format,this has a huge potential for businesses such as medicine and field service. Perhapsthe most recognized wearable device at the moment is Google Glass, which forbrevity will sometimes be referred to simply as ‘Glass’.

1.2 Problem Definition

The focus of the thesis involved a generalized use case where a field service techni-cian has traveled to a remote site to solve a problem. The technician is equippedwith Google Glass, or any equivalent HMD.

When on his or her way to the site, the technician has information such aslocation of the site and the support ticket available. The technician can look upinformation such as the hardware on the site, the expected spare parts to resolvethe issue, and recent tickets for the same site.

Once the technician has arrived to the site, the back o�ce support will benotified. When at the site, the technician can view manuals and use the device todocument the work.

If the problem is more complicated than expected, or the technician is unableto resolve the issue for some other reason, the device can be used to call an expertin the back o�ce support.

The purpose of this thesis was to research how a contextual communicationsystem can be tailored for this use case. The part that was in focus was the call to

11

CHAPTER 1. INTRODUCTION

the back o�ce support which is made after the technician has arrived on site andrequires assistance to resolve the issue.

1.3 Research Goals and Contributions

The goal was to research collaborative communication and find ways to tailor acommunication system for this specialized kind of call. Di�erent ways to giveinstructions and display information to the wearer of a HMD were investigated, aswell as how these could be implemented.

An experimental prototype using some of the ideas was then constructed. Theprototype was implemented using web real-time communication (WebRTC). Atthe time when the prototype was built there was no implementation of WebRTCavailable on Google Glass. A framework called Mario that was developed internallyat Ericsson Research was used as WebRTC implementation, as it runs on Androidamong other platforms.

The prototype comprised several di�erent subsystems that were all implementedfrom scratch:

• A web application built with HTML5, WebGL and WebRTC technology.

• A NodeJS server acting as web server and signaling server.

• A Google Glass application using the Glass Development Kit (GDK).

An evaluation of the prototype was done with focus on how it could be furtherimproved, and if any of the implemented ideas can be applied to similar use casesand devices. An evaluation of the capabilities of Google Glass with regard to mediaprocessing and the performance of Mario was also performed.

1.4 Limitations

The time limit of the thesis is ten weeks, therefore a number of limitations wheremade so that it would be completable within this limit.

No in-depth evaluation of the user experience (UX) of the prototype would bedone. The prototype was to be designed with UX in mind, using the result of theinitial research, but no further evaluation would be done. A broad enough UXevaluation was not considered possible within the limited time frame.

The prototype would not be optimized for battery life and bandwidth usage.These restrictions are of course important and the limitations imposed by themwould be taken into consideration. Although no analysis and optimization wouldbe done to find the most optimal solutions with regard to these issues.

12

Literature Review

2.1 Wearable Computers

A wearable computer is an electronic device that is worn by the user. They oftentake form as a watch or head-mounted display, but can also be worn on other partsof the body or e.g. be sewn into the fabric of clothing.

“An important distinction between wearable computers and portablecomputers (handheld and laptop computers for example) is that the goalof wearable computing is to position or contextualize the computer insuch a way that the human and computer are inextricably intertwined,so as to achieve Humanistic Intelligence —i.e. intelligence that arisesby having the human being in the feedback loop of the computationalprocess." —Steve Mann [1]

The definition of what a wearable computer is has changed over time. By somedefinitions a digital clock with an alarm from the 90’s is a wearable computer, butthat is not what we think of as a wearable computer today.

2.1.1 History of Wearable ComputersThe first wearable computer was invented by Edward o. Thorp and Claude Shan-non [2]. They built a device which helped them cheat at roulette, with an expectedgain of 44%. It was a cigarette pack sized analog device that could predict in whichoctant of the table the ball was some likely to stay in.

The next step in wearable computing was the first version of Steve Mann’sEyeTap from 1981 [3]. It consisted of a computer in a backpack wired up to acamera and it’s viewfinder which were mounted on a helmet. EyeTap has sincethen been significantly reduced in size and the only visible part is an eyepiece usedto record and display video. The eyepiece uses a beam splitter to reflect some ofthe incoming light to a camera. The camera then records the incoming image andsends it to a computer, which in turn processes the image and send it to a projector.The projector is then able to overlay images on top of the user’s normal view byprojecting it into the other side of the beamsplitter.

In the mid 90’s the ‘wrist computer‘ was invented by Edgar Matias and MikeRuicci, and it had very a very di�erent interaction method. The entire device was

13

CHAPTER 2. LITERATURE REVIEW

strapped to the wrist and was marketed as a recording device. Although it wassignificantly larger than today’s smartwatches [4].

Even though there had been a number of great inventions, wearable devices hadnot gained any commercial traction at the turn of the mellenia [5]. The followingyears a few new devices were created, but it took another decennia until wearabledevices started getting public recognition.

In 2009 Glacier Computers designed the W200, a wrist computer that could runboth Linux and Windows CE and had a 3.5" colored touch screen. It was intendedfor use in security, defence, emergency services, and field logistics and could beequipped with Bluetooth, Wi-Fi and GPS [6].

In 2010 and the following years there was a newfound interested in the ideaof smart wearable technology. A more practical alternative to the W200 becameavailable with the optional attachment for the 6th generation iPod Nano whichturned it into a wrist computer.

In April 2012, Pebble Technology launched a Kickstarter campaign for theirPebble smartwatch [7]. The initial fundraising target was set to $100,000, and itwas reached within two hours of going live. It took six days days for the projectto become the highest funded project throught Kickstarter so far. The fundraisingclosed after a little over a month and $10,266,844 raised, more than a hundred timesthe initial goal.

The success of the Pebble smartwatch was one of the indicators that wearabledevices were getting some widespread recognition. Another example is the OculusRift, which is a virtual reality head-mounted display [8]. Just like Pebble, the Ocu-lus rift had a successful Kickstarter campaign, raising over $2.4M with a $250,000target.

Another device unveiled in April 2012 was Google Glass, which we will look atin depth in the next chapter (3.1).

These devices were only a few of many that had reached the market or wereabout to be released. Some other smart watches are Sony SmartWatch [9], Sam-sung’s Galaxy Gear [10], and a rumored smart watch from Apple [11, ?, 12]

More important to this thesis were HMDs that were being developed or alreadyreleased, such as Meta Pro [13], Vuzik M100 [14], and Recon Jet [15].

2.2 Augmented Reality

Augmented reality (AR) is a variation of virtual reality (VR) where the real worldis mixed with virtual objects [16]. VR completely immerses the user in a virtualworld, the user is unable to see the real world. AR on the other hand, keeps theuser in the real world and superimposes virtual objects onto it. VR replaces realitywhereas AR supplements it, enabling it to be used as tool for communication andcollaboration when solving problems in the real world.

A similar concept is augmented virtuality (AV), which is a middle point be-tween AR and VR. It allows the user to still be present in the real world, but it

14

2.2. AUGMENTED REALITY

is mixed with a virtual world to a further extent than AR. Physical objects aredynamically integrated into the virtual world and shown to the user through theirvirtual representation [17].

AR can be divided into two categories, direct AR and indirect AR [18]. Thereare also multiple ways in which AR can be achieved, for this thesis the focus wasvideo and optical see-through displays.

2.2.1 Direct vs. Indirect Augmented Reality

Indirect AR is when the superimposed AR elements are not seen with the samealignment or perspective as the user’s normal view [18]. Indirect AR is achieved bydisplaying an image of reality to the user, and overlaying elements on this image.It is possible to use this technique to achieve direct AR, but the image then hasto be taken from the user’s perspective, and displayed in alignment with the user’seyes.

An example of indirect AR is the iOS and Android application Word Lens [19].The application translates text by superimposing a translation on top of the originaltext. The text is recorded using the back facing camera on the device, and the resultis displayed on the device’s screen. This is indirect AR, because the perspectiveand alignment of the screen is di�erent than the user’s. Direct AR can be emulatedusing this application by holding the device in alignment with one’s eyes.

An example of direct AR is the system used to help workers heat up a preciseline on a plate presented in [20]. The system uses AR markers to show a line wherethe worker is supposed to heat the plate. It uses a video see-through system thatis aligned with the user’s field of view and is mounted directly in front of one’s eye.

Even though Google Glass is often referred to as an AR device, has a forwardfacing camera, and a transparent display; it is not capable to direct AR. It isperhaps possible to calibrate an application to allow overlaying objects on the areathat is behind the display, but this is an ine�ective method as the display onlytakes up a very small area of the user’s field-of view. It would also be di�cult if notimpossible for the calibration to remain accurate as the display’s position relativeto the user’s eyes will change when moving.

Developing for direct AR devices provides more possibilities, but also placeshigher demands on the application. It is possible for a indirect AR applicationto show video with a high latency, or even still images. When using direct ARhowever, the latency is crucial, especially with a video see-through display [21].Direct AR also requires precise calibration of the display and sometimes a camera,in relation to the user’s normal view [22, ?]

2.2.2 Video vs. Optical See-Through Display

Video see-through can be used for both direct and indirect AR, while optical see-through can only be used for direct AR.

15


Optical see-through uses a transparent display that is able to show images inde-pendently on di�erent parts on the screen. This allows the ability to only displaythe elements that are overlaid on reality, as the user is able to see the world throughthe transparent display [23].

Video see-through uses an opaque display, and the user’s view is recorded withone or more cameras. The elements are superimposed on the recorded video, and theaugmented video is displayed to the user. This technique requires very low latencyand a precise alignment of the cameras and the user’s usual field of view [21, ?]

2.2.3 E�ect on HealthWhen using a video see-through display for direct AR, it is important that thedisplay is correctly calibrated for the user’s normal field of view. When the per-spective of the video displayed to the user di�ers heavily from the expected view,it takes a long time for the brain adapt to the new perspective. But when the usertakes o� the display, it will only take a short time to restore the view. On the otherhand, if the perspective is almost correct, but di�ers slightly, it will take the braina short time to adjust, but an expended period of time to restore the view. Duringthis period the user might acquire motion sickness with symptoms such as nausea.

Using a HMD for an extended period of time may cause symptoms such asnausea and eye strain. Age has some significance for the eye strain, with olderuser’s experiencing more apparent symptoms. It is also possible that females havea higher susceptibility to motion sickness caused by HMDs, although this might bethe result of males being less likely to accurately report their symptoms [24].

2.2.4 PositioningIn many AR applications it is important to have knowledge of the device’s positionand orientation in relation to the surrounding space or objects. An AR system thathas an accurate understanding of a user’s position will be able to create a far moreimmersive experience, and often it is required for the system to be of any use atall.

Some sensors that can be used to help determine the position and orientationof a device are:

• Gyroscope

• Compass

• Accelerometer

• Global Positioning System (GPS)

• Camera

16

2.2. AUGMENTED REALITY

Sensing Orientation

An accurate, responsive and normalized measurement of the device’s orientationin relation to the earth can be acquired using combined data from the gyroscope,accelerometer and compass.

The accelerometer is used to find the direction of gravity. It provides a noisybut accurate measurement of the device’s tilt and roll in relation to the earth [25].In order to remove the noise, a low pass filter has to be used. The downside is thata low pass filter will add a latency to the sensor data, but this is solved using thegyroscope.

The gyroscope provides a responsive measurement of the device’s rotation, butit cannot be used to reliably determine the rotation over time since that wouldcause the perceived angle to drift. It also has no sense of the direction of gravityor north.

The drift is a result of how the rotation is calculated using the angular speedreceived from the gyroscope. In order to calculate the angle of the device, theangular speed has to be integrated. With a perfect gyroscope, this would not bean issue, but all gyroscopes will add at least a small amount of noise. When noiseis integrated it is converted to drift, see Figure 2.1.

Figure 2.1: Noise vs. Drift

The drift in compensated for usingthe accelerometer and compass. Thecompass cannot be used on it’s own be-cause the output is noisy, and it needsto be tilt compensated i.e. it needs tobe aligned with the horizontal plane ofthe earth [26].

In the end, the gyroscope is used tocalculate the orientation of the device,it is then corrected with regard to theangle of gravity by the accelerometer,and the direction of the magnetic northpole by the compass.

Sensing Position

Finding the position is a far greaterchallenge than determining the orientation of a device.

The GPS is only usable outdoors and provides a position with an error of a fewmeters [27, ?] This makes it unusable for any AR system that is not viewed at alarge scale e.g. navigation.

The output from the accelerometer is acceleration, which can be integrated onceto calculate speed, and twice to calculate position. Like the gyroscope, the outputfrom the accelerometer is noisy, and as explained, integrating noise causes drift.This e�ect is amplified when noise is integrated twice, the perceived position of the

17


device can drift several inches in a second [28].The largest problem is not the double integration of noise, but small errors in the

computed orientation of the device. When calculating the position of the device, thelinear acceleration needs to be used, i.e. the observed acceleration minus gravity.The value of the linear acceleration is obtained by subtracting gravity from theobserved acceleration using the orientation calculated with the method describedin the previous section.

The dependency on a correct value of the orientation of the device turns outto be a huge problem when trying to calculate the position of the device. A smallerror in the orientation will lead to a very large error in the calculated position. Ifthe computed orientation is o� by 1%, the device will be perceived to move at 8m s

≠1 [28].This means that the only way to accurately determine the position of the device

using the listed sensors is with the camera, using image processing. A commonmethod of achieving this is by using markers, figures that are easily recognizableby a camera [29]. This technique will often only determine the markers’ position inrelation to the device, and not the devices position in relation to the world. It isalso possible to use data from the accelerometer to assist in marker detection withmethods such as the one described in [30].

2.3 Collaborative Communication

A central part of the research was the interaction between the expert and thetechnician. The expert should have better way of giving instructions than simplyproviding voice guidance while viewing a live video feed.

The study presented in [31] explores di�erent methods for remote collaboration.Two pilot studies were conducted that give an understanding of the di�erence ine�ectiveness between the di�erent methods. The study did not come to a clearconclusion as to what method was the most e�ective, but it did rule out somesolutions.

The study compared the e�ectiveness of instructions given using a single pointer(a red dot controlled by the instructor), and real time annotations by drawing. Italso compared the use of a live video feed and a shared still image. In total, thecompared methods were:

1. Pointer on a Still Image

2. Pointer on Live Video

3. Annotation on a Still Image

4. Annotation on Live Video

The test were carried out using a pair of software applications. The personthat was receiving the instructions used a tablet to record video or capture images,

18

2.4. IMAGE PROCESSING

while the instructor used a PC application. The instructor was given a an imageof a number of objects arranged in a specific manner, while the other person hadthe same objects laying in front of them in a di�erent arrangement. The instructorthen used the remote collaboration tool to tell the other person how the objectsshould should be arrange. The data presented was: task completion time, numberof errors made, and total distance the mouse was moved by the instructor.

The study clearly showed that method 1 was ine�ective when considering thetime taken to complete the task. It took on average 25% longer than any othermethod. The other results were less clear, method 4 and 3 had a slight advantageover 2, while 3 and 4 were equivalent.

In a second test were only method 2 and 4 were included, it was shown that itwas important to have a method of erasing the annotations. If the drawing werenot removed, the old drawings would confuse the user.

The tests also showed that the pointer based method caused the instructor tomove the mouse roughly five times as far compared to the annotation based method.

2.4 Image Processing

As a part of the collaborative communication interface, one of the ideas that werechosen to be implemented in the prototype was highlighting of objects in an image.Therefore, di�erent methods for detecting objects in an image were investigated,as well as methods of representing the region that constituted the object in a waythat is simple to transmit over the network.

2.4.1 Edge DetectionEdge detection is a method for identifying points in an image where a given propertyof the components have discontinuities. The method can be used to detect changesin e.g. color or brightness and is an important tool in image processing [32]. Anexample of edge detection by color is displayed in figure 2.2. Figure 2.2a shows anarbitrary image, while figure 2.2b shows a possible result of an edge detection.

(a) Before (b) After

Figure 2.2: Edge detection by color

19


The are many di�erent methods that can be used for edge detection. For thisthesis, a very simple implementation was chosen, which is found in [33]. The methodprovides a baseline edge detection implementation which can be improved upon ifrequired by the application.

The method is as follows:

For each pixel I

x

, the average color di�erence of the opposing surrounding pixelsis computed. Consider the arrangement of pixels surrounding any pixel I

x

.

I2 I5 I8

I1 I

x

I7

I0 I3 I6

The following formula is used to calculate the value of I

x

.

I

x

= |I1 ≠ I7| + |I5 ≠ I3| + |I0 ≠ I8| + |I2 ≠ I6|4 (2.1)

The result from 2.1 is then multiplied by a scale factor s and compared to twothresholds, T

min

and T

max

, using the following formulas:

I

Õx

= sI

x

(2.2)

I

ÕÕx

=

Y_]

_[

0 if I

Õx

< T

min

1 if I

Õx

> T

max

I

Õx

if I

Õx

Ø T

min

and I

Õx

Æ T

max

(2.3)

The scale factor s and the thresholds T

min

and T

max

are all parameters to thefilter, and will give varying results depending on their values.

2.4.2 Noise Reduction FiltersAn important part of the image processing was to reduce the noise in the image,improving the result of the edge detection. A few methods of noise reduction wereexplored:

Gaussian Blur

Gaussian blur is a low-pass filter that blurs the image using a Gaussian function.It calculates the sum of all pixels in an area, after applying a weight to each pixel.

20


The weights are computed using the two-dimensional Gaussian function 2.5, a com-position of the one-dimensional Gaussian function 2.4.

G(x) = 1Ô2fi‡

2e

≠ x

22‡

2 (2.4)

G(x, y) = G(x) ú G(y) = 12fi‡

2 e

≠ x

2+y

22‡

2 (2.5)

When implementing a Gaussian filter, a matrix is computed using the Gaussianfunction and subsequently normalized. A new value of every pixel in the image isthen calculated by multiplying it and the surrounding pixels by the correspond-ing matrix component. This operation can be divided into two steps, horizontalblur and then vertical, or vice versa. Doing this yields the same result, but fewercalculations are needed.

Median Filter

A median filter is a nonlinear filtering technique which is often used to removenoise from images. It will preserve edges in an image which makes it optimal forsegmenting an image. A filter such as Gaussian blur will remove noise, but alsoblur edges, which can make a subsequent edge detection filter less e�ective [34].

A median filter is similar to the Gaussian blur in that it each pixel’s value iscalculated using the surrounding pixels. But instead of calculating a weighted av-erage, the median of the surrounding pixels is calculated [35]. This causes commoncolors in an area to become more dominant, while removing noise.

2.4.3 Hough TransformThe Hough transform is a technique which can be used to isolate features of anyparticular shape within an image [36]. It is applied on a binary image, typicallyafter applying an edge detection method. The simplest form is the Hough Linetransform used to detect lines.

The idea of the Hough transform is that each point in the image that meets acertain criteria is selected, e.g. all white pixels. Each point then votes for every linethat could pass through that point, which in reality is a discrete number that isselected depending on desired performance and resolution. When every point hasvoted, the votes are counted and lines are selected using the result of the vote. Thelines that are selected span the entire image, which means that a second algorithmhas to be employed in order to find the line segments that were present in theimage.

An e�cient implementation of the Hough transform requires the possibility torepresent a line using two variables. For example, in the Cartesian coordinatesystem a line can be represented with the parameters (m, b), using y = mx + b. Inthe Polar coordinate system a line is represented with the parameters (r, ◊), wherer = x cos ◊ + y sin ◊. Because it is impossible to represent a line which is parallel

21


to the y-axis using y = mx + b, Polar coordinates are used to represent a line in aHough transform. Polar coordinates were not used initially, this improvement wassuggested by Duda and Hart in [37].

The voting step of the Hough transform is typically implemented using a two-dimensional array called accumulator. Each of the two dimensions represent allpossible values of r and ◊ respectively.

All fields in the accumulator are initialized to 0. When a vote is cast for aline represented by two distinct values of r and ◊, the corresponding field in theaccumulator is incremented.

Once all votes are counted, the field in the accumulator with the highest valuegives the r and ◊ values of the most prominent line in the image. Several linescan be found using various techniques such as finding local maxima, selecting a setnumber of lines, or finding all values above a threshold.

2.4.4 Image Region Labeling

Figure 2.3: Region labeling

Region labeling, or connected-component label-ing, is used to connect and label regions in animage. What connects the di�erent regions canvary, but a binary image such as the one in fig-ure 2.3 is typically used. In this case all thewhite regions are labeled [38].

An algorithm that connects the pixels inan image based on 4-way connectivity is pre-sented [39].

The algorithm is split into two passes, thefirst assigns temporary labels and rememberswhich labels are connected, while the secondpass resolves all connected labels so that eachregion only has one label.

The pixels are iterated through in row majororder, from the top left to bottom right. Eachpixel is checked for connectivity to the northand west, the pixels marked in figure 2.1. For every pixel that is encountered thatshould be labeled, e.i. is white; the following steps are carried out:

22


N

W

Table 2.1: Connected Pixels

1. If neither the North or West pixels are labeled, assign a label to the pixel andmove on to the next pixel. Otherwise, move to the next step.

2. If only one of the North or West pixels are labeled, assign that label to thepixel and move on to the next pixel. Otherwise, move to the next step.

3. If both the North and West pixels are labeled, take the following steps.

a) If the North and West pixels have the same label, assign that label tothe pixel and move on to the next pixel. Otherwise, move to the nextstep.

b) The North and West pixels must have di�erent labels, record this di�er-ence and assign the lowest label out of the two to the pixel and move onto the next pixel.

The second pass consist of iterating through every pixel in the image once more.Any pixel that is encountered that has a label which has an equivalent label witha lower value has it’s label reassigned to the lower value.

After the second pass all pixels are guaranteed to have the same label i� theyare connected to each other.

23

Technology

This chapter describes the hardware and software involved in the thesis.

3.1 Google Glass

Google Glass is a wearable computer in the form of a HMD. It has a camera,a horizontal touchpad, and an optical see-through display. Audio is received bythe user through a bone conduction transducer, and there are one or two internalmicrophones [40]. The battery capacity is enough for a day of normal use, but willdrain rapidly when e.g. recording a video [41]. The device is delivered with a pairof attachable shades, nose pads of varying size, and a Micro USB mono headphone.

Google Glass runs a modified version of Android. At the beginning of the workon this thesis, the base android version was 4.0.4 (Ice Cream Sandwich), but thiswas later changed to version 4.4 (Kit Kat) with the XE16 update.

The device is configured using either a website or a smartphone companion ap-plication. The configuration is done either using Bluetooth, if using the smartphoneapplication, or with a QR-code, if done through the web page.

3.1.1 TimelineThe timeline is the central part of the interaction with Google Glass. It consists ofcards, with each card being some content attached to a point in time. Cards caninclude simple static content such as an URL or an image. They can also be morecomplex and display dynamic content, such cards are called live cards. Live cardsare used e.g. in the compass or navigation applications.

The centerpiece of the timeline is the clock screen, which is the default screento be shown when the display is activated. From this screen, the timeline can bescrolled in two directions, left and right.

Scrolling to the right represents going back in time. When new cards are insertedin the timeline, they are placed next to the center. As newer cards are inserted, theolder cards are gradually pushed to the right and eventually disappear once theyare too old.

Scrolling to the left reveals cards that are happening “now”. Some of cardsshown here are permanent, such as the weather, calendar, and settings card. They

25

CHAPTER 3. TECHNOLOGY

are mixed with cards that are temporarily inserted by the user e.g. the navigation,stopwatch, and compass cards.

3.1.2 InteractionThe main methods of interaction with Google Glass are the touchpad and voicecommands. The multi-touch touchpad is located on the outside of the device alongthe bearer’s temple. The touchpad is currently the only way to navigate throughthe timeline, although newly inserted cards are shown if the screen is activated soonafter they are received.

Some of the cards in the timeline have a text at the bottom reading “ok glass”,the user can either tap the touchpad or say “ok glass” to activate these cards. If thecard is activated using voice input, a list of all voice commands that are availableto the user will be shown. If it is activated through touch input, the user can scrollthrough the same commands and activate them by tapping once more.

The clock card also has the “ok glass” text at the bottom. When this cardis activated it shows a list of all the voice commands that can be used to startnew interactions with the device, such as “take a picture” and “get directions”.Developers can add their own applications to this list, as long as they adhere to apredetermined set of voice commands.

There are two intended ways of activating the screen while wearing Google Glass:tilting the head backwards, or by tapping the touchpad. Since the display can beactivated by simply tilting ones head it is possible to perform most common tasksusing only voice input. This allows the devices to be used completely hands-free.This has great potential for the field support use case, but if it is used as the onlyinput method, the application will be vulnerable to noisy environments, or crowdedenvironments where the technician would prefer not to use voice input.

A less apparent way of interacting with the device is an inward facing IR sensor.The sensor is able to detect when the user blinks. This is used by a functionalitythat is built into the system which lets the user take pictures by winking [42].

3.1.3 MicrointeractionsThe design of Google Glass is focused around microinteractions [43]. The goal isto allow the user to go from intent to action in a short time.

When a user takes a note there should only be a few second between when theuser realizes that a note needs to be written down, and when the note has beencompleted. When a wearer of Google Glass receives an email, it will take a fewseconds before the user is looking at the email and determining if it’s important ornot, provided the user is not preoccupied.

This reduction in time for simple interactions results in the removal of thebarrier towards the world that technology creates around user. When using asmartphone, if the user has invests 20 seconds to retrieve the device and navigateto the appropriate application, she is bound to devote even more time while it is in

26

3.1. GOOGLE GLASS

her hand. An incoming SMS leads to a quick scan of the inbox, and why not checkthe email, and social media as well? By limiting the time it takes to do these smalltasks, the time in which technology is out of the way can be increased [43].

3.1.4 HardwareThe hardware in Google Glass has not been made o�cial, but a teardown andand analysis with ADB has revealed some information of what components areinside [40, ?]

• Texas Instruments OMAP4430 SoC

• 1 GB RAM, 682 MB unreserved

• 16 GB Flash storage, 4 GB reserved by system

• 640x360 pixel display

• 5 MP camera capable of recording up to 720p

• Bone conduction transducer

• One or two internal microphones

• 802.11b/G Wifi

• Bluetooth 4.0

• 3 axis gyroscope, accelerometer, magnetometer

• Ambient light sensor

• Proximity sensor

The CPU and RAM are comparable to a low-end smart phone such as theMotorola Moto G [44].

A notable component that is absent is a 3G or 4G modem. This is becauseGoogle Glass is not intended to be used as the only device carried by the user.

3.1.5 DevelopmentInitially, the only way to develop applications for Google Glass that could runwithout rooting the device was through the Mirror API [45]. It allows developersto build web services, called Glassware, that interact with the users’ timelines.Static cards can be added, removed and updated using the provided REST API.

In late November 2013 the Glass Development Kit (GDK) was released. It wasbuilt on top of Android 4.0.4 and provides functionality such as live cards, simplified

27


touch gestures, user interface elements, and other components that streamlinesapplication development for Google Glass.

In April 2014, the XE16 update was released. This update brought the Androidversion up to 4.4, which includes the new WebView and MediaRecorder API.

3.1.6 DisplayThe display in Google Glass is a 640 by 360 pixel display and is roughly the equiv-alent of looking at a 25 inch screen eight feet away. It is a transparent opticaldisplay, and it takes up a very small portion of the users field of view, approxi-mately 14 ¶ [46].

The display is positioned above the users natural field of view, which means thatin order to look at the display the user needs to look up. This keeps the displayfrom obscuring the user’s vision, even while it is active.

3.2 Alternative Devices

Other devices than Google Glass were explored to gain a broader understanding ofwearable devices and to provide alternatives. The type of wearable computer thatwas the focus of in this thesis was the head mounted display, which meant that onlysuch devices were evaluated. It is possible to use e.g. a smartwatch with a camera,or a chest mounted projector such as SixthSense [47] for the use case, but HMDswhere prioritized as it is more likely that a solution for Google Glass is reusable onanother HMD. The list was also limited to devices that were commercially availableor soon to be commercially available.

The devices were compared to Google Glass with a few main aspects in mind:

• How similar is the hardware?

• What display technology is used, and how does it di�er?

• How do the development platforms compare?

3.2.1 Meta ProAlthough the Meta Pro [13] might seem similar to Google Glass, using these wouldsignificantly change the user interaction. They provided direct AR through opticalsee-through displays as well as motion detection using depth sensors. They havesignificantly more powerful hardware directly available, as all the processing is doneon a separate high-performance pocket computer capable of running x86-baseddesktop operating systems [48].

Applications are developed either using Meta’s internally developed SDK orwith Unity3D [49, ?] It is unlikely that any non-native Android application thatis developed for Google Glass will be easily portable on this device. In the endthis does not matter as the user interaction is so fundamentally di�erent that any

28

3.2. ALTERNATIVE DEVICES

application designed with Google Glass in mind would have to be redesigned fromthe ground up.

3.2.2 Vuzix M100The Vuzix M100 [14] are very similar to Google Glass, both have a OMAP 4 CPU,although the M100 have a slightly faster OMAP4460. The M100 also runs Android4.0.4, has a similar display which only makes it capable of indirect AR, and sharemany other hardware features. Any application developed for Google Glass wouldlikely be easy to port to the M100. Di�erences between the two are that the M100has an opaque display and has multiple buttons instead of a touch pad.

3.2.3 Oculus Rift Development Kit 1/2The Oculus Rift [8] is primarily intended for use as an alternative to a conventionaldisplay when playing computer games. The available development version does nothave a transparent display or a front-facing camera, which makes it incapable ofperforming AR. This can be solved using custom modifications such as attachinga front-facing camera, which is potentially going to be included in the consumerversion [50].

The Oculus Rift has to be connected to a control box, which in turn receivesimages from an external device. This makes it very impractical to use and is unlikelyto fit the use case.

3.2.4 Recon JetThe Recon Jet [15] have a lot in common with Google Glass, they have a smalloptical HMD and are also incapable of direct AR. Since they also run Android,but with an unknown CPU, it would likely be easy to port any application fromGlass to run on the Recon Jet. This device has an advantage over Glass in thatthe touch sensor works in all weather conditions, and also while wearing gloves [51].This can be a big advantage depending on the typical work environment of the fieldtechnician.

3.2.5 XMExpertThe XMExpert [52] is a complete solution for field service support, and is includedhere as a parallel solution rather than an alternative device. It includes a portableworkstation for the expert, and a helmet mounted AR system for the field techni-cian. Instructions are given by the expert using his hands, which are overlaid onthe technicians view in a similar way to the Vipaar system [53].

Just like the Oculus Rift, the AR helmet system was in itself quite large, andexpert also has to carry a backing unit when moving around. This makes the system

29


less suited for the entire use case were the technician is able to use the same deviceto display information when traveling to the location as well as working.

3.3 Web Technologies

The number of technologies available in web browsers have grown significantly overthe past years, the days where browsers could only view static HTML pages arelong gone. HTML5 is the latest version of HTML, the previous version is 4.01which was released in 1999 [54]. It add multiple features that can be used to allowrich content in the browser without the need for plugins.

3.3.1 WebRTC

RTC stands for Real-Time Communication, technology that enables peer-to-peeraudio, video, and data streaming. WebRTC provides browsers with RTC capa-bilities, allowing browser-to-browser teleconferencing and data sharing, withoutrequiring plug-ins or third-party software [55].

WebRTC is currently supported by Google Chrome, Mozilla Firefox, Opera,and Opera Mobile. The implementation of WebRTC varies between the di�erentbrowsers, not all features of are implemented everywhere and some do not haveinteroperability. This is partly because the WebRTC standard is still under de-bate [56].

The WebRTC API has three important interfaces, getUserMedia,RTCPeerConnection and RTCDataChannel [57]. getUserMedia is used togain access to the user’s video and audio recording devices through aLocalMediaStream object. When getUserMedia is called, the user is promptedto allow the use of the recording devices. The user is able to deny the request, pro-tecting against malicious websites and recording at inconvenient times.

The RTCPeerConnection interface handles the connection to another peer.The LocalMediaStream object retrieved with getUserMedia can be added to aRTCPeerConnection in order to allow the remote peer to receive a media streamfrom the user. Likewise, the RTCPeerConnection will receive remote streamswhen these are added by the peer.

A RTCPeerConnection instance will emit events with signaling data thatneeds to be received by the other peer. How these signals should be transmitted isnot included in the WebRTC specification, and needs a separate implementation.

The RTCDataChannel interface provides a real time data channel which trans-mits text over the Stream Control Transmission Protocol (SCTP) [58]. SCTP sup-ports multiple channels that can be configured depending on what kind of reliabilityis required. By default the channels will retransmit lost packets and guarantee thatpackets will arrive in order, but both of these behaviors can be switched o�.

30

3.3. WEB TECHNOLOGIES

3.3.2 WebGL

WebGL (Web Graphics Library) enables browsers to access graphics hardware onthe host machine in order to render accelerated real time graphics. It is primarilyused to render 3D scenes, but can be used for other purposes as well, such as imageprocessing.

The WebGL API is based on OpenGL ES 2.0 [59]. OpenGL ES is in turn asubset of OpenGL [60], aimed at embedded devices. The API uses a HTML5 canvaselement [61] as drawing surface.

Figure 3.1: Simplified WebGL Pipeline

WebGL uses as pipeline to do it’s work, a simplified version of the WebGLpipeline can be seen in figure 3.1. The blue elements are set with the WebGL API,while the red elements are fixed, although in some cases properties of them can bechanged. The green elements are called shaders and it is through these that thedeveloper can program the GPU. The shader programs are written in the OpenGLShading Language (GLSL) [62].

WebGL is essentially a 2D API, and not a 3D API [63]. The the output of aWebGL pipeline is a 2D surface, the developer is responsible for creating the 3De�ect. A program that wishes to display an object that is specified in 3D coordinateshas to project the object onto a 2D plane called clipspace. The coordinates of theclipspace plane are always -1 to +1 along both the x-axis and y-axis.

The vertex shader is responsible for converting 3D points (vertices) to clipspacecoordinates. It takes a single vertex as input and outputs the clipspace coordinatesof the vertex. Each vertex is a part of a primitive shape, such as a triangle or line.When the vertex shader has computed the position for all points in a primitive, itis rasterized.

Rasterization is the process of converting a shape specified by coordinates toa raster image (pixels). An illustration of this is showed in figure 3.2. When therasterization is complete, every pixel that is produced is sent to the fragment shader(this is not always the case, e.g. if multisampling is used [64]).

31


Figure 3.2: Illustration of Rasterization

The fragment shader is responsible for setting the color of each pixel it receivesand can also discard the pixel. When OpenGL is used for image processing, this iscommonly where all the processing is done.

The end of the pipeline, the framebu�er [65], is by default the bu�er that isdisplayed to the user once the rendering is complete. It is however possible toreplace the default framebu�er with one specified by the programmer. Using thisfeature it is possible to render a scene to a texture, which then can be used whenrendering the next scene.

3.4 Mario

Mario is an implementation of WebRTC. It is build using the GStreamer libraryand runs on OS X, Android, iOS and Linux.

Mario can be used to write both web-based and native applications. A C-APIis used for native applications, while web-based applications use a JavaScript API.The JavaScript API uses remote procedure calls (RPC) [66] to send commands toa server running Mario in the background.

3.4.1 GStreamerGStreamer [67] is an open source library that runs on Linux, OS X, iOS, Windows,and many other platforms. GStreamer relies on GObject [68] to provide objectoriented abstractions. The main purpose of GStreamer is to construct media pro-cessing pipelines using a large number of provided elements.

GStreamer is designed with a plugin architecture which allows for plugins to bedynamically loaded upon request. Each plugin contains one or more elements thatcan be linked together. An example pipeline of a simple MP3 player with an audiovisualizer can be seen in figure 3.3. The pipeline does not function unless a queueelement is inserted, which is left as an exercise for the reader.

32

3.5. VIDEO ENCODING

Each element in the pipeline has it’s own function:

filesrc reads a file and passes on a raw byte stream

mpegaudioparse parses the raw data into an encoded MP3 stream.

mpg123audiodec decodes the MP3 stream and passes on raw audio data.

tee splits the stream.

autoaudiosink automatically selects a suitable method for audio playback andplays the received audio stream.

wavescope creates a wave visualization of the audio stream.

autovideosink automatically selects a suitable method for video playback andshows the received video stream.

Figure 3.3: Example GStreamer Pipeline

This pipeline architecture makes it simple to swap out parts of the pipeline whilekeeping the rest intact. For example, if the previous MP3 player wanted to playbacka vorbis stream instead, this could be achieved by replacing the mpegaudioparseand mpg123audiodec with vorbisparse and vorbisdec respectively. This isused in Mario to be able to be able to encode and decode di�erent video and audioformats.

3.5 Video Encoding

The current implementation of Mario on Android uses software encoders, some ofwhich are NEON [69] optimized. A high-end consumer smartphone has enoughprocessing power to encode video in realtime using a software encoder. However,the OMAP 4430 processor in Google Glass XE does not have enough processingpower to encode video at an acceptable resolution and frame rate.

The desired resolution was VGA (640 x 480 pixels), with a frame rate of 25frames per second. When using the default software encoder with these settings,no video was received at all. The NEON optimized encoder was able to transmitvideo at a few frames per seconds with a very low video quality.

33


While this was enough for initial testing, the performance of the video encoderwould have to be improved significantly if used in a real application.

3.5.1 Hardware Accelerated Video EncodingThe most computationally intensive task in a real time multimedia communicationapplication is normally video encoding [70]. Therefore, many mobile devices havededicated hardware that can be used for video (and audio) encoding. Google Glassis no exception, it’s OMAP 4430 SoC processor has an IVA3 multimedia hardwareaccelerator.

If hardware video encoding was to be used, it would need to be incorporated inthe GStreamer pipeline in Mario. The only sensible way to enable hardware encodedvideo was by using an existing GStreamer plugin. Three possible alternatives werefound, gst-ducati, gst-omx and the androidmedia plugin.

All the listed plugins access the hardware at di�erence layers: gst-ducati useslibdce (distributed codec-engine), which is the lowest level; gst-omx uses Open-MAX IL, a middle layer; and androidmedia uses the Android MediaCodec API,the highest layer.

The layer that is chosen has an e�ect on where the application will be able torun. Using gst-ducati will limit the use to devices with an OMAP processor, whilegst-omx and androidmedia should theoretically allow the application to run on anyAndroid device [71].

androidmedia

This plugin uses the Android MediaCodec API, which was added in Android 4.1.The API is built on top of OpenMAX IL [71], and makes it easier to provide crosscompatibility. Since Google Glass initially ran Android 4.0.4 it could not be used.

gst-ducati

gst-ducati can be found at [72], it depended the following libraries: libdce, libdrm,a custom branch of gst-plugins-bad, libpthreadstub, which were all also available at[72]. The Distributed Codec Engine (DCE) which gst-ducati uses is written specif-ically for the IVA-HD subsystem in the OMAP4 processor. It interfaces directlywith the co-processor without the need of OpenMax.

gst-omx

OMX is an abbreviation of OpenMAX (Open Media Acceleration), which is a cross-platform C-language interface that provides abstractions for audio, video, and imageprocessing. OpenMAX provides three di�erent interface layers: application layout(AL), integration layer (IL) and development layer (DL). A number of di�erentGStreamer plugins exist for OMX, which all use OpenMax IL. The most activelymaintained alternative was chosen, which can be found at [73].

34

Implementation of Prototype

The research goals of the thesis were achieved by reviewing existing research, andcreating and evaluating a prototype. The information gathered from the literature-review was complemented with original ideas and used in the construction of theprototype. This chapter describes the process of implementing the prototype, theproblems encountered, and how they were solved.

4.1 Baseline Implementation

The work began by implementing a basic version of the prototype, with minimalfunctionality.

A web server was built using NodeJS [74]. It served a static web page andalso acted as signaling server for the WebRTC call. The web page had controls toinitiate a simple two-way audio and video WebRTC call. Basic session and userhandling was added where any participant could freely choose to join any sessionusing any username.

A basic Android application for Google Glass was created, which provided aminimal implementation of WebRTC using Mario. The application could join asession on the server and set up a WebRTC call.

The early prototype revealed several issues:

• The encoded video on Glass had to have a very low resolution, or no videowould be received.

• The Glass application became sluggish over time and would eventually freezecompletely.

• Glass became very warm and displayed a warning message.

These problems all hinted towards that the CPU in Google Glass was not power-ful enough to allow software encoding of video. Di�erent alternatives for hardwareencoded video on Google Glass were therefore researched.

35

CHAPTER 4. IMPLEMENTATION OF PROTOTYPE

4.2 Hardware Accelerated Video Encoding

Mario was built using GStreamer version 1.0, so any plugin that provided hardwareencoding would also have to be built on top of GStreamer 1.0. This was a significantissue, most of the libraries discovered where aimed at GStreamer 0.10. Threeplugins where however found that used GStreamer 1.0: gst-omx, gst-ducati, andandroidmedia, which were described in the previous chapter.

Each of the plugins were built and tested, but none if them worked withoutmodifications.

It was possible to build tandroidmedia and load it into Mario. But the pluginused the Android media API, which was only available in Android 4.1. Since Glassat the time ran Android 4.0.4, the plugin could not be used.

gst-omx could be built and loaded, but the plugin didn’t work and printed along list of error messages.

It was possible to build gst-ducati and all of it’s dependencies, but trying tolink [75] it with Mario caused a large number of symbol conflicts.

Out of the three plugins tested it was concluded that gst-omx was the onethat most likely could be adapted and integrated into Mario and Google Glass.androidmedia was limited by the Android version, and gst-omx was still activelydeveloped and seemed to have a larger community than gst-ducati.

4.2.1 gst-omx on Google GlassThe TI E2E community [76] was a valuable source of information. A post by aTI employee explained how to debug OpenMAX on devices with a OMAP4 CPU:a log on the device located at /d/remoteproc/remoteproc0/trace1 loggederrors from the remote processor, including errors generated by OpenMAX. Thiswas a vital tool as it reported errors that were not shown in the GStreamer log.

A number of issues were found and resolved, enabling the use of hardwareencoded video through gst-omx. Using hardware accelerated encoding, VGA videocould be streamed at 25 frames per second, which was adequate for the prototype.

4.3 Ideas to implement

Ideas for how to continue the development of the prototype were explored.An aspect which was apparent from the beginning was that a remote view of the

expert was note very useful to the technician. A solution similar to XMReality andVIPAAR was considered, where there is an augmented view of the expert shown tothe technician. This idea was abandoned, as it would require a more sophisticatedsetup on the expert side, and it would be di�cult to provide something new to thesolution.

The idea that was chosen instead was that the expert would use simple anno-tation tools to annotate the view of the technician.

36

4.3. IDEAS TO IMPLEMENT

4.3.1 Annotating the Technician’s View

The idea was to allow the expert to draw on top of a still image or a video feed,and that the annotations would be sent to the technician. It was undecided if thisshould be done on a video feed or still images.

In the collaborative communication study (see 2.3), it was concluded that an-notating live video and still images are roughly as e�cient.

An important aspect of the test conducted in the research was that the personreceiving the instructions had the same viewpoint all the time. When annotatinglive video it allowed the person to align the annotations with the view, solving theproblem of keeping the annotations stable. In the field technician use case, it wouldbe beneficial if the technician is able to do work that requires moving around, whilereceiving instructions. This meant that if live video annotations were to be used,some form of alignment of the annotations and the video would have to be done.

4.3.2 Aligning Annotations and Video

If live video was to be annotated, there would have to be a way to track annotationsso that e.g. a circle around an object would stay around the object even as thetechnician moves around.

An approach considered was to track the technician’s position. It does not solvethe problem of finding the annotations’ position in relation to the real work objects,but it simplifies the problem.

The concept was tested with a simple prototype application that aimed at show a3D cube in the real world using AR. A combination of the gyroscope, accelerometerand compass were used to track the rotation of the device, as described earlier (see2.2.4). This worked well, the cube stayed in the same position when rotating one’shead, without any noticeable delay.

The challenge proved to be to track the position of the device. As was alsoconcluded earlier (see 2.2.4), an accelerometer can not be used to accurately trackthe position of an object. This meant that image processing would have to be usedto track the location of the technician, as well as finding the annotations’ locationrelative to the real world objects.

A number of di�erent approaches to this issue were considered:

1. Do all video processing on the expert side and send annotated video to thetechnician.

2. Recognize points in the video that can be used as anchors, send annotationsrelative to these points.

3. Same approach as 1, but transfer the annotations from the incoming video toa live feed on Glass.

37


When the video processing is said to be done on the expert or technician side,it is not required to be done on the actual device. It could be o�oaded to anotherdevice or server, as long as it can be done with very low latency.

Alternative 1 would cause a big delay between the technicians real view and hisor her view in Glass. It would also require twice the bandwidth as video is sentin both directions and use more processing power on Glass to decode the incomingvideo.

Alternative 2 was to synchronize the annotations using image recognition. Thiscould be done by recognizing points in the video on both the expert and technicianside, and then sending annotation coordinates that are relative to the recognizedpoints. This is perhaps the most realistic and optimal solution, if it is possible. Itwould most likely require the image processing on Glass to be o�oaded to a nearbydevice.

Alternative 3 solves the latency issue present in alternative 1, but it increasesthe complexity significantly. It would likely require the computation to be o�oadedto another device because of the limited processing power of Google Glass.

In the end neither of the alternatives were implemented. Alternative 1 and3 requires the ability to modify an incoming video stream and to transmit thatstream to another peer, which is not supported by WebRTC. Alternative 2 mightbe possible to implement, but it was considered too complex to implement withinthe time frame. It was also not though of from the beginning and is is left as apossible path for future improvement.

It was decided that still image annotation would be used instead. Accordingto the research found it should also be at least as e�cient as annotated video.The earlier mentioned experiment had also shown that humans are very good atcorrelating objects in an image with the real world, even if the image is from adi�erent perspective, which would be the case when using still images.

4.4 Still Image Annotation

The implementation of still image annotation presented a few questions and chal-lenges: how does the expert select the image, what types of annotations should bepossible, and how are the annotations transmitted to the technician?

The solution to the problem of selecting an image was solved by allowing theexpert to select video frame a few seconds in the past. The live video was shown ina section of the web application, with each shown video frame being pushed ontoa queue. Whenever the user pressed a button on the page, the queued video wasdisplayed in a separate section. The user could then seek through the queued videoand select a suitable frame. The solution made it possible to always display thelive video feed, while also being able to go back in time to select an optimal videoframe. How the images were sent to the technician is described later in this chapter.

The first annotation method to be implemented was free drawing. The drawingwas done using a 2D canvas in both the Glass application and web application. The

38

4.4. STILL IMAGE ANNOTATION

Android Canvas API [77] is very similar to the HTML5 Canvas API [78], whichmade it very simple to implement a drawing system with a consistent result for thesame data. Both canvases also supported drawing text, which was added at a laterstage.

4.4.1 WebRTC Data Channel

WebRTC data channels were chosen as transport for the annotation data. This as-sured the lowest possible latency. It also causes the annotation data to be transmit-ted by the same means as the audio data, ensuring that the audio and annotationsare synchronized.

Mario had not implemented WebRTC data channels at the time, however thiswould be added in the future. A simple solution that allowed data to be sentdirectly over UDP was implemented. The transport of the annotation data wasdesigned with the assumption that SCTP would later be used.

A protocol was created that assumed that all data was eventually delivered, butcould handle out of order delivery. This reduces the latency, and more importantlyit removes the spikes caused when a message is lost and has to be retransmitted.This was also a necessity as the messages temporarily had to be transported overUDP, although no retransmission of messages was implemented.

4.4.2 Out of Order Messages

A method that allowed the annotation data to be received out of order was devised.Each time the expert started drawing, a message was sent marking the beginning

of a new line and any properties of the line, such as color. When the expert hadmoved the cursor far enough (10 px), a message was sent with the coordinates ofthe point the cursor had been moved to. This limited the amount of data sent bydividing the drawing into segments instead of pixels.

In order to support out of order delivery, each line and point was given an index,with the point indices starting at 0 for each line being drawn. Every start of linemessage included the index of the line, while every point message contained theline index and it’s own index. A point message bu�er was also added in the Glassapplication that bu�ered any point message that was received before the associatedline message.

As shown in the research on collaborative communication, it was important thatthe expert was able to remove old annotations to avoid confusing the technicianby cluttering the scene. This was achieved by sending a clear message. The clearmessage supported out of order delivery by using the same incrementing index aswas used by the line messages.

39


Figure 4.1: Out of Order Delivery of Messages

A method for ensuring that the correct annotations were cleared had to beimplemented in the Glass application. This was achieved by keeping track of theminimum allowed index, as illustrated in figure 4.1. Whenever a clear message isreceived, the minimum index is set to the index of the clear message. At the sametime, any annotation with an index lower than the new minimum is removed. Anymessage that is received with an index below the minimum is ignored.

This solution was later expanded to allow clearing of individual colors by keepingtrack of a minimum index for each color.

4.5 Image Processing

An idea that was brought up during the development of the prototype was theability to highlight objects in the selected image by clicking on them. This waschosen as the feature that would be implemented after annotations by free drawing.

Multiple di�erent methods for selecting objects were considered:

1. The user draws a rough outline around the object, an algorithm then refinesthe edge.

2. Recognize objects in the image and allow the user to select them.

3. The image is split into di�erent regions. When a user clicks inside a region itis selected.

Out of the options listed, number 1 would provide the most accurate outline, butthe goal of the object selection method was to provide a quick way for the expertto select and object. Alternative 1 defeats the purpose of the object highlighting,as the rough drawing would signal the object of interest just as well as a preciseoutline.

Alternative 2 and 3 were both chosen and methods for implementing them wasresearched.

40


The image processing had to be done either directly in the web application oro�oaded to a server. A web based solution was preferred, as this would minimizethe delay and server load.

Di�erent libraries were found that could provide face detection or hand detec-tion, but no library was found that provided either detection of objects that stoodout of an image or some method of segmenting an image by color. An attempt wasinstead made to implement alternative 2 or 3 using WebGL.

4.5.1 Image Processing using WebGL

The image processing was implemented using fragment shaders. While they onlyoutput one color, fragment shaders can read multiple points from many di�erenttextures that are loaded onto the GPU.

A series of fragment shaders can be composited with the help of framebu�ers.The image is first loaded into a texture. The texture is then rendered to a frame-bu�er with an attached texture using the first shader. The new texture is thenrendered to another framebu�er using the second shader, and so on. The resultobtained in the final step can then be rendered to the screen or read into mainmemory.

Figure 4.2: Ping-Pong Framebu�ers

Because a framebu�er can not render to itself, at least two framebu�ers hadto be used. The render target alternated between the two framebu�ers, usingthe unbound framebu�er’s attached texture as input to the fragment shader. Anillustration of this is shown in figure 4.2. The figure also shows how new framebu�ers

41


can be created to save the intermediate steps in the process. This method was usedas a debugging tool to be able to look through the di�erent steps afterwards.

4.5.2 Image Segmentation using Hough Transform

A challenge faced when implementing the image segmentation was how the selectionof a region would be shared with the technician. Even if the image could be dividedinto suitable regions, the preferred way of sending the annotations to the technicianwas through the text based WebRTC data channel.

For this reason, the first method that was tried was a form of object recognitionusing a Hough Line transform (see 2.4.3). The intention was that a Hough transformwould find line segments in the image that could then be stitched together, creatinggraph. With some basic collision detection the coordinates of a mouse click couldbe used to find the smallest subgraph that encompassed the click. The coordinatesof the line segments could the be sent to the expert and it would be trivial to renderthe outline.

The image was preprocessed by applying a Gaussian blur and then the simpleedge detection algorithm described in the literature-review (see 2.4.1). The HoughLine transform was then done, revealing any lines in the image. The lines werethen supposed to be split into the actual line segments in the image, and segmentswould then be combined to a graph.

This method proved to have some big disadvantages. The biggest one was thata Hough transform can only reveal objects with a predetermined shape, in this caseit would only be able to detect object with straight and sharp edges. This wasknown from the beginning but it because apparent how limiting it was once thealgorithm was run on some sample images.

It also proved to be di�cult to pick out the lines in the image by finding relevantlocal maximum points. Many of the sample images used had lines that where similarto each other but did not originate from the same object, e.g. two tables standingnext to each other at slightly di�erent angles.

4.5.3 Image Segmentation using Median Filters

While experimenting with the Hough transform, a mean filter was tested to see ifit would be a more suitable preprocessing step than the Gaussian blur filter. Itturned out that a decent segmentation of the image could be achieved by applyinga mean filter several times and then running an edge detection shader.

This method was further improved upon be creating a second type of medianfilter that sampled the surrounding pixels in a circular pattern instead of a square.The two di�erent median filters were applied after each other a number of timesand by trial and error a pattern than yielded an acceptable result was found.

The method provided a decent segmentation of the image. It was not able toproperly identify objects when an edge was blurred e.g. if out of focus.

42


Metropolis-Hasting Algorithm

An attempt was made to implement the Metropolis-Hastings algorithm [79]. Goodresults were achieved with a slow cooling process and an iteration count of severalthousands, but this resulted in a computation time far above what was acceptablefor the application. The shortcut method with a fast cooling process that resultsin domain fragmentation was never achieved, most likely due to bad parameters orerrors in the implementation.

(a) Before (b) After

Figure 4.3: Result of Median Filters and Edge Detection

4.5.4 Region LabelingAn example of a resulting image after the median filters and the edge detectionhave been applied can be seen in figure 4.3. The regions found in the image arewhite while the borders between the regions were black.

A way to select a region and to share this selection with the technician had to beimplemented. This was achieved with the help of the labeling algorithm describedearlier (see 2.4.4).

Initially another more naive algorithm was used, which took roughly 1.3 s tolabel an image with VGA resolution. When using the new algorithm the time wasreduced to below 1 s. This was still too slow to be an acceptable solution, especiallysince the web browser is frozen while the algorithm is running.

The solution was to use a web worker [80], which can be thought of as a threadin JavaScript. Web workers are more closely related to processes however, as theyrun in a separate context and can only share immutable data with other contexts.

When moving the computation to a web worker the goal was to allow the mainthread in the browser to keep running while the computation was being executed.It turned out to have a side e�ect, the execution time went down to approximately30 ms. It was guessed that isolating the code in an otherwise empty context allowedbetter optimizations to be made, but this was never studied further.

Once the regions had been labeled, an outline around a single region could berendered using the region label and an edge detection shader. The same method

43


was also used to send the label to the technician.

4.6 Glass Application

The development of the Glass application was very straightforward as it did not haveany graphical user interface (GUI) to speak of. In most cases the implementationsimply mimicked the web application.

4.6.1 ConfigurationThe address to the server was initially hardcoded in application. This meant thatevery time a di�erent network was used, the address in the code had to be changed.This quickly became a tedious task, so a method of configuring the address wasadded.

QR codes are used to configure the Wi-Fi network on Glass, this idea was copiedwhen adding the address configuration option. A library to generate QR codes wasadded to the web application. When clicking a button a QR code was shown thatcontained the current address of the server. Since Android does not provide a QRcode reader a library called ZXing [81] was used to provide that functionality.

The QR code scanner could be accessed through a context menu in the Glassapplication. When scanning the QR code provided by the web application, theaddress received was stored in the shared preferences for the application. The nexttime the application tried to connect to the server it would now use the updatedaddress.

4.6.2 OpenGL ES 2.0The OpenGL version used in the Glass application was OpenGL ES 2.0, whichWebGL is based upon. This meant that all the methods used for the rendering couldbe reused, and the shaders could be reused without any modification. The Glassapplication did not mirror the image processing functionality, only the methodsused to render an outline around an image region were implemented.

4.6.3 Photographs by TechnicianWhen testing the system it was realized that it would be useful for the technicianto be able to take photographs himself and send these to the expert. This wouldallow for the expert to instruct the technician to e.g. take a close up picture ofan object using annotations or voice guidance. With the solution at the time thetechnician would have to simply look at the object for some time so that the expertcould then select a suitable frame.

This proved to be a technical challenge as the camera has to be acquired beforea photograph can be taken, and it is already in use by Mario. A possible solutionwould be to end the call, take the picture, and then resume the call. This seemed

44

4.7. SIGNALING SERVER

like a solution that would make the feature pointless as it would take much longertime to take the picture than to simply look at the object for some time.

An alternative solution was found that allowed Mario to keep running while aphotograph is taken. GStreamer provides methods for intercepting bu�ers that arebeing sent through the pipeline. This was used to add a function to Mario whichcopied the next video frame send from the camera. This meant that the picturehad the same resolution as the video, but it had better quality as it was extractedbefore the video was encoded. This solution also allowed photographs to be takenwith a very small delay, as the camera is already recording video.

4.7 Signaling Server

This section describes the implementation of the signaling server.The server had several tasks that it must be able to handle:

• Keeping track of sessions and users.

• Provide a signaling channel between the users for WebRTC.

• Allow images and photographs to be uploaded to a session.

• Send notification to users on events such as image upload or user events.

4.7.1 Sessions and usersThe session and user handling was very simple. No authentication was implementedas this was not a needed feature in the prototype. Session IDs and usernames arechosen by the users when connecting to the server. Users receive events whenanother user joins or leaves their session, and are able to send messages to any userin the same session.

4.7.2 Server sent eventsThe signaling channel was chosen to be implemented using either web sockets [82]or server-sent events [83]. Server-sent events are implemented in the browser usingthe EventSource API, while web sockets use the WebSocket API.

The biggest di�erence between server-sent events and web sockets is that websockets are bidirectional while server-sent events only allow data to be sent from theserver to the client. Data can be sent to the server when using server-sent events,but this has to be done separately using regular HTTP request. This causes a largeoverhead compared to web sockets if small chunks of data are sent.

The server implementation of server-sent events would also be simpler than theweb-socket implementation, since server-sent events simply keeps a HTTP requestopen.

45


Before the XE16 update to Google Glass, which brought the new Android 4.4WebView, neither of these technologies were supported by the WebView in theGDK. This could be solved using a polyfill, which is a drop-in script that implementssome missing functionality in the browser [84].

Using a polyfill caused the browser to fall back to long polling. Long polling isa form of polling where the browser requests the resource at a much slower rate. Ifthe server does not have any response available when the request is received, it isleft open until there is something to send. In the end this issue disappeared whenthe XE16 update was released.

Server-sent events where chosen because the data sent to the server was sent inlarge chunks and infrequently, only the notifications from the server to the clientswere small and frequent. On the server they were implemented by keeping onerequest open for each connected user. When the web application had finishedloading a request was sent to the path /join along with the required parameters.Upon receiving the request, the response was simply left open and the content typewas set to text/event-stream. This allowed events to be sent using the openresponse.

Server-sent events were used to implement both the server notifications such asusers disconnecting and images being uploaded, as well as the signaling channelthat was used to initiate the WebRTC call.

4.7.3 Image uploadThe image upload was implemented by saving images that were sent to a specificpath using a PUT request. The images were saved per session and were stored inmemory only.

The images uploaded by the technician were assigned a unique id and all up-loaded photo were saved. Of the images uploaded by the expert, only the mostrecent one was saved.

Each time an image was uploaded an event was sent to all users in the session.Di�erent events were emitted depending on if the expert had uploaded an imageor the technician had uploaded a photograph. This allowed the clients to startdownloading images as soon as they had been uploaded.

An improvement that was considered was to send the image events as soon asthe image upload had started. The image would then be streamed to any involvedclient while it was still being uploaded, resulting in a reduced latency. This wasnever implemented due to time constraints.

46

Result

This chapter provides a summary of the final implementation of the prototype froma user point of view.

5.1 Web Application

Figure 5.1 shows a view of the web application when it has just loaded. Theconfigure button (1) displays a QR code which can be scanned by the technicianto set the server address to the address that the expert is currently connected to.Once the technician has connected to the same session, an “Initiate Call” buttonwill be displayed in area A.

Area A is where the remote view of the technician is displayed. When button 2is clicked a scan through the last few seconds of video is initiated. While scanningthrough the video with the slider in area 3, the frames are displayed in area B.

When the technician takes a photograph it will appear in area D. The expert isable to preview the pictures by hovering over them.

At any point during a call, images can be dragged from areas A, B, and Dto area C to select an image for annotation. The selected image is immediatelydisplayed to the technician through the display on Google Glass. Images can alsobe dragged from other sources such as other websites or the file system.

When the expert has selected an image it can be annotated using any of themethods seen in area 5: text, highlighting, or free drawing. The color of theannotations is selected with the color picker (4).

In free drawing mode the expert annotates the images by drawing with themouse on top of the image.

Highlight mode allows the expert to select regions in the image that have beenfound using an image segmentation algorithm. When the expert hovers over theregions of the image a preview of the highlighting will be shown. If the expert clicksthe region it will be selected and the highlighting will be sent to the technician.

When in text mode the expert can click anywhere within the image to displaya box where text can be inserted. If the expert hits enter the text is converted toan annotation and displayed to the technician, while escape aborts the text input.

The erase buttons (6) can be used to remove the annotations. The “Erase Color”button only removes annotations of the currently selected color, while “Erase All”

47

CHAPTER 5. RESULT

removes all annotations. All annotations are automatically removed if a new imageis selected.

Button 7 and 8 are debug tools and are used to display sample images andenable debug keybindings. The debug keys can be used go back and view thedi�erent steps of the image segmentation process.

Figure 5.1: Web Application

5.2 Google Glass Application

The user opens the application using voice command or by navigating to it in thecommand list.

When inside the application the user is always able to quickly tap the touchpadto bring up the context menu. Through the menu the user is able to stop theapplication, which alternatively can be done by swiping down on the touchpad.The menu also provides access to a QR code scanner which is used to configure theserver address by scanning the code generated by the web application.

Once the application has started it automatically tries to connect to the serverusing the configured address. The user gets feedback through a large message inthe middle of the screen which tells what the connection status is. If the applicationsuccessfully connected to the server a message saying “Connected” will be shown.

48

5.2. GOOGLE GLASS APPLICATION

If the connection is not successful the user can tap the touchpad with two fingersto attempt to reconnect.

Once the application is successfully connected to the server the technician has towait for the expert to initiate the call. When the call is started, the text disappearsand the technician will be able to speak to the expert. As soon as the call hasstarted the technician is able to long tap the touchpad to send a photograph to theexpert.

When the expert has selected an image to annotate, the image will be displayedin full screen to the technician. As soon as the expert adds or removes any an-notations they are instantly sent to the technician and displayed on top of theimage.

49

Discussion

This chapter will analyze the work and result of the thesis. In addition to imple-menting a prototype of a collaborative communication system, a research goal wasto evaluate the prototype from a viewpoint of further improvements, reusability,and how well suited Google Glass are for the task.

6.1 Analysis of Method and the Result

This section discusses some of the choices made when implementing the prototypeand parts that could have been done di�erently.

6.1.1 Live Video AnnotationsThe choice of sending still images to the technician instead of video was sensiblegiven the time frame, but it would have been interesting to investigate the use oflive video annotations. If the time spent implementing the image segmentationwas instead spent by investigating live video annotation, it is possible that thediscoveries made would have been more closely related to the goal of the thesis.

In either case it would have been a good idea to do a simple implementation oflive video annotations, without any compensation for delay or movement, if onlyto get a better understanding of the issues and benefits of live video annotations.

6.1.2 Early PrototypeImplementing an early prototype with minimal functionality provided valuable in-formation about the performance issues of Google Glass. It allowed planning to bechanged at an early stage to ensure that a solution was found. If the implemen-tation of the prototype had not started until later, it is likely that the unexpectedissues had had a bigger impact on the result.

6.1.3 WebGLAn issue with WebGL on displays with a high pixel density was discovered afterthe prototype had been completed: it took a very long time to read the image dataneeded by the region labeling algorithm from the GPU, up to several seconds. Given

51

CHAPTER 6. DISCUSSION

the current state of WebGL, which is mostly stable but has performance issuesin some edge cases, it would have been favorable to run the image segmentationalgorithm on a separate server.

Even so, if going back with the knowledge of the issues with WebGL, it wouldstill be compelling to use. Implementing an image processing algorithm in WebGLwas an interesting example of the possibilities of new web technologies.

6.1.4 OpenMAXOnce the Android MediaCodec API was available on Google Glass, the androidme-dia plugin was tested. While simply plugging an androidmedia video encoder intothe pipeline did not work, it is possible that using androidmedia would have beena lot easier than gst-omx, while also providing a more portable solution. This wasnot something that could have been done di�erently however, as the MediaCodecAPI was made available toward the end of the thesis work.

6.1.5 AudioThe visual component of the communication system received far more attentionthan the audio, which is perhaps justified. Some time should nevertheless havebeen spent by doing a small-scale comparison of di�erent audio codecs and settingsand finding a combination that works particularly well with the bone conductiontransducer used by Google Glass.

6.2 Further Improvements

This section lists possible improvements and some ideas were never implemented.

6.2.1 Image SegmentationThe most direct improvement to the prototype would be to replace the imagesegmentation algorithm. The implemented algorithm is good enough to be able todemonstrate the concept, but it has a lot of flaws. If was very vulnerable to badfocus and soft edges on objects, and would often fail to outline an object that wasclearly distinguishable from the background. A method that was discovered towardsthe end of the work was mean shift, it would probably be a big improvement overthe current solution. Alternatively the implementation of the Metropolis-Hastingsalgorithm could be fixed and/or tuned.

6.2.2 Video AnnotationsSomething that should be investigated is how the still image solution comparesto a live video solution. This would mean implementing live video annotations

52

6.3. REUSABILITY

which could be considered an entire new system rather than an improvement to theexisting prototype.

6.2.3 UX Evaluation

A thorough evaluation of the user experience would also have to be done. Thisevaluation should be focused on the technician’s work and how easy it is to adaptto the use of Google Glass and how easy it is to follow instructions provided.

6.2.4 Gesture and Voice Input

An idea that was considered was adding some form of gesture or voice input for thetechnician. This would allow the technician to provide input in situations where itmight not be possible otherwise.

6.2.5 More Annotations Options

The usability of the annotation interface could be improved by adding features suchas drawing shapes, undo and redo, and working with multiple images simultane-ously.

6.2.6 Logging

A system that records the technicians actions could also be implemented. Thevideo and audio would be recorded, along with other data such as facing, position,temperature, and bandwidth usage. This would require a more sophisticated setupwhere the video is transmitted via an intermediate server where it can be recorded.

6.3 Reusability

The system can easily be adapted for any other device that is similar to GoogleGlass, such as the Recon Jet or Vuzix M100. Because the devices also run Android,the application could likely be ported with minimal e�ort, only replacing the GDKspecific code with a solution written with their SDK.

It would however the di�cult to adapt the prototype to any direct AR device,such as the Meta Pro. It does not make sense to use still images to give instructionsto someone wearing a direct AR device. The images could perhaps be displayedon a simulated floating screen just like the other devices, but this would just causea loss of detail compared to the other options. The Meta Pro also provide depthsensors and other tools that would make live video annotation a lot simpler toimplement.

53

CHAPTER 6. DISCUSSION

6.4 Capabilities of Google Glass

It was obvious from beginning that a big issue would be the processing power avail-able on Google Glass. What was also noticed early was the problem of overheating.When using the built in video recording software it took only a few minutes forthe device to become quite warm. When developing the prototype Glass had to beremoved from time to time because it became too hot to be comfortable to wear.

Once the hardware could be accessed to accelerate the video encoding it wasproven that it was possible to use Glass for the use case. The device did not overheatas quickly, although it still happened after 5 – 10 minutes of use. At it’s currentstate it is not suitable for long calls to the back o�ce support, but it can definitelybe used for short calls that don’t last longer than about 5 minutes.

The battery capacity was most often not an issue, although this was most likelybecause the device overheated before the battery could be depleted. Battery life isan issue that could quite easily be resolved in the use case. The technician coulde.g. charge the device while driving to the location. There are also several productsthat aim to extend the battery life of Google Glass by attaching an extra battery.

An issue that was quickly recognized was the audio quality. Glass uses a boneconduction transducer which results in a quite bad audio playback. One of thedesign goals of the device is that it should not shut out the user from his or hersurroundings. This feature comes at a cost, as the audio quality does su�er.

6.5 E�ects on Human Health and the Environment

This section discusses possible e�ects on the user and the environment.

6.5.1 Environmental ImpactsAny communication system that is able to replace the need to be at a location inperson has the potential to reduce both environmental and economic costs.

When the problem at a site turns out to be too complicated for the technicianto solve alone, he or she can call support instead of having to give up and sendsomeone else to the site. If the issue can be solved remotely using the instructionsfrom an expert, unnecessary pollution and expenses arising from traveling to thesite can be avoided.

Problems will be solved faster and more reliably the better adapted the com-munication system is. Failures are avoided by providing the expert with the toolsnecessary to give comprehensive instructions to the technician, which is what theproposed system aims to do.

6.5.2 Health ConcernsThe e�ect on the health of the technician that the prototype has is minimal. Theindirect AR used in the prototype does not cause simulation sickness, which is a risk

54

6.5. EFFECTS ON HUMAN HEALTH AND THE ENVIRONMENT

when using a video see-through direct AR system. The e�ects are likely limited toeye-strain caused by looking at the screen, and some possible discomfort if GoogleGlass becomes overheated.

55

Conclusion

In conclusion, the prototype that was created shows that it is possible to create acommunication system involving Google Glass that is well adapted to the use case.With some work the system is likely to be an improvement over any smartphoneor tablet based solution.

Glass is also well suited for the entire use case as it can be used for navigationwhen driving to the remote site, as well as notifying the technician of any events.

There are however some issues with the hardware on Google Glass, the mostapparent being the problem of overheating. But one should not forget that theversion of Google Glass used is the explorer edition, and much can change beforethe device is o�cially released to the market. It is also the first generation of a newtype of device, and it is likely that future version will fill in the gaps.

57

Bibliography

[1] Mann S. In: Soegaard M, Dam RF, editors. Wearable Computing. Aarhus,Denmark: The Interaction Design Foundation; 2013. .

[2] Thorp EO. The Invention of the First Wearable Computer. In: ISWC ’98 Pro-ceedings of the 2nd IEEE International Symposium on Wearable Computers.Pittsburgh, PA, USA: IEEE Computer Society Press; 1998. p. 4–8.

[3] Peng J, Seymour S. Envisioning the Cyborg in the 21st Century and Be-yond;p. 18.

[4] The history of wearable computing [homepage on the Internet]. Dr. HolgerKenn; [cited 2014 May 24]. Available from: http://www.cubeos.org/lectures/W/ln_2.pdf.

[5] The History of Wearable Technology [homepage on the Internet]. Chicago: TheAssociation; c1995-2002 [updated 2001 Aug 23; cited 2002 Aug 12]. Availablefrom: http://217.199.187.63/click2history.com/?p=21.

[6] Glacier Computer – Press Release [homepage on the Internet]. Glacier Com-puter; [updated 2014 May 24; cited 2014 May 25]. Available from: https://github.com/zxing/zxing.

[7] Pebble: E-Paper Watch for iPhone and Android [homepage on theInternet]. Kickstarter, Inc.; [cited 2014 May 25]. Available from:https://www.kickstarter.com/projects/597507018/pebble-e-paper-watch-for-iphone-and-android.

[8] Oculus Rift [homepage on the Internet]. Oculus VR, Inc.: The Khronos Group;[cited 2014 May 25]. Available from: http://www.oculusvr.com/rift/.

[9] SmartWatch [homepage on the Internet]. Sony Mobile Communications; [cited2014 May 25]. Available from: http://www.sonymobile.com/global-en/products/accessories/smartwatch/.

[10] U.S. News Center [homepage on the Internet]. Samsung; [updated 2013 Sep05; cited 2014 May 25]. Available from: http://www.samsung.com/us/news/newsRead.do?news_seq=21647.

59

BIBLIOGRAPHY

[11] Apple iWatch release date, news and rumors [homepage on the Inter-net]. TechRadar; [updated 2014 April 09; cited 2014 May 25]. Avail-able from: http://www.techradar.com/news/portable-devices/apple-iwatch-release-date-news-and-rumours-1131043.

[12] Apple iWatch: Price, rumours, release date and leaks [homepage on theInternet]. Beauford Court, 30 Monmouth Street, Bath BA1 2BW: Fu-ture Publishing Limited; [updated 2014 Apr 09; cited 2014 May 25].Available from: http://www.t3.com/news/apple-iwatch-rumours-features-release-date.

[13] Meta – SpaceGlasses [homepage on the Internet]. META; [cited 2014 May 25].Available from: https://www.spaceglasses.com/.

[14] M100 – Smart Glasses | Vuzix [homepage on the Internet]. Vuzix; [cited2014 May 25]. Available from: http://www.vuzix.com/consumer/products_m100/.

[15] Recon Jet [homepage on the Internet]. Recon Instruments; [cited 2014 May25]. Available from: http://reconinstruments.com/products/jet/.

[16] Azuma R, Baillot Y, Behringer R, Feiner S, Julier S, MacIntyre B. Recent Ad-vances in Augmented Reality. IEEE Comput Graph Appl. 2001 Nov;21(6):34–47.

[17] Milgram P, Takemura H, Utsumi A, Kishino F. Augmented Reality: A Classof Displays on the Reality-Virtuality Continuum; 1994. p. 282–292.

[18] Liestol G, Morrison A. Views, alignment and incongruity in indirect augmentedreality. In: Mixed and Augmented Reality - Arts, Media, and Humanities(ISMAR-AMH), 2013 IEEE International Symposium on; 2013. p. 23–28.

[19] Word Lens on App Store on iTunes [homepage on the Internet]. AppleInc.; [updated 2014 Apr 18; cited 2014 May 25]. Available from: https://itunes.apple.com/en/app/word-lens/id383463868?mt=8.

[20] Iwamoto K, Kizuka Y, Tsujino Y. Plate bending by line heating with inter-active support through a monocular video see-through head mounted display.In: Systems Man and Cybernetics (SMC), 2010 IEEE International Conferenceon; 2010. p. 185–190.

[21] Kijima R, Ojika T. Reflex HMD to compensate lag and correction of derivativedeformation. In: Virtual Reality, 2002. Proceedings. IEEE; 2002. p. 172–179.

[22] Genc Y, Sauer F, Wenzel F, Tuceryan M, Navab N. Optical see-through HMDcalibration: a stereo method validated with a video see-through system. In:Augmented Reality, 2000. (ISAR 2000). Proceedings. IEEE and ACM Inter-national Symposium on; 2000. p. 165–174.

60

BIBLIOGRAPHY

[23] Rolland JP, Fuchs H. Optical Versus Video See-Through Head-Mounted Dis-plays in Medical Visualization. Presence: Teleoper Virtual Environ. 2000jun;9(3):287–309.

[24] Hakkinen J, Vuori T, Paakka M. Postural stability and sickness symptomsafter HMD use. In: Systems, Man and Cybernetics, 2002 IEEE InternationalConference on. vol. 1; 2002. p. 147–152.

[25] Sachs D. Sensor Fusion on Android Devices: A Revolution in Motion Pro-cessing; 2010. 15:20 - 16:20. [Google Tech Talk]. Available from: https://www.youtube.com/watch?v=C7JQ7Rpwn2k.


[27] Deak G, Curran K, Condell J. A survey of active and passive indoor localisationsystems. Computer Communications. 2012;35(16):1939 – 1954.


[29] Kato H, Billinghurst M. Marker tracking and HMD calibration for a video-based augmented reality conferencing system. In: Augmented Reality, 1999.(IWAR ’99) Proceedings. 2nd IEEE and ACM International Workshop on;1999. p. 85–94.

[30] Skoczewski M, Maekawa H. Augmented Reality System for AccelerometerEquipped Mobile Devices. In: Computer and Information Science (ICIS),2010 IEEE/ACIS 9th International Conference on; 2010. p. 209–214.

[31] Kim S, Lee GA, Sakata N. Comparing pointing and drawing for remote collab-oration. In: Mixed and Augmented Reality (ISMAR), 2013 IEEE InternationalSymposium on; 2013. p. 1–6.

[32] Umbaugh SE. Digital Image Processing and Analysis: Human and ComputerVision Applications with CVIPtools, Second Edition. 2nd ed. Boca Raton, FL,USA: CRC Press, Inc.; 2010.

[33] Edge detection Pixel Shader [homepage on the Internet]. Agnius Vasiliauskas;[updated 2010 Jun 3; cited 2014 May 25]. Available from: http://coding-experiments.blogspot.se/2010/06/edge-detection.html.

[34] Pinaki Pratim Acharjya DG Soumya Mukherjee. Digital Image SegmentationUsing Median Filtering and Morphological Approach. The Annals of Statistics.2014 01;4(1):552–557.

61

BIBLIOGRAPHY

[35] Arias-Castro E, Donoho DL. Does median filtering truly preserve edges betterthan linear filtering? The Annals of Statistics. 2009 06;37(3):1172–1206.

[36] Strzodka R, Ihrke I, Magnor M. A graphics hardware implementation of thegeneralized Hough transform for fast object recognition, scale, and 3D posedetection. In: Image Analysis and Processing, 2003.Proceedings. 12th Inter-national Conference on; 2003. p. 188–193.

[37] Duda RO, Hart PE. Use of the Hough Transformation to Detect Lines andCurves in Pictures. Commun ACM. 1972 Jan;15(1):11–15.

[38] Dillencourt MB, Samet H, Tamminen M. A General Approach to Connected-component Labeling for Arbitrary Image Representations. J ACM. 1992Apr;39(2):253–280.

[39] Connected Components Labeling [homepage on the Internet]. R. Fisher, S.Perkins, A. Walker and E. Wolfart; [cited 2014 May 25]. Available from:http://homepages.inf.ed.ac.uk/rbf/HIPR2/label.htm.

[40] Google Glass Teardown [homepage on the Internet]. Scott Torborg, Star Simp-son; [cited 2014 May 25]. Available from: http://www.catwig.com/google-glass-teardown/.

[41] Tech specs - Google Glass Help [homepage on the Internet]. Google; [cited2014 May 25]. Available from: https://support.google.com/glass/answer/3064128?hl=en.

[42] Wink - Google Glass Help [homepage on the Internet]. Google; [cited 2014 May25]. Available from: https://support.google.com/glass/answer/4347178?hl=en.

[43] Starner T. Project Glass: An Extension of the Self. Pervasive Computing,IEEE. 2013 April;12(2):14–16.

[44] Motorola Moto G – Full phone specifications [homepage on the Internet]. GS-MArena.com; [cited 2014 May 25]. Available from: http://www.gsmarena.com/motorola_moto_g-5831.php.

[45] Alain Vongsouvanh JM. Building Glass Services with the Google Mirror API;2013. 40:00 - 40:30. [Google I/O 2013]. Available from: https://www.youtube.com/watch?v=CxB1DuwGRqk.

[46] They’re No Google Glass, But These Epson Specs O�er A New Look AtSmart Eyewear [homepage on the Internet]. Say Media Inc.; [updated 2014May 20; cited 2014 May 25]. The New Reality; [about 3 screens]. Availablefrom: http://readwrite.com/2014/05/20/augmented-reality-epson-moverio-google-glass-oculus-rift-virtual-reality.

62

BIBLIOGRAPHY

[47] Mistry P, Maes P. SixthSense: A Wearable Gestural Interface. In: ACMSIGGRAPH ASIA 2009 Sketches. SIGGRAPH ASIA ’09. New York, NY, USA:ACM; 2009. p. 11:1–11:1.

[48] Meta Pro AR Goggles Kick Google’s Glass - Yahoo News [homepage onthe Internet]. Yahoo News; c1995-2002 [updated 2014 Jan 8; cited 2014May 25]. Available from: http://news.yahoo.com/meta-pro-ar-goggles-kick-122918255.html.

[49] meta: The Most Advanced Augmented Reality Glasses [homepageon the Internet]. Kickstarter, Inc.; [cited 2014 May 25]. TheTech Specs – Software; [about 2 screens from the bottom]. Avail-able from: https://www.kickstarter.com/projects/551975293/meta-the-most-advanced-augmented-reality-interface.

[50] The Tantalizing Possibilities of an Oculus Rift Mounted Camera [home-page on the Internet]; [updated 2014 May 14; cited 2014 May 25]. Avail-able from: http://www.roadtovr.com/oculus-rift-camera-mod-lets-you-bring-the-outside-world-in/.

[51] Tech Specs | Recon Jet [homepage on the Internet]. Recon Instruments; [cited2014 May 25]. Available from: http://www.reconinstruments.com/products/jet/tech-specs/.

[52] XMReality – XMExpert consists of two parts [homepage on the Internet].XMReality; [cited 2014 May 25]. Available from: http://xmreality.se/product/?lang=en.

[53] Mobile Video Platform for Field Service [homepage on the Internet]. VIPAAR;[cited 2014 May 25]. Available from: http://www.vipaar.com/platform.

[54] HTML5 Introduction [homepage on the Internet]. Refsnes Data; [cited 2014May 25]. Available from: http://www.w3schools.com/html/html5_intro.asp.

[55] WebRTC 1.0: Real-time Communication Between Browsers [homepage on theInternet]. World Wide Web Consortium; [updated 2014 Apr 10; cited 2014May 24]. Available from: http://dev.w3.org/2011/webrtc/editor/webrtc.html.

[56] Is WebRTC ready yet? [homepage on the Internet]; [cited 2014 May 24].Available from: http://iswebrtcreadyyet.com/.

[57] WebRTC | MDN [homepage on the Internet]. Mozilla Developer Network;[updated 2014 May 15; cited 2014 May 24]. Available from: https://developer.mozilla.org/en-US/docs/WebRTC.

63

BIBLIOGRAPHY

[58] RFC 4960 — Stream Control Transmission Protocol [homepage on the Inter-net]. Internet Engineering Task Force; [updated 2007 Sep; cited 2014 May 24].Available from: http://www.ietf.org/rfc/rfc2960.txt.

[59] The Standard for Embedded Accelerated 3D Graphics [homepage on the Inter-net]. Beaverton, OR, USA: The Khronos Group; [cited 2014 May 22]. Availablefrom: http://www.khronos.org/opengles/2_X/.

[60] The Industry’s Foundation for High Performance Graphics [homepage on theInternet]. Beaverton, OR, USA: The Khronos Group; [cited 2014 May 22].Available from: http://www.khronos.org/opengl/.

[61] HTMLCanvasElement [homepage on the Internet]. Mozilla DeveloperNetwork; [updated 2014 Mar 21; cited 2014 May 22]. Avail-able from: https://developer.mozilla.org/en-US/docs/Web/API/HTMLCanvasElement.

[62] Marroquim R, Maximo A. Introduction to GPU Programming with GLSL. In:Proceedings of the 2009 Tutorials of the XXII Brazilian Symposium on Com-puter Graphics and Image Processing. SIBGRAPI-TUTORIALS ’09. Wash-ington, DC, USA: IEEE Computer Society; 2009. p. 3–16.

[63] WebGL Fundamentals [homepage on the Internet]. Gregg Tavares; [updated2012 Feb 9; cited 2014 May 22]. Available from: http://www.html5rocks.com/en/tutorials/webgl/webgl_fundamentals/.

[64] Litherum: How Multisampling Works in OpenGL [homepage on the Inter-net]. Litherum; [updated 2014 Jan 6; cited 2014 May 24]. Available from:http://litherum.blogspot.se/2014/01/how-multisampling-works-in-opengl.html.

[65] OpenGL Wiki - Framebu�er [homepage on the Internet]; [cited 2014 May 22].Available from: http://www.opengl.org/wiki/Framebuffer.

[66] Birrell AD, Nelson BJ. Implementing Remote Procedure Calls. ACM TransComput Syst. 1984 Feb;2(1):39–59.

[67] GStreamer [homepage on the Internet]; [cited 2014 May 22]. Available from:http://gstreamer.freedesktop.org/.

[68] GObject [homepage on the Internet]. The GNOME Project; [cited 2014May 22]. Available from: https://developer.gnome.org/gobject/stable/.

[69] NEON – ARM [homepage on the Internet]. ARM Ltd.; [cited 2014 May24]. Available from: http://www.arm.com/products/processors/technologies/neon.php.

64

BIBLIOGRAPHY

[70] Chandra S, Dey S. Addressing computational and networking constraints toenable video streaming from wireless appliances. In: Embedded Systems forReal-Time Multimedia, 2005. 3rd Workshop on; 2005. p. 27–32.

[71] Android - Media [homepage on the Internet]. Android Open Source Project;[cited 2014 May 22]. Available from: https://source.android.com/devices/media.html.

[72] gstreamer-omap [homepage on the Internet]; [cited 2014 May 22]. Availablefrom: https://gitorious.org/gstreamer-omap.

[73] gstreamer/gst-omx [git repository]; [cited 2014 May 22]. Available from:http://cgit.freedesktop.org/gstreamer/gst-omx/.

[74] NodeJS [homepage on the Internet]. Joyent, Inc.; [cited 2014 May 22]. Availablefrom: http://nodejs.org/.

[75] GNU Development Tools — LD [homepage on the Internet]. Panagiotis Chris-tias; 1994 [cited 2014 May 23]. Available from: http://unixhelp.ed.ac.uk/CGI/man-cgi?ld.

[76] TI E2E Community [homepage on the Internet]. Texas Instruments; [cited2014 May 23]. Available from: http://e2e.ti.com/.

[77] Android API Reference — Canvas [homepage on the Internet]; [updated2014 May 20; cited 2014 May 24]. Available from: http://developer.android.com/reference/android/graphics/Canvas.html.

[78] Mozilla Developer Network — Canvas [homepage on the Internet]. Mozilla;[updated 2014 May 23; cited 2014 May 24]. Available from: https://developer.mozilla.org/en-US/docs/Web/HTML/Canvas.

[79] Abramov A, Kulvicius T, Wörgötter F, Dellen B. Real-Time Image Segmen-tation on a GPU. In: Keller R, Kramer D, Weiss JP, editors. Facing theMulticore-Challenge. vol. 6310 of Lecture Notes in Computer Science. SpringerBerlin Heidelberg; 2010. p. 131–142.

[80] Mozilla Developer Network — Canvas [homepage on the Internet].Mozilla; [updated 2014 May 21; cited 2014 May 24]. Availablefrom: https://developer.mozilla.org/en-US/docs/Web/Guide/Performance/Using_web_workers.

[81] O�cial ZXing (“Zebra Crossing") project home [homepage on the Internet].GitHub, Inc.; [updated 2014 May 24; cited 2014 May 24]. Available from:https://github.com/zxing/zxing.

[82] The Web Sockets API [homepage on the Internet]. World Wide Web Con-sortium; [cited 2014 May 24]. Available from: http://www.w3.org/TR/2009/WD-websockets-20091222/.

65

BIBLIOGRAPHY

[83] Server-Sent Events [homepage on the Internet]. World Wide Web Consortium;[updated 2012 Dec 11; cited 2014 May 24]. Available from: http://www.w3.org/TR/eventsource/.

[84] Where polyfill came from / on coining the term [homepage on the Internet].Remy Sharp; [updated 2010 Oct 8; cited 2014 May 24]. Available from: http://remysharp.com/2010/10/08/what-is-a-polyfill/.

66

field service support with google glass and webrtc724724/fulltext02.pdf• a google glass...

Documents