remote augmentation in multipoint telepresence€¦ · augmented reality (ar) is a concept and a...

1 Introduction

Augmented Reality (AR) is a concept and a set of technologies for merging of real and virtual elements to produce new visualizations – typically a video – where physical and digital objects co-exist and interact in real time. Most AR applications support real-time interaction with content (AR scene with virtual objects) which has been produced in advance or offline. In many cases, for example in ad hoc remote guidance applications, AR interaction over a network is required. AR interaction over a network means, for example, adding virtual objects into a video feed captured in a remote location.AR interaction over networks includes solutions for both: 1) real-time situations, where users are simultaneously interacting with each other and with AR content, and 2) off-line situations, where the users are not simultaneously interacting with each other, but still want to produce or share AR content over a network. Support for remote AR interaction needs to also be available when real-time and offline sessions are following or alternating with each other. This requires that the AR content be produced, stored, and updated seamlessly in successive sessions.

2 Background

2.1 Local Augmented Reality3D models and animations are the most obvious virtual elements to be visualized in AR. However, AR objects can basically be any digital information for which spatiality (3D position and orientation in space) gives added value such as, for example, pictures, videos, graphics, text, and audio.

Augmented reality visualizations require a means to see augmented virtual elements as a part of the physical view. This can be implemented by, for example, a tablet with an embedded camera, which captures video from the user’s environment and shows it together with virtual elements on its display. Augmented reality glasses, either video-see-through or optical-see-through, and either monocular or stereoscopic, can also be used for viewing. The user of the viewing device can also be a remote user watching, over a network, the same augmented scene as the local user.

AR visualizations can be seen correctly from different viewpoints, so that when the user changes his or her viewpoint, virtual elements remain in place and act as if they are of the physical scene. This requires that positions of virtual objects are defined with respect to 3D coordinates of the environment, and tracking technologies are used for tracking the viewer’s (camera) position with respect to the environment.

Traditionally printed graphical markers are placed in the environment to be detected from a video and used as a reference for both augmenting virtual information in the right orientation and scale, and for tracking the viewer’s (camera) position. Recently, a lot of research has been made on so called markerless AR, which – instead of sometimes disturbing markers – relies on detecting distinctive features of the environment, and using them for augmenting virtual information and tracking the user’s position.

The majority of AR applications are meant for local viewing of the AR content; in other words, the viewer is also in the same physical space which is being augmented. However, as the result is typically shown as a video on a display, augmented video can naturally also be seen remotely over a network, if desired.

Remote Augmentation In Multipoint Telepresence

Augmented Reality (AR) is a concept and a set of technologies for merging of real and virtual elements to produce new visualizations – typically a video – where physical and digital objects co-exist and interact in real time.

Innovation Partners | White Paper

2.2 Remote Augmented RealityProducing AR content remotely – i.e., adding virtual objects and animations to be augmented over network – is a very useful feature in many applications, such as remote guidance applications where a remote expert can add virtual objects that are augmented to a video viewed by a local user. A poorly supported area with growing importance is delivering virtual objects in telepresence and social media applications. The common feature of all these applications is the need for synchronous interaction between two or more users, both AR content producers and consumers. Here synchronous interaction means the remote and local users have a video conference and see the virtual objects that are added to the video stream in real time. For many applications, supporting real-time AR interaction is quite demanding due to required bandwidth, processing time, latency, etc.

Asynchronous interaction is about delivering and sharing information (messages, for example) without hard real-time constraint. In many cases, asynchronous interaction is preferred, as it does not require simultaneous presence from the interacting parties. In teleconferencing, asynchronous interaction may happen after the live conference has ended to allow participants to add virtual objects to other participants’ environments. The participants can see the virtual objects when later accessing the conference space.

In many applications, supporting synchronous and asynchronous functionalities in parallel is necessary or beneficial. They can also be mixed in more integral ways in order to create new ways of interacting.

2.2.1 Producing AR content remotely

If graphical markers are attached to the local environment, remote augmentation can be made simply by detecting the markers’ pose (position, orientation, and scale) from the transmitted local video, and aligning virtual objects with respect to them. This is fairly simple and fast – and can even be partly automated – and is suited well to synchronous interactions. An early example of this approach is given by Hirokazu Kato and Mark Billinghurst [Kato & Billinghurst 1999].

Markerless 3D feature-based methods can be used in cases when visible markers are too disturbing or do not work at all, like in large-scale augmentations outdoors. Typical for feature-based methods is that they require more advance preparations than marker-based methods. They also require more complex data capture, more complex processing, and more complex tools for AR content production compared to a marker-based approach. In addition, they don’t give as explicit a scale reference for the augmentations as when using markers.

Feature-based methods can be used for augmenting remote spaces, if the required preparations (for example, 3D scanning of the environment) can be made in advance, and if the local environment stays stable enough so that the results of those preparations can be used repeatedly, in several synchronous sessions. In these solutions, 3D scanning of the local space can be made by using a moving camera or a depth sensor.

Marker-based methods can be applied even if there are no predefined markers in the local environment. In this approach the application allows a remote user to select a known feature set (e.g., a poster on the wall or a logo of a machine) from the local environment. This set of features used for tracking is in practice an image that can be used similarly as markers to define 3D location and 3D orientation [Reitmayr et al. 2007, Siltanen et al. 2015].

With restrictions, even unknown planar features, recognized and defined objectively by the remote user, can be used to help remote augmentation. In this case, however, the depth and scale cannot be derived precisely from the remote video,


PRODUCING AR CONTENT REMOTELY – I.E. ADDING VIRTUAL OBJECTS AND ANIMATIONS TO BE AUGMENTED OVER NETWORK –

IS VERY USEFUL FEATURE IN MANY APPLICATIONS, SUCH AS REMOTE GUIDANCE APPLICATION WHERE

A REMOTE EXPERT CAN ADD VIRTUAL OBJECTS THAT ARE

AUGMENTED TO VIDEO VIEWED BY A LOCAL USER.


and the augmentation is often restricted to replacing planar feature sets with other subjectively scaled planar objects (e.g., a poster with another poster).

In a local environment that has no features that are known in advance, the method called simultaneous localization and mapping (SLAM) has been developed [Klein & Murray 2007]. SLAM simultaneously estimates the 3D pose of the camera and 3D features of the scene from a live video stream. The method results in a set of 3D points, which can be used by a remote user to align virtual objects to a desired 3D position, while the local user is moving a camera to show the local environment from different angles.

Local 3D features can also be captured with a set of fixed video cameras, each filming the environment from different angles. These streams can be used to calculate a set of 3D points that can be used by the remote user [Seitz et al. 2006]. Optionally, the above-described 3D point cloud can be created by using a depth camera [Izadi et al. 2011]. For making the point cloud, related camera and/or depth sensor-based solutions described for 3D telepresence are also applicable [Maimone et al. 2012].

2.3 Coding and transmission of 3D dataCoding and transmission of real-time captured 3D data requires much more bandwidth than real-time video. For example, the raw data bitrate of Kinect 1 sensor is almost 300MB/s (9.83 MB per frame). So, obviously, efficient compression methods are needed. Compression methods for Kinect type of depth data (either RGB-D or ToF) are, however, still in their infancy. The amount of real-time captured depth sensor data (color plus depth) can be considered much bigger than that of a video camera. The same relative comparison also holds for multi-sensor systems, compared to multi-camera systems.

US patent application US 2013/0321593 (Dec. 5, 2013 by Microsoft) is an example of a method for real-time transmission of 3D geometry and texture data, on demand, from a viewpoint specified by the rendering client. The application describes a method of transmitting only the texture data and geometry necessary to render the view from a selected viewpoint, instead of transmitting the whole 3D-captured geometry.

3 New methods for supporting unassisted remote AR

As teleconferencing systems have become increasingly popular, people are accustomed to using real-time video in their communication. Enabling a remote user to enrich remote live video by virtual information is a natural extension to standard teleconferencing that brings added value in many applications. In order to add virtual objects to the video feed, the remote user needs to have 3D information of the environment to position the virtual objects into 3D space. The 3D information can be captured as described in Section 2.2 and transmitted to the remote user. The remote user can then use the 3D model as a reference when positioning virtual information with respect to the environment.

ENABLING A REMOTE USER TO ENRICH REMOTE LIVE VIDEO BY VIRTUAL INFORMATION

IS A NATURAL EXTENSION TO STANDARD TELECONFERENCING THAT BRINGS ADDED VALUE IN

MANY APPLICATIONS

3.1 Unassisted feature-based remote ARThis white paper introduces new solutions for unassisted remote AR which does not require assistance for capturing 3D features of the local environment. Advance preparations or local assistance can be avoided by capturing local space with a fixed setup of 3D cameras and/or sensors, and providing this information for a remote user to make accurate 3D augmentations. 3D feature capture and reconstruction has been studied extensively, and many technologies and solutions exist. Since our goal is to support real-time applications and not to require local assistance in 3D feature capture, methods based on moving a single camera or depth sensor in space are not applicable.

Solutions for real-time, unassisted 3D capture are used, for example, in real-time 3D telepresence. In these systems, multi-sensor capture is typically used for deriving a 3D representation of the captured scene. In [Kuster 2011], a combination of three color cameras and two Kinects (using their IR parts only) were used to make high-quality view synthesis at about 7fps, and the framerate was expected to increase by fourfold through the parallelization of algorithms. In [Maimone et al. 2012], five Kinects were used for capturing and rendering conferencing participants at an average rate of 8.5fps. New viewpoints to an unchanged volume were rendered at 26.3fps.

Related US patent US 8,872,817 B2 (Oct. 28, 2014 by ETRI) describes an apparatus and method for real-time 3D reconstruction. Another patent, US 8,134, 556 B2 (Mar. 13, 2012 by Elsberg et al.), describes a method for forming a photorealistic view on demand of a three-dimensional simulation model using ray tracing method.

3.2 Adding a virtual object to a remote environment The ability to enrich remote live video by virtual information does not exist in standard teleconferencing systems yet, partly because transmitting real-time captured 3D information over network has some disadvantages. For example, bandwidth requirements of 3D information are much higher than normal video. Also, in order to generate an accurate 3D representation, the environment must be captured from several directions. Normally, users are accustomed to pointing a video camera so that it captures only those parts of the environment that the user wants to show to remote participants of the video conference. In the case of 3D capture, there is no obvious way to restrict the information shown to the remote participants, because the whole point of 3D capture is to gather as much complete 3D information as possible.

In the new InterDigital solution [US 62/316,884], the user may restrict the 3D captured information transmitted to the remote users using the same method as in traditional video conferencing: by pointing a video camera in the desired direction. Even though the local 3D capture set-up generates 3D information of the whole environment, the system computes which 3D objects are visible in the camera view and transmits only those to the remote participants. The user may use the camera of a phone or a tablet to show which area is visible to the remote users (Figure 1) or a laptop computer camera used in the teleconference may restrict the visible area, while the local user is using the camera to communicate with the remote participants.

Since the amount of 3D information within the visible area is smaller than the whole environment, the solution reduces the bandwidth requirements when transmitting the 3D model to the remote participants. Also, since the remote user sees the local environment only from one direction at a time, the InterDigital solution suggests transmitting virtual camera views from the angle requested by the remote user, reducing the bandwidth requirements to the same as those of a normal video feed.

WP_201804_009 Innovation Partners | White Paper

3D reconstructionset-up

Augmentedobject

User witha tablet

3D information istransmitted only from

area visible to the local user’s camera

Figure 1. Restricting a visible area


3.3 Calibrating capture setup The 3D information is generated by a capture setup, using a set of sensors (that may be depth cameras or normal video cameras) hanging on room walls (cf. Figure 1). The sensor system needs to be calibrated in order to create a common coordinate system for the whole setup, including the user’s camera. In the InterDigital solution, the sensor setup is self-calibrating, so that a user need only follow basic instructions for the assembly of sensors and the user’s camera in the environment, and the system takes care of the calibration. The calibration is implemented so that it allows flexible re-configuration of the setup, for example to enable better capture of some areas of the space.

In the InterDigital solution [US 62/316,884], automatic calibration is achieved by using so-called camera markers [US 62/202,431], where the sensors are combined with a marker of known dimensions, as capture set-up sensors (Figure 2). The markers may be printed on paper or the camera markers may have a display where the marker is shown when needed.

The capture setup consists of wide-angle cameras with markers. The capture setup needs to be calibrated in order to create a common coordinate system for the whole setup, including the user’s video camera. A common coordinate system is needed for the system to be able to augment virtual objects to the video feed transmitted from the user’s camera to the remote participants. The camera markers are positioned so that each of the cameras captures at least one of the markers, allowing the system to calibrate that camera automatically, using prior art methods, such as [Brückner 2010]. If the marker is shown on the camera marker’s display, the marker needs to be shown only during the calibration process and other times the display may be used for different purposes.

The user’s camera needs to also be calibrated to get accurate position and orientation information in the local space. The user’s camera calibration requires orienting the camera so that it captures at least one of the camera markers. When the camera is calibrated, the virtual objects can be augmented to the video feed it captures and the virtual objects remain in correct 3D position as long as the camera captures enough distinctive features from the local space.

3.4 Synchronous and asynchronous interactionUsing the capture setup described above, the 3D model of local environment is produced in real-time. The model can be used by local or remote user(s) as a spatial reference for producing an accurate AR scene, i.e., a compilation of virtual elements, each with precise position and orientation. In synchronous interaction, this 3D data is provided for remote users together with real-time video view of the local space. The video view is generated by the local user’s video camera for example on a laptop or a tablet. A remote user can edit the local AR scene using both 3D data and video view, and other users may see the scene augmented to the video feed they receive in real time.

However, synchronous interaction is only one way people interact with each other. Synchronous interaction is possible only when people are in the same space (virtual or real) at the same time. People are accustomed to interacting with each other also asynchronously, by changing real space, e.g., by leaving notes or moving objects, so that another person sees the change when entering the space.

Camera markers withwide angle cameras

Figure 2. Self-calibrating capture set-up

For supporting asynchronous interactions, the InterDigital solution [US - 62/320,098] allows remote users to change the AR scene related to the local user’s physical space when the local user is not present in a teleconference. The 3D data generated during synchronous sessions is stored and can be accessed by the remote users for AR scene creation even when the local user has left the teleconference. Again, the AR object’s pose can be set using different perspective views to the 3D data. The AR object scene generated during an asynchronous session can then be augmented to the local user’s video view in the next synchronous session. Again, the local user may restrict the space visible to the remote users as described in Section 3.1.

4 Conclusions

The InterDigital solution enables support for remote AR interaction using familiar video-based communication. A remote user can add virtual objects (any digital information for which spatiality gives added value, such as, for example, pictures, videos, graphics, text, and audio) to the local scene, which will be augmented to the video feeds transmitted from the local site over network to remote participants as well as presented to the local user through augmentation. The solution supports both synchronous and asynchronous interactions using a fixed real-time capture setup.

A special feature of the system is management of user privacy, where the user has control over what 3D information he or she shows to remote parties just by adjusting the real-time video view.

Since the disclosed concept enables value-adding features to already familiar video communication solutions, it can be expected that the new solution is easy to accept by the users, and it is easily implemented on top of existing solutions. The solution also supports non-symmetrical use cases, where all the users do not need to have the 3D capture setup installed in their site.

WP_201804_009 Innovation Partners | White Paper

Innovation Partners | White Paper WP_201806_010

Innovation Partnerswww.innovation-partners.com

5 References

Kato, H.; Billinghurst, M. (1999). “Marker tracking and HMD calibration for a video-based augmented reality conferencing system.” Proceedings 2nd IEEE and ACM International Workshop on Augmented Reality (IWAR’99).

Reitmayr, G.; Eade, E.; Drummond, T. W. (2007). “Semi-automatic annotations in unknown environments.” 6th IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR 2007).

Siltanen, P.; Valli, S.; Ylikerälä, M.; Honkamaa, P. (2015), “An Architecture for Remote Guidance Service.” 22nd ISPE Concurrent Engineering Conference (CE2015).

Klein, G., & Murray, D. (2007). “Parallel tracking and mapping for small AR workspaces.” 6th IEEE andACM International Symposium on Mixed and Augmented Reality (ISMAR 2007).

Seitz, S. M.; Curless, B.; Diebel, J.; Scharstein, D.; and Szeliski, R. (2006), “A comparison and evaluation of multi-view stereo reconstruction algorithms.” IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06).

Izadi, S.; Kim, D.; Hilliges, O.; Molyneaux, D.; Newcombe, R.; Kohli, P.; Shotton, J.; Hodges, S.; and Freeman, D. (2011). “KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera.” 24th ACM Symposium on User Interface Software and Technology (UIST 2011).

Maimone, A. & Fuchs, H. (2012), “Real-time volumetric 3D capture of room-sized scenes for telepresence”, 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON 2012).

Kuster, C.; Popa, T.; Zach, C.; Gotsman, C.; and Gross, M. (2011), “FreeCam: A hybrid camera system for interactive free-viewpoint video.” 16th Annual Workshop on Vision, Modeling and Visualization (VMV 2011).

Brückner M. & Denzler J. (2010), “Active self-calibration of multi-camera systems.” 32nd Annual Symposium of the German Association for Pattern Recognition (DAGM 2010).

InterDigital Patent Applications Referenced:

Valli, S. T., Siltanen P. K., Apparatus and Method for Supporting Interactive Augmented Reality Functionalities, US 62/316,884, filed as a Provisional patent application on 01 Apr 2016.

Valli, S.T., Siltanen P.K., Apparatus and Method for Supporting Interactive Augmented Reality Functionalities, US 62/202,431, filed as a Provisional patent application on 07 Aug 2015.

Valli, S.T., Siltanen P.K., Apparatus And Method For Supporting Synchronous And Asynchronous Augmented Reality Functionalities, US 62/320,098, filed as a Provisional patent application on 08 Apr 2016.

http://www.innovation-partners.com

remote augmentation in multipoint telepresence€¦ · augmented reality (ar) is a concept and a...

Documents