photo-realistic 3d model extraction from camera array capture

The authors are solely responsible for the content of this technical presentation. The technical presentation does not necessarily reflect the official position of the Society of Motion Picture and Television Engineers (SMPTE), and its printing and distribution does not constitute an endorsement of views which may be expressed. This technical presentation is subject to a formal peer-review process by the SMPTE Board of Editors, upon completion of the conference. Citation of this work should state that it is a SMPTE meeting paper. EXAMPLE: Author's Last Name, Initials. 2010. Title of Presentation, Meeting name and location: SMPTE. For information about securing permission to reprint or reproduce a technical presentation, please contact SMPTE at [email protected] or 914-761-1100 (3 Barker Ave., White Plains, NY 10601).

Author(s)

First Name Middle Name

Surname Role Email SMPTE Member?

John Robert Naylor Author [email protected] Yes

Callum Rex Reid Author [email protected] No

Affiliation

Organization Address Country

Timeslice USA 415 N. State St, Ste 190,

Lake Oswego, OR, 97034

USA

Digicave Ltd 3 Orange Row

Brighton, East Sussex, BN1 1UQ

UK

Pub ID Pub Date

2011 Stereo 3D Conference June 20, 2011

SMPTE Meeting Presentation

Photo-realistic 3D Model Extraction from Camera Array Capture

John R. Naylor B.Sc. (Hons.), M.B.A., C.Eng., M.I.E.T

Timeslice USA, Lake Oswego, Oregon, USA, [email protected]

Callum Rex Reid B.A. (Hons.)

Digicave Ltd, Brighton, East Sussex, UK, [email protected]

The authors are solely responsible for the content of this technical presentation. The technical presentation does not necessarily reflect the official position of the Society of Motion Picture and Television Engineers (SMPTE), and its printing and distribution does not constitute an endorsement of views which may be expressed. This technical presentation is subject to a formal peer-review process by the SMPTE Board of Editors, upon completion of the conference. Citation of this work should state that it is a SMPTE meeting paper. EXAMPLE: Author's Last Name, Initials. 2010. Title of Presentation, Meeting name and location.: SMPTE. For information about securing permission to reprint or reproduce a technical presentation, please contact SMPTE at [email protected] or 914-761-1100 (3 Barker Ave., White Plains, NY 10601).

Written for presentation at the

International Conference on Stereoscopic 3D for Media and Entertainment

Abstract. In which the authors describe a process of capturing photo-realistic 3D computer models using purely passive methods based on subject capture with arrays of cameras, and image analysis to permit the instant capture of both form and texture of the subject or talent.

The key performance characteristics of the array are discussed, particularly the challenges of triggering, and the limits imposed on the content created by the native resolution of the cameras used in the array.

Details of the rig design and camera layout and configuration for efficient and effective subject capture are presented.

The process by which multiple still photographs are processed to produce a point cloud which in turn becomes the model mesh are presented, together with examples of the current state of the art of this approach.

Tradeoffs such as the decision to eschew the use of active techniques such as laser scanning, structured light projection, or time-of-flight techniques are discussed, together with their benefits.

Keywords. 3D scanning; Full Body; Camera array; Photometrics; Passive scanning; Sculptural Photography, Free Viewpoint Media.

2

Introduction

The system described in this paper specifically looks at achieving sculptural photography within standard creative industry working practices within realistic budget constraints to be used in online interactive content as “Free Viewpoint Media” (FVM).

Let‟s start by defining “sculptural photography”, as an expression of 3D scanning technology.

3D Scanning as Sculptural Photography

Currently 3D scanning is mainly used for applications such as approximating someone‟s dress size or creating 3D references for special effects. But we think that it is important to look at what 3D scanning can offer as a way of creating content in its own right; as a tool to empower traditional photographic techniques that can give artists the ability to capture form as well as image. We term this simultaneous capture of form and image “Sculptural Photography”.

With this in mind, sculptural photography systems have to perform under a wide variety of lighting conditions to which some 3D scanning techniques are better suited than others. They also need to be affordable, and the choice of a 3D scanning system is the single biggest cost driver in current solutions. The scanning cost is dictated by simple economics that cause special purpose solutions for niche applications to be orders of magnitude more costly than those that repurpose off-the-shelf components.

Sculptural photographic content can be experienced in current interactive platforms as a form of “free viewpoint media” whereby the end user can explore a real moment from virtual camera perspectives. Technological considerations about the final delivery platform can often influence capture procedures and should be taken into account throughout the process; with the choice between passive and active 3D scanning methods having most influence.

Passive Scanning versus Active Scanning 3D Scanning falls into two main categories, passive scanning and active scanning.

The active scanning category covers all the techniques that require the projection of light onto the subject to estimate depth.

From structured light and infra red (IR) to time-of-flight scanners, these invariably require the form to be captured separately from the natural colours of the subject because the subject is briefly exposed to a patterned light, or they are known to fail on a wide variety of surfaces if they use invisible wavelengths of light due to the different reflection behavior of light at these wavelengths.

In contrast, passive scanning uses only the ambient light reflected by the subject, as captured by passive sensors, and sophisticated algorithms to infer the geometry of the form. As an emerging science, a rich variety of techniques have been explored for this purpose; techniques that include stereo photo-metrics, artificial intelligence, Bayesian algebra, and other image analysis techniques. Reliance on tightly calibrated multi-camera systems is a common aspect of most of these, because the capability of passive capture systems depends on the calibration, resolution and image quality of their input.

Table 1 introduces some of the key characteristics of 3D scanning, and compares active with passive methods.

3

Characteristic Active Scanning. Passive Scanning. Comments

Typical Capture Components

Lasers, IR lamps, structured filters, special purpose sensors and processors (e.g. time-of-flight)

Digital stills cameras, ranging from specialist instrumentation units, to off-the-shelf commercial products

Lighting Normally constrained to be flat

Can be what the Director of Photography wants

Scanning Usually progressive Whole frame Rolling shutter effects of progressive scan restrict talent‟s ability to move

Economics Niche Market Supply Mass Market Supply Mass market demand potential

Main capability drivers Resolution of custom sensors, some of which are only 200x200. Degree to which wavelengths used behave the same as visible light.

Camera resolution, dynamic range, triggering, and sophistication of analysis algorithms

Table 1 Comparison of Passive and Active 3D Scanning Methods

From this analysis Digicave and Timeslice chose to develop a sculptural photography system based on passive scanning to better satisfy the technical, artistic, and economic criteria that will enable sculptural photography as a mass market phenomenon.

The rest of this paper describes the capture system, and image analysis pipeline that has been developed for sculptural photography; and some of its constraints, and trade-offs. It concludes by discussing delivery mechanisms for sculptural photographs, and steps towards the capture of photographic sculptures in motion.

The Capture System

Capture systems for sculptural photography are derived from more normal camera array systems and share all of a camera array‟s requirements for high quality results. These have been described previously Macmillan (2010), so here we will concentrate on the requirements that are altered or special to sculptural photography:

• Portability and Calibration

• Configuration

• On-set Preview

Portability and Calibration

It may seem strange to mix the two requirements of portability and calibration under the same heading, until one considers that a system that takes more than an hour to calibrate isn‟t really portable. Which is why we have developed a calibration method that is quick to execute, yet

4

robust, at the cost of a 10% loss in maximum resolution provided by the cameras in use. The process relies on targeting each camera at a common fixation point at which we have placed a tracking target; and centering, leveling, focusing, and framing each camera in the array by eye. The array is then triggered and off-the-shelf motion tracking software used to stabilise the tracking target in the array sequence.

This process produces stable-looking sequences, at the cost of lost pixels around the border of each individual frame. The entire initial calibration can be accomplished by an experienced crew in less than an hour.

The level of calibration achieved by this process is good enough to produce stabilized sequences for on-set preview, but the stabilized sequence is not used for the extraction of 3D models. This 2nd level of calibration is covered later in this paper.

Staying calibrated is as important as getting calibrated in the first place, and is accomplished by paying attention to numerous practical details such as ensuring that the camera supports are stable, and isolated from sources of vibration, wind, and rapid changes in temperature.

Configuration

How many cameras? Where do we put them? Where do we point them? These are the three questions that drive the configuration of a sculptural photography rig.

The number of cameras is driven by resolution, and subject size: the higher the resolution, the fewer cameras are needed. And the smaller the subject matter, the fewer. With today‟s 12 Mega-Pixel (MP) devices, and a 2m diameter action area, we get good results with a 6m diameter circular rig that has 36 cameras fixated on the center of the circle, with a camera height of 1m to get even vertical coverage of the (mainly 2m tall human) subjects.

Note that all cameras are located at equal spacing on the rig‟s equator. Somewhat counter-intuitively, we have found that satellite cameras at higher latitudes or the poles are not needed to capture form and image of human subjects, and it is quicker and easier to build and calibrate rigs that do not contain them.

We have experimented with other configurations and cannot claim that the simple arrangement described above is in any way optimum apart from its ease of assembly and operation.

On-set Preview

Converting an image sequence into a photo-realistic 3D model is currently heavily compute intensive, so it‟s important that the sequences that are input to the process are going to match the client‟s expectations. For this purpose we have developed a facility to view each stabilized rig capture within 20s of each take, so that clients can quickly select the shots they wish to move forward to the next phase of the pipeline: image processing.

Model Extraction

Extracting 3D models from picture sequences that have been captured using the methods described above is achieved using this sequence of operations using based on algorithms developed by Hernández (2010):

Calibration

Create a low complexity visual hull of the subject

Refine the mesh

5

Apply the texture as captured

Touch up both the model, and textures by hand, if necessary

The following figures illustrate the first four stages of this pipeline.

Calibration

This is achieved using the same plates that were used for the on-set preview stabilization described above. The calibration routine in this part of the workflow calculates a solution for each camera in the rig (position, gimbal angles, focal length) based on photogrammetric analysis of the tracking target. With a solidly constructed rig and stable environmental conditions, it is only necessary to recalibrate the rig after 3 hours use. The process typically takes 5 minutes.

Initial Capture

Figure 1 Initial Subject showing Camera Array

Visual Hull and Mesh Refinement

Figure 2 Hull, Datamap, and Refinement

Figure 2 illustrates the rough visual hull that is initially extracted from the array capture.

6

Reading left to right, the next picture is a map of the amount and quality of the point cloud data that has been inferred by the algorithm‟s first pass. Notice that there is a paucity of data in the area below the knees of the subject, which is indicated by dark areas. In contrast from the waist upwards, the algorithm has a large amount of good point cloud data with with to attempt refinement of the rough visual hull.

The results of the refinement step are shown in the third picture. Note the improved detail around the arms and hands, the shirt collar and the corrected orientation of the head.

Despite looking like one, the final picture in this sequence is not a photograph, but a rendering of the refined model with the texture applied.

A Closer Look at Areas with Low Data Availability

Figure 3 Low Data Management

Figure 3 shows in close up the areas with gaps in the point cloud data, indicated by the dark areas in the first two pictures.

The third picture illustrates how the algorithm‟s heuristics create a smooth, even mesh over these areas, and in the main, produce convincing results, especially on low complexity areas such as illustrated here. It‟s important to produce an even mesh to achieve the best results when the model is rendered.

Skin Detail Considerations

Figure 4 Resolution and Skin Detail

7

Figure 4 illustrates the level of skin and facial detail that can be captured using 12 Mega Pixel cameras at a range of approximately 3m from the subject. The first image is a direct capture from one of the cameras. It shows that skin and facial hair detail is adequately captured for 2D photographic purposes, and that this level of detail is also adequate when used to texture the extracted model when the distance between it and the virtual camera is similar to that of the initial capture.

The center image in this sequence illustrates the ability of the pipeline to extract a large amount of good positional data from the subject‟s facial area.

Figure 5 Present and Future Resolution Performance

Reading Figure 5 from left to right shows where the current state of the art starts to hit its technological limits. By pushing the virtual camera closer to the rendered subject, we see, in the first picture, that a noticeable amount of skin detail has been lost.

The second two pictures in this figure were also captured with 12MP cameras, but at a much closer distance to the subject to emulate the performance of 25MP cameras in the large rig. They convincingly demonstrate that the algorithm will produce much better output with higher resolution input.

8

Motion Capture

Figure 6 Capturing Motion

Figure 6 illustrates the results obtained with a subject that is moving in a dynamic manner. Note the integrity of the form and its freedom from rolling shutter artifacts. This is a starting point for full motion capture in 3D.

Compression and Delivery

Delivering 'naturally lit' 3D scanned content in a freely explorable interactive environment gives rise to a unique way of engaging with and viewing the “real”. Static sculptural photography delivered in this way can be seen to be Free Viewpoint Media and the motion version of it is undoubtedly the beginnings of “vitalized reality” or “free viewpoint TV”, Kanade (1997).

For content to be served as FVM, important consideration must be made in areas of delivery mechanisms. Interactive delivery platforms today are split between console, web, smart phone, and tablet. Most of these platforms access content through the Internet and therefore require data packages to be manageable; which brings compression to the forefront of FVM delivery.

In order to maintain „photo-realism‟ a balance has to be managed between capture resolution and the final screen size resolution requirements. The quality is reduced further by necessary compression but holds up somewhat due to an effect caused by baking the real light onto the model at point of capture.

9

Figure 7 Compression Effects

Figure 7 illustrates the dramatic perceptual improvement that is gained by using self-lit models. Both the models shown comprise roughly 6000 polygons, compressed from a 200,000+ polygon original. The one of the left is self-lit, the one on the right uses modeled lighting. This effect can be exploited to make Internet delivery of sculptural photographs both convincing and efficient.

Conclusion

In this paper we have shown that the capture of photo-realistic 3D models is practical using passive 3D scanning techniques, coupled with photogrammetric analysis and commercially available digital cameras. We have demonstrated the limits that the state of the art imposes on the ability to capture skin detail with high fidelity, but also shown that this will likely be overcome by the next generation of DSLRs. Finally, we have demonstrated the viability of the content to be experienced interactively, via efficient, Internet delivery to underscore our claim that what we term “Sculptural Photography” is a practical, new form of digital content.

Glimpsing the future, we plan to extend the technology to full motion sculptural photography. With the challenges being to increase frame rate achievable by a camera array to initially produce frame-based motion capture. Parallel algorighmic development will deploy Artificial Intelligence methods to generate Inverse Kinematic metadata from the forms captured, thereby delivering models that are tractable to key-frame animation and motion path editing.

References

Hernández, C and Vogiatzis, G. (2010). Shape from Photographs: A Multi-view Stereo Pipeline. Computer Vision: Detection, Recognition and Reconstruction, Cipolla, Battiato, Farinella (Eds.), 2010 Springer-Verlag.

Kanade, Takedo et al (1997). Constructing Virtual Worlds from Real Scenes. J. Multimedia Vol 4, Issue 1, IEEE. Los Alamitos, CA.

10

Macmillan, Tim (2010). Stereo Image Acquisition using Camera Arrays. International Conference on Stereoscopic 3D for Media and Entertainment. SMPTE. New York.

photo-realistic 3d model extraction from camera array capture

Documents