stereo image acquisition using camera arrays

1 | P a g e S t e r e o A c q u i s i t i o n u s i n g C a m e r a A r r a y s T i m M a c m i l l a n J o h n N a y l o r

Stereo Image Acquisition Using Camera Arrays

Tim Macmillan, Time-Slice® Films, Ltd, Bath, UK

John R. Naylor, Convergent Intelligence℠, Portland, OR, USA

“Simultaneity” was one of the central concepts

of the Cubist movement in early 20th century art,

in which an object or scene is recorded from

multiple angles to create an artwork that

synthesises time and space. The instigators of

Cubism (Picasso, and Braque) took some of their

inspiration from the work of Eadweard

Muybridge, who though rightly regarded as the

father of motion pictures, was also the first

person to record “frozen time” sequences with

camera arrays. Today, such sequences can be

produced at high resolution within 30s of a

telegenic event by using arrays of digital stills or

video cameras. They can also produce content

in stereo; simultaneity indeed!

The principle of using camera arrays is familiar,

having caught the world’s imagination with the

bullet-time sequences in The Matrix. Time and

technology have progressed since this movie’s

release in 1998, to the extent that material that

used what a wet film process and required

substantial post-production to achieve the

desired results is now fully digitized and

automated. Simply stated, the process is this ...

1) Build an array of still cameras

2) Trigger them all at once

3) Organize the stills into a stereo

sequence

... but getting acceptable results is more

complicated. This paper is organized to

describe the practical challenges in this process

and to show how they are overcome at present.

We then go on to consider the demands of the

artist in using camera arrays, and describe some

techniques and approaches that are unique to

this method of capturing scenes.

Camera Considerations

It’s typical to use between 30 and 60 DSLR type

cameras in a camera array. Each camera in the

array corresponds to one frame of output. So

the duration of any sequence is equal to the

number of cameras in the array. For stereo,

overlapped pairs of cameras are normally used

which limits the duration of the output

sequence to be one less frame than the total

number of cameras. The key characteristics of a

camera array are: temporal resolution and

spatial resolution. These are impacted by the

characteristics of the individual cameras as

follows:

Physical size. Smaller is better because the

closer together the cameras can be in the

array, the higher the array’s spatial and

temporal resolution, and this also defines

how close the cameras can get to the

subject. .

Chip size. The bigger the better for these

reasons:

o High sensitivity and low noise

o Good match to the lenses. Note that

large sensors also provide a shallower

depth of field at large apertures

compared to smaller sensors.

However, this is not necessarily a

valuable characteristic in camera-array

capture because shallow depths of field

vary between cameras in the array

causing undesirable “jitter” in the depth

of field in resulting shots.

Resolution. The higher the better. We have

established a rule of thumb that the capture

resolution, of the RAW Bayer sensor must

be at least 200% higher than the desired


RGB result to overcome the following

effects:

o Converting “photosites”, which are the

unistimulus point values captured by

the DSLR Bayer sensors, to “pixels”,

which are tristimulus RGB or YCbCr

point values.

o 10% allowance for over scan which is

to allow the unavoidable registration

differences between members of the

array to be eliminated by automated

cropping.

o A further 50% allowance to provide a

large enough “action safe” area of the

desired resolution and aspect ratio.

Figure 1 Comparing Capture and Output Resolutions

Figure 1 illustrates the spatial compromises

required to produce 1920x1080 HD: starting

with around 12M photosites, the capture area is

reduced by 10% to permit registration variances

to be corrected. Unless the location of the

subject can be tightly controlled the output

frame is panned and cropped from within the

action area that is roughly half the size in each

linear dimension.

Note that some of these requirements are

pushing in opposite directions – large sensors

tend not to come packaged in small cameras.

They also need a bigger lens which raises the

very real issue of budget because of the need to

use up to sixty DSLRs, and also lenses.

Lens Considerations

It’s possible to go to great lengths such as

polishing lenses from the same block of glass

when manufacturing a matched pair of lenses

for stereo movie making. This simply isn’t

practical when sixty lenses are needed. The

challenge is to prevent the inevitable differences

that occur between nominally identical lenses

being visibly present in the content produced.

We address this challenge with the following

approaches:

Where possible, shoot with prime lenses

because this minimizes focal length

deviations. Alternatively, use zoom lenses

as “double primes” where only each end-

stop position is used.

Measure and correct the geometrical

aberrations lens by lens. We do this by

using a test card to calibrate each lens, and

have processing in our pipeline to correct

individual variations – covered later.

Use the same lens with the same camera

from shoot to shoot. This is a simple matter

of labeling each lens and camera and

ensuring that they are always used together.

We’ve recently incorporated a “LiveView”

function which provides a real-time view of

the full imager (often not the case in

viewfinders or on-camera LCDs) and also

interrogates the metadata provided by each

lens to allow focal lengths between the end-

stops to be selected and matched.

The other lens characteristics to design for are:


Optimum aperture to provide a large depth

of field at f8 and above.

Resolving power should be at least as good

as the sensor’s photosite density, this will

become more of an issue as the megapixel

count increases.

Low weight – sixty times any mass can be

problematic!

Focal length: a wider lens has a more

inherent 3D look than a long focus lens, for

this reason, the maximum focal length we

recommend is 50mm. Also bear in mind

that the cropping performed for frame

stabilization and subject tracking effectively

increases the focal length experienced by

viewers. However, this has the added

benefit of eliminating areas that are most

heavily affected by lens distortion.

Triggering Tradeoffs

Without a deterministic and reliable method of

triggering the shutters of all the cameras in the

array at precisely the right time, one simply does

not have a useful camera array.

Figure 2 Trigger Design Rules

DSLR cameras used in camera arrays possess

two characteristics that are critical to

synchronised triggering performance. The first

is average trigger response time, which is the

measured latency between the trigger being

asserted, and the center of the shutter-open

period. We typically measure the average over

100 trials, and use this figure to calibrate the

trigger control system. Fortunately, the sample

difference between DSLRs of the same make

and model is a few tens of microseconds which

means that we don’t have to accommodate for

camera to camera deviations within the rig.

The second and more critical parameter is the

variation about the mean trigger response time

exhibited by the camera. A good rule of thumb

is that, over 100 trials, the worst deviation

should be ±500µs. This rule provides a “safe

trigger window” of 1ms in duration centered on

the nominal trigger point. Note that this rule

fails for fast moving objects that occupy a large

portion of the imager. Capturing such content

requires the use of high-speed flash lighting.

Figure 2 illustrates the relationship between the

number of cameras in a rig, the desired

reliability of the rig, and the variance in trigger

latency that can be tolerated.

Let Ra be defined as the probability that all the

cameras in the rig trigger within the safe

window, and N be the number of cameras, and

Pc the probability for an individual camera to

trigger outside of the safe area. Pc is given by

the following expression:

And the reliability of the rig by its inverse:


As the figure shows, far tighter tolerances are

required for a 99% reliable 60 camera array,

than for a 15 camera array that produces usable

sequences just half the time.

Accurate triggering becomes even more

important for stereo capture, because the human

visual system has much higher stereo acuity (sub

pixel) than for luminance, even if the image has

motion blur.

Convergence

Although, in principle, the shape of rig and

arrangement of cameras is arbitrary, we’ve

found that most artistic demands can be met in

practice with three basic riggings as illustrated in

Figure 3.

We’ll revisit the convergent topologies later

when considering virtual dolly moves. The

Linear Parallel topology has the interesting

property of producing sequences which work

best with motion adaptive interpolation software

to create longer sequences. This is because, the

motion between frames is uniform; the

convergent topologies, by contrast, produce

sequences where there is always motion

towards and away from the viewer between

successive frames.

Color Matching

The final main consideration when building a

Time-Slice rig is to ensure that the color

responses of the cameras fall within a tight

band.

This is achieved by calibrating each camera

against a test card and performing color

correction in real-time in the image processing

pipeline.

Figure 3 Rig Topologies in Common Use

The outstanding benefit of DSLRs is being able

to shoot in RAW format, which spares the

vagaries of compression artifacts compounded

with clipping of gamma and contrast. So, using

DSLR RAW we can confidently match any grade

or exposure response required by our clients.

To summarize:

Select cameras based on triggering

accuracy, sensor size, physical size, and

resolution.

Select lenses that match the sensor size, and

minimize geometric distortions.

Circular Convergent

Linear Convergent

Linear Parallel


Correct for registration, geometric

distortion, color differences and talent

framing in the image processing pipeline

that forms the “brains"

So let’s consider the image processing pipeline

and the tradeoffs between operational speed,

image quality, and system cost.

Image Processing Pipeline and

Network Infrastructure

Figure 4 illustrates the arrangement of custom

and off-the-shelf hardware that forms the

platform for the image acquisition and

processing pipeline, together with the storage

and networking arrangements. Each group of

four cameras is controlled by custom-designed

“slave” units that serve to concentrate image

data before it travels across the network, and to

share the trigger signal in a daisy-chain fashion.

The cameras are controlled, and trigger

sequences programmed from a proprietary

application that typically runs on a laptop as

illustrated. The trigger parameters are

communicated to the Trigger Control Head End

via USB, which is also custom-designed

hardware. These parameters in particular are set

up in each camera by this process:

Trigger delay (for sequential triggering)

Image format, either RAW for jobs that will

be post-produced, or JPEG for play-to-air

applications

Shutter speed, aperture and ISO

Thereafter, the rest of the platform comprises

off-the-shelf, standards-based components: a

non-blocking gigabit Ethernet switch, Dual

RAID6 stores that capture and mirror image data

in both RAW and JPEG formats from the

cameras.

A high-performance workstation assembles the

individual stills that are stored on the RAIDs into

sequences that are suitable for playout after they

have been stabilized, graded and formatted for

the desired output. The software here is a

collection of commercially available

compositing packages including Nuke1 and

After Effects2, and some custom plug-ins and

scripts.

The platform, collectively termed a “rig” is used

in three distinct phases which are illustrated in

Figure 5, and described in the following section.

Calibration

A registration plate (test model) is acquired by

each camera. It contains enough information

for frame to frame alignment, color and grey-

scale calibration, and a 3D camera solve.

The registration plate is downloaded to the

workstation as an ordered image sequence.

A multi-point, two-pass, 2D stabilization is

computed from the registration plate and used

to solve a tracker that corrects for x, y, z and

rotation, even though the disparity in ‘z’, and ‘r’

will be very small, smoother visuals result if

they are also corrected for. This tracking

solution stored so that it can be applied to live

action sequences during production.

1 Nuke is a trademark of The Foundry Ltd.

2 After Effects is a trademark of Adobe Inc.


Figure 4 Control and Data Network Arrangements

Figure 5 Image Processing Pipeline

Production

Solve for

Frame

alignment,

Color, Grey-

scale

Shoot

Registration

Plate

Sequence

(Test Card)

Calibration

Post Production

Capture

“Hero”

Sequence

Correct lens

distortions

3D Camera

solve, and

point cloud

Stereo

authoring,

preparation for

coalescence

Compute 2

point, 2 pass,

2D

stabilization

track

StabilizeLive to Air

Color and

Grey-scale

correction


Production

The production process follows this sequence:

1. The rig is triggered, either manually or

automatically, for example by an infra-

red beam being broken.

2. The still pictures are read from each

camera by the slave units that forward

them to the RAIDs. JPEG format is used

for live-to-air work or for on-set reviews,

and RAW when more specialized

output is required.

3. From there, they are accessed by the

high performance workstation for

stabilizing, cropping and automated

grading.

4. In live-to-air situations, the sequence is

ready for playout approximately 30s

after the trigger, and a watch folder can

be set up so that the image processing

is initiated by the image files being

written to the RAIDs.

Post Production

At this stage, the sequence is already acceptable

for general, mono use. Additional stages that

can be undertaken are:

Generation of an accurate 3D camera

solve and point cloud to facilitate lens

distortion correction, compositing and

the inclusion of CGI objects within the

captured scene.

Interpolation – a scene can be prepared

via traditional rotoscoping and keying

techniques for optical-flow retiming

should a different shot duration be

required. This works best with the

Linear Parallel rig topology.

Stereoscopic authoring – the sequence

is prepared for coalescence via

programs like Ocula3, or Neo 3D4.

Note that, prior to these post production stages,

the sequence is normally rendered out as 32 bit

floating point in either OpenEXR or TIFF format.

Artistic Considerations

There are some important differences between

an array of still cameras as described here, and

motion picture cameras that can be exploited to

achieve unusual effects. Some of these

differences are:

A motion picture camera cannot use a

shutter time that is longer than the

inverse of the frame rate. In a camera

array, the shutter speed independent of

the frame rate. This property permits

the capture of large amounts of motion

blur as an in-camera effect, subject to

the caveat that it may take several

attempts to achieve the desired take.

The fine control over individual camera

triggers means that the array can be

programmed to produce interesting

effects. Fired simultaneously, yields the

classic “tracking shot through frozen

time” effect. Cameras can also be set

up to trigger in sequence to achieve the

effect of a tracking shot at up to

1000fps. Triggers can even be arranged

to run “backwards” over part of the

array (normally half) to create the effect

of time running forwards during the

tracking shot, until the halfway point, at

which point time goes into reverse, but

the tracking shot continues on its

established locus.

3 Ocula is a trademark of The Foundry, Ltd.

4 Neo 3D is a trademark of Cineform, Inc.


Camera arrays can be augmented with

motion picture ones at the head and tail

of the array to provide linear-time

“handles” used to enter and exit the

frozen scene.

And, of course, stereoscopic sequences are a

natural extension to a technique that has always

been “two and a half D” subjectively. The

simplest way to create stereo from a camera

array is to “ripple” the stills from individual

cameras thus:

Frame

number

Source for

Left Eye

Source for

Right Eye

1 1 2

2 2 3

3 3 4

... ... ...

n n-1 n

This yields a stereo sequence that is one frame

shorter than the number of cameras in the array.

Conclusion and Future Trends

We have described the challenges of building a

DSLR based camera array and how they can be

overcome by careful testing and selection of the

cameras and lenses, and the process by which

the array is characterized and calibrated prior to

use. In particular we have explained the

empirical rules governing acquisition resolution,

and its relationship with that of the output; and

the importance of tolerances on triggering

latency to the reliability and practical size of an

array.

We have also described the image processing

pipeline that sequences individual stills,

stabilizes and grades them for near

instantaneous playback. Further, off-line

processing for stereoscopic rendering, sequence

re-timing and the integration of CGI objects into

the captured scene has been introduced.

All the above is a practical reality today with the

following refinements and developments likely

in the near future:

Video arrays. We have already

experimented with arrays of video

cameras that promise the benefit of

never missing a shot, though they

present practical difficulties of their

own, not least that of getting small

cameras with consistent color behavior

from frame to frame, and dealing with

storage and bandwidth issues.

However, it is also a common

misconception that 25-30fps will

always ‘capture the moment’. In reality

operator eye-to-trigger response far

exceeds standard time-base temporal

resolutions. Video arrays need to

operate at 100fps to deliver

Higher resolution, higher sensitivity,

lower noise DSLRs, possibly including

quantum dot sensors. Higher resolution

would be used either to increase the

resolution of the delivered output, or to

further virtualize the effective focal

length of cameras in the array and

permit the construction of stereo dolly

moves that provide correct convergence

as the virtual camera moves towards or

away from the subject. This technique

has already been proven in Street

Dance 3D using our existing capture

resolutions, where we were able to

adjust both convergence and

interocular to create dramatic virtual

pans and zooms. This is also the

situation where the convergent rig

topologies offer tremendous future

benefit.


Commodization of the technique to an

extent that makes it practical to use in

live sports applications – including the

start/finish line in athletics, or integrated

into the backboard in basketball, the

crossbar in soccer, provided that, we

can achieve high enough frame rates

and temporal resolutions with video

rigs.

In short, stereo output simply requires some

scripting and add-ons to the post-production

pipeline, and it’s very exciting to see a coming

convergence between camera-array which has

3D capability already embedded within it and

industry demand for 3D visual effects content.

There's an interesting historical footnote as well.

The Digital Age has heralded the death of

Motion Picture as we used to understand it.

Shot length is no longer defined by mag size,

the camera is no longer tied to dolly, crane or

sticks; gone is the old 'chemistry'! We know the

camera never entirely told the truth, but we now

avidly embrace the possibilities of a new

chapter in visual language; the new 'camera'

can be an expression of thought unhampered by

technical doctrine. We find ourselves back with

Muybridge and the Cubists thinking 'what if'?

stereo image acquisition using camera arrays

Documents