stereo image acquisition using camera arrays
DESCRIPTION
This is a white paper I wrote and co-presented with Tim Macmillan, inventor of multi-camera capture that entered the public consciousness with The Matrix movie franchise. We did this at SMPTE's inaugural conference on stereo and 3D production.TRANSCRIPT
1 | P a g e S t e r e o A c q u i s i t i o n u s i n g C a m e r a A r r a y s T i m M a c m i l l a n J o h n N a y l o r
Stereo Image Acquisition Using Camera Arrays
Tim Macmillan, Time-Slice® Films, Ltd, Bath, UK
John R. Naylor, Convergent Intelligence℠, Portland, OR, USA
“Simultaneity” was one of the central concepts
of the Cubist movement in early 20th century art,
in which an object or scene is recorded from
multiple angles to create an artwork that
synthesises time and space. The instigators of
Cubism (Picasso, and Braque) took some of their
inspiration from the work of Eadweard
Muybridge, who though rightly regarded as the
father of motion pictures, was also the first
person to record “frozen time” sequences with
camera arrays. Today, such sequences can be
produced at high resolution within 30s of a
telegenic event by using arrays of digital stills or
video cameras. They can also produce content
in stereo; simultaneity indeed!
The principle of using camera arrays is familiar,
having caught the world’s imagination with the
bullet-time sequences in The Matrix. Time and
technology have progressed since this movie’s
release in 1998, to the extent that material that
used what a wet film process and required
substantial post-production to achieve the
desired results is now fully digitized and
automated. Simply stated, the process is this ...
1) Build an array of still cameras
2) Trigger them all at once
3) Organize the stills into a stereo
sequence
... but getting acceptable results is more
complicated. This paper is organized to
describe the practical challenges in this process
and to show how they are overcome at present.
We then go on to consider the demands of the
artist in using camera arrays, and describe some
techniques and approaches that are unique to
this method of capturing scenes.
Camera Considerations
It’s typical to use between 30 and 60 DSLR type
cameras in a camera array. Each camera in the
array corresponds to one frame of output. So
the duration of any sequence is equal to the
number of cameras in the array. For stereo,
overlapped pairs of cameras are normally used
which limits the duration of the output
sequence to be one less frame than the total
number of cameras. The key characteristics of a
camera array are: temporal resolution and
spatial resolution. These are impacted by the
characteristics of the individual cameras as
follows:
Physical size. Smaller is better because the
closer together the cameras can be in the
array, the higher the array’s spatial and
temporal resolution, and this also defines
how close the cameras can get to the
subject. .
Chip size. The bigger the better for these
reasons:
o High sensitivity and low noise
o Good match to the lenses. Note that
large sensors also provide a shallower
depth of field at large apertures
compared to smaller sensors.
However, this is not necessarily a
valuable characteristic in camera-array
capture because shallow depths of field
vary between cameras in the array
causing undesirable “jitter” in the depth
of field in resulting shots.
Resolution. The higher the better. We have
established a rule of thumb that the capture
resolution, of the RAW Bayer sensor must
be at least 200% higher than the desired
2 | P a g e S t e r e o A c q u i s i t i o n u s i n g C a m e r a A r r a y s T i m M a c m i l l a n J o h n N a y l o r
RGB result to overcome the following
effects:
o Converting “photosites”, which are the
unistimulus point values captured by
the DSLR Bayer sensors, to “pixels”,
which are tristimulus RGB or YCbCr
point values.
o 10% allowance for over scan which is
to allow the unavoidable registration
differences between members of the
array to be eliminated by automated
cropping.
o A further 50% allowance to provide a
large enough “action safe” area of the
desired resolution and aspect ratio.
Figure 1 Comparing Capture and Output Resolutions
Figure 1 illustrates the spatial compromises
required to produce 1920x1080 HD: starting
with around 12M photosites, the capture area is
reduced by 10% to permit registration variances
to be corrected. Unless the location of the
subject can be tightly controlled the output
frame is panned and cropped from within the
action area that is roughly half the size in each
linear dimension.
Note that some of these requirements are
pushing in opposite directions – large sensors
tend not to come packaged in small cameras.
They also need a bigger lens which raises the
very real issue of budget because of the need to
use up to sixty DSLRs, and also lenses.
Lens Considerations
It’s possible to go to great lengths such as
polishing lenses from the same block of glass
when manufacturing a matched pair of lenses
for stereo movie making. This simply isn’t
practical when sixty lenses are needed. The
challenge is to prevent the inevitable differences
that occur between nominally identical lenses
being visibly present in the content produced.
We address this challenge with the following
approaches:
Where possible, shoot with prime lenses
because this minimizes focal length
deviations. Alternatively, use zoom lenses
as “double primes” where only each end-
stop position is used.
Measure and correct the geometrical
aberrations lens by lens. We do this by
using a test card to calibrate each lens, and
have processing in our pipeline to correct
individual variations – covered later.
Use the same lens with the same camera
from shoot to shoot. This is a simple matter
of labeling each lens and camera and
ensuring that they are always used together.
We’ve recently incorporated a “LiveView”
function which provides a real-time view of
the full imager (often not the case in
viewfinders or on-camera LCDs) and also
interrogates the metadata provided by each
lens to allow focal lengths between the end-
stops to be selected and matched.
The other lens characteristics to design for are:
3 | P a g e S t e r e o A c q u i s i t i o n u s i n g C a m e r a A r r a y s T i m M a c m i l l a n J o h n N a y l o r
Optimum aperture to provide a large depth
of field at f8 and above.
Resolving power should be at least as good
as the sensor’s photosite density, this will
become more of an issue as the megapixel
count increases.
Low weight – sixty times any mass can be
problematic!
Focal length: a wider lens has a more
inherent 3D look than a long focus lens, for
this reason, the maximum focal length we
recommend is 50mm. Also bear in mind
that the cropping performed for frame
stabilization and subject tracking effectively
increases the focal length experienced by
viewers. However, this has the added
benefit of eliminating areas that are most
heavily affected by lens distortion.
Triggering Tradeoffs
Without a deterministic and reliable method of
triggering the shutters of all the cameras in the
array at precisely the right time, one simply does
not have a useful camera array.
Figure 2 Trigger Design Rules
DSLR cameras used in camera arrays possess
two characteristics that are critical to
synchronised triggering performance. The first
is average trigger response time, which is the
measured latency between the trigger being
asserted, and the center of the shutter-open
period. We typically measure the average over
100 trials, and use this figure to calibrate the
trigger control system. Fortunately, the sample
difference between DSLRs of the same make
and model is a few tens of microseconds which
means that we don’t have to accommodate for
camera to camera deviations within the rig.
The second and more critical parameter is the
variation about the mean trigger response time
exhibited by the camera. A good rule of thumb
is that, over 100 trials, the worst deviation
should be ±500µs. This rule provides a “safe
trigger window” of 1ms in duration centered on
the nominal trigger point. Note that this rule
fails for fast moving objects that occupy a large
portion of the imager. Capturing such content
requires the use of high-speed flash lighting.
Figure 2 illustrates the relationship between the
number of cameras in a rig, the desired
reliability of the rig, and the variance in trigger
latency that can be tolerated.
Let Ra be defined as the probability that all the
cameras in the rig trigger within the safe
window, and N be the number of cameras, and
Pc the probability for an individual camera to
trigger outside of the safe area. Pc is given by
the following expression:
And the reliability of the rig by its inverse:
4 | P a g e S t e r e o A c q u i s i t i o n u s i n g C a m e r a A r r a y s T i m M a c m i l l a n J o h n N a y l o r
As the figure shows, far tighter tolerances are
required for a 99% reliable 60 camera array,
than for a 15 camera array that produces usable
sequences just half the time.
Accurate triggering becomes even more
important for stereo capture, because the human
visual system has much higher stereo acuity (sub
pixel) than for luminance, even if the image has
motion blur.
Convergence
Although, in principle, the shape of rig and
arrangement of cameras is arbitrary, we’ve
found that most artistic demands can be met in
practice with three basic riggings as illustrated in
Figure 3.
We’ll revisit the convergent topologies later
when considering virtual dolly moves. The
Linear Parallel topology has the interesting
property of producing sequences which work
best with motion adaptive interpolation software
to create longer sequences. This is because, the
motion between frames is uniform; the
convergent topologies, by contrast, produce
sequences where there is always motion
towards and away from the viewer between
successive frames.
Color Matching
The final main consideration when building a
Time-Slice rig is to ensure that the color
responses of the cameras fall within a tight
band.
This is achieved by calibrating each camera
against a test card and performing color
correction in real-time in the image processing
pipeline.
Figure 3 Rig Topologies in Common Use
The outstanding benefit of DSLRs is being able
to shoot in RAW format, which spares the
vagaries of compression artifacts compounded
with clipping of gamma and contrast. So, using
DSLR RAW we can confidently match any grade
or exposure response required by our clients.
To summarize:
Select cameras based on triggering
accuracy, sensor size, physical size, and
resolution.
Select lenses that match the sensor size, and
minimize geometric distortions.
Circular Convergent
Linear Convergent
Linear Parallel
5 | P a g e S t e r e o A c q u i s i t i o n u s i n g C a m e r a A r r a y s T i m M a c m i l l a n J o h n N a y l o r
Correct for registration, geometric
distortion, color differences and talent
framing in the image processing pipeline
that forms the “brains"
So let’s consider the image processing pipeline
and the tradeoffs between operational speed,
image quality, and system cost.
Image Processing Pipeline and
Network Infrastructure
Figure 4 illustrates the arrangement of custom
and off-the-shelf hardware that forms the
platform for the image acquisition and
processing pipeline, together with the storage
and networking arrangements. Each group of
four cameras is controlled by custom-designed
“slave” units that serve to concentrate image
data before it travels across the network, and to
share the trigger signal in a daisy-chain fashion.
The cameras are controlled, and trigger
sequences programmed from a proprietary
application that typically runs on a laptop as
illustrated. The trigger parameters are
communicated to the Trigger Control Head End
via USB, which is also custom-designed
hardware. These parameters in particular are set
up in each camera by this process:
Trigger delay (for sequential triggering)
Image format, either RAW for jobs that will
be post-produced, or JPEG for play-to-air
applications
Shutter speed, aperture and ISO
Thereafter, the rest of the platform comprises
off-the-shelf, standards-based components: a
non-blocking gigabit Ethernet switch, Dual
RAID6 stores that capture and mirror image data
in both RAW and JPEG formats from the
cameras.
A high-performance workstation assembles the
individual stills that are stored on the RAIDs into
sequences that are suitable for playout after they
have been stabilized, graded and formatted for
the desired output. The software here is a
collection of commercially available
compositing packages including Nuke1 and
After Effects2, and some custom plug-ins and
scripts.
The platform, collectively termed a “rig” is used
in three distinct phases which are illustrated in
Figure 5, and described in the following section.
Calibration
A registration plate (test model) is acquired by
each camera. It contains enough information
for frame to frame alignment, color and grey-
scale calibration, and a 3D camera solve.
The registration plate is downloaded to the
workstation as an ordered image sequence.
A multi-point, two-pass, 2D stabilization is
computed from the registration plate and used
to solve a tracker that corrects for x, y, z and
rotation, even though the disparity in ‘z’, and ‘r’
will be very small, smoother visuals result if
they are also corrected for. This tracking
solution stored so that it can be applied to live
action sequences during production.
1 Nuke is a trademark of The Foundry Ltd.
2 After Effects is a trademark of Adobe Inc.
6 | P a g e S t e r e o A c q u i s i t i o n u s i n g C a m e r a A r r a y s T i m M a c m i l l a n J o h n N a y l o r
Figure 4 Control and Data Network Arrangements
Figure 5 Image Processing Pipeline
Production
Solve for
Frame
alignment,
Color, Grey-
scale
Shoot
Registration
Plate
Sequence
(Test Card)
Calibration
Post Production
Capture
“Hero”
Sequence
Correct lens
distortions
3D Camera
solve, and
point cloud
Stereo
authoring,
preparation for
coalescence
Compute 2
point, 2 pass,
2D
stabilization
track
StabilizeLive to Air
Color and
Grey-scale
correction
7 | P a g e S t e r e o A c q u i s i t i o n u s i n g C a m e r a A r r a y s T i m M a c m i l l a n J o h n N a y l o r
Production
The production process follows this sequence:
1. The rig is triggered, either manually or
automatically, for example by an infra-
red beam being broken.
2. The still pictures are read from each
camera by the slave units that forward
them to the RAIDs. JPEG format is used
for live-to-air work or for on-set reviews,
and RAW when more specialized
output is required.
3. From there, they are accessed by the
high performance workstation for
stabilizing, cropping and automated
grading.
4. In live-to-air situations, the sequence is
ready for playout approximately 30s
after the trigger, and a watch folder can
be set up so that the image processing
is initiated by the image files being
written to the RAIDs.
Post Production
At this stage, the sequence is already acceptable
for general, mono use. Additional stages that
can be undertaken are:
Generation of an accurate 3D camera
solve and point cloud to facilitate lens
distortion correction, compositing and
the inclusion of CGI objects within the
captured scene.
Interpolation – a scene can be prepared
via traditional rotoscoping and keying
techniques for optical-flow retiming
should a different shot duration be
required. This works best with the
Linear Parallel rig topology.
Stereoscopic authoring – the sequence
is prepared for coalescence via
programs like Ocula3, or Neo 3D4.
Note that, prior to these post production stages,
the sequence is normally rendered out as 32 bit
floating point in either OpenEXR or TIFF format.
Artistic Considerations
There are some important differences between
an array of still cameras as described here, and
motion picture cameras that can be exploited to
achieve unusual effects. Some of these
differences are:
A motion picture camera cannot use a
shutter time that is longer than the
inverse of the frame rate. In a camera
array, the shutter speed independent of
the frame rate. This property permits
the capture of large amounts of motion
blur as an in-camera effect, subject to
the caveat that it may take several
attempts to achieve the desired take.
The fine control over individual camera
triggers means that the array can be
programmed to produce interesting
effects. Fired simultaneously, yields the
classic “tracking shot through frozen
time” effect. Cameras can also be set
up to trigger in sequence to achieve the
effect of a tracking shot at up to
1000fps. Triggers can even be arranged
to run “backwards” over part of the
array (normally half) to create the effect
of time running forwards during the
tracking shot, until the halfway point, at
which point time goes into reverse, but
the tracking shot continues on its
established locus.
3 Ocula is a trademark of The Foundry, Ltd.
4 Neo 3D is a trademark of Cineform, Inc.
8 | P a g e S t e r e o A c q u i s i t i o n u s i n g C a m e r a A r r a y s T i m M a c m i l l a n J o h n N a y l o r
Camera arrays can be augmented with
motion picture ones at the head and tail
of the array to provide linear-time
“handles” used to enter and exit the
frozen scene.
And, of course, stereoscopic sequences are a
natural extension to a technique that has always
been “two and a half D” subjectively. The
simplest way to create stereo from a camera
array is to “ripple” the stills from individual
cameras thus:
Frame
number
Source for
Left Eye
Source for
Right Eye
1 1 2
2 2 3
3 3 4
... ... ...
n n-1 n
This yields a stereo sequence that is one frame
shorter than the number of cameras in the array.
Conclusion and Future Trends
We have described the challenges of building a
DSLR based camera array and how they can be
overcome by careful testing and selection of the
cameras and lenses, and the process by which
the array is characterized and calibrated prior to
use. In particular we have explained the
empirical rules governing acquisition resolution,
and its relationship with that of the output; and
the importance of tolerances on triggering
latency to the reliability and practical size of an
array.
We have also described the image processing
pipeline that sequences individual stills,
stabilizes and grades them for near
instantaneous playback. Further, off-line
processing for stereoscopic rendering, sequence
re-timing and the integration of CGI objects into
the captured scene has been introduced.
All the above is a practical reality today with the
following refinements and developments likely
in the near future:
Video arrays. We have already
experimented with arrays of video
cameras that promise the benefit of
never missing a shot, though they
present practical difficulties of their
own, not least that of getting small
cameras with consistent color behavior
from frame to frame, and dealing with
storage and bandwidth issues.
However, it is also a common
misconception that 25-30fps will
always ‘capture the moment’. In reality
operator eye-to-trigger response far
exceeds standard time-base temporal
resolutions. Video arrays need to
operate at 100fps to deliver
Higher resolution, higher sensitivity,
lower noise DSLRs, possibly including
quantum dot sensors. Higher resolution
would be used either to increase the
resolution of the delivered output, or to
further virtualize the effective focal
length of cameras in the array and
permit the construction of stereo dolly
moves that provide correct convergence
as the virtual camera moves towards or
away from the subject. This technique
has already been proven in Street
Dance 3D using our existing capture
resolutions, where we were able to
adjust both convergence and
interocular to create dramatic virtual
pans and zooms. This is also the
situation where the convergent rig
topologies offer tremendous future
benefit.
9 | P a g e S t e r e o A c q u i s i t i o n u s i n g C a m e r a A r r a y s T i m M a c m i l l a n J o h n N a y l o r
Commodization of the technique to an
extent that makes it practical to use in
live sports applications – including the
start/finish line in athletics, or integrated
into the backboard in basketball, the
crossbar in soccer, provided that, we
can achieve high enough frame rates
and temporal resolutions with video
rigs.
In short, stereo output simply requires some
scripting and add-ons to the post-production
pipeline, and it’s very exciting to see a coming
convergence between camera-array which has
3D capability already embedded within it and
industry demand for 3D visual effects content.
There's an interesting historical footnote as well.
The Digital Age has heralded the death of
Motion Picture as we used to understand it.
Shot length is no longer defined by mag size,
the camera is no longer tied to dolly, crane or
sticks; gone is the old 'chemistry'! We know the
camera never entirely told the truth, but we now
avidly embrace the possibilities of a new
chapter in visual language; the new 'camera'
can be an expression of thought unhampered by
technical doctrine. We find ourselves back with
Muybridge and the Cubists thinking 'what if'?