vision-only place recognition · 2014. 6. 14. · – ~1/3 for vision. cyphy laboratory we’ve...
TRANSCRIPT
Queensland University of TechnologyCyPhy La
b
Vision-only place recognition
Peter Corke
http://tiny.cc/cyphy
ICRA 2014 Workshop on Visual Place Recognition in Changing Environments
cyphy laboratory
Navigation system
• Integrative component–dead reckoning: odometry, VO,inertial etc
cyphy laboratory
Navigation system
• Integrative component–dead reckoning: odometry, VO,inertial etc
• Extroceptive component–GPS, visual place recognition, landmark recognition
cyphy laboratory
The core problem
• Given a new image of a place, determine which previously seen image is the most similar, from which we infer similarity of place
• Similar to the CV image retrieval problem–Differences:
• we can assume temporal and spatial continuity (locality) across images in the sequence
• viewpoint might be quite different• the scene might appear different due to external factors
cyphy laboratory
Semantic classification
• Given a new image of a place determine what type of place it is–eg. kitchen, bathroom, auditorium
• Can be useful if we have strong priors like a map with labelled places
• Can be useful if place types are unique within the environment
cyphy laboratory
Issue #1: Appearance & geometry
• Geometry is the 3D structure of the world• Appearance is a 2D projection of the world• Geometry → appearance (computer graphics)• Appearance⥇ geometry
cyphy laboratory
...issue #1: Appearance & geometry
• door or not a door?
cyphy laboratory
Issue #2: Confounding factors
Weather and Lighting
Shadows Seasons
Image credits (L to R): Milford and Wyeth (ICRA2012), Corke et al (IROS2013), Neubert et al (ECMR2013).
cyphy laboratory
Issue #3: Distractors
• Many pixels in the scene are not discriminative–sky–road–etc
cyphy laboratory
Issue #4: Aliasing
• Where am I?–Can I tell?–Does it matter if I can’t?
cyphy laboratory
Issue #5: Viewpoint
• What do we actually mean by place?–Is this the same place?–What if the same location, but facing the other way?
cyphy laboratory
...issue #5: Viewpoint
• Viewpoints affects the scene globally–all pixels change
• However small elements of the scene are unchanged (invariant)– just shifted
cyphy laboratory
Issue #6: Getting good images
• Robots move–motion blur
• Huge dynamic range outdoors–from dark shadows to highlights
• Huge variation in mean illumination–0.001 lx moonless with clouds–0.27 lx full moon–500 lx office lighting–100,000 lx direct sunlight
• Color constancy
place
• Appearance is a function of–scene 3D geometry–materials–viewpoint– lighting changes (intensity, color)–exogenous factors (leaves, rain, snow)
• This function is complex, non-linear and not invertible
• Lots of undiscriminative stuff like sky, road etc.
cyphy laboratory
Summary: the nub of the problem
cyphy laboratory
The easy way out - go for geometry
• Roboticists began to use laser scanners in the early 1990s
• Increase in resolution, rotation rate, reflectance data• Maximum range and cost little changed
• Appearance is a function of–scene 3D geometry–materials–viewpoint– lighting changes (intensity, color)–exogenous factors (leaves, rain, snow)
• This function is complex, non-linear and not invertible
• Lots of undiscriminative stuff like sky, road etc.
cyphy laboratory
Summary: the nub of the problem
cyphy laboratory
Why do we like lasers?
• metric• sufficient range• we are suckers for colored
3D models of our world
cyphy laboratory
Measurement principles
• Time of flight• Phase shift• Frequency modulated
continuous wave (FMCW) or chirp
cyphy laboratory
2D scanning
• High speed rotating mirror• Typically a pulse every 0.5deg
cyphy laboratory
The curse of 1/R4
cyphy laboratory
3D scanning
• 2-axis scanner• multi-beam laser• pushbroom• flash LIDAR
cyphy laboratory
3D scanning
• 2-axis scanner• multi-beam laser• pushbroom• flash LIDAR
cyphy laboratory
3D scanning
• 2-axis scanner• multi-beam laser• pushbroom• flash LIDAR
cyphy laboratory
3D scanning
• 2-axis scanner• multi-beam laser• pushbroom• flash LIDAR
cyphy laboratory
Long range flash LIDAR
• 1 foot resolution at 1km
US Patent 4,935,616 (1990)
cyphy laboratory
Laser sensing
✓Clearly sufficient✓We have great algorithms for scan matching,
building maps, closing loops etc.✓Great hardware: Sick, Velodyne
- Price point still too high- How will we cope with many vehicles using the
same sensor- Misses out on color and texture.
cyphy laboratory
The perpetual promise of vision
• Visual place recognition is possible
cyphy laboratory
The (amazing) sense of vision
• eye invented 540 million years ago
• 10 different eye designs• lensed eye invented 7
times
cyphy laboratory
Compound Eyes of a Holocephala fusca Robber Fly
cyphy laboratory
Anterior Median and Anterior Lateral Eyes of an Adult Female Phidippus putnami Jumping Spider
CRICOS No. 00213Ja university for the worldreal ®
cyphy laboratory
Datasheet for the eye/brain system• 4.5M cone cells
–150,000 per mm2 (~2 µm square)–daylight only
• 100M rod cells–night time only–respond to a single photon
• Total dynamic range 1,000,000:1 (20 bits)
• Human brain–1.5 kg–1011 neurons–~20W– ~1/3 for vision
cyphy laboratory
We’ve been here before
• Eureka project 1987-95• 1000km on Paris highways, upto 130km/h• 1600km Munich to Copenhagen, overtaking, upto
175km/h• distance between interventions: mean 9km, max
158km1987
cyphy laboratory
...we’ve been here before
98% autonomous
1995
☛
cyphy laboratory
Approaches to robust place recognition
• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure
–laser, stereo, active range camera (eg. Kinect)
cyphy laboratory
Getting good images
underexposed
flare
blurry
F =ff
e µ G✓
LTAcos
4 qF2
+h◆
cyphy laboratory
Pixel brightness
pixel area
exposure timescene
luminance
gain
noise
cyphy laboratory
Exposure time T
• T has to be compatible with camera + scene dynamics
cyphy laboratory
Increase ϕ
cyphy laboratory
Photon boosting
cyphy laboratory
Log-response cameras
• Similar dynamic range to human eye (106), but no slow chemical adaption time
592 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 4, APRIL 2001
Fig. 8. Complete camera with a length of 35 mm and a diameter of 25 mm.
Fig. 9. Averaged photoreceptor response of all 100K pixels.
IV. MEASUREMENTS
The sensor performance has been evaluated using a red diodelaser ( nm) and a xenon arc lamp emitting a rela-tively homogeneous white spectrum with a few peaks in thenear infrared. A number of neutral density filters allows to varythe light intensity in a dynamic range of more than 8 decades.It should be mentioned that the absolute value of the outputvoltage has no meaning, as it can be shifted to any value bychanging the reference voltages. Only differences of the voltagesignals correspond to the receptor sensitivity. All measurementswere carried out at a frame rate of 50 Hz.
A. Photoreceptor ResponseFig. 9 shows the averaged response of all 384 288 pixels as
a function of the incident light intensity (xenon arc lamp). Overa dynamic range of 6 decades (from 3 mW/m to 3 kW/m ) thesensor has the expected logarithmic behavior. The slope of thisresponse curve, averaged in a range of three decades, amountsto 250 mV per decade. It can be adjusted, however, to any valuebetween 130 and 720 mV/decade by changing the feedback ca-pacitors of the readout amplifier (Fig. 4). The decreasing slopeat very low intensities results from the low photocurrents in this
Fig. 10. Distribution of pixel offsets at a light intensity of 1 W/m .
region. They reach the value of the diode’s dark current rep-resenting a fixed offset in the response curve. Besides, a longtime is required to charge the parasitic pixel capacitances, espe-cially when switching from calibration to readout. Both effectslimit the sensitivity of the sensor under low light conditions. Thesignal decrease at very high intensities stems from the dischargeof the storage capacitors. Parasitic photocurrents significantlychange the stored correction voltages before the individual pho-toreceptors are read out.
B. Self-Calibration ResultsThe remaining FPN after applying the self-calibration has
been measured with the help of the white arc lamp. The ref-erence current has been adjusted to approximately 1 nA cor-responding to a photocurrent generated at 25 W/m . For bestFPN reduction, photocurrent and reference current should be inthe same range, eleminating possible slope variations. However,there is a certain lower limit for in order to keep the circuitfast enough for performing the self-calibration. The offset dis-tribution at an intensity of 1 W/m can be seen in Fig. 10. Itshows an RMS value of 10.7 mV corresponding to 3.8% of adecade due to the slope of 250 mV/decade. This means a FPNof 0.63% of the total dynamic range. The peak-to-peak valuesare about five times higher than the RMS values leading to 20%of a decade and 3–4% of the total dynamic range, respectively.In addition to the histogram, a Gaussian curve fitted to the mea-sured values has been drawn. Both curves match relatively well,i.e., the pixel offsets are normally distributed.Unfortunately, the architecture of the self-calibrating sensor
does not allowmeasurement of the uncalibrated FPN. In order tostill obtain a reasonable comparison between calibrated and un-calibrated pixel-to-pixel variations, a test column has been builtin the same 0.6- m CMOS process. It consists of simple log-arithmic pixels which do not contain the calibration circuitry.The FPN results of this test sensor [9] amount to 90% of adecade and are more than 20 times worse than the calibratedresults. Referring to another sensor design which has been real-ized in a different CMOS process also using uncalibrated loga-
LOOSE et al.: SELF-CALIBRATING SINGLE-CHIP CMOS CAMERA 595
(a) (b) (c)
Fig. 16. Images of a high-dynamic-range scene taken with (a) a logarithmic CMOS sensor, (b) a logarithmic CMOS sensor with enabled digital zoom (zoomlevel 3), and (c) a CCD camera with opened and closed aperture.
F. Parasitic Discharge of the Analog Memory CellsThe pixel correction voltages are stored on analog memory
cells (capacitors) in order to bridge the time gap between twosuccessive calibration cycles. Since the switch isolating thismemory cell during the storage phase is realized as a MOS tran-sistor, parasitic currents (leakage and photocurrents) dischargethe capacitor with time. This discharge behavior determiningthe maximum time between two calibration cycles has beenexamined after stopping the calibration process. Fig. 15 showsthe mean output voltage of all pixels as a function of time. Itcan be seen that the voltage decreases down to about 2.4 Vand then stays constant. This decrease corresponds to a riseof the storage capacitor voltage because the pixel signalis inverted by the readout amplifier. The output voltage stopsdropping (at 2.4 V) when reaches the upper supplyvoltage .The drop rate significantly depends on the light intensity. At
very high illuminations, the signal has already noticeably de-creased after a few milliseconds. Therefore, the time betweencalibration and readout should be very short in order to expandthe dynamic range as much as possible toward high intensities.On the other hand, however, the pixel circuit takes a long time atlow illuminations to recover its operating point after switchingfrom calibration to readoutmode. Here, a large distance betweencalibration and readout is desired. Consequently, the responsecurve is shifted toward lower or higher intensities by changingthe time between calibration and readout. Usually, the sensor isread out directly (a few hundredmicroseconds) after calibration.
G. Camera ImagesFinally, a high-dynamic-range scene taken with the self-cal-
ibrating camera is presented (Fig. 16). It shows a bright incan-descent bulb and the logo of our laboratory printed on whitepaper. The dynamic range amounts to about 5 decades, whichis too much for a CCD camera [Fig. 16(c)]. Either the logosymbol can be clearly seen but the bulb is completely overex-posed (blooming), or, by closing the aperture, the filament ofthe bulb can be seen but the logo disappears in the black back-ground.The logarithmic sensor [Fig. 16(a)] is able to see the bulb
structure as well as the printed symbol at the same time. Due tothe built-in self-calibration, the FPN is reduced to a level which
TABLE ISENSOR PROPERTIES AND PERFORMANCE SUMMARY
is scarcely noticeable in this image. Fig. 16(b) shows the digitalzoom capability of the camera chip allowing the mapping of asubpart of the sensor area to the full video screen.
V. CONCLUSION
We have presented a 384 288 pixel CMOS image sensorbased on a self-calibrating photoreceptor with logarithmic
LOOSE et al.: SELF-CALIBRATING SINGLE-CHIP CMOS CAMERA 595
(a) (b) (c)
Fig. 16. Images of a high-dynamic-range scene taken with (a) a logarithmic CMOS sensor, (b) a logarithmic CMOS sensor with enabled digital zoom (zoomlevel 3), and (c) a CCD camera with opened and closed aperture.
F. Parasitic Discharge of the Analog Memory CellsThe pixel correction voltages are stored on analog memory
cells (capacitors) in order to bridge the time gap between twosuccessive calibration cycles. Since the switch isolating thismemory cell during the storage phase is realized as a MOS tran-sistor, parasitic currents (leakage and photocurrents) dischargethe capacitor with time. This discharge behavior determiningthe maximum time between two calibration cycles has beenexamined after stopping the calibration process. Fig. 15 showsthe mean output voltage of all pixels as a function of time. Itcan be seen that the voltage decreases down to about 2.4 Vand then stays constant. This decrease corresponds to a riseof the storage capacitor voltage because the pixel signalis inverted by the readout amplifier. The output voltage stopsdropping (at 2.4 V) when reaches the upper supplyvoltage .The drop rate significantly depends on the light intensity. At
very high illuminations, the signal has already noticeably de-creased after a few milliseconds. Therefore, the time betweencalibration and readout should be very short in order to expandthe dynamic range as much as possible toward high intensities.On the other hand, however, the pixel circuit takes a long time atlow illuminations to recover its operating point after switchingfrom calibration to readoutmode. Here, a large distance betweencalibration and readout is desired. Consequently, the responsecurve is shifted toward lower or higher intensities by changingthe time between calibration and readout. Usually, the sensor isread out directly (a few hundredmicroseconds) after calibration.
G. Camera ImagesFinally, a high-dynamic-range scene taken with the self-cal-
ibrating camera is presented (Fig. 16). It shows a bright incan-descent bulb and the logo of our laboratory printed on whitepaper. The dynamic range amounts to about 5 decades, whichis too much for a CCD camera [Fig. 16(c)]. Either the logosymbol can be clearly seen but the bulb is completely overex-posed (blooming), or, by closing the aperture, the filament ofthe bulb can be seen but the logo disappears in the black back-ground.The logarithmic sensor [Fig. 16(a)] is able to see the bulb
structure as well as the printed symbol at the same time. Due tothe built-in self-calibration, the FPN is reduced to a level which
TABLE ISENSOR PROPERTIES AND PERFORMANCE SUMMARY
is scarcely noticeable in this image. Fig. 16(b) shows the digitalzoom capability of the camera chip allowing the mapping of asubpart of the sensor area to the full video screen.
V. CONCLUSION
We have presented a 384 288 pixel CMOS image sensorbased on a self-calibrating photoreceptor with logarithmic
Markus Loose, “A Self-Calibrating CMOS Image Sensor with Logarithmic Response”, Ph. D thesis, Institut Für Hochenergiehysik, Universität Heidelberg, 1999.
cyphy laboratory
Human visual response
S LM
• Rods–night time
• Cones–daytime–3 flavours
CRICOS No. 00213Ja university for the worldreal ® © Peter Corke
silicon photosensor, orpixel
light
colored filter array(CFA)
The silicon equivalent
CRICOS No. 00213Ja university for the worldreal ® © Peter Corke
Dichromats
cyphy laboratory
Why stop at 3 cones?
• FluxData Inc• FS-1665• 3 Bayer + 2 NIR• 3 CCDs
cyphy laboratory
Multispectral cameras
cyphy laboratory
Assorted pixel arrays
Nyquist Frequencyof
Horizontal
Frequency
Vertical Frequency
Nyquist Frequencyof other pixels
0.25fs
0.125fs
G G G G
R GG
R GG
R GG
R GG
B B
B B
R GG
R GG
R GG
R GG
B B
B BR GG
R GG
R GG
R GG
B B
B B
R GG
R GG
R GG
R GG
B B
B B
R GG
R GG
R GG
R GG
B B
B B
R GG
R GG
R GG
R GG
B B
B BR GG
R GG
R GG
R GG
B B
B B
R GG
R GG
R GG
R GG
B B
B BOptical Resolution Limit(N=f/5.6, λ =555nm, p=1.0)
, , ,
(a) 3 colors and 4 exposures CFA in [8] and its Nyquist Limits
1 46
1 57
1 57
1 46
2 3
3 2
1 46
1 57
1 57
1 46
2 3
3 21 46
1 57
1 57
1 46
2 3
3 2
1 46
1 57
1 57
1 46
2 3
3 2
1 46
1 57
1 57
1 46
2 3
3 2
1 46
1 57
1 57
1 46
2 3
3 21 46
1 57
1 57
1 46
2 3
3 2
1 46
1 57
1 57
1 46
2 3
3 2
Horizontal
Frequency
Vertical Frequency
Nyquist Frequencyof other pixels
1
Optical Resolution Limit(N=f/5.6, λ=555nm, p=1.0)
Nyquist Frequencyof
0.25fs(b) 7 colors and 1 exposure CFA in [2] and its Nyquist Limits
Figure 2: Nyquist Limits of previous assorted designs used with sub-micron pixel imagesensors (pixel pitch p = 1.0nm).
resolution limit.
Figure 2 shows the Nyquist limits when the CFA patterns of previous assorted pixels
are used with the sub-micron pixel size image sensor. When the highest frequency of
the input signal is lower than the Nyquist limit, aliasing does not occur, according to the
sampling theorem. Therefore, aliasing is not generated at pixels marked ‘1’ in Figure
2(b).
6
• Better dynamic range–2x2 Bayer filter cells with 3 levels
of neutral density filter• More colors
–3x3 or 4x4 filter cells ➙ 9 or 16 primaries
Wide field of view
☛
cyphy laboratory
Approaches to robust place recognition
• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure
–laser, stereo, active range camera (eg. Kinect)
cyphy laboratory
Whole scene descriptors
• GIST• HoG• SIFT/SURF on whole
image• Color histograms
cyphy laboratory
...issue #5: Viewpoint
• Viewpoints affects the scene globally–all pixels change
• However small elements of the scene are unchanged (invariant)– just shifted
cyphy laboratory
Visual elements
• Bag of visual words (BoW)–FABMAP, OpenFABMAP
cyphy laboratory
• Feature-detection front ends fail completely across extreme perceptual change
cyphy laboratory
Future work
• Really interesting recent work on learning distinctive elements of a scene
• Contextual priming, choose the features for the situation–day/night– indoor/outdoor
☛
cyphy laboratory
Approaches to robust place recognition
• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure
–laser, stereo, active range camera (eg. Kinect)
cyphy laboratory
Understand variation over time
• Traditional visual localization methods are not robust to appearance change
• How do features change over time?• Can we predict appearance based on time?
cyphy laboratory
Generalization of Temporal Change over Space
• Assume we have a “training set” of paired image sequences from locations under two different times of day
cyphy laboratory
Training Images
• Use known matched images to generate a temporal “codebook” across the two appearance configurations
cyphy laboratory
Generalizing about change
cyphy laboratory
Generalising about change: results
☛
cyphy laboratory
Approaches to robust place recognition
• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure
–laser, stereo, active range camera (eg. Kinect)
cyphy laboratory
Camera Resolutions...
courtesy Barry Hendy
Similar story for storage and compute
cyphy laboratory
Pixel subtended angle
• 10 Mpixel sensor, 30deg FOV–0.01 deg per pixel
• 64 pixel sensor, 30deg FOV–4 deg per pixel
cyphy laboratory
Use fewer pixels
cyphy laboratory
How many pixels do you need?
Eynsham dataset or datasets with odometry, it is possible to use a small range or even single value of vk.
By considering the sum of sub-route difference scores s(i) as a sum of normally distributed random variables, each with the same mean and variance, the sum of normalized differences over a sub-route of length n frames has mean zero and variance n, assuming that frames are captured far enough apart to be considered independent. Dividing by the number of frames produces a normalized route difference score with mean zero, variance 1/n. Percentile rank scores can then be used to determine an appropriate sub-route matching threshold. For example, for the primary sub-route length n = 50 used in this paper, a sub-route threshold of -1 yields a 7.7×10-13 chance of the match occurring by chance.
To determine whether the current sub-route matches to any stored sub-routes, the minimum matching score is compared to a matching threshold sm. If the minimum score is below the threshold, the sub-route is deemed to be a match, otherwise the sub-route is assigned as a new sub-route. An example of the minimum matching scores over every frame of a dataset (the Eynsham dataset described in this paper) is shown in Figure 2. In the second half of the dataset the route is repeated, leading to lower minimum matching scores.
Figure 2. Normalized sub-route difference scores for the Eynsham dataset
with the matching threshold sm that yields 100% precision performance.
IV. EXPERIMENTAL SETUP In this section we describe the four datasets used in this
work and the image pre-processing for each study.
A. Datasets A total of four datasets were processed, each of which
consisted of two traverses of the same route. The datasets were: a 70 km road journey in Eynsham, the United Kingdom, 2 km of motorbike circuit racing in Rowrah, the United Kingdom, 40 km of off-road racing up Pikes Peak in the Rocky Mountains, the United States, and 100 meters in an Office building (italics indicate dataset names). The Eynsham route was the primary dataset on which extensive quantitative analysis was performed. The other datasets were added to provide additional evidence for the general applicability of the algorithm. Key dataset parameters are provided in Table I, including the storage space required to represent the entire dataset using low resolution images.
Figure 3 shows aerial maps and imagery of the Eynsham, Rowrah and Pikes Peak datasets, with lines showing the route that was traversed twice. The Eynsham dataset consisted of
high resolution image captures from a Ladybug2 camera (circular array of five cameras) at 9575 locations spaced along the route. The Rowrah dataset was obtained from an onboard camera mounted on a racing bike. The Pikes Peak dataset was obtained from cameras mounted on two different racing cars racing up the mountain, with the car dashboard and structure cropped from the images. This cropping process could most likely be automated by applying some form of image matching process to small training samples from each of the camera types. The route consisted of heavily forested terrain and switchbacks up the side of a mountain, ending in rocky open terrain partially covered in snow.
TABLE I. DATASETS
Dataset Name Distance Number of
frames Distance between
frames Image
Storage Eynsham 70 km 9575 6.7 m (median) 306 kB
Rowrah 2km 440 4.5 m (mean) 7 kB http://www.youtube.com/watch?v=_UfLrcVvJ5o
Pikes Peak
40 km 4971 8 m (mean) 159 kB http://www.youtube.com/watch?v=4UIOq8vaSCc http://www.youtube.com/watch?v=7VAJaZAV-gQ
Office 53 m 832 0.13 m (mean) 1.6 kB http://df.arcs.org.au/quickshare/790eb180b9e87d53/data3.mat
Figure 3. The (a) 35 km Eynsham, (b) 1 km Rowrah and (c) 20 km Pikes Peak routes, each of which were repeated twice. Copyright 2011 Google.
Figure 4. (a) The Lego Mindstorms dataset acquisition rig with 2 sideways facing light sensors and GoPro camera for evaluation of matched routes. (b)
The 53 meter long route which was repeated twice to create the dataset.
B. Image Pre-Processing 1) Eynsham Resolution Reduced Panoramic Images
For the Eynsham dataset, image processing consisted of image concatenation and resolution reduction (Figure 5). The raw camera images were crudely cropped to remove overlap between images. No additional processing such as camera undistortion, blending or illumination adjustment was performed. The subsequent panorama was then resolution reduced (re-sampling using pixel area relation in OpenCV 2.1.0) to the resolutions shown in Table II.
cyphy laboratory
Example Route Match - Eynsham
cyphy laboratory
Eynsham Resolution Reduction Results
Directi
on
of
goodness
cyphy laboratory
Eynsham Pixel Bit Depth Results
32 pixel images
!!!2 bit image
☛
cyphy laboratory
Approaches to robust place recognition
• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure
–laser, stereo, active range camera (eg. Kinect)
cyphy laboratory
Eynsham Sequence Length Results
32 pixel images
cyphy laboratory
Milford and Wyeth, ICRA2012
cyphy laboratory
Remember ALVINN?
• Back in the 70s and 80s, roboticists, AI and computer vision research only used low resolution images–Camera limitations–Compute limitations–Algorithm limitations
☛
cyphy laboratory
Approaches to robust place recognition
• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure
–laser, stereo, active range camera (eg. Kinect)
cyphy laboratory
Shadows are everywhere! Yet, the human visual system is so adept at filtering them out, that we never give shadows a second thought; that is until we need to deal with them in our algorithms. Since the very beginning of computer vision, the presence of shadows has been responsible for wreaking havoc on a variety of applications....
Lalonde, Efros, Narsimhan ECCV 2010
cyphy laboratory
0 100 200 300 400 500 6000
0.1
0.2
0.3
0.4
0.5
0.6
0.7pi
xel i
nten
sity
distance along profile (pixels)
0 100 200 300 400 500 6000.2
0.4
0.6
0.8
1
1.2
1.4
1.6
distance along profile (pixels)
colo
r rat
ios
R/GB/G
cyphy laboratory
T=3000 KT=2000-3000 K
T=5000-5400 K T=8000-10000 K
Blackbody illuminants
logrR = c1
� c2
T logrB = c01
� c02
T
logrB
logrR
rB =BG
rR =RG
cyphy laboratory
Log-log chromaticity
increasing T
mat
erial
prop
erty
u (pixels)
v (p
ixel
s)
500 1000 1500 2000 2500
200
400
600
800
1000
1200
1400
1600
1800
0 0.5 1 1.5 2 2.5 3 3.50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
invariant line angle (rad)
inva
rianc
e im
age
varia
nce
cyphy laboratory
Angle of the projection line
q
logrB
logrR
cyphy laboratory
cyphy laboratory
Car park sequence
cyphy laboratory
Car park sequence
cyphy laboratory
Outdoor localization
Fig. 14. The approach does not compensate for shadows containingreflected lighting from objects in the scene. The figure illustrates an examplewhere shadows next to coloured walls are not fully removed.
textures induced by shadows rather than the underlyingstructure. We have applied standard point feature extraction(e.g. SIFT, SURF etc. [20]) to the invariant image withsuccess. Despite the lower SNR of the invariant image allbut the smallest scale features reliably associate with materialrather than lighting features of the scene.
F. Limitations
One of the limitations of this method is that the modelassumes scene lighting by a single Planckian source, (SectionIV) and hence cannot fully compensate when shadows arepartly-illuminated by light reflected from objects populatinga scene. For example, Figure 14 shows a strong shadow nextto a building but the shadow is clearly still evident in theinvariant image. In this case the shadow region is illuminatedby sky light reflected from the coloured wall of the buildingwhich makes its spectrum non-Planckian.
VI. CONCLUSION
In this paper we have described an approach to eliminateshadows from colour images of outdoor scenes that is knownin the computer vision community and applied it to a hardrobotic problem of outdoor vision-based place recognition.We have described the details of key implementation stepssuch as minimising camera spectral channel overlap andestimating the direction of the projection line, and discussedapproaches to overcome practical problems with low andhigh pixel values.
VII. ACKNOWLEDGEMENTS
Peter Corke was supported by Australian Research Coun-cil project DP110103006 Lifelong Robotic Navigation usingVisual Perception. Winston Churchill was supported by anEPSRC Case Studentship with Oxford Technologies Ltd.Paul Newman was supported by an EPSRC Leadership Fel-lowship, EPSRC Grant EP/I005021/1. Authors thank MarkSheehan and Dr. Alastair Harrison for insightful discussionon JR divergence and Dr. Benjamin Davis for maintainingthe robotic platform used for this work. We thank DominicWang for valuable suggestions on this paper.
REFERENCES
[1] J. Lalonde, A. Efros, and S. Narasimhan, “Detecting ground shadowsin outdoor consumer photographs,” Computer Vision–ECCV 2010, pp.322–335, 2010.
[2] W. Churchill and P. Newman, “Practice makes perfect? managingand leveraging visual experiences for lifelong navigation,” IEEEInternational Conference on Robotics and Automation, 2012.
[3] J. Zhu, K. Samuel, S. Masood, and M. Tappen, “Learning to recognizeshadows in monochromatic natural images,” in Computer Vision andPattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010,pp. 223–230.
[4] R. Guo, Q. Dai, and D. Hoiem, “Single-image shadow detectionand removal using paired regions,” in Computer Vision and PatternRecognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp.2033–2040.
[5] G. Finlayson, M. Drew, and C. Lu, “Intrinsic images by entropyminimization,” Computer Vision-ECCV 2004, pp. 582–595, 2004.
[6] G. Finlayson, S. Hordley, C. Lu, and M. Drew, “On the removal ofshadows from images,” Pattern Analysis and Machine Intelligence,IEEE Transactions on, vol. 28, no. 1, pp. 59–68, 2006.
[7] S. Narasimhan, V. Ramesh, and S. Nayar, “A class of photometricinvariants: separating material from shape and illumination,” in IEEEInternational Conference on Computer Vision, oct. 2003, pp. 1387–1394 vol.2.
[8] S. Nayar and S. Narasimhan, “Vision in bad weather,” in ComputerVision, 1999. The Proceedings of the Seventh IEEE InternationalConference on, vol. 2. Ieee, 1999, pp. 820–827.
[9] S. Narasimhan and S. Nayar, “Chromatic framework for vision inbad weather,” in Computer Vision and Pattern Recognition, 2000.Proceedings. IEEE Conference on, vol. 1. IEEE, 2000, pp. 598–605.
[10] V. Kwatra, M. Han, and S. Dai, “Shadow removal for aerial imageryby information theoretic intrinsic image analysis,” in ComputationalPhotography (ICCP), 2012 IEEE International Conference on. IEEE,2012, pp. 1–8.
[11] S. Park and S. Lim, “Fast shadow detection for urban autonomousdriving applications,” in Intelligent Robots and Systems, 2009. IROS2009. IEEE/RSJ International Conference on. IEEE, 2009, pp. 1717–1722.
[12] M. Drew and H. Joze, “Sharpening from shadows: Sensor transformsfor removing shadows using a single image,” in Color ImagingConference, 2009, pp. 267–271.
[13] M. Milford, “Visual route recognition with a handful of bits,” inProceedings of Robotics: Science and Systems, Sydney, Australia, July2012.
[14] P. I. Corke, Robotics, Vision & Control: Fundamental Algorithms inMATLAB. Springer, 2011, iSBN 978-3-642-20143-1.
[15] F. Wang, T. Syeda-Mahmood, B. Vemuri, D. Beymer, and A. Rangara-jan, “Closed-form Jensen-Renyi divergence for mixture of gaussiansand applications to group-wise shape registration,” Medical ImageComputing and Computer-Assisted Intervention–MICCAI 2009, pp.648–655, 2009.
[16] Z. Botev, J. Grotowski, and D. Kroese, “Kernel density estimation viadiffusion,” The Annals of Statistics, vol. 38, no. 5, pp. 2916–2957,2010.
[17] M. Sheehan, A. Harrison, and P. Newman, “Self-calibration for a 3dlaser,” The International Journal of Robotics Research, 2011.
[18] A. Hamza and H. Krim, “Image registration and segmentation bymaximizing the Jensen-Renyi divergence,” in Energy MinimizationMethods in Computer Vision and Pattern Recognition. Springer, 2003,pp. 147–163.
[19] P. Felzenszwalb and D. Huttenlocher, “Efficient graph-based imagesegmentation,” International Journal of Computer Vision, vol. 59,no. 2, pp. 167–181, 2004.
[20] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robustfeatures (surf),” Computer Vision and Image Understanding, vol. 110,no. 3, pp. 346–359, 2008.
cyphy laboratory
Place 5
cyphy laboratory
Place 8
cyphy laboratory
Image similarity
•5 places➥2-9 images of each➥total 28 images (48x64)➥compared using ZNCC
cyphy laboratory
PR curve
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
recall
pre
cisi
on
greyscaleshadow invariant
cyphy laboratory
Park example
u (pixels)
v (p
ixels
)
100 200 300 400 500 600 700 800
100
200
300
400
500
600
u (pixels)
v (p
ixels)
100 200 300 400 500 600 700 800
100
200
300
400
500
600
u (pixels)
v (pix
els)
100 200 300 400 500 600 700 800
100
200
300
400
500
600
cyphy laboratory
Fail!
☛
cyphy laboratory
Approaches to robust place recognition
• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure
–laser, stereo, active range camera (eg. Kinect)
Determining distance1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective
PERCEPTION, LAYOUT, AND VIRTUAL REALITY 29
Chauvet et al., 1995; Hobbs, 1991) and Egyptian art (seeHagen, 1986; Hobbs, 1991), where it is often used alone,with no other information to convey depth. Thus, onecan make a reasonable claim that occlusion was the firstsource of information discovered and used to depict spa-tial relations in depth.
Because occlusion can never be more than ordinal in-formation—one can only know that one object is in frontof another, but not by how much—it may not seem im-pressive. Indeed, some researchers have rejected it as in-formation about depth (e.g., Landy, Maloney, Johnston,& Young, 1995). But the range and power of occlusion isstriking: As is suggested in Figure 1, it can be trusted atall distances without attenuation, and its depth thresholdexceeds that of all other sources. Even stereopsis seemsto depend on partial occlusion (Anderson & Nakayama,1994). Normalizing size over distance, occlusion pro-vides depth thresholds of 0.1% or better. This is thewidth of one sheet of paper against another at 30 cm, thewidth of a person against a wall at 500 m, or the width ofa car against a building at 2 km. Cutting and Vishton (1995)have provided more background on occlusion along withjustifications for this plotted function, as well as forthose of the other sources of information discussed here.
2. Height in the visual field measures relations amongthe bases of objects in a 3-D environment as projected tothe eye, moving from the bottom of the visual field (or
image) to the top, and assuming the presence of a groundplane, of gravity, and the absence of a ceiling (see Dunn,Gray, & Thompson, 1965). Across the scope of manydifferent traditions in art, a pattern is clear: If one sourceof information about layout is present in a picture be-yond occlusion, that source is almost always height in thevisual field. The conjunction of occlusion and height,with no other sources, can be seen in the paintings at Chau-vet; in classical Greek art and in Roman wall paintings;in 10th-century Chinese landscapes; in 12th- to15th-century Japanese art; in Western works of Cimabue, Duc-cio di Buoninsegna, Simone Martini, and Giovanni diPaolo (13th–15th centuries); and in 15th-century Persian art(see Blatt, 1984; Chauvet et al., 1995; Cole, 1992; Hagen,1986; Hobbs, 1991; Wright, 1983). Thus, height appearsto have been the second source of information discovered,or at least mastered, for portraying depth and layout.
The potential utility of height in the visual field is sug-gested in Figure 1, dissipating with distance. This plot as-sumes an upright, adult observer standing on a flat plane.Since the observer’s eye is at a height of about 1.6 m, nobase closer than 1.6 m will be available; thus, the func-tion is truncated in the near distance, which will have im-plications later. I also assume that a height difference ofabout 5! of arc between two nearly adjacent objects isjust detectable; but a different value would simply shiftthe function up or down. When one is not on a flat plane,
Figure 1. Just-discriminable ordinal depth thresholds as a function of the logarithm of distance from the ob-server, from 0.5 to 10,000 m, for nine sources of information about layout. I assume that more potent sources ofinformation are associated with smaller depth-discrimination thresholds; and that these thresholds reflectsuprathreshold utility. This array of functions is idealized for the assumptions given in Table 1. From “PerceivingLayout and Knowing Distances: The Integration, Relative Potency, and Contextual Use of Different InformationAbout Depth,” by J. E. Cutting and P. M. Vishton, 1995, in W. Epstein and S. Rogers (Eds.), Perception of Spaceand Motion (p. 80), San Diego: Academic Press, Copyright 1995 by Academic Press. Reprinted with permission.How the eye measures reality and virtual reality 1995Cutting, J. E. & Vishton, P. M. | Reprinted from Perception of Space and Motion, W. Epstein and S. Rogers, Perceiving layout and knowing distances: The interaction, relative potency, and contextual use of different information about depth.page 80., Copyright (1995), with permission from Elsevier
Determining distance1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective
1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective
How do we estimate distance?
1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective
How do we estimate distance?
1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective
How do we estimate distance?
1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective
How do we estimate distance?
1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective
How do we estimate distance?
1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective
How do we estimate distance?
1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective
How do we estimate distance?
http://www.youtube.com/watch?v=6GliSCGkpZ4
Eye Movement Terminology YouTube 2008Sam Tapsell | Used with permission.
1. Occlusion2. Height in visual field3. Relative size4. Texture density5. Aerial perspective6. Binocular disparity7. Accommodation8. Convergence9. Motion perspective
How do we estimate distance?
video from handheld camera while walking, with near and far objects moving past
3D camera
cyphy laboratory
Use 3D structure to identify places
cyphy laboratory
place
• Appearance is a function of–scene 3D geometry–materials–viewpoint– lighting changes (intensity, color)–exogenous factors (leaves, rain, snow)
• This function is complex, non-linear and not invertible
• Lots of undiscriminative stuff like sky, road etc.
cyphy laboratory
Summary: the nub of the problem
cyphy laboratory
Approaches to robust visual place recognition
• Get better images• Robust 2D image descriptors• Understand variation over time• Use fewer pixels• Use a sequence of recent images• Use some invariant• Use 3D structure
–laser, stereo, active range camera (eg. Kinect)
cyphy laboratory
• We’re doing: robust vision, semantic vision, vision & action, algorithms & architectures
• Looking for 16 postdocs
We’re hiring
www.roboticvision.org