slam: from frames to events
TRANSCRIPT
SLAM: from Frames to Events
Davide Scaramuzza
http://rpg.ifi.uzh.ch1
Institute of Informatics | Institute of Neuroinformatics
Research Topics
Real-time, Onboard Computer Vision and Control for Autonomous, Agile Drone Flight
2P. Foehn et al., AlphaPilot: Autonomous Drone Racing, RSS 2020, Best System Paper Award. PDF Video
Research Topics
Real-time, Onboard Computer Vision and Control for Autonomous, Agile Drone Flight
3Kaufmann et al., Deep Drone Acrobatics, RSS 2020, Best Paper Award finalist. PDF. Video.
Research Topics
Real-time, Onboard Computer Vision and Control for Autonomous, Agile Drone Flight
4Loquercio et al., Agile Autonomy: Learning High Speed Flight in the Wild, Science Robotics., 2021 PDF. Video. Code & Datasets
Today’s Outline
• A brief history of visual SLAM
• SVO and real-world applications
• Active exposure control
• Event cameras
6
A Brief history of Visual Odometry & SLAM
• Scaramuzza, D., Fraundorfer, F., Visual Odometry: Part I - The First 30 Years and Fundamentals, IEEE Robotics and Automation Magazine, Volume 18, issue 4, 2011. PDF
• Fraundorfer, F., Scaramuzza, D., Visual Odometry: Part II - Matching, Robustness, and Applications, IEEE Robotics and Automation Magazine, Volume 19, issue 1, 2012. PDF
• C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I.D. Reid, J.J. Leonard, Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age, IEEE Transactions on Robotics, Vol. 32, Issue 6, 2016. PDF
• Scaramuzza, Zhang, Visual-Inertial Odometry of Aerial Robots, Encyclopedia of Robotics, Springer, 2019, PDF.
• Huang, Visual-inertial navigation: A concise review, International conference on Robotics and Automation (ICRA), 2019. PDF.
• Gallego, Delbruck, Orchard, Bartolozzi, Taba, Censi, Leutenegger, Davison, Conradt, Daniilidis, Scaramuzza, Event-based Vision: A Survey, IEEE Transactions of Pattern Analysis and Machine Intelligence, 2020. PDF
7
A Brief history of Visual Odometry & SLAM
• 1980: First known VO implementation on a robot by Hans Moraveck PhD thesis (NASA/JPL) for Mars rovers using one sliding camera (sliding stereo)
• 1980 to 2000: The VO research was dominated by NASA/JPL in preparation of the 2004 mission to Mars
• 2000-2004: First real-time monocular real-time VSLAM solutions (e.g., S. Soatto, A. Davison, D. Nister, G. Klein)
• 2004: VSLAM was used on a robot on another planet: Mars rovers Spirit and Opportunity(see seminal paper from NASA/JPL, 2007)
• 2015-today: VSLAM becomes a fundamental tool of several products:vacuum cleaners, scanners, VR/AR, drones, robots, smartphones
• 2021. VSLAM used on the Mars helicopter
8
Recently founded VO & SLAM companies
• AI Incorporated SLAM for autonomy, software
• Artisense SLAM for autonomy, software and hardware
• Augmented Pixels SLAM for mapping, software
• GEO SLAM SLAM for mapping, software
• Indoors SLAM for indoor positioning and mapping, software
• Kudan SLAM for autonomy, software
• Modelai SLAM hardware for drones
• MYNT EYE Manufacturer of camera-IMU sensors, hardware
• NAVVIS SLAM for mapping, software and hardware
• Roboception SLAM for robot arms, software and hardware
• Sevensense SLAM for autonomy, software and hardware
• SLAMCore SLAM for autonomy, software and hardware
• SUIND SLAM for drone autonomy, software and hardware
• VanGogh Imaging SLAM for object tracking and mapping, software
• Wikitude SLAM for AR/VR, software
9
A Short Recap of the last 40 years of VIO
10Feature based (1980-2000)
Accuracy
Feature + Direct (from 2000)
+IMU (from 2007)(10x accuracy)
+Machine Learning
(from 2012)
+Event Cameras
(from 2014)
Robustness(adverse environment conditions, HDR, motion blur, low
texture, dynamic environments)
Efficiency (speed, memory, and CPU load)
We need more datasets to evaluate the performance of SLAM
11
Accuracy
Efficiency (speed, memory, and CPU load)
Robustness(adverse environment conditions,
HDR, motion blur, low texture)
• TartanAir Dataset 2021• BlackBird 2018• UZH-FPV dataset 2018• DSEC 2021• MVSEC 2018• Event Camera 2017• …
• HILTI-SLAM dataset 2021• ETH 3D Dataset 2021• TUM VI Benchmark 2021• Devon Island 2013• TUM-RGBD 2012• KITTI 2012• EuRoC 2016• …
Realistic simulators:• AirSim 2017• Flightmare• FlightGoggles 2019• ESIM 2018• ……
• SLAM Bench 3, 2019
Algorithms are tunes to overfit datasets! We need a Common Task Framework!
HILTI SLAM Dataset & Challenge
12Helmberger et al., The Hilti SLAM Challenge Dataset, Arxiv Preprint, 2021, PDF. Dataset
• 2 LiDARs + 5 standard cameras + 3 IMU• Goal: Benchmark accuracy of structure and motion
HILTI SLAM Challenge – Leader Board
Helmberger et al., The Hilti SLAM Challenge Dataset, Arxiv Preprint, 2021, PDF. Dataset
Use predominantly LiDAR
Use predominantly vision
14
UZH-FPV Drone Racing Dataset & Challenge• Goal: benchmarking VIO & VSLAM algorithms at high speed, where motion blur and high dynamic range are detrimental
• Recorded with a drone flown by a professional pilot up to over 70km/h
• Contains over 30 sequences with images, events, IMU, and ground truth from a robotic total station: https://fpv.ifi.uzh.ch/
• VIO leader board: https://fpv.ifi.uzh.ch/?sourcenova-comp-post=2019-2020-uzh-fpv-temporary-leader-board
Delmerico et al. "Are We Ready for Autonomous Drone Racing? The UZH-FPV Drone Racing Dataset“, ICRA’19. PDF. Video. Datasets.
UZH-FPV Challenge
Sliding-window optimization a la OKVIS or VINS-Mono
Filter-based (MSCKF)
Delmerico et al. "Are We Ready for Autonomous Drone Racing? The UZH-FPV Drone Racing Dataset“, ICRA’19. PDF. Video. Datasets.
No event cameras have been used yet!
Today’s Outline
• A brief history of visual SLAM
• SVO and real-world applications
• Active exposure control
• Event cameras
16
SVO• Key needs: low latency, low memory, high speed
• Combines indirect + direct methods • Direct (minimizes photometric error)
• Used for frame-to-frame motion estimation
• Corners and edgelets
• Jointly optimizes poses & structure (sliding window)
• Indirect (minimizes reprojection error)
• Frame-to-Keyframe pose refinement
• Mapping• Probabilistic depth estimation (heavy tail Gaussian distribution)
• Faster than real-time (up to 400Hz): 400 fps on i7 laptops and 100 fps on smartphone PCs (Odroid (ARM), NVIDIA Jetsons
17Forster, Zhang, Gassner, Werlberger, Scaramuzza, SVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systems,
IEEE Transactions on Robotics (T-RO), 2017. PDF, code, videos.]
Edgelet Corner
Source code of SVO Pro: https://github.com/uzh-rpg/rpg_svo_pro_open
SVO
18Forster, Zhang, Gassner, Werlberger, Scaramuzza, SVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systems,
IEEE Transactions on Robotics (T-RO), 2017. PDF, code, videos.]
Edgelet Corner
Probabilistic Depth Estimation
Source code of SVO Pro: https://github.com/uzh-rpg/rpg_svo_pro_open
• Key needs: low latency, low memory, high speed
• Combines indirect + direct methods • Direct (minimizes photometric error)
• Used for frame-to-frame motion estimation
• Corners and edgelets
• Jointly optimizes poses & structure (sliding window)
• Indirect (minimizes reprojection error)
• Frame-to-Keyframe pose refinement
• Mapping• Probabilistic depth estimation (heavy tail Gaussian distribution)
• Faster than real-time (up to 400Hz): 400 fps on i7 laptops and 100 fps on smartphone PCs (Odroid (ARM), NVIDIA Jetsons
Processing time of SVO vs. ORB-SLAM, LSD-SLAM, DSO
19Forster, Zhang, Gassner, Werlberger, Scaramuzza, SVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systems,
IEEE Transactions on Robotics (T-RO), 2017. PDF, code, videos.]
Processing times in milliseconds
SVO front end is over 10x faster than state of the art systems and 4x more efficient (runs on half a core instead of 2 cores CPU).
This makes it appealing for real-time applications on embedded PCs (drones, smartphones)
SVO Pro (2021) – Just released - does full SLAM!
Includes:
• Supports monocular, stereo systems as well as omnidirectional models (fisheye and catadioptric)
• Visual-inertial sliding window optimization backend (modified from OKVIS)
• Loop closure via DBOW2
• Global Bundle Adjustment or Pose-Graph optimization via iSAM2 in real time (at frame rate)
20Forster, Zhang, Gassner, Werlberger, Scaramuzza, SVO: Semi Direct Visual Odometry for Monocular and Multi-Camera Systems,
IEEE Transactions on Robotics (T-RO), 2017. PDF, code, videos.]
SVO Pro contains a full SLAM system running in real time
Source code of SVO Pro: https://github.com/uzh-rpg/rpg_svo_pro_open
21More here: http://rpg.ifi.uzh.ch/svo2.html
Throw-and-go (2015)(inspired many products, like DJI Tello drone)
Autonomous quadrotor navigation in dynamic scenes (down-looking camera)(running on Odroid U3 board (ARM Cortex A9 at 90fps)
20 m/s obstacle free autonomous quadrotor flight at DARPA FLA (2015)
Virtual Reality with SVO running on an iPhone 6(with company Dacuda at CES 2017)
Startup: “Zurich-Eye” – Today: Facebook-Oculus Zurich
• Vision-based Localization and Mapping systems for mobile robots
• Born in Sep. 2015, became Facebook Zurich in Sep. 2016. Today >200 employees
22
Startup: “Zurich-Eye” – Today: Facebook-Oculus Zurich
• Vision-based Localization and Mapping systems for mobile robots
• Born in Sep. 2015, became Facebook Zurich in Sep. 2016. Today >200 employees
• In 2018, Zurich-Eye launched Oculus Quest (2 million units shipped so far)
23“From the lab to the living room”: The story behind Facebook’s Oculus Insight technology from Zurich-Eye to Oculus Quest:
https://tech.fb.com/the-story-behind-oculus-insight-technology/
SVO and its derivatives are used today in many of products…
• DJI drones
• Magic Leap AR headsets
• Oculus VR headsets
• Huawei phones
• Nikon cameras
• …
24
Takeaway: Partner with industry to understand the key problems
• Industry provides use cases
• They have very stringent requirements:• Low latency
• Low energy (e.g., AR, VR, always-on devices): see NAVION or PULP chips
• Robustness to HDR, blur, dynamic environments, harsh environment conditions
• Accuracy: e.g., construction monitoring requires maps with <5mm absolute error
25
Today’s Outline
• A brief history of visual SLAM
• SVO and real-world applications
• Active exposure control
• Event cameras
26
HDR scenes are challenging for SLAM
• Cameras have limited dynamic range
• Built-in auto-exposure is optimized for image quality, not for SLAM!
27
Idea: Actively adjust the exposure time
Active Camera Exposure Control
28Zhang, Forster, Scaramuzza, Active Exposure Control for Robust Visual Odometry in HDR Environments, ICRA’17. PDF. Video.
Standard Built-in Auto-Exposure Our Active Exposure Control
Active Camera Exposure Control
29
Standard Built-in Auto-Exposure Our Active Exposure Control
Zhang, Forster, Scaramuzza, Active Exposure Control for Robust Visual Odometry in HDR Environments, ICRA’17. PDF. Video.
Takeaway: make your algorithm scene aware!
Cameras have many parameters that can be adaptively tuned or actively controlled to achieve the best performance
• Scene aware exposure-time control [Zhang, ICRA’17]
• Scene aware motion-blur & rolling shutter compensation [Meilland, ICCV’13] [Liu,. ICCV’21]
• More generally, we need scene aware, continuous self-calibration & parameter control
30
Today’s Outline
• A brief history of visual SLAM
• SVO and real-world applications
• Active exposure control
• Event cameras
31
Open Challenges in Computer Vision
The past 60 years of research have been devoted to frame-based cameras but they are not good enough!
32
Dynamic RangeLatency & Motion blur
Event cameras do not suffer from these problems!
What is an event camera?
• Novel sensor that measures only motion in the scene
• Key advantages:• Low-latency (~ 1 μs)
• No motion blur
• Ultra-low power (mean: 1mW vs 1W)
• High dynamic range (140 dB instead of 60 dB)
33
VGA event camera from Prophesee
Video from here
Traditional vision algorithms cannot be directly used because
asynchronous pixels eventcameraoutput:
Opportunities
• Low latency: AR/VR, automotive (<10ms)
• Low energy: AR/VR, always-on devices (see Synsense)
• HDR & No motion blur
34
Who sells event cameras and how much are they?
• Prophesee & SONY:• ATIS sensor: events, IMU, absolute intensity at the event pixel• Resolution: 1M pixels• Cost: ~5,000 USD
• Inivation & Samsung• DAVIS sensor: frames, events, IMU. • Resolution: VGA (640x480 pixels)• Cost: ~5,000 USD
• CelePixel Technology & Omnivision:• Celex One: events, IMU, absolute intensity at the event pixel• Resolution: 1M pixels• Cost: ~1,000 USD
• Cost to sink to <5$ when killer application found(recall first ToF camera (>10,000 USD) today <50 USD)
35
Generative Event Model
• Consider the intensity at a single pixel.
• An event is triggered when the log intensity change passes a threshold 𝐶:
36
log 𝐼 𝒙, 𝑡 − log 𝐼 𝒙, 𝑡 − Δ𝑡 = ±𝐶
log 𝐼(𝒙, 𝑡)
𝑂𝑁
𝑂𝐹𝐹 𝑂𝐹𝐹 𝑂𝐹𝐹
𝑂𝑁 𝑂𝑁
𝑂𝐹𝐹𝑂𝐹𝐹 𝑂𝐹𝐹
Notice that events are generated asynchronously
𝑂𝑁 𝑂𝑁
𝐶 = Contrast threshold
Do events carry the same visual information as normal cameras?
37
Events Munda, IJCV’18 Scheerlinck, ACCV’18
From the event-generation model, we can reconstruct images up to an unknown intensity value
Results are far from perfect mainly due to contrast threshold being not constant(depends on scene content).
Can we learn video from events end to end?
Can we learn to reconstruct video from events?
38
Events Reconstructed video from events
Rebecq et al., “High Speed and High Dynamic Range Video with an Event Camera”, T-PAMI’19. PDF Video Code
The video reconstruction is now very accurate because the network learns an implicit noise model
Learned from Simulation only – One-Shot
• Recurrent neural network based on Unet
• Trained in simulation only, deployed on a real event camera without fine tuning
• We randomize the contrast sensitivity to reduce sim-to-real gap
• Generalizes to real and different event cameras without fine tuning
39
Source code & Datasets: https://github.com/uzh-rpg/rpg_e2vid
Rebecq et al., “High Speed and High Dynamic Range Video with an Event Camera”, T-PAMI’19. PDF Video Code
Reconstructed video inherits all advantages of event cameras: e.g., high temporal resolution
40
Source code & Datasets: https://github.com/uzh-rpg/rpg_e2vid
Bullet: 1300 Km/h
Rebecq et al., “High Speed and High Dynamic Range Video with an Event Camera”, T-PAMI’19. PDF Video Code
Huawei P20 phone camera Our reconstruction from events at over 5,000 fps
Reconstructed video inherits all advantages of event cameras: e.g., high dynamic range
41
Source code & Datasets: https://github.com/uzh-rpg/rpg_e2vid
Huawei P20 phone cameraOur reconstruction from events
Rebecq et al., “High Speed and High Dynamic Range Video with an Event Camera”, T-PAMI’19. PDF Video Code
Raw events
What happens if we feed reconstructed video to a state-of-the-art SLAM algorithm?
42Rebecq et al., “High Speed and High Dynamic Range Video with an Event Camera”, T-PAMI’19. PDF Video Code
The SLAM inherits all the advantages of event cameras: no motion blur, HDR, low-latency!
The Key Challenge
• The fact that we can reconstruct high quality video means that event cameras carry the same visual information as standard cameras
• So it must be possible to perform all vision tasks of standard cameras
• But we want to build efficient and low energy algorithms that compute the output without passing through intermediate image reconstruction
43
Image reconstruction
CV algorithm Output
Application 1: Low-Latency & Low-Energy Tracking
44
• [1] Gallego et al., Event-based 6-DOF Camera Tracking from Photometric Depth Maps, T-PAMI’18. PDF. Video.• [2] Mueggler et al., Continuous-Time Visual-Inertial Odometry for Event Cameras, TRO’18. PDF• [3] Rosinol et al., Ultimate SLAM?, RAL’18 Best Paper Award finalist PDF. Video. IEEE Spectrum.• [3] Gehrig et al., EKLT: Asynchronous, Photometric Feature Tracking using Events and Frames, IJCV 2019. PDF, YouTube, Evaluation Code, Tracking Code
Application 2: “Ultimate SLAM”
Goal: combining events, images, and IMU for robustness to HDR and high speed scenarios
45
Back-End
State-of-the-art
Non-linear-optimization-based VIO
Rosinol-Vidal, Rebecq, Horstschaefer, Scaramuzza, Ultimate SLAM? Combining Events, Images, and IMU for Robust Visual SLAM in HDR and High Speed Scenarios, IEEE Robotics and Automation Letters (RAL), 2018 – PDF. Video. Best Paper Award Honorable Mention
Front End:Feature tracking from Events and Frames
Application 2: “Ultimate SLAM”
85% accuracy gain over standard VIO in HDR and high speed scenarios
46Rosinol-Vidal, Rebecq, Horstschaefer, Scaramuzza, Ultimate SLAM? Combining Events, Images, and IMU for Robust Visual SLAM in HDR and High Speed
Scenarios, IEEE Robotics and Automation Letters (RAL), 2018 – PDF. Video. Best Paper Award Honorable Mention
Standard camera Event camera
47
Sun, Cioffi, de Visser, Scaramuzza, Autonomous Quadrotor Flight despite Rotor Failure with Onboard Vision Sensors: Frames vs. Events, IEEE RAL’2021. PDF. Video. Code. 1st place winner of the NASA TechBrief Award: Create the Future Contest
• Quadrotors subject to full rotor failure require accurate position estimates to avoid crashing• SOTA systems used external position tracking systems (e.g., GPS, Vicon, UWB)• We achieve this with only onboard cameras. With event cameras, we can make it work in very low light!
Application of Ultimate SLAM: Autonomous Flight despite Rotor Failure
Application 3: Slow Motion Video
• We can combine an event camera with an HD RG camera
• We use events to upsample low-framerate video by over 50 times with only 1/40th of the memory footprint!
48Tulyakov et al., TimeLens: Event-based Video Frame Interpolation, CVPR’21. PDF. Video. Code.
Code & Datasets: http://rpg.ifi.uzh.ch/timelens
Application 3: Slow Motion Video
• We can combine an event camera with an HD RG camera
• We use events to upsample low-framerate video by over 50 times with only 1/40th of the memory footprint!
49Tulyakov et al., TimeLens: Event-based Video Frame Interpolation, CVPR’21. PDF. Video. Code.
Code & Datasets: http://rpg.ifi.uzh.ch/timelens
Application 3: Slomo Video
• We can combine an event camera with an HD RG camera
• We use events to upsample low-framerate video by over 50 times with only 1/40th of the memory footprint!
50Tulyakov et al., TimeLens: Event-based Video Frame Interpolation, CVPR’21. PDF. Video. Code.
Code & Datasets: http://rpg.ifi.uzh.ch/timelens
Application 4: Event-Guided Depth Sensing
51Muglikar et al., Event Guided Depth Sensing, 3DV’21. PDF
• Problem: Standard depth sensors (ToF, LiDAR, Structured Light) sample the depth uniformly and at a fixed scan rate, thus oversampling redundant static information → large power consumption and high latency
• Idea: use event camera to guide depth measurement process: scan with higher spatial density areas generating events, and with lower density the remaining areas
• Finding: since moving edges correspond to less than 10% of the scene on average, event-guided depth sensing could lead to almost 90% less power consumption by the illumination source
Conclusion
• Visual Inertial SLAM theory is well established
• Biggest challenges today are reliability and robustness to:• HDR, low light, Adverse environment conditions, motion blur, low-texture, dynamic environments
• Active control of camera parameters, like exposure time, can greatly benefit
• Machine learning exploits context & provides robustness. • Best way to use is to combine it with geometric approaches.
• Event cameras are complementary to standard cameras and provide:• Robustness to high speed motion and HDR scenes
• Allow low-latency and low-energy which is key for AR/VR and always-on devices
• Current SLAM datasets are saturated: new datasets & challenges are needed (Common Task Framework)!
52
Thanks!
Code, datasets, videos, and publications, slides: http://rpg.ifi.uzh.ch/
I am hiring PhD students and Postdocs in AI
@davsca1 @davidescaramuzzaailabRPG