institute of informatics institute of neuroinformatics deep...
TRANSCRIPT
Deep Learning
Daniel Gehrig
Institute of Informatics – Institute of Neuroinformatics
1
Outline• Introduction
• Motivation and history
• Supervised Learning• Semantic Segmentation• Image Description• Image Localization
• Unsupervised Learning• Depth estimation• Optical Flow estimation• Depth and egomotion
• Applications to Computer Vision• LIFT: Learned Invariant Feature Transform• Superpoint• Place Recognition - NetVLAD
• Conclusions2
Relevant for the exam
The Deep Learning Revolution
Medicine Media & Entertainment
Autonomous DrivingSurveillance & Security
3
Some History
1958 1974
Perceptron
Critics
1969
Back-Propagation
AI Winter
1995
SVM
1998
Convolutional Neural Networs for Handwritten Digits
Recognition
2006
Restricted BoltzmanMachines
2012
AlexNet
4
What changed?
• Hardware Improvements
• Big Data Available
• Algorithmic Progress
5
Image ClassificationTask of assigning an input image a label from a fixed set of categories.
Slide adapted from CNNs for Visual Recognition (Stanford)6
The semantic gap• What computers see against what we see
7
Classification Challenges
Slide adapted from CNNs for Visual Recognition (Stanford)
Directly specifying how a category looks like is impossible.
We need use a Data Driven Approach
8
Supervised Learning
𝑓(𝑥, 𝜃)
Function parametersor weights
N numbers representingclass scores
0.10.7…0.0
1.00.0…0.0
Predicted Ground truth, 𝑦𝑖
𝐿𝑜𝑠𝑠(𝑓 𝑥𝑖 , 𝜃 , 𝑦𝑖)Update
Find function 𝑓 𝑥, 𝜃 that imitates a ground truth signal
9
Machine Learning Keywords
• Loss: Quantify how good 𝜃 are
•Optimization: The process of finding 𝜃that minimize the loss
• Function: Problem modelling -> Deep networks are highly non-linear 𝑓(𝑥, 𝜃)
Slide adapted from CNNs for Visual Recognition (Stanford) 10
Classifiers: K-Nearest neighbor
𝑓(𝒙, 𝜃) = label of the K training examples nearest to 𝒙
• How fast is training? How fast is testing?• O(1), O(n)
• What is a good distance metric ? What K should be used?
Test example
Training examples
from class 1
Training examples
from class 2
Features are represented in the descriptor space
11
Classifiers: Linear
• Find a linear function to separate the classes:
𝑓 𝒙, 𝜃 = 𝑠𝑔𝑛 𝒘 𝒙 + 𝑏
• What is now 𝜃? What is the dimensionality of images? 12
Classifiers: non-linear
Bad classifier (over fitting)Good classifier
• What is 𝑓(𝒙, 𝜃) ?
13
Biological Inspiration
Axon
Terminal Branches
of AxonDendrites
S
x1
x2
w1
w2
wn
xn
x3 w3
f 𝑥, 𝜃 = 𝐹(𝑊𝑥), F is a non-linear activation function (Step, ReLU, Sigmoid)
The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Frank Rosenblatt (1958)
14
Multi Layer Perceptron
𝒇 (𝒙, 𝜽)
Non-linear Activation functions (ReLU, sigmoid, etc.)
Neural Networks and Deep Learning, Chapter 2– Michael Nielsen 15
Forward propagation
Forward Pass
𝐿𝑜𝑠𝑠(𝑓 𝑥𝑖 , 𝜃 , 𝑦𝑖)
Neural Networks and Deep Learning, Chapter 2– Michael Nielsen 16
Optimization: Back-propagation
Backward PassCompute gradients with respect to all parameters
and perform gradient descent
𝜃𝑛𝑒𝑤 = 𝜃𝑜𝑙𝑑 − 𝜇𝛻𝐿𝑜𝑠𝑠
𝐿𝑜𝑠𝑠(𝑓 𝑥𝑖 , 𝜃 , 𝑦𝑖)
Artificial Neural Networks, Back Propagation and the Kelley-Bryson Gradient Procedure – Stuart E .Dreyfus
17
Problems of fully connected network
• Too many parameters -> possible overfit
• We are not using the fact that inputs are images!
18
Convolutional Neural Networks
Gradient-based learning applied to document recognition, Y. LeCun et al. (1998)19
Going Deep
10,2%
11,5%
15,6%
62,7%
Flower
Cup
Dog
Car
20
• Inspired by the human visual system• Learn multiple layers of transformations of input• Extract progressively more sophisticated representations
Why Deep?
21
Supervised Learning
• In supervised learning it is assumed that we have access to both input data or images and ground truth labels.
• Networks trained with supervision usually perform best
• However, usually it is hard to generate this data since it often has to be hand-labelled
𝐿 𝑓 𝑥, 𝜃 , 𝑦
Network
ground truth label
Image prediction
22
Supervised LearningImage Segmentation
Fully Convolutional Networks for Semantic Segmentation – J. Long, E. Shelhamer, 201523
Supervised LearningImage Captioning
Deep Visual-Semantic Alignments for Generating Image Descriptions – Karpathy et al., 201524
Supervised LearningImage Localization
PlaNet - Photo Geolocation with Convolutional Neural Networks - Weyand et al. 2016 25
Unsupervised Learning
• In unsupervised learning we only have access to input data or images.
• Usually, these methods are more popular because they can use much larger datasets that do not need to be manually labelled.
𝐿 𝑓 𝑥, 𝜃
Network
predictionImage
26
Unsupervised LearningMonocular Depth Estimation
Unsupervised Monocular Depth Estimation with Left-Right Consistency,Godard et al. 2017
27
Unsupervised LearningStructure from Motion
Unsupervised Learning of Depth and Ego-Motion from Video, Zhou et al. 2017 28
Unsupervised LearningDense Optical Flow
Characteristic of the learned flow:• Robustness against light changes (Census Transform)• Occlusion handling (Bi-directional Flow)• Smooth flow
UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss, Meister et al, 2018
29
Unsupervised vs. Supervised learning
30
Supervised Unsupervised
Performance Usually better for the same dataset size. Usually worse, but can outperform supervised methods due to larger data availability.
Data availability Low, due to manual labelling. High, no labelling required.
Training Simple, ground truth gives a strong supervision signal.
Sometimes difficult, loss functions have to be engineered to get good results.
Generalizability Good, although sometimes the network learns to blindly copy the labels provided, leading to poor generalizability.
Better, since unsupervised losses often encode the task in a more fundamental way.
Applications to Computer Vision
31
Deep Descriptors: LIFT
LIFT Pipeline consists of 3 neural networks:
• A keypoint detector
• An orientation detector
• A descriptor generator
LIFT: Learned Invariant Feature Transform, Kwang Moo Yi et al., 2016
Applications to Computer VisionKeypoint Detection and Description
32
LIFT Loss
•
33
LIFT Results• Works better than SIFT! (well, in some datasets)
34
SuperPoint: Self-Supervised Interest Point Detection and Description• SIFT and friends are complicated heuristic algorithms• Still, SIFT is a hard to beat baseline for new methods• Can we do better with a data driven approach?• SuperPoint shows how a convolutional neural network
can be trained to predict keypoints and descriptorssimultaneously.
• Detector less accurate than SIFT, but descriptorshown to outperform SIFT in some scenarios.
Applications to Computer VisionKeypoint Detection and Description
SuperPoint: Self-Supervised Interest Point Detection and Description – D. Detone et al, CVPRW 201835
SuperPoint - Training
a) Training on synthetic dataset to bootstrap the detector
a) Use this detector on real images and generate groundtruth correspondences by sampling perspective transformations
a) Jointly training both detector and descriptor
36
SuperPoint - Qualitative (Cherry Picked) Results
37
SuperPoint - Quantitative Results
- ε: Pixel distance to count keypoint as correct. SIFT is very accurate due to sub-pixel refinement.- MLE (Mean Localization Error): Lower is better. Metric for determining detector accuracy- NN mAP (nearest-neighbor mean average precision): Measures discriminativeness of descriptors. Higher is
better.- M. Score (Matching Score): Evaluates detector and descriptor jointly on groundtruth correspondences.
38
Applications to Computer VisionPlace Recognition
Image representation space
f( )
f( )
f( )f( )
f( )f( )
f( )Query
Design an “image
representation” extractor
𝑓(𝐼, 𝜃)
Geotagged image database
39
NetVladMimic the classical pipeline with deep learning
Convolutional layers fromAlexNet or VGGNet
Trainable pooling layer
Extract local features (SIFT)
200101…
F(I)Aggregate (BoW, VLAD, FV)
Image I
Slide adapted from NetVLAD presentation, CVPR 2017 40
• Triplet loss formulation
NetVlad Loss
Disclaimer: The actual NetVlad loss is a slightly more complicated version of the one above
Matching samples
Non matching samples
margin
41
NetVlad Results• Code, dataset and trained network online: give it a try!
• http://www.di.ens.fr/willow/research/netvlad/ the link is in the
Query Top result
Green: Correct
Red: IncorrectSlide adapted from NetVLAD presentation, CVPR 2017 42
Deep Visual Odometry (DeepVO)
DeepVO End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks, S. Wang et al. ICRA 2017
- End-to-end trained method that computes the camera pose from images directly
- Encodes image pairs to geometric features using a convolutional neural network (CNN)
- Feeds these features to a recurrent neural network (RNN) which outputs the 6-DOF pose
43
Deep Visual Odometry (DeepVO) - Training
- The network is trained on the KITTI VO/SLAM [1] benchmark with 7410 samples. To deal with the smallamount of data they use a network pretrained on FlowNet [2]
- During training DeepVO tries to minimize the difference between ground-truth and predicted position and orientation (in euler angles):
- The network is later tested on sequences of KITTI which were not seen during training
[1] Are we ready for autonomous driving? the KITTI vision benchmark suite, A. Geiger et al. CVPR 2012.[2] FlowNet: Learning Optical Flow with Convolutional Networks, A. Dosovitskiy et al. ICCV 2017. 44
Deep Visual Odometry (DeepVO) - Results
- DeepVO is compared to VISO2 [1] (Monocular and Stereo setup) in terms of rotation and translation error for different subtrajectory lengths and speeds.
- While DeepVO outperforms the monocular setup for VISO2 (VISO2_M) it is still outperformed by the stereo setup (VISO2_S)
- An important thing to note is that learning based visual odometry is not yet solved since many of these method overfit to the driving scenario (2D motion). However, this field is growing rapidly.
[1] Are we ready for autonomous driving? the KITTI vision benchmark suite, A. Geiger et al. CVPR 201245
Deep Learning Limitations
• Require lots of data to learn
• Difficult debugging and finetuning
• Poor generalization across similar tasks
Neural NetworksPractitioners
46
Things to remember
• Deep Learning is able to extract meaningful patterns from data.
• It can be applied to a wide range of tasks.
• Artificial Intelligence ≠ Deep Learning
47
Come over for projects in DL!
http://rpg.ifi.uzh.ch/student_projects.php
Visit our webpage for projects!
48
Additional Readings
• Neural Networks and Deep Learning, by Michael Nielsen [Chapter 2]
• Practical Recommendations for Gradient-Based Training of Deep Architectures, Y. Bengio
• Deep Learning, I. Goodfellow, Y. Bengio, A. Courville
• All the references above!
49
1
Agile Autonomy:
High-Speed Flight with On-board Sensing and Computing
Antonio Loquercio
Code, datasets, videos, and publications, slides: https://antonilo.github.io/
@antoniloq @antonilo
2
The drone market is valued $130 billions today
Inspection Agriculture
Transport Search and Rescue
Davide Scaramuzza
https://www.pwc.pl/pl/pdf/clarity-from-above-pwc.pdf
3
How are current commercial drones controlled?
➢ By a human pilot▪ requires line of sight or video link
▪ requires a lot of training
➢ By the autopilot: autonomous navigation▪ GPS: doesn’t work in GPS denied or degraded environments
▪ Lidar (e.g., Exyn): expensive, heavy, power hungry
▪ Cameras (e.g., Parrot, DJI, Skydio): cheap, lightweight, passive (i.e., low power)
4
Last 10-years Progress on Autonomous Vision-based Flight
2020• Skydio (2018-2020), • DJI (2018-2020), • NASA Mars Helicopter (2020)
1st products in the market or sent to another planet ☺
2010EU SFLY Project (2009-2012)
[Bloesch, ICRA 2010]1st onboard goal-oriented
vision-based flight(previous research focused
on reactive navigation) 4
5
Control Commands
Images
IMU
Reference
Teacher-Student distribution shift
State estimation (VIO) Planning Control
The Traditional Drone Control Architecture
6
Control Commands
Images
IMU
Reference
Teacher-Student distribution shift
State estimation (VIO) Planning Control
The Traditional Drone Control Architecture
Mature Algorithms, but brittle during high speed due to motion blur.
Require strong assumptions about the environment .
Needs significant tuning, especially at high speed.
7
Neural NetworkControl
Commands
Images
IMU
Reference
Teacher-Student distribution shiftCan we use Deep Learning to Control Drones?
Yes!
8
Teacher-Student distribution shiftWhat are the biggest challenges?
1. Data Collection.
2. Input and Output Representation.
3. Guarantee the platform safety during training and testing.
99
Data Collection: imitate cars and bicycles
➢They are already integrated in the streets of a city
➢Can leverage large publicly available datasets
➢No need of an expert drone pilot
➢Easy to collect negative experiences
DroNet: Learning to Fly by Driving, Loquercio et al., Robotics and Automation Letters 2018, PDF, Video, Code.
1010
A drone that thinks to be a car
DroNet: Learning to Fly by Driving, Loquercio et al., Robotics and Automation Letters 2018, PDF, Video, Code
1111
Generalization: Indoor corridors
➢Extremely different in appearance from outdoor training data: No fine tuning!
DroNet: Learning to Fly by Driving, Loquercio et al., Robotics and Automation Letters 2018, PDF, Video, Code
1212
What is DroNet looking at?
➢Line-like feature are a strong cue of direction
DroNet: Learning to Fly by Driving, Loquercio et al., Robotics and Automation Letters 2018, PDF, Video, Code
1313
Data Collection: Simulation to Reality Transfer
What are the conditions for a neural network to transfer knowledge between domains?
Train Test
• Cannot harm drones at train time.
• Cheap to collect data.
• Can train on ANY trajectory, even the one difficult for drone pilots.
14
Flightmare Quadrotor Simulator
Song, Naji, Kaufmann, Loquercio, Scaramuzza, Flightmare: A Flexible Quadrotor Simulator, CORL’20, PDF Video Website
Source code: https://uzh-rpg.github.io/flightmare
1515
Sim2Real: Domain Randomization for Drone Racing
Loquercio, et al., Deep Drone Racing with Domain Randomization, IEEE T-RO, 2019. PDF. Video.Best System Paper Award at Conference for Robotics Learning (CORL), 2018.
Training Environments
1616
Loquercio, et al., Deep Drone Racing with Domain Randomization, IEEE T-RO, 2019. PDF. Video.Best System Paper Award at Conference for Robotics Learning (CORL), 2018.
Sim2Real: Domain Randomization for Drone Racing
1717
Sim2Real: Input Abstractions for Drone Acrobatics
Kaufmann*, Loquercio*, Ranftl, Mueller, Koltun, Scaramuzza, Robotics, Science and Systems, 2020, PDF, Video, Code
Source code: https://github.com/uzh-rpg/deep_drone_acrobatics
Deep Drone Acrobatics
17
1818
Acrobatics are Challenging even for Human Pilots
Kaufmann*, Loquercio*, Ranftl, Mueller, Koltun, Scaramuzza, Deep Drone Acrobatics, RSS 2020. PDF, Video, Code
WARNING:This drone is controlled by a human pilot!
1919
Neural NetworkThrust
Bodyrates
Images
IMU
Reference
Images Feature Tracks
IMU Integration*
Reference
Teacher-Student distribution shiftSim2Real: Learning via Intermediate Representations
2020
➢ The simulation to reality gap in performance for a finite-horizon controller directly depends on the covariate shift introduced by the simulated and real observation model.
➢ Find an abstraction function to reduce the gap between observation models.
simulated real
Sim2Real: Learning via Intermediate Representations
2121
IMU
Sampler(100 Hz)
PointNetPointNet
Feature Tracks
PointNet
Temporal Convolutions
128 x 1
64 x 1
64 x 1
• IMU and camera generate observations at different rates.• We propose an asynchronous architecture that can process inputs arriving at different frequencies.
Co
nca
t
Multilayer Perceptron
Command
𝑐𝜔𝑥
𝜔𝑦
𝜔𝑧
Reference Trajectory
Sampler(50 Hz)
1 x 128
•••
T =
81 x 10
•••
T =
8
1 x 10
•••
T =
8
Processing Asynchronous Data
21
2222
Controlled Experiments in Simulation
We outperform the traditional pipeline based on estimation (VIO) and control (MPC)
23
Intel RealSense T265for visual-inertial measurements
NVIDIA Jetson TX2for neural network
inference and flight control
Weight: 900gMaximum Thrust: 70NThrust-to-Weight: ca. 7
2424
Results: Indoor & Outdoor Flight
Kaufmann*, Loquercio*, Ranftl, Mueller, Koltun, Scaramuzza, Deep Drone Acrobatics, RSS 2020, PDF, Video, Code
25
Takeaways: Learning vs Model-Based Controller
Learning-based controller Traditional controller
The future will be about how to best combine these two worlds
• Robust to imperfect perception
• Easy to program
• Does not generalize to different tasks
• Difficult to interpret what it is doing
• Sensitive to imperfect perception
• Requires extensive expertise
• High generalization: same controller for all maneuvers
• Transparent and easy to interpret
Let’s now see a way to do it!
2626
𝑦
Low Uncertainty
Sensor Failure
𝑦
High Uncertainty
• A robotic system cannot blindly trust neural network predictions:
• What happens if the current observation is very different from the training ones?
• What if some sensors fail?
Measuring Uncertainty in Deep Neural Networks
Loquercio et al., A General Framework for Uncertainty Estimation in Deep Learning, RA-L 2020, PDF , Video, Code.
2727
Uncertainty associated with a prediction depends on two factors:
• Data Uncertainty
• Noise inherent in input data
• Independent of the training dataset
• Model Uncertainty
• Uncertainty about network weights
• Decreases with the amount of training data
,
The total uncertainty is the combination of the two
Total Uncertainty
2828
𝜎2
𝑦
Low Uncertainty
Sensor Failure
𝜎2
𝑦
High Uncertainty
Main contribution
• We propose a novel framework for uncertainty estimation of deep neural network predictions.
• Our framework accounts for the two sources of prediction uncertainty: data uncertainty & model uncertainty.
• Code available at https://github.com/uzh-rpg/deep_uncertainty_estimation
Uncertainty in Deep Learning: The goal
2929
Whenever the neural network predictions are too uncertain, fall back to a safety modus.
Uncertainty Estimation for Closed-Loop Control
Loquercio et al., A General Framework for Uncertainty Estimation in Deep Learning, RA-L 2020, PDF , Video, Code.
30
➢ Autonomous vision-based agile flight via deep learning is a very promising:
• Car-Driving Datasets or Simulation can be used to train deep navigation systems.
• Transfer knowledge via input and output abstractions.
• Measuring the networks’ uncertainty is necessary to guarantee safety
@antoniloq @antonilo
Conclusions & Takeaways