institute of informatics institute of neuroinformatics deep...

Deep Learning

Daniel Gehrig

Institute of Informatics – Institute of Neuroinformatics

1

Outline• Introduction

• Motivation and history

• Supervised Learning• Semantic Segmentation• Image Description• Image Localization

• Unsupervised Learning• Depth estimation• Optical Flow estimation• Depth and egomotion

• Applications to Computer Vision• LIFT: Learned Invariant Feature Transform• Superpoint• Place Recognition - NetVLAD

• Conclusions2

Relevant for the exam

The Deep Learning Revolution

Medicine Media & Entertainment

Autonomous DrivingSurveillance & Security

3

Some History

1958 1974

Perceptron

Critics

1969

Back-Propagation

AI Winter

1995

SVM

1998

Convolutional Neural Networs for Handwritten Digits

Recognition

2006

Restricted BoltzmanMachines

2012

AlexNet

4

What changed?

• Hardware Improvements

• Big Data Available

• Algorithmic Progress

5

Image ClassificationTask of assigning an input image a label from a fixed set of categories.

Slide adapted from CNNs for Visual Recognition (Stanford)6

The semantic gap• What computers see against what we see

7

Classification Challenges

Slide adapted from CNNs for Visual Recognition (Stanford)

Directly specifying how a category looks like is impossible.

We need use a Data Driven Approach

8

Supervised Learning

𝑓(𝑥, 𝜃)

Function parametersor weights

N numbers representingclass scores

0.10.7…0.0

1.00.0…0.0

Predicted Ground truth, 𝑦𝑖

𝐿𝑜𝑠𝑠(𝑓 𝑥𝑖 , 𝜃 , 𝑦𝑖)Update

Find function 𝑓 𝑥, 𝜃 that imitates a ground truth signal

9

Machine Learning Keywords

• Loss: Quantify how good 𝜃 are

•Optimization: The process of finding 𝜃that minimize the loss

• Function: Problem modelling -> Deep networks are highly non-linear 𝑓(𝑥, 𝜃)

Slide adapted from CNNs for Visual Recognition (Stanford) 10

Classifiers: K-Nearest neighbor

𝑓(𝒙, 𝜃) = label of the K training examples nearest to 𝒙

• How fast is training? How fast is testing?• O(1), O(n)

• What is a good distance metric ? What K should be used?

Test example

Training examples

from class 1

Training examples

from class 2

Features are represented in the descriptor space

11

Classifiers: Linear

• Find a linear function to separate the classes:

𝑓 𝒙, 𝜃 = 𝑠𝑔𝑛 𝒘 𝒙 + 𝑏

• What is now 𝜃? What is the dimensionality of images? 12

Classifiers: non-linear

Bad classifier (over fitting)Good classifier

• What is 𝑓(𝒙, 𝜃) ?

13

Biological Inspiration

Axon

Terminal Branches

of AxonDendrites

S

x1

x2

w1

w2

wn

xn

x3 w3

f 𝑥, 𝜃 = 𝐹(𝑊𝑥), F is a non-linear activation function (Step, ReLU, Sigmoid)

The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Frank Rosenblatt (1958)

14

Multi Layer Perceptron

𝒇 (𝒙, 𝜽)

Non-linear Activation functions (ReLU, sigmoid, etc.)

Neural Networks and Deep Learning, Chapter 2– Michael Nielsen 15

Forward propagation

Forward Pass

𝐿𝑜𝑠𝑠(𝑓 𝑥𝑖 , 𝜃 , 𝑦𝑖)

Neural Networks and Deep Learning, Chapter 2– Michael Nielsen 16

Optimization: Back-propagation

Backward PassCompute gradients with respect to all parameters

and perform gradient descent

𝜃𝑛𝑒𝑤 = 𝜃𝑜𝑙𝑑 − 𝜇𝛻𝐿𝑜𝑠𝑠

𝐿𝑜𝑠𝑠(𝑓 𝑥𝑖 , 𝜃 , 𝑦𝑖)

Artificial Neural Networks, Back Propagation and the Kelley-Bryson Gradient Procedure – Stuart E .Dreyfus

17

Problems of fully connected network

• Too many parameters -> possible overfit

• We are not using the fact that inputs are images!

18

Convolutional Neural Networks

Gradient-based learning applied to document recognition, Y. LeCun et al. (1998)19

Going Deep

10,2%

11,5%

15,6%

62,7%

Flower

Cup

Dog

Car

20

• Inspired by the human visual system• Learn multiple layers of transformations of input• Extract progressively more sophisticated representations

Why Deep?

21

Supervised Learning

• In supervised learning it is assumed that we have access to both input data or images and ground truth labels.

• Networks trained with supervision usually perform best

• However, usually it is hard to generate this data since it often has to be hand-labelled

𝐿 𝑓 𝑥, 𝜃 , 𝑦

Network

ground truth label

Image prediction

22

Supervised LearningImage Segmentation

Fully Convolutional Networks for Semantic Segmentation – J. Long, E. Shelhamer, 201523

Supervised LearningImage Captioning

Deep Visual-Semantic Alignments for Generating Image Descriptions – Karpathy et al., 201524

Supervised LearningImage Localization

PlaNet - Photo Geolocation with Convolutional Neural Networks - Weyand et al. 2016 25

Unsupervised Learning

• In unsupervised learning we only have access to input data or images.

• Usually, these methods are more popular because they can use much larger datasets that do not need to be manually labelled.

𝐿 𝑓 𝑥, 𝜃

Network

predictionImage

26

Unsupervised LearningMonocular Depth Estimation

Unsupervised Monocular Depth Estimation with Left-Right Consistency,Godard et al. 2017

27

Unsupervised LearningStructure from Motion

Unsupervised Learning of Depth and Ego-Motion from Video, Zhou et al. 2017 28

Unsupervised LearningDense Optical Flow

Characteristic of the learned flow:• Robustness against light changes (Census Transform)• Occlusion handling (Bi-directional Flow)• Smooth flow

UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss, Meister et al, 2018

29

Unsupervised vs. Supervised learning

30

Supervised Unsupervised

Performance Usually better for the same dataset size. Usually worse, but can outperform supervised methods due to larger data availability.

Data availability Low, due to manual labelling. High, no labelling required.

Training Simple, ground truth gives a strong supervision signal.

Sometimes difficult, loss functions have to be engineered to get good results.

Generalizability Good, although sometimes the network learns to blindly copy the labels provided, leading to poor generalizability.

Better, since unsupervised losses often encode the task in a more fundamental way.

Applications to Computer Vision

31

Deep Descriptors: LIFT

LIFT Pipeline consists of 3 neural networks:

• A keypoint detector

• An orientation detector

• A descriptor generator

LIFT: Learned Invariant Feature Transform, Kwang Moo Yi et al., 2016

Applications to Computer VisionKeypoint Detection and Description

32

LIFT Loss

•

33

LIFT Results• Works better than SIFT! (well, in some datasets)

34

SuperPoint: Self-Supervised Interest Point Detection and Description• SIFT and friends are complicated heuristic algorithms• Still, SIFT is a hard to beat baseline for new methods• Can we do better with a data driven approach?• SuperPoint shows how a convolutional neural network

can be trained to predict keypoints and descriptorssimultaneously.

• Detector less accurate than SIFT, but descriptorshown to outperform SIFT in some scenarios.

Applications to Computer VisionKeypoint Detection and Description

SuperPoint: Self-Supervised Interest Point Detection and Description – D. Detone et al, CVPRW 201835

SuperPoint - Training

a) Training on synthetic dataset to bootstrap the detector

a) Use this detector on real images and generate groundtruth correspondences by sampling perspective transformations

a) Jointly training both detector and descriptor

36

SuperPoint - Qualitative (Cherry Picked) Results

37

SuperPoint - Quantitative Results

- ε: Pixel distance to count keypoint as correct. SIFT is very accurate due to sub-pixel refinement.- MLE (Mean Localization Error): Lower is better. Metric for determining detector accuracy- NN mAP (nearest-neighbor mean average precision): Measures discriminativeness of descriptors. Higher is

better.- M. Score (Matching Score): Evaluates detector and descriptor jointly on groundtruth correspondences.

38

Applications to Computer VisionPlace Recognition

Image representation space

f( )

f( )

f( )f( )

f( )f( )

f( )Query

Design an “image

representation” extractor

𝑓(𝐼, 𝜃)

Geotagged image database

39

NetVladMimic the classical pipeline with deep learning

Convolutional layers fromAlexNet or VGGNet

Trainable pooling layer

Extract local features (SIFT)

200101…

F(I)Aggregate (BoW, VLAD, FV)

Image I

Slide adapted from NetVLAD presentation, CVPR 2017 40

• Triplet loss formulation

NetVlad Loss

Disclaimer: The actual NetVlad loss is a slightly more complicated version of the one above

Matching samples

Non matching samples

margin

41

NetVlad Results• Code, dataset and trained network online: give it a try!

• http://www.di.ens.fr/willow/research/netvlad/ the link is in the

Query Top result

Green: Correct

Red: IncorrectSlide adapted from NetVLAD presentation, CVPR 2017 42

http://www.di.ens.fr/willow/research/netvlad/

Deep Visual Odometry (DeepVO)

DeepVO End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks, S. Wang et al. ICRA 2017

- End-to-end trained method that computes the camera pose from images directly

- Encodes image pairs to geometric features using a convolutional neural network (CNN)

- Feeds these features to a recurrent neural network (RNN) which outputs the 6-DOF pose

43

Deep Visual Odometry (DeepVO) - Training

- The network is trained on the KITTI VO/SLAM [1] benchmark with 7410 samples. To deal with the smallamount of data they use a network pretrained on FlowNet [2]

- During training DeepVO tries to minimize the difference between ground-truth and predicted position and orientation (in euler angles):

- The network is later tested on sequences of KITTI which were not seen during training

[1] Are we ready for autonomous driving? the KITTI vision benchmark suite, A. Geiger et al. CVPR 2012.[2] FlowNet: Learning Optical Flow with Convolutional Networks, A. Dosovitskiy et al. ICCV 2017. 44

Deep Visual Odometry (DeepVO) - Results

- DeepVO is compared to VISO2 [1] (Monocular and Stereo setup) in terms of rotation and translation error for different subtrajectory lengths and speeds.

- While DeepVO outperforms the monocular setup for VISO2 (VISO2_M) it is still outperformed by the stereo setup (VISO2_S)

- An important thing to note is that learning based visual odometry is not yet solved since many of these method overfit to the driving scenario (2D motion). However, this field is growing rapidly.

[1] Are we ready for autonomous driving? the KITTI vision benchmark suite, A. Geiger et al. CVPR 201245

Deep Learning Limitations

• Require lots of data to learn

• Difficult debugging and finetuning

• Poor generalization across similar tasks

Neural NetworksPractitioners

46

Things to remember

• Deep Learning is able to extract meaningful patterns from data.

• It can be applied to a wide range of tasks.

• Artificial Intelligence ≠ Deep Learning

47

Come over for projects in DL!

http://rpg.ifi.uzh.ch/student_projects.php

Visit our webpage for projects!

48

Additional Readings

• Neural Networks and Deep Learning, by Michael Nielsen [Chapter 2]

• Practical Recommendations for Gradient-Based Training of Deep Architectures, Y. Bengio

• Deep Learning, I. Goodfellow, Y. Bengio, A. Courville

• All the references above!

49

1

Agile Autonomy:

High-Speed Flight with On-board Sensing and Computing

Antonio Loquercio

Code, datasets, videos, and publications, slides: https://antonilo.github.io/

[email protected]

@antoniloq @antonilo

2

The drone market is valued $130 billions today

Inspection Agriculture

Transport Search and Rescue

Davide Scaramuzza

https://www.pwc.pl/pl/pdf/clarity-from-above-pwc.pdf

https://www.pwc.pl/pl/pdf/clarity-from-above-pwc.pdf

3

How are current commercial drones controlled?

➢ By a human pilot▪ requires line of sight or video link

▪ requires a lot of training

➢ By the autopilot: autonomous navigation▪ GPS: doesn’t work in GPS denied or degraded environments

▪ Lidar (e.g., Exyn): expensive, heavy, power hungry

▪ Cameras (e.g., Parrot, DJI, Skydio): cheap, lightweight, passive (i.e., low power)

4

Last 10-years Progress on Autonomous Vision-based Flight

2020• Skydio (2018-2020), • DJI (2018-2020), • NASA Mars Helicopter (2020)

1st products in the market or sent to another planet ☺

2010EU SFLY Project (2009-2012)

[Bloesch, ICRA 2010]1st onboard goal-oriented

vision-based flight(previous research focused

on reactive navigation) 4

http://rpg.ifi.uzh.ch/docs/ICRA10_bloesch.pdf

5

Control Commands

Images

IMU

Reference

Teacher-Student distribution shift

State estimation (VIO) Planning Control

The Traditional Drone Control Architecture

6

Control Commands

Images

IMU

Reference

Teacher-Student distribution shift

State estimation (VIO) Planning Control

The Traditional Drone Control Architecture

Mature Algorithms, but brittle during high speed due to motion blur.

Require strong assumptions about the environment .

Needs significant tuning, especially at high speed.

7

Neural NetworkControl

Commands

Images

IMU

Reference

Teacher-Student distribution shiftCan we use Deep Learning to Control Drones?

Yes!

8

Teacher-Student distribution shiftWhat are the biggest challenges?

1. Data Collection.

2. Input and Output Representation.

3. Guarantee the platform safety during training and testing.

99

Data Collection: imitate cars and bicycles

➢They are already integrated in the streets of a city

➢Can leverage large publicly available datasets

➢No need of an expert drone pilot

➢Easy to collect negative experiences

DroNet: Learning to Fly by Driving, Loquercio et al., Robotics and Automation Letters 2018, PDF, Video, Code.

http://rpg.ifi.uzh.ch/docs/RAL18_Loquercio.pdf

https://youtu.be/ow7aw9H4BcA

https://github.com/uzh-rpg/rpg_public_dronet/

1010

A drone that thinks to be a car

DroNet: Learning to Fly by Driving, Loquercio et al., Robotics and Automation Letters 2018, PDF, Video, Code




1111

Generalization: Indoor corridors

➢Extremely different in appearance from outdoor training data: No fine tuning!





1212

What is DroNet looking at?

➢Line-like feature are a strong cue of direction





1313

Data Collection: Simulation to Reality Transfer

What are the conditions for a neural network to transfer knowledge between domains?

Train Test

• Cannot harm drones at train time.

• Cheap to collect data.

• Can train on ANY trajectory, even the one difficult for drone pilots.

14

Flightmare Quadrotor Simulator

Song, Naji, Kaufmann, Loquercio, Scaramuzza, Flightmare: A Flexible Quadrotor Simulator, CORL’20, PDF Video Website

Source code: https://uzh-rpg.github.io/flightmare

http://rpg.ifi.uzh.ch/docs/Arxiv20Flightmare_Yunlong.pdf

https://youtu.be/m9Mx1BCNGFU

https://uzh-rpg.github.io/flightmare/

https://uzh-rpg.github.io/flightmare

1515

Sim2Real: Domain Randomization for Drone Racing

Loquercio, et al., Deep Drone Racing with Domain Randomization, IEEE T-RO, 2019. PDF. Video.Best System Paper Award at Conference for Robotics Learning (CORL), 2018.

Training Environments

https://arxiv.org/pdf/1905.09727

https://youtu.be/vdxB89lgZhQ

1616

Loquercio, et al., Deep Drone Racing with Domain Randomization, IEEE T-RO, 2019. PDF. Video.Best System Paper Award at Conference for Robotics Learning (CORL), 2018.

Sim2Real: Domain Randomization for Drone Racing

https://arxiv.org/pdf/1905.09727

https://youtu.be/vdxB89lgZhQ

1717

Sim2Real: Input Abstractions for Drone Acrobatics

Kaufmann*, Loquercio*, Ranftl, Mueller, Koltun, Scaramuzza, Robotics, Science and Systems, 2020, PDF, Video, Code

Source code: https://github.com/uzh-rpg/deep_drone_acrobatics

Deep Drone Acrobatics

17

http://rpg.ifi.uzh.ch/docs/RSS20_Kaufmann.pdf

https://youtu.be/2N_wKXQ6MXA

https://github.com/uzh-rpg/deep_drone_acrobatics


1818

Acrobatics are Challenging even for Human Pilots

Kaufmann*, Loquercio*, Ranftl, Mueller, Koltun, Scaramuzza, Deep Drone Acrobatics, RSS 2020. PDF, Video, Code

WARNING:This drone is controlled by a human pilot!




1919

Neural NetworkThrust

Bodyrates

Images

IMU

Reference

Images Feature Tracks

IMU Integration*

Reference

Teacher-Student distribution shiftSim2Real: Learning via Intermediate Representations

2020

➢ The simulation to reality gap in performance for a finite-horizon controller directly depends on the covariate shift introduced by the simulated and real observation model.

➢ Find an abstraction function to reduce the gap between observation models.

simulated real

Sim2Real: Learning via Intermediate Representations

2121

IMU

Sampler(100 Hz)

PointNetPointNet

Feature Tracks

PointNet

Temporal Convolutions

128 x 1

64 x 1

64 x 1

• IMU and camera generate observations at different rates.• We propose an asynchronous architecture that can process inputs arriving at different frequencies.

Co

nca

t

Multilayer Perceptron

Command

𝑐𝜔𝑥

𝜔𝑦

𝜔𝑧

Reference Trajectory

Sampler(50 Hz)

1 x 128

•••

T =

81 x 10

•••

T =

8

1 x 10

•••

T =

8

Processing Asynchronous Data

21

2222

Controlled Experiments in Simulation

We outperform the traditional pipeline based on estimation (VIO) and control (MPC)

23

Intel RealSense T265for visual-inertial measurements

NVIDIA Jetson TX2for neural network

inference and flight control

Weight: 900gMaximum Thrust: 70NThrust-to-Weight: ca. 7

2424

Results: Indoor & Outdoor Flight

Kaufmann*, Loquercio*, Ranftl, Mueller, Koltun, Scaramuzza, Deep Drone Acrobatics, RSS 2020, PDF, Video, Code




25

Takeaways: Learning vs Model-Based Controller

Learning-based controller Traditional controller

The future will be about how to best combine these two worlds

• Robust to imperfect perception

• Easy to program

• Does not generalize to different tasks

• Difficult to interpret what it is doing

• Sensitive to imperfect perception

• Requires extensive expertise

• High generalization: same controller for all maneuvers

• Transparent and easy to interpret

Let’s now see a way to do it!

2626

𝑦

Low Uncertainty

Sensor Failure

𝑦

High Uncertainty

• A robotic system cannot blindly trust neural network predictions:

• What happens if the current observation is very different from the training ones?

• What if some sensors fail?

Measuring Uncertainty in Deep Neural Networks

Loquercio et al., A General Framework for Uncertainty Estimation in Deep Learning, RA-L 2020, PDF , Video, Code.


https://youtu.be/X7n-bRS5vSM

https://tinyurl.com/v3jb64k

2727

Uncertainty associated with a prediction depends on two factors:

• Data Uncertainty

• Noise inherent in input data

• Independent of the training dataset

• Model Uncertainty

• Uncertainty about network weights

• Decreases with the amount of training data

,

The total uncertainty is the combination of the two

Total Uncertainty

2828

𝜎2

𝑦

Low Uncertainty

Sensor Failure

𝜎2

𝑦

High Uncertainty

Main contribution

• We propose a novel framework for uncertainty estimation of deep neural network predictions.

• Our framework accounts for the two sources of prediction uncertainty: data uncertainty & model uncertainty.

• Code available at https://github.com/uzh-rpg/deep_uncertainty_estimation

Uncertainty in Deep Learning: The goal

https://github.com/uzh-rpg/deep_uncertainty_estimation

2929

Whenever the neural network predictions are too uncertain, fall back to a safety modus.

Uncertainty Estimation for Closed-Loop Control

Loquercio et al., A General Framework for Uncertainty Estimation in Deep Learning, RA-L 2020, PDF , Video, Code.


https://youtu.be/X7n-bRS5vSM

https://tinyurl.com/v3jb64k

30

➢ Autonomous vision-based agile flight via deep learning is a very promising:

• Car-Driving Datasets or Simulation can be used to train deep navigation systems.

• Transfer knowledge via input and output abstractions.

• Measuring the networks’ uncertainty is necessary to guarantee safety

[email protected]

@antoniloq @antonilo

Conclusions & Takeaways

institute of informatics institute of neuroinformatics deep...

Documents