3dvtech eccv ’16 trends · 2016-10-17 · 3dvtech eccv ’16 trends trends report amsterdam...
TRANSCRIPT
The measurement rate of the event-based camera is on the order of a microsecond, its independent pixel architecture provides very high dynamic range, and the bandwidth of an event stream is much lower than a standard video stream. These superior properties of event-based cameras over the potential to overcome some limitation of conventional cameras
Facts : 1673 registered persons 415 accepted papers (26,6%) 342 posters (21,9%) 45 spotlights (2,9%) 28 orals (1,8%) “Top-down Neural Attention by Excitation Backprop”,
by Jianming Zhang, Zhe Lin, Jonathan Brandt Xiaohui Shen, and Stan Sclaroff
A new backpropagation scheme, Excitation Backprop, based on a probabilistic Winner-Take-All formulation is proposed to model the top-down neural attention for CNN classifiers. Authors also presents a contrastive top-down attention, which captures the differential effect between a pair of contrastive top-down signals. This contrastive top-down attention can significantly improve the discriminativeness of the generated attention maps.
European Conference
On Computer Vision
ECCV ’16 Trends 3DV T e c h
Tr e n d s Re p or t
Am s t er d am Oc t o br e
2 01 6
Short version
Best papers
CNN reduction
Human pose
SFM-MVS
Other
“Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera” by Hanme Kim, Stefan Leutenegger and Andrew Davison” from Imperial College London; This paper presents a method which can perform real-time 3D reconstruction from a single hand-held event camera with no additional sensing, and works in unstructured scenes of which it has no prior knowledge. It is based on three decoupled probabilistic filters, each estimating 6-DoF camera motion, scene logarithmic (log) intensity gradient and scene inverse depth relative to a keyframe. They build a real-time graph of these to track and model over an extended local workspace.
Downloadable Codes & links
• ECCV ’16 conference: http://eccv2016.org/
• SSD: Single Shot MultiBox Detector: SSD is a unified framework for object detection with a single
network. You can use the code to train/evaluate a network for object
detection task.
https://github.com/weiliu89/caffe/tree/ssd
Download this report at http://www.3dvtech.com/
Longer version or specific review of the conference can be asked, please contact 3DVTech.
Best Student
Award
Emma Alexander for Focal
Flow: Measuring Distance
and Velocity with Defocus
and Differential Motion
She presents a new system
for perceiving depth: the focal
Flow sensor. It is an
unactuated, monocular
camera that simultaneously
exploits defocus and
differential motion to
measure a depth map and a
3D scene velocity field.
It does so using an optical-
flow-like, per-pixel linear
constraint that relates image
derivatives to depth and
velocity.
Best paper awards.
3DVTech trends report – ECCV ’16 – Short version
3DVTech - CréACannes
11 avenue Maurice Chevalier 06150 Cannes La Bocca
Téléphone : 06 21 13 81 28
Email : [email protected]
www.3DVTech.com
See: https://youtu.be/yHLyhdMSw7w
• Pixelwise View Selection for Unstructured Multi-View Stereo:
COLMAP is a general-purpose Structure-from-Motion (SfM) and Multi-View
Stereo (MVS) pipeline for robust and dense modeling from unstructured images.
https://colmap.github.io
Binary-Weight-Networks and XNOR-Networks Human pose and 3D mesh computed by SMPL method
New challenges
and datasets
Having dataset for learning, for
evaluating or for comparing
results is an important concern
that was discussed all along the
conference. Here a list of
proposed datasets:
• Tracking
http://www.votchallenge.net/
https://motchallenge.net/
• Learning:
Common Object in Context
http://mscoco.org/ has now
detection and keypoint
challenges
New http://image-net.org/ 2016
challenges. Authors received
the PAMI Everingham Prize
• City/Road:
https://www.cityscapes-
dataset.com/
GTA-based game dataset :
http://download.visinf.tu-
darmstadt.de/data/from_games/
Toronto announced for 2017 a
new large dataset covering
aerial and street views and
maps of full Toronto area,
Features:
https://github.com/featw/hpatch
es for local descriptor matching
https://archive.ics.uci.edu/ml/dat
asets/SIFT10M
Other 2016 Datasets:
Chalearn 2016:
http://gesture.chalearn.org/2016
-looking-at-people-eccv-
workshop-challenge
Movie description:
https://sites.google.com/site/des
cribingmovies/
CNN reduction
Using CNN currently requires
large memory and processing
power which are not always
available in devices such as
smartphones. Some papers
proposed solutions to tackle
those limitations.
M. Rastegari proposes two
approximations for standard
CNN: the Binary-Weight-
Networks that approximates
filters with binary values, and
the XNOR-Networks that
approximates convolutions by
binary operations. This
results in 58 times faster
convolutional operations and
32 times memory savings.
Gao Huang propose in Deep
Networks with Stochastic
Depth, a training procedure
that enables the seemingly
contradictory setup to train
short networks and use deep
networks at test time.
The paper presents a novel Deep Network architecture that implements the
full feature point handling pipeline, that is, detection, orientation estimation,
and feature description. While previous works have successfully tackled
each one of these problems individually, thay show how to learn to do all
three in a unified manner while preserving end-to-end differentiability.
Full learning and learnt
descriptor outperforms the
state-of-the-art of feature
detection and description.
Code is available:
https://github.com/cvlab-epfl/LIFT
“LIFT: Learned Invariant Feature Transform”, by Kwang Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua
It is achieved by randomly
skip layers entirely during the
training.
Authors could reduce training
time by 25% while keeping
same detection accuracy.
In Less is More: Towards
Compact CNNs, Hao Zhou
propose to remove some
neurons during training using
sparse constraints.
Experimental results on four
well-known CNN
architectures demonstrate a
significant reduction in the
number of neurons and the
memory footprint of the
testing phase without
affecting the classification
accuracy (see below table).
Human Poses
A large number of papers
were focusing on recovering
the pose, the skeleton and
even the shape of humans
from a single image. Next are
some examples.
In Keep it SMPL: Automatic
Estimation of 3D Human
Pose and Shape from a
Single Image, Federica Bogo
proposes a two steps method
to automatically estimate the
3D pose of the human body
as well as its 3D shape. In
the first step 2D joints
locations are detected onto
the body using a CNN-based
method called DeepCut.
Neuron reduction by compacting CNNs
Then a gender specific top
down method called SMPL is
used to fit a 3D shape directly
onto the 2D joints. The
resulting model can be
immediately posed and
animated.
Code is available at:
http://smplify.is.tue.mpg.de/
In DeeperCut: A Deeper,
Stronger, and Faster Multi-
person Pose Estimation
Model, E. Insafutdinov is
augmenting the DeepCut
framework by improving the
2D joints location estimation
and by adding a step which
selects a body model from a
list of proposals body part
configurations.
They introduce novel image-
conditioned pairwise terms
between body parts that allow
to significantly push the
performance in the challenging
case of multi-people pose
estimation, and dramatically
reduce the run-time.
SFM and MVS
Pixelwise View Selection for
Unstructured Multi-View Stereo
by Johannes. Schönberger.
This paper presents a Multi-
View Stereo system for robust
and efficient dense modeling
from unstructured image
collections.
https://colmap.github.io/
Other papers
The Fast Bilateral Solver, J.
Barron. It is a generalization
of bilateral filter (Honorable
Mention paper).
https://github.com/poolio/bilateral_solver
Focal flow: Measuring depth
and velocity from defocus and
differential motion, E.
Alexander. A new depth
sensing technology
Jointly estimated poses estimated by DeeperCut
Workshop Geometry
meets Deep learning
Recovering 3D geometry of the
world from 2D and 3D visual data
is a central task in computer
vision. The traditional approaches
for geometric vision problems are
mostly based on handcrafted
geometric representations and
image features.
The goal of this workshop was
to encourage the interplay
between 3D vision and deep
learning by presented invited
talks from 11 experts in both
domains:
Best paper has been attributed
to Learning Covariant Feature
Detectors by Karel Lenc.
Detection is different from
description, Karel presented a
way to improve description by
learning to extract viewpoint
invariant features from images.
Code is available:
https://github.com/lenck/ddet
ImageNet and
COCO Visual
Recognition
Challenges Joint
Workshop
This challenge is the main
competition of object
classification and detection. The
workshop presented last
methods and results on a new
dataset.
Best results in object detection
from image and video were
achieved by Chinese university
of Hongkong :
http://www.ee.cuhk.edu.hk/~wlo
uyang/projects/GBD/index.html
Scene Classification Challenge
has been won by HIKVIsion.