tophysics-based feature extraction and image manipulation...

1
Physics-based Feature Extraction and Image Manipulation via Autoencoders Abstract I We experiment with the extraction of physics-based features by utilizing synthesized data as ground truth. I We utilize these extracted features to perform image space manipulations. I We model our network as a semisupervised adversarial autoencoder, and train our encoder to extract features corresponding to physical properties of the image-described scene. Motivation An existing issue with style transfer is that results are often asthetically pleasing yet not very realistic. We think it would be interesting to see if we could work on this problem by enforcing physical constraints on image generation via semi supervised methods. Motivated by recent work on interesting applications of deep learning to image synthesis, we explore a hybrid technique between completely data-based methods and physics-based generative models, by training a joint encoder-decoder network that performs extraction of graphical appearance on the encoder end, and learns an feature-based 2D render engine on the decoder end. Our Approach We use a similar framework to [3], where we train a encoder on some number of extrinsic features such as depth, surface normals, texture, and lighting, as well as some variational amount of hidden intrinsic parameters, and train a decoder to act as a 2D image space renderer, which attempts to output the original image given the feature vector as generated by the encoder. Based on the positive results reported by [1] and [9] for image generation from feature vector modifications, we use a pretrained VGG19 network as our first layer for the encoder network, and combine that with existing architecture from [3] for our initial results. Architecture The architecture can be split into three general components, the encoders, the decoder, and the discriminators. We train separate encoders via semi-supervised methods, where first for the intrinsic feature vector, we append the ground truth features we obtained to the encoder output to pass to the decoder. For each extrinsic feature encoder, we remove the ground truth feature corresponding to it and train an encoder for it, still appending the other ground truth features and the trained intrinsic vector. This is a variation on [3]’s methods. For each feature encoder, we utilize an adversarial network discriminator during training time to enforce a gaussian prior on the encoder outputs. For our decoder, we swap out our first fully connected layer each time we replace the ground truth features with the feature encodings. Data The main challenge of this project is the data collection; as described above, we need ground truth that is not easily measurable in the real world. Few existing datasets go beyond RGBD, so we ended up having to spend a significant amount of time on data collection. We synthesize our own data by heavily augmenting pbrt3, a state of the art research-oriented renderer, to output qualities such as material approximations, surface normals, depth, lighting etc. A visualization of two scenes from different angles are as shown below. From left to right: original image, depth map, light intensity map, material approximation, and normal map. Implementation We used pytorch to implement our models, and augmented existing C++ code to synthesize our data. Results Still running due to complications with the dataset, will paste a page on when poster gets printed... Future Work 1. How does the number of intrinsics in the encoder output affect accuracy? 2. Can we effective perform image relighting or recoloring via the manipulation of the decoder’s input? 3. What types of images are hard to work with? Can we capture complex phenomena such as reflection and subsurface scattering? References I D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015. I P. Khungurn, D. Schroeder, S. Zhao, K. Bala, and S. Marschner. Matching real fabrics with micro-appearance models. ACM Transactions on Graphics (TOG), 35(1):1, 2015. I T. D. Kulkarni, W. Whitney, P. Kohli, and J. B. Tenenbaum. Deep convolutional inverse graphics network. CoRR, abs/1503.03167, 2015. I M. M. Loper and M. J. Black. Opendr: An approximate differentiable renderer. In European Conference on Computer Vision, pages 154–169. Springer, 2014. I A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5188–5196, 2015. I A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015. I R. Ng, R. Ramamoorthi, and P. Hanrahan. All-frequency shadows using non-linear wavelet lighting approximation. In ACM Transactions on Graphics (TOG), volume 22, pages 376–381. ACM, 2003. I P. Ren, Y. Dong, S. Lin, X. Tong, and B. Guo. Image based relighting using neural networks. ACM Transactions on Graphics (TOG), 34(4):111, 2015. I P. Upchurch, J. Gardner, K. Bala, R. Pless, N. Snavely, and K. Weinberger. Deep feature interpolation for image content changes. arXiv preprint arXiv:1611.05507, 2016. Winnie Lin CS 231N 2017: Final Project Stanford University

Upload: others

Post on 28-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: toPhysics-based Feature Extraction and Image Manipulation ...cs231n.stanford.edu/reports/2017/posters/307.pdflighting, as well as some variational amount of hidden intrinsic parameters,

Physics-based Feature Extraction and Image Manipulation via Autoencoders

AbstractI We experiment with the extraction of

physics-based features by utilizing synthesizeddata as ground truth.

I We utilize these extracted features to performimage space manipulations.

I We model our network as a semisupervisedadversarial autoencoder, and train our encoderto extract features corresponding to physicalproperties of the image-described scene.

Motivation

An existing issue with style transfer is that resultsare often asthetically pleasing yet not veryrealistic. We think it would be interesting to see ifwe could work on this problem by enforcingphysical constraints on image generation via semisupervised methods.

Motivated by recent work on interestingapplications of deep learning to image synthesis,we

explore a hybrid technique betweencompletely data-based methods andphysics-based generative models, by traininga joint encoder-decoder network that performs

extraction of graphical appearance on theencoder end, and learns

an feature-based 2D render engine on thedecoder end.

Our Approach

We use a similar framework to [3], where we traina encoder on some number of extrinsic featuressuch as depth, surface normals, texture, andlighting, as well as some variational amount ofhidden intrinsic parameters,

and train a decoder to act as a 2D image spacerenderer, which attempts to output the originalimage given the feature vector as generated bythe encoder.

Based on the positive results reported by [1] and[9] for image generation from feature vectormodifications, we use a pretrained VGG19 networkas our first layer for the encoder network, andcombine that with existing architecture from [3] forour initial results.

Architecture

The architecture can be split into three generalcomponents, the encoders, the decoder, and thediscriminators.

We train separate encoders via semi-supervisedmethods, where first for the intrinsic feature vector,we append the ground truth features we obtained tothe encoder output to pass to the decoder.

For each extrinsic feature encoder, we remove theground truth feature corresponding to it and train anencoder for it, still appending the other ground truthfeatures and the trained intrinsic vector. This is avariation on [3]’s methods.

For each feature encoder, we utilize an adversarialnetwork discriminator during training time toenforce a gaussian prior on the encoder outputs.

For our decoder, we swap out our first fullyconnected layer each time we replace the groundtruth features with the feature encodings.

Data

The main challenge of this project is the data collection; as described above, we need ground truth thatis not easily measurable in the real world. Few existing datasets go beyond RGBD, so we ended uphaving to spend a significant amount of time on data collection.

We synthesize our own data by heavily augmenting pbrt3, a state of the art research-oriented renderer,to output qualities such as material approximations, surface normals, depth, lighting etc. A visualizationof two scenes from different angles are as shown below.

From left to right: original image, depth map, light intensity map, material approximation, and normalmap.

Implementation

We used pytorch to implement our models, andaugmented existing C++ code to synthesize ourdata.

Results

Still running due to complications with the dataset,

will paste a page on when poster gets printed...

Future Work1. How does the number of intrinsics in the

encoder output affect accuracy?2. Can we effective perform image relighting or

recoloring via the manipulation of the decoder’sinput?

3. What types of images are hard to work with?Can we capture complex phenomena such asreflection and subsurface scattering?

ReferencesI D. Eigen and R. Fergus.

Predicting depth, surface normals and semantic labels with a common multi-scaleconvolutional architecture.In Proceedings of the IEEE International Conference on Computer Vision, pages2650–2658, 2015.

I P. Khungurn, D. Schroeder, S. Zhao, K. Bala, and S. Marschner.Matching real fabrics with micro-appearance models.ACM Transactions on Graphics (TOG), 35(1):1, 2015.

I T. D. Kulkarni, W. Whitney, P. Kohli, and J. B. Tenenbaum.Deep convolutional inverse graphics network.CoRR, abs/1503.03167, 2015.

I M. M. Loper and M. J. Black.Opendr: An approximate differentiable renderer.In European Conference on Computer Vision, pages 154–169. Springer, 2014.

I A. Mahendran and A. Vedaldi.Understanding deep image representations by inverting them.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 5188–5196, 2015.

I A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey.Adversarial autoencoders.arXiv preprint arXiv:1511.05644, 2015.

I R. Ng, R. Ramamoorthi, and P. Hanrahan.All-frequency shadows using non-linear wavelet lighting approximation.In ACM Transactions on Graphics (TOG), volume 22, pages 376–381. ACM, 2003.

I P. Ren, Y. Dong, S. Lin, X. Tong, and B. Guo.Image based relighting using neural networks.ACM Transactions on Graphics (TOG), 34(4):111, 2015.

I P. Upchurch, J. Gardner, K. Bala, R. Pless, N. Snavely, and K. Weinberger.Deep feature interpolation for image content changes.arXiv preprint arXiv:1611.05507, 2016.

Winnie LinCS 231N 2017: Final ProjectStanford University