fast and clevr visual reasoning programs via improved...

Fast and CLEVR Visual Reasoning Programs via Improved Topologically Batched Graph Execution Joseph Suarez and Clare Zhu Background Motivation Improved Topological Program Graph Sort Practical Considerations Conclusion Modern neural network architectures achieve >10X speedups on GPU via parallelization over batch size. As Justin’s model assembles per-example architectures, this is not feasible in the forward pass and in practice yields <2X speedups during the backward pass, regardless of batch size. This is an impediment to expansion upon and adaptation of his model. • Theoretical bounds assume perfect parallelization over batch size. • Improved Topological Sort introduces additional CPU code and data split/concat operations that could be further optimized. • The execution engine is the main performance bottleneck. • Backwards pass inefficient due to current PyTorch backend; this is not a theoretical limitation. Our bound holds for the backwards pass. Johnson et al. recently demonstrated the effectiveness of neural module based programs for visual question answering, reducing error rates on CLEVR by 8.6X over strong baselines and 2.3X over human evaluators. We address and remedy performance issues of the authors’ architecture, providing both strong computational complexity bounds and practical speedups of over 14X. We successfully implemented the authors’ model; using our improved topological sort on the model, we achieved speedups of over: • 2.0X for overall training • 5.5X for the forward pass • 14X during cell execution while being ~30% more memory efficient. p = program vocabulary size Vanilla: O(bs) s = max program length Topological Sort: O(ps) b = batch size Improved Topological Sort: O(pd) d = max tree depth + Balanced Program Trees: O(p log 2 d) Efficiency Gains Topological Aggregate Execute blocks O(p) per blocks for O(d) blocks All programs in a batch are aggregated at once Johnson et al, 2017. Inferring and Executing Programs for Visual Reasoning. Theoretical Bounds Dataset CLEVR is a synthetic but realistic visual question answering dataset consisting of (question, image, answer) triplets. It also contains a functional program representation of each question (accessible during training).

Upload: others

Post on 30-Dec-2020

3 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Fast and CLEVR Visual Reasoning Programs via Improved ...cs231n.stanford.edu/reports/2017/posters/805.pdfAll programs in a batch are aggregated at once Johnson et al, 2017. Inferring

Fast and CLEVR Visual Reasoning Programs via Improved Topologically Batched Graph Execution

Joseph Suarez and Clare Zhu

Background

Motivation

Improved Topological Program Graph Sort

Practical Considerations

Conclusion

Modern neural network architectures achieve >10X speedups on GPU via parallelization over batch size. As Justin’s model assembles per-example architectures, this is not feasible in the forward pass and in practice yields <2X speedups during the backward pass, regardless of batch size. This is an impediment to expansion upon and adaptation of his model.

• Theoretical bounds assume perfect parallelization over batch size.

• Improved Topological Sort introduces additional CPU code and data split/concat operations that could be further optimized.

• The execution engine is the main performance bottleneck.

• Backwards pass inefficient due to current PyTorch backend; this is not a theoretical limitation. Our bound holds for the backwards pass.

Johnson et al. recently demonstrated the effectiveness of neural module based programs for visual question answering, reducing error rates on CLEVR by 8.6X over strong baselines and 2.3X over human evaluators. We address and remedy performance issues of the authors’ architecture, providing both strong computational complexity bounds and practical speedups of over 14X.

We successfully implemented the authors’ model; using our improved topological sort on the model, we achieved speedups of over:

• 2.0X for overall training

• 5.5X for the forward pass

• 14X during cell execution

while being ~30% more memory efficient.

p = program vocabulary size Vanilla: O(bs)

s = max program length Topological Sort: O(ps)

b = batch size Improved Topological Sort: O(pd)

d = max tree depth + Balanced Program Trees: O(p log2 d)

Efficiency Gains

Topological

Aggregate

Execute blocks O(p) per blocks for O(d) blocks

All programs in a batch are aggregated at once

Johnson et al, 2017. Inferring and Executing Programs for Visual Reasoning.

Theoretical Bounds

Dataset

CLEVR is a synthetic but realistic visual question answering dataset consisting of (question, image, answer) triplets. It also contains a functional program representation of each question (accessible during training).

Plankton Species Identiﬁcation via Convolutional Neural ...cs231n.stanford.edu/reports/2015/pdfs/Dangelo_Tennefoss_Final_paper.pdf · Plankton Species Identiﬁcation via Convolutional

Lecture 7: Convolutional Neural Networks - Stanford …cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdfLecture 7: Convolutional Neural Networks Fei-Fei Li & Andrej Karpathy

Convolutional Neural Networks for Age and Gender …cs231n.stanford.edu/reports/2016/pdfs/003_Report.pdfvantage of the gender-speciﬁc age characteristics and age-speciﬁc gender

Computational Construction Grammar for Visual Question ... · logical reasoning. CLEVR is set up as a visual question answering task. This means that each question is accompanied

Tiny ImageNet Classiﬁcation with Convolutional …cs231n.stanford.edu/reports/2015/pdfs/leonyao_final.pdfTiny ImageNet Classiﬁcation with Convolutional Neural Networks Leon Yao,

Convolutional Neural Networks for Fashion Classiﬁcation ...cs231n.stanford.edu/reports/2015/pdfs/BLAO_KJAG_CS...Convolutional Neural Networks for Fashion Classiﬁcation and Object

Human Pose Estimation and Activity Classiﬁcation …cs231n.stanford.edu/reports/2015/pdfs/cdong-paper.pdfHuman Pose Estimation and Activity Classiﬁcation Using Convolutional Neural

Chinese Painting Generation Using Generative Adversarial ...cs231n.stanford.edu/reports/2017/posters/311.pdf · Chinese Painting Generation Using Generative Adversarial Networks Yuan

Automated Image Timestamp Inference Using Convolutional ...cs231n.stanford.edu/reports/2016/pdfs/267_Report.pdf · ing the Flickr API sorting tag “most interesting” descend-ing,

Automated Left Ventricle Segmentation in Cardiac MRIs ...cs231n.stanford.edu/reports/2016/pdfs/317_Report.pdf · Automated Left Ventricle Segmentation in Cardiac MRIs using Convolutional

CLEVR: A Diagnostic Dataset for Compositional Language and ... · visual reasoning and (2) CLEVR’s synthetic nature and de-tailed annotations facilitate in-depth analyses of reasoning

Image Classification pipeline Lecture 2 - cs231n.stanford…cs231n.stanford.edu/slides/2018/cs231n_2018_lecture02.pdf · SCPD students: Use your @stanford.edu address to register