drilldown: interactive retrieval of complex scenes using...
TRANSCRIPT
![Page 1: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/1.jpg)
DrillDown: Interactive Retrieval of Complex Scenes Using Natural Language Queries
![Page 2: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/2.jpg)
When we’d like to retrieve an image of a complex scene
Difficult to describe the whole scene in one sentence
![Page 3: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/3.jpg)
Image Search Engine
Single sentence as queryNo refinement (no interaction)
![Page 4: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/4.jpg)
Find a specific image in our gallery album
or online image collection
![Page 5: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/5.jpg)
Image Retrieval with Multiple Rounds Queries
Drill-down: Interactive Retrieval of Complex Scenes using Natural Language QueriesFuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, Vicente Ordonez.Conf. on Neural Information Processing Systems. NeurIPS 2019. Vancouver, Canada. December 2019.
![Page 6: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/6.jpg)
Previous efforts on Image-Text Matching
Two women sitting on the sofa
Woman in white shirt holding a dog
Woman in yellow shirt holding a cat
CNN RNN
1D Feature Space
[1] DeViSE: A Deep Visual-Semantic Embedding Model. Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov. NIPS 2013.[2] Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. Andrej Karpathy, Armand Joulin, Li Fei-Fei. NIPS 2014
![Page 7: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/7.jpg)
Previous efforts on Image-Text Matching
[3] Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations. Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma. CVPR 2019.
![Page 8: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/8.jpg)
Observations
Feature channels
Sp
atia
l dim
ensi
on
s2D image representation can help distinguish instances sharing the same feature subspace
![Page 9: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/9.jpg)
Observations
Feature channels
Sp
atia
l dim
ensi
on
s
Two women sitting on the sofa
Woman in white shirt holding a dog
Woman in yellow shirt holding a cat
1D sentence representation can NOT distinguish instances sharing the same feature subspace
![Page 10: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/10.jpg)
Observations
Feature channels
Sp
atia
l dim
ensi
on
s
Two women sitting on the sofa
Woman in white shirt holding a dog
Woman in yellow shirt holding a cat
2D sentence representation
“person” subspace
“dog” subspace
“cat” subspace
Instance1
Instance2
Instance3
![Page 11: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/11.jpg)
We still want compact representations
Especially, if it is for retrieval applications
Feature vector 1Sentence 1
Feature vector 2Sentence 2
Feature vector 3Sentence 3
...
![Page 12: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/12.jpg)
Text input
Pre-allocated state vectors
![Page 13: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/13.jpg)
Text feature
![Page 14: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/14.jpg)
Action: which state vector to
update
![Page 15: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/15.jpg)
Update the state vector
![Page 16: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/16.jpg)
Pairwise alignment between state vectors and
image regions
![Page 17: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/17.jpg)
Simulated queries through region-phrase annotations at training time
Human queries
![Page 18: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/18.jpg)
Quantitative evaluation on a test set of 10000 images
Although, the more state vectors,
the better
![Page 19: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/19.jpg)
Although, the more state vectors,
the better
We could have an even more compact representation
Quantitative evaluation on a test set of 10000 images
![Page 20: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/20.jpg)
Quantitative evaluation on a test set of 10000 images
![Page 21: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/21.jpg)
Target
![Page 22: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/22.jpg)
Target
![Page 23: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/23.jpg)
Target
![Page 24: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/24.jpg)
Target
![Page 25: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/25.jpg)
Target
![Page 26: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/26.jpg)
Target
![Page 27: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/27.jpg)
Target
![Page 28: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/28.jpg)
Future work: instance aware text encoder for dialog based applications?
Potential challenges:● Named entity detection● Coreference resolution● Negation● ...
![Page 29: DrillDown: Interactive Retrieval of Complex Scenes Using ...ft3ex/projects/drilldown/DrillDown_slides.pdfImage Retrieval with Multiple Rounds Queries Drill-down: Interactive Retrieval](https://reader033.vdocuments.us/reader033/viewer/2022043010/5fa04095a8301d4ddd2f4ee8/html5/thumbnails/29.jpg)
Q&A