end-to-end localization and ranking for relative attributesyjlee/teaching/ecs289g... · our idea:...
Post on 30-Jul-2020
0 Views
Preview:
TRANSCRIPT
End-to-End Localization and Ranking
for Relative Attributes
Krishna Kumar Singh and Yong Jae Lee
Presented by Minhao Cheng
Visual attributes
High heel SmileMountainousCozy
[Farhadi et al. 2009, Kumar et al. 2009, Lampert et al. 2009,
Berg et al. 2010, Rastegari et al. 2012, …][Slide: Xiao and Lee, ICCV 2015]
Relative attributes
Is she smiling? Hard to say... Lot easier to say "the right
one is more smiling"
<
[Parikh & Grauman 2011, Shrivastava et al. 2012,
Kovashka et al. 2013, Sandeep et al. 2014, …]
[Slide: Xiao and Lee, ICCV 2015]
Localization of attributes
Spatial regions that are most relevant to a particular attribute
Smile
MountainousCozy
[Slide: Xiao and Lee, ICCV 2015]
Prior work on localizing attributes
• Attribute localization with human-in-the-loop: [Duan et al. 2012]
• Attribute localization with pre-trained detectors: [Bourdev et al. 2011, Zhang et al. 2014, Sandeep et al. 2014]
• Attribute localization with binary attributes: [Berg et
al. 2010, Bourdev et al. 2011, Duan et al. 2012, Zhang et al. 2014]
Requires strong human supervision
or binary attribute annotations
[Slide: Xiao and Lee, ICCV 2015]
Prior work on localizing attributes
“Pipeline” where features, localizer, and
classifier are trained separately and
sequentially; suboptimal and slow
• Attribute localization in weakly-supervised setting: [Xiao and Lee, ICCV 2015]
[Slide: Xiao and Lee, ICCV 2015]
Our idea: jointly learn features, localizer, and ranker end-to-end using deep network
End-to-end network for attribute localization and ranking
[Singh and Lee, ECCV 2016]
Our idea: jointly learn features, localizer, and classifier end-to-end using deep network
End-to-end network for attribute localization and ranking
Attribute: Smile
Training pairs
Training
[Singh and Lee, ECCV 2016]
Our idea: jointly learn features, localizer, and classifier end-to-end using deep network
End-to-end network for attribute localization and ranking
Attribute: Smile
Training pairs
Weak Strong
Testing
Training
Test images
Overview of our end-to-end approach
[Singh and Lee, ECCV 2016]
V1
Loss Function
V2
Localization
Network
Ranker
Network
Siamese Network (S1)
Localization
Network
Ranker
Network
Siamese Network (S2)
I1
I2
Attribute: Smile
• Goal: Given pairs of ordered training images, simultaneously localize attribute in each image and learn a ranker
Our end-to-end approach
I
96256
384 384 384 128 3128
θ Grid
generator
Ranker Network
V
96256
384 384 384 4096 4096
8192
1
Localization Network
[Singh and Lee, ECCV 2016]
Our end-to-end approach
I
96256
384 384 384 128 3128
θ Grid
generator
Localization Network
• Localization network discovers the region-of-interest for the attribute
• Learn transformation parameters mapping input to output
• Spatial Transformer Networks [Jaderberg et al. 2014]
[Singh and Lee, ECCV 2016]
Our end-to-end approach
I
96256
384 384 384 128 3128
θ Grid
generator
Ranker Network
V
96256
384 384 384 4096 4096
8192
1
Localization Network
• Ranker network takes the localized region to produce a ranking score
• Combine the global image for global context
[Singh and Lee, ECCV 2016]
V1
Loss Function
I1
V2I2
Localization
Network
Ranker
Network
Siamese (S1)
Localization
Network
Ranker
Network
Siamese (S2)
Training
• Cross entropy:
Attribute: Smile
[Singh and Lee, ECCV 2016]
V1
Loss Function
I1
V2I2
Localization
Network
Ranker
Network
Siamese (S1)
Localization
Network
Ranker
Network
Siamese (S2)
Training
• Localized region can fall outside image bounds making learning difficult
Attribute: Smile
[Singh and Lee, ECCV 2016]
V1
Loss Function
I1
V2I2
Localization
Network
Ranker
Network
Siamese (S1)
Localization
Network
Ranker
Network
Siamese (S2)
Training
• Optimized using backpropagation, mini-batch Stochastic Gradient Descent
Attribute: Smile
[Singh and Lee, ECCV 2016]
Attribute:
Smile
Attribute: Dark hair
Training epochs
• Heatmap: distribution of localized region across entire training dataset
Progression of localized region over training epochs
[Singh and Lee, ECCV 2016]
VtestLocalization
NetworkRanker Network
Siamese (S1)
Testing
• Localize the relevant attribute region
• Produce a ranking score for the test image
Test image
[Singh and Lee, ECCV 2016]
Experiments: Relative attribute datasets
Visible teeth, Eyes open, Dark hair, Smile, Good looking...
Pointy, Open, Sporty, Comfort
LFW-10 (2k images)[Sandeep et al., CVPR 2014]
UTZappos50k (50k images)[Yu & Grauman, CVPR 2014]
[Singh and Lee, ECCV 2016]
Results: Discovered regions and ranking on LFW-10 FacesWeak Strong
Bald
Dark
hair
Eyes
open
• Our network discovers relevant attribute regions
• Leads to accurate rankings
Smile
Weak Strong
Masculine
Good
looking
• Global attributes are harder to interpret
• Focus more on larger areas
Results: Discovered regions and ranking on LFW-10 Faces
Young
[Singh and Lee, ECCV 2016]
Weak Strong
Open
Pointy
Sporty
Comfort
Results: Discovered regions and ranking UT-Zap50K Shoes
[Singh and Lee, ECCV 2016]
Results: Image pair ranking accuracy
• % of test image pairs whose predicted relative attribute ranking is correct
• State-of-the-art results on LFW-10, UT-Zap50K, OSR, Shoe-with-Attribute
Combing global image context w/ localized fine-grained information performs best
[Singh and Lee, ECCV 2016]
Conclusions
• Novel end-to-end network for ranking and localizing attributes.
• State-of-the-art performance on the attribute ranking performance on benchmark face, shoe, and outdoor scene datasets.
• Our Our approach is 100 times faster than [Xiao & Lee].
Question
• What if we can use multiple localization network instead of one to help to get a better performance? (like we can use the eye’s feature to help ranking the smile attribute as well)
top related