collective vision: using extremely large photograph collections mark lenz cameranet seminar...
TRANSCRIPT
Collective Vision: Using Extremely Large Photograph Collections
Mark Lenz
CameraNet Seminar
University of Wisconsin – Madison
January 26, 2010
Acknowledgments: These slides combine and modify slides provided by Yantao Zheng et al. (National University of Singapore/Google)
Introduction
• Distributed Collaboration
• Google Goggles– Personal object recognition
• World-Wide Landmark Recognition
• Building Rome in a Day– Distributed matching and reconstruction
Distributed Collaboration
• Disaster or emergency– Time is of the essence
• Telecommunication networks down
• No maps or GPS
What can we do to help ourselves and those around us?
Mobile Phones for Distributed Collaboration
• Camera for collecting visual information
• Ad-hoc wireless LAN– e.g. Bluetooth
Goals:– Determine location, exits and hazardous paths
Have I or someone else been here before?
Model Scenarios
• Firefighters
• Trapped miners
• Natural Disasters– Large population exodus– Building collapse
Multiple agents collaborating to traverse an unknown environment
• Visual search using picture as query
• Combination of algorithms– Object recognition– Optical character recognition– Geo-location (GPS & compass)
• Identify– Books and products– Businesses and landmarks
A World-Wide Landmark Recognition Engine with Web Learning
• Goal: Build a landmark recognition engine at earth-scale
Challenge I
No list of landmarks in the world We only have: noisy data on Internet
Tourist web articles
Tourist photos
geographical
location
Challenge II
How to learn landmark visual models
Image search engine
Photo-sharing websites
Challenge III
• Efficiency– Learning from enormous data– Recognizing from huge model
Discovering landmarks in the world
Two approaches: Photos in photo sharing websites
Online tourist articles
Geo-tagged
Landmark
name
Learning landmarks from GPS-Tagged photos
GPS-tagged photos
20M images from picasa.companoramio.com
Geo-clustering
geo cluster = landmarks?
validate by photo authors
Noisy image pool
Visualclustering
Graph clustering based on local features
Validate by photo authors
Analyzing text tags
Compute frequency of n-grams of text tags
Premise: Landmark photos are
• geographically adjacent• visually similar• uploaded by diff. users
Landmarks from GPS-Tagged photos
~20 million GPS-tagged photos• 140k geo-clusters and 14k visual
clusters• 2240 landmarks from 812 cities in
104 countries – biased distribution, mostly in Europe
United States 263Spain 194Italy 183France 141United Kingdom 136Greece 51Portugal 48Russia 45Austria 42
Learning landmarks from tourist web articles
Explore article corpus in wikitravel.com
Assume a geographical hierarchy
Landmark mining = named entity extraction
HTML is a structure tree Node: a HTML tag
Value: text
Classify each tree node , based on semantic clues embedded in the document structure
Learning landmarks from tourist web articles
Heuristic rules nodes are in "To See" or "See"
section nodes are children of “bullet list”
nodes. Nodes indicate bold font format
Extract all named entities as landmark candidates
Validate by visual models
Learning landmarks from tourist web articles
~7000 landmarks from 787 cities in 145 countries
More evenly distributed
Unsupervised learning of landmark images
Geo-clusters
Landmarks from tour
articles
Noisy image pool
Visual clustering
Premise: photos from landmark should be similar
Clustering based on local features
Validate and clean models
Visual model validates landmarks!
Photo v.s. non-photo classifer to filter out noisy images
……
Local Feature Detection
• Find invariant and robust features
• Create distinctive feature descriptions
Laplacian-of-Gaussian (LoG)
• Scale-invariant edge detection
• Gaussian image filter to remove noise
• Laplacian filter to find areas of rapid change
Local Feature Description
• Invariant and distinctive description
• Texture from 118 dimension Gabor wavelet
Object matching based on local features
Sim( ) = image match score,
Image representationInterest points:
Laplacian-of-Gaussian (LoG) filter
Local feature: Gabor wavelets
match score =
Probability that match of and is false positive
Probability of at least m out of n features match, if
Probability of a feature match by chance
Constructing match region graph
Image matching
•Node is match region•2 types of edges:
•match edge: measures match confidence
•overlap region edge: measures spatial overlapping
Graph clustering on match regions
Distance between any two regions = shortest path connecting them
Why hierarchical agglomerative clustering? but not K-means, GMM etc
Because we don't have a priori knowledge of # of clusters. Each cluster should correspond to one aspect of a landmark
intuitively
Agglomerative hierarchical clustering
Match region graph Visual clusters
Visual cluster example
Corcovado, Rio de Janeiro, BrazilAcropolis, Athens, Greece
Visual cluster validation and cleaning Validate by authors or hosting webs of
images reflect the popular appeal of
landmarks Filter out non-photographic images, like
map, logo train Adaboost classifier features: color hist, hough transform, etc.
Clean clusters by detecting large area human face
Efficiency issues
Issue 1: learning landmark image
21.4M photos
Recognition engine: ~5000
landmarksIssue 2: recognizing landmark
Query image
Parallel computing to learn true landmark images
Efficient hierarchical clustering
Indexing local feature for matching Query time: ~0.2 sec in a P4 computer
kd-tree indexing
Experiments: statistics of learned landmarks
From photos
From articles
Total
Landmark # 2240 3246 5486
City # 812 626 1259
Country # 104 130 144
small overlap: 174 landmarks shared
China: 101 landmarksUnder-counted! Why?
U.S.- High internet penetration rate & enourmous tour site
Evaluation of landmark image learning
• Randomly select 1000 visual clusters
• 68 (0.68%) are outliers: maps, logos, human photos
• Apply photographic v.s. non-photographic classifier
• 37 outliers. 0.68%=>0.37%
Evaluation of landmark recognition
• Positive testing images: – 728 images from 124 landmarks
• Negative testing images: • Caltech-256 (30524 ) +
Pascal VOC 07 (9986 ) = 40,510 images.
• For positive images: – 417 images detected to be
landmarks– 337/417 (80.8%) are correct– Identification rate: 337/728
(46.3%)
• For negative images: – 463 images detected to be
landmarks– False acceptance rate:
1.1%
Landmarks canbe similar!
False detected images
Match is technically correct, but match region is not landmark
Match is technically false, due to visual similarity
A problem of model generation
A problem of image feature and matching mechanism