summarizing data collections by their spatial, temporal ... · andreas henrich and daniel blank −...
TRANSCRIPT
15. November 2012
Summarizing data collections by their
spatial, temporal, textual, and image footprint:
Techniques for source selection and beyond
Andreas Henrich and Daniel Blank [email protected] University of Bamberg Media Informatics Group Leipzig, 15.11.2012
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 2)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Agenda
1. Motivation and Introduction
(1) Distributed indexing and search
(2) Media types in use
(3) Scenario: source selection for image retrieval
2. Source Selection: Research Goals & Work in our Group
(1) Efficient source selection with HFS and UFS
(2) Visualizing the source selection process
(3) Applicability in other application fields (geographic IR, access methods)
3. Conclusion and Outlook
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 3)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Dramatic increase in data volumes in many areas:
in the world wide web
on private devices
in companies
…
adequate indexing and search techniques needed
distributed indexing and search
1. Motivation and Introduction
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 4)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
global vs. local indexing:
global indexing: “one” global index is distributed
inserts and updates → high network load
no locality of data (autonomy) → replication
local Indexing: many local indexes & query routing based on
local data summaries
query processing with logarithmic cost hard to guarantee
hybrid approaches: caching, replication, data specialization, …
1. Motivation and Introduction
Distributed indexing and search
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 5)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Visualization: features of 1 object/image 1 point in 2d
1. Motivation and Introduction
Distributed indexing and search
…
Müller, Boykin, Sarshar, Roychowdhury: Comparison of CBIR Queries in P2P Cambridge, 2006-09-06
query (dot in 0; 1 × 0; 1 )
query result (4-NN)
white points: 2d vectors
black squares with axes: peers
25 peers/resources and their data
1. Motivation and Introduction
Distributed indexing and search
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 7)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
CAN image
Peer/resource responsible for (a) region(s) within the whole feature space
global indexing 1 coordinate system in total
1. Motivation and Introduction
Distributed indexing and search
Data has to be transferred to / indexed by the responsible peer
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 8)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
local indexing 1 coordinate system per resource
Data summaries per resource enhance query routing
Querying resource knows summaries of all other resources (scalability: Rumorama)
our scenario 1. Motivation and Introduction
Distributed indexing and search
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 9)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Administration of distributed media items:
Content-based image retrieval (CBIR) just an example scenario.
Goal: Provide resource description and selection techniques | for general
metric spaces | and apply them elsewhere (e.g. geo-context)
media item (MI)
low-level features
1. Motivation and Introduction
Media types in use
textual content
timestamps
geographic footprint
location
color and texture
tags, description
shot date
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 10)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Resource description and selection: Identify promising MAs
according to a query based on a set of known data summaries.
media item
textual content
low-level features
timestamps
geographic footprint
…
…
…
…
…
… …
MA desc. text
MA desc. low-level
MA desc. time
media archive (MA)
MA desc. geo
four summaries per MA (one per criterion)
1. Motivation and Introduction
Media types in use
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 11)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
text → (counting) Bloom filters, Topic models, etc.
time → histogram, clustering, our approach feasible
geo → discussed at the end of the talk
low-level → focus in the remainder of the talk (CBIR)
Search by multiple criteria: e.g. ‘town hall Leipzig’ at sunset
merging of several criteria-specific resource rankings
need to model correlations in the resource descriptions
future work: not covered so far
1. Motivation and Introduction
Media types in use → resource descriptions
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 12)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Callan [Cal00] defines distributed information retrieval (IR) based on three
problems and tasks:
1. resource description
adequate descriptions/summaries
2. resource selection
adequate selection mechanisms
3. result merging
merging of resource-specific search results
Often in literature (and also in this talk): resource selection = resource description & selection;
result merging trivial
space requirements
selectivity
1. Motivation and Introduction
Scenario: source selection for image retrieval
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 13)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Image retrieval: often text search & tags
Only text search not sufficient:
Bolettieri et al. [BEF+09]: >100 Mio. Flickr images
30%: no comments and no tags → searchable?
on average: 0.5 comments and 5.0 tags
tag spam, homonyms/synonyms,
selectivity (‘phone’), …
Benefits of content-based image retrieval (CBIR):
automatically extracted image properties
features: color (histograms), texture, salient points, …
1. Motivation and Introduction
Scenario: source selection for image retrieval
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 14)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
given: query object 𝑞
database 𝑂
wanted: ‘similar’ images w.r.t. query; criterion: 𝑑 𝑞, 𝑜 with 𝑜 ∈ 𝑂
important query types:
range queries (search radius 𝑟):
range 𝑞, 𝑟 = 𝑜 ∈ 𝑂 𝑑 𝑞, 𝑜 ≤ 𝑟
𝑘-nearest-neighbor queries (desired #hits: 𝑘):
kNN 𝑞, 𝑘 = 𝐾 with ∀𝑜 ∈ 𝐾, 𝑜′ ∈ 𝑂 ∖ 𝐾: 𝑑 𝑞, 𝑜 ≤ 𝑑 𝑞, 𝑜′ and 𝐾 = 𝑘
1. Motivation and Introduction
Scenario: source selection for CBIR
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 15)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
resource A
resource B resource C
1. Motivation and Introduction
Scenario: source selection for CBIR
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 16)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
1. Motivation and Introduction
Scenario: source selection for CBIR resource A
resource B resource C
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 17)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
1. Motivation and Introduction
Scenario: source selection for CBIR
C
A
B
resource description
similarity query
1. C 2. A 3. B
resource selection
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 18)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
2. Source Selection:
Research Goals and Work in our Group
Design resource selection techniques which improve
earlier work w.r.t. efficiency:
Space efficiency and selectivity of the resource descriptions
Visualize the source selection process
Show usefulness of the techniques in other scenarios (apart from distributed IR: metric access methods, geo IR, …)
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 19)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
1. Assign every DB object to the closest reference object (RO).
2. Count per reference object, how many DB objects have been assigned to it.
cluster histogram computation
2. Source Selection – Research Goals and Work in our Group
Efficient source selection with HFS and UFS
DB objects
reference objects
distance d
Baseline: [MEH05b]
resource description
|.|= γ
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 20)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
● reference objects ■ feature objects
assumption: 2-dim. feature space euclidean distance
cluster histogram:
2. Source Selection – Research Goals and Work in our Group
Efficient source selection with HFS and UFS
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 21)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Anfragebearbeitung mit
Zusammenfassungen
cluster histograms as robust approximations of
…
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 22)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Sicht auf andere Peers mit
Zusammenfassungen
… density distributions
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 23)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Histogrammerstellung cluster histogram
computation
DB objects
reference objects
distance d
[MEH05b]
distributed clustering
resource description
γ
γ
Zusammenfassung
1. Assign every DB object to the closest reference object (RO).
2. Count per reference object, how many DB objects have been assigned to it.
2. Source Selection – Research Goals and Work in our Group
Efficient source selection with HFS and UFS
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 24)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Histogrammerstellung cluster histogram
computation
DB objects
reference objects
distance d resource description
random selection γ
γ
Zusammenfassung
1. Assign every DB object to the closest reference object (RO).
2. Count per reference object, how many DB objects have been assigned to it.
[EMH+06]
2. Source Selection – Research Goals and Work in our Group
Efficient source selection with HFS and UFS
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 25)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Source Selection as in [EMH+06]: version 1 for small 𝛾
1. compute the reference object 𝑐𝑗 closest to 𝑞 w.r.t. 𝑑(𝑞, 𝑐𝑗)
2. ranking of peers
peer 𝑝𝑎 has more DB objects in cluster 𝑐𝑗 than 𝑝𝑏
⇒ 𝑝𝑎 is contacted before 𝑝𝑏
peer 𝑝𝑏 has more DB objects in cluster 𝑐𝑗 than 𝑝𝑎
⇒ 𝑝𝑏 is contacted before 𝑝𝑎
otherwise (same number of objects in 𝑐𝑗) a random decision is made
2. Source Selection – Research Goals and Work in our Group
Efficient source selection with HFS and UFS
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 26)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Anfragebearbeitung mit
Zusammenfassungen
starting point
query
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 27)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Starting 4-NN query: resource ranking
criterion: number of images in query
cluster
query object
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 28)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
intermediate result
top 4 documents of contacted
resource
1st resource visited
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 29)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
intermediate result (improved)
top 4 documents of contacted
resource
2nd resource visited
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 30)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3rd resource visited
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 31)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
4th resource visited
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 32)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
5th resource visited
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 33)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Cluster Histogramm Erstellung
HFS computation
DB objekts
distance d resource description
random selection
γ* ≫ γ
γ* ≫ γ
reference objects
[BEM+07]
1. Assign every DB object to the closest reference object (RO).
2. Count per reference object, how many DB objects have been assigned to it.
2. Source Selection – Research Goals and Work in our Group
Efficient source selection with HFS and UFS
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 34)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
20 ROs
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 35)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
1000 ROs
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 36)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Source Selection as in [EMH+06]: version 2 for higher values of 𝛾
1. compute list 𝐿
sort reference objects 𝑐𝑗 ascending w.r.t. 𝑑(𝑞, 𝑐𝑗)
2. ranking of peers
Iterate over the list 𝐿 (from beginning to end):
peer 𝑝𝑎 has more DB objects in current cluster than 𝑝𝑏
⇒ 𝑝𝑎 is contacted before 𝑝𝑏
peer 𝑝𝑏 has more DB objects in current cluster than 𝑝𝑎
⇒ 𝑝𝑏 is contacted before 𝑝𝑎
otherwise (same number of objects in all clusters) random decision is made
2. Source Selection – Research Goals and Work in our Group
Efficient source selection with HFS and UFS
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 37)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Cluster Histogramm Erstellung
HFS computation
DB objects
distance d
random selection γ* ≫ γ
reference objects
[BEM+07]
resource description !
2. Source Selection – Research Goals and Work in our Group
Efficient source selection with HFS and UFS
γ* ≫ γ
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 38)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Cluster Histogramm Erstellung
HFS computation
DB objects
distance d
random selection
reference objects
[BEM+07]
compression
resource description
2. Source Selection – Research Goals and Work in our Group
Efficient source selection with HFS and UFS
γ* ≫ γ
γ* ≫ γ
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 39)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Cluster Histogramm Erstellung
HFS computation
DB objects
distance d
reference objects
[BEM+07]
compression
resource description
2. Source Selection – Research Goals and Work in our Group
Efficient source selection with HFS and UFS
γ* ≫ γ
γ* ≫ γ
transferred with software updates
external collection selection
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 40)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Cluster Histogramm Erstellung
UFS computation
DB objects
distance d
reference objects
compression
resource description
2. Source Selection – Research Goals and Work in our Group
Efficient source selection with HFS and UFS
γ* ≫ γ
γ* ≫ γ
transferred with software updates
external collection selection
[BH10a, BH10b]
UFS are bit vectors. A bit set at position 𝑗 indicates, that at least one DB object is assigned to reference object 𝑐𝑗.
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 41)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Another view on the source selection process with UFS:
2. Source Selection – Research Goals and Work in our Group
Visualizing the source selection process
Set of peers: 𝑃 = 𝑝𝑖 1 ≤ 𝑖 ≤ 𝑛 ,
(1 summary 𝑠𝑖 𝑗 per row)
Set of ROs: 𝐶 = 𝑐𝑗 1 ≤ 𝑗 ≤ 𝛾 ,
(1 RO 𝑐𝑗 with cluster 𝑐𝑗 per column)
…
𝑠𝑖 𝑗 = 0
𝑝𝑖 has NO docs in 𝑐𝑗
𝑠𝑖 𝑗 > 0
𝑝𝑖 has ≥ 1 doc(s) in 𝑐𝑗
source selection as an 𝛾 × 𝑛 image.
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 42)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Another view on the source selection process with UFS:
2. Source Selection – Research Goals and Work in our Group
Visualizing the source selection process
Set of peers: 𝑃 = 𝑝𝑖 1 ≤ 𝑖 ≤ 𝑛 ,
(1 summary 𝑠𝑖 𝑗 per row)
Set of ROs: 𝐶 = 𝑐𝑗 1 ≤ 𝑗 ≤ 𝛾 ,
(1 RO 𝑐𝑗 with cluster 𝑐𝑗 per column)
…
𝑠𝑖 𝑗 = 0
𝑝𝑖 has NO docs in 𝑐𝑗
𝑠𝑖 𝑗 > 0
𝑝𝑖 has ≥ 1 doc(s) in 𝑐𝑗
source selection as an 𝛾 × 𝑛 image.
one cluster /
reference object
one peer
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 43)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
2. Source Selection – Research Goals and Work in our Group
Visualizing the source selection process
…
no ranking
top 80 peers
lowest 80 peers
MIRFLICKR 25k collection: http://press.liacs.nl/mirflickr/ 24.258 images assigned to 9.862 peers based on the Flickr-UserID CEDD descriptor + Hellinger distance
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 44)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
2. Source Selection – Research Goals and Work in our Group
Visualizing the source selection process
… …
no ranking by peer size HFS/UFS ranker
…
𝑐𝑗 ordered by
Increasing
𝑑 𝑞, 𝑐𝑗
top 80 peers
lowest 80 peers
Peers having docs in
query cluster
Peers having NO docs in
query cluster
…
…
…
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 45)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
2. Source Selection – Research Goals and Work in our Group
Applicability in other application fields
a) Design of a MAM based on inverted files:
𝑐0 𝑜7 𝑜8
𝑐1 𝑜1 𝑜4 𝑜6
𝑐γ−1 𝑜5 …
…
…
… 𝑑 𝑐1, 𝑜4 , … , 𝑑 𝑐γ−1 , 𝑜4
store object-pivot distances in the postings
map clusters to posting list
Search: pruning of clusters/lists pruning of DB objects
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 46)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
2. Source Selection – Research Goals and Work in our Group
Applicability in other application fields
b) Apply HFS/UFS and extensions in the context of geographic data and
compare it with:
geometric approaches space partitioning approaches
→ also: hybrid combinations!
= peer data
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 47)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
: peer data
2. Source Selection – Research Goals and Work in our Group
Applicability in other application fields
kMeans++ MBR
RecMBR CH
geometric approaches
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 48)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Exemplary ranking for MBR summaries, other rankers for geometric
resource descriptions work likewise!
Given:
Query 𝑞 and two peers: 𝑝𝑎 with MBRa and 𝑝𝑏 with MBRb
Peer Ranking:
kMeans++ and RecMAR: For each peer a queue of shapes is created (ordered
by minimum distance to query point), queue based consideration of shapes.
),,(),(),(
),(),(),(
),(),(
),(),(
baba
baba
abab
baba
MBRMBRqdistToMBRqMBRcontainsqMBRcontains
MBRMBRrecMBRSizeqMBRcontainsqMBRcontains
ppqMBRcontainsqMBRcontains
ppqMBRcontainsqMBRcontains
reciprocal size of area covered
by MBR
minimum distance to MBR
2. Source Selection – Research Goals and Work in our Group
Applicability in other application fields
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 49)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
: peer data
2. Source Selection – Research Goals and Work in our Group
Applicability in other application fields
GFBun
UFSn
GRIDn
DFSn
: reference object
space partitioning approaches
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 50)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
1. Create list L of subspaces/clusters sorted by minimum
distance of subspace to query object (cf. HFS/UFS as before)
2. j = 1;
3. Choose subspace [cj], positioned j-th in list L (1 ≤ j ≤ γ):
pa administers documents in [cj], pb does not ⇒ 𝑝𝑎 ≻ 𝑝𝑏
pa and pb both do (not) administer documents in [cj]
j++; GOTO 3;
2. Source Selection – Research Goals and Work in our Group
Applicability in other application fields
Grid: considers rings of neighboring cells instead of single subspaces
UFS/DFS: list ordered by min. distance of ROs to query object
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 51)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
20
40
60
80
100
120
140
160
0,5
0,7
0,9
1,1
1,3
1,5
1,7
1,9
512 2048 8192
avg s
um
mary
siz
e in b
yte
fraction o
f conta
cte
d p
eers
in %
number of subspaces
queryMode=1
GFBu_e UFS_e DFS_e
2. Source Selection – Research Goals and Work in our Group
Applicability in other application fields
Experimental Results – Space Partitioning Approaches
better selectivity than
grid-based approach
storing additional
distances increases
selectivity but increases
storage requirements
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 52)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3. Conclusion and Outlook: ‘Source selection through visual analytics’
0
100
200
300
…
02040
…
0
peers
ROs cluster
populations peer sizes
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 53)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3. Conclusion and Outlook: ‘Source selection through visual analytics’
0
100
200
300
…
02040
…
0
no ranking
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 54)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3. Conclusion and Outlook: ‘Source selection through visual analytics’
0
100
200
300
… …
0 050 0
ranking by peer size
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 55)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3. Conclusion and Outlook: ‘Source selection through visual analytics’
0
100
200
300
… …
0 050 0
visualize ROs
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 56)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3. Conclusion and Outlook: ‘Source selection through visual analytics’
0
100
200
300
… …
0 050 0
‘arbitrary’ ROs possible
(metric space)
AGSTVAVREA
| |||||| |
ATSTVAVRSA
X
Y Z
B
D
visualize ROs
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 57)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3. Conclusion and Outlook: ‘Source selection through visual analytics’
0
100
200
300
… …
0 050 0
choose RO as query
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 58)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3. Conclusion and Outlook: ‘Source selection through visual analytics’
…
0 0 02040
[cj] ordered by increasing d(q,cj)
UFS ranking
0
100
200
300
0
…
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 59)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3. Conclusion and Outlook: ‘Source selection through visual analytics’
…
0 0 02040
[cj] ordered by increasing d(q,cj)
0
100
200
300
0
… …
9 NNs
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 60)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3. Conclusion and Outlook: ‘Source selection through visual analytics’
0
100
200
300
0 0 02040
…
0
…
[cj] ordered by increasing d(q,cj)
visualize peer summaries
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 61)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3. Conclusion and Outlook: ‘Source selection through visual analytics’
0
100
200
300
0 0 02040
…
0
…
[cj] ordered by increasing d(q,cj)
visualize peer summaries
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 62)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3. Conclusion and Outlook: ‘Source selection through visual analytics’
0
100
200
300
0 0 02040
…
0
…
[cj] ordered by increasing d(q,cj)
find similar peers
Similarity on UFS/HFS summaries: Hamming dist. (ext.) quadratic form KL divergence … → using correlations: distances between ROs
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 63)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3. Conclusion and Outlook: ‘Source selection through visual analytics’
0
100
200
300
0 0 02040
…
0
…
[cj] ordered by increasing d(q,cj)
browsing: images/peers least similar to q → interestingness?
interestingness specialties
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 64)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3. Conclusion and Outlook: ‘Source selection through visual analytics’
0
100
200
300
0 0 02040
…
0
…
[cj] ordered by increasing d(q,cj)
browsing: images/peers least similar to q → interestingness?
interestingness specialties
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 65)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
3. Conclusion and Outlook: ‘Source selection through visual analytics’
0
100
200
300
0 0 02040
…
0
…
interestingness specialties
browsing: images/peers least similar to q → interestingness?
Summarizing data collections by their spatial, temporal, textual and image footprint (p. 66)
Andreas Henrich and Daniel Blank − "Scalable Visual Analytics" Text-Workshop in Leipzig, November 15, 2012
15. November 2012
Thank you very much for your attention!
…
Questions?
3. Conclusion and Outlook: ‘Source selection through visual analytics’
0100200300 0
50
100
0
the end
Literaturverzeichnis (1/2)
[BDP04] S. Berretti, A. Del Bimbo, and P. Pala. Merging results for distributed content based image
retrieval. Multimedia Tools Appl., 24(3):215–232, 2004.
[BEF+09] P. Bolettieri, A. Esuli, F. Falchi, C. Lucchese, R. Perego, T. Piccioli, and F. Rabitti. CoPhIR: a Test
Collection for Content-Based Image Retrieval. CoRR, abs/0905.4627v2,
http://arxiv.org/abs/0905.4627v2 (last visit: 2.11.2011), 2009.
[BEM+07] D. Blank, S. El Allali, W. Müller, and A. Henrich. Sample-based creation of peer summaries for
efficient similarity search in scalable peer-to-peer networks. In Intl. SIGMM Workshop on
Multimedia Information Retrieval, pages 143–152, Augsburg, Germany, 2007. ACM.
[BH10a] D. Blank and A. Henrich. Binary histograms for resource selection in peer-to-peer media
retrieval. In Proc. of LWA Workshop Lernen, Wissen, Adaptivität, pages 183–190, Kassel, Germany,
2010.
[BH10b] D. Blank and A. Henrich. Description and selection of media archives for geographic nearest
neighbor queries in P2P networks. In Proc. of the Information Access for Personal Media Archives
Workshop, pages 22–29, http://doras.dcu.ie/15373/ (last visit: 2.11.2011), 2010.
[BH12] D. Blank and A. Henrich. Inverted file-based indexing for efficient multimedia information
retrieval in metric spaces. In Proc. of the 27th ACM Intl. Symp. on Applied Computing. Riva del
Garda, Italy, 2012 (to appear).
p. 69 Summarizing data collections by their spatial, temporal, textual and image footprint:
Techniques for source selection and beyond
Literaturverzeichnis (2/2)
[Cal00] J. Callan. Distributed information retrieval. In W. B. Croft, editor, Advances in Information
Retrieval, pages 127–150. Kluwer Academic Publishers, 2000.
[EMH+06] M. Eisenhardt, W. Müller, A. Henrich, D. Blank, and S. El Allali. Clustering-based source selection
for efficient image retrieval in peer-to-peer networks. In Proc. of the 8th Intl. Symp. on Multimedia,
pages 823–830, San Diego, CA, USA, 2006. IEEE.
[MEH05a] W. Müller, M. Eisenhardt, and A. Henrich. Scalable summary based retrieval in P2P networks. In
Proc. of the 14th Intl. Conf. on Information and Knowledge Management, pages 586–593, Bremen,
Germany, 2005. ACM.
[MEH05b] W. Müller, M. Eisenhardt, and A. Henrich. Fast retrieval of high-dimensional feature vectors in
P2P networks using compact peer data summaries. Multimedia Systems, 10(6):464–474, 2005.
[MKL+03] D. S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pruyne, B. Richard, S. Rollins, and Z. Xu.
Peer-to-peer computing. Technical Report HPL-2002-57 (R.1), HP Laboratories, Palo Alto, CA, USA, July
2003.
[ZAD+05] P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search: The Metric Space Approach.
Springer New York, Inc., Secaucus, NJ, USA, 2005.
p. 70 Summarizing data collections by their spatial, temporal, textual and image footprint:
Techniques for source selection and beyond
Bildquellen des Vortrags
Seite 3: [BEF+09]
Seite 5: http://commons.wikimedia.org/wiki/File:Bamberg-Rathaus1-Asio.JPG
Seite 6: http://commons.wikimedia.org/wiki/File:Dreieck.svg?uselang=de
Seite 7 und 12: http://commons.wikimedia.org/wiki/File:Bamberg-Rathaus3-Asio.JPG
Seite 10 und 11:
http://commons.wikimedia.org/wiki/File:Bamberg-Rathaus1-Asio.JPG
http://commons.wikimedia.org/wiki/File:Bamberg_Klein-Venedig_I.jpg
http://commons.wikimedia.org/wiki/File:Marienberg_wuerzburg.jpg
http://commons.wikimedia.org/wiki/File:Marienkapelle.JPG
http://commons.wikimedia.org/wiki/File:Zugspitze_Sonnenuntergang_Westseite.JPG
http://commons.wikimedia.org/wiki/File:Sonnenuntergang_001.jpg
Das pinke Sonnenuntergangsbild ist derzeit nicht mehr bei Wikimedia verfügbar.
Seite 17: http://commons.wikimedia.org/wiki/File:Voronoi_diagram.svg
p. 71 Summarizing data collections by their spatial, temporal, textual and image footprint:
Techniques for source selection and beyond