photo stream alignment and summarization for collaborative photo collection and sharing

IEEE TRANSACTIONS ON MULTIMEDIA MANUSCRIPT 1

Photo Stream Alignment and Summarization

for Collaborative Photo Collection and SharingJianchao Yang, Student Member, IEEE, Jiebo Luo, Fellow, IEEE, Jie Yu, Member, IEEE,

and Thomas Huang, Life Fellow, IEEE

Abstract

With the popularity of digital cameras and camera phones, it is common for different people, who

may or may not know each other, to attend the same event and take pictures and videos from different

spatial or personal perspectives. Within the realm of social media, it is desirable to enable these people

to select and share their pictures and videos in order to enrich memories and facilitate social networking.

However, it is cumbersome to manually manage these photos from different cameras, of which the clocks

settings are often not calibrated. In this paper, we propose automatic algorithms to address the above

problems. First, we accurately align different photo streams or sequences from different photographers

for the same event in chronological order on a common timeline, while respecting the time constraints

within each photo stream. Given the preferred similarity measures (e.g. visual, and spatial similarities),

our algorithm performs photo stream alignment via matching on a bipartite kernel sparse representation

graph that forces the data connections to be sparse in an explicit fashion. Furthermore, we can produce

a summary master stream from the aligned super stream of photos for efficient sharing by removing

those redundant photos in the super stream while accounting for the temporal integrity. Based on a

similar kernel sparse representation graph, our master stream summarization algorithm performs greedy

backward selection to drop redundant photos without affecting the integrity of remaining photos for the

entire event. We evaluate our algorithms on real-world personal online albums for thirty-six events and

demonstrate its efficacy in automatically facilitating collaborative photo collection and sharing.

Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other

purposes must be obtained from the IEEE by sending request to [email protected].

Jianchao Yang and Thomas Huang are with Beckman Institute, University of Illinois at Urbana-Champaign, Urbana, IL-61801,

USA. E-mail:{jyang29, huang}@ifp.illinois.edu.

Jiebo Luo is with Department of Computer Science, University of Rochester, Rochester, NY-14627, USA. E-mail:

[email protected].

Jie Yu is with GE Global Research, Niskayuna, NY-12309, USA. E-mail: [email protected].


Fig. 1. Collaborative photo collection and sharing is common in social media websites such as Facebook and Picasa.

I. INTRODUCTION

Today, millions and millions of users worldwide capture images and videos to record various events

in their lives. Such image data capturing was done for two important reasons: first for the participants

themselves to relive the events of their lives at later points of time, and second for them to share these

events in their lives with other friends and family, who were not present at the events but still interested

in knowing how the event (e.g. the vacation, or the wedding, or the trip) went. When it comes to sharing,

it is also common for people who have been to the same event to share pictures taken at the same

event by different people who had different viewpoints, timing, or subjects. The importance of sharing

is indeed underscored by the billions of image uploads per month on social media sites like Facebook,

Flickr, Picasa, and so on.

In this study, we consider a very common scenario for many photo-worthy events: for example, you

and several friends took a trip to the Yellowstone National Park, and each of you took your own cameras

to record the trip. At the end, collectively you ended up with several photo albums, each created by a

different camera and composed of hundreds or even thousands of photos. One natural problem arises:

How do you share these photo albums or collections among the friends in an effective and organized


manner? Such a scenario occurs often for events that involve many people, such as trips, excursions,

sports activities, concerts and shows, graduations, weddings, and picnics. Many photo sharing sites now

provide functions or apps to facilitate photo sharing. As shown in Figure 1, for example, Picasa now

allows people one shares photos with to contribute to one’s album, while a Facebook app called Friends

Photos searches in one’s network to present an overview of the friends’ photo albums. However, these

functions do not automatically and selectively align photos of the same event from different contributors.

Currently, people must view the individual photo collections separately, and few are willing to invest

time to augment their own collections with photos taken by others. It is clearly not a solution to simply

merge all the photos from different albums into one super collection. Because different albums use

different photo naming conventions, putting those photos together will result in either ordering the photos

into disjoint groups bearing no semantic meanings, or worse, disorder due to naming conflicts.

Second, it is unlikely that one can merge these photos based on their timestamps. Within each photo

collection from one camera, the photos can be arranged in chronological order based on their timestamps,

which forms the photo stream. However, since people rarely bother to calibrate their camera clocks before

taking photos, the timestamps from different cameras are typically out of sync and not reliable for aligning

the photos in different streams. Typically, the camera clocks can be offset by minutes, hours and even

days when people travel through different time zones. In fact, this is true for all the real-world photo

collections unintentionally gathered for experiments in this work.

Third, not all the photos are wanted for sharing, given that many of them are redundant among different

photo streams, as they are capturing the same event. Such redundancy also occurs even within each photo

stream, with the low-cost of photo taking by modern digital cameras. The photo redundancies, both within

and across multiple streams, make both sharing and browsing rather inefficient.

Therefore, it is desirable to develop an automatic algorithm that can facilitate different people, who

may or may not know each other, attend the same event and take photos from different spatial or personal

perspectives, to share their pictures and videos in an effective way, especially for online albums, in order

to enrich memories and promote social networking.

In recent years, the explosion of consumer digital photos has drawn growing research interests in

the organization and sharing of community photo collections, due to the popularity of web-based user-

centric multimedia social networks, such as Facebook, Flickr, Picasa, and Youtube. While many efforts

have been devoted to photo organization [15], annotation [8], [14], [23], [26], [10] summarization [18],

[6], browsing [12], [20], and search [13], [16], little has been done to relate media collections that are

about the same event for effective sharing. As a practical need we encounter everyday, intelligent photo


Fig. 2. Overview of our collaborative photo collection and sharing system.

collection and sharing is one on many people’s wish lists.

This paper attempts to address the problems raised in the beginning and aims to develop automatic

algorithms to facilitate the sharing of multiple photo albums captured for the same event. Figure 2

illustrates the diagram of our system, where multiple albums are first aligned in the chronological order

of the event to create a super photo stream, which captures the integrity of the whole event for sharing

among different users. And then a master photo stream is created to discard those redundant photos for

efficient sharing among different users for the same event. In both steps, our algorithms rely on the kernel

sparse representation graph that is constructed by explicitly sparsifying the photo connections using ℓ1-

norm minimization, as a generalization of the ℓ1-graph [4], [5] in the kernel space for broader application

purposes. For customer albums, most photos by different cameras are visually uncorrelated, but some of

them overlap in content and thus form the base for aligning different photo streams. Therefore, the visual

correlation links between different photo streams are sparse. As we will see, by explicitly accounting

for this sparseness in the graph construction via ℓ1-norm minimization, our matching algorithm is more

robust than conventional baseline methods, and the master summarization algorithm can properly discard

those redundant photos.

The remainder of the paper is organized as follows. Section II describes the kernel sparse representation

graph, given the preferred photo similarity measures. Tailored to our problems, Section III introduces a

specific sparse bipartite graph for robust alignment of multiple photo streams, and Section IV presents

the master stream summarization algorithm from the aligned super photo stream. In Section V, we report

the experimental results on a total of 36 real-world photo datasets, each of which involves two or more


cameras and was collected from the Picasa Web Album. Finally, Section VI concludes our paper with

future work.

II. KERNEL SPARSE GRAPH CONSTRUCTION

In pervasive computer vision and machine learning tasks, finding the correct data relationship, typically

represented as a graph, is essential to the success of many algorithms. Due to the limitations of the existing

similarity measures, sparse graphs usually offer certain advantages because they can reduce spurious

connections between data points and thus tend to exhibit high robustness [27]. Existing sparse graph

construction methods are mainly based on k-nearest-neighbor method and ϵ-ball method for neighbor

selection. Then various approaches, e.g. binary, Gaussian kernel [1] and ℓ2-reconstruction [17], are used

to determine the edge weights. However, the graph constructed by k-nearest-neighbor or ϵ-ball method is

based on pair-wise Euclidean distance, which is very sensitive to data noise especially in high-dimensional

spaces. Furthermore, both k-nearest-neighbor and ϵ-ball methods rely on a fixed parameter(k and ϵ) to

select the neighborhood. Therefore, they cannot adaptively select the most kindred neighbors for the data

points, which are unevenly distributed in most cases.

As is common in most computer vision applications, we are facing very high-dimensional visual data,

which is known to be of a much lower dimension. In data analysis and dimensionality reduction, one

of the most fundamental steps consists of approximating the data set by a single low-dimensional space,

classically achieved by the Principal Component Analysis (PCA). However, in many situations, the data

points do not lie in a single low-dimensional space, but near a union of low-dimensional subspaces, e.g.

handwritten digits [11] and faces [24]. By formulating all other data points as the dictionary, each data

point in the union of subspaces will have a sparse representation, which can robustly identify the linear

subspace this data point lies in [19]. For clustering and discriminant analysis, people empirically found

that visual data or features from the same class typically lie near the same low-dimensional subspaces

[24], [7], [25], [5], [4]. Therefore, finding the sparse representation of a data point in terms of all others

can be applied in robust recognition and subspace clustering for high-dimensional data, e.g. using sparse

representation for face recognition [24]. Based on this observation, Cheng et al.[4] proposed a new

graph construction method via ℓ1-norm minimization, where the graph connections are established based

on the sparse representation coefficients of the current data point in terms of all the other data points.

Compared with the conventional graph construction algorithms, the new ℓ1-graph is robust to noise and

can adaptively select the kindred neighbors, demonstrating substantial improvements on various graph-

based applications. However, the method in [4] is limited to applications where data can be roughly


aligned, e.g. faces and digits. In this section, we propose to generalize the concept of ℓ1-graph for

exploring data relationships in a general kernel space, thus making our new graph applicable to much

broader applications.

A. Similarity Measure

To construct the graph, we first define the similarity measure for photos. With the associated meta-data

of consumer photos, we represent each photo as {x, g}, where x denotes the image itself, and g its

geo-location. To keep the notation uncluttered, we simply use x instead of the duplet in the following

presentation. We define the photo similarity as

S(xi,xj) =1

2[Sv(xi,xj) + Sg(xi,xj)] , (1)

where Sv and Sg are the visual and geo-location similarities between photos xi and xj respectively.

Other information, e.g. photo tags for online albums, can also be incorporated if available.

Visual similarity Sv is the most important cue for our tasks. In this paper, we choose the following

three visual features to compute the visual similarity, due to their simplicity and effectiveness:

1) Color Histogram, an evidently important cue for consumer photos;

2) GIST [21], a simple and popular feature to capture the visual global shape;

3) LLC [22], a state-of-the-art appearance feature for image classification.

We concatenate these features with equal weights, normalize them into unit length, and simply use their

inner products as our visual similarity.

Photos taken close in location are probably about the same content. Given the geo-locations gi and

gj for photos xi and xj , a geo-location similarity can be defined by a Gaussian kernel owing to its

simplicity:

Sg(xi,xj) = exp(−∥gi − gj∥22

σ2). (2)

Finally, the similarity measure S defines a valid kernel κ(·, ·) with Φ(·) being the implicit feature mapping

function, i.e.

κ(xi,xj) = Φ(xi)TΦ(xj) = S(xi,xj). (3)

In this paper, we mainly rely on the visual similarity, and Geo-location similarity is used when GPS

information is recorded.


B. Graph Construction

The basic question of graph construction is, given one datum xt, how to connect it with many other

data points {xi}ni=1 based on some given similarity measure. The graph construction based on sparse

representation is formulated as

min ∥α∥0 s.t. ∥Φ(xt)−Dα∥22 ≤ ϵ, (4)

where D = [Φ(x1),Φ(x2), ...,Φ(xn)] serves as the dictionary to represent Φ(xt). The connection

between xt and each xi is determined by the solution α∗: if α∗(i) = 0, there is no edge between

them; otherwise, the edge weight is defined as |α∗(i)|. Eqn. 4 is a combinatorial NP-hard problem,

whose tightest convex relaxation is through ℓ1-norm minimization [3],

min ∥α∥1 + β∥α∥22 s.t. ∥Φ(xt)−Dα∥22 ≤ ϵ. (5)

We further add a small ℓ2-norm regularization term to stabilize the sparse solution [28].

In many scenarios, we can easily define the similarity measure between data points whereas explicit

feature mapping Φ(·) may not be available, i.e. we only have the kernel function κ(·, ·). Eqn. 5 can be

solved implicitly in the kernel space by expanding the constraint [9],

min ∥α∥1 + β∥α∥22

s.t. 1 + αTκ(D,D)α− 2κ(xt, D)α ≤ ϵ,

(6)

where κ(D,D) is a matrix, with the (i, j)-th entry κ(D(:, i), D(:, j)) = S(xi,xj), and κ(xt, D) is a

vector with k-th entry κ(xt, D(:, k)) = S(xt,xk). Eqn. 6 can be solved efficiently in the similar way of

Eqn. 5.

Co-event photo collections by different cameras usually cover a wide variety of contents with some

amount of redundancy, i.e. most photos are visually uncorrelated while some of them overlap in content,

because different photographers may have captured correlated contents at same times. Consequently,

the visual correlation links between photo streams are usually sparse with respect to all possible edges

between photos, i.e. for each photo node in one photo stream, we only want to connect it with few photo

nodes in the other photo stream. On the other hand, real image data is usually very noisy. It is therefore

essential that the graph construction procedure is robust to such noise to select kindred neighbors. By

explicitly incorporating the sparsity constraint via ℓ1-norm in our kernel graph construction procedure,

we can adaptively and robustly select the most visually correlated photos for each node in the graph [4],

and thus discovering the most informative links between different photo streams of the same event. In the


following, with the basic kernel sparse graph construction procedure, we propose a principled approach

for aligning photo streams based on a sparse bipartite graph in Section III-B.

III. PHOTO STREAM ALIGNMENT

In this Section, we describe our approach for aligning multiple photo streams from different cameras

whose time settings are not calibrated. For each pair of photo streams, our alignment algorithm is based

on matching on a bipartite graph constructed based on the kernel sparse representation graph discussed

in previous Section. A max linkage selection procedure is further introduced for photo link competition

for robust matching.

A. Problem Statement

Suppose we are given two photo streams X1 = [x11,x

12, ...,x

1m] and X2 = [x2

1,x22, ...,x

2n] about

the same event, associated with which are their own camera timestamps T1 = [t11, t12, ..., t

1m] and T2 =

[t21, t22, ..., t

2n], where xj

i denotes the i-th photo with tji its camera timestamp in stream j ∈ {1, 2}. In

most cases, we can assume that the relative time within both T1 and T2 is correct, but the relative time

shift between T1 and T2 is unknown. Our goal is to estimate the correct time shift ∆T between the two

time sequences. To make accurate photo stream alignment possible, we make the following assumption:

Assumption. The photo streams to be aligned contain a certain amount of temporal-visual correlations.

By finding such temporal-visual correlations between photos from different streams, though in many

cases sparse, we can align the two photo streams in chronological order to describe the complete event.

Although there is only one parameter ∆T to infer, robust and accurate alignment turns out to be nontrivial

due to the following reasons:

1) Limited effectiveness of the visual features, i.e. semantically similar photos may be distant by the

visual similarity, and vice versa. For example, in Figure 3, the left two photos have low visual

similarity, but they are related to the same moment and same scene of the event.

2) Photos are not taken consciously to facilitate alignment, e.g. different photographers may capture

largely different contents. This is very common since different photographers have different spatial

and personal perspectives about the same event.

3) Misleading data may exist for alignment, e.g. similar scenes may be captured at different times.

For example, in Figure 3, the two photos are about the same scene, but they were taken at different


Fig. 3. The left two photos are visually distant, but semantically they are about the same scene (from Horse Trail dataset).

The right two photos are about the same scene, but were taken at different times (from Lijiang dataset).

times by the two photographers respectively.

As such, consumer photo streams are extremely noisy for accurate alignment, and decisions made on an

isolated pair of images can be incorrect without proper context of the corresponding photo streams. In

fact, as discussed later in our experiments in Section V, heuristic approaches are not reliable and often

run into contradictions. Therefore, we propose a principled approach for robust and accurate alignment by

matching two photo streams on a sparse bipartite graph between each pair of photo streams, constructed

based on kernel sparse representation.

B. Sparse Bipartite Graph

Different photo streams for the same event usually share some similar photo contents. If we can build

a bipartite graph G = (X1,X2, E) linking the informative pairs—the distinctive photo pairs from two

streams that share large visual similarities—from the two streams, with assumption III-A, we will be able

to find the correct ∆T . Consumer albums typically contain photos diverse in contents and appearance,

the informative pairs are only a few compared with the album sizes. Therefore, the bipartite graph G,

which includes the links of informative pairs between photo streams as its edges, should be sparse, i.e.

|E| ≪ |X1| · |X2|.

In this case, we can only use the visual information and possible GPS information for measuring the

photo similarities. Based on the basic technique presented in Section II, Algorithm 1 shows the procedures

of constructing the bipartite graph between two photo streams X1 and X2, where E12 records the directed

bipartite graph edges from X1 to X2, and E21 records the reverse graph edges. The final affinity matrix

simply averages the two directed affinity matrices

Eij =1

2(E12

ij + E21ji ). (7)

Using “average” of the two directed edge weights makes the bipartite graph linkage more distinctive.

If E12ij and E21

ji are both nonzero, i.e. both x1i and x2

j choose the other one as one of its informative

neighbors among many others, then x1i and x2

j are strongly connected and are more likely to be the

informative pair desired for the alignment task.


Algorithm 1 Sparse Bipartite Graph Construction1: Input: photo streams X1 and X2, kernel function κ.

2: for each x1i ∈X1 do

3: Solve the following optimization:

α1i =argmin

α∥α∥1 + β∥α∥22

s.t. 1 + αTκ(X2,X2)α− 2κ(x1i ,X2)α ≤ ϵ.

4: Assign E12ij = |α1

i (j)| for j = 1, 2, ..., n.

5: end for

6: for each x2j ∈X2 do

7: Solve the following optimization:

α2j =argmin

α∥α∥1 + β∥α∥22

s.t. 1 + αTκ(X1, X1)α− 2κ(x2j , X1)α ≤ ϵ.

(8)

8: Assign E21ji = |α1

j (i)| for i = 1, 2, ...,m.

9: end for

10: Output: sparse bipartite graph affinity matrix E =[E12 + (E21)T

]/2.

C. Max Linkage Selection for Robust Matching

The above sparse bipartite graph construction is based on the similarity measure only, without respecting

the chronological order constraint within each photo stream. Yet these sparse links provide the candidate

photo matches critical for alignment. However, due to the limitations of the photo similarity measures,

these candidate matches may be too spurious for precise alignment. For photos that have very similar

neighbors in the other stream, the representation coefficients will be sparse with large magnitudes, and

thus these connections are more reliable for matching. For photos that do not have similar neighbors

(behave like outliers for the photo streams), their representations will be less sparse and the coefficient

magnitudes are smaller, an observation similar to the face verification scenario in [24]. Therefore, we

propose a procedure called max linkage selection to prune those false candidate matches (usually with

small representation coefficients): if a photo has multiple links with other nodes, we only keep the edge

with maximum weight and break the rest.

In this way, the remaining matched pairs are more informative for the alignment task, as verified by

our experiments. Note that the max linkage selection is not equivalent to finding the most similar photo


based on nearest neighbor. Nearest neighbor selection is based on some distance metric, which is very

sensitive to noise in high-dimensional spaces. In contrast, our sparse graph is more robust to noise, and

thus the max linkage selection has a better chance to select the correct neighbors. In the experiment part,

we will compare with two baseline algorithms, i.e. “DNN” and “R-kNN”, which are more advanced and

thus better than the simple nearest neighbor selection strategy. Actually, the “R-kNN” algorithm degrades

to finding the most similar photo when we set k = 1. As we will see later, our kernel sparse graph works

remarkably better than these baseline algorithms.

Denote the set of pruned matched pairs as

M ={(x1

i , t1i ;x

2j , t

2j )|Eij = 0

}. (9)

The correct time shift ∆T (in seconds) is found by

∆T = argmax∆t

∑(i,j)∈M

Eijδ(|t1i − t2j −∆t| ≤ τ), (10)

where δ is the indicator function, and τ is a small time displacement tolerance for reliable matching

(chosen as 60s in our experiments). Once we have ∆T ’s for each pair of photo streams, we can merge

multiple streams into a master photo stream in chronological order for sharing among different users.

D. Multiple Sequence Adjustment

In practice, we typically have more than two photo streams, which can provide complementary visual

matching information for alignment. Since pair-wise stream matching does not ensure time consistency as

a whole, we need to combine the matching results for multiple stream pairs. Suppose we have s streams

in total, for each pair of matched photo streams, we have the matched photo pair set M∗pq, 1 ≤ p, q ≤ s

found by Eqn. 10. Let T ∗p and T ∗

q denote the timestamp sequences of the matched photo pair set, and

wpq be the edge weights of the matched photo pairs, i.e. wpq(i) denotes the edge weight between the

i-th matched photo pair in M∗pq. Our goal is, for a chosen reference timestamp sequence T ∗

ref , infer a

time shift ∆Tp for each time stamp sequence T ∗p with respect to T ∗

ref , so that multiple photo streams

will be mapped onto the common time axis. We define the matching error for two sequences as

ϵpq = h(wTpq(T

∗p +∆Tp − T ∗

q −∆Tq)), (11)

where h is the Huber function. As we can see, the larger wpq, the more we trust T ∗p and T ∗

q . In the

extreme case, ∆Tp and ∆Tq are merely the time shifts of T ∗p and T ∗

q with T ∗ref respectively, and therefore,

we have ϵpq = 0. However, it sometimes happens that T ∗P and T ∗

q are wrongly matched due to insufficient


visual hints. Based on all photo streams, we hope ∆Tp and ∆Tq can rectify this matching outlier, where

wTpq(T

∗p +∆Tp−T ∗

q −∆Tq) = 0. Using Huber cost, our objective function is more robust to such matching

outliers. Combining the matching errors from each sequence pair, the consistent time alignments can thus

be found by minimizing

min{∆Tl}s

l=1

s∑p=1

∑q =p

ϵpq. (12)

IV. MASTER STREAM SUMMARIZATION FOR SHARING

Since people have similar photo taking interests and camera viewpoints, they tend to capture the

same scenes in the same event, leaving redundant photos in the aligned super photo stream. Even

within one photo stream, many redundant photos exist due to the low cost of photo taking with modern

digital cameras. Such heavy redundancy makes automatic photo sharing, especially online sharing, rather

inefficient in both browsing and storage.

In this section, we target at creating a compact master stream from the aligned super photo stream X ={x1,x2,x3, ...,x|X |

}, created by merging multiple photo streams from different cameras in chronological

order. The problem can be formally stated as

Definition 1. Given the super photo stream X , find a subset C ∈ X , called master stream, which has a

good coverage of the visual and temporal content information of X with minimal redundancy.

In contrast to previous summarization works such as [18], which aims to represent the scene set by

only a few example photos, our master stream summarization is designed to discard those redundant

photos while maintain temporal integrity of the event. Because the size of the photo summary produced

by [18] is much smaller than that of the original photo collection, the selection can become subjective or

sensitive to the summarization method and its parameters. In our master stream summarization, we are not

explicitly concerned with selecting few semantically and subjectively accurate photos from a collection.

Therefore, the output of our master stream summarization will have more photos than those selected

by the conventional summarization works. We will present a first-party evaluation of our summarization

results at the end of the next section.

We define two mathematical criteria for our desired master stream C:

1) Compactness: the master stream should be as small as possible;

2) Coverage: the master stream should represent the original set well in terms of the similarity

measure.


Based on these two criteria, we propose the following cost function for searching such a master stream

C:

C∗ = argminC∈X

Ls(C,X ) + γLr(C), (13)

where Ls(C,X ) denotes the Information Loss incurred by representing X with the master stream C,

Lr(C) denotes the Information Redundancy contained in C, and γ balances the two terms. We will define

these two terms after the introduction of the kernel sparse graph we need in the following.

Algorithm 2 Sparse Representation Graph Construction1: Input: super stream X , and kernel function κ.

2: Initialize: W = I ∈ R|X |×|X |.

3: for k = 1 to |X | do

4: Solve the following optimization with Dk = X\xk

α∗ = argmin ∥α∥1 + β∥α∥22

s.t. 1 + αTκ(Dk, Dk)α− 2κ(xk, Dk)α ≤ ϵ,

5: Assign Wkt = |α∗(t)|, for t = k.

6: end for

7: Output: the affinity matrix W .

A. Sparse Graph on the Super Stream

As mentioned earlier, consumer photo albums typically cover a large variety of visual contents,

while containing some amount of highly correlated redundant photos, especially in the case where

multiple cameras capture the same event. Therefore, the graph constructed on the super photo stream

X ={x1,x2, ...,x|X |

}should have sparse visual correlation links. We again employ the kernel sparse

representation graph to adaptively choose the affinitive neighbors for each photo. Different from the

sparse bipartite graph discussed in Section III-B, for each photo in the super stream, we model all the

reminder as the dictionary, which outputs an asymmetric affinity matrix W as in Algorithm 2. We obtain

the final symmetric affinity matrix by

W =1

2(W +W T ). (14)


We again use “average” of the two directed edge weights to emphasize those strongly connected photo

pairs. For now, we assume the affinity matrix W defines a valid kernel on the super stream photos1, and

we denote κ its kernel function. Then the Information Loss term Ls(C,X ) and Information Redundancy

term Lr(C) in Eqn. 13 can be defined as

Ls(C,X ) =|X |∑i=1

Ls(C,xi), (15)

with

Ls(C,xi) = min∥α∥0≤s

αT κ(C, C)α− 2κ(xi, C)α, (16)

and

Lr(C) = ∥κ(C, C)− I∥F (17)

where I is the identity matrix and s is the sparsity of the representation. Eqn. 16 is the the sparse

representation problem for xi using master stream C in the kernel space κ defined by W , where we

choose s = 1 in this problem. Eqn. 17 means the master stream C should be as orthogonal as possible.

Note that the “Frobenious” norm used here also favors a smaller master stream.

B. Greedy Backward Selection

Eqn. 13 seeks a master stream C that can well represent the original super stream, while favoring

orthogonality. The problem is similar to finding the maximum independent set problem in a graph, which

is known to be combinatorial NP-hard. Therefore, we try a greedy backward model selection algorithm

to find the approximate solution. The basic idea is to greedily prune the samples in X until we find the

best objective function value. Algorithm 3 describes the procedures of searching the master stream C.

Although this greedy algorithm is suboptimal, as shown in Section V-C, the algorithm was effective in

practice.

V. EXPERIMENTAL EVALUATION

A. Experiment Datasets

To evaluate the performance of our algorithm, we have collected a total of 36 real-world consumer

photo datasets, each corresponding to one event and containing multiple personal photo albums. The

1Empirically, we found that W is positive semi-definite. We will see in the later algorithm that this requirement can be

discarded, as we only cares about the sparsified similarity.


Algorithm 3 Greedy backward selection.Input: sparse affinity matrix W .

Initialize: C ← X , f = inf , and γ = 1e− 5.

loop

for i = 1 to |C| do

v[i] = Ls(Cc,X ) + γLr(Cc), with Cc = C\C{i}.

end for

fo ← f ; [f, k]← min(v).

if f > fo then

Return C.

end if

Update C ← C\C(k).

end loop

photographers were not aware of this project at the time of creating their photo albums, and therefore

the photos were taken unconsciously and thus avoid the bias for later photo alignment. The number of

photos in each dataset ranges from several dozens to several hundreds or even over a thousand. The

entire collection of datasets is rather diverse: the content ranges from traveling (numerous natural and

urban scenes) to social events (wedding, car racing, sports, stage shows, etc.); tens of photographers were

involved and tens of different models of cameras were used in total. In Figure I, we list the 36 real-world

consumer photo datasets, number of photographers, and number of photos in each event. For events

spanning multiple days, there are multiple albums, each corresponding to one day. In the following, we

will first present our photo stream alignment results and compare with several baseline algorithms, and

then show our master stream summarization results in comparison with human selection. Based on the

proposed kernel sparse representation graph, our photo stream alignment and summarization algorithms

provide a robust and practical solution to the collaborative photo collection and sharing problem.

B. Photo Stream Alignment

For all the 36 real-world consumer photo datasets we have collected, the camera clocks were not

calibrated with each other in situ, and therefore, the absolute ground truth is unattainable. However, for

photo stream alignment, all we care about is the sequential order of the photos (from different cameras)

on a common timeline. Thus, we can provide the pseudo ground truth in the form of correct alignment


TABLE I

LIST OF THE 36 REAL-WORLD CONSUMER PHOTO ALBUMS FOR PHOTO STREAM ALIGNMENT AND SUMMARIZATION WE

HAVE COLLECTED.

Events Albums Photographers Photos

Carnival Trip 8 5 798

Lijiang Trip 4 2 467

Zurich Trip 1 2 397

Yellowstone Trip 9 3 1060

Yellow Mountain 1 2 276

Grand Canyon 7 2 1498

Car Racing 1 4 1034

Wedding 1 3 294

Pegeant Show 1 2 156

Moon Festival Show 1 3 258

Soccer 1 2 87

Horse Trial 1 2 111

TABLE II

THE PHOTO STREAM ALIGNMENT ACCURACY ON THE 36 PHOTO DATASETS WITH DIFFERENT ALGORITHMS.

Alg. DNN SIFT kNN R-kNN SRG R-SRG

Acc. 25/36 25/36 27/36 29/36 32/36 34/36

between photo streams according to the consensus of the involved photographers. In other words, as long

as an algorithm determines a time shift that produces the correct sequential order between all the involved

photo streams, that time shift is considered as good as the absolute ground truth. Therefore, we evaluate

the alignment results by checking whether the merged super photo stream is in the same sequential order

as the verified pseudo ground truth–if so, we count it as correct; otherwise, we count it as a failure no

matter how large the actual alignment error is. We also noticed that given the constraints from multiple

matching photo pairs, the uncertainty range for the absolute ground truth tends to be small (unusually

within a few seconds) for most of the datasets we collected.

1) Alignment Results: The key to the photo stream alignment is to find the informative photo pairs. One

can come up with many possible heuristic approaches for this problem. However, heuristic approaches

often run into contradiction and fail to obtain robust and accurate alignment, suggesting that a principled

approach is needed for robust alignment. In the following, we describe and compare with the three best


performing baseline methods among what we have tried.

1) Distinctive nearest neighbor search (DNN). For photo x1 from the first stream, x2 in the second

stream is its distinctive nearest neighbor, if the similarity between x1 and x2 is at least r (r > 1)

times larger than those between x1 and any other photos in the second stream; otherwise, there

is no match for photo x1. There are also other ways to define DNN, e.g. only link those nearest

neighbors with similarities larger than some threshold µ. However, we find that our definition of

DNN is more robust for different datasets, since it introduces a competing procedure instead of

relying on a certain fixed threshold.

2) SIFT feature matching. Another straightforward way to find the informative pairs is to use near-

duplication detection techniques, such as local SIFT feature matching by RANSAC [2]. However,

on one hand, SIFT feature matching tends to miss many visually similar but not quite duplicate

photos, leading to too few detections of the informative photo pairs in some cases. On the other

hand, this method tends to be mislead by strong outliers, e.g. near-duplicate scenes that in some

cases actually occurred at different times. After all, the photographers do not always walk in locked

steps and take pictures. In practice, this approach also runs too slow.

3) R-kNN graph matching. Instead of the proposed sparse graph, one can use the conventional kNN

graph to establish the sparse links, and assign the edge weights with the calculated similarities

between the photo nodes. To reject spurious links, one can also apply the max linkage selection

procedure as in our algorithm for robust matching, referred as R-kNN.

After obtaining the informative photo pairs based on the above three baseline matching methods, same

procedures as Eqn. 10-12 are employed to achieve the final alignment results. In Table II, we list the

alignment accuracies of the different algorithms. By using the proposed max linkage selection procedure,

“R-kNN” performs better than kNN graph matching thanks to spurious linkage rejection. Directly using

the sparse representation graph, “SRG” already outperforms all the heuristic methods, and it is further

improved by using the competing procedure of max linkage selection (referred as “R-SRG”).

Overall, our algorithm can achieve excellent alignment results in merging different photo streams in

chronological order on most of the datasets, in a fashion comparable with human observers. In very

few other datasets (2 out of 36), such as the “soccer” dataset, where Asumption III-A is violated, our

algorithm fails as do unrelated human observers. In the “soccer” dataset, the photographers merely sat

around the same location, taking photos that were visually very similar but at different times. Figure 4

shows some example photos from two streams (indicated by different border colors) in this dataset, where


Fig. 4. Example photos from the “soccer” dataset, which hardly shows informative visual-time correlations.

the photos are visually similar across different times. Alignment on this dataset is also very challenging

for a human: only by a pair of photos can a very observant human roughly align these two streams with

careful examination of the semantic content of the photos (i.e. the positions and moving directions of

the soccer players).

Figure 5 shows an alignment example for the “Lijiang” trip photo dataset by our algorithm. There

is one particular difficulty with the alignment for this dataset—visually similar photos do not always

occur at the same time, which causes a problem for SIFT feature matching. The SIFT feature matching

method will strongly link the photo pairs connected by yellow dotted arrows shown in the figure, and

consequently produces incorrect alignment. In contrast, by utilizing more graph links from other photos,

our method can ultimately identify the correct time shift. Figure 6 shows another alignment example

for the “Wedding” event, which occurred in the courthouse. Many of those photo are visually very

similar (same persons with same backgrounds). All the baseline algorithms fail in this case, since many

matched pairs found by these algorithms are false links. By adaptively selecting the most relevant photos

via sparsity constraint followed by a competing procedure of max linkage selection, our algorithm can

effectively reject those spurious links and correctly identify the true time shift.

Figure 7 shows the curve of matching score versus time shift on three of the datasets. For the first two

cases, our algorithm can successfully locate the accurate time shift ∆T by picking the sharp peak from

the curve. However, for the third “soccer” dataset, the algorithm could not locate a clear peak. Compared

with the previous two cases, the curve has high entropy and multiple peaks, which are strong indications

of poor matching.

Finally, we note that the proposed algorithm is also computationally as efficient as heuristic methods

DNN and kNN (R-kNN), and much faster than SIFT matching.


Fig. 5. Alignment example by our proposed algorithm on the Lijiang Trip dataset. Photos from different cameras are indicated

by different border colors. Photos connected by the yellow dotted arrows denote the photo pairs linked by SIFT matching.

Fig. 6. Alignment examples by our proposed algorithm on the Wedding dataset. Photos from different cameras are indicated

by different border colors.

2) Effects of GPS Information: The kernel sparse representation graph proposed is a general framework

where any similarity measure can be incorporated. In particular, we include the GPS information when

available in addition to the visual similarity for our tasks. Take the “Grand Canyon” dataset as an example,

which has geo-locations recorded. To show the effects of the GPS information, we first run the alignment

algorithm based on visual information only to obtain the informative photo pairs, and then deliberately

discard all the photos in these informative pairs from the original dataset to obtain a more challenging new

dataset. As expected, the subsequent alignment on this new dataset based only on (handicapped) visual

information results in disorder of the merged master photo stream ( an alignment error of approximate 5

minutes compared with the verified ground truth ). With geo-location information incorporated, however,

our algorithm can still achieve satisfactory alignment results, even though we do not have all the visually


−8 −6 −4 −2 0 2 4 6 8

x 105

0

2

4

6

8

10

12

Time Shift

Mat

chin

g S

core

−4 −3 −2 −1 0 1 2 3

x 104

0

1

2

3

4

5

6

7

Time Shift

Mat

chin

g S

core

−6000 −4000 −2000 0 2000 4000 6000 80000

0.5

1

1.5

2

2.5

Time Shift

Mat

chin

g S

core

Fig. 7. The matching score vs. time shift. Left: Grand Canyon; middle: Lijiang day 4; right: Soccer.

TABLE III

MASTER STREAM SUMMARIZATION FOR THE RAW PHOTO DATASETS.

Photosets Original Master Stream

Wedding 291 125

Zurich 388 281

Pageant Show 156 97

Yellow Mountain 272 181

informative photo pairs.

C. Master Stream Summarization

Many of the datasets we have in this study were already pre-selected by the users before they were

uploaded to online albums. Therefore, we only perform our summarization experiment on several raw

photo datasets, which have high redundancy within and across photo streams. Table III shows the sizes

of the original set and master stream using the proposed algorithm, with γ = 0.005. Figure 8 shows

how the cost function value changes as the algorithm drops more and more redundant photos for the

“Wedding” and “Yellow Mountain” datasets. Our algorithm terminates at the lowest point of the cost

curve, where information loss and redundancy reach a comprise.

Figure 9 shows two examples for master stream summarization on the “Zurich” and “Yellow Mountain”

datasets, where the grayed and borderless photos denote those discarded by our algorithm. In both cases,

our algorithm successfully selects the representative photos along the time line to summarize the entire

event, and discards those redundant ones both within and across photo streams.


0 50 100 150 200 250 3000.5

1

1.5

2

2.5

3

Image Index

Cos

t

0 50 100 150 200 250 3000.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Image Index

Cos

t

Fig. 8. The cost function value reaches an optimum as the algorithm removes redundant photos. Left: Wedding; Right: Yellow

Mountain

(a)

(b)

Fig. 9. Example photos of the master stream selected for the (a) “Zurich” and (b) “Yellow Mountain”.

The first evaluation experiment is conducted on the “Yellow Mountain” photoset. In this case, we

have the raw photoset, and also its compact version “Yellow Mountain Compact”, where each user has

filtered his own album before uploading to Picasa. Running our algorithm on both datasets results in 181

photos for the raw photoset (C1) and 131 photos for the compact one (C2). Comparing the two master

streams, we found that 110 out of 131 photos (84%) in C2 are covered by C1. Consumers usually select

photos with certain subjectiveness, especially when they have many photos, e.g. they may neglect some

photos even though the photos could be unique in visual content. Considering this, the 84% coverage is


remarkable although our algorithm selects slightly more photos than humans did.

The second evaluation experiment is conducted on the “Wedding” and “Zurich” photo datasets. In both

cases, we ask the first parties of the albums to evaluate the quality of the master steams created by our

algorithm. To quantitatively measure the master stream qualities, we ask the subjects to report the number

of photos they would like to adjust (add to or delete photos from our master streams) to create their

own master streams. For the “Wedding” photoset, our algorithm selects 125 photos. The subjects decide

to delete 3 photos from our automatic master stream and add 6 more photos back from the super photo

stream to create their own. For the “Zurich” photoset, our algorithm selects 281 photos. Starting from

this master stream, the corresponding subjects want to delete 8 photos from it and add 12 more photos

back. Given the subjective nature of the photo summarization task, the above numbers demonstrate that

our algorithm can produce master streams that are close to the expectation of consumers.

VI. CONCLUSION

In this paper, we address the practical problem of photo stream alignment and summarization for

collaborative photo collection and sharing in social media. Since people have similar photo taking interests

and viewpoints, there are photos with overlapping visual content when several cameras (photographers)

capture the same event. Given that the photo streams contain a sufficient amount of temporal-visual

correlations, we are able to align multiple photo streams along a common chronological timeline of

the event, by employing a sparse bipartite graph to find the informative photo pairs and a max linkage

selection competing procedure to prune the false links. Compared with several common baseline matching

algorithms, our alignment algorithm can achieve satisfactory performance that are comparable to that of

human. To further facilitate photo sharing, we also propose an algorithm to create a compact master

stream from the aligned multiple photo streams, by removing those redundant photos. The proposed

framework also lends itself to other applications, such as geo-tag and user-tag transfer between aligned

photo streams, which we investigate in our future work.

ACKNOWLEDGEMENT

This work is supported in part by Kodak Research. It is also supported by U.S. Army Research

Laboratory and U.S. Army Research Office under grand number W911NF-09-1-0383.

REFERENCES

[1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computing,

15(6):1373 – 1396, 2002.


[2] M. Brown and D. G. Lowe. Automatic panoramic image stitching using invariant features. International Journal of

Computer Vision, 74:59–73, 2007.

[3] E. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete

frequency information. IEEE Transactions on Information Theory, 52:489–509, Feb. 2006.

[4] B. Cheng, J. Yang, S. Yan, Y. Fu, and T. Huang. Learning with ℓ1-graph for image analysis. IEEE Transactions on Image

Processing (TIP), 19(4):858–866, 2010.

[5] H. Cheng, Z. Liu, and J. Yang. Sparsity induced similarity measure for label propagation. In IEEE International Conference

on Computer Vision, 2009.

[6] W. T. Chu and C.-H. Lin. Automatic summarization of travel photos using near-duplicate detection and feature filtering.

In Proceedings of the ACM International Conference on Multimedia, 2009.

[7] E. Elhamifar and R. Vidal. Sparse subspace clustering. In CVPR, 2009.

[8] S. Gammeter, L. Boassard, T. Quack, and L. V. Gool. I know what you did last summer: object-level auto-annotation of

holiday snaps. In IEEE International Conference on Computer Vision, pages 614–621, 2009.

[9] S. Gao, I. W.-H. Tsang, and L.-T. Chia. Kernel sparse representation for image classification and face recognition. In

European Conference on Computer Vision (ECCV), 2010.

[10] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Tagprop: Discriminative metric learning in nearest neighbor

models for image auto-annotation. In IEEE International Conference on Computer Vision, 2009.

[11] T. Hastie and P. Y. Simard. Metrics and models for handwritten character recognition. Statistical Science, 1998.

[12] D. Huynh, S. Drucker, P. Baudisch, and C. Wong. Time quilt: scaling up zoomable photo browsers for large, unstructured

photo collections. In SIGCHI Conference on Human factors in computing systems, pages 1937–1940, 2005.

[13] D. Kirk, A. Sellen, C. Rother, and K. Wood. Understanding photowork. In SIGCHI Conference on Human factors in

computing systems, 2006.

[14] L.-J. Li, S. R., and L. Fei-Fei. Towards total scene understanding: Classification, annotation and segmentation in an

automatic framework. In IEEE International Conference on Computer Vision, pages 2036–2043, 2009.

[15] A. C. Loui and A. Savakis. Automated event clustering and quality screening of consuer pictures for digital albuming.

IEEE Transactions on Multimedia, 2003.

[16] T. Quack, B. Leibe, and L. V. Gool. World-scale mining of objects and events from community photo collections. In

International Conference on Content-based Image and Video Retrieval, 2008.

[17] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323 – 2326,

2000.

[18] I. Simon, N. Snavely, and S. M. Seitz. Scene summarization for online image collections. In IEEE 11th International

Conference on Computer Vision, pages 1–8, 2007.

[19] M. Soltanolkotabi and E. J. Candes. A geometric analysis of subspace clustering with outliers. CoRR, 2011.

[20] G. Strong and M. Gong. Organizing and browsing photos using different feature vectors and their evaluations. In ACM

International Conference on Image and Video Retrieval, 2009.

[21] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. Context-based vision system for place and object recognition.

In Proceedings of International Conference on Computer Vision, 2003.

[22] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In

Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.

[23] X.-J. Wang, L. Zhang, M. Liu, Y. Li, and W.-Y. Ma. Arista-image search to annotation on billions of web photos. In

IEEE Conference on Computer Vision and Pattern Recognition, pages 2987–2994, 2010.


[24] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Transactions

on Pattern Analysis and Machine Intelligence, pages 210 – 227, 2009.

[25] X.-T. Yuan and S. Yan. Visual classification with multi-task joint sparse representation. In CVPR, 2010.

[26] S. Zhang, J. Huang, Y. Huang, Y. Yu, H. Li, and M. D.N. Antomatic image annotation using group sparsity. In IEEE

Conference on Computer Vision and Pattern Recognition, pages 3312–3319, 2010.

[27] X. Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin Madison, 2008.

[28] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society,

Series B, 67:301–320, 2005.

Jianchao Yang (S’08) received his B.E. degree in the Department of Electronics Engineering and Infor-

mation Science, University of Science and Technology of China (USTC), China, in 2006; and his M.S. and

Ph.D. degree in Electrical and Computer Engineering from the University of Illinois at Urbana-Champaign,

Illinois, in 2011. His research interests include computer vision, machine learning, sparse representation,

compressive sensing, image and video processing.

Jiebo Luo (S’93-M’96-SM’99-F’09) received the B.S. and M.S. degrees in electrical engineering from

the University of Science and Technology of China (USTC), Hefei, China, in 1989 and 1992, respectively;

and the Ph.D. degree in electrical and computer engineering from the University of Rochester, Rochester,

NY, in 1995. He was a Senior Principal Scientist with the Kodak Research Laboratories before joining the

Computer Science Department, University of Rochester, in 2011. His research interests include signal and

image processing, machine learning, computer vision, social media data mining, and medical imaging.

He has authored over 190 technical papers and holds over 60 U.S. patents. Dr. Luo is a Fellow of the SPIE and IAPR. He has

been actively involved in numerous technical conferences, including serving as the General Chair of ACM CIVR 2008; program

Co-Chair of IEEE CVPR 2012 and ACM Multimedia 2010; area Chair of IEEE ICASSP 2009C2012, ICIP 2008C2012, CVPR

2008, and ICCV 2011. He has served on the editorial boards of the IEEE TRANSACTIONS ON PATTERN ANALYSIS AND

MACHINE INTELLIGENCE, the IEEE TRANSACTIONS ON MULTIMEDIA, the IEEE TRANSACTIONS ON CIRCUITS

AND SYSTEMS FOR VIDEO TECHNOLOGY, Pattern Recognition, Machine Vision and Applications, and the Journal of

Electronic Imaging.


Jie Yu received his Ph.D. degree in Computer Science from the University of Texas at San Antonio,

San Antonio, Texas, in 2007 and his B.E. degree in Electrical Engineering from Dong Hua University,

Shanghai, China, in 2000. He is currently a Lead Computer Scientist at GE Global Research. Before

that, he was a Research Scientist at Kodak Research Labs. His research interests include computer vision,

machine learning, and pattern recognition. He has published over 30 journal articles, conference papers,

and book chapters in these fields. He is the recipient of the Best Poster Paper Award of ACM CIVR 2008

and the Student Paper Contest Winner Award ofIEEE ICASSP 2006. He is a member of the IEEE and the ACM

Thomas S. Huang (LF’01) received his B.S. Degree in Electrical Engineering from National Taiwan

University, Taipei, Taiwan, China; and his M.S. and Sc.D. Degrees in Electrical Engineering from the

Massachusetts Institute of Technology, Cambridge, Massachusetts. He was on the Faculty of the Depart-

ment of Electrical Engineering at MIT from 1963 to 1973; and on the Faculty of the School of Electrical

Engineering and Director of its Laboratory for Information and Signal Processing at Purdue University

from 1973 to 1980. In 1980, he joined the University of Illinois at Urbana-Champaign, where he is now

William L. Everitt Distinguished Professor of Electrical and Computer Engineering, and Research Professor at the Coordinated

Science Laboratory, and at the Beckman Institute for Advanced Science he is Technology and Co-Chair of the Institute’s major

research theme Human Computer Intelligent Interaction.

Dr. Huang’s professional interests lie in the broad area of information technology, especially the transmission and processing

of multidimensional signals. He has published 21 books, and over 600 papers in Network Theory, Digital Filtering, Image

Processing, and Computer Vision. He is a Member of the National Academy of Engineering; a Member of the Academia Sinica,

Republic of China; a Foreign Member of the Chinese Academies of Engineering and Sciences; and a Fellow of the International

Association of Pattern Recognition, IEEE, and the Optical Society of America.

Among his many honors and awards: Honda Lifetime Achievement Award, IEEE Jack Kilby Signal Processing Medal, and

the King-Sun Fu Prize of the International Association for Pattern Recognition.

photo stream alignment and summarization for collaborative photo collection and sharing

Documents