exsample: efﬁcient searches on video repositories through ... · the art gpus, one third of the...

ExSample: Efficient Searches on Video Repositoriesthrough Adaptive Sampling

Oscar MollMIT CSAIL

[email protected]

Favyen BastaniMIT CSAIL

[email protected]

Sam MaddenMIT CSAIL

[email protected] Stonebraker

MIT [email protected]

Vijay GadepallyMIT Lincoln Laboratory

[email protected]

Tim KraskaMIT CSAIL

[email protected]

ABSTRACTCapturing and processing video is increasingly common ascameras become cheaper to deploy. At the same time, richvideo understanding methods have progressed greatly in thelast decade. As a result, many organizations now have mas-sive repositories of video data, with applications in mapping,navigation, autonomous driving, and other areas.

Because state-of-the-art object detection methods are slowand expensive, our ability to process even simple ad-hoc ob-ject search queries (‘find 100 traffic lights in dashcam video’)over this accumulated data lags far behind our ability to col-lect it. Processing video at reduced sampling rates is a rea-sonable default strategy for these types of queries, however,the ideal sampling rate is both data and query dependent.We introduce ExSample, a low cost framework for objectsearch over un-indexed video that quickly processes searchqueries by adapting the amount and location of sampledframes to the particular data and query being processed.

ExSample prioritizes the processing of frames in a videorepository so that processing is focused in portions of videothat most likely contain objects of interest. It continuallyre-prioritizes processing based on feedback from previouslyprocessed frames. On large, real-world datasets, ExSam-ple reduces processing time by up to 6x over an efficientrandom sampling baseline and by several orders of magni-tude over state-of-the-art methods that train specialized per-query surrogate models. ExSample is thus a key componentin building cost-efficient video data management systems.

PVLDB Reference Format:xxx. xxx. PVLDB, 21(xxx): xxxx-yyyy, 2021.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx

1. INTRODUCTIONVideo cameras have become incredibly affordable over the

last decade, and are ubiquitously deployed in static andmobile settings, such as smartphones, vehicles, surveillance

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.Proceedings of the VLDB Endowment, Vol. 21, No. xxxISSN 2150-8097.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx

cameras, and drones. Large video datasets are enabling anew generation of applications. For example, video datafrom vehicle dashboard-mounted cameras (dashcams) is usedto train object detection and tracking models for autonomousdriving systems [35]; to annotate map datasets like Open-StreetMap with locations of traffic lights, stop signs, andother infrastructure [26]; and to analyze collision scenes indashcam footage to automate insurance claims processing [28].

However, these applications must process large amountsof video to extract useful information. Consider the task offinding examples of traffic lights – to, for example, annotatea map – within a large collection of dashcam video collectedfrom many vehicles. The most basic approach to evaluatethis query is to run an object detector frame by frame overthe dataset. Because state-of-the-art object detectors runat about 10 frames per second (fps) over 1080p video onmodern high-end GPUs, scanning through a collection of1000 hours of 30 fps video with a detector on a GPU wouldtake 3000 GPU-hours. In the offline query case, which is thecase we focus on in this paper, we can parallelize our scanover the video across many GPUs, but, as the rental price ofa GPU is around $0.50 per hour (for the cheapest g4 AWSinstance in 2020) [1], our bill for this one ad-hoc query wouldbe $1.5K, regardless of parallelism. Hence, this workloadpresents challenges in both time and monetary cost. Notethat accumulating 1000 hours of video represents just 10cameras recording for less than a week.

A straightforward means for mitigating this issue is to re-duce the number of sampled frames: for example, only runthe detector on one frame per second of video. After all, wemight think it reasonable to assume all traffic lights are vis-ible for longer than one second. The savings are large com-pared to inspecting every frame: processing only one frameevery second decreases costs by 30x for a video recorded at30 fps. Unfortunately, this strategy has limitations: for ex-ample, the one frame out of every 30 that we look at may notshow the light clearly, causing the detector to miss it com-pletely. Second, lights that remain visible in the video forlong intervals, like 30 seconds, would be seen multiple timesunnecessarily, thereby wasting resources detecting those ob-jects repeatedly. Worse still, we may miss other types ofobjects that remain visible for shorter intervals, and in gen-eral the optimal sampling rate is unknown, and varies acrossdatasets depending on factors such as whether the camera ismoving or static, and the angle and distance to the object.

In this paper, we introduce ExSample, a video sampling

1

arX

iv:2

005.

0914

1v2

[cs

.DB

] 1

6 Se

p 20

20

technique designed to reduce the number of frames that needto be processed by an expensive object detector for objectsearch queries over large video repositories. ExSample mod-els this problem as one of deciding which frame from thedataset to look at next based on what it has seen in thepast. Specifically, ExSample splits the dataset into temporalchunks (e.g, half-hour chunks), and maintains a per-chunkestimate of the probability of finding a new object in a framerandomly sampled from that chunk. ExSample iterativelyselects the chunk with the best estimate, samples a framerandomly from the chunk, and processes the frame throughthe object detector. ExSample updates the per-chunk es-timates after each iteration so that the estimates becomemore accurate as more frames are processed.

Recent state-of-the-art methods for optimizing queries overlarge video repositories have focused on training cheap, sur-rogate models to approximate the outputs of expensive, deepneural networks [2, 12, 18, 20, 24]. In particular, BlazeIt [17]adapts such surrogate-based methods for object search limitqueries, and optimizes these queries by training a lightweightsurrogate model to output high scores on frames contain-ing the object of interest and low scores on other frames;BlazeIt computes scores through the surrogate over the en-tire dataset, and then prioritizes high-scoring frames for pro-cessing through the object detector. However, this approachhas two critical shortcomings, which we expand on in Sec-tion 2.2. First, even when processing limit queries, wherethe user seeks a fixed number of results, BlazeIt still requiresa time-consuming upfront dataset scan in order to computesurrogate scores on every video frame. Second, while frameswith high surrogate scores are likely to contain object de-tections relevant to a query, those objects may have alreadybeen seen in neighboring frames earlier during processing,and so applying the object detector on those frames doesnot yield new objects.

ExSample addresses the aforementioned shortcomings toachieve an order of magnitude speedup over the prior state-of-the-art (BlazeIt). First, rather than select frames for ap-plying the object detector based on surrogate scores, ExSam-ple employs an adaptive sampling technique that eliminatesthe need for a surrogate model and an upfront full scan.A key challenge here is that, in order to generalize acrossdatasets and object types, ExSample must not make as-sumptions about how long objects remain visible to thecamera and how frequently they appear. To address thischallenge, ExSample guides future processing using feed-back from object detector outputs on previously processedframes. Second, ExSample gives higher weight to portions ofvideo likely to not only contain objects, but also to containobjects that haven’t been seen already. Thus, ExSampleavoids redundantly processing frames that only contain de-tections of objects that were previously seen in other frames.

We evaluate ExSample on a variety of search queries span-ning different objects, different kinds of video, and differentnumbers of desired results. We show savings in the numberof frames processed ranging up to 6x on real data, with ageometric average of 2x across all settings, in comparisonto an efficient random sampling baseline. In the worst case,ExSample does not perform worse than random sampling,something that is not always true of alternative approaches.Additionally, in comparison to BlazeIt [17], our method pro-cesses fewer frames to find the same number of distinct re-sults in many cases, showing that ExSample provides higher

precision than a surrogate model in selecting frames withnew objects; even in the remaining cases, ExSample stillrequires one to two-orders of magnitude less runtime (or,equivalently, dollar cost) because it does not require an up-front per-query scan of the entire dataset.

In summary, our contributions are (1) ExSample, an adap-tive sampling method that facilitates ad-hoc object searchesover video repositories, (2) a formal analysis of ExSample’sdesign, and (3) an empirical evaluation showing ExSam-ple is effective on real datasets and under real system con-straints, and that it outperforms existing approaches for ob-ject search.

2. BACKGROUNDIn this section we review object detection, introduce dis-

tinct object queries, and discuss limitations of prior work.

2.1 Object DetectionAn object detector is a model that operates on still im-

ages, inputting an image and outputting a set of boxeswithin the image containing the objects of interest. Theamount of objects found will range from zero to arbitrar-ily many. Well known examples of object detectors includeYolo [31] and Mask R-CNN [10]. In Figure 1, we show twoexample frames that have been processed by an object detec-tor, with yellow boxes indicating detections of traffic lights.

Object detectors with state-of-the-art accuracy in bench-marks such as COCO [23] typically execute at around 10frames per second on modern GPUs, though it is possibleto achieve real time rates by sacrificing accuracy [13, 31].

In this paper we do not seek to improve on state-of-the-art object detection approaches. Instead, we treat objectdetectors as a black box with a costly runtime, and aim tosubstantially reduce the number of video frames that needto be processed by the detector.

2.2 Distinct Object QueriesIn this paper, we are interested in processing higher level

queries on video enabled by the availability of object detec-tors. In particular, we are concerned with distinct objectlimit queries such as “find 20 traffic lights in my dataset”over large repositories of multiple videos from multiple cam-eras. For distinct object queries, each result should be adetection of a different object. For example, in Figure 1, wedetected the same traffic light in two frames several secondsapart; although these detections are at different positionsand are computed in different frames, in a distinct objectquery, these two detections yield only one distinct result.

Figure 1: Two video frames showing the same traffic lightinstance several seconds apart. A distinct object query isdefined by having these two boxes only count as one result.

Then, to specify a query, users must specify not only theobject type of interest and the number of desired results,but also a discriminator function that determines whethera new detection corresponds to an object that was already

2

seen earlier during processing. In this paper, we assume afixed discriminator that applies an object tracking algorithmto determine whether a detection is new: it applies a trackersimilar to SORT [3] backwards and forwards through videofor each detection of a new object to compute the positionof that object in each frame where the object was visible;then, future detections are discarded if they match previ-ously observed positions.

The goal of this paper is to reduce the cost of process-ing such queries. Moreover, we want to do this on ad-hocdistinct object queries over large video repositories, wherethere are diverse videos in our dataset and where it is toocostly to compute all object detections ahead of time for thetype of objects we are looking for. This distinction affectsmultiple decisions in the paper, including not only the re-sults we return but also the main design of ExSample andhow we measure result recall.

Below, we discuss two baselines and prior work for opti-mizing the processing of distinct object queries.

Naive execution. A straightforward method is to processframes sequentially, applying the object detector on eachframe of each video, using the discriminator function to dis-card repeated detections of the same object. Once we collectenough distinct objects to satisfy the query’s limit clause,we can stop scanning. A natural extension is to sample only1 out of every n frames. Sequential processing exhibits highvariance in execution time due to the uneven distribution ofobjects in video. Moreover, if objects appear in the videofor much longer than the sampling rate, we may repeat-edly compute detections of the same object. Similarly, ifobjects appear for shorter than the sampling rate, we maycompletely miss some objects.

Random sampling. A better strategy is to iterativelyprocess frames uniformly sampled from the video repository(without replacement). This method reduces the query ex-ecution time over naive execution as it explores more areasof the data more quickly, whereas sequential execution canget stuck in a long segment of video with no objects. Addi-tionally, early in query execution, randomly sampled framesare less likely to contain previously seen objects comparedto frames sampled sequentially.

Surrogate-based methods. Methods that optimize videoquery execution by training cheap, surrogate models to ap-proximate the outputs of expensive object detectors haverecently attracted much interest. In particular, BlazeIt [17]proposes an adaption of surrogate model techniques for pro-cessing distinct object queries with limit clauses: rather thanrandomly sampling frames for processing through the objectdetector, BlazeIt processes frames in order of the score com-puted by the surrogate model on the frame, beginning withthe highest scoring frames. Since the surrogate model istrained to produce higher scores on frames containing rele-vant object detections, this approach effectively ensures thatframes processed early during execution contain relevant de-tections (but not necessarily new objects).

However, surrogate-based methods have several criticalshortcomings when used for processing object search queries.First, for queries seeking rare objects that appear infre-quently in video, these methods require substantial pre-processing to collect annotations for training the surrogatemodels, which can be as costly as solving the search prob-lem in the first place; indeed, BlazeIt resorts to random

sampling if the number of positive training labels is toosmall. Moreover, when processing limit queries, surrogate-based methods require an upfront per-query dataset scan inorder to compute surrogate scores on every video frame inthe dataset. As we will show in our evaluation, oftentimes,the cost of performing just this scan is already larger thansimply processing a limit query under random sampling.

3. ExSampleIn this section we explain how our approach, ExSample,

optimizes query execution. Due to the high compute de-mand of object detectors, runtime in ExSample is roughlyproportional to the number of frames processed by the de-tector. Thus, we focus on minimizing the number of framesprocessed to find some number of distinct objects.

To do so, ExSample estimates which portions of a videoare more likely to yield new results, and samples frames moreoften from those portions. Importantly, ExSample priori-tizes finding distinct objects, rather than purely maximizingthe number of computed object detections.

At a high level, ExSample conceptually splits the inputinto chunks, and scores each chunk separately based on thefrequency with which distinct objects have been found inthat chunk in the past. This scoring system allows ExSam-ple to allocate resources to more promising chunks while alsoallowing it to diversify where it looks next over time. In ourevaluation, we show that this technique helps ExSample out-perform greedy surrogate-guided strategies, even when theyemploy duplicate avoidance heuristics (i.e., do not processframes that are close to previously processed frames).

To make it practical, ExSample is composed of two corecomponents: an estimate of future results per chunk, de-scribed in Section 3.1, and a mechanism to translate theseestimates into a sampling decision which accounts for esti-mate errors, introduced in Section 3.4. In those two sectionswe focus on quantifying the types of error. Later, in Section3.5, we detail the algorithm step by step.

3.1 Scoring a Single ChunkIn this section we derive our estimate for the future value

of sampling a chunk. To make an optimal decision of whichchunk to sample next, assuming n samples have been takenalready, ExSample estimates R(n+1) for each chunk, whichrepresents the number of new results we expect to find on thenext sample. New means R does not include objects alreadyfound in the previous n samples, even if they appeared alsoin the (n+1)th frame. Intuitively, the chunk with the largestR(n+ 1) is a good location to pick a frame from next.

Let N be the number of distinct objects in the data. Eachobject i out of those N will be visible for a different numberof frames, which means that when sampling frames at ran-dom from a chunk, each i will have a different probabilityof being found, which we call pi. For example, in video col-lected from a moving vehicle that stops at a red light, redlights will tend to have large pi lasting in the order of min-utes, while green and yellow lights are likely to have muchsmaller pi, perhaps in the order of a few seconds. We findthat in practice these pi vary widely even for a single objectclass, from tens to thousands of frames.

We emphasize that pi and N are not known in advanceand neither the user nor ExSample knows them, instead,ExSample relies on estimating the sum R(n+ 1) = ΣN

i=1pi ·[i /∈ seen(n)] without estimating N and the pi directly.

3

The estimate is similar to the Good-Turing estimator[7],and relies on counting the number of distinct objects seenexactly once so far, which we denote N1(n). That is, aftersampling a frame and seeing a new object for the first time,N1(n) increases by 1, and after observing the same objecta second time in a different frame, N1(n) decreases by 1.Unlike the pi or N , which are unknown to the user and toExSample, N1(n) is a quantity we can observe and track aswe sample frames. For a single chunk, our estimate is:

R(n+ 1) ≈ N1(n)/n (1)

This formula is similar to the Good-Turing estimator [7],but under a different setting where the meaning of N1(n) issomewhat different. Specifically, in object search over video,we sample frames rather than symbols at random, and so asingle frame can contain many objects, or no object at all.This means that in ExSample, N and N1(n) could rangefrom 0 to far larger than n itself, e.g. in a crowded scenewith many distinct objects per frame. In the traditional useof the Good-Turing estimator (such as the one explainedin the original paper [7]) there is exactly one instance persample and the only question is whether it is new. In thatsituation N1(n) will always be smaller than n.

In the following sub-sections, we describe the bias andvariance of the estimate in terms of n, N , the average dura-tion of a result µp = Σpi/N and also the standard deviationof the pi σp.

In particular we first will bound the relative bias:

rel. err =E[N1(n)]/n− E[R(n+ 1)]

E[N1(n)]/n

The following inequalities bound the relative bias error ofour estimate:

Theorem (Bias).

0 ≤ rel. err (2)

rel. err ≤ max pi (3)

rel. err ≤√N(µp + σp) (4)

Intuitively, (2) implies that N1(n)/n tends to overesti-mate R(n+ 1), (3) implies that the size of the overestimateis guaranteed to be less than the largest probability, whichis likely small, and (4) implies that even if a few pi outlierswere large, as long the durations within one standard devi-ation σp of the average µp are small the error will remainsmall. A large N or a large µp or σp may seem problematicfor Equation 4, but we note that a large number of resultsN or long average duration µp implies many results will befound after only a few samples, so the end goal of the searchproblem is easy in the first place and having guarantees ofaccurate estimates is less important.

In later experiments, we will show that we consistentlyperform well both in simulated skewed data containing manyobjects, and in five real-world datasets with diverse skews.

Proof. Now we prove Equation 2 Equation 3, and Equa-tion 4. The chance that object i is seen on the (n + 1)th

random sample from the chunk after being missed on the

first n samples is pi(1− pi)n. For the rest of the paper, wedenote this quantity πi(n+ 1). By linearity of expectation:

E[R(n+ 1)] = ΣNi=1πi(n+ 1)

For the rest of this proof, we will avoid explicitly writingsummation indexes i, since our summations always are overthe result instances, from i = 0 to N .

E[N1(n)] can also be expressed directly. The chance ofhaving seen instance i exactly once after n samples is npi(1−pi)

n−1, with the extra n coming from the possibility of theinstance having shown up at any of the first n samples. Wecan rewrite this as nπi(n). So E[N1(n)] = nΣπi(n), giving:

E[N1(n)]/n = Σπi(n)

Now we will focus on the numerator E[N1(n)]/n−R(n+1)

E[N1(n)]/n− E[R(n+ 1)] = Σπi(n)− πi(n+ 1)

= Σp2i (1− pi)n−1

= Σpiπi(n)

We can see here that each term in the error is positive, hencewe always overestimate, which proves Equation 2.

Now bound this overestimate. Intuitively, this overesti-mate is small because each term is a scaled down version ofthe terms in E[N1(n)]/n =

∑πi(n), more precisely:

Σpiπi(n) ≤ (max pi) · Σπi(n)

= (max pi) · E[N1(n)]/n (5)

For example, if all the pi are all less than 1%, then theoverestimate is also less than 1% of N1(n)/n. However, ifwe know there may be a few outliers with large pi which arenot representative of the rest, unavoidable in real data, thenwe would like to know our estimate will still be useful. Wecan show this is still true as follows:

Σpiπi(n) ≤√

(Σp2i )(Σπ2i ) (Cauchy-Schwarz)

≤√

(Σp2i ) (Σπi)2 πi are positive

=√

(Σp2i )Σπi

=√

(Σp2i ) E[N1]/n

=√N√

(Σp2i /N) E[N1]/n

The term Σp2i /N is the second moment of the pi, and canbe rewritten as σ2

p + µ2p, where µp and σp are the mean and

standard deviation of the underlying result durations:

=√N√

(σ2p + µ2

p) E[N1]/n

≤√N(σp + µp) E[N1]/n (6)

Putting together Equation 5 and Equation 6 we get:

E[N1(n)]/n− E[R(n+ 1)] ≤

{(max pi)

E[N1]n√

N(σp + µp)E[N1]n

And dividing both sides, we get

rel. err =E[N1(n)]

n− E[R(n+ 1)]

E[N1(n)]/n≤

{max pi√N(σp + µp)

4

Which justifies the remaining two bounds: Equation 3 andEquation 4.

3.2 Scoring Multiple ChunksIf we knew Rj(nj + 1) for every chunk j, the algorithm

could simply take the next sample from the chunk with thelargest estimate. However, if we simply use the raw estimate:

Rj(nj + 1) ≈ N1j (nj)/nj (7)

And pick the chunk with the largest estimate, then we runinto two potential problems. First, instances may span mul-tiple chunks. This corner case can be handled easily byadjusting the definition of N1

j , and can be verified directlyin a similar way to Equation 1 (see our technical report [25]).Moreover, each chunk’s estimate may be off due to the ran-dom sample error, especially in chunks with small nj , whichwe handle next.

3.3 Instances spanning multiple chunksIf instances span multiple chunks, for example a traffic

light that spans across the boundaries of two neighboringchunks, Equation 7 is still accurate with the caveat thatN1

j (nj) is interpreted as the number of instances seen ex-actly once globally and which were found in chunk j. Thesame object found once in two chunks j and k does not con-tribute to either N1

j or to N1k , even though each chunk has

only seen it once. The derivation of this rule is similar tothat in the previous section, and is given in Section 3.3. Inpractice, if only a few rare instances span multiple chunksthen results are almost the same and this adjustment doesnot need to be implemented.

At runtime, the numerator N1j will only increase in value

the first time we find a new result globally, decrease backas soon as we find it again either in the same chunk or else-where, and finding it a third time would not change it any-more. Meanwhile nj increases upon sampling a frame fromthat chunk. This natural relative increase and decrease ofN1

j with respect to each other allows ExSample to seamlesslyshift where it allocates samples over time.

Here we prove Equation 7 is also valid when differentchunks may share instances. Assume we have sampled n1

frames from chunk 1, n2 from chunk 2, etc. Assume in-stance i can appear in multiple chunks: with probability pi1of being seen after sampling chunk 1, pi2 of being seen aftersampling chunk 2 and so on. We will assume we are work-ing with chunk 1, without loss of generality. The expectednumber of new instances if we sample once more from chunk1 is:

R1(n+ 1) =

N∑i=1

[pi1(1− pi1)n1

M∏j=2

(1− pij)nj

](8)

Similarly, the expected number of instances seen exactlyonce in chunk 1, and in no other chunk up to this point is

N11 =

N∑i=1

[n1pi1(1− pi1)n1−1

M∏j=2

(1− pij)nj

](9)

In both equations, the expression∏M

j=2(1− pij)nj factorsin the need for instance i to not have been while samplingchunks 2 to M . We will abbreviate this factor as qi. When

instances only show up in one chunk, qi = 1, and everythingis the same as in Equation 1.

The expected error is:

N11 (n1)/n1 −R1(n1 + 1) =

N∑i=1

[p2i1(1− pi1)n1−1qi

](10)

Which again is term-by-term smaller than N11 (n1)/n1 by

a factor of pi

3.4 Sampling Chunks Under UncertaintyBefore we develop our sampling method, we need to han-

dle the problem that an observed N1(n) will fluctuate ran-domly due to randomness in our sampling. This is especiallytrue early in the sampling process, where only a few sam-ples have been collected but we need to make a samplingdecision. Because the quality of the estimates themselvesis tied to the number of samples we have taken, and we donot want to stop sampling a chunk due to a small amountof bad luck early on, it is important we estimate how noisyour estimate is. The usual way to do this is by estimatingthe variance of our estimator: Var[N1(n)/n]. Once we havea good idea on how this variance error depends on the num-ber of samples taken, we can derive our sampling method,which balances both the raw estimate of new objects andthe number of samples it is based on.

Theorem (Variance). If instances occur independently ofeach other, then

Var[N1(n)/n] ≤ E[N1(n)]/n2 (11)

Note that this bound also implies that the variance erroris more than n times smaller than the value of the raw es-timate, because we can rewrite it as (E[N1(n)]/n)/n. Notehowever, that the independence assumption is necessary forproving this bound. While in reality different results maynot occur and co-occur truly independently, our experimen-tal results in the evaluation results show our estimate workswell in practice.

Proof. We will estimate the variance N1(n) assuming inde-pendence of the different instances, We can express N1(n) asum of binary indicator variablesXi, which are 1 if instance ihas shown up exactly once. Xi = 1 with probability πi(n) =npi(1 − pi)

n−1. Then, because of our independence as-sumption, the total variance can be estimated by summing.Therefore N1(n) =

∑iXi and because of our independence

assumption Var[N1(n)] =∑

i Var[Xi]. Because Xi is aBernoulli random variable, its variance is πi(n)(1 − πi(n))which is bounded by πi(n) itself. Therefore, Var[N1(n)] ≤∑nπi(n). This latter sum we know from before is E[N1(n)].

Therefore Var[N1(n)/n] ≤ E[N1(n)]/n2.

In fact, we can go further and fully characterize the dis-tribution of values N1(n) takes.

Theorem (Sampling distribution of N1(n)). Assuming piare small or n is large, and assuming independent occurrenceof instances, N1(n) follows a Poisson distribution with pa-rameter λ = Σπi(n).

Proof. We prove this by showing N1(n)’s moment generat-ing function (MGF) matches that of a Poisson distributionwith λ, M(t) = exp

(λ[et − 1]

).

5

As in the proof of Equation 11, we think of N1(n) as asum of independent binary random variables Xi, one perinstance. Each of these variables has a moment generatingfunction MXi(t) = 1+πi(e

t−1). Because 1+x ≈ exp(x) forsmall x, and πi(e

t − 1) will be small, then 1 + πi(et − 1) ≈

exp(πi(et−1)). Note πi(e

t−1) is always eventually small forsome n because πi(n+ 1) = pi(1− pi)n ≤ pie−npi ≤ 1/en.

Because the MGF of a sum of independent random vari-ables is the product of the terms’ MGFs, we arrive at:MN1(n) =

∏iMXi(t) = exp

([∑i πi

] [et − 1

]).

3.4.1 Thompson SamplingNow we can use this bound to design a sampling strategy.

The goal is to pick between (N1j (n), nj) and (N1

k (n), nk) in-

stead of only between N1j and N1

k , where the only reasonableanswer would be to pick the largest. One way to implementthis comparison is to randomize it, which is what Thompsonsampling [32] does.

Thompson sampling works by modeling unknown param-eters such as Rj not with point estimates such as N1

j (nj)/nj

but with a wider distribution over its possible values. Thewidth should depend on our uncertainty. Then, instead ofusing the point estimate to guide decisions, we use a sam-ple from from this distribution. This effectively adds noiseto our point estimate in proportion to how uncertain it is.In our implementation, we choose to model the uncertaintyaround Rj(n+ 1) as following a Gamma distribution:

Rj(n+ 1) ∼ Γ(α = N1j , β = nj) (12)

A Gamma distribution is shaped much like the Normalwhen the N1/n is large, but behaves more like a single-taileddistribution when N1/n is near 0, which is desirable becauseN1/n will become very small over time, but we know ourhidden λ is always non-negative. The Gamma distributionis a common way to model the uncertainty around an unob-servable parameter λ for a Poisson distribution whose sam-ples we can observe. Because we have shown that N1(n)does in fact follow a Poisson distribution with an unknownparameter, this choice is especially suitable for our use case.The mean value of the Gamma distribution Equation 12is α/β = N1

j /nj which is by construction consistent with

Equation 7, and its variance is α/β2 = N1j /n

2j which is by

construction consistent with the variance bound of Equa-tion 11.

Finally, the Gamma distribution is not defined when α orβ are 0, so we need both a way to deal with the scenariowhere N1(n) = 0 which could happen often due to objectsbeing rare, or due to having exhausted all results. As well asat initialization, when both N1 and n are 0. We do this byadding a small quantity α0 and β0 to both terms, obtaining:

Rj(nj + 1) ∼ Γ(α = N1j (nj) + α0, β = nj + β0) (13)

We used α0 = .1 and β0 = 1 in practice, though we didnot observe a strong dependence on this value. We exper-imented with alternatives to Thompson sampling, specifi-cally Bayes-UCB [19], which uses upper quantiles from Equa-tion 13 instead of samples to make decisions, but we did notobserve significantly different results.

3.4.2 Empirical ValidationIn this section we provide an empirical validation of the

estimates from the previous sections, including Equation 7,and Equation 13. The question we are interested in is: given

an observed N1 and n, what is the true expected R(n+ 1),and how does it compare to the belief distribution Γ(N1, n)which we propose using.

We ran a series of simulation experiments. To do this,we first generate 1000 p0, p1, ...p1000 at random to represent1000 results with different durations. To ensure there isduration skew we observe in real data, we use a lognormaldistribution to generate the pi. To illustrate the skew in thevalues, the smallest pi is 3 × 10−6, while the max pi = .15.The median pi is 1 × 10−3. The parameters µp computedfrom the pi are 3 × 10−3 and 8 × 10−3 respectively. Fora dataset with 1 million frames (about 10 hours of video),these durations correspond to objects spanning from 1/10of a second up to about 1.5 hours, more skew than whatnormally occurs.

Then, we model the sampling of a random frame as fol-lows: each instance is in the frame independently at randomwith probability pi. To decide which of the instances willshow up in our frame we simulate tossing 1000 coins inde-pendently, each with their own pi, and the positive drawsgive us the subset of instances visible in that frame. Wethen proceed drawing these samples sequentially, trackingthe number of frames we have sampled n, how many in-stances we have seen exactly once, N1, and we also recordE [R(n+ 1)]: the expected number of new instances we canexpect in a new frame sampled, which is possible becausewe can compute it directly as ΣN

i [i /∈ seen (n)] · pi, be-cause in the simulation we know the remaining unknowninstances and know their hidden probabilities pi, so we com-pute E [R(n+ 1)]. We sequentially sample frames up ton = 180000, and repeat the experiment 10K times, obtaininghundreds of millions of tuples of the form (n,N1, R(n+ 1))for our fixed set of pi. Using this data, we can answer ouroriginal question: given an observed N1 and n, what is thetrue expected R(n + 1)? by selecting different observed nand N1 at random, and conditioning (filtering) on them,plotting a histogram of the actual R(n+ 1) that occurred inall our simulations. We show these histograms for 10 pairsof n and N1 in Figure 2, alongside our belief distribution.

Figure 2 shows a mix of 3 important scenarios. The first3 subplots with n ≤ 100, representative of the early stagesof sampling. Here we see that the Γ model has substantiallymore variance than the underlying true distribution of R(n+1). This is intuitively expected, because early on both thebias and the variance of our estimate are bottlenecked bythe number of samples, and not by the inherent uncertaintyof R(n+ 1). As n grows to mid range values (next 4 plots),we see that the curve fits the histograms very well, and alsothat the curve keeps shifting left to lower and lower ordersof magnitude on the x axis. Here we see that the one-sidednature of the Gamma distribution fits the data better than abell shaped curve. The final 3 subplots show scenarios wheren has grown large and N1 potentially very small, including acase where N1 = 0. In that last subplot, we see the effect ofhaving the extra α0 in Equation 13, which means Thompsonsampling will continue producing non-zero values at randomand we will eventually correct our estimate when we find anew instance. In 3 of the subplots there is a clear bias tooverestimate, though not that large despite the large skew.

This empirical validation was based on simulated data. Inour evaluation we show these modeling choices work well inpractice on real datasets as well, where our assumption ofindependence is not guaranteed and where durations may

6

n: 172085

N1: 5

n: 179601

N1: 0

n: 120911

N1: 4

n: 131315

N1: 7

n: 41013

N1: 18

n: 48094

N1: 17

n: 100

N1: 116

n: 14093

N1: 58

n: 82

N1: 127

n: 97

N1: 121

0e+00 3e−05 6e−05 9e−05 0.0e+00 2.5e−05 5.0e−05 7.5e−05

0.00000 0.00005 0.00010 0.00015 0.00000 0.00005 0.00010 0.00015

0.00025 0.00050 0.00075 0.00100 0.001250.00000 0.00025 0.00050 0.00075 0.00100

1.0 1.2 1.4 0.002 0.003 0.004 0.005 0.006 0.007

1.1 1.3 1.5 1.7 1.9 1.0 1.2 1.4

R(n+1)

actual R(n+1) point estimate Thompson samples Gamma(N1+.1,n+1)

Estimates, real values and Thompson sampling

Figure 2: Comparing our Gamma heuristic of Equation 13with a histogram of the true values R(n + 1) from a simu-lation with heavily skewed pi. The details of the simulationare discussed in Section 3.4.2. The histograms show therange of values seen for R(n+1) when we have the observedN1 and n. The point estimate N1/n of (Equation 7) isshown as a vertical line. The belief distribution density isplotted as a thicker orange line, and 5 samples drawn usingThompson sampling are shown with dashed lines.

not necessarily follow the same distribution law.

3.5 AlgorithmThe pseudocode for ExSample is shown in Algorithm 1.The inputs to the algorithm are:

- video: The video dataset itself, which may be a singlevideo or a collection of files.

- chunks: The collection of logical chunks that we have splitour video dataset into. There are M chunks total.

- detector: An object detector that processes video framesand returns object detections relevant to the query.

- discrim: The discriminator function that decides whethera detection is new or redundant.

- result limit: The limit clause indicating when to stop.After initializing arrays to hold per-chunk statistics, the

code can be understood in three parts: choosing a frame,

input : video, chunks, detector, discrim, result limitoutput: ans

1 ans← []

// arrays for stats of each chunk

2 N1 ← [0,0,. . . ,0]3 n← [0,0,. . . ,0]4 while len(ans) < result limit do

// 1) choice of chunk and frame

5 for j ← 1 to M do6 Rj ← Γ(N1[j] + α0, n[j] + β0).sample()

7 end8 j∗ ← arg maxj Rj

9 frame id← chunks[j∗].sample()

// 2) io, decode, detection and matching

10 rgb frame← video.read and decode(frame id)11 dets← detector(rgb frame)

// d0 are the unmatched dets

// d1 are dets with only one match

12 d0,d1 ← discrim.get matches(frame id, dets)

// 3) update state

13 N1[j∗]← N1[j∗]+len(d0) − len(d1)14 n[j∗]← n[j∗] + 115 discrim.add(frame id, dets)16 ans.add(d0)

17 endAlgorithm 1: ExSample

processing the frame, and updating the sampler state. First,ExSample decides which frame to process next. It appliesthe Thompson sampling step in line 6, where we draw aseparate sample Rj from the belief distribution Equation 13for each of the chunks, which is then used in line 8 to pickthe highest scoring chunk. The j in the code is used as theindex variable for any loop over the chunks. During the firstexecution of the while loop all the belief distributions areidentical, but their samples will not be, breaking the ties atrandom. Once we have decided on a best chunk index j∗, wesample a frame index at random from the chunk in line 9.

Second, ExSample reads and decodes the chosen frame,and applies the object detector on it (line 11). Then, wepass the computed detections on to a discriminator, whichcompares the detections with objects we have returned ear-lier during processing in other frames, and decides whethereach detection corresponds to a new, distinct object. Thediscriminator returns two subsets: d0, the detections thatdid not match with any previous results (are new objects),and d1, the detections that matched exactly once with anyprevious detection. We will only use the size of those setsto update our statistics in the next stage. This frame pro-cessing stage of ExSample is the main bottleneck, and isdominated first by the detector call in line 11, and secondby the random read and decode in line 10.

Third, we update the state of our algorithm, updating N1

and n for the chunk we sampled from. We also store detec-tions in the discriminator and append the new detections tothe result set (ans). We note that the amount of state weneed to track only grows with the number of results we haveoutput so far, and not with the size of the dataset.

3.6 Other Optimizations

7

We now extend Algorithm 1 with two optimizations.

Batched execution. Algorithm 1 processes one frame ata time, but on modern GPUs, inference is much faster whenperformed on batches of images. The code for a batchedversion is similar to that in Algorithm 1, but on line 6 wedraw B samples per j, instead of one sample from each be-lief distribution, creating B cohorts of samples. In Figure 2we show 5 different values from Thompson sampling of thesame distribution in dashed lines. Because each sample forthe same chunk will be different, the chunk with the maxi-mum value will also vary and we will get B chunk indices,biased toward the more promising chunks. The logic forstate update only requires small modifications. In princi-ple, we may fear that picking the next B frames at randominstead of only 1 frame could lead to suboptimal decisionmaking within that batch, but at least for small values ofB up to 50, which is what we use in our evaluation, we sawno significant drop. This is batch size is sufficient to fullyutilize a machine with 4 GPUs.

We do not implement or evaluate asynchronous, distributedexecution in this paper, but the same reasoning suggestsExSample could be made to scale to an asynchronous set-ting, with workers processing a batch of frames without wait-ing for other workers. Ultimately, all the updates to N1

j andnj are commutative because they are additive.

Avoiding near duplicates within a single chunk. Whilerandom sampling is both a reasonable baseline and a goodmethod for sampling frames within selected chunks, randomallows samples to happen very close to each other in quicksuccession: for example in a 1000 hour video, random sam-pling is likely to start sampling frames within the same onehour block after having sampled only about 30 differenthours, instead of after having sampled most of the hoursonce. For this reason, we introduce a variation of randomsampling, which we call random+, to deliberately avoid sam-pling temporally near previous samples when possible: bysampling one random frame out of every hour, then samplingone frame out of every not-yet sampled half an hour at ran-dom, and so on, until eventually sampling the full dataset.We evaluate the separate effect of this change in our eval-uation. We also use random+ to sample within a chunk inExSample, by modifying the internal implementation of thechunk.sample() method in line 9.

4. ANALYSISWe have explained how ExSample allocates samples across

chunks. In this section, we develop an analytical modelto help us understand the circumstances under which ourmethod performs better than random.

4.1 Upper Bound on Chunking EffectivenessAfter n samples, ExSample will have allocated those sam-

ples across M different chunks, effectively weighting eachchunk differently. In this section we derive an equation pa-rameterized by dataset statistics to derive an upper boundon how well any method that allocates samples across chunkscan do in terms of samples taken versus random samplingacross the entire data set.

The number of results discovered by random sampling af-ter n samples follows the curve

N(n) = Σ(1− (1− pi)n) = N − Σ(1− pi)n (14)

If we split the dataset into M chunks, then instead ofhaving a single pi of being found, instance i has a vectorpi of M entries pij , where pij is the conditional probabilityof instance i being found in a sample on chunk j. Everyweighted sampling scheme over these chunks is correspondsto a M dimensional vector w where wj is the probabil-ity of sampling from chunk j. The total chance of sam-pling instance i under w is the dot product piw. Hence,N(n,w) = N − Σ(1 − piw)n. The optimal allocation ofsamples to chunks corresponds to the convex problem:

N∗(n) = N −minw

Σ(1− piw)n (15)

subject to w ≥ 0 and Σwi = 1. (16)

If we assume the M chunks are equally sized, then randomsampling corresponds to w = 1/M, and so it is only oneof the feasible solutions, which makes clear that optimalweighted sampling is never worse than random sampling,though some implementations of weighted sampling couldperform worse than random if weights are chosen poorly.

Note that ExSample does not have access to the pi aheadof time, instead ExSample is solving the same optimizationproblem but in an online fashion. However, for evaluationpurposes in this paper, this bound can be computed directlyfrom the pi using a solver package such as cvxpy[4], which iswhat we do in later sections. This bound is useful becauseonce chunks are fixed, no method that relies on weightedrandom sampling can do better than this curve in expec-tation, this includes design variations of ExSample usingdifferent estimators, or different ways to model uncertainty.We later show empirically that ExSample converges to thisbound, depending partially on our choice of chunks.

4.2 Data Dependent ParametersThe previous section provided a tight upper bound N∗(n)

for any weighted sampling strategy, but it does not guaran-tee ExSample or even N∗(n) is always better than randomsampling. This section explores how different parameters,in particular skew in instances and average duration of in-stances, affect ExSample performance in simulation.

Instance skew is a key parameter governing performanceof ExSample. If we knew 95% of our results lie on the firsthalf of the dataset, then we could allocate all our samplesto that first half. This would boost the pi of those resultsin the first half by 2×, without requiring knowledge of theirprecise locations. Hence we could expect about a 2× savingsin the frames needed to reach the same number of results.Skew arises in datasets whenever the occurrence of an eventcorrelates relevant latent factor that varies across time e.g.,time of day or location (city, country, highway, camera an-gle) where video is taken. Note that ExSample does notknow of these correlated factors or expect them as inputs inany way, but exploits this skew via sampling when it existsin the data.

In our simulation we show that under a wide variety of sit-uations ExSample can outperform random by 2x to 80x, andit never does significantly worse. We also explore in detailsituations random is competitive with ExSample. Resultsare shown results in Figure 3.

To run the simulation, we fixed the number of instances Nto 2000, and the number of frames to 16 million. We placedthese 2000 instances according to a normal distribution cen-tered in these 16 million frames, varying the standard devia-

8

1

10

100

1k2k

num

ber o

f dist

inct

inst

ance

s fou

nd (o

ut o

f 200

0)

no instance skew

0.79x

random this work

skewed toward1/4 of dataset

1.4x


3.9x


mean duration14 fram

es8.5x

1

10

100

1k2k

1.1x0.98x

0.89x2.2x

2.6x12x

mean duration100 fram

es

4.7x29x

1

10

100

1k2k

0.88x 1x

1x

1.4x2.5x

3.2x

3.6x15x

24x mean duration700 fram

es

6.1x26x

84x

10 100 1k 10k1

10

100

1k2k

0.97x0.98x

1.1x

10 100 1k 10k

0.89x 2x

2.6x

10 100 1k 10k

1.2x7.8x

14x

10 100 1k 10knumber of frames sampled (out of 16 Million)

mean duration

4900 frames

8.1x14x

37x

Figure 3: Simulated savings in frames from ExSample range from 1x to 84x depending depending on instance skew (increasingskew from left to right, starting with no skew), and on the average duration of a instance in frames (increasing from top tobottom). Solid lines show the median frames sampled by ExSample and random. Shaded areas mark 25-75 percentile. Welabel with text the savings in samples needed to reach 10, 100 and 1000 results. The dashed blue line show the expectedresults if samples were allocated optimally based on perfect prior knowledge of chunk statistics. The dashed red line showsthe expected results from random. For a complete explanation see Section 4.2.

tion to result in no skew (left column in the figure) and skewwhere 95% of the instances appear in the center 1/4, 1/32,and 1/256 of the frames (additional columns in the figure).To generate varied durations for each of these instances, weuse a LogNormal distribution with a target mean of 700frames. This creates a set of durations where the short-est one is around 50 frames and the longest is around 5000frames. To change the average duration we simply multiplyor divide all durations by powers of 7, to get average dura-tions of 700/49 = 14, 700/7 = 100, 700, 700 × 7 = 4900.These correspond to the rows in Figure 3.

Once we have fixed these parameters, the different algo-rithms proceed by sampling frames, and we track the in-stances that have been found (y axis of subplots) as we in-crease the number of samples we have processed (x axis).For ExSample, we divide the frames into 128 chunks. Werun each experiment 21 times. In Figure 3 we show themedian trajectory (solid line) as well as a confidence bandcomputed from the 25th to the 75th percentile (shaded).Additionally, the dashed lines show the expectation esti-mates computed using the formula in Equation 14. ExSam-ple and random start off similarly, which is expected becausewe ExSample starts by sampling uniformly at random, butas samples accumulate, ExSample s trajectory moves moreand more toward the dashed blue line which represents theexpected number of instances if the allocation of samplesacross chunks had been optimal from the beginning. ExSam-ple outperforms random but performs similarly in two cases:1) when there is very little skew in the locations of results(top row of figure) and 2) when results are very rare, whichmakes getting to the first result equally hard for both (leftcolumn of figure).

4.3 Chunks

The way we split the dataset into chunks is an externalparameter that affects the performance of ExSample. Weconsider two extremes. Suppose we have a single chunk.This makes ExSample equivalent to random sampling. Thefewer chunks there are, the less able to exploit any skewthere is. Using two chunks imposes a hard ceiling on savingsof 2x, even if the skew is much larger, in this case 32. Onthe opposite extreme, one chunk per image frame would alsobe equivalent to random sampling: we would only samplefrom each chunk once before running out of frames, and wewould not be able to use that that information to inform thefollowing samples.

In Figure 4 we show how ExSample behaves for a rangeof chunk scenarios spanning several orders of magnitude. Inthis simulation, we fix instance skew (32) and pi (with amean duration of 700 frames), corresponding to the thirdcolumn and third row of Figure 3. We vary the number ofchunks from 1 to 1024. Dash lines show the optimal alloca-tion from Equation 15; these lines improve with more chunksbecause we have exact knowledge of frame distribution andcan exploit skew at smaller time scales. Note that a linearincrease in chunks does not result in linear speedup, becausethe underlying instance skew is limited.

Solid lines in Figure Figure 4 show the median perfor-mance of ExSample under different chunk settings. Here,increasing the number of chunks can decrease performance:for example going from 128 to 1024 chunks shows a decrease,even though both have similar optimal allocation (dashed)curves. The reason for this drop is that as we increase thenumber of chunks, we also increase the number of samplesExSample needs to be able to tell which chunks are morepromising: when working with 1024 chunks, we need at least

9

0 10000 20000 30000frames sampled (out of 16 Million)

0

500

1000

1500

2000

inst

ance

s fou

nd

1024 128 16 2 random

(a) Instances vs samples

0 10000 20000 30000frames sampled (out of 16 Million)

0

250

500

750

gap

to o

ptim

al a

lloca

tion

1024 128 16 2 random

(b) Vertical gap between dashed and solid lines

Figure 4: Varying the number of chunks for a fixed workload.Different colors correspond to different number of chunksin the legend. Dashed lines correspond to optimal sampleallocation. Solid lines correspond to simulation results.

1024 samples for each chunk to be sampled once. Duringthat time, ExSample performs like random.

The bottom subplot in Figure 4 presents the vertical gapbetween the dashed line and the solid line in the top subplot.We can see indeed that the gap peaks at around the timethe number of samples reaches the number of chunks n = Mand decreasing afterwards.

Thus, more chunks M can help improve how much of theinstance skew any weighted sampling can exploit, but alsoimplies a larger exploration ”tax”, proportional to M . ForExSample, this trade off shows up as a longer time to con-verge to the optimal allocation trajectory with larger M . Wenote, however that this sensitivity to chunks is not large –in our simulation we vary the number of chunks by 3 ordersof magnitude and we still see a benefit of chunking versusrandom across all settings.

5. EVALUATIONOur goals for this evaluation are to demonstrate the ben-

efits of ExSample on real data, comparing it to alternativesincluding random sampling as well as existing work in videoprocessing. We show on these challenging datasets ExSam-ple achieves savings of up to 4x with respect to random sam-pling, and orders of magnitude higher to approaches basedon specialized surrogate models.

5.1 Experimental SetupBaselines. We compare ExSample against two of the base-lines discussed in Section 2.2: uniform random samplingand BlazeIt [17]. BlazeIt is the state-of-the-art represen-tative of surrogate model methods for distinct object limitqueries, and optimizes these queries by first performing afull scan of the dataset to compute surrogate scores on ev-ery video frame, and then processing frames from highest

to lowest score through the object detector. As in our ap-proach, BlazeIt uses an object tracker to determine whethereach computed detection corresponds to a distinct object.

Implementation. Our implementations of random sam-pling, BlazeIt, and ExSample are at their core samplingloops where the choice of which frame to process next isbased on an algorithm-specific score. Based on this score,the system will fetch a batch of frames from the video andrun that batch through an object detector. We implementthis sampling loop in Python, using PyTorch to run infer-ence on a GPU. For the object detector, we use Faster-RCNN with a ResNet-50 backbone. To achieve fast randomaccess frame decoding rates, we use the Hwang library fromthe Scanner project[30], and re-encode our video data toinsert keyframes every 20 frames.

We implemented the subset of BlazeIt that optimizes dis-tinct object limit queries, based on the description in thepaper [17] as well as their published code. We opted forour own implementation to make sure the I/O and decod-ing components of both ExSample and BlazeIt were equallyoptimized, and also because extending BlazeIt to handle ourown datasets and metrics was infeasible. For BlazeIt’s cheapsurrogate model, we use an ImageNet pre-trained ResNet-18. This model is more expensive than the ones used in thatpaper, but also more accurate. Note that our timed resultsdo not depend strongly on the inference cost of ResNet-18,since the model is still sufficiently lightweight that BlazeIt’sfull scan is dominated by video IO and decoding time.

Datasets. We use 5 different datasets in this evaluation.Two datasets consist of vehicle-mounted dashboard cameravideo, while three datasets are from static street cameras.

The dashcam dataset consists of 10 hours of video, or over1.1 million video frames, collected from a dashcam over sev-eral drives in cities and highways. Each drive can rangefrom around 20 minutes to several hours. The BDD datasetused for this evaluation consists of a subset of the BerkeleyDeep Drive Dataset[35]. Our subset is made out of 1000 ran-domly chosen clips. Because the BDD dataset has a widervariety of locations (multiple cities) and is made up of 100040-second clips, the number of chunks is forced to be highfor this dataset. So, even though both datasets come fromdashcams, they are quite different from each other.

The other three datasets are taken from [17], and consistof video recorded by fixed cameras overlooking urban loca-tions. They are named archie, amsterdam and night-street,and each consists of 3 segments of 10 hours each taken ondifferent days. These datasets are used in the evaluations ofseveral existing works on optimizing video queries on staticcamera datasets [12, 16].

Queries. Each dataset contains slightly different types ofobjects, so we have picked between 6 and 8 objects perdataset based on the content of the dataset. In additionto searching for different object classes, we also vary thelimit parameter recall of 10%, 50% and 90%, where recall isthe fraction of distinct instances found. These recall ratesare meant to represent different kinds of applications: 10%represents a scenario where an autonomous vehicle data sci-entist is looking for a few test examples, whereas a higherrecall like 90% would be more useful in an urban planningor mapping scenario where finding most objects is desired.

10

Ground Truth. None of the datasets have human-generatedobject instance labels that identify objects over time. There-fore we approximate ground truth by sequentially scanningevery video in the dataset and running each frame through areference object detector (we reuse the Faster-RCNN modelthat we apply during query execution). If any objects aredetected, we match the bounding boxes with those fromprevious frames and resolve which correspond to the sameinstance. To match objects across neighboring frames, weemploy an IoU matching approach similar to SORT [3]. IoUmatching is a simple baseline for multi-object tracking thatleverages the output of an object detector and matches de-tection boxes based on overlap across adjacent frames.

5.2 ResultsHere we evaluate the cost of processing, in both time and

frames, distinct object queries using ExSample, random+

random and BlazeIt. Because some classes are much morecommon than others, the absolute times and frame countsneeded to process a query vary by many orders of magni-tude across different object classes. It is easier to get aglobal view of cost savings by normalizing query process-ing times of ExSample and BlazeIt against that of random.That is, if random takes 1 hour to find 20% of traffic lights,and ExSample takes 0.5 hours to do the same, time savingswould be 2×. We also apply this normalization when com-paring frames processed. In the first part of this section wecover results for dashcam are shown in Figure 5 and for bddin Figure 6. We cover the remaining datasets later.

5.2.1 Dashcam Dataset ResultsOverall, ExSample saves more than 3× in frames pro-

cessed vs random, averaged across all classes and differentrecall levels. Savings do vary by class, reaching savings ofabove 5x for the bicycle class in the dashcam dataset. Interms of frames we can see BlazeIt performs really well atrecall of 10% in the dashcam dataset, reaching 10× the per-formance of random. The reason is that in the dashcamdataset BlazeIt surrogate models learn to classify framesmuch better than random, as is evident in Figure 7. Forhigh recall levels, however, this initial advantage diminishesbecause it is hard to avoid resampling the same instances indifferent frames, despite skipping a gap of 100 frames whena result is successfully found.

The story becomes more complex when comparing run-time. On the right panel of Figure 5 we show the total run-time costs, including overhead incurred prior to processingthe query, which erase the early wins from better predictionon the left panel. The impact of this overhead reduces inmagnitude as more samples are taken, and so the differenceis less dramatic at higher recall levels. For BlazeIt the totaltime in the plot is a sum of three separate costs: those oftraining, scoring, and sampling. Training costs are mostlydue to needing to label a fraction of the dataset using thethe expensive detector. Scoring costs come from needing toscan and process the full dataset with the surrogate model.Finally, sampling costs come from visiting frames in scoreorder.

In Table 1 we quantify the costs of the three steps, wherefor BlazeIt we estimate sampling costs by assuming a welltrained surrogate model finding instances 100× faster thanrandom. The table shows that BlazeIt’s runtime is dom-inated first by scoring (i.e., the full scan to process every

frame through the surrogate model), and second by label-ing. While it may be possible to amortize the costs of la-beling in BlazeIt across many queries, because the expensivemodel may provide general-purpose labels, the full scan thatdominates the overhead cannot be amortized as the surro-gate is query-specific. Because random does not need to scanthe entire dataset before yielding results, it can afford to bemuch worse in precision while still providing a faster run-time, especially at low recalls. ExSample outperforms theother methods because it provides high sampling precisionwhile avoiding the need for a full scan.

metric: savings in frames metric: time savings

recall: 0.1recall: 0.5

recall: 0.9

bicy

cle

pers

on

traf

fic li

ght

fire

hydr

ant

stop

sig

n

bus

truc

k

bicy

cle

pers

on

traf

fic li

ght

fire

hydr

ant

stop

sig

n

bus

truc

k

0.1

1.0

10.0

0.3

1.0

3.0

1

3

5

BlazeIt random+ this work

Savings on Dashcam dataset

Figure 5: Results on the Dashcam dataset. Comparingframes (left column) and times(right column) for differentrecall levels (each row) on the dashcam dataset. Differentmethods (color) are compared relative to the costs of usingrandom to process the same query and reach the same levelof instance recall.

through- BlazeIt randomstage put(fps) frames/time(s) frames/time(s)

label 20 110K/5.8Ks 0score 100 1.1M/12Ks 0sample 20 120/5.8s 1.2K/58s

Table 1: Time breakdown: random vs BlazeIt.

5.2.2 BDD ResultsThe maximum benefit from ExSample is lower in Figure 6,

reaching only 3x. BDD is a challenging scenario for ExSam-ple because as we have mentioned earlier, it is composed of1000 40 second clips. BDD is also challenging for BlazeIt,because surrogate models have a harder time learning withhigh accuracy on this dataset, which presents many pointsof view and small objects. Figure 7 shows that surrogatemodels do not reach particularly high Average Precision.This is not an issue for ExSample because we do not rely ona cheaper approximation, which may not be as easy to getfor highly variable datasets.

5.2.3 Additional DatasetsIn addition to the Dashcam and BDD datasets, we evalu-

ated ExSample on archie, amsterdam, and night-street from

11

metric: savings in frames metric: time savings

recall: 0.1recall: 0.5

recall: 0.9

pers

on

bike

mot

or

rider

bus

truc

k

traf

fic li

ght

traf

fic s

ign

pers

on

bike

mot

or

rider

bus

truc

k

traf

fic li

ght

traf

fic s

ign

0.1

0.3

1.0

3.0

1

2

3

0.80.91.0

BlazeIt random+ this work

Savings on BDD dataset

Figure 6: Results on BDD dataset. Surrogate has a hardertime learning on this dataset, with lower accuracy scores.Counterintuitively, this makes its savings comparable tothose of random sampling, improving its results comparedto Dashcam datasest

the BlazeIt paper [17], and measured how ExSample per-forms compared to random in terms of frame savings on allfive datasets. This evaluation includes around 35 classes to-tal across the datasets. The results are aggregated belowby dataset in Table 2 (averaging across recall levels) and byrecall level in Table 4 (averaging across datasets).

Table 2 shows that savings are both query and data depen-dent. This is reflected in Figure 8, where we show an objectfor each dataset, and how its distribution is skewed acrosschunks in the dataset. The dataset/class combos with thelargest skews (dashcam, bdd1k and night-street) also showthe largest gains, as expected based on Section 4.2. Am-sterdam and archie have little skew overall on any class,hence ExSample cannot perform much better than random.On BDD, there are many short chunks due to the datasetstructure, so the savings are lower than skew would suggest,as explained in Section 4.3.

dataset classes min (savings) median max

dashcam 7 1.7 4.1 6.5bdd1k 8 1.5 1.9 3.5night-street 6 1.2 1.6 3.1amsterdam 7 0.9 1.5 1.7archie 6 1.2 1.4 1.5

Table 2: Aggregate savings of ExSample relative to randomfor the different datasets.

Rather than running a specific surrogate-based system likeBlazeIt on these additional datasets, we compared ExSam-ple to an idealized surrogate, which has a 100% recall onidentifying instances, so its cost is just the cost it takes toscore all frames in the dataset. We show how the fraction ofinstances that ExSample can identify in the time it takes thisidealized surrogate do this scoring, using surrogate timingdata based on the experiments in subsection 5.2, using thesame five classes show in Figure 8. The results are shown inTable 3. From this, we can see that across a range of object

dataset: bdd dataset: dashcam

bike bus

mot

or

pers

on

rider

traf

fic li

ght

traf

fic s

ign

truc

k

bicy

cle

bus

fire

hydr

ant

pers

on

stop

sig

n

traf

fic li

ght

truc

k

1

3

10

30

surr

ogat

e A

P /

rand

om A

P

Surrogate model AP vs random AP

Figure 7: Average Precision of surrogate models vs a ran-domly assigned score. For BDD, the surrogate model doesonly slightly better than random in precision, which causesit to perform similarly to random when measured in savingsin Figure 6. For the Dashcam dataset, the surrogate modelsare more accurate than random. Counterintuitively, higheraccuracy in scores hurts relative performance in Figure 5due to the greediness of the algorithm.

recall fractiondataset category (this work) sampled

A-dashcam bicycle 0.72 0.07B-bdd1k motor 0.38 0.22C-night street person 0.90 0.02D-archie car 0.90 0.02E-amsterdam boat 0.90 0.02

Table 3: Recall level reached by ExSample in the sameamount of time it would take to scan the dataset. The ex-periment stops at recall .9.

classes and skew, ExSample can detect most instances in adataset before a surrogate can score all frames.

Finally, Table 4 compares ExSample to random on all 5datasets, aggregating by recall. Here ExSample performsslightly better at 50% recall, because at 10% recall, withfewer samples, ExSample has had less time to identify chunkswith more instances, and at 90% recall, some of the instancesare in less skewed chunks that ExSample samples less.

recall min (savings) median max

10% 0.8 1.4 6.950% 0.9 1.7 6.990% 1.2 1.9 4.8

Table 4: Aggregate savings broken down by recall.

6. RELATED WORKVideo Data Management. There has recently been re-newed interest in developing data management systems forvideo analytics, motivated in large part by the high costof applying state-of-the-art video understanding methods,which almost always involve expensive deep neural networks,on large datasets. Recent systems adapt prior work in visual

12

N

S= 14savings=7

A-dashcam/bicycleN= 249

S= 19savings=2

B-bdd1k/motorN= 509

S=4.5savings=3

C-night street/personN= 2078

S=1.1savings=1

D-archie/carN=33546

chunk

S=1.6savings=0.9

E-amsterdam/boatN= 588

Figure 8: Instance skew and savings for different classesand datasets of Table 2. Each vertical bar corresponds toa chunk’s number of instances. The blue colored bars arethe minimum set that cover 50% of instances. S is our skewmetric defined in Section 4.2.

data management [5, 29] for modern tasks that involve ap-plying machine learning techniques to extract insights fromlarge video datasets potentially comprising of millions ofhours of video. In particular, DeepLens [21] and Visual-WorldDB [9] explore a wide range of opportunities that anintegrated data management system for video data wouldenable. DeepLens [21] proposes an architecture split intostorage, data flow, and query processing layers, and consid-ers tradeoffs in each layer – for example, the system mustchoose from several storage formats, including storing eachframe as a separate record, storing video in an encoded for-mat, and utilizing a hybrid segmented file. VisualWorldDBproposes storing and exposing video data from several per-spectives as a single multidimensional visual object, achiev-ing not only better compression but also faster execution.

Speeding up Video Analytics. Several optimizationshave been proposed to speed up execution of various compo-nents of video analytics systems to address the cost of min-ing video data. Many recent approaches train specializedsurrogate classification models on reduced dimension inputsderived from the video [18, 24, 20, 12, 2]. As we detailed inSection 2.2, most related to our work is BlazeIt [17], whichadapts surrogate-based video query optimization techniquesfor object search limit queries.

Video analytics methods employing surrogate models arelargely orthogonal to our work: we can apply ExSample tosample frames, while still leveraging a cascaded classifier sothat expensive models are only applied on frames where fast,surrogate models have sufficient confidence. The extensionsto limit queries proposed to BlazeIt are not orthogonal, asthey involve controlling the sampling method, but a fusionsampling approach that combines ExSample with BlazeItmay be possible. Nevertheless, a major limitation of BlazeItis that it requires applying the surrogate model on everyvideo frame even for limit queries; as we showed in Section5, this presents a major bottleneck that can make BlazeItslower on limit queries than random sampling.

Besides methods employing surrogate models, several ap-proaches, such as Chameleon [16] as well as [14, 33, 34], con-sider tuning optimization knobs such as the neural networkmodel architecture, input resolution, and sampling framer-ate to achieve the optimal speed-accuracy tradeoff. Like sur-rogate models, these approaches are orthogonal to ExSample

and can be applied in conjunction.

Speeding up Object Detection. Accelerating video an-alytics by improving model inference speed has been exten-sively studied in the computer vision community. Becausethese methods optimize speed outside of the context of aspecific query, they are orthogonal to our approach and canbe straightforwardly incorporated. Several general-purposetechniques improve neural network inference speed by prun-ing unimportant connections [8, 22] or by introducing mod-els that achieve high accuracy with fewer parameters [11,15]. Coarse-to-fine object detection techniques first detectobjects at a lower resolution, and only analyze particularregions of a frame at higher resolutions if there is a largeexpected accuracy gain [6, 27]. Cross-frame feature prop-agation techniques accelerate object tracking by applyingan expensive detection model only on sparse key frames,and propagating its features to intermediate frames using alightweight optical flow network [36, 37].

7. CONCLUSIONOver the next decade, workloads that process video to ex-

tract useful information may become a standard data miningtask for analysts in areas such as government, real estate,and autonomous vehicles. Such pipelines present a new sys-tems challenge due to the cost of applying state of the artmachine vision techniques over large video repositories.

In this paper we introduced ExSample, an approach forprocessing distinct object search queries on large video repos-itories through chunk-based adaptive sampling. Specifically,the aim of the approach is to find frames of video that con-tain objects of interest, without running an object detectionalgorithm on every frame, which is often prohibitively ex-pensive. Instead, in ExSample, we adaptively sample frameswith the goal of finding the most distinct objects in theshortest amount of time – ExSample iteratively processesbatches of frames, tuning the sampling process based onwhether new objects were found in the frames sampled oneach iteration. To do this tuning, ExSample partitions thedata into chunks and dynamically adjusts the frequency withwhich it samples from different chunks based on the rate atwhich new objects are sampled from each chunk. We for-mulate this sampling process as an instance of Thompsonsampling, using a Good-Turing estimator to compute theper-chunk likelihood of finding a new object in a randomframe. In this way, as objects in a particular chunk are ex-hausted, ExSample naturally refocuses its sampling on otherless frequently sampled chunks.

Our evaluation of ExSample on a real-world dataset ofdashcam video shows that it is able to substantially reduceboth the number of sampled frames and the execution timeneeded to achieve a particular recall compared to both ran-dom sampling and methods based on lightweight surrogatemodels, such as BlazeIt [17], that are designed to estimateframes likely to contain objects of interest with lower over-head. Thus, we believe that ExSample represents a key stepin building cost-efficient video data management systems.

8. REFERENCES[1] AWS. Amazon EC2 on-demand pricing.

https://aws.amazon.com/ec2/pricing/on-demand.Accessed: 2020-09-10.

13

https://aws.amazon.com/ec2/pricing/on-demand

[2] F. Bastani, S. He, A. Balasingam, K. Gopalakrishnan,M. Alizadeh, H. Balakrishnan, M. Cafarella,T. Kraska, and S. Madden. Miris: Fast object trackqueries in video. In Proceedings of the 2020 ACMSIGMOD International Conference on Management ofData, pages 1907–1921, 2020.

[3] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft.Simple online and realtime tracking. CoRR,abs/1602.00763, 2016.

[4] S. Diamond and S. Boyd. CVXPY: APython-embedded modeling language for convexoptimization. Journal of Machine Learning Research,17(83):1–5, 2016.

[5] M. Flickner, H. Sawhney, W. Niblack, J. Ashley,Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee,D. Petkovic, et al. Query by image and video content:The QBIC system. Computer, 28(9):23–32, 1995.

[6] M. Gao, R. Yu, A. Li, V. I. Morariu, and L. S. Davis.Dynamic zoom-in network for fast object detection inlarge images. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages6926–6935, 2018.

[7] I. J. GOOD. THE POPULATION FREQUENCIESOF SPECIES AND THE ESTIMATION OFPOPULATION PARAMETERS. Biometrika,40(3-4):237–264, 12 1953.

[8] S. Han, H. Mao, and W. J. Dally. Deep compression:Compressing deep neural networks with pruning,trained quantization and huffman coding. InInternational Conference on Learning Representations,2016.

[9] B. Haynes, M. Daum, A. Mazumdar, M. Balazinska,A. Cheung, and L. Ceze. VisualWorldDB: A dbms forthe visual world. In CIDR, 2020.

[10] K. He, G. Gkioxari, P. Dollar, and R. B. Girshick.Mask R-CNN. CoRR, abs/1703.06870, 2017.

[11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,W. Wang, T. Weyand, M. Andreetto, and H. Adam.MobileNets: Efficient convolutional neural networksfor mobile vision applications. CoRR, abs/1704.04861,2017.

[12] K. Hsieh, G. Ananthanarayanan, P. Bodik,S. Venkataraman, P. Bahl, M. Philipose, P. B.Gibbons, and O. Mutlu. Focus: Querying large videodatasets with low latency and low cost. In 13th{USENIX} Symposium on Operating Systems Designand Implementation ({OSDI} 18), pages 269–286,2018.

[13] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,A. Fathi, I. Fischer, Z. Wojna, Y. Song,S. Guadarrama, and K. Murphy. Speed/accuracytrade-offs for modern convolutional object detectors.CoRR, abs/1611.10012, 2016.

[14] C.-C. Hung, G. Ananthanarayanan, P. Bodik,L. Golubchik, M. Yu, P. Bahl, and M. Philipose.VideoEdge: Processing camera streams usinghierarchical clusters. In 2018 IEEE/ACM Symposiumon Edge Computing (SEC), pages 115–131. IEEE,2018.

[15] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf,W. J. Dally, and K. Keutzer. SqueezeNet:AlexNet-level accuracy with 50x fewer parameters and

0.5 mb model size. arXiv preprint arXiv:1602.07360,2016.

[16] J. Jiang, G. Ananthanarayanan, P. Bodik, S. Sen, andI. Stoica. Chameleon: Scalable adaptation of videoanalytics. In Proceedings of the 2018 Conference of theACM Special Interest Group on Data Communication,pages 253–266, 2018.

[17] D. Kang, P. Bailis, and M. Zaharia. Blazeit: Fastexploratory video queries using neural networks.CoRR, abs/1805.01046, 2018.

[18] D. Kang, J. Emmons, F. Abuzaid, P. Bailis, andM. Zaharia. NoScope: Optimizing neural networkqueries over video at scale. Proc. VLDB Endow.,10(11):1586–1597, Aug. 2017.

[19] E. Kaufmann. On bayesian index policies forsequential resource allocation. Ann. Stat.,46(2):842–865, Apr. 2018.

[20] N. Koudas, R. Li, and I. Xarchakos. Video monitoringqueries. In 2020 IEEE 36th International Conferenceon Data Engineering (ICDE), pages 1285–1296. IEEE,2020.

[21] S. Krishnan, A. Dziedzic, and A. J. Elmore. Deeplens:Towards a visual data management system. InConference on Innovative Data Systems Research(CIDR), 2019.

[22] G. Leclerc, R. C. Fernandez, and S. Madden. Learningnetwork size while training with ShrinkNets. InConference on Systems and Machine Learning, 2018.

[23] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B.Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollar,and C. L. Zitnick. Microsoft COCO: common objectsin context. CoRR, abs/1405.0312, 2014.

[24] Y. Lu, A. Chowdhery, S. Kandula, and S. Chaudhuri.Accelerating machine learning inference withprobabilistic predicates. ACM SIGMOD, June 2018.

[25] O. Moll, F. Bastani, S. Madden, M. Stonebraker,V. Gadepally, and T. Kraska. ExSample: Efficientsearches on video repositories through adaptivesampling (technical report). https://github.com/orm011/tmp/blob/master/techreport.pdf,September 2020.

[26] T. Murray. Help improve imagery in your area withour new camera lending program.https://www.openstreetmap.us/2018/05/

camera-lending-program/, 2018.

[27] M. Najibi, B. Singh, and L. S. Davis. Autofocus:Efficient multi-scale inference. In Proceedings of theIEEE International Conference on Computer Vision,pages 9745–9755, 2019.

[28] Nexar. Nexar. https://www.getnexar.com/company,2018.

[29] V. E. Ogle and M. Stonebraker. Chabot: Retrievalfrom a relational database of images. Computer,28(9):40–48, 1995.

[30] A. Poms, W. Crichton, P. Hanrahan, andK. Fatahalian. Scanner: Efficient video analysis atscale. ACM Trans. Graph., 37(4):138:1–138:13, July2018.

[31] J. Redmon and A. Farhadi. YOLOv3: An incrementalimprovement. CoRR, abs/1804.02767, 2018.

[32] D. Russo, B. V. Roy, A. Kazerouni, and I. Osband. A

14

https://github.com/orm011/tmp/blob/master/techreport.pdf

https://github.com/orm011/tmp/blob/master/techreport.pdf

https://www.openstreetmap.us/2018/05/camera-lending-program/

https://www.openstreetmap.us/2018/05/camera-lending-program/

https://www.getnexar.com/company

tutorial on thompson sampling. CoRR,abs/1707.02038, 2017.

[33] R. Xu, J. Koo, R. Kumar, P. Bai, S. Mitra,G. Meghanath, and S. Bagchi. ApproxNet: Contentand contention aware video analytics system for theedge. arXiv preprint arXiv:1909.02068, 2019.

[34] T. Xu, L. M. Botelho, and F. X. Lin. VStore: A datastore for analytics on large videos. In Proceedings ofthe Fourteenth EuroSys Conference, pages 1–17, 2019.

[35] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao,V. Madhavan, and T. Darrell. BDD100K: A diverse

driving video database with scalable annotationtooling. CoRR, abs/1805.04687, 2018.

[36] X. Zhu, J. Dai, L. Yuan, and Y. Wei. Towards highperformance video object detection. In Proceedings ofthe IEEE Conference on Computer Vision andPattern Recognition, pages 7210–7218, 2018.

[37] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei. Deepfeature flow for video recognition. In Proceedings ofthe IEEE Conference on Computer Vision andPattern Recognition, pages 2349–2358, 2017.

15

exsample: efﬁcient searches on video repositories through ... · the art gpus, one third of the...

Documents