food recognition using statistics of pairwise local...

Food Recognition Using Statistics of Pairwise Local Features

Shulin (Lynn) Yang1 Mei Chen2 Dean Pomerleau2,3 Rahul Sukthankar2,3

[email protected] [email protected] [email protected] [email protected]

1University of Washington 2Intel Labs Pittsburgh 3Robotics Institute, Carnegie Mellon

Abstract

Food recognition is difficult because food items are de-formable objects that exhibit significant variations in ap-pearance. We believe the key to recognizing food is to ex-ploit the spatial relationships between different ingredients(such as meat and bread in a sandwich). We propose anew representation for food items that calculates pairwisestatistics between local features computed over a soft pixel-level segmentation of the image into eight ingredient types.We accumulate these statistics in a multi-dimensional his-togram, which is then used as a feature vector for a discrim-inative classifier. Our experiments show that the proposedrepresentation is significantly more accurate at identifyingfood than existing methods.

1. Introduction

Automatic food recognition is emerging as an importantresearch topic in object recognition because of the demandfor better dietary assessment tools to combat obesity. Thegoals of such systems are to enable people to better under-stand the nutritional content of their dietary choices and toprovide medical professionals with objective measures oftheir patients’ food intake [18]. People are not very ac-curate when reporting the food that they consume. As analternative to manual logging, we investigate methods forautomatically recognizing foods based on their appearance.Unfortunately, the standard object recognition approachesbased on aggregating statistics of descriptive local featuresperform poorly on this task [5] because food items are de-formable and exhibit significant intra-class variations in ap-pearance; the latter is the case even for relatively standard-ized items from fast food menus.

Our research is motivated by the observation that a fooditem can largely be characterized by its ingredients and theirrelative spatial relationships. For instance, sandwiches areoften composed of a layer of meat surrounded on either sideby slices of bread, while salads consist of assorted greens

!"#$%&'()$*(+

,$-".-/#+0#$*1"#

2#3#*$45#&!##0+

,$-".-/#+0#$*1"#

Figure 1. Exploiting spatial relationships between ingredients us-ing pairwise feature statistics improves food recognition accuracy.

whose spatial layout can vary widely. Our hypothesis is thatalthough detectors for any given ingredient are likely to beunreliable, one can still derive sufficient information by ag-gregating pairwise statistics about ingredient types and theirspatial arrangement to reliably identify food items, both atthe coarse (e.g., sandwich vs. salad) and fine (e.g., Big Macvs. Baconator) levels. Fig. 1 illustrates how we can exploitthe spatial relationship between ingredients to identify thisitem as a Double Bacon Burger from Burger King.

Although this hypothesis has intuitive appeal, applying itto food recognition is challenging for several reasons. First,even a simple food item can contain numerous visually dis-tinct ingredients, not all of which may be visible in a givenimage due to occlusion or variations in assembly. In partic-ular, we cannot rely on the presence of distinctive low-levelfeatures nor on reliable edges between ingredients. Second,although perfect accuracy is not required at the ingredientlevel, the proposed method does require the pixel-level seg-mentation to generate distributions over ingredients fromvery small local patches. Finally, there can be significantintra-class variation in the observed sizes and spatial rela-tionships between ingredients, requiring us to build repre-sentations that can cope with this appearance variability ina principled manner.

Our proposed approach is illustrated in Fig. 2 and can be

[email protected]

[email protected]

[email protected]

[email protected]

Figure 2. Framework of proposed approach: (1) Each pixel in thefood image is assigned a vector representing the probability withwhich the pixel belongs to each of nine food ingredient categories,using STF [16]. (2) The image pixels and their soft labels areused to extract statistics of pairwise local features, to form a multi-dimensional histogram. (3) This histogram is passed into a multi-class SVM to classify the given image.

summarized as follows. First, we assign a soft label (dis-tribution over ingredients) to each pixel in the image usinga semantic texton forest [16]. For instance, the distributioncorresponding to a pixel in a red patch could have a highlikelihood for “tomato” and low for “cheese” while anotherpixel might have high values for “bread” and “cheese” butnot “tomato”. Next, we construct a multi-dimensional his-togram feature where each bin corresponds to a pair of in-gredient labels and discretized geometric relationships be-tween two pixels. For example, the likelihood of observinga pair of “bread” and “tomato” labels that are 20–30 pix-els apart at an angle of 40–50 degrees. Since we use softlabels, each observation from a pair of pixels contributesprobability mass to a number of bins. Thus, the histogramaggregated from many samples of pixel pairs in the imageserves as a representation of the spatial distribution of in-gredients for that image. Finally, we treat this histogram asa feature vector in a discriminative classifier to recognizethe food item. Our representation is not specific to food andcould also be suitable for recognizing objects or scenes thatare composed of visually distinctive “ingredients” arrangedin predictable spatial configurations.

The remainder of this paper is organized as follows. Sec-tion 2 discusses the related work. Section 3 details ourproposed approach. Section 4 describes our experimentalmethodology. Section 5 presents food recognition results onthe PFID [5] dataset. Finally, Section 6 presents our conclu-sions and proposes directions for future work in this area.

2. Related Work

There has been relatively little work on the problem offood recognition. Shroff et al. [17] proposed a wearablecomputing system to recognize food for calorie monitoring.Bolle et al. [2] developed an automatic produce ID systemcalled VeggieVision to help with the produce checkout pro-

cess. Yang et al. [5] introduced the PFID dataset for foodrecognition along with benchmarks using two baseline al-gorithms: color histogram and bag of SIFT [11] features.Russo et al. [14] monitored the production of fast food us-ing video. Wu and Yang [20] analyzed eating videos to rec-ognize food items and estimate caloric intake.

Object category recognition has been a popular researcharea in computer vision for many years (see [6] for a com-prehensive survey). These include approaches based on lo-cal features such as the SIFT [11] descriptor or global fea-tures such as color histograms or GIST [12] features; im-ages are typically represented as “bags” of these featureswithout explicit spatial information. Other approaches,such as the “constellation model” [3] and its numerous ex-tensions, model objects as a collection of parts with a pre-dictable spatial arrangement.

Of particular interest to food recognition are methodsthat explicitly address deformable objects. Shape con-text [1] picks n pixels from the contours of a shape, obtainsn − 1 vectors by connecting a pixel to the others, and usesthese vectors as a description of shape at the pixel. Felzen-szwalb proposes techniques [7] to represent a deformableshape using triangulated polygons. Leordeanu et al. [10]proposed an approach that recognizes category using pair-wise interaction of simple features. A recent work [8] learnsa mean shape of the object class based on the thin platespline parametrization.

These approaches work well for many problems in ob-ject recognition, but two issues make them hard to apply tofoods. First, these approaches all require detecting mean-ingful feature points, such as edges, contours, key pointsor landmarks. But precise features like these are typicallynot available in images of food. Second, shape similarity infood images is hard to exploit with these approaches, sincethe shape of real foods is often quite amorphous.

To overcome these challenges, our approach does notrely on the detection of features like edge or keypoints.Instead, we use local, statistical features defined over ran-domly selected pairs of pixels. The statistical distribution ofpairwise local features effectively captures important shapecharacteristics and spatial relationships between food ingre-dients, thereby facilitating more accurate recognition.

3. Pairwise local feature distribution (PFD)

In this section, we give the definition of our image repre-sentation — the statistics of pairwise local features, namelypairwise feature distribution (PFD). Using this representa-tion, each image can be represented by a multi-dimensionalhistogram.

3.1. Soft labeling of pixels

Before obtaining the statistics of local features, we clas-sify all image pixels into several categories based on theappearance of the image patch around the pixels. The localfeatures will later be gathered separately based on differentpixels categories. We use nine handpicked categories repre-senting eight common fast food ingredients: beef, chicken,pork, bread, vegetable, tomato/tomato sauce, cheese/butter,egg/other, plus a category for the background.

Rather than labeling every pixel as a member of a singlecategory, we use soft labeling to create a vector of proba-bilities corresponding to the likelihood that the given pixelbelongs to each of the nine categories. Soft labeling deferscommitment about the identity of a pixel, allowing for un-certainty about the true label for a pixel, and for the fuzzydistinction between ingredient categories.

Many methods could be used for pixel classification. Wechose to employ the Semantic Texton Forest (STF) [16].STF is a method for image categorization and segmenta-tion that generates soft labels for a pixel based on local,low-level characteristics (such as the colors of nearby pix-els). This is achieved using ensembles of decision trees thatare each trained on subsets of manually-segmented images.A pixel is classified by descending each tree from root toleaf based on low-level features computed around the givenpixel. The leaves contain probabilistic class distributions(over ingredients) that our algorithm uses to compute itssoft labels. We chose to use STF because it incorporateshuman knowledge in the form of manually labeled trainingimages, and because it has proven effective for soft label-ing in other challenging domains like the PASCAL VisualObject Classes (VOC) task.

With STF, each pixel is assigned a vector of K probabil-ities, where K is the number of pixel categories. K = 9 inour case. Fig. 3 is a visualization of the soft labeling of animage.

3.2. Global Ingredient Representation (GIR)

The simplest way to use the soft pixel labels producedby STF to represent a food image is to create a single one-dimensional histogram with 8 bins representing how fre-quently each of the 8 food ingredient categories appear inthe image. To create an overall histogram representing thedistribution of ingredients for the image, we sum up the softlabels for the 8 food ingredients for all the non-backgroundpixels in an image. Then we normalize the histogram bythe number of non-background pixels in the image. Thisglobal ingredient representation (GIR) is intuitive and easyto compute, but it does not capture the spatial relationshipsbetween ingredients that are important for distinguishingone food from another. It will serve as a useful benchmarkfor comparison with the more sophisticated, pairwise local

(a)

!!!!"##$!!!!!!!!!!!!!!!!!!!!%&'()#*!!!!!!!!!!!!!!!!!!+,-)

!!"-#./!!!!!!!!!!!!!!!!!0#1#2.34#!!!!!!!!!!!!!!5,6.2,

!!%&##7#!!!!!!!!!!!!!!!!!!!!811!!!!!!!!!!!!!!!!!".()1-,9*/

(b)

Figure 3. Visualization of soft labeling: (a) original food image. In(b), each of the nine images represents the probability that a givenpixel belongs to that category. The brighter the pixel is, the largerits probability.

features described below.

3.3. Pairwise Features

We attempt to capture the spatial relationship betweenpixels of different food ingredients using pairwise local fea-tures. Fig. 4 is an explanation of the four pairwise featureswe have explored.

Pairwise distance reflects the distance between two pix-els in an image. Similar to the feature descriptors in theShape Context method [1], the pairwise distance featureis defined to be greater when the two pixels in the pairare close to each other. We employ Sturges’ formula [19],k = log2dn + 1e and use the log of the distance betweenthe pair of pixels P1 and P2.

D(P1, P2) = logd|P1, P2|+ 1e (1)

Pairwise orientation, O(P1, P2), is defined as the angleof the line connecting the pixel pair. It ranges in [0o, 360o),

! !

!"#$%#&'()'"*+$'

)'"*+$'(,'&-$#.*#/0 12.$'&&#/0 )#34

!"#$%&'()*(+,"#$%&'(+-($.((&+$./+0"1(2#+35+%&,+36

!4357(365 4-5

67"(&$%$"/&)*(+/7"(&$%$"/&+/8+$*(+2"&(+-($.((&+$./+0"1(2#+35+%&,+36

64357(365 4'5

9",0/"&$+'%$(:/7;

)*(+$;0(+/8+$*(+0"1(2+"&+$*(+<",,2(+/8+$./+0"1(2#+35+%&,+36

94357(365 4,5

=($.((&>0%"7+'%$(:/7;

)*(+$;0(+/8+%22+0"1(2#+-($.((&+$./+0"1(2#+35+%&,+36

=4357(365 4(5

(a)

!"#$%&#'(

#$&")*+,-&./0+1(

#'&"23++4+&./0+1(

(b)

5"#$%&#'(

#'&"23++4+&./0+1(

#$&")*+,-&./0+1(

(c)

6"#$%&#'(7!""#$%&'$ $

!(")*$%&+$

,-(.$%&/#$&")*+,-&./0+1(

#'&"23++4+&./0+1(

(d)

8"#$%&#'(7

#'&")*+,-&./0+1(

#$&")*+,-&./0+1(

!""#$%&'$ $

!(")*$%&+$

,-(.$%&/

!""#$%&'$ $

!(")*$%&+$

,-(.$%&/

!""#$%&'$ $

!(")*$%&+$

,-(.$%&/

!""#$%&'$ $

!(")*$%&+$

,-(.$%&/

!""#$%&'$ $

!(")*$%&+$

,-(.$%&/

!""#$%&'$ $

!(")*$%&+$

,-(.$%&/

(e)

Figure 4. Four pairwise local features. In the table in (a), column1 is the name of the feature; column 2 is a description of the fea-ture; column 3 shows the expression of the feature in the followingtext; column 4 refers to their illustration figures. (b) (c) (d) (e) areillustrations of these pairwise local features.

and positive is counter clockwise.Pairwise midpoint, M(P1, P2), is a discrete feature de-

fined as the soft label of the pixel at the midpoint of the lineconnecting the two pixels. When this feature is added to themulti-dimensional histogram, the midpoint pixel’s soft label(distribution over ingredient types) determines the weight ofits contribution to the relevant bins.

Like the pairwise midpoint feature, the between-pairfeature captures information about the pixel labels lyingon a the line connecting the two pixels in the pair. Butrather than picking only the soft label of the midpoint, thebetween-pair feature incorporates all pixel soft labels alongthe line connecting the pair of pixels. So the feature foreach pixel pair has t discrete values, t being the number ofpixels exist along the line between a pair of pixels. We useB(P1, P2) = {Ti|i = 1, 2, ..., t} to represent the feature setfor pixels P1 and P2.

In addition to these individual features, we also explorejoint features that are composites of the above features, such

! !

!"#$%&'(#)*#+,&-,(%.),

-,(%.),&/,+0)#1%#"$ 231),++#"$

!"#$%&'()%&*)+,"(&$%$"-&

./()'-&01&'$"-&)-2)*"#$%&'()%&*)-,"(&$%$"-&)2(%$1,()3($4((&)5"6(7#)84)%&*)85

!+9846&85:

+,"(&$%$"-&)%&*);"*5-"&$)'%$(<-,=

./()'-&01&'$"-&)-2)-,"(&$%$"-&)%&*)>"*5-"&$)'%$(<-,=)2(%$1,()3($4((&)5"6(7#)84)%&*)85

+;9846&85:

Figure 5. Joint pairwise features

as a joint feature of distance and orientation, DO(P1, P2),a joint feature of orientation and midpoint, OM(P1, P2), asshown in Fig. 5.

3.4. Histogram representation for pairwise featuredistribution

Computing the complete set of pairwise features in animage can be computationally expensive since, for an imagewith M pixels, we would need to consider

(M2

)) pairs. It

suffices to consider(M2

)rather than M2 pairs because our

pairwise relationships are symmetric; we convert directedorientation to a symmetric form by averaging counts in bothdirections. Nonetheless,

(M2

)can still be prohibitively large

for high-resolution images.Thus, for computational efficiency, we randomly sample

N (N = 1000) pixels from the non-background portion ofthe image and estimate pairwise statistics using these

(N2

)pixel pairs. We use a set P to represent the N pixels werandomly pick from an image: P = {P1, P2, ..., PN}. Thesoft labels of the N pixels are represented as S = {Sik|i =1, 2, ..., N ; k = 1, 2, ..., 8}. We calculate pairwise localfeatures (D(Pi, Pj), O(Pi, Pj), M(Pi, Pj) and B(Pi, Pj))for all pixels, and accumulate their values into a distribu-tion. The distribution is weighed by the soft labels of thetwo pixels. Specifically, this weight is the product of theprobabilities that a pixel is assigned to the ingredient labelcorresponding to the given bin.

We use a multi-dimensional histogram to represent thedistribution of pairwise local features, as detailed in Fig. 6.The first two dimensions of the histogram are the ingredi-ent labels. The other dimensions are the weighted pairwisefeatures calculated for the given pair of pixels.

The first and second dimensions of the above multi-dimension histograms have 8 bins, representing the eightpixel categories (excluding the background). The other di-mensions have either 12 bins for pairwise distance or orien-tation, or 8 bins for midpoint or between-pair category.

3.5. Histogram normalization

We normalize the distribution of certain pairwise fea-tures to achieve invariance to transformations such as ro-

! !

!"#$%&'()*+,-. !"#$%&'()*/.#0'"-$"%1 *2"3.

!"#$%&'( )%*()+,+)%*()+,+-"#$%&'( .,.,/0

12"(&$%$"3& )%*()+,+)%*()+,+32"(&$%$"3& .,.,/0

4"-53"&$+'%$(6327 )%*()+,+)%*()+,+8"-53"&$ .,.,.

9($:((&;5%"2+'%$(6327)%*()+,+)%*()+,+*($:((&;5%"2

.,.,.

!"#$%&'(+%&-+12"(&$%$"3&

)%*()+,+)%*()+,+-"#$%&'(+,+32"(&$%$"3&

.,.,/0,/0

12"(&$%$"3&+%&-+4"-53"&$+'%$(6327

)%*()+,+)%*()++,+32"(&$%$"3&+,+8"-53"&$

.,.,/0,.

Figure 6. Histogram representation of PFD

tation and scaling. We compute the mode of pairwise dis-tance or orientation, and shift the corresponding dimensionof the multi-dimension histogram so that it is centered onthe mode. Since we apply a logarithm to the distance fea-ture, changing the scale of an image becomes equivalentto shifting its histogram in the distance dimension, result-ing in scale invariance. Similarly, centering the orientationhistogram at its mode value results in rotation invariance.Since we only measure relative distances between pixels,our representation is also translation invariant.

For joint features, we normalize the joint histogram bycentering it at the mode value of the marginal distributionsof each of its continuous dimensions.

3.6. Classification with local feature distributions

Using PFD or Global Ingredient Representation, eachimage is represented as a multi-dimension histogram. Weemploy a Support Vector Machine (SVM) for classificationusing a χ2 kernel [15].

The symmetrized approximation of χ2 is defined as:

dχ2(x, y) =∑i

(xi − yi)2

xi + yi, (2)

where x and y represent the PFD histograms of two images,and xi and yi are two bins of the histograms.

The χ2 kernel is defined as:

kχ2(x, y) = e−dχ2 (x,y). (3)

With this pre-computed kernel matrix, we apply theSVM to classify images into multiple food categories. Inour implementation, we use the libSVM package [4].

4. Experimental MethodologyIn this section, we briefly review our dataset, baseline

approaches, and detail the implementation of the pixel-level

segmentation using which we generate our ingredient softlabels. Experimental results are given in Section 5.

4.1. Dataset

We evaluate our work on the recently-released PittsburghFood Image Dataset (PFID) [5] and compare our proposedapproach against their two baseline methods. The PFIDdataset is a collection of fast food images and videos from13 chain restaurants acquired under lab and realistic set-tings. Our experiments focus on the set of 61 categoriesof specific food items (e.g., McDonald’s Big Mac) withmasked background. Each food category contains three dif-ferent instances of the food (bought on different days fromdifferent branches of the restaurant chain), and six imagesfrom six viewpoints (60 degrees apart) of each food in-stance.

We follow the experimental protocol proposed byChen et al. [5] and perform 3-fold cross-validation for ourexperiments, using the 12 images from two instances fortraining and the 6 images from the third for testing. Werepeat this procedure three times, with a different instanceserving as the test set and average the results. The protocolensures that no image of any given food item ever appearsin both the training and test sets, and guarantees that fooditems were acquired from different restaurants on differentdays.

4.2. Baseline approaches

We use the two standard baseline algorithms specifiedby PFID: color histogram + SVM and bag of SIFT [11] fea-tures + SVM, as briefly described below.

Due in part to its simplicity, the color histogram methodhas been popular in object recognition for more than adecade. We employ a standard RGB 3-dimensional his-togram with four quantization levels per color band. Eachpixel in the image is mapped to its closest cell in the his-togram to generate a 43 = 64 dimensional representationfor each image. Then we use a multi-class support vectormachine (SVM) implementation [4] for classification.

Several studies (e.g., [9]) have demonstrated the meritsof using a bag of SIFT features. The basic idea of this ap-proach is to represent each image as a histogram of occur-rence frequencies defined over a discrete vocabulary of fea-tures (pre-generated using k-means clustering) and then touse the histogram as a multi-dimensional vector in an SVM.

To compare against these baseline approaches, we con-duct experiments using the six pairwise local features andthe GIR feature proposed in Section 3. We describe the ex-perimental details below.

(a)

(b)

Figure 7. (a) shows nine colors that are used to label trainingimages for STF. Each color represents one of the food ingredi-ent types or the background. (b) shows two samples of manuallylabeled images for STF: use the nine labels in (a) to color all pixelsof the 16 training images for STF. (Best viewed in color.)

4.3. Pre-processing with STF

The initial step in generating the six pairwise local fea-tures is to obtain pixel-wise soft labels for the images. Weuse Semantic Texton Forests (STF) [16] to generate these asfollows. First, we train STF using 16 manually-segmentedfood images, two examples of which are shown in Fig. 7(b).We found 16 training images to be sufficient to cover the ap-pearance of the food ingredients in the PFID dataset. Whilemore data could potentially generate better STF ingredientlabels, using a small training set shows that our algorithmcan operate with a small amount of relatively noisy trainingdata.

After training, we employ STF to soft label all 61×6×3food images in the dataset, creating a 9-element probabilityvector for each pixel, representing the likelihood of it be-longing to each of the eight food ingredient types or thebackground.

The output of STF is far from a precise parsing of animage. Fortunately, our method does not require a perfectsegmentation of food ingredients nor precise soft labelingresults.

5. ResultsThis section summarizes our results, both at the individ-

ual food item level (61 categories) and at the major food

Color SIFT GIR D O M B DO OM

0%

5%

10%

15%

20%

25%

30%

11.3%

9.2%

18.9%19.2%20.8%

22.6%21.3% 21.2%

28.2%Classification Accuracy(61 categories)

Chance1.6%

Figure 8. Classification accuracy for 61 categories

type level (7 broad categories).

5.1. Classification accuracy on the 61 categories

Fig. 8 summarizes the classification accuracy results onthe 61 categories for the two baselines (color histogram, bagof SIFT), the global ingredient representation (GIR), andthe six pairwise local features.

The chance recognition rate for each of the 61 categoriesis below 2% (1/61). The two baselines (color histogramand bag of SIFT features) reach only about 10%. The GIRglobal ingredient histogram method achieves 19%, and thelocal pairwise features methods range from 19% to 28%.The latter, achieved with the OM feature, is more thantwice the accuracy of the PFID algorithms [5] and nearly20× chance.

Fig. 9 shows the confusion matrices for four approaches:color histogram, bag of SIFT features, GIR, and pairwisefeature OM . The darker the diagonal line, the more ef-fective an approach, because it has a higher probability ofclassifying food of a given category as itself. Clearly, thereis a more salient diagonal line in the matrix of the OM fea-ture than the matrices for the other three approaches, whichindicates that a greater number of food items are correctlyclassified as themselves using joint feature OM . In both ofthe baseline approaches, we can see straight vertical linesin their confusion matrices, showing that certain food cate-gories are frequently (and incorrectly) chosen as the classi-fication result; GIR and OM do not exhibit this problem.

There is a straightforward explanation for why the com-bination of orientation and midpoint is the higher-order fea-ture that gives the best accuracy. This pair of features isable to leverage the vertically-layered structure of many fastfoods. This vertical layering is particularly characteristic ofburgers and sandwiches where meat and vegetables are sit-

GIR OM

Color histogram Bag of SIFT

Figure 9. Confusion matrix comparison for 61 categories. Rowsrepresent the 61 categories of food, and columns represent theground truth categories.

uated between two (or more) horizontally oriented slices ofbread. The orientation feature relates only two endpointsand therefore by itself cannot capture the three-way rela-tionships like bread-meat-bread or bread-vegetable-breadthat are characteristic of burgers and sandwiches. The mid-point feature captures three-way relationships, but withoutthe orientation constraint, the statistics of the midpoint fea-ture will be dominated by uniform triples like bread-bread-bread and meat-meat-meat created from horizontal sam-ples, which form the majority. By combining the orien-tation and midpoint features, the algorithm is able to dis-cover and leverage the vertical triplets that are key to ac-curately discriminating burgers and sandwiches from otherfood classes (e.g., pizza), as well as distinguishing betweenindividual burgers and sandwiches (e.g., a Big Mac vs. aQuarter Pounder).

5.2. Classification accuracy into 7 major food types

The relatively low accuracy of even the best method de-scribed above is due in part to the fact that many foods withsimilar appearances and similar ingredients are assigned todifferent categories in the PFID food database, as shown inFig. 10. Such cases are challenging, even for humans, todistinguish.

In order to match more closely to the way people cate-gorize foods, we organize the 61 PFID food categories intoseven major groups — sandwiches, salads/sides, chicken,breads/pastries, donuts, bagels, and tacos. Then, we use thebaseline methods and our approach to classify images in thedataset into these categories. Fig. 11 shows the classifica-tion results.

Figure 10. Similar food items in PFID: these two foods are seman-tically and visually similar, but are distinct food items and thusappear in different PFID categories. Such subtle distinctions makefast food recognition inherently challenging for vision algorithms.(left) Arby’s Roast Turkey and Swiss; (right) Arby’s Roast TurkeyRanch Sandwich.

Color SIFT GIR D O M B DO OM

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

49.7%55.3%

69.0%69.7%71.0%74.3%73.8% 72.2%

78.0%

Classification Accuracy(7 major categories)

Chance14.3%

Figure 11. Classification accuracy for 7 major types

The rankings between algorithms observed in the previ-ous experiment continue to hold true in this broader cate-gorization task, with OM achieving nearly 80% accuracy.More interestingly, the category-level confusion matrices(Fig. 12) provide some insights into the classification er-rors. We see that sandwiches are often confused with foodswhose major ingredients include bread, such as donuts andbagels.

6. ConclusionFood recognition is a new but growing area of explo-

ration. Although many of the standard techniques devel-oped for object recognition are ill-suited to this problem,we argue that exploiting the spatial characteristics of food,in combination with statistical methods for pixel-level im-age labeling will enable us to develop practical systems forfood recognition.

Our experiments on a recent publicly-released dataset offast food images demonstrate that our proposed method sig-nificantly outperforms the baseline bag-of-features models

Sandwich Salad&sides Bagel Donut Bread&pastry Chicken Taco




Color histogram Bag of SIFT

STF OM0.79 0.79 0.33 0.14 0.73 0.40 0.47

0.86 0.93 0.40 0.17 0.82 0.65 0.67

0.69 0.16 0.13 0 0.49 0.39 0.08

0.75 0.45 0.15 0.18 0.36 0.24 0.03

Figure 12. Confusion matrix comparison for 7 major categories.Rows represent the major 7 food categories, and columns representthe ground truth major categories.

based on SIFT or color histograms, particularly when weaugment pixel-level features with shape statistics computedon pairwise features.

In future work, we plan to extend our method to: (1) per-form food detection and spatial localization in addition towhole-image recognition, (2) handle cluttered images con-taining several foods and non-food items, (3) develop prac-tical food recognition applications, and (4) explore how theproposed method generalizes to other recognition domains.Preliminary experiments indicate that pairwise statistics ofSTF features in conjunction with Gist [13] perform signifi-cantly better than Gist+STF+SVM alone on scene recogni-tion tasks.

AcknowledgmentsWe would like to thank Lei Yang for providing

SIFT+SVM baseline code and the PFID project for helpwith the food recognition dataset.

References[1] S. Belongie, J. Malik, and J. Puzicha. Shape context:

a new descriptor for shape matching and object recog-nition. In NIPS, 2000. 2, 3

[2] R. Bolle, J. Connell, N. Haas, R. Mohan, andG. Taubin. VeggieVision: A produce recognition sys-tem. In WACV, 1996. 2

[3] M. Burl, M. Weber, and P. Perona. A probabilisticapproach to object recognition using local photometryand global geometry. In ECCV, 1998. 2

[4] C.-C. Chang and C.-J. Lin. LIBSVM: a library forsupport vector machines, 2001. 5

[5] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar,and J. Yang. PFID: Pittsburgh fast-food image dataset.Proceedings of International Conference on ImageProcessing, 2009. 1, 2, 5, 6

[6] S. Dickinson. The evolution of object categorizationand the challenge of image abstraction. In Object Cat-egorization: Computer and Human Vision Perspec-tives. Cambridge University Press, 2009. 2

[7] P. F. Felzenszwalb. Representation and detection ofdeformable shapes. PAMI, 27(2), 2005. 2

[8] T. Jiang, F. Jurie, and C. Schmid. Learning shape priormodels for object matching. In CVPR, 2009. 2

[9] B. Leibe, A. Leonardis, and B. Schiele. Combined ob-ject categorization and segmentation with an implicitshape model. Proc. Workshop on Statistical LearningComputer Vision, 2004. 5

[10] M. Leordeanu, M. Hebert, and R. Sukthankar. Beyondlocal appearance: Category recognition from pairwiseinteractions of simple features. In CVPR, 2007. 2

[11] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 2004. 2, 5

[12] A. Oliva and A. Torralba. Modeling the shape of thescene: a holistic representation of the spatial envelope.IJCV, 42(3), 2001. 2

[13] A. Oliva and A. Torralba. Modeling the shape of thescene: a holistic representation of the spatial envelope.IJCV, 42(3), 2001. 8

[14] R. Russo, N. da Vitoria Lobo, and M. Shah. A com-puter vision system for monitoring production of fastfood. In Proceedings of Asian Conference on Com-puter Vision, 2002. 2

[15] B. Schiele and J. L. Crowley. Object recognition us-ing multidimensional receptive field histograms. InECCV, 1996. 5

[16] J. Shotton, M. Johnson, and R. Cipolla. Semantic tex-ton forests for image categorization and segmentation.In CVPR, 2008. 2, 3, 6

[17] G. Shroff, A. Smailagic, and D. Siewiorek. Wearablecontext-aware food recognition for calorie monitoring.In Proceedings of International Symposium on Wear-able Computing, 2008. 2

[18] R. Spector. Science and pseudoscience in adult nu-trition research and practice. Skeptical Inquirer, 33,2009. 1

[19] H. A. Sturges. The choice of a class interval. J. Amer-ican Statistical Association, 1926. 3

[20] W. Wu and J. Yang. Fast food recognition from videosof eating for calorie estimation. In Proceedings ofInternational Conference on Multimedia and Expo,2009. 2

food recognition using statistics of pairwise local...

Documents