view-based recognition of real-world textures · view-based recognition of real-world textures ......

16
This is the version originally submitted to Pattern Recognition View-based recognition of real-world textures Matti Pietikäinen * , Tomi Nurmela, Topi Mäenpää, Markus Turtinen Machine Vision Group, Infotech Oulu, P.O. Box 4500, FIN-90014 University of Oulu, Finland Corresponding author. Tel.: +358 8 553 2782; fax: +358 8 553 2612. E-mail address: [email protected] (M. Pietikäinen). Abstract A new method for recognizing 3D textured surfaces is proposed. Textures are modeled with multiple histograms of micro- textons, instead of more macroscopic textons used in earlier studies. The micro-textons are extracted with the recently proposed multiresolution local binary pattern operator. Our approach has many advantages compared to the earlier approaches and provides the leading performance in the classification of Columbia-Utrecht database (CUReT) textures imaged under different viewpoints and illumination directions. It also provides very promising results in the classification of outdoor scene images. An approach for learning appearance models for view-based texture recognition using self- organization of feature distributions is also proposed. The method performs well in experiments. It can be used for quickly selecting model histograms and rejecting outliers, thus providing an efficient tool for vision system training even when the feature data has a large variability. Keywords : 3D texture; Local binary pattern; Appearance-based; Classification; Self-organization 1. Introduction The analysis of 3D textured surfaces has been a topic of increasing interest recently due to many potential applications, including classification of materials and objects from varying viewpoints, classification and segmentation of scene images for outdoor navigation, aerial image analysis, and retrieval of scene images from databases. Due to the changes in viewpoint and illumination, the visual appearance of different surfaces can vary greatly, which makes their recognition very difficult. The simplest solution to the recognition of 3D textured objects is to apply 2D texture analysis methods "as they are". Recently, Castano et al. assessed the performance of two different classifiers using features extracted by Gabor filter banks. Real-world images relevant to autonomous navigation on cross-country terrain and to autonomous geology were used in experiments [1]. Some approaches specifically tailored towards 3D analysis try to deal with the changes of surface appearance due to the changes in illumination and viewing direction. Among the first papers of this kind were [2,3,4,5]. Malik et al. proposed a method based on learning representative texture elements (textons) by clustering, and then describing the texture by their distribution [6]. The image is first processed with a multichannel filter bank. Then, the filter responses are clustered into a small set of prototype response vectors, i.e. textons. The vocabulary of textons corresponds to the dominant features in the image: bars and edges at various orientations and phases. Each pixel in the texture gets the label of the best matching texton. The histogram of the labels computed over a region is used for texture description. Leung and Malik extended this approach to 3D surfaces by constructing the vocabulary of 3D textons [7] and demonstrating the applicability of the method in the classification of Columbia-Utrecht database (CUReT) textures [8] taken in different views and illuminations. Recent findings from human psychophysics, neurophysiology and computer vision provide converging evidence for a framework in which objects and scenes are represented as collections of viewpoint-specific local features rather than two- dimensional templates, or three-dimensional models [9].

Upload: others

Post on 24-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

This is the version originally submitted to Pattern Recognition

View-based recognition of real-world textures

Matti Pietikäinen*, Tomi Nurmela, Topi Mäenpää, Markus TurtinenMachine Vision Group, Infotech Oulu, P.O. Box 4500, FIN-90014 University of Oulu, Finland

• Corresponding author. Tel.: +358 8 553 2782; fax: +358 8 553 2612.

E-mail address: [email protected] (M. Pietikäinen).

AbstractA new method for recognizing 3D textured surfaces is proposed. Textures are modeled with multiple histograms of micro-textons, instead of more macroscopic textons used in earlier studies. The micro-textons are extracted with the recentlyproposed multiresolution local binary pattern operator. Our approach has many advantages compared to the earlierapproaches and provides the leading performance in the classification of Columbia-Utrecht database (CUReT) texturesimaged under different viewpoints and illumination directions. It also provides very promising results in the classification ofoutdoor scene images. An approach for learning appearance models for view-based texture recognition using self-organization of feature distributions is also proposed. The method performs well in experiments. It can be used for quicklyselecting model histograms and rejecting outliers, thus providing an efficient tool for vision system training even when thefeature data has a large variability.

Keywords: 3D texture; Local binary pattern; Appearance-based; Classification; Self-organization

1. IntroductionThe analysis of 3D textured surfaces has been a topic of increasing interest recently due to many potential applications,including classification of materials and objects from varying viewpoints, classification and segmentation of scene imagesfor outdoor navigation, aerial image analysis, and retrieval of scene images from databases.

Due to the changes in viewpoint and illumination, the visual appearance of different surfaces can vary greatly, whichmakes their recognition very difficult. The simplest solution to the recognition of 3D textured objects is to apply 2D textureanalysis methods "as they are". Recently, Castano et al. assessed the performance of two different classifiers using featuresextracted by Gabor filter banks. Real-world images relevant to autonomous navigation on cross-country terrain and toautonomous geology were used in experiments [1]. Some approaches specifically tailored towards 3D analysis try to dealwith the changes of surface appearance due to the changes in illumination and viewing direction. Among the first papers ofthis kind were [2,3,4,5].

Malik et al. proposed a method based on learning representative texture elements (textons) by clustering, and thendescribing the texture by their distribution [6]. The image is first processed with a multichannel filter bank. Then, the filterresponses are clustered into a small set of prototype response vectors, i.e. textons. The vocabulary of textons correspondsto the dominant features in the image: bars and edges at various orientations and phases. Each pixel in the texture gets thelabel of the best matching texton. The histogram of the labels computed over a region is used for texture description. Leungand Malik extended this approach to 3D surfaces by constructing the vocabulary of 3D textons [7] and demonstrating theapplicability of the method in the classification of Columbia-Utrecht database (CUReT) textures [8] taken in different viewsand illuminations.

Recent findings from human psychophysics, neurophysiology and computer vision provide converging evidence for aframework in which objects and scenes are represented as collections of viewpoint-specific local features rather than two-dimensional templates, or three-dimensional models [9].

Page 2: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

Cula and Dana [10,11] and Varma and Zisserman [12] used histograms of 2D textons extracted from training samples indifferent viewing and illumination conditions as texture models instead of determining 3D textons. As an alternative to thetexton-based approach, the models used for texture description can also be built by quantizing the filter responses into binsand then normalizing the resultant histogram [13]. Based on the results of these studies, a robust view-based classificationof 3D textured surfaces from a single image acquired under unknown viewpoint and illumination seems to be feasible. Usingrotation-invariant features computed at three different scales (MR8 filter bank) Varma and Zisserman were able to classify all61 CUReT textures with an accuracy of 96%, when 46 models for each texture class were used [12]. Their results representthe current the state-of-the art in the recognition of these textures.

A problem with the proposed approaches is that the methods need many parameters to be set and are computationallycomplex, requiring learning of a representative texton library using e.g. K-means clustering, intensity normalization of grayscale samples e.g. to have zero mean and unit standard deviation, feature extraction by a multiscale filter bank, normalizationof filter responses, and vector quantization of the multidimensional feature data to find the textons.

In this paper, we propose an approach in which 3D textures are modeled with multiple histograms of micro-textons,instead of more macroscopic textons used in earlier studies. The micro-textons are extracted with the recently introducedlocal binary pattern operator [14,15]. This provides us with several advantages and improvements to the state-of-the art.

The performance of our approach is first assessed with the same CUReT textures that were used by [10,11,12,13] in theirrecent studies. In addition, classification experiments with a new set of outdoor scene images are also presented.

How to find proper models of objects for a view-based vision system is a largely open research issue. In this paper, wealso propose a very promising approach based on self-organization of texture feature distributions. Texture samples with asimilar appearance are likely to cluster close to each other in a self-organizing map, while more different samples may formtheir own clusters

2. Texture description by micro-texton histogramsVarying lighting conditions and viewing angles greatly affect the gray scale properties of an image due to effects such asshading, shadowing or local occlusions. Therefore, it is important to use features, which are invariant with respect to grayscale changes. The textures may also be arbitrarily oriented, which suggests using rotation-invariant features. This was alsorecently proposed in [12]. In Castano et al's experiments with scene images, however, the rotation-invariant features did notperform as well as the normal ones, which demonstrates that the choice of features is naturally dependent on the application[1]. Due to foreshortening and other geometric transformations in a 3D environment, invariance to affine transformationsshould also be considered.

For our approach, we chose the local binary pattern operator, which has recently shown excellent performance in theclassification of 2D textures [14,15]. LBP is a gray-scale invariant texture primitive statistic. For each pixel in an image, abinary code is produced by thresholding its neighborhood with the value of the center pixel. A histogram is created tocollect up the occurrences of different binary patterns. The basic version of the LBP operator considers only the eightneighbors of a pixel, but the definition can be extended to include all circular neighborhoods with any number of pixels. Byextending the neighborhood one can collect larger-scale texture primitives.

In our research, we considered neighborhoods with 8, 16 and 24 samples and radii 1, 3 and 5 [15]. The rotation-dependent operators chosen were LBP 8,1 (8 samples, radius 1), and multiresolution LBP 8,1+16,3+24,5 obtained by concatenatinghistograms produced by operators at three resolutions into a single histogram. In order to reduce the number of binsneeded, we adopted the "uniform" pattern approach (see [15]) in the rotation-dependent case for the radii 3 and 5. We alsoused rotation-invariant operators LBP riu2 8,1 and LBP riu2 8,1+16,3 +24,5 as described in [15], in which the matches of similaruniform patterns at different orientations are collected into a single bin.

LBP can be regarded as a "micro-texton" operator. At each pixel, it detects the best matching local binary patternrepresenting different types of (slowly or deeply sloped) curved edges, spots, flat areas etc. For LBP riu2 8,1 operator, forexample, the length of the feature vector and size of the texton vocabulary is as low as 10 (9 + 1 for "miscellaneous"). Afterscanning the whole image to be analyzed with the chosen operator, each pixel will have a label corresponding to one textonin the vocabulary. The histogram of the labels computed over a region is then used for texture description. Formultiresolution operators, the texton histograms computed at different resolutions are concatenated into a single histogramcontaining, for example, 54 textons (= 10+18+26) in case of the LBP riu2 8,1+16,3+24,5 operator.

3. Use of multiple histograms as texture modelsThe 3D appearance of textures can be efficiently described with histograms of textons computed from training images takenfrom different views and illumination conditions [10,11,12]. In our basic method, one model histogram for each trainingsample is used. An unknown sample is then classified by comparing its texton histogram to the models of all training images,and the sample is assigned to the class of the nearest model.

Page 3: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

We used LBP histograms as texture models. The histograms were normalized with respect to image size variations bysetting the sum of their bins to one. For comparing histograms, a log-likelihood statistic was used:

∑=

=B

bbb MSMSL

1log),( (1)

where B is the number of bin and Sb and Mb correspond to the sample and model probabilities at bin b, respectively [15].

Training of a system can be very problematic if a good coverage of training samples taken from different viewing anglesand illumination conditions is required. Collecting such a large training set may even be impossible in real-worldapplications. A fundamental assumption in view-based vision is that each object of interest can be represented with a smallnumber of models obtained from images taken from selected views. How many of these "keyframes" are needed and howthey are selected is dependent on the data, features, and requirements of the given application.

For rough textures reasonably many models may be needed because the visual appearance of these textures can varygreatly, whereas smooth textures may require a much smaller number of models. By using invariant features (rotation, grayscale, affine) the within-class variability is likely to decrease, which should reduce the number of models needed. One shouldremember, however, that while adding feature invariance the discriminative power of a feature (and between-classdifferences) might in fact decrease.

How to select a good reduced set of models is dependent on the application. If a good coverage of training images of allclasses taken from different viewpoints and illumination conditions is available and all classes are known beforehand, amodel reduction method based on clustering or optimization can be used [12]. The method based on dimensionalityreduction by principal component analysis considers only within-class variations [10,11]. In the experiments presented inSection 5.1, we adopt the optimization approach in order to compare our results to the state-of-the art [12].

The optimization method is a simple hill-climbing search in which the set of samples is divided into two parts, one fortraining and the other for testing. Each sample in the training set is dropped off in turn, and a classification result against thetest set is obtained. In each iteration, the sample whose removal results in the best classification accuracy is moved to thetest set. Since there is no data to validate the optimization result, this kind of method obviously results in over-learning andbiased results. Furthermore, the classification results are not directly comparable due to the fact that a different testing set isused in each iteration.

Due to the excessive amount of model textures, we modified the algorithm a bit. The number of models was reducedclass-wise until 20 models (for 3x3 operators) or 15 models (for multiresoltion operators) were left in each, respectively. Onlythen was the global reduction started.

If it is not possible to have enough representative training samples, the selection of models could be done, for example,by utilizing information about the imaging positions and geometries of the objects, or temporal information available in imagesequences. In the following section, we will propose an approach for finding appearance models for each class by usingself-organization.

4. Learning appearance models by self-organizationRecent psychophysical studies suggest that human brain represents objects as series of interconnected views. Bülthoff etal. have proposed a system, which learns such representations of objects through the process of feature tracking [9]. Theywere able to demo nstrate that employing temporal information during learning by means of keyframe representation yields alarge increase in recognition performance compared to a simple view-based representation of the same complexity using onlystatic views.

How to find such interconnected frames for the recognition of arbitrary 3D textures? It is obvious that the optimizationmethod presented in Section 3, for example, which is based on the assumption that all possible classes are knownbeforehand, cannot be used. The view-based representation of a texture should be more or less independent of the context,and be applicable to various collections of textures.

One possibility would be to track textures in consecutive frames digitized by an image acquisition system, and choosekeyframes according to the changes in texture appearance. This approach resembles the method used in [9]. Instead of this,we propose the use of self-organization of texture feature distributions over a sequence of images to learn the key frames foreach texture class separately. The samples of each texture type in the CUReT database can be thought as a sequence offrames taken in different viewpoints and illumination directions.

The main idea in finding representative keyframes is to utilize the self-organization of feature distributions. The LBPhistograms of each training sample belonging to a certain class are fed to a SOM [16], which strives to distribute the featurevectors as uniformly as possible. The dimensionality of the data is reduced to two while preserving its topology. Since

Page 4: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

samples with similar textural characteristics cluster close to each other, representative models (keyframes) can be selected byconsidering only a subset of samples in each local neighborhood.

We also consider the case in which all training samples of all classes are fed to a common self-organizing map. Based onthe knowledge of clustering properties of SOM we should find structural dependencies of the inspected data. When theactual class labels are attached to the nodes of SOM, badly confused and spread classes are visually revealed. A problemwith this approach is that requiring all classes to be known and well covered at once is too restricting in a more generic view-based recognition system.

5. Experiments with CUReT textures

5.1 Classification

The CUReT database contains 61 different real-world textures shown in Fig. 1, each imaged under more than 200 differentcombinations of viewing and illumination directions [8]. In order to be able to compare the results, we did similar experimentsas were carried out in earlier studies [10,11,12,13]. In the first image set, 118 images taken under varying viewpoint andillumination with a viewing angle less than 60 degrees were extracted from each texture class in the database. The second setcontained 156 images of each texture class with a viewing angle less than 70 degrees. Three different classification problemswere considered containing 20, 40 and all 61 different textures. Half of the images (i.e. 59 or 78) for each texture were used fortraining and the other half for testing. The images for the training (and test) set were selected in two optional ways: bytaking every alternate image, or by choosing them randomly and computing an average accuracy of 10 trial runs. 59 or 78LBP histograms then modeled each texture, respectively, and in classification each sample from the test set was assigned tothe class of the closest model histogram.

Figure 1

Table 1 shows the classification rates for different LBP features in each classification problem, when the viewing anglewas less than 60 degrees and every alternate image was chosen for the training set. For comparison, the best correspondingresults obtained by [12] with the MR8 filter bank are also presented. The results are not fully comparable, because the imagesets were not exactly the same. Varma and Zisserman [12] used only 92 "sufficiently large" images for each class wíth aviewing angle less than 60 degrees, while we used all 118 images available with the same viewing angle limitation. Ourclassification problem is more challenging, because we included also small, often highly tilted samples, for which it is noteasy to extract reliable feature statistics.

Table 1

Multiresolution rotation-invariant LBP performed best, exceeding the classification rates obtained with the MR8 filter, evenwhen we used a more difficult image set than in [12].

The same operator achieved high classification rates also when the training samples were chosen randomly: 97.67%,94.81% and 94.30% for the 20, 40 and 61 class problems, respectively. When using 156 images per class with viewing angleless than 70 degrees, and alternate sampling, very high rates of 97.63%, 95.22% and 93.57% were still obtained for the giventhree problems . A large viewing angle means that highly tilted, often almost featureless samples were included in this imageset.

Next we investigated how many models are needed for each class to obtain good performance when rotation-invariantLBP features are used. The problem of classifying 20 textures with a viewing angle less than 60 degrees was chosen for thisstudy. Because a good coverage of training images for all classes was available, the greedy reduction method used by [12]was adopted.

First we did experiments with the optimization approach of [12], which maximized the total number of images classified.That is, if only M models per texture are used for training, then the rest of the 59-M training images are added to the test setso that also they may be classified. Table 2 shows the classification accuracy versus the number of models .

To see how well the optimized results generalize, we also did another set of experiments. This time, the samples weredivided into two sets: one for optimization (79 samples/texture), and one for validation (39 samples/texture). The optimizationset was further divided into two smaller sets, one for training (40 samples/texture) and the other for testing (39samples/texture). The optimization algorithm of [12] was run on the optimization set, with the exception that the testing setwas not modified during the process. The samples for each set were selected so that the first sample was for training, thesecond one for optimization, the third one for testing, the fourth for training, etc. The classification accuracy of the optimizedfeatures on the independent validation set is shown in Table 3. These rates can be considered conservative due to thesmaller image sets used in different phases.

Table 2

Page 5: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

Table 3

We can see that the multiresolution rotation-invariant LBP provides an outstanding performance, achieving anclassification rate of 99.77% in the optimistic case when using nine models per class in average, and a classification rate of93.21% in the more realistic experimental setup, respectively.

5.2. Learning appearance models

The problem of classifying 20 textures with a viewing angle less than 60 degrees was chosen for the experiments. Like earlier(Table 1), the data set was divided into two groups of equal size, 59 samples per class for training and 59 disjoint samples perclass for testing. Multiresolution rotation-invariant LBP was used as the texture measure with which class-wise SOMs werecreated.

Fig. 2 illustrates a SOM map obtained for the class "Terrycloth". As shown by the number in the figure, many samples canbe grouped into a single node in the SOM during the self-organization process. These samples are likely to be very similar toeach other. The dissimilarity between samples grows with the geometrical distance between nodes.

Figure 2

A simple subsampling scheme was used in finding the key frames. Given a requested number of M models, M nodes wereselected so that they covered the SOM as well as possible. If a selected node contained more than one sample, one of themwas randomly chosen. Fig. 3 shows an example how nine models are selected from a 5x5 SOM. With three models per class,the nodes in the top-left corner, in the bottom-right corner and in the middle were selected.

Figure 3

When using this rough model selection principle and 3, 6, and 9 models per class in average, we obtained classificationrates of 75.17%, 86.02% and 90.26%, respectively. These results are quite close to those obtained with optimization (Table 3).They show that the presented approach takes well into account the within-class variations and is able to find proper modelsfor each texture class.

We also investigated whether this kind of approach could be used for selecting models when the whole training set of all20 classes is used to train a common SOM. In this case also the between-class variations are taken into account like in theoptimization-based method. Fig. 4 presents the resulting map.

Figure 4

Different classes are drawn with different colors. When a node contained samples from more than one class, the classwith most hits was chosen to represent that node. Thick red borders show the occurrence of one class (Velvet). We cannotice that some samples of this class are mixed to some other classes (e.g. Sponge and Linen) and the cluster is actuallydivided into two separate connected regions. Our training system allows us also to visualize the original texture imagesrepresenting each node. This helps a user (vision system trainer) to inspect which samples in a SOM node belong to agiven class. By using the map and its visualization capabilities, we picked the models for each class manually by choosingsamples from the edges and centers of each cluster. With this approach, a classification rate of 89.24% was obtained for the20-class problem when using nine models per class in average.

The accuracy was about the same as obtained by choosing models for each class separately. This further confirms ourhypothesis that it is possible to find good appearance models for 3D-texture classification by applying self-organization toeach textured object separately. Of course an approach taking into account both within-class and between-class variationsshould provide somewhat better results, but requiring all classes to be known and well covered at once is too restricting in amore generic view-based recognition system.

The presented preliminary results are very encouraging. With the presented approach the user (teacher) can easilyvisualize the inspected data, see how badly the classes are confused and detect possible outliers. The time needed formodel selection was only a fraction of that needed by the greedy optimization method.

6. Experiments with scene imagesAnalysis of outdoor scene images, for example for navigating a mobile robot, is very challenging. Texture could play animportant role in this kind of application, because it is much more robust than color with respect to changes in illuminationand it could also be utilized in night vision [1]. Castano et al. argue that classification is a more important issue in a vastmajority of applications, rather than clustering or unsupervised segmentation. In their work, they assessed the performanceof two Gabor filtering based texture classification methods on a number of real-world images relevant to autonomousnavigation on cross-country terrain and to autonomous geology. They obtained satisfactory results for rather simple terrainimages containing four classes (soil, trees, bushes/grass, and sky), but the results for the rock dataset were much poorer.

Page 6: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

Unfortunately, we were not able to get their images for our research. Therefore, we created our own test set of outdoorimages by taking a sequence of 22 color images of 2272x1704 pixels with a digital camera. A person was walking on a streetand took a new picture after about every five meters. The camera was looking forward and its orientation with respect to theground was kept roughly constant, simulating a navigating vehicle. The image set is available at the Outex database [17].Half of the (gray-scale) images were used for training and the other half for testing. The images for training (and testing)were selected by taking every alternate image.

Five texture classes were defined: sky, trees, grass, road and buildings. Due to the considerable changes of illumination,the following sub-classes were also used: trees in the sun, grass in the sun, road in the sun, and buildings in the sun.Following the approach of Castano et al. [1], we labeled by hand ground-truth areas from each training and testing image, i.e.areas in which a given texture class is dominating and not too close to another class to avoid border effects. Figure 5(a)shows an original (test) image and Fig. 5(b) a labeled image. Next, each training image was processed with the chosen LBPoperator and model distributions were created in two optional ways: 1) for each class and sub-class a single model histogramwas computed using LBP-labeled pixels inside the ground-truth areas of the training set, 2) for each of the 11 images in thetraining set, a separate model histogram was created for those classes or sub-classes that were present in the given image. Inthe first case, the total number of model histograms was nine (in average 1.8 models per class), while in the second case itwas 68 (in average 13.6 models per class).

Figure 5

Pixel-wise classification of the test images was done by centering a circular disk with radius r (r=30) at the pixel beingclassified, computing the sample LBP histogram over the disk and assigning the pixel to the class whose model was mostsimilar to the sample. After classifying all pixels inside the ground-truth regions in a similar way, classification rates for eachof the five classes can be computed after combining classes with their possible sub-classes. Fig 5(c) shows an example ofclassified ground-truth pixels. Fig. 5(d) demonstrates how the whole image could segmented by classifying all pixels in theimage in a similar way.

Fig. 6 presents the classification rates for the whole test set using different LBP operators with single or multiplehistograms as models. Rotation-variant multiresolution operator LBPu2

8,1+16,3+24,5 achieved a very good accuracy of 85.43%,but also the simple LBP8,1 operator performed well (80.92%). Multiple histogram models were clearly better than singlehistograms. In this application area, the rotation-invariant LBP operators did not perform as well as their rotation-variantcounterparts. Fig. 7 shows the classification rates for different test images in case of the LBP8,1 operator. The results forimages P9100033 and P9100035 were quite poor largely because the classes "road" and "grass" were often mixed.

Figure 6

Figure 7

The scene textures used in the experiments had a wide variability within and between images due to variations inillumination, shadows, foreshortening, self-occlusion, and non-homogeneity of the texture classes. Therefore the resultsobtained can be considered very promising.

7. DiscussionThe experiments presented in this paper show that histograms of micro-textons provided by the multiresolution LBPoperator are very efficient descriptors of 3D surface textures imaged under different viewing and illumination conditions,providing the leading performance in the classification of CUReT textures. Due to the gray-scale and rotation invariance ofthe features used, only a few models per texture are needed for robust view-based recognition. The method performed wellalso in the classification of outdoor scene textures. These textures had a wide variability within and between images due tochanges in illumination, shadows, foreshortening, self-occlusion, and non-homogeneity of the texture classes. Thereforethe results obtained can be considered very promising, confirming that our approach has much potential for a wide varietyof applications.

A significant advantage of the proposed method is that there is no need to create a specific texton library like in the earlierapproaches, but a generic library of micro-textons can be used instead. Due to the invariance of the LBP features withrespect to monotonic gray-scale changes, our method can tolerate considerable gray-scale variations common in naturalimages and no normalization of input images is needed. Unlike the earlier approaches, the proposed features are very fast tocompute and do not require many parameters to be set. The only parameters needed in the LBP approach are the number ofscales and the number of neighborhood samples.

Problems of finding representative models for view-based texture recognition were considered. A method based on self-organization of feature histograms was proposed. We showed that it is possible to quickly learn a good reduced set ofappearance models for each texture class separately. The presented visualization-based approach can be used easily forselecting model histograms and rejecting outliers, thus providing an efficient tool for vision system training even when thefeature data has a large variability.

Page 7: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

Our results suggest that the micro-textons detected by the LBP approach contain more discriminative textureinformation than the more coarsely-grained textons used in earlier studies. Micro-textons contain information about curvededges, spots and flat areas in a texture. These might be used as basic primitives for describing various types of textures,including microtextures, macrotextures, and non-homogeneous textures. .

AcknowledgementsThe authors would like to thank O.G. Cula for providing information about the CUReT samples used in experiments. Thiswork was partly supported by the Academy of Finland.

References[1] R. Castano, R. Manduchi, J. Fox, Classification experiments on real-world texture, Proc. Third Workshop on Empirical

Evaluation Methods in Computer Vision, Kauai, Hawaii, 2001, pp. 3-20.

[2] M. Chantler, Why illuminant direction is fundamental to texture analysis, IEE Proceedings Vision, Image and SignalProcessing 142(4)(1995) 199-206.

[3] K.J. Dana, S.K. Nayar, Histogram model for 3D textures, Proc. IEEE Conference on Computer Vision and PatternRecognition, 1998, pp. 618-624.

[4] P. Suen, G. Healey, Analyzing the bidirectional texture function, Proc. IEEE Conference on Computer Vision and PatternRecognition, 1998, pp. 753-758.

[5] B. van Ginneken, M. Stavridi, J.J. Koenderink, Diffuse and specular reflectance from rough surfaces, Applied Optics(37)(1998) 130-139.

[6] J. Malik, S. Belongie, T. Leung, J. Shi, Contour and texture analysis for image segmentation, International Journal ofComputer Vision 43(1)(2001) 7-27.

[7] T. Leung and J. Malik, Representing and recognizing the visual appearance of materials using three-dimensionaltextons, International Journal of Computer Vision 43(1)(2001) 29-44.

[8] K.J. Dana, B. van Ginneken, S.K. Nayar, J.J. Koenderink, Reflectance and texture of real world surfaces, ACMTransactions on Graphics 18(1)(1999) 1-34.

[9] H.H. Bülthoff, C. Wallraven, A. Graf, View-based object recognition based on human perception, Proc. 16th InternationalConference on Pattern Recognition, Vol. 3, 2002, pp. 768-776.

[10] O.G. Cula, K.J. Dana, Compact representation of bidirectional textures, Proc. IEEE Conference on Computer Vision andPattern Recognition, Vol. 1, 2001, pp. 1041-1047.

[11] O.G. Cula, K.J. Dana, Recognition methods for 3D texture surfaces, Proc. SPIE Conference on Human Vision andElectronic Imaging VI, Vol. 4299, 2001, pp. 209-220.

[12] M. Varma, A. Zisserman, Classifying images of materials achieving viewpoint and illumination independence, Proc. 7thEuropean Conference on Computer Vision, Vol. 3, 2002, pp. 255-271.

[13] M. Varma, A. Zisserman, Classifying materials from images: to cluster or not to cluster, Proc. 2nd InternationalWorkshop on Texture Analysis and Synthesis, 2002, pp. 139-143.

[14] T. Ojala, M. Pietikäinen, D. Harwood, A comparative study of texture measures with classification based on featuredistributions, Pattern Recognition 29(1996) 51-59.

[15] T. Ojala, M. Pietikäinen, T. Mäenpää, Multiresolution gray-scale and rotation invariant texture classification with localbinary patterns, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7)(2002) 971-987.

[16] T. Kohonen, Self-Organizing Maps, Springer-Verlag, Berlin, 1997.

[17] T. Ojala, T. Mäenpää, M. Pietikäinen, J. Viertola, J. Kyllönen, S. Huovinen, Outex - New framework for empiricalevaluation of texture analysis algorithms, Proc. 16th International Conference on Pattern Recognition, Vol. 1, 2002, pp.701-706. (http://www.outex.oulu.fi)

Page 8: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

Figure and table captions

Fig. 1: CUReT textures.

Fig. 2: A 4x4 SOM created for the class Terrycloth.

Fig. 3: Nine selected Terrycloth models from the corners and center of the SOM.

Fig. 4: SOM representing the whole training data.

Fig. 5: A scene image. (a) The original image. (b) Ground-truth regions. (c) Classified pixels within ground-truth regions. (d)Segmented image.

Fig. 6: Classification rates (%) for different versions of LBP.

Fig. 7: Classification rates of different images in the test set for the LBP8,1 operator.

Table 1: Classification rates for different number of texture classes.

Table 2: Optimistic classification rates for different average number of models (20-class problem).

Table 3: Conservative classification rates for different average number of models (20-class problem).

Page 9: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

Table 1. Classification rates (%) for different number of texture classes.

# of texture classes

Operator 20 40 61

LBP 8,1 97.54 91.57 87.02

LBP 8,1+16,3 +24,5 98.73 94.49 90.03

LBPriu2 8,1 93.73 83.69 81.47

LBPriu2 8,1+16,3+24,5 98.81 97.25 96.55

MR8 [12] 97.50 96.30 96.07

Table 2. Optimistic classification rates (%) for different average number of models (20-class problem).

Average # of modelsper texture

Operator 3 6 9

LBPriu2 8,1 77.70 88.39 92.89

LBPriu2 8,1+16,3+24,5 91.96 98.79 99.77

MR8 [12] 90.67 98.14 98.61

Table 3. Conservative classification rates (%) for different average number of models (20-class problem).

Average # of modelsper texture

Operator 3 6 9

LBPriu2 8,1 72.18 82.05 85.51

LBPriu2 8,1+16,3+24,5 83.36 91.92 93.21

Page 10: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

Fig. 1: CUReT textures.

Fig. 2: A 4x4 SOM created for the class Terrycloth.

Page 11: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

Fig. 3: Nine selected Terrycloth models from the corners and center of the SOM.

Page 12: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

Fig. 4: SOM representing the whole training data.

Page 13: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

Fig. 5: A scene image. (a) The original image. (b) Ground-truth regions. (c) Classified pixels within ground-truth regions.(d) Segmented image.

Page 14: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

Fig. 6: Classification rates (%) for different versions of LBP.

50

55

60

65

70

75

80

85

90ro

tatio

n va

rian

t(8

,1)

rota

tion

inva

rian

t(8

,1)

rota

tion

vari

ant

(8,1

)+(1

6,3)

+(2

4,5)

rota

tion

inva

rian

t(8

,1)+

(16,

3)+

(24,

5)

LBP operator

Cla

ssif

icat

ion

rat

e-%

multiple histogramssingle histogram

Page 15: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image

Fig. 7: Classification rates of different images in the test set for the LBP8,1 operator.

30

40

50

60

70

80

90

100

P91

0003

3

P91

0003

7

P91

0004

1

P91

0004

5

P91

0004

9

P91

0005

3

Picture

Cla

ssifi

catio

n r

ate-

%

multiplehistogramssinglehistogram

Page 16: View-based recognition of real-world textures · View-based recognition of real-world textures ... classification and segmentation of scene images for outdoor navigation, aerial image