CSCI 5280 Image Processing and Computer Vision
Last Updated: 2013.5.10
The goal of the project is to implement an image classification system. The main task is to train image classifiers on the Training set, and do the testing on the Testing set. A dataset is provided, which contains 5 classes ("bedroom", "industrial", "kitchen", "living room", "store").
In this project, we have implemented a image classification based on the skeleton code provided. The system is programmed in MATLAB. Some third-party libraries were also used to promote the performance. Download links of the code and other libraries are listed here
VL_Feat (Feature Extraction Library)
LibSVM (High Performance SVM Implementation)
LibLinear (Large Scale Linear SVM)
Sparse Coding(OMP Sparse Projection, SPAM)
LLC coding (local constraint linear coding)
The system consists of three parts: 1) Feature Extraction Module; 2)Image Representation Module; 3)Classifier Module. Based on these modules, we implemented a complete pipeline from input image to classification result, and can evaluate the performance of our algorithm by the classification accuracy.
We have tried a few image features, including dense sift, surf, and gist. The best feature selection we found is the dense sift with gist. To our understanding, the gist feature reveal the local detail of the image, while the gist feature gives an holistic decription of the image. The dense sift feature is extracted using VL_Feat library. In the extraction process, we set the spacing of the dense grid to 6 pixels, and the sampling window is set to 16*16 pixels.
Figure 1 Dense Grid to Extract DSIFT
The gist feature is a well-developed feature used in scene recognition. It uses Gabor filter to extract a holistic description of a lot properties of the image. These properties are highly related to the underlying scene where the image was taken. It will help determine what type of object the image is showing. However, the images we used in the experiment are only grayscale images, which will constraint the performance of the gist feature. But gist feature actual improves our recognition performance.
After feature extraction, we get the raw form of feature descriptors. These descriptors cannot be directly used in recognition. We do a coding processing on the descriptors to get the final image representation for recognition. Here we also tried a few methods. We set the codebook length to 1500 and tried these coding method: 1)nearest neighbor and k-NN 2)LLC (local linear constraint coding) 3)sparse coding. These method has variant performance on the dataset. The best performance is achieved with LLC.
In the experiment, the coding is committed on dense sift features. In the training process, we extract a dataset of about 350 thousands descriptor vectors. Then we try to learn a codebook of 1500 vectors from the dataset. The learning here we used is kmeans for NN and LLC, and sparse dictionary learning for sparse coding (using SPAM software).
Nearest neighbor method
In the nearest neighbor method, to code a feature vector, we find its 1 or k nearest neighbors in the codebook. Then we use the Gaussian kernel similarity metrics to represent this vector by codebook.
The local linear constraint coding is a fast coding method. Using the kmeans result for the code book, we achieved a good recognition performance. We have modified the original code to adapt to our usage.
Sparse coding method is a novel concept raised in recent years. It aims to represent an input with sparse composition of a codebook set.
After we code the descriptor vectors, we still have a lot of code vector, we still have a lot of vectors with one image. Commonly, we can make a histogram by sum up vectors of one image. However, a direct histogram loses spatial information of these descriptors, which may be very important for recognition. To preserve spatial relations of the code vectors, we introduced the spatial pyramid matching techniques. Pooling is also used, which is like the way biological visual system does.
Spatial pyramid match aims to preserve the spatial relationship between local descriptors. The basic idea is to divide the image into a few rectangle blocks and do the histogram in the block. In layer of the pyramid, the image is equally divided into blocks. Each block has a histogram of codes. Thus the image is represented by a vector, where is the length of the codebook. The emphasize the importance of the spatial position, histogram of each layer is assigned with a weight , where L is the number of layers.
Figure 2 Spatial Pyramid of 3 Layers
Pooling is a concept comes from biology neuroscience. Instead of histograming, pooling method apply a function on each element of all code vectors in the dataset, which lead to a value assigned to the corresponding element in the representation vector. Experiments shows that pooling can improve the result of LLC and sparse coding.
The image representation we obtained need to be fed to a set of classifiers to determine their classes. In this project, I used the LibLinear library, a SVM library used for large-scale data. This library is very suitable for our mission, which needs a fast classifiers to handle the long feature vector given by spatial pyramid technique. Our problem can be formulated as a multi-class classification problem. We choose the one-vs-all strategy to train the SVM. This means we train n SVMs, each labels the sample inside one class as +1 and other samples as -1, in total we have n classes. In classification, the test sample will be fed to n classifiers, the highest score given by classifier i means the sample belongs to class i.
We do experiments on the data given. The training set has 5 classes, each has 60 images. The testing set has 5 corresponding classes, each has 40 images. To test the performance of our algorithm, we train our codebook and classifiers on the training set. Then we permute the testing set to a set of 200 images, their class label are saved in the ground truth file. Then we calculate the recognition accuracy on the testing set. The experiment results are shown below. Note the spatial pyramid is set to 3 layers, and the pooling is max-pooling.
Figure 3 Image Classification Result. Here LLC use 20 nearest neighbors.
We can see that the best performance is obtained using dSIFT with gist features, LLC coding and a 1500 entries codebook. The best recognition accuracy was 74%.
We have discussed in previous section that dSIFT adding gist can give us a local and global representation of the image scene. We It is not so trivial that LLC, sparse coding, and kNN are on par with each other in performance. One possibility is that the codebook of 1500 words may have mostly covered the image visual sementics.
One interesting fact is that when we use the KMeans codebook to do the sparse coding experiment, performance only slightly decreased. We suggest that the sparse coding result is not so dependent on the codebook learning method. But it is more related to the sparse model. When we use the Homotopy solver to do the sparse coding, we achieved a result that is even lower than our baseline, the nearest neighbor hard coding.
Another important fact is that spatial pyramid matching and pooling can significantly improve the performance of both LLC and sparse coding. For LLC, the improvement of pyramid with pooling is about 10%. For sparse coding, the improvement from pyramid matching (pooling is default for sparse coding) is about 5.5%. This shows that spatial information is important for recognition.
In this project, we built up a image classification system use various features, codebooks and coders. We also test the importance of spatial pyramid matching and pooling. Finally the best recognition rate we achieved is 74%. In the implementation, the feature extraction, sparse coding and classifiers are from 3-rd party libraries, other part of the system are all implemented by myself based on the structure of given skeleton code.
 R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A Library for Large Linear Classification, Journal of Machine Learning Research 9(2008), 1871-1874.
 Andrea Vedaldi and Brian Fulkerson. 2010. Vlfeat: an open and portable library of computer vision algorithms. In Proceedings of the international conference on Multimedia (MM '10). ACM, New York, NY, USA, 1469-1472.
 Lazebnik, S.; Schmid, C.; Ponce, J., "Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories," Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on , vol.2, no., pp.2169,2178, 2006
 David G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, 60, 2 (2004), pp. 91-110.
 Oliva A, Torralba A. Modeling the shape of the scene: A holistic representation of the spatial envelope. International journal of computer vision, 2001, 42(3): 145-175.
 Wang J, Yang J, Yu K, et al. Locality-constrained linear coding for image classification. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010: 3360-3367.
 Allen Yang, Arvind Ganesh, Zihan Zhou, Shankar Sastry, and Yi Ma. Fast L1-Minimization Algorithms for Robust Face Recognition. (preprint)
 J. Mairal, F. Bach, J. Ponce and G. Sapiro. Online Learning for Matrix Factorization and Sparse Coding. Journal of Machine Learning Research, volume 11, pages 19-60. 2010.
1. Install and run
On installing, just extract the package to any place you want. To test the recognition performance, just put the test images into the “/testing” folder just as the current images does, every class has its pictures in the subfolder of the class id (0,1,2…). Then run the matlab script “image_classification_system.m” you will see the result when script runs to an end. Note that the order of test image will be uniform random permuted.
The package contains a standard configuration file “config.ini”. The content of the file and the setting we used to achieve the best performance is listed below:
If the image numbers every class in testing set is changed, please set the parameter “nImgPerClass_testing” to the actual number you are using. This package also contain pretrained codebooks using kmeans and sparse dictionary learning. If you want to train your own codebook, please change the value of “trainBook” to ‘true’.