2017年4月1日 星期六

Why the traditional classifier is better than the Neural Network?

Overview

    This article gives my interpretation of why traditional classifier suppresses the (fully-connected) neural network if the input features are given.

Introduction

     To recognize certain objects from an image, a lot of research use the Convolution Neural Network (CNN) to extract features and then input those features to the traditional classifier to predict the result.  This still holds even if the data sets are large enough (i.e. the effect of overfitting can be excluded).
     It's hard to believe though, since the CNN is trained for such purpose and the last layer (such as the fully-connected layer or the softmax layer) is optimized with others as a whole.  Namely,  during training, each layer will adjust their parameters so that they can give the best result.  In this situation, the former layers will extract features that best work with the last layer who performs the classification.  And the last layer is especially designed for accepting those kinds of features and do the classification.
     However, there're plenty of papers use the traditional classifier to give the final results.  And even the SVM (which is regarded as the simplest classifier and it's trivial version can only separate the data linearly) is considered better than the fully-connected layer.  Besides of the cause of overfitting (the data is too few compare to the parameters to fit), following is the interpretation I proposed.

Interpretation

     Suppose we take 2D feature space to describe a system:
1. If the distribution of data looks like Fig. 1(a), it is very simple to separate it into two classes.
(All of the following graphs are all calculated by libsvm)
Fig. 1 (a)(b)

2. If the distribution of data looks like Fig. 2(a), it still makes sense.
Fig. 2 (a)(b)
For example, the horizontal axis represents the BMI of students, and the vertical axis is their sports performance.  This distribution can also be perfectly separated by SVM if a certain transformation is performed on the two axes.  For example, if you transform the BMI ratio to some score that represents the health of human body, the graph will look like Fig. 1 (which means "health score" is more suitable to describe such system).

3. If the distribution of data looks like Fig. 3(a), some red dots that surround by blue dots may be noises (i.e. data that is mistakenly labeled).  You know you want your algorithm to separate your data and gives the result like Fig. 3(b).  However, if your algorithm makes a fuss about such noises it would look like Fig. 3(c).  If this is the case, it will drag down the performance.
Fig. 3 (a) (b)
(c)

4. If the distribution of data looks like Fig. 4(a), this feature space may make no sense in my opinion.  This probably means this feature space is not adequate to describe such system.  If your algorithm tries hard and tends to separate these data into many small groups, it probably does it wrong.
Fig. 4 (a) (b)
In this case, you should probably abandon one of the features (or even both of them).  Or, find new features that describe the system well by setting some restrictions of your model (such as Regularization).  This may also be able to explain why Clustering can't give a better result than SVM:  Although the parameters of Clustering is quite a few (compare to the input data), trying to separate every data well is meaningless.

    Now, let us back to the high dimension feature space.  If your algorithm has many parameters that can fit into many circumstances (such as the fully-connected neural network), no matter current data is a noise or some of the features are poor to describe the system, it may get into trouble.  Instead, if one algorithm that can ignore some noises and take less of those features which does not help to classify objects, it may give a better result.

Multi-tasks in Convolution Neural Network (CNN) - the reuse of the extracted features

Overview

     This article will discuss one of the approaches to perform multi-tasks in computer vision: By extracting the features from a CNN and send those features to the traditional classification methods (such as SVM, Joint Bayesian... etc).  This article also shows the range of accuracy, so that one can judge if this approach satisfies his own purposes.

Motivation

     In some circumstances, we want more than one task is executed.  For example, supposed that we want to develop an APP that can recommend different products to different consumer groups by examining the age, gender, wear glasses... etc.  And judge their response to such advertisement by examining their facial expressions.  It seems that multiple CNN should be executed at a time.
     However, passing an image through one CNN is time-consuming.  Not to mention that there're so many tasks to be performed.  There should be other alternatives.

Introduction

     As shown below, the Convolution Neural Network (CNN) is usually composed of several Convolution Layer at the front of the network, as well as the Fully-Connected Layer at its end.

The former of the network are believed to perform the feature extraction (i.e. to extract the characteristic of such image). While the later layer, especially the Fully-Connected Layer, is believed to perform the classification of such image by using the previously derived features.
     Due to such characteristics of CNN, it is straightforward to replace the classification part (i.e. the Fully-Connected Layer) by the traditional classifiers (such as SVM).  For its efficiency and even its accuracy (see this article).
     In this article, we approach the multi-tasks by training a CNN for one task.  And perform other tasks by extracting features from that CNN and input those features to the traditional classifier to perform other tasks.  We believe that the features that used by one task may be close to the feature of the other task, if both of the tasks are similar to each other.

CNN + traditional classifiers

     In our case, we have at least three tasks to perform: to infer whether the user is smiling, wearing glasses, and the user's age.  We start from the DEX model (which is designed for age deduction), take features from the first the Fully-Connected Layer (so that the size of the features is large enough to do further processing) and finally, input those features to the traditional classifier (SVM in our cases).


Result


Age

      Since we do not re-train the CNN, the error of Age should remain the same (i.e. from its paper: MAE = 3.221).
-------------------------------------------------------------------------------------------------------------------------------------------------------------

      On the beginning of this project, our team has a heated discussion about whether the result of Smile-Task is more accurate than the Glasses-Task.  One might say that: since the glasses is more apparent than any other textures  of the human faces, the Glasses-Task should get the better result.  On the other hand, the other may say: since the DEX model is trained for detecting the human age, it might tend to only extract the features that relate to it (e.g. the wrinkle) and these features are more suitable for judging smiling.

Smile

      In this task, we only output two results: Smile or Not-Smile.  We collect about 3000 training data for each class to train SVM.
     The accuracy is around 88% which can be better if we train a model especially for detecting Smile.  However, this accuracy is suitable for our purpose.  Instead of letting users wait, some mistakes are tolerable.

Glasses

      In this task, we output three results: No Glasses, Glasses, and Sun Glasses.  We also collect around 3000 training data for each class.
      The accuracy is only around 84%.  Looks like the Smile-Task win!  To give a brief interpretation after the result, this might due to the DEX model is trained for predicting age.  Therefore, the later parts of the network might tend to ignore some features that belong to the glasses, since wearing a glasses does not change the age of a human.
      Furthermore, the result of glasses may get better if we extract features from the former layers.  However, since the features of the former layer is quite large (10 times as many features than we have used) and will drag down the efficiency.

Conclusions

    It's possible to use features that extracted by CNN that is trained for another purpose.  And the performance is around 80%~90% depend on whether current task is close to the original task (for which the CNN is designed).