Overview
This article will discuss one of the approaches to perform multi-tasks in computer vision: By extracting the features from a
CNN and send those features to the traditional classification methods (such as
SVM,
Joint Bayesian... etc). This article also shows the range of accuracy, so that one can judge if this approach satisfies his own purposes.
Motivation
In some circumstances, we want more than one task is executed. For example, supposed that we want to develop an APP that can recommend different products to different consumer groups by examining the age, gender, wear glasses... etc. And judge their response to such advertisement by examining their facial expressions. It seems that multiple
CNN should be executed at a time.
However, passing an image through one
CNN is time-consuming. Not to mention that there're so many tasks to be performed. There should be other alternatives.
Introduction
As shown below, the Convolution Neural Network (
CNN) is usually composed of several
Convolution Layer at the front of the network, as well as the
Fully-Connected Layer at its end.
The former of the network are believed to perform the feature extraction (i.e. to extract the characteristic of such image). While the later layer, especially the
Fully-Connected Layer, is believed to perform the classification of such image by using the previously derived features.
Due to such characteristics of
CNN, it is straightforward to replace the classification part (i.e. the
Fully-Connected Layer) by the traditional classifiers (such as
SVM). For its efficiency and even its accuracy (see
this article).
In this article, we approach the multi-tasks by training a
CNN for one task. And perform other tasks by extracting features from that
CNN and input those features to the traditional classifier to perform other tasks. We believe that the features that used by one task may be close to the feature of the other task, if both of the tasks are similar to each other.
CNN + traditional classifiers
In our case, we have at least three tasks to perform: to infer whether the user is smiling, wearing glasses, and the user's age. We start from the
DEX model (which is designed for age deduction), take features from the first the
Fully-Connected Layer (so that the size of the features is large enough to do further processing) and finally, input those features to the traditional classifier (
SVM in our cases).
Result
Age
Since we do not re-train the
CNN, the error of Age should remain the same (i.e. from its paper: MAE = 3.221).
------------------------------------------------------------------------------------------------------------------------
-------------------------------------
On the beginning of this project, our team has a heated discussion about whether the result of
Smile-Task is more accurate than the
Glasses-Task. One might say that: since the glasses is more apparent than any other textures of the human faces, the
Glasses-Task should get the better result. On the other hand, the other may say: since the
DEX model is trained for detecting the human age, it might tend to only extract the features that relate to it (e.g. the wrinkle) and these features are more suitable for judging smiling.
Smile
In this task, we only output two results:
Smile or
Not-Smile. We collect about 3000 training data for each class to train
SVM.
The accuracy is around 88% which can be better if we train a model especially for detecting
Smile. However, this accuracy is suitable for our purpose. Instead of letting users wait, some mistakes are tolerable.
Glasses
In this task, we output three results:
No Glasses,
Glasses, and
Sun Glasses. We also collect around 3000 training data for each class.
The accuracy is only around 84%. Looks like the
Smile-Task win! To give a brief interpretation after the result, this might due to the
DEX model is trained for predicting age. Therefore, the later parts of the network might tend to ignore some features that belong to the glasses, since wearing a glasses does not change the age of a human.
Furthermore, the result of glasses may get better if we extract features from the former layers. However, since the features of the former layer is quite large (10 times as many features than we have used) and will drag down the efficiency.
Conclusions
It's possible to use features that extracted by
CNN that is trained for another purpose. And the performance is around 80%~90% depend on whether current task is close to the original task (for which the CNN is designed).