Overview
This article will discuss one of the approaches to perform multi-tasks in computer vision: By extracting the features from a CNN and send those features to the traditional classification methods (such as SVM, Joint Bayesian... etc). This article also shows the range of accuracy, so that one can judge if this approach satisfies his own purposes.Motivation
In some circumstances, we want more than one task is executed. For example, supposed that we want to develop an APP that can recommend different products to different consumer groups by examining the age, gender, wear glasses... etc. And judge their response to such advertisement by examining their facial expressions. It seems that multiple CNN should be executed at a time.However, passing an image through one CNN is time-consuming. Not to mention that there're so many tasks to be performed. There should be other alternatives.
Introduction
As shown below, the Convolution Neural Network (CNN) is usually composed of several Convolution Layer at the front of the network, as well as the Fully-Connected Layer at its end.Due to such characteristics of CNN, it is straightforward to replace the classification part (i.e. the Fully-Connected Layer) by the traditional classifiers (such as SVM). For its efficiency and even its accuracy (see this article).
In this article, we approach the multi-tasks by training a CNN for one task. And perform other tasks by extracting features from that CNN and input those features to the traditional classifier to perform other tasks. We believe that the features that used by one task may be close to the feature of the other task, if both of the tasks are similar to each other.
CNN + traditional classifiers
In our case, we have at least three tasks to perform: to infer whether the user is smiling, wearing glasses, and the user's age. We start from the DEX model (which is designed for age deduction), take features from the first the Fully-Connected Layer (so that the size of the features is large enough to do further processing) and finally, input those features to the traditional classifier (SVM in our cases).Result
Age
Since we do not re-train the CNN, the error of Age should remain the same (i.e. from its paper: MAE = 3.221).-------------------------------------------------------------------------------------------------------------------------------------------------------------
On the beginning of this project, our team has a heated discussion about whether the result of Smile-Task is more accurate than the Glasses-Task. One might say that: since the glasses is more apparent than any other textures of the human faces, the Glasses-Task should get the better result. On the other hand, the other may say: since the DEX model is trained for detecting the human age, it might tend to only extract the features that relate to it (e.g. the wrinkle) and these features are more suitable for judging smiling.
Smile
In this task, we only output two results: Smile or Not-Smile. We collect about 3000 training data for each class to train SVM.The accuracy is around 88% which can be better if we train a model especially for detecting Smile. However, this accuracy is suitable for our purpose. Instead of letting users wait, some mistakes are tolerable.
Glasses
In this task, we output three results: No Glasses, Glasses, and Sun Glasses. We also collect around 3000 training data for each class.The accuracy is only around 84%. Looks like the Smile-Task win! To give a brief interpretation after the result, this might due to the DEX model is trained for predicting age. Therefore, the later parts of the network might tend to ignore some features that belong to the glasses, since wearing a glasses does not change the age of a human.
Furthermore, the result of glasses may get better if we extract features from the former layers. However, since the features of the former layer is quite large (10 times as many features than we have used) and will drag down the efficiency.
沒有留言:
張貼留言