2018年5月14日 星期一

Notes on applying the LSTM in Computer Vision

Overview

  This article illustrates several key points while applying the CNN with the LSTM in Computer Vision.

Introduction

  Nowadays, the Convolutional Neural Network (CNN) shows its great successes in many computer vision tasks, such as the image classification, the object detection, and the object segmentation... etc.  However, most of these algorithms are only be able to apply on the image (or, the single frame).  For example, most of the object detectors can detect and highlight the car which drives along the road.  Since the object detector is designed to detect the car in a single frame (i.e. it cannot take temporal information into consideration), the object detector may fail to detect the car while it is currently occluded by a tree.  However, we can still expect the position of the car because the human brain takes the history of the interested object into consideration.
  To overcome such issue, it's very intuitive to combine the CNN and the LSTM.  However, there are several tricks to be noted when applying the LSTM in computer vision.

Notes:

1. Adding dropout and regularization

  The LSTM is as easy as the fully-connected layer to get overfitting, not to mention that the LSTM can be seen as the 4 layers combination of the fully-connected layer.
  Fortunately, the TensorFlow provides the dropout wrapper for LSTM that can perform the dropout between the stacked LSTM (Note: this paper states that it'd be better to use dropout between the LSTM cells than use inside the cell.).  See this website for more detail.
  For the regularization, you need to get all of the variables first, then filter the variables of LSTM out by the name scope.  See this website for more detail.
  Finally, the wrapper function of LSTM is defined as follows:
def LSTM(name_, inputTensor_, numberOfOutputs_, isTraining_, dropoutProb_=None):
 with tf.name_scope(name_):
  cell = tf.nn.rnn_cell.LSTMCell(num_units=numberOfOutputs_,
       use_peepholes=True,
       initializer=layerSettings.LSTM_INITIALIZER,
       forget_bias=1.0,
       state_is_tuple=True,
       activation=tf.nn.tanh,
       name=name_+"_cell")

  if dropoutProb_ != None:
   dropoutProbTensor = tf.cond(isTraining_, lambda: dropoutProb_, lambda: 1.0)
   cell = tf.nn.rnn_cell.DropoutWrapper(cell,
            input_keep_prob=dropoutProbTensor,
            output_keep_prob=dropoutProbTensor)

  statePlaceHolder = tf.nn.rnn_cell.LSTMStateTuple( tf.placeholder(layerSettings.FLOAT_TYPE, [None, numberOfOutputs_]),
          tf.placeholder(layerSettings.FLOAT_TYPE, [None, numberOfOutputs_]) )

  outputTensor, stateTensor = tf.nn.dynamic_rnn( cell=cell,
        initial_state=statePlaceHolder,
        inputs=inputTensor_)

  # Add Regularization Loss
  for eachVariable in tf.trainable_variables():
   if name_ in eachVariable.name:
    if ('bias' not in eachVariable.name)and(layerSettings.REGULARIZER_WEIGHTS_DECAY != None):
     regularizationLoss = L2_Regularizer(eachVariable)
     tf.losses.add_loss(regularizationLoss, loss_collection=tf.GraphKeys.REGULARIZATION_LOSSES)
     

 return outputTensor, stateTensor, statePlaceHolder

2. Unrolls the LSTM

  In training the single frame algorithms, the batch data will be the shape: (batch_size, w, h, c).  While training the video algorithms, the batch data will be the shape: (batch_size, unrolled_size, w, h, c).  Here, the batch_size means how many videos in this batch while the unrolled_size means how many frames per video.
  By applying the tf.nn.dynamic_rnn(), which takes the input tensor and the LSTM cell as arguments, it will unroll the input in the second dimension and feed it into the LSTM cell.

3. Start training with a few Unroll Size

  Several studies (such as Re3 and here) show that one should start with a larger batch_size and a smaller unroll_size while training the network with the LSTM.  Otherwise, the network will not converge.  Namely, the user should start with a bunch of videos (say 40 videos, for example), but with a few frames (say 2 frames per video, for example).  Then double the number of unrolls and halve the batch_size while the network reaches the loss plateaus, as suggested by the Re3.  For example, input 20 videos and 4 frames per video while the network reaches the first plateau.
  Nonetheless, the above situation does not occur in my recent project:  Without the use of above method, the network still converges easily.  And the use of the above method does not improve the network to converge to the lower minima as well.  However, I think it's still advantageous for one to start training with the above method.

4. Which should be fed into the current network: the output or the label of the previous time?

  In the deploying time of NLP, the output of the previous frame should also be inputted to the network, as shown in the following figure.

In training, however, the label of the previous frame should be inputted instead, as shown follows.
  Normally, the above mechanism has not been applied in computer vision.  However, in some special cases, this mechanism is also applied.  For example, in Re3, the user is asked to drag a bounding box of an interested object.
  The Re3 network will track the object in the next frame by input two cropped images:  The first cropped image is an image in the previous frame that cropped by the bounding box of the interested object (actually, it is as twice the size as the bounding box).  The second cropped image is a sub-image that cropped in the same way as the first cropped image (i.e. the same (x, y, w, h) as the previous crop), but apply on the current frame image.  The output of the network is the bounding box of the interested object in the current frame.  The current frame is then cropped by such output bounding box and will be used as the previous frame for the next run.
  The way to crop the current frame is similar to the mechanism shown in the above figure: During deploying, the crop coordinate of the current frame is determined by the previous output.  However, one should use the previous label of bounding box while training and slightly increase the probability to use the previous output after some condition met.  In Re3, the probability of using the previous label starts with 0 than plus 0.25 while the user increases the unroll size.

5. The cell state of the LSTM should be maintained in a different way while train and deploy, respectively.
  While training the LSTM, the cell state is reset to zeros per video.  Suppose that the input batch is of the shape: (b, u, w, h, c) and the LSTM has N neurons, one should create the initial cell state as:
 initialCellState = tuple( [np.zeros([b, N])] * 2 )
 initialCellState = tf.nn.rnn_cell.LSTMStateTuple(initialCellState[0], initialCellState[1])
Note that the initialCellState includes both the hidden state and the output state.

  While deploying, the cell state should be maintained and then pass to the feed_dict each frame by the user.  Furthermore, the user should get the value of the LSTM states by sess.run() the state tensor of the LSTM.  The deploying procedure looks like:
  inputFeedDict = { self.net.inputImage : batchData.batchOfImages,
      ...
      self.net.isTraining : False
    }
  cellStateFeedDict = self.net.GetFeedDictOfLSTM(...)

  inputFeedDict.update(cellStateFeedDict)

  loss, listOfPreviousCellStates = session.run( [ self._lossOp] + self.net.GetListOfStatesTensorInLSTMs(),
                            feed_dict = inputFeedDict )

To decouple the train/deploy from the network design, I define the interface of the networks as follows:
class NetworkBase:
 __metaclass__ = ABCMeta
 @abstractmethod
 def Build(self):
  pass

 @abstractmethod
 def GetListOfStatesTensorInLSTMs(self):
  pass

 @abstractmethod
 def GetFeedDictOfLSTM(self, BATCH_SIZE_, listOfPreviousStateValues_=None):
  pass
And one possible implementation of a network that contains one LSTM is shown as follows (see here for more detail):
class Net(NetworkBase):
 def __init__(...):
  ...

 def Build(self):
  ...

  out, self._stateTensorOfLSTM_1, self._statePlaceHolderOfLSTM_1 = LSTM( "LSTM_1",
           out,
           self._NUMBER_OF_NEURONS_IN_LSTM,
           isTraining_=self._isTraining,
           dropoutProb_=self._DROPOUT_PROB)
  ...



 def GetListOfStatesTensorInLSTMs(self):
  return [self._stateTensorOfLSTM_1]


 def GetFeedDictOfLSTM(self, BATCH_SIZE_, listOfPreviousStateValues_=None):
  if listOfPreviousStateValues_ == None:
   '''
       For the first time (or, the first of Unrolls), there's no previous state,
       return zeros state.
   '''
   initialCellState = tuple( [np.zeros([BATCH_SIZE_, self._NUMBER_OF_NEURONS_IN_LSTM])] * 2 )
   initialCellState = tf.nn.rnn_cell.LSTMStateTuple(initialCellState[0], initialCellState[1])

   return {self._statePlaceHolderOfLSTM_1 : initialCellState }

  else:
   return { self._statePlaceHolderOfLSTM_1 : listOfPreviousStateValues_[0] }

Therefore, in training, one could just (see here for more detail):
  inputFeedDict = { self.net.inputImage : batchData.batchOfImages,
      ...
      self.net.isTraining : True,
    }

  '''
      For Training, do not use previous state.  Set the argument:
      'listOfPreviousStateValues_'=None to ensure using the zeros
      as LSTM state.
  '''
  cellStateFeedDict = self.net.GetFeedDictOfLSTM(batchData.batchSize, listOfPreviousStateValues_=None)
  inputFeedDict.update(cellStateFeedDict)

  session_.run( [self._optimzeOp],
         feed_dict = inputFeedDict )
While in deploying, one could simply (see here for more detail):
  inputFeedDict = { self.net.inputImage : batchData.batchOfImages,
      ...
      self.net.isTraining : False,
    }
  cellStateFeedDict = self.net.GetFeedDictOfLSTM(batchData.batchSize, self._listOfPreviousCellState)

  inputFeedDict.update(cellStateFeedDict)

  tupleOfOutputs = session.run( [ self._lossOp] + self.net.GetListOfStatesTensorInLSTMs(),
                feed_dict = inputFeedDict )
  listOfOutputs = list(tupleOfOutputs)
  batchLoss = listOfOutputs.pop(0)
  self._listOfPreviousCellState = listOfOutputs

6. Gradient Clipping

  It's well known that the LSTM suffer from gradient explosion.  Therefore, the gradients are often examined and clipped to a certain range.  See here for gradient clipping in TensorFlow.
  In Re3, they state that it is not necessary to clip the gradients if one applies the training strategy that starts with a few unrolls.  However, in my recent project, the exploding gradients happened even if I apply such training strategy.

2018年5月4日 星期五

Violence Detection by CNN + LSTM

Overview

  This article shows the detail of my currently developed project: Violence Detection.  The proposed approach outperforms the state-of-the-art methods, while still processing the videos in real-time.  The comparison between my method and the previous work by Conv3D is also shown.


Introduction

  School fight is always a big issue while I was guarding the high school.  However, to monitor the surveillance cameras by staffs is infeasible, not to mention they may have other responsibilities.  Fortunately, the currently raised AI (or, the Deep Learning) techniques may be able to detect the anomalies automatically [1].  Such anomaly detection is very fast and can be used as the preprocessing to filter out the normal surveillance videos, and then send the anomaly videos to perform the further examinations by other highly accurate algorithms.
  The previous work on violence detection use traditional features such as the BoVW, the STIP, and the MoSIFT, and classify the features by SVM [2].  Ding et. al. extract the spatial and temporal features by the 3D convolution layers and classify the features by the fully-connected layers (as shown in Fig. 1) [3].  However, both of the proposed methods do not well support the variable length of videos.  And the computational time of 3D convolution grows rapidly because of the depth of the temporal axis.


Fig. 1.  Violence detection by 3D convolutional networks in ref. [3].


  Moreover, the lack of the richness of the large-scale video dataset is also an issue.  Although the Sports-1M dataset [4] provides a million of videos, most of the categories are about sports (i.e. The diversity of the videos is not as rich as the ImageNet).  However, several studies imply that if the pre-trained model had seen a certain object that will be used later in the transfer learning task, the performance of the target task will be better [4, 5].  Therefore, the availability of the use of the pre-trained model on ImageNet is also important, not to mention that there are plenty amounts of available pre-trained models.
  In this work, a new network is proposed:  A CNN takes the input video frames and outputs the features to the Long Short-Term Memory (LSTM) to learn global temporal features and finally classify the features by fully-connected layers.  This network can not only implement by the pre-trained models in ImageNet, but also have the flexibility to accept variable length videos, and even boosts the accuracy to the 98.5% while still processing the image in real-time (80 fps on Nvidia GTX 1080 Ti).


Method

Network Architecture

  The proposed network architecture is shown in Fig. 2.  It has been shown that in addition to adding the LSTM (which is supposed to extract global temporal features) after the CNN, the local temporal features that can be obtained from the optical flow is also important [6].  Furthermore, it has been reported that the virtue of the optical flow is due to its invariance in appearance as well as its accurate at boundaries and at small displacements [7].  Therefore, in this work, the effect of optical flow is supposed to be mimicked by taking two video frames as input.  The two input frames are processed by the pre-trained CNN.  The two frames outputs from the bottom layer of the pre-trained model are concatenated in the last channel and then fed into the additional CNN (labeled by orange color in Fig. 2).  Since the outputs from the bottom layer are regarded as the low-level features, the additional CNN is supposed to learn the local motion features as well as the appearance invariant features by comparing the two frames feature map.  The two frames outputs from the top layer of the pre-trained network are also concatenated and fed into the other additional CNN to compare the high-level features of the two frames.  The outputs from the two additional CNN are then concatenated and passed to a fully-connected layer and the LSTM cell to learn the global temporal features.  Finally, the outputs of the LSTM cell are classified by a fully-connected layer which contained two neurons that represent the two categories (fight and non-fight), respectively.


Fig. 2.  The proposed network architecture.  The layers that labeled by blue color are pre-trained on the ImageNet dataset and are frozen during training.  The layers that labeled by the orange color are trained on the video dataset.


  The pre-trained model is implemented by Darknet19 [8] due to its accuracy on ImageNet and the above real-time performance.  Since the Darknet19 already contains 19 convolutional layers, to avoid the degradation problem [9], the additional CNN are implemented by the residual layers [9].

Accuracy Evaluation

  The proposed model in this work can output the classified result per frame.  However, the previous research evaluates the accuracy at the video-level.  To be able to compare with the previous work, the frame-level results are gathered and processed by the following strategy:  The video is classified to a certain category if and only if the number of the continuous signals of such category is larger than a certain threshold.  Such threshold can be derived by scanning the threshold from 0 to the length of the video and see which threshold yields the best accuracy in the validation set, as shown in Fig. 3.  If there are multiple thresholds that can yield the same accuracy, the small one will be chosen.




Fig. 3.  The threshold-accuracy curve in the validation set.  The horizontal axis represents the threshold of the number of continuous frames that has the positive signal.  The vertical axis represents the accuracy at such threshold in the validation set.  In the figure, thresholds start from 3 to 9 are all yield the best accuracy.  The smallest threshold (i.e. threshold = 3) is chosen so that the continuous false positive in the test set could be reflected by this metric.

Gradient Clipping

  It is well known that the gradient of the recurrent network may increase rapidly due to the long-term components [10].  The normal way to deal with the exploding gradient would be: truncate the gradient so that it remains in a reasonable range.  While several studies solve this issue by another approach: start training with few unrolls, and then double the size of unrolls when the loss reaches plateaus [5].  In the second approach, they found that it even not necessary to clip the gradients.  They also state that without starting from the small unrolls, the network may not even converge [5].
  In this work, I found that the network can easily converge even the initial unrolls is set to the length of the videos.  However, the absence of the gradient clipping makes the loss curve oscillating during training even if the training is starting with a small unrolls.  Therefore, the gradients of the network are truncated in the range from -5.0 to 5.0.  Clipping the gradients into the smaller range (e.g. from -1.0 to 1.0) has also been tested.  However, my experiment shows that this will cause the network hardly to converge to the lower minima.


Results

Experiment on the Hockey dataset

  The Hockey dataset proposed by Bermejo et. al. has 500 fighting clips and 500 non-fighting clips collected from the hockey games [2].  Follow the experiment proposed by Ding et. al. [3], the dataset is further split into the following configuration: the 400 clips (including 200 fighting clips and 200 non-fighting clips)  for testing, the 500 clips for training and the 100 clips for validation.  The result is shown in Table 1.  One can see that the proposed method in this work outperforms other state-of-the-art methods.



Method
Accuracy
STIP(HOG)+HIK with 1000 vocabulary [3]
84.25%
STIP(HOF)+HIK with 1000 vocabulary [3]
78.00%
STIP(HOG+HOF)+HIK with 1000 vocabulary [3]
78.50%
MOSIFT+HIK with 1000 vocabulary [3]
90.90%
Conv3D [3]
91.00%
Darknet19 + Residual Layers+ LSTM
98.50%
Table 1.  The comparison between the previous methods and the proposed method.

The Single Frame Baseline

  It has been reported that the single frame models (i.e. the model that does not consider the temporal information) already has a strong performance [4].  This may due to the fact that several categories in the video classification task (such as the Sports-1M and the UCF-101) can be recognized by the scene or the background in the videos (such as football or swimming).  The network does not necessary to learn the motion features of the moving objects.
  In this work, however, all of the videos are shot in the hockey field and several frames are necessary when examined by the human eyes.  Therefore, the performance of the single frame model was not expected to be as well as the models that take temporal information into consideration.  However, to compare the proposed method with the single frame model, a simple single frame network has also been proposed.  As shown in Fig. 4, The single frame model takes the output from Darknet19 and sends the output feature map into the 3 fully-connected layers to classify the input.

Fig. 4.  The single frame model.


  The result of the comparison is shown in Table 2.  Surprisingly, the single frame model also gives a very accurate video-accuracy.  However, the per-frame-accuracy of the single frame model is much lower than the network that considers the temporal information.  Moreover, the threshold of the number of the continuous positive signals is much larger than the network with the LSTM unit.  This is reasonable since the single frame model does not have any temporal information and the only way that decrease the misjudgement is to increase the threshold of the continuous positive signals.


Method
threshold
frame accuracy
video accuracy
Darknet19 + 3Fc
14
93.77%
96.00%
Darknet19 + Residual Layers + LSTM
3
97.81%
98.50%
 Table 2.  The comparison between the single frames and the proposed method.


Conclusion

  In this article, a new network architecture has been proposed.  Part of the network can use the pre-trained models on the ImageNet dataset and the other part is supposed to able to extract both the global and the local temporal features.  In addition to the high accuracy in detecting violence, the real-time processing speed, the ability to detect violence event frame by frame and the ability to support the variable length detection, as shown in the following video.  The source code of this project is available here.


Reference







2017年12月29日 星期五

TensorFlow not runing by GPU?

Problem

     You're sure you install the GPU version of TensorFlow by:
pip install tensorflow-gpu
However, you find that the calculation is extremely slow and your CPU loading is pretty high.

Solution

    This might because of the environment variable that controls which graphic cards should be used:  The CUDA_VISIBLE_DEVICES should be equal to 0 if you have only one GPU.  In my case, I have downloaded a deep learning project that has assigned this variable to empty string (CUDA_VISIBLE_DEVICES = "") result in the project is running on CPU.

Batch Normalization in TensorFlow

Overview

    This article will not illustrate the basic concept of Batch Normalization.  Instead, this article will focus on the implementation detail of Batch Normalization in TensorFlow.

Introduction

    Batch Normalization can make the convergence of neural networks easier, and sometimes can even improve the accuracies.  The formula is shown below:






where










   The idea is simple and elegant.   However, when it comes to implementation, it becomes a little tricky due to the average parts (the average of the means and the average of the variances):  To calculate such average parts over all training set seems infeasible.
     A common way to solve it is to invoke the Moving Average algorithm.  In this case, you cannot merely wrap the Batch Normalization in a function and return its output tensor, you should also return the update operations of the Moving Average calculation.  And session run such update operations after you backpropagate the neural network in each step.  Therefore, you should expect that there should be somehow an update operation after each training step (or, at least in an implicit way), no matter what Batch Normalization API you use.
     Tensorflow provides two Batch Normalization API, tf.nn.batch_normalization() and tf.layers.batch_normalization().  The following section will use these different API to implement a function called BatchNormalization() and compare their performance.

tf.nn.batch_normalization:

    This is the low-level API for Batch Normalization:  You should not only session run the update operations each training step, but also need to calculate the average of mean and variance as well as to create variables such as gamma and betta all by your own.  The code is shown as follows (ref: [1], [2]):
def BatchNormalization(isTraining_, currentStep_, inputTensor_, isConvLayer_, layerName_="BatchNorm"):
        with tf.variable_scope(layerName_):
                currentBatchMean = None
                currentBatchVariance = None
                outputChannels = None
                if isConvLayer_:
                        currentBatchMean, currentBatchVariance = tf.nn.moments(inputTensor_, [0, 1, 2])
                else:
                        currentBatchMean, currentBatchVariance = tf.nn.moments(inputTensor_, [0])

                averageCalculator = tf.train.ExponentialMovingAverage(decay=0.99,
                                                                      num_updates=currentStep_)
                updateVariablesOperation = averageCalculator.apply( [currentBatchMean, currentBatchVariance] )

                totalMean = tf.cond(isTraining_,
                                    lambda: currentBatchMean, lambda: averageCalculator.average(currentBatchMean) )

                totalVariance = tf.cond(isTraining_,
                                        lambda: currentBatchVariance, lambda: averageCalculator.average(currentBatchVariance) )

                outputChannels = int(inputTensor_.shape[-1])
                gamma = tf.Variable( tf.ones([outputChannels]) )
                betta = tf.Variable( tf.zeros([outputChannels]) )
                epsilon = 1e-5
                outputTensor = tf.nn.batch_normalization(inputTensor_, mean=totalMean, variance=totalVariance, offset=betta,
                                                         scale=gamma, variance_epsilon=epsilon)
                return outputTensor, updateVariablesOperation
Note that as remarked in ref [1], it suggests that one should assign the current training step to num_updates in the construction of tf.train.ExponentialMovingAverage() to "prevent from averaging across non-existing iterations".  I think it just means that the variable is randomly initialized in the first step and should be scaled down its importance.  If you don't assign the num_updates, according to the TensorFlow documents, it will calculate the mean simply by:
totalMean = (1 - decay)*currentBatchMean + decay*totalMean

And if you assign the current training step to num_updates, the mean will be calculated as follows:
decay = min(decay, (1 + step)/(10+step) )
totalMean = (1 - decay)*currentBatchMean + decay*totalMean

Usage:

You can build your net as follows:
class AlexnetBatchNorm(SubnetBase):
        def __init__(self, isTraining_, trainingStep_, input_, ...):
                self.isTraining = isTraining_
                self.trainingStep = trainingStep_
                self.input = input_
                ...

        def Build(self):
                net = ConvLayer(self.input, 3, 8, stride_=1, padding_='SAME', layerName_='conv1')
                net, updateVariablesOp1 = BatchNormalization(self.isTraining, self.trainingStep, net, isConvLayer_=True)
                net = tf.nn.relu(net)

                net = ConvLayer(net, 3, 16, stride_=1, padding_='SAME', layerName_='conv2')
                net, updateVariablesOp2 = BatchNormalization(self.isTraining, self.trainingStep, net, isConvLayer_=True)
                net = tf.nn.relu(net)

                ...

                updateOperations = tf.group(updateVariablesOp1, updateVariablesOp2, ...)
                return net, updateOperations
where ConvLayer() is a simple wrapper for convolution layer. The isTraining_, trainingStep_, input_ is the placeholders that will be assigned while you session run.
     Finally, session run the update operation for each training step:
while step < MAX_TRAINING_STEPS:
        session.run( trainOp,
                     feed_dict={self.net.isTraining : True,
                                self.net.trainingStep : step,
                                self.net.input : x,
                                ...})

        session.run( self.updateNetOp,
                     feed_dict={self.net.isTraining : False,
                                self.net.trainingStep : step,
                                self.net.input : x,
                                        ...})

You can refer to our git hub (file: Train.py, src/subnet/AlexBatchNorm.py, src/layers/BasicLayers.py) for more detail.

Performance:

     The following figure shows the training and validation curve of models that applied Batch Normalization (the green and the pink curves) or not (the blue and the orange curves):
One can see that the model with Batch Normalization converges very fast and even improves the result by a small percentage.

Recover:

    In the above implementation, we have used tf.train.ExponentialMovingAverage to calculate the average of mean and variance.  However, its documents suggested that while you try to recover the graph from checkpoints, you should do something like:
variables_to_restore = ema.variables_to_restore()
saver = tf.train.Saver(variables_to_restore)
In the Batch Normalization, however, it seems that we can recover the network as usual (probably the tf.nn.batch_normalization() has already done it?).  If I try to recover the network as the above suggestion, I got following error:
Therefore, one should just recover the network just as follows:
modelLoader = tf.train.Saver()
modelLoader.restore(session, PATH_TO_MODEL_CHECKPOINT)
    One more proof of this is to re-train the model and see if its loss starts from the same value as the pre-train model:
As shown above, the blue curve is the validation of the pre-train model.  The red curve is the model that read from the last step of the pre-train model and re-train it again.  You can see that they are perfectly matched.  Therefore, we can judge that the means and variances of variables are perfectly recovered.


tf.layers.batch_normalization:

    TensorFlow also provides high-level API for Batch Normalization.  However, its weird behavior makes us decide not to use it finally.
    The wrapper function of Batch Normalization that applies this API is simply:
def BatchNormalization(isTraining_, inputTensor_, layerName_=None):
        return tf.layers.batch_normalization(inputTensor_, training=isTraining_, name=layerName_)
    In this implementation, you can see that we don't need to input whether the last layer is convolution and the function just return the output tensor.  However, this does not mean that you don't need to update the network.  The update operation is stored in the tf.collection and you should pull it out and session run it after each training step.  The documentation suggests that you can also claim the dependencies of the update operation and the training operation so that while you session run the training operation, it'll automatically run the update operation for you:
updateOps = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
optimizer = tf.train.AdamOptimizer(learning_rate=self.learningRate)
with tf.control_dependencies(updateOps):
                self.trainOp = optimizer.minimize(lossOp)

while step < MAX_TRAINING_STEPS:
        session.run( self.trainOp,
                     feed_dict={self.net.isTraining : True,
                                self.net.trainingStep : step,
                                self.net.input : x,
                                ...})

Performance:

Following is the comparison of the two implementations.
The two upper curves are the train & val that apply the tf.layers.batch_normalization API.  And the two lower curve is the train & val that apply the tf.nn.batch_normalization API.  One can see that: Both of them converge to the same limit.  However, the tf.layers.batch_normalization API had gone through a small hump during the 45~50 training epochs.  It's this strange behavior makes us finally use the tf.nn.batch_normalization API.

Conclusion

     This article concentrates on the implementation detail of the Batch Normalization in Tensorflow.  Two approaches have been compared.  Moreover, this article also proves that one does not need to recover the tf.train.ExponentialMovingAverage variables manually (as the documentation suggested) while one recover the networks.

2017年8月10日 星期四

AI + AR: The new generation of Augmented Reality

Overview

    AI is hot, while AR is awesome, why not combine them together?  At the very first time that I understand how AI (strictly speaking, the Convolution Neural Network (CNN) ) works, I always want to apply it to enhance the object detection of traditional AR.  However, it was not until recently that I decided to make it come true.  This article will show you how AI improve the user experience by present a simple demo of PokemonGo-like AR application.

Introduction

    Augmented Reality (AR) can be roughly separated into two categories: camera-based and location-based.

Camera-based AR

    In this category, the application will interact with the user through the camera.  Basically, the traditional camera-based AR detects the Corners of an image, then use machine learning algorithms to determine whether the target objects present in this image given the distribution of its Corners.  The word 'Corner' here, means the pixels that have the prominent color difference with its surrounding pixels.  For example, the tip of the roof. As shown below (Left), Corners are marked by the red circle:
    The traditional AR is usually used to recognized 2D images as shown above (Right).  However, when it comes to 3D object recognition, the distributions of corners vary if one uses the different perspective to see it, as shown follows:

Therefore, it's hard to apply the traditional AR to recognize 3D objects.

Location-based AR

    The famous application of location-based AR is the PokemonGo.  It detects the location of the user and determines whether place Game Objects to interact with the user.

    It also somehow provides the camera-based AR, but there's no more functionality whether you open or without open the camera.  For example, if I detect Squirtle near the bush, then I step back.  The Squirtle does not look smaller or respond (such as attack, chase or flee) to my movement.  Moreover, why is the Squirtle in the bush, not in the pond?  Furthermore, if I keep stepping back, the Squirtle looks like dragged by me, not chasing me to the road.

    
This fact makes the user feel cumbersome if they play the PokemonGo with camera opened.  Furthermore, It's somehow disappointing, since one of their advertisement couple of years ago looks like it can enable the user to interact with his surroundings (such as the Pikachu can be found near the bush or the Snorlax sleep on the bridge).

   
     Nevertheless, with CNN many of the problems that discussed above can be solved.  The next section will show you how CNN can improve such user experience tremendously:

Improve AR by CNN

     Despite classification, another basic functionality of CNN is to perform object detection.  It can output the x, y, width, height and probability (relative to the image coordinate) of the trigger objects (e.g. the bridge).  Such information can be used to place the Game Objects (e.g. the Snorlax) on the image so that you can find the Snorlax only if you open the camera.
     Furthermore, unlike the traditional camera-based AR which can only recognize a certain bridge (e.g. the white bridge, as shown above) but not all of the bridges.  Not to mention it can only recognize such bridge in a certain perspective.  These disadvantages can be conquered if one apply CNN to perform the object detection.
    I have spent a few weeks to write a simple application based on such idea:  If the bush has been detected, it will trigger the MushBro (sorry, to avoid the copyright issue, I didn't use the Pikachu) to jump out and interact with the player, as shown follows.  You can see that the MushBro will keep standing near the bush.  Although the video is not shown, if one step back, the MushBro will keep shrinking.  And if one keeps stepping back, so that the bush becomes too small to be recognized, the MushBro will flee.

Other application scenarios

Still not interested yet?  Consider the following scenario:
"You know there's Lapras located in Loch Ness, but only if you point your camera to the center of Loch Ness so that the Lapras will emerge."
and
"You have heard that there will be an Ho-Oh on the roof of an old temple.  And you can only find it when you aim your cell phone to the roof of that temple."

as well as
"After raining, you can go fishing near the newly created puddle."
or
"You have a small probability to find the Mew with while you point your cell phone to the rainbow."

Note that in the last two scenarios, the appearance of the Pokemons is even beyond the developer's expectation:  The Pokemons are placed by nature, not located by human hands.

Implementation

    I put the recognition system on the server, and wrote an application that keeps sending images to that server.  When the recognition finished, the server sends the recognition result back to the phone.  And the phone application determines where to place the Game Object (the MushBro).


Conclusion

    This article shows how CNN can improve the user experience of the traditional AR, and its potential to be extended to more fascinated scenarios.
    However, there're also some cons of this approach:

1. It's very expensive to perform the object recognition on the server.  In this case, to reduce the recognition rate can reduce the burden of the server, such as performing one object detection for several frames. Or once the trigger object detected, stop sending new images.  Note: the latter solution will not change the Game Object's size according to the movement of the user.  And therefore, might reduce the user experience.

2. The Game Object is wiggling.  This can be solved by applying algorithms to stabilize the recognition result.  For the future work, I'll try to use several tracking algorithms such as a simple LSTM to perform such trick.

3. The user can use a certain image to cheat the object recognizer: For example, if the Ho-Oh will appear in a certain temple in Tainan.  The user can just show the image of that temple to the camera, so that they can trigger Ho-Oh without actually going there.  This can be solved by checking the location of the user first, then perform the object recognition.  This can also be used to reduce the burden of the recognition server.

2017年7月14日 星期五

How do you think about Neural Network, the intuition perspective or the parameter perspective?

Overview

    This article compares two perspectives to the Neural Networks (NN).  With different perspective, one might design the NN in a very different way and finally result in various accuracies.

Introduction

    Recently, there's a competition of counting the number of sea lions in the picture (as shown below).
The winner suppresses other competitors with error nearly 15% less than the second place.  That's a huge success.  Here is the quote from the resident of google brain:
While everyone tried object detection/segmentation, winner is simple VGG16 regressor that directly outputs sea lion counts from raw images.

     This reminds me that my colleagues and I have a similar debate about another competition: The Nature Conservancy Fisheries Monitoring.


The Nature Conservancy Fisheries Monitoring contest

    In this competition, participants are asked to classify the fishes in the given pictures (as shown below).

There're many objects in the image.  However, only a small part of the image is relevant: the Neural Network (NN) should learn to ignore most of the irrelevant objects (such as the human and tools) and focus on the fish and classify their species.
    In this circumstance, how should we designed and label the output of NN?  One might say: since the classification is relatively simple than the detection (which will output bounding box of the fishes), it'd be better that NN just outputs the categories of the fishes.  Another might say: if NN is a human, how can he learn without give him a hint (such as label bounding box so that NN will learn to focus on the important part)?
    If we regard the first approach as "the parameters perspective" and the second approach as "the intuition perspective", following list several arguments for each perspective.

The parameter perspective

Supporters:  

    The parameters shared per output of classification is larger than the detection.  Namely, there're plenty of parameters that can optimize the classification.

Opponents:

    How does the NN learn to recognize fish, if we don't give it a hint?  If we only label what kind of fish in the image, it may end up that NN learns to recognize some misleading features in the image.

The intuition perspective

Supporters:

    Like the human, if you have marked what's important in the picture (as shown below), NN will learn to focus on the things that really matter.

Opponents:

    To do so, one may decrease the parameters per output to the one-fifth of the original one (from predicting the category to predicting the category, the upper left point and the bottom right point).  This will transfer the NN from the optimization of the category to the optimization of the category as well as other irrelevant variables (at least irrelevant to the competition).

We did not perform any experiment to test whether perspective is correct.  However, the champion of the counting sea lion competition seems to support the parameter perspective.

Conclusion

    Neural Network is a black box:  Instead of design algorithm by human hand, it let the model automatically learns to solve the question by fitting its inner parameters.  Therefore, it might not have as much of meanings as the human designed algorithms.  However, from time to time, people tend to give meaning to the NN or tend to illustrate their behavior.  That's fine.  Some of the interpretation of NN even has strong evidence.
    However, never forget it is also a model that contained a large amount of parameters!  When the two points of view (the human intuition perspective and the parameters perspective) against each other, the parameters perspective seems a better choice in my opinion.

2017年6月12日 星期一

Connect C++ to Python

Background

    In order to support features such as function overloading and template, C++ applies the name mangling technique.   However, the C++ standard does not restrict how to do it.  Therefore, every vendor who implements the C++ compiler may do it in his own way.  This fact makes C++ hard to connect with other languages.
    With this in mind, I always think the only way to connect C++ with other languages is through the C interface.  Until I saw Boost.Python.

Boost.Python

    This module provides a convenient way to build such connection.  With this module, you don't even need to change your C++ code at all (non-intrusive, see its Quick Start for more detail).  However, when it comes to more advanced features such as the polymorphism and the template, one should do some effort to make it work...