2017年8月10日 星期四

AI + AR: The new generation of Augmented Reality

Overview

    AI is hot, while AR is awesome, why not combine them together?  At the very first time that I understand how AI (strictly speaking, the Convolution Neural Network (CNN) ) works, I always want to apply it to enhance the object detection of traditional AR.  However, it was not until recently that I decided to make it come true.  This article will show you how AI improve the user experience by present a simple demo of PokemonGo-like AR application.

Introduction

    Augmented Reality (AR) can be roughly separated into two categories: camera-based and location-based.

Camera-based AR

    In this category, the application will interact with the user through the camera.  Basically, the traditional camera-based AR detects the Corners of an image, then use machine learning algorithms to determine whether the target objects present in this image given the distribution of its Corners.  The word 'Corner' here, means the pixels that have the prominent color difference with its surrounding pixels.  For example, the tip of the roof. As shown below (Left), Corners are marked by the red circle:
    The traditional AR is usually used to recognized 2D images as shown above (Right).  However, when it comes to 3D object recognition, the distributions of corners vary if one uses the different perspective to see it, as shown follows:

Therefore, it's hard to apply the traditional AR to recognize 3D objects.

Location-based AR

    The famous application of location-based AR is the PokemonGo.  It detects the location of the user and determines whether place Game Objects to interact with the user.

    It also somehow provides the camera-based AR, but there's no more functionality whether you open or without open the camera.  For example, if I detect Squirtle near the bush, then I step back.  The Squirtle does not look smaller or respond (such as attack, chase or flee) to my movement.  Moreover, why is the Squirtle in the bush, not in the pond?  Furthermore, if I keep stepping back, the Squirtle looks like dragged by me, not chasing me to the road.

    
This fact makes the user feel cumbersome if they play the PokemonGo with camera opened.  Furthermore, It's somehow disappointing, since one of their advertisement couple of years ago looks like it can enable the user to interact with his surroundings (such as the Pikachu can be found near the bush or the Snorlax sleep on the bridge).

   
     Nevertheless, with CNN many of the problems that discussed above can be solved.  The next section will show you how CNN can improve such user experience tremendously:

Improve AR by CNN

     Despite classification, another basic functionality of CNN is to perform object detection.  It can output the x, y, width, height and probability (relative to the image coordinate) of the trigger objects (e.g. the bridge).  Such information can be used to place the Game Objects (e.g. the Snorlax) on the image so that you can find the Snorlax only if you open the camera.
     Furthermore, unlike the traditional camera-based AR which can only recognize a certain bridge (e.g. the white bridge, as shown above) but not all of the bridges.  Not to mention it can only recognize such bridge in a certain perspective.  These disadvantages can be conquered if one apply CNN to perform the object detection.
    I have spent a few weeks to write a simple application based on such idea:  If the bush has been detected, it will trigger the MushBro (sorry, to avoid the copyright issue, I didn't use the Pikachu) to jump out and interact with the player, as shown follows.  You can see that the MushBro will keep standing near the bush.  Although the video is not shown, if one step back, the MushBro will keep shrinking.  And if one keeps stepping back, so that the bush becomes too small to be recognized, the MushBro will flee.

Other application scenarios

Still not interested yet?  Consider the following scenario:
"You know there's Lapras located in Loch Ness, but only if you point your camera to the center of Loch Ness so that the Lapras will emerge."
and
"You have heard that there will be an Ho-Oh on the roof of an old temple.  And you can only find it when you aim your cell phone to the roof of that temple."

as well as
"After raining, you can go fishing near the newly created puddle."
or
"You have a small probability to find the Mew with while you point your cell phone to the rainbow."

Note that in the last two scenarios, the appearance of the Pokemons is even beyond the developer's expectation:  The Pokemons are placed by nature, not located by human hands.

Implementation

    I put the recognition system on the server, and wrote an application that keeps sending images to that server.  When the recognition finished, the server sends the recognition result back to the phone.  And the phone application determines where to place the Game Object (the MushBro).


Conclusion

    This article shows how CNN can improve the user experience of the traditional AR, and its potential to be extended to more fascinated scenarios.
    However, there're also some cons of this approach:

1. It's very expensive to perform the object recognition on the server.  In this case, to reduce the recognition rate can reduce the burden of the server, such as performing one object detection for several frames. Or once the trigger object detected, stop sending new images.  Note: the latter solution will not change the Game Object's size according to the movement of the user.  And therefore, might reduce the user experience.

2. The Game Object is wiggling.  This can be solved by applying algorithms to stabilize the recognition result.  For the future work, I'll try to use several tracking algorithms such as a simple LSTM to perform such trick.

3. The user can use a certain image to cheat the object recognizer: For example, if the Ho-Oh will appear in a certain temple in Tainan.  The user can just show the image of that temple to the camera, so that they can trigger Ho-Oh without actually going there.  This can be solved by checking the location of the user first, then perform the object recognition.  This can also be used to reduce the burden of the recognition server.