Gradient Descent for Deep Learning and Motion Estimation

# Gradient Descent for Deep Learning and Motion Estimation I found something really striking when I came across deep learning. It shares the same approach / technique as motion estimation, a completely different field, on which I worked during my PhD thesis. You should understand that the purpose of motion estimation is to find the amount of motion in a video sequence. With deep learning you can solve complex problems such as autonomous driving, handwriting- and/or letters’ recognition and object detection. To my surprise, these fields, with completely different goals, come close thanks to a mathematical tool called “The Gradient Descent”.

Let us take a step back, so that we can look at it closer.

## Deep Learning from a mathematical point of view

Please don’t get scared by the title, there are going to be just a few simple formulas and a lot of text. Many problems in computer vision are solved using convolutional neural networks. When you want to build such a network, you put together different types of layers (convolutional, Relu, pooling and so on) in a structure called a “model”. After this the real magic starts… the model learns. And this happens as follows:

In order to understand if the model correctly learns the problem of interest, you need an accuracy measure. In other words, a mathematical formula that defines the delta between the predictions of the model and real examples (this set of examples is called “test set”). This mathematical formula is called “the cost function” and is usually composed of hundreds of parameters. Ideally, you’d like the distance between the model predictions and real data to be zero. However, this is often not possible (we live in an imperfect world), and you therefore calculate the parameters that give the best result when used in the function = the minimal distance. In the following formula there is an example of cost function: where x are the input images, l(x) are the expected results, and y(x) are the neural network output.

In order to minimize the cost function, you use an algorithm (or one of its variations) called “Gradient Descent”. This algorithm solves non-linear systems of equations. Normally you can calculate the solution of a problem directly, if it is expressed as a linear system. A linear system is a set of linear equations that have a solution. Everyone loves linear systems because they are easy and fast to solve. An example of linear system follows: Sadly, it is not always possible to describe problems with linear systems, especially for really interesting problems. In the case of a non-linear system, help comes from my friend introduced in the beginning of the article, the gradient descent. This works as follows:

I am sure that the non-linear problem has a solution, however I cannot calculate it directly. Following the opposite direction of the gradient, the result will end up in a minimum of the cost function (this is mathematically proven but I will spare complicated details).

This process also works when the non-linear function is large, with hundreds of parameters, like in the case of deep learning functions.

## Motion Estimation and the mathematicians’ shortcut

You may wonder how this is connected with motion estimation. In order to calculate the motion available in a video, mathematicians describe the problem with a functional (a function of functions). Minimizing a functional is quite a challenge, so you take a shortcut: you minimize a function that has the same solution as the functional. Could you guess what type of function this is? Well done, if you guessed that this is a non-linear equation!

Let us look at the assumptions for motion estimation.

The functional for motion estimation is based on a simple assumption: all the points belonging to an object preserve their color while the object moves. If my face (the object being looked at) moves 10 cm to the right, all the colors of my face remain exactly the same. The best possible movement from a mathematical point of view is the one that preserves all the colors. In mathematical terms: where I is the input image, x and y are the coordinates of a point in the image and u,v is the movement in x and y direction.

Since movement happens in microseconds, it is supposed to be smooth in space and time. In other words, from one frame to the next, the movement is small and coherent with the movement of the neighboring points. For example, if I move my face, as described above, all the points belonging to my face move with the same speed in the same direction. (You would not recognize my face if they were behaving differently).

In mathematical terms: where ∇u and ∇v are respectively the gradients of the functions u and v. Like before u,v is the movement.

The result of all these assumptions put together is a functional. Following further mathematical steps, this ends up in a non-linear equation to be minimized. And here comes again my friend, the gradient descent.

Let us look at an example of motion estimation result. In the following image we see the arm of a person that is moving. The calculated motion is visualized with blue vectors. The direction and the size of each vector is proportional to the direction and size of the motion. The “descent” in the “The Gradient Descent” refers to the fact that one goes in the decreasing direction of the function, meaning that the function is minimized. You’re never alone with a Gradient Decent.