[K-MOOC]  Introductory Mathematics for Artificial Intelligence

         그림입니다.
원본 그림의 이름: cover-new-1.jpg
원본 그림의 크기: 가로 3585pixel, 세로 4985pixel

                Translated by

  Sang-Gu LEE with Youngju NIELSEN, Yoonmee HAM

         from the original Korean text written by

      Sang-Gu LEE with Jae Hwa LEE, Yoonmee HAM, Kyung-Eun PARK


    Part Ⅴ. PCA and ANN

13. Artificial Neural Network

A neural network is a model of neurons, the basic unit of the nervous system, and performs tasks such as image recognition.

In this section, we will explain how the artificial neural network works.


  13.1 ANN

  13.2 How artificial neural network works

  13.3 How the neural network study

  13.4 Backpropagation (BP)


13.1 Artificial Neural Network

Neurons(nerve cells) in a human brain make <Neuro transmitters from amino acids in the blood>. Nerve stimulation occurs in one direction,

and the signals are transmitted to other nerve cells around it by using the generated neurotransmitter. One nerve cell responds to

signals (input signals) received from several other nerve cells. If the cell membrane reaches a particular threshold potential(threshold value),

it transmits a signal (output signal) of a certain size to the next nerve cell. Since this is a vector-in and vector-out, it can be understood that

the resulting vector was obtained by multiplying a matrix to the input vector. The key is to find this function, which is a matrix.

In general, the nerve does not respond when a small stimulus is initially given. But the nerve's intensity is gradually increased, and

when it exceeds a threshold value, it suddenly responds with 'Ouch!'. The corresponding function can be considered a Heaviside function.

But the Heaviside function is not differentiable, we use the Sigmoid function (or ReLU function) as the activation function which is similar to it.


Machine Learning (ML) is the study of computer algorithms that improve automatically through experience Machine learning algorithms build a model

based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.

An artificial neural network (ANN) is a computational model based on the structure and functions of biological neural networks.

Information that flows through the network affects the structure of the ANN because a neural network changes or learns based on that input and output.


When we talk about machine learning in ANN(artificial neural network), if there are hidden layers in an ANN, we call it a deep learning.

The process of creating and updating weights between hidden layers in a deep learning is called a Backpropagation.

The mathematical principle of this Backpropagation will be introduced. The Backpropagation algorithm explains what happens

between the hidden layers in a ANN. The human brain has about 100 billion neurons and performs difficult tasks very well,

such as image and pattern recognition. It is a picture of a neural network in the brain.


      그림입니다.
원본 그림의 이름: 800px-ANN_neuron.svg.png
원본 그림의 크기: 가로 800pixel, 세로 527pixel

     [Source] https://commons.wikimedia.org/wiki/File:ANN_neuron.svg


A neural network is a model of a neuron, which is a basic unit of the nervous system. Neurons are in effect the primary transmission lines

of the nervous system. One artificial neuron(node) receives multiple input signals and one output signal. This is similar to transmitting information

by sending electrical signals in real neurons. When a neuron transmits signals, the weight plays a role. Each input signal has its own weight,

and a signal with a larger weight will be a more important signal. This figure is an example of an artificial neural network.

     

           그림입니다.
원본 그림의 이름: Arjun-Chandran_BSJF192.png
원본 그림의 크기: 가로 800pixel, 세로 380pixel

[Source]  https://bsj.berkeley.edu/wp-content/uploads/2019/12/Arjun-Chandran_BSJF192.png


13.2  How artificial neural network works

Now we will study how this artificial neural network works. As shown below, one node can be seen as a function that receives an input signal

and transmits the result. Therefore, it can be expressed as a function that receives an input signal and outputs . Here, is a bias.

For example, when an input signal is given and the weights assigned in advance exceed the threshold value ,

then we will have the output .


                  If     output 1  ()

        그림입니다.    그림입니다.

            [Source : Prof. Ho-Sung NAM at Korea Univ.]


Similarly, it can be expressed as a linear function when we have two input signals. In the figure, when and are inputs with given weights ,

then the output will be . And when there are three input signals, and , then the output will be .


            그림입니다.      그림입니다.


If there are two outputs, not one, it can be expressed using the matrix product as follows. For convenience, let the bias 0 for simplicity.

We may put weights as 's since there are multiple weights. This is called the weight connected from the th node in the input layer

to the th node in the output layer.


   그림입니다.     .


In linear algebra, the input () is a vector and the output () is a vector and the function is a matrix .

But in statistics or artificial intelligence, they use which is the transpose of for their convenience. Here vector is

the input and matrix is multiplied it to yield a vector as following.

                   

This way is more intuitive in some sense, so this notation will be seen in the field of artificial intelligence and statistics. Now we see that

we can model neurons using functions, especially matrices.

If a number of input signals are given in an artificial neuron, calculate with the weights assigned in advance, and if the sum exceeds

a predetermined threshold, outputs 1. If not, outputs 0 (or –1). At this time, the function that determines the output is called the Activation Function.

A role model of an activation function is the Sigmoid function which is -th differentiable.

                          

The Sigmoid function has almost the same behavior as the following Heaviside function (step function),

                              .


                   그림입니다.
원본 그림의 이름: CLP000021100001.bmp
원본 그림의 크기: 가로 545pixel, 세로 480pixel



Sigmoid function has the following properties.

 ① For all , .

 ② If then and if then .

 ③ Sigmoid function has the following a nice property.

 



Compared with the Heaviside function(step function), this Sigmoid function normalizes the output signal to a continuous value between 0 and 1,

rather than an extreme value (0 or 1). In addition, the Heaviside function is not differentiable when , but the Sigmoid function is differentiable

on all real numbers and the computations are very simple. Besides, ReLU(Rectified Linear Unit) function, hyperbolic tangent(), etc are used

more often as activation functions in many ANN problems. For now, we will focus on the Sigmoid function and develop the theory.

[More details in https://reniew.github.io/12/]


Let’s organize this process.


① First, artificial neurons receive input data () from other artificial neurons or from outside.

② When input comes in, the artificial neuron combines the input data with the given weights and creates an output

(by linearly combining a single value with the weight and adding the bias).

③ Substituting the value obtained in ⓶ into the activation function , output is returned.


Activation functions can be written as or . In the following figure, the activation function was written as . Schematically, the output value is obtained

through this process. In other words, when the linear combination of an input () and weight is given, the activation function

gives an output .


    그림입니다.
원본 그림의 이름: 1564953855233-12.jpg
원본 그림의 크기: 가로 1094pixel, 세로 483pixel

   Input data, Weight, Linear combination,         Activation function,      Output data

 

13.3  How the neural network study 

The neural network consists of an input layer, hidden layers, and an output layer. It receives a signal from the input layer and propagates it

to the hidden layer through a given activation function after computation with a weight assigned in advance. The prediction result is transmitted

to the output layer in the same way. Suppose there is an error between the predicted value obtained using the neural network

from the given data and the observed value known in advance. In that case, the weight is gradually updated to reduce the error.

Backpropagation and the gradient descent method are used in this process.


We can interpret a neural network is a model of a collection of neurons connected by a non-circulating graph. As shown in the figure below,

it consists of an input layer, hidden layers, and an output layer from the left.

When a signal is received from the input layer, it is propagated to the hidden layer through a given activation function

after calculating the weight given in advance. Similarly, the signal is calculated and transmitted to the next layer, and after repeating this process,

the output layer produces the corresponding result.


그림입니다.
원본 그림의 이름: hiddenlayer.jpg
원본 그림의 크기: 가로 850pixel, 세로 393pixel

   [Source] https://deepai.org/machine-learning-glossary-and-terms/hidden-layer-machine-learning


Suppose that we are given a large input data set and we can obtained the correct answer based on the given data set. Our objective in this exercise

to determine how efficient neural network can be used to reproduce the correct answer. At this time, the weights are not computed by any formula,

but the weights are given randomly at the beginning, and the weights are gradually updated in the direction of reducing the error

between the predicted values using the neural network and the correct answers obtained from the given data. For performing this weights updating process is

the Backpropagation(BP) algorithm.


13.4  Backpropagation (BP)

Let’s look at the Backpropagation algorithm, which is used to train artificial neural networks.


[Back propagation algorithm]

 

① [Data Cleaning] The process of organizing data to fit into files is called a <Data Preprocessing(Data Cleaning)>. After data preprocessing and securing the essential data to make the necessary model, divide this data set into training data, validation data, and a test data set. In general, 80% of the original data is used as training and verification data, and 20% is used as a test data set. The modeling is performed using the training data, and the model is refined more and more elaborately using the validation data. Then, it is checked whether it fits well by using test data. Through this process, we can find a functsion, in the form of a matrix, that fits the given artificial neural network.

② [Start with the initial conditions] Enter input and observed value and set the weight of artificial neural network arbitrarily.

③ [Set up matrix, activation ft.] Set up initial weight values for ANN(example; matrix, activation ft.) In other words, provide an arbitrary matrix for an appropriate function. After that, set the Sigmoid function (or ReLU function) as an activation function and provide an appropriate coefficient. For example, we can start with a matrix whose entries are all 1.

④ [Process] When the input layer receives a signal from training data, compute it with the weight (or updated weight after step ⑥) and transmits it to the hidden layer through a given activation function. The signal passes through all hidden layers, and we can get a predicted value at the output layer.

⑤[Find error] There will be an error between this predicted value and the observed value.

⑥ [Minimize error, GDM] Minimize this error by applying the gradient descent method. With the answer, we modify weights of the matrices (functions) in ③.

⑦ [Repeat the algorithm until the error is minimized] After correcting the weights, repeat the ④~⑥ steps until the overall error is minimized.

⑧ [Stop with the optimal solution, Check the model] If we find the optimal solution through the whole process, we have the neural network model. After modeling, check that the neural network model works well using the test data set.

 

This whole process operated within the hidden layers is called the <Backpropagation> algorithm.


If we want to increase the accuracy, we may increase the number of training data and the number of hidden layers until we find a model

with acceptable or target accuracy as we need.


Deep learning is a learning process of an artificial neural network with more than two hidden layers.


The method of updating the weights based on error transmitted to each layer by the Backpropagation method is the same as

applying the gradient descent method to minimize the error function. For example, let be the output value obtained from the activation function

after the computation with weights; let () be the input signal received from each layer; let be the real observed value

(the value corresponding to the correct answer). There must be an error between and . Then we can write an error function as

a squared error between and , using a vector norm.


                            

Then the formula for updating each weight is given by the gradient descent method. As we can see, is an algorithm

to find NEW weights comes from the OLD one by one. In the previous description of the gradient descent method, the same algorithm was used

to define for the new . Here means a partial derivative of with respect to the variable .


After calculating the weights with input signals 's obtained from a hidden layer, make a linear combination to compute an error of

the next output layer. The Backpropagation algorithm is used in this process. We can transmit the error inversely proportional to the degree

of influence on error. In the same way, if the output is shown in the following picture from the hidden layer, we compute the error

between the expected value and apply it to the next step. Let's go through the mathematical details involve in this process.


① Compute the error passed to each layer. The objective for this step is to compute the error in the output layer. Assume that the hidden layer

receives the input signal and and compute the weights , , , and then obtains the output and through the Sigmoid function .

 

    그림입니다.


   Then, the squared error in the output layer can be obtained as follows.


        

          


   Now we compute the error in the hidden layer. Unlike the output layer, the hidden layer does not have any observed values, so there is

no corresponding error(difference) in the output layer. At this time, the aforementioned backpropagation method is used.

The error in the neural network means that the result was made due to the error in hidden layers when the input signal propagates from the input layer

to the final output layer. The error can be reversed in proportion to the degree, that is, the weight.


              그림입니다.


   So, assuming that the error was obtained at the first node of the  output layer, it is connected to the first node of the hidden layer

with the weight and the second node of the hidden layer with the weight . If the weight is large, the effect on the error will be large.

So the error is transferred to the hidden layer in proportion to the weight. For example, if is an error propagated to the first node of the hidden layer,

it can be expressed as below. Also, in the same way, can be expressed as below.


        ,    


   Even if we do the computation with the denominator of the error function 's are removed, the ratio still be kept. So there will be no big difference

in our analysis from the result we obtained since it does not significantly affect the optimal solution. Now we may simplify the equation into

the following linear system of equations.


        


   Then, we now can handle the artificial neural network with hidden layers.


The objective of this step is to update the weights between the hidden layers and the output layer from the error obtained at the output layer.

From the properties of the Sigmoid function and the chain law, we can find as follows.


      

               


   As we have seen before, the derivative of the above function can be obtained easily due to the properties of the Sigmoid function.

Specifically, we can get the result easily because we know the property of the Sigmoid function.


   In the same way, the other weights are calculated as follows.


                         

                         

                         


          그림입니다.


   Similarly, it is easy to obtain other weights. The weight is connected only with the th node of the hidden layer and the th node

in the output layer, so can be expressed simply as follows.


                 


   Here, is an input from the th node in the hidden layer, is an output from the th node in the output layer and is an error

at the th node in the output layer. Essentially, we are using the gradient descent method to update the weights.


                       


The objective of this step is to update the weights between the input layer and the hidden layer from the error passed to the hidden layer.

After reflecting it, update the weights between the input layer and the hidden layer from the error passed to the hidden layer.

Basically, we can compute it by updating the weights between the hidden layer and the output layer. For convenience, if we use the symbol as it is,

we can see that the same algorithm, the Backpropagation, works in the hidden layer.


   Here, is an input from the th node in the input layer, is an output from the th node in the output layer, is an error passed to th node.


                          


   All weights can be updated as follow by applying the gradient descent method.


                         


   Neural networks require a lot of data until the model can predict adequately and may also have many hidden layers between the input

and the output layers. Such artificial neural networks with several hidden layers are called a <Deep Neural Network>, and machine learning

for training a deep neural network is called <Deep Learning>. If there are multiple hidden layers, we can also use this Backpropagation method

to update the weights by applying a gradient descent method from the error propagated in the previous layer.

There are various informations/references related to this topic that we can refer to it.

   This is the Backpropagation method. The extra info can be found in the following reference.


 [Reference]   ‘What is Backpropagation doing?’

http://makeyourownneuralnetwork.blogspot.com/2016/07/error-backpropagation-revisted.html 

https://youtu.be/Ilg3gGewQ5U 


 [Lecture]   The Maths behind Backpropagation algorithm
https://towardsdatascience.com/the-maths-behind-back-propagation-cf6714736abf


 [Lecture] Math in ANN: https://youtu.be/d4WercT_OnU   


◩ Open Problem 5 

Describe simply ANN and Backpropagation algorithm as you understand.  


Copyright @ 2021 SKKU Matrix Lab. All rights reserved.
Made by Manager: Prof. Sang-Gu Lee and Dr. Jae Hwa Lee