﻿

(English version) Introductory Math4AI

[K-MOOC] Introductory Math for Artificial Intelligence

[K-MOOC]

Introductory Mathematics

for Artificial Intelligence

Sang-Gu LEE with Jae Hwa LEE, Yoonmee HAM, Kyung-Eun PARK

Translated by the main author from the original Korean text

Sang-Gu LEE with Youngju NIELSEN, Yoonmee HAM

<Kyungmoon Publishers, 2021>

Reference : http://matrix.skku.ac.kr/KOFAC/book/

English :   http://matrix.skku.ac.kr/Intro-Math4AI/

(Korean :  http://matrix.skku.ac.kr/math4ai-intro/ )

 Contents

Preface

I. Basic Math for AI

1. Function, Graph and Solution of Equations

1.1 Functions and its Graph

1.2 Polynomial functions

1.3 Rational functions

1.4 Trigonometric functions

1.5 Exponential and Logarithmic functions

1.6 Solution of the equation

II. Artificial Intelligence and Matrix

2.  Data and Matrices

2.1 Ordered tuples and vectors

2.2 Vector operations

2.3 Matrices and Tensors

2.4 Matrix Operations

2.5 Rules for Matrix Operations

3. Classification of Data

3.1 Data Similarity

3.2 Distance

3.3  Norm

3.4 Similarity Measure using Norm

3.5 Comparison of Data using Angle

3.6 Cosine Similarity

3.7 Inner product

3.8 Angle between two vectors

3.9 Computing cosine similarity

4. Solution Sets for System of Linear Equations

4.1 System of linear equations

4.2 Augmented matrices

4.3 Gauss-Jordan elimination

4.4 Solution Set for System of Linear Equations

5. Orthogonal Projection and Least Squares Problem

5.1 Least Squares Problem

5.2 Meaning of the least-squares problem

5.3 Projection and least-squares solution

5.4 Finding a suitable curve for data (Curve Fitting)

6. Matrix decompositions (LU, QR, SVD)

6.1 LU decomposition

6.2 QR decomposition

6.3 SVD(Singular Value Decomposition)

III.  AI and Optimal solution (Calculus)

7. Limits of Functions

7.1. Limits of Functions

7.2. Derivative and Differentiation

8. Local Maximum and Minimum,

Absolute Maximum and Minimum

8.1. Applications of Derivatives

8.2. Applications of the second derivative

8.3. Local Maximum and Minimum, Absolute Maximum and Minimum

*9.2. Application of the Gradient Descent Method

IV. AI and Statistics

10. Permutation, combination, probability, random variable, probability distribution,  Bayes' theorem

10.1 Permutations and combinations

10.2 Probability

10.3 Conditional probability

10.4 Bayes' theorem

10.5. Random variables

10.6 Discrete probability distributions

10.7 Continuous probability distributions

11. Expectation, variance, covariance, correlation coefficient, covariance matrix

11.1 Expectation, variance, standard deviation

11.2 Joint probability distribution

*11.3 Covariance, correlation coefficient

11.4 Covariance matrix

V. PCA and ANN

12. Principal Component Analysis

12.1 Dimension reduction

12.2 Principal Component Analysis  (PCA)

12.3 How to Compute principal component

12.4 Examples of PCA

*12.5 PCA and Covariance matrix

*12.6 PCA and Linear Regression

13. Artificial Neural Network

13.1 Artificial Neural Network

13.2  How artificial neural network works

13.3  How the neural network study

13.4  Backpropagation (BP)

14.  Hand Writing Numbers detection using ANN on MNIST Dataset

14.1 Hand Writing Numbers detection using ANN

VI. References and Appendix

A1. References

A2. Appendix

▪ Final PBL Report [Week 14] [TakeHome/OpenBook Exam]

▪ Web resources

▪ Index                                                           200

◼ Information of Authors

Original Book by Sang-Gu LEE
with Jae Hwa LEE, Yoonmee HAM, Kyung-Eun PARK

Translated by the main author from the original Korean text

Sang-Gu LEE

Professor of Mathematics, Sungkyunkwan University

http://matrix.skku.ac.kr/sglee/vita/LeeSG.htm

Youngju NIELSEN

Yoonmee HAM

Professor of Mathematics, Kyonggi University

https://www.researchgate.net/profile/Yoonmee_Ham

Preface

Artificial intelligence(AI) came nearby to our lives already. For example, we can think of Netflix, Amazon's product recommendation system, AI voice assistant 'Siri', AlphaGo Master who won all the best Go players, and NEON and Bally which was first introduced at CES 2020. Besides, voice recognition, face recognition, self-driving car, the understanding and generation of natural language are all related to AI. Soon, artificial intelligence robots will be working for humans in factories, roads, and homes. As a result, humans may not have to go to work and lose their jobs. We need to understand the basic principles of what artificial intelligence is and how it works.

Large university hospitals have medical records of many patients, showing persons’ height, sleeping hours, exercise hours, calorie intake, weight, blood pressure, etc.  If we give the inputs such as sleeping hours, exercise hours, calorie intake, and height into AI, then AI gives us an output such as normal weight or blood pressure. This is a process of the function. If both Input and output are vectors, then the matrix acts like a function. Therefore, if we can find this matrix, it is possible to predict that 'people with certain inputs will have certain outputs'. It is a mathematics. We have tried to build up a Mathematical Modeling for this purpose. Now, when we have enough input data and answers, a Machine can find rules, functions, or matrices from those data on its own. In particular, artificial neural networks are modeled on neurons, the basic unit of a nervous system. The human brain has about 100 billion neurons and performs tasks like image recognition. When the nerve is stimulated, it does not respond to small stimuli, but when the stimuli go beyond the threshold value, we will react with "Ahhhhh!". The sigmoid function is used for these neural responses. An artificial neural network with many layers is a model of this process of outputting results through the layers when data is entered. We now can see this process of finding a Matrix(function) is not magic.

In AI, we use mathematical and statistical techniques to improve Artificial neural networks. In that process, we use the backpropagation algorithm. Backpropagation is a process of gradually increasing the precision of the desired matrix by correcting errors while going back and forth between the layers. A Backpropagation is a process of minimizing the error function to improve the neural network. In finding a Least Squares Solution(LSS), the Singular Value Decomposition(SVD) is used. And the Gradient descent method(GDM) is used as an algorithm to minimize error in this error function. We will learn mathematics such as LSS, SVD, and GDM needed to obtain this matrix and complete our mathematical modeling.

Take another example. Let's explain how retailers recommend products for their consumers. AI analyzes the data, checks what people need, deduces what they will like, and constantly sends recommendation messages. We keep getting these messages, too. Like this, AI works according to a process that we can fully understand. AI is not a smart box that we don't know. In other words, AI is Not Magic. AI is based on algorithms that operate using data and processing power.

Students should learn eigenvalue, eigenvector, diagonalization of matrices, SVD, PCA, local extremum points, counting techniques, probability, probability distribution, covariance matrix in Math/Statistics subject, and some coding using Python language and R language.

If we do well, we will have the ability to understand and utilize artificial intelligence. We will be able to use artificial intelligence without fear and prepare for a future job. An artificial neural network is a key player in artificial intelligence. An artificial neural network produces a function that predicts output values from input values. At this time, if both input and output are vectors, the function will be a matrix.

If we can find that matrix (or understand the process of finding the matrix), we can always estimate the proper output from the given data. Once we understand that the core of an artificial neural network is a proper use of mathematics, we will be able to face Artificial Intelligence without fear.  Therefore, we select and explain the knowledge required for artificial intelligence, such as linear algebra, calculus, and statistics. And we will use the 'text Code' to do computations and simulations. In particular, we have prepared a textbook with code, and hence we can practice problems directly in Sage which is a Python based language and R language. As a result, we can handle functions with Code, so we can easily analyze the given data.

Mathematical theory and Coding has been a key factor in the development of AI. In this book, we will explain most of the basic mathematical knowledge that is necessary to understand AI. We will also offer free web content, free cyber lab using ready-made Python based code. Welcome. We look forward to your pleasant learning of <Introductory Mathematics for Artificial Intelligence>. We hope all can enjoy it.

Finally, we would like to emphasize our creation of a technology-rich mathematics textbook. We greatly appreciate the valuable suggestions and comments of our colleagues including Dr. Dong Ryeol SHIN, Thomas Yew Sing LEE, Jae Hwa LEE, Eungki KIM, Natanael KARJANTO, Kinkar Chandra DAS, Nhan-Phu CHUNG, Lois SIMON, Minki KIM and the efforts of contributors including Robert BEEZER and William STEIN to the Python based  free open-source SageMath software project who inspired us to write this book. We dedicate this book to our friends, family, mentors and students who shared the meaning and joy of mathematics.

From the Authors

at Sungkyunkwan University

2021. 1. 15.

Royal Palace, Sungkyunkwan University and BukAk Mt. (Photo by the author)

 Reading Material 1 What is AI?     Math DongA,  May 2020   Artificial intelligence(AI) now cooks and wash dishes for us, gives answers to questions, and take care our schedules. AI works very closely with us already. We now need to understand AI and how it works. We set the goal of this course to learn and practice the contents of linear algebra, calculus, and statistics (especially SVD, GDM, PCA, Artificial neural network) that are needed to understand artificial intelligence. It seems that the age of robots is already nearby us. In the past, we saw robots in movies such as 'Terminator, Matrix, and I, Robot’. Now robots drive and work for humans at home. Therefore, we need to understand how AI works. Already, robots are going beyond human abilities in quiz games, Go-games, and Internet games. Let's take an example. When we say "I want Pizza" to Siri, AI confirms that there is no term like "cooking," and judges that the speaker is looking for a pizza place to eat. And AI uses our GPS information to find nearby Pizza restaurants, ranking them according to proximity, reputation, price, etc. Also, AI refers to your records of restaurants visit, then recommends a high-scored restaurant recommended by other people. As a result, AI generates a sentence “There is a pizza restaurant called Gino's Pizza about three blocks ahead from here.” and reads it in her voice. Here is an example. AI works for the operational process of the purchasing system. Suppose there is a record of the books we have purchased so far. In that case, AI selects people with a similar purchasing history to us, checks out the books that others also bought recently, and books that you haven't bought yet then recommends a book by email or Social Network Service(SNS). AI also works for many marriage brokerage firms. AI uses deep learning algorithm and neural networks that uses several mathematical concepts. 'The Handwritten Number Recognition System Using MNIST Data Set' explains ‘recognizing that a given letter from the data is a number close to 7’ and ‘the code that implements the artificial neural network to recognize which number this handwritten letter is close to.’ And using the given codes, AI  increases the accuracy of the neural network by changing the number of data, layers, and nodes. If we can choose necessary pieces of information from the given big data, our friend may think we are a genius, but we will see that anyone with observation and analysis could do the same thing. We will see 'what makes artificial intelligence works' by reading this book. Welcome to .   Alan Turing and his Monumental paper on Computer   Founding Fathers of AI (1)   Main figures in 1956 Dartmouth Conference (2)

The Roadmap of Mathematics for Deep Learning | by Tivadar Danka | Oct, 2020 | Towards Data Science – https://towardsdatascience.com/the-roadmap-of-mathematics-for-deep-learning-357b3db8569b

English :   http://matrix.skku.ac.kr/Intro-Math4AI/

Part I.

Part I. Basic Math for AI

1. Function, Graph and Solution of Equations

1    Week 1 Function, Graph and Solution of Equations

In this first chapter, we start to draw graphs of any given function .  Then we learn how to find the solution of equations of and . And then we see how to point out the points(solutions) approximately.

 1.1 Functions and its Graph 1.2 Polynomial functions 1.3 Rational functions 1.4 Trigonometric functions 1.5 Exponential and logarithmic functions 1.6 Solution of the equation

 Homework Week 1

After solving the above open questions, please share a brief self-introduction, motivation for taking this course, and the summary/ practice/question/answered content on the QnA board.

[Source] https://math.meta.stackexchange.com/questions/6479/a-graph-map-of-math-se

Part II.

Part II. AI and Matrix

2. Data and matrices

3. Classification of data

4. Solution set of system of linear equations

5. Orthogonal Projection and Least Squares Problem

6. Matrix decompositions

2.   Week 2   Data and Matrices  http://matrix.skku.ac.kr/intro-math4ai/W2/

 ◩ Open Problem 6

Find bigger than matrix from the internet or other textbook and check whether transpose and inverse matrices exist. If those exist, find out what those are.

 Homework Week 2

3   . Week 3   Classification of Data

Data similarity in distance

Data similarity in direction

 3.1 Data similarity 3.2 Distance 3.3 Norm 3.4 Similarity measure using norms 3.5 Similarity measure using angles 3.6 Cosine similarity 3.7 Inner product 3.8 Angle between two vectors 3.9 Computing cosine similarity

 ◩ Open Problem 7

What kinds of data can you apply the similarity measures we just discussed?

 ◩ Open Problem 8

Can you think of the data you can apply to the similarity measure we just discussed? [Hint: Vectors/Data in the same directions]

Can you think of the data you cannot apply to the similarity measure we just discussed? Can you think of any other measure you can use?

 ◩ Open Problem 9

Make (randomly generate) two 7-dimensional vectors and then calculate the distance between two vectors. [Hint: Lab <Distance similarity> ]

 ◩ Open Problem 10

What kinds of data can you apply a cosine similarity measure to?

 ◩ Open Problem 11

Make two 5-dimensional vectors and find the inner product and the angle of two vectors.

[Reference] Public data portals https://www.data.go.kr/

[Reference] Sample of public data usage

[Reference] National Core data https://www.data.go.kr/tcs/eds/selectCoreDataListView.do

[Reference] High School AI Mathematics http://matrix.skku.ac.kr/KOFAC/book/

4.   Week 4   Solution Sets for System of Linear Equations

http://matrix.skku.ac.kr/intro-math4ai/W4/

Linear algebra is known as a subject that can be used to solve most of our problems. Alan Tucker even said, "Linear algebra is the model that mathematical theory pursues." A method to find the solution set of a given system of linear equations will be explained.

 4.1 System of linear equations 4.2 Augmented matrices 4.3 Gaussian elimination 4.4 Solution Sets for System of Linear Equations

 ◩ Open Problem 12

Find a system of linear equations from other textbooks. Find a solution to that by using the code that you learned.

 ◩ Open Problem 13

Explain how and what we can determine for a given linear system of equations with a unique solution or infinitely many solutions, or no solutions. Use an RREF of to do this. Explain to others what you understand.

Reference Solution Sets http://matrix.skku.ac.kr/K-MOOC-LA/cla-week-4.html

Week 5. Orthogonal Projection and Least Squares Problem

The least-squares method is the most intuitive way to obtain the best fit curve representing the data patterns. Let's find out how we can get the best possible solution using the method of least squares (curve fitting).

 5.1 Least Squares Problem 5.2 The meaning of the least squares problem 5.3 Orthogonal projection and least squares solution 5.4 Curve fitting

5.1 Least Squares Problem

http://matrix.skku.ac.kr/intro-math4ai/W5/

 ◩ Open Problem 14

Discuss the conditions for the existence of the inverse matrix of and find a least-squares solution for the given 6-points in  to make the curve of .

 Lab - Least squares problem: http://matrix.skku.ac.kr/2020-math4AI/LSS/ - Projection:https://www.geogebra.org/m/ewP9ybUP

6.  Week 6.   Matrix decompositions (LU, QR, SVD)

We need to understand how our computer understands and computes data. Let's briefly introduce three matrix decompositions that AI often uses to handle real-life data.

 6.1 LU decomposition 6.2 QR decomposition 6.3 SVD (Singular value decomposition)

We have discussed vector calculations, a system of linear equations, the least-squares problem. Computer programs use matrix decompositions to solve these problems. We introduce the decomposition, the -decomposition, and the Singular Value Decomposition (SVD).

 ◩ Open Problem 16

Find the SVD of a matrix you found from other textbooks by using the codes above.

In the code for finding an SVD, RDF works for real matrices, and CDF works for complex matrices.

More details on matrix diagonalizations can be found in the following link: http://matrix.skku.ac.kr/LA-K/ ( particularly in http://matrix.skku.ac.kr/LA-K/Ch-8/)

The derivation of matrix decomposition is beyond the high-school mathematics level. Refer to the below references for more details.

[Linear Algebra Textbook] http://matrix.skku.ac.kr/LA-K/

[Ejgenvalue] http://matrix.skku.ac.kr/LA-K/Ch-4/

[Matrix Diagonalization] http://matrix.skku.ac.kr/LA-K/Ch-8/

[Math for AI Book] http://matrix.skku.ac.kr/math4ai/Math4AI.pdf

[AI and Matrix] http://matrix.skku.ac.kr/math4ai/part1/

[LU decom tool] http://matrix.skku.ac.kr/LA-K/LU-decom.html

[QR decomposition tool] http://matrix.skku.ac.kr/LA-K/LS-QR-decom.html

[SVD Lecture] http://youtu.be/7-qG-A8nXmo

 ◩ Open Problem 1

Sketch the graph of three polynomial functions found in your other math books.

 ◩ Open Problem 2

Make the sketch of the graph for the function .

[ Hint: plot(x*sin(1/x), (x, -2, 2), ymin = -1.5, ymax = 1.5) ]

Practice in http://matrix.skku.ac.kr/KOFAC/

 ◩ Open Problem 3

Choose a composite function from the functions that you already learned and draw a graph of it.

[Hint: plot(sin(e^(1/3)^x),  (x, -1.2, 10))]

Practice in  http://matrix.skku.ac.kr/KOFAC/

 ◩ Open Problem 4

Find (approximate) solutions to various equations from other textbooks that you have used.

 ◩ Open Problem 5

Perform matrix operation by applying the operations on vectors and matrices you have learned in this book to matrices you find in other textbooks. [http://matrix.skku.ac.kr/KOFAC/]

 ◩ Open Problem 6

Find bigger than matrix from the internet or other textbooks and check whether transpose and inverse matrices exist. If those exist, find out what those are.

 ◩ Open Problem 7

What kinds of data can you apply the similarity measures we just discussed?

 ◩ Open Problem 8

What kind of data you cannot use this similarity measure using the distance? Any other measures that you can think of? (Hint: Vectors/Data in the same directions)

 ◩ Open Problem 9

Make (randomly generate) two 7-dimensional vectors and then calculate the distance between two vectors.

(Hint: Lab <Distance similarity>) http://matrix.skku.ac.kr/math4AI-tools/distance_similarity/

 ◩ Open Problem 10

What kinds of data can you  apply the cosine similarity to?

 ◩ Open Problem 11

Make two 5-dimensional vectors and find the inner product and the angle of two vectors.

 ◩ Open Problem 12

Find a system of linear equations from other textbooks. Find a solution to that by using the code that you learned.

 ◩ Open Problem 13

Explain how and what we can determine for a given linear system of equations with a unique solution or infinitely many solutions, or no solutions. Use RREF of to do this. Explain to others what you understand.

 ◩ Open Problem 14

Discuss the conditions for the existence of the inverse matrix of and find a least-squares solution for the given 6-points in to make the curve of .

 ◩ Open Problem 15

Find a system of linear equations problem from other textbooks. Then find that problem's least-squares solution using the same method as in example 4.

 ◩ Open Problem 16

Find the SVD for a matrix from the other textbook by using the codes above.

[AI School at SKKU]

STRONG KOREA Forum 2020   https://youtu.be/R0KGx_npa8o(Movie)

Part III.

Part III. AI and Optimal solution(Calculus)

7. Limits and Differentiation

8. Local Maximum and Minimum, Absolute Maximum and Minimum

7  . Week 7. Limits of Functions

An optimal solution is a solution in which a function defined in a set has a maximum or minimum value. The problem of finding an optimal solution involves generalized concepts and operations of derivatives. Techniques used to find an approximate solution help when we find an optimal solution.

* History and Concept of Calculus (Storytelling)

 7.1 Limits of Functions 7.2 Derivative and differentiation

 ◩ Open Problem 1

Find the third derivative of a differentiable function from the textbook.

[Differentiation]

[Lectures] http://matrix.skku.ac.kr/2019-album/

[Textbook] http://matrix.skku.ac.kr/Cal-Book1/

[Problem/Solution] http://matrix.skku.ac.kr/Cal-Book/

1. Single variable Calculus  http://matrix.skku.ac.kr/PBL/

2. Multivariable Calculus http://matrix.skku.ac.kr/PBL2/

Week 8.  Local Maximum and Minimum,  Absolute Maximum and Minimum

The derivative can be used to determine whether a function increases or decreases based on the sign of slope at a point (or on a given interval). There might be a changing point of the slope (e.g, decreasing to increasing, increasing to decreasing). We call it a critical point, and the derivative of the function will be zero at a critical point. The second derivative can be used to determine whether a given function has a local maximum or minimum value at a critical point. Using the second derivative, we can 'check that a function has the absolute maximum or minimum on a given interval'.

 8.1 Application of derivatives      8.2 Application of the second derivative      8.3 Local Maximum and Minimum,          Absolute Maximum and Minimum

 ◩ Open Problem 2

Find a complicated twice differentiable function in other textbooks, and use the code to find the local maximum, the local minimum, the absolute maximum, and the absolute minimum of the function.

[Fermat's (interior extremum) theorem]

The optimal solution satisfies:

Therefore, we solve an equation , and determine which critical points give us an optimal solution. However, if the function is too complicated, it is also difficult to find all critical points from the equation . In such cases, we can find them by using numerical methods. The gradient descent method is one of the most popular numerical optimization methods for solving this problem. We will study the Gradient Descent Method(GDM) in the next class.

< Web resources >

[A lecture on the limit] http://youtu.be/mXVU8OqIHJY

[Calculus Lab] http://matrix.skku.ac.kr/Cal-Book1/Ch4/

[Appl. of derivative Lecture] https://youtu.be/O4lN5zEZnMA

http://matrix.skku.ac.kr/intro-math4ai/W9/

For any differentiable function that has the local minimum value at , then . Therefore, we need to find all points that satisfy , critical points,  first and then determine which of them will  be the optimal solution. However, when the function is complicated, it is not easy to solve the equation to find  critical points. In such cases, the critical point can be obtained by a numerical method ,which is called the gradient descent method. The gradient descent method is also the core algorithm for updating the weights in deep learning.

 9.1  Gradient descent method *9.2  Application (least-square problem)

We now introduce the Gradient Descent Method to obtain an approximate solution to the least-squares problem numerically.

There is a problem of finding the minimum value of a one-variable function that can be differentiated as follows.

Then, the point giving the minimum value of the function (it is called the optimal solution) by [Fermat's critical point theorem] satisfies the following.

Therefore, we can determine which of the solutions satisfying the equation is the optimal solution. However, when the function is complex, it is not easy to find a critical point by solving the equation. In this case, the critical point is obtained numerically. In this section, we look at the gradient descent method, which is the most popular numerical optimization method to find the minimum value of a given function. The basic idea of ​​​​the gradient descent method is to find the slope of the function, move it toward downhill, and repeat it until it reaches the extreme value.

 [Gradient Descent Method] Algorithm [Step 1] Set an initial iterate , tolerance , initial learning         rate (eta) and iteration number . [Step 2] Compute . If , then stop. [Step 3] Set , and go to [Step 2].

Let's explain this GDM algorithm. Set and compute . If (it satisfies the critical point theorem within the margin of error we allow), then the algorithm stops and will give    as the optimal solution. Here, (epsilon, tolerance) satisfies . A notation ‘<<’ in this inequality means that the epsilon is much smaller than 1. Hence, we give a small tolerance epsilon , which is very close to zero, and with a learning rate . And let the iteration number be 1. If , is determined using the formula. Then is calculated to determine whether or not the critical point theorem is satisfied. If the critical point theorem is not satisfied, then is determined similarly. In this way, we create , , (). If the value of is within a given tolerance after some repeated steps , then the algorithm stops. We expect some or the limit to satisfy

Let's see how the GDM works in detail. . Suppose the slope of the tangent line to the function is negative at the approximate solution after the -th iteration. It is  as shown in the figure below. This means the function decreases when moves from left to right, so we can expect is located on the right side of . If moves from the left to the right, then is closer to the optimal value . Therefore, we move from to with direction.

Similarly, the slope of the tangent line to the function is positive at the approximate solution after the -th iteration. It is  as shown in the figure below. It means the function is increasing when moves from left to right, so we can expect is located on the left side of . If moves the right to the left ,then is closer  to the optimal value . Therefore, we move from to with direction. If we repeat the process until moves to , then will converge to 0. In this way, we get the approximate solution .

This shows that the GDM will find , , , that satisfy . Here, determines the  magnitude of movements. We call this the ‘learning rate.’ If the learning rate is too large, can pass over , or even the function value may increase. So we should be careful to choose a reasonable learning rate .

Do not converse                            Too slow

 ☞ Note It is important to determine the appropriate learning rate, but we leave the detail of this issue for the next stage. It is usual to set in (, ). As an initial learning rate, or is usually chosen.

Find the minimum of . Take, ,, and in the following code.

Solution. From , the critical point becomes and is convex downward. So the minimum value  can be obtained at  .

-------- http://matrix.skku.ac.kr/KOFAC/ --------

f(x) = 2*x^2 - 3*x + 2  # function

df(x) = diff(f(x), x)  # derivative

x0 = 0.0  # the initial iterate

tol = 1e-6 # the tolerance level

eta = 0.1  # the learning rate

for k in range(300):

g0 = df(x0)

if abs(g0) <= tol:

print ("Algorithm Succeed!")

break

x0 = x0 - eta*g0

print("x* =", x0)

print("|g*| =", abs(g0))

print("f(x*) =", f(x0))

print(" iteration number  =", k + 1)

-------------------------------------------------------------------------------

Algorithm Succeed!

x* = 0.749999834194560

|g*| = 6.63221759289456e-7

f(x*) = 0.875000000000055

iteration number  = 31

In the above example, the points created by the gradient descent method , , ,   are shown on the coordinate plane along with the graph of the function as follows. From the figure, it is easy to see that it is intuitively converged to the point on the curve with the minimum value.

It can be seen that the sequence converges to the optimal solution .

 ◩ Open Problem 3

Find the minimum value of .  Use , , and with the above GDM code.

So far, the algorithm and the principle of GDM have been explained using simple examples. The GDM is a key algorithm that is used to update weights in Deep Learning. When we introduced the GDM algorithm, we assumed a convex down function on the domain. However, if we have a function that contains both convex down and convex up in an interval (non-convex), the GDM algorithm may not work well, as you can see in the figure below. Under this circumstance, the algorithm may converge to different points. It might be another local minimum or might not even an extreme or a critical point. It depends on where the starting point is given.

 ◩ Open Problem 4

Draw graphs for various differentiable functions, as shown in the figure above.

Determine a small interval containing the local minimum identified from the figure (by your eye), and apply the GDM code to each interval with a reasonable . Discuss your output.

■  In addition to the gradient descent method, you can use Newton's method to find the optimal solution. It is not in the scope of our book. Please refer to the following for more information on Newton's method and Taylor series. Visit the following link for more information on Newton's method.

[Math4 AI Book] http://matrix.skku.ac.kr/math4ai/Math4AI.pdf

[AI and Optimal solution] http://matrix.skku.ac.kr/math4ai/part2/

[Newton's method] http://matrix.skku.ac.kr/Cal-Book1/Ch4/

[Taylor series] http://matrix.skku.ac.kr/Cal-Book1/Ch9/

*9.2. Application of the Gradient Descent Method

The GDM introduced in the first session is for minimizing a function  of one variable. Let's generalize the algorithm for minimizing a function of several variables. The least-squares problem we learned is a multivariate function with at least two independent variables. The least-squares problem was solved using the error function  transformed using multiple variables. The least-squares problem can also be solved by the gradient descent method(GDM).

 [GDM Algorithm for minimizing a multi-variable function] [Step 1] Set an initial iterate , tolerance , initial learning          rate (eta), and iteration number . [Step 2] Compute . If , then stop. [Step 3] Set , and go to [Step2].

All the steps of the GDM algorithm for minimizing a multi-variable function are the same as for a one-variable function.

■ One variable function      <=>    Multi-variable function

① Scalar                      <=>      Vector,

② Absolute value        <=>      Norm of vector,

Let be independent variables and be a third variable. Now we can consider a function , which is dependent on and . Functions of several variables can be defined in the same manner.

In a 3-dimensional (coordinate) space, can be considered as a point . As and move, the graph of becomes a surface. For example, the graph of a two-variable function can be drawn using the codes as follows.

---------------------------------------------

var('x, y')   # variables

f(x, y) = -x*y*exp(-x^2 - y^2)

plot3d(f(x, y), (x, -2, 2), (y, -2, 2), opacity = 0.6, aspect_ratio = [1, 1, 10])

----------------------------------------------------------------------------------

The graph of in the figure above intuitively shows that has a local maximum value at the peak and a local minimum value at the valley. To find those critical points, the notion of a 'gradient of a multi-variable function' is required.

The gradient of a function of two variables is defined as follows:

The gradient of is a 2×1 vector, where means the partial derivative of with respect to . The partial derivative with respect to considers other variables except  is constant. Similarly, means the partial derivative of with respect to y.

The gradient of can be found as follows.

---------------------------------------------

var('x, y')   # variable

f(x, y) = -x*y*exp(-x^2 - y^2)

------------------------------------------------------------------------------------------

(2*x^2*y*e^(-x^2 - y^2) - y*e^(-x^2 - y^2),

2*x*y^2*e^(-x^2 - y^2) - x*e^(-x^2 - y^2))

Similarly, the gradient of a multi-variable function with three or more variables  can be obtained. In general, the gradient of the -variable function is as following;

◆ In [Section 5.2], the first example of the least-squares problem was as follows. For each data , let's say is the value obtained by substituting in a linear function . So we have, . If this equation's solution  does not exist, it is possible to find , where the square of error is minimized. The error function , which is the sum of squared errors , is defined as follows: (The reason we multiplied it by at the front of the error expression is only for computational convenience, so it doesn't affect anything on the conclusion.)

Let's use the GDM to get the approximate solution that minimizes .

Set an initial iterate , tolerance , and initial learning rate .

-------- http://matrix.skku.ac.kr/KOFAC/ --------

var('a, b'

# error function

E(a, b) = 1/2*((a - 1)^2 + (a + b - 3)^2 + (a + 2*b - 4)^2 + (a + 3*b - 4)^2)

u = vector([-2.5, -2.5])  # the initial value

tol = 1e-6 # the tolorence level 10^(-6)

eta = 0.1 # the learning rate

r = []    # graph

for k in range(300):

gn = g.norm()

r.append((u[0], u[1], E(u[0], u[1])))

if gn <= tol:

print("Algorithm Succeed!")

break

u = u - eta*g

print("u* =", u)

print("E(x*) =", E(u[0], u[1]))

print("iteration number =", k + 1)

p1 = plot3d(E(a, b), (a, -3, 5), (b, -3, 5), opacity = 0.6)  #  E(a, b)

p2 = line3d(r, color = 'red') + point3d(r, color = 'black', size = 50)

show(p1 + p2, aspect_ratio = [5, 5, 1]) # graph

------------------------------------------------------------------------------------

Algorithm Succeed!

u* = (1.49999929120196, 1.00000033198324)

E(x*) = 0.500000000000364

iteration number = 118

The solution is the same as that obtained from QR decomposition. The least-square line is .

Therefore, the line we get from the least-squares solution is .

As we found earlier, we found the <least-square line> by algorithm from the given data. The least-squares solution is obtained with the algorithm. The least-squares line  was obtained with the answer as a coefficient. We can also find a quadratic function that best fits the given data. In this case, the error function is as follows.

We can use the same GDM code to find the best fit for the below quadratic function . Take an initial value , tolerance and learning rate .

-------- http://matrix.skku.ac.kr/KOFAC/ --------

var('a, b, c')

E(a, b, c) = 1/2*((a - 1)^2 + (a + b + c - 3)^2 + (a + 2*b + 4*c - 4)^2 + (a + 3*b + 9*c - 4)^2)

u = vector([1.0, 1.0, 1.0])

tol = 1e-6  # tor 10^(-6)

eta = 0.01  # the learning rate

for k in range(5000):

gn = g.norm()

if gn <= tol:

print("Algorithm Succeed!")

break

u = u - eta*g

print("u* =", u)

print("E(x*) =", E(u[0], u[1], u[2]))

print("iteration number =", k + 1)

-----------------------------------------------------------------------------------

Algorithm Succeed!

u* = (1.00000138155249, 2.49999723768073, -0.499999180018564)

E(x*) = 1.59664737265671e-12

iteration number = 4207

Hence, the least square quadratic curve is since the output has .

After 4207 iterations, we now have the least square curve.

It means that we can always have the  best fit least-square quadratic curve that passes through four or more points.

[Web resources]

[GDM lecture] https://youtu.be/BME4lOvnE-U  (New)

[Calculus and Optimal solution]  http://matrix.skku.ac.kr/math4ai/part2/

 ◩ Open Problem 1

Find the third derivative for the differentiable function from the textbook.

 ◩ Open Problem 2

Find a complicated twice differentiable function in other textbooks, and use the code to find the local maximum, the local minimum, the absolute maximum, and the absolute minimum of the function.

 ◩ Open Problem 3

Find the minimum value of .

Use , , and using the above GDM code.

 ◩ Open Problem 4

Draw a graph for various differentiable functions, as shown in the figure above. Determine a small interval containing the local minimum identified from the figure (by your eye), and apply the GDM code to each interval with a reasonable starting point . Discuss your output.

Sungkyunkwan University Students at the old classroom

Part IV.

Part IV. AI and Statistics

10. Permutation, combination, probability, random variable,

probability distribution,  Bayes' theorem

11. Expectation, variance, covariance, correlation coefficient,

covariance matrix

10. Week 10. Permutation, combination, probability, random variable, probability distribution,  Bayes' theorem

 10.1 Permutations and combinations 10.2 Probability 10.3 Conditional probability 10.4 Bayes' theorem 10.5 Random variable 10.6 Discrete probability distributions 10.7 Continuous probability distributions

10.1 Permutations and combinations

http://matrix.skku.ac.kr/intro-math4ai/W10/

 ◩ Open Problem 1

Discuss the Monty Hall problem of conditional probabilities. It is one of the examples in which Bayes' theorem can be applied.  https://destrudo.tistory.com/5

[Lecture]   Permutation, combination, probability: https://youtu.be/KQXO-XbJauU

[Lecture] Bayes’ theorem: https://youtu.be/VAGLigLt2Hw

10.5. Random variables

10.6 Discrete probability distributions

10.7 Continuous probability distributions

[Lecture on] Discrete probability distributions https://youtu.be/Fq7D7bGG_cE

[Lecture on] Continuous probability distributions https://youtu.be/4wx1raETI8o

[Lecture on] Integral https://youtu.be/62OxYG7VMsE

11. Week 11.  Expectation, variance, covariance, correlation coefficient, covariance matrix

 11.1 Expectation, variance, standard deviation 11.2 Joint probability distribution *11.3 Covariance, correlation coefficient 11.4 Covariance matrix

11.1 Expectation, variance, standard deviation

The expected value, or expectation, of a random variable is an average value for probabilistic events, which is the sum of products of a value obtained by each event and each event's probability. The variance of a random variable is a measure of the dispersion of numbers (data), which indicates how far a set of numbers (data) are spread out from their average value. The standard deviation is defined as a square root of variance.

You can find the expectation, variance, and standard deviation of a discrete random variable with these formulas.

(1) Expectation:

(2) Variance:

(3) Standard deviation:

Find the expected value, variance, and standard deviation of a random variable .

 0 1 2 3 Sum Probability 0.01 0.84 0.145 0.005 1

Solution.  This problem can be solved easily using the Sage code.

-------- http://matrix.skku.ac.kr/KOFAC/ --------

var_x = [0, 1, 2, 3]  #A random variable

prob_x = [0.010, 0.840, 0.145, 0.005] #Probability distribution

E_x = sum(var_x[i]*prob_x[i] for i in range(4)) #Expected value

V_x = sum((var_x[i] - E_x)^2*prob_x[i] for i in range(4))  #Variance

S_x = sqrt(V_x)  #Standard Deviation

print("E(X) =", E_x.n(digits = 7))

print("V(X) =", V_x.n(digits = 7))

print("S(X) =", S_x.n(digits = 7))

--------------------------------------------------------------------------------------------------------------

E(X) = 1.145000      #    Expected value

V(X) = 0.1539750     #   Variance

S(X) = 0.3923965      #    Standard Deviation

We can find it with R code.

-------- http://matrix.skku.ac.kr/KOFAC/ - < R  code> --

x <- c(0, 1, 2, 3)                    #A random variable

pr.x <- c (0.010, 0.840, 0.145, 0.005)  #Probability distribution

e.x <- sum(x*pr.x)                    #Expected value

var.x <- sum((x^2)*pr.x) - e.x^2      #Variance

sd.x <- sqrt(var.x)                   #Standard Deviation

cat(e.x, var.x, sd.x)                 # Print many lines at the once

----------------------------------------------------------------------------------

1.145              Expected value

0.153975        Variance

0.3923965     #    Standard Deviation

The expectation, the variance, and the standard deviation of a continuous random variable X can be found as follows:

(1) Expectation

(2) Variance

(3) Standard deviation

Here are the properties of the expected value and variance.

③ When you define ,  Z is called 'a standardized random variable'. The expected value of Z is 0 and the variance of Z is 1 according to the properties of expected value and variance 1 and 2.

Let the probability density function of variable is . Find the variance of and

Solution.

-------- http://matrix.skku.ac.kr/KOFAC/ --------

var('t')

f(t) = 3*t^2

Ex = integral(t*f(t), t, 0, 1)

Ex2 = integral(t^2*f(t), t, 0, 1)

print(Ex)                  # the Expected value of X

print(Ex2 – Ex^2)         # the variance of X

Ey = integral((4*t + 2)*f(t), t, 0, 1)

Ey2 = integral((4*t + 2)^2*f(t), t, 0, 1)

print(Ey)                 #the expected value of Y=4X+2

print(Ey2 – Ey^2)       # the variance of Y

-------------------------------------------------------------------------------

3/4

3/80    # the variance of X,

5

3/5   # Variance of Y

 ◩ Open Problem 2

Try to find an example of a continuous random variable from other textbooks for your exercise. Find the expected value, the variance, and the standard deviation of .

11.2 Joint probability distribution

When there are two or more random variables, we should also look at the probability distribution for those random variables together (e.g., the event two events happen simultaneously) and  the probability distributions for each variable. We call the joint probability the probability two random variables happen at the same time. When we know the probability distribution of two random variables together and the probability of one random variable, we can figure out the rest of one random variable's probability distribution. We need to know the concept of the joint probability distribution(joint distribution)  to understand it.

Let’s look at the case with two discrete random variables.

(1) If and are discrete random variables, the joint probability function for and is defined as below:

(2) We can show for every possible value of and as in the below table. We call it the joint probability distribution

 Sum the marginal probability distribution for Sum the marginal probability distribution for

(3) When the joint distribution for and is given, the marginal probability distribution is defined as below.

The marginal probability distribution is the probability distribution for one variable of two variables described in the given joint probability distribution.

Here are the properties of the joint probability distribution.

① For all , ,

② For all , the sum of is 1.

③ For all ,

There are three blue balls, two red balls, and three green balls of the same size in a pocket and we are going to take out two balls randomly. Let’s say that among the balls we took out, the number of blue balls is , and the number of red balls is . Please answer the following questions

(1) Find the joint probability function of and ,

(2) Make a table of the joint probability distribution of and .

(3) Find ,

(4) Find the marginal distribution of , and

(5) Find the marginal distribution of .

Solution. We can get the table below using the definitions of joint distribution and marginal distribution.

① 과 ②

 Sum Sum

 Sum

 Sum

◆ As for a continuous random variable, we use the joint density function. The joint density function for continuous random variable and is defined as below. [We use double integrals here. Refer to this YouTube lecture https://youtu.be/T1z_GYt85rI.]

①  , for all real

②  , for all real

③  , for all real

④ The probability of belongs to the area is as follows.

⑤ We can define the Marginal probability density function of and   as follows.

*   The joint density function of the two random variables and is given below.

Find the marginal density function of , in the given area of (0,1).

Solution. Using the definition, we can solve this problem as follows.

-------- http://matrix.skku.ac.kr/KOFAC/ --------

var( 'x, y' )

f( x, y ) = x + y

print( "f_X(x) =" , integral( f(x, y), y, 0, 1)) #the marginal distribution of X

print( "f_Y(y) =" , integral( f(x, y), x, 0, 1)) # the marginal distribution of Y

--------------------------------------------------------------------------------------------------------------

f_X(x) = x + 1/2      #the marginal distribution of X

f_Y(y) = y + 1/2    #the marginal distirubiton of Y

[Reference]   A random variable, Expectation https://youtu.be/SUsZHarQqqg

*11.3 Covariance, correlation coefficient

The first thing we use to understand the distribution of a random variable is the mean. Using the mean, information about the distribution can be expressed as a single number (the middle part of the distribution). The second concept we use is variance. Using the variance indicates how far the distribution is from the mean.

Then, what method is needed to understand the probability distribution with two random variables and ? First, we can think of the mean of and the mean of . Next, we can use the variance to determine how each variable is spread out. However, the concept of covariance is necessary to know the correlation between two random variables.

The covariance of random variables and is defined as follows.

The  covariance is the average of the product of the deviation of and the deviation of . However, there is a problem that the covariance is affected by the size of the units of and . To compensate for this, we use a correlation coefficient. The correlation (coefficient) can be thought of as standardization by dividing the covariance by the standard deviations of two random variables. Therefore, it is not affected by the absolute size of the random variable.

The correlation coefficient between the random variables and is defined as follows.

11.4 Covariance matrix

Using matrices, it is easy to express how several random variables relate to each other. The covariance matrix is created by using the variance and covariance of each random variable.

The covariance matrix for {, , } of random variables is defined using a matrix with the covariance between the th and th random variable when , and the variance of the th random variable when , It is denoted by

In simple terms, The covariance matrix is a square matrix with the variance in the main diagonal and covariance of two variables in off-diagonal.

The covariance matrix can be seen as representing the distribution of data, as shown in the figure below.

Given the sample data as follows, find the covariance matrix of it.

 1 2 3 4 5 6 2 3 5 6 1 9 3 5 5 5 10 8 10 20 30 40 50 55 7 8 9 4 6 10

We can easily find the covariance matrix using code.

-------- http://matrix.skku.ac.kr/KOFAC/ --<Sage  code >--

X = matrix([[1, 2, 3, 4, 5, 6],

[2, 3, 5, 6, 1, 9],

[3, 5, 5, 5, 10, 8],

[10, 20, 30, 40, 50, 55],

[7, 8, 9, 4, 6, 10]])

n, p = X.nrows(), X.ncols()  # n is the number of rows, p is the number of columns

X_ctr = zero_matrix(RDF, n, p)  # Make a matrix

import numpy as np

for row in range(n):

v = np.array(X.row(row))

m = mean(v)

for col in range(p):

X_ctr[row, col] = X[row, col] - m

print("Covariance_Matrix =")

print(1/(p - 1)*X_ctr*X_ctr.transpose().n(digits = 7))

--------------------------------------------------------------------------------------------------------------

Covariance_Matrix =

[ 3.500000  3.000000  4.000000  32.50000 0.4000000]

[ 3.000000  8.666667 0.4000000  25.33333  2.466667]

[ 4.000000 0.4000000  6.400000  38.00000 0.4000000]

[ 32.50000  25.33333  38.00000  304.1667  1.333333]

[0.4000000  2.466667 0.4000000  1.333333  4.666667]

R code to generate covariance matrix from the given data

-------- http://matrix.skku.ac.kr/KOFAC/ - < R  code > --

# create vectors

a <- c(1,2,3,4,5,6)

b <- c(2,3,5,6,1,9)

c <- c(3,5,5,5,10,8)

d <- c(10,20,30,40,50,55)

e <- c(7,8,9,4,6,10)

# create a matrix from vectors

M <- cbind(a,b,c,d,e)

cov(M)

--------------------------------------------------------------------------------------------------------------

 ◩ Open Problem 3

Find the covariance matrix of the sample data found in other textbooks.

 ☞ Note The covariance matrix plays an important role in dimension reduction, effectively reducing the dimension while maintaining the distribution of high-dimensional data as much as possible. A typical technique is principal component analysis (PCA). When calculating the principal component, the singular value decomposition (SVD) is mainly used.

 ◩ Open Problem 1

Discuss the Monty Hall problem of conditional probability, one example  in which Bayes' theorem is applied. https://destrudo.tistory.com/5

 ◩ Open Problem 2

Find the expected value, the variance, and the standard deviation of the continuous random variable found in other textbooks.

 ◩ Open Problem 3

Find the covariance matrix of the data found in other textbooks.

Bicheon dang (비천당, 丕闡堂), place of National Gwageo Exam, the highest-level examination to recruit officials of the Joseon Dynasty (Photo by Author)

Part V

Part V. PCA and ANN

Principal Component Analysis (PCA) and Artificial Neural Network (ANN)

12. Principal Component Analysis (PCA)

13. Artificial Neural Network (ANN)

12. Week 12. Principal Component Analysis

 12.1 Dimensional reduction 12.2 Principal Components Analysis (PCA) 12.3 Finding principal component 12.4 Examples of principal component analysis *12.5 Principal component analysis and covariance matrix *12.6 Principal component analysis and linear regression

12.1 Dimension reduction         http://matrix.skku.ac.kr/intro-math4ai/W12/

<Principal Components Analysis (PCA)> is the most important part of this book. Large or high-dimensional data is difficult to analyze because it is difficult to compute and visualize. Therefore, it is necessary to reduce the dimension of the data while maintaining the distribution of the original data as much as possible. It is called a dimension reduction. Principal component analysis is one of the most widely used dimension reduction techniques. It converts data in high-dimensional space into data that can be handled in a low-dimensional space while preserving the original data's distribution as much as possible. It is why the PCA is one of the most fundamental and essential part in AI.

In 1974, <MOTOR TREND MAGAZINE> published some data that characterize and quantify cars released during the year 1973-74. Given below are some of the data on the 11 variables from that publication:

 mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 ...

As we can see in this table, each car's data is presented as an 11-dimensional vector. Such multidimensional data is difficult to analyze because it is not easy to visualize and compute. Therefore, it is necessary to reduce the dimension of the data while preserving the distribution of the original data as much as possible. This process is called a dimension reduction. It is possible to extract and only use the data of some essential variables(feature selection), but it is impossible to know in advance what a close relationship exists between the variables. So even if we select and analyze only the weight and displacement (engine/piston/cylinder) in this way, we may not be sure that it properly reflects the distribution of the initially sufficiently investigated data. Therefore, an analysis of the choice of the principal component is necessary to determine which function or element to analyze. This process is called the Principal Component Analysis (PCA).

12.2 Principal Component Analysis  (PCA)

The Principal Component Analysis (PCA) is one of the most widely used dimension reduction techniques. Basically, PCA converts a data set from high-dimensional space into low-dimensional, easy-to-handle spaces while preserving the distribution of the original data as much as possible. The PCA combines existing variables to find new variables that are not related to each other, namely, principal components. The first principal component PC1 preserves the original data distribution as much as possible, the second principal component PC2 preserves the distribution of the next original data as much as possible too, and so on.

In the case of 11-dimensional data, the 11 principal components can be created by combining existing variables. For example, let us assume that PC1, PC2, and PC3 preserve about 90% of the distribution (property) of the original data. Then, we still can perform a rational analysis and capture majority of the insights even though 10% of the information is lost when we do the analysis with only PC1, PC2, and PC3. So we can simply reduce the dimension of our analysis to 3D data by selecting only PC1, PC2, and PC3 for the next level of analysis. In this case, it is much easier to compute and visualize, and it allows us to analyze the data without much difficulty.

◆ To find the principal component, we need to find a new axis, which is called <the principal axes or principal direction>. The first axis sets the largest distribution of data obtained by projecting the original data onto this axis. Below is a picture of the given 2D data (left picture) orthogonal to the first axis (right picture). At this time, the projected data constitutes PC1.

[Source] High school grades (GPA) and university grades (GPA) data of 105 computer science students (centered) are available at the following links.

How to find the new axis? When the data are given, the orthogonal projection should be used to find the new axis where the error is relatively small. We take  the first axis, whose distribution is the largest, and call it PC1. And we continue to do this for PC2 and PC3, etc., in the same way. For the second axis, when the original data is projected onto the PC1 axis, the obtained data distribution is the largest after PC1. Again, using the concept of projection, we consider the size of the sum of the errors. And so on, then PC1 and PC2 are not related to each other, which means the second axis is not related to the first axis. This is the reason that we studied the orthogonal projection and Gram-Schmidt orthonormalization process was for this.

In order to implement process, a principal component analysis is needed to set the axis so that essential variables are collected as each principal component, and that each principal components are not related to each other and are (if possible) linearly independent. It is a concept related to the diagonalization of matrices. This is why it is necessary to ensure the orthogonality of the different principal axes. The process of finding axes that are linearly independent of each other in the orthogonalization method is used for principal component analysis. We could say that it was to be able to do this that we studied Linear Algebra, particularly about the matrix diagonalization (or SVD).

The picture on the left below is a projection of the given 2D data by finding the second axis. Here, the projected data forms PC2. And if we let the first main axis as the -axis, the second main axis as -axis and do the orthogonal projection of the data, it will be looked like a picture on the right. Hold the - and -axes like below. From this, we can intuitively understand that the distribution of the orthogonal data along the first principal axis is much larger. The -axis distribution is longer and the -axis distribution is shorter.

12.3 How to Compute principal component

Let's do the principal component analysis. is called an data matrix. Suppose that is the number of samples, and is the number of random variables {, , } representing the features of the data. In statistics, in general, we simply use lowercase notation {, , } instead of uppercase notation {, , } for random variables to distinguish it from the data matrix . So, here we will use random variables in the lower case of as {, , }. Let's say that the component of the data matrix means one data set for the th random variable in the -th sample. And , the column of , means all the data set of the random variable .

Let be the matrix that is centered ("centering, adjust the mean of the random variable to zero") from the data matrix as above, so the mean of the random variable in any mean centered matrix are all zero. There are a number of reasons for this. It makes all the computations become simple, and the analysis becomes straight forward and easy.

① [Define the centered matrix as (tilda)]. Given a matrix named , we define a centered matrix , so that the mean of the random variables are all zero.

② To do this, find averages of each column in (each random variable). The mean of the random variable is denoted by   , . We now have  the centered matrix after subtract the mean (of the column) from the data for each column of as following.

This centered matrix is called a mean-centered matrix because it is made for that purpose. From now on, we will assume that a given data matrix is ​​simply a mean-centered matrix = already for an easy subsequent analysis when we develop the theory.

③ Find the Singular Value Decomposition (SVD) of a mean-centered matrix .

Here, and are orthogonal matrices, and is the diagonal matrix with  singular values as the main diagonal component arranged in order of magnitude. The interesting thing here is that only (rank of the matrix ) singular values are greater than 0 and the rest are 0. Hence, after the multiplication, the result is , which means that only the first columns were used in the orthogonal matrix , and only columns were used in the orthogonal matrix .

④ In the Singular Value Decomposition of , the column vector of becomes the principal axes.

⑤ If the product of and is expressed as , the column vectors of , obtained by the original data's orthogonal projection  to the principal axis, become <principal component scores (PC scores)>.

⑥ After that, computed with singular values becomes the variance of -th principal components, and becomes the ratio at which the -th principal component preserves the distribution of the original data. By computing the proportion that preserves the distribution of the original data for each PC(principal component), we can decide the reduced dimension that we like to use.

⑦ If we decide the reduced dimension as () from the original dimension , we only select the first column vectors () of and the -th leading principal submatrix () of , then becomes an matrix containing the first PC (principal components).

.

 ◩ Open Problem 1

Explain how a singular value decomposition (SVD) is used in the principal component analysis.

12.4 Examples of PCA

In this section, we introduce one example of principal component analysis using a numerical data set. Specifically, using this dat set, we will perform a PCA as a practice example. We create an example so that we can check the data implemented with the real Sage and R code.

A 4-item Likert survey with a scale of 7 was conducted on 16 people on what people are interested in when choosing a new computer.

(1: strongly disagree – 7: strongly agree)

Price        Price value

Software     Compatibility with user's OS

Aesthetics    Design value

Brand       Brand value

Using the data (price, software, design, brand preference) obtained from a survey of 16 people, a “Data matrix” is formed and the PCA is conducted using the Sage code given below. This data comes from the following open source.  [Source]  http://yatani.jp/teaching/doku.php?id=hcistats:PCA

-------- http://matrix.skku.ac.kr/KOFAC/ --------

# Enter the data matrix.

# Here, each column represents data corresponding to the variables Price, Software, Aesthetics, and Brand.

# data matrix

X = matrix(RDF, [[6, 5, 3, 4], [7, 3, 2, 2], [6, 4, 4, 5], [5, 7, 1, 3],

[7, 7, 5, 5], [6, 4, 2, 3], [5, 7, 2, 1], [6, 5, 4, 4],

[3, 5, 6, 7], [1, 3, 7, 5], [2, 6, 6, 7], [5, 7, 7, 6],

[2, 4, 5, 6], [3, 5, 6, 5], [1, 6, 5, 5], [2, 3, 7, 7]])

print("X =")

print(X)

print()  # One line indent

# centering

# scaling

n = X.nrows()  # (number of rows)

p = X.ncols()  # (number of columns)

X_ctr = zero_matrix(RDF, n, p)  # Prepare memory for centered matrix

import numpy as np

for col in range(p):

v = np.array(X.column(col))

m = mean(v)

st = std(v)

for row in range(n):

X_ctr[row,col] = (X[row,col] - m)/st

# (X_ctr is a centered matrix)

print("X_ctr =")

print(X_ctr.n(digits = 4))

print() # (One line indent)

# Singular Value Decomposition(SVD). X_ctr = U*S*V.transpose()

U, S, V = X_ctr.SVD()

Var_PC = [(S[i, i]^2/n).n(digits = 4) for i in range(p)] # PC variance

Prop_Var = [100*Var_PC[i]/sum(Var_PC) for i in range(p)] # Ratio of variance

print("PC variance      ", Var_PC)

print("Ratio of variance    ", Prop_Var)

print()  # (One line indent)

# Scree Plot. Visually show variance. Decide to what dimension to reduce.

show(list_plot(Var_PC, plotjoined = True, color = 'red', axes_labels = ['', 'Variances']))

print()  # (One line indent)

# Compute PC score.  Z = U*S

# Calculate the PC score.  Z = U*S

PC_score = U*S

PC12 = PC_score[:,0:2]# (Use only the first and second components for visualization)

# (Visualize the data on a plane.)

point([PC12.row(i) for i in range(n)], color = 'red')

----------------------------------------

[6.0 5.0 3.0 4.0]

[7.0 3.0 2.0 2.0]

[6.0 4.0 4.0 5.0]

[5.0 7.0 1.0 3.0]

[7.0 7.0 5.0 5.0]

[6.0 4.0 2.0 3.0]

[5.0 7.0 2.0 1.0]

[6.0 5.0 4.0 4.0]

[3.0 5.0 6.0 7.0]

[1.0 3.0 7.0 5.0]

[2.0 6.0 6.0 7.0]

[5.0 7.0 7.0 6.0]

[2.0 4.0 5.0 6.0]

[3.0 5.0 6.0 5.0]

[1.0 6.0 5.0 5.0]

[2.0 3.0 7.0 7.0]

X_ctr =

[  0.8485 -0.04218  -0.7500  -0.3866]

[   1.317   -1.392   -1.250   -1.511]

[  0.8485  -0.7170  -0.2500   0.1757]

[  0.3804    1.307   -1.750  -0.9489]

[   1.317    1.307   0.2500   0.1757]

[  0.8485  -0.7170   -1.250  -0.9489]

[  0.3804    1.307   -1.250   -2.074]

[  0.8485 -0.04218  -0.2500  -0.3866]

[ -0.5559 -0.04218   0.7500    1.300]

[  -1.492   -1.392    1.250   0.1757]

[  -1.024   0.6327   0.7500    1.300]

[  0.3804    1.307    1.250   0.7380]

[  -1.024  -0.7170   0.2500   0.7380]

[ -0.5559 -0.04218   0.7500   0.1757]

[  -1.492   0.6327   0.2500   0.1757]

[  -1.024   -1.392    1.250    1.300]

PC variance      [2.278, 0.9011, 0.4356, 0.1348]

Ratio of variance     [60.76, 24.03, 11.62, 3.596]

# Hence, from here it can be understood that PC1 and PC2 retain about 85% of the original data variance.

# In this case, the original data was four-dimensional, but even if we reduced it to two-dimensional and analyzed, it also retains about 85% of the original data's properties. So, in general, they are useful for decision-making. It means that we can cut the time in half, or at least more than a quarter. We got such a result because we reduced the 4 dimension problem to a 2 the dimension problem. From this simple example, one can easily envision the benefit of large dimensional reduction. For example, if we are able to reduce the data of dimension 1,000,000- to a data of dimension 1,000, then we can significantly reduce the amount of economic costs and computing time. Through the use of this theory and computer programming code, we can immediately analyze the result.

# For the principal components' eigenvalue below, we can see the process of deciding how many principal components to select when we reduce the dimension of the data using the following line plot. Eigenvalues are sorted in the order of magnitude from the largest component to the component to the left of the “elbow point”, usually where the slope is bent. Therefore, even if we lose about 15% of the information, we can reduce the dimension 2 by selecting only PC1 and PC2. If we look at these figures, we can see it is enough to have two axes to understand about 85% of the overall distribution. More details can be found in http://math1.skku.ac.kr/home/pub/212/.

 ◩ Open Problem 2

Take a data matrix from sources related to your major, then use the code above to apply a Principal Component Analysis (PCA) algorithm to this matrix, and then, post the results on the QnA Board for discussion. This will help you to understand the topic.

*12.5 PCA and Covariance matrix

The principal component analysis is a technique that converts data from high-dimensional space to low-dimensional space. However, the information about the distribution of the raw (original) data is also contained in the covariance matrix. So, there is a very close relationship between principal components analysis and covariance matrices. This is another reason that motivates us to learned Linear Algebra. To do principal component analysis, we need to be able to apply SVD on the covariance matrix. From now on, we assume that we always start with a mean-centered data matrix , and we create a covariance matrix based on the given data, and since this covariance matrix is a symmetric matrix, it is also diagonalizable, which means that its eigenvalues are always real numbers.

The goal of the principal component analysis is to "Find the smallest number of new variables" that can "preserve the information from the covariance matrix as much as possible". Our goal is to use this method to reduce the rank to (MAKE) a much smaller size () matrix. Thus, the covariance matrix can be expressed with the orthogonal matrix used in SVD. [rank-p reduction, dimension reduction]

Let's have a SVD of .

Then we have

If we look closely at this relationship, we will see that it is like =, which means the eigenvalue-eigenvector relation. That is,

, that menas ().

Now is an eigenvalue of , and is an eigenvector of corresponding to . This means the variance of the -th PC, and means the th principal axis.

 ☞ Note eigenvalue-eigenvector relation: http://matrix.skku.ac.kr/math4ai/part1/

 ◩ Open Problem 3

Discuss what you understand on the process of dimension reduction by PC’s on the covariance matrix.

*12.6 PCA and Linear Regression

In linear regression, and were determined to minimize the distance between the data and the computed points from (i.e., distance on -axis) (left figure). But in PCA, and were determined to minimize the perpendicular distance between the data and (right figure). That is, the straight line obtained using principal component analysis is closer to the data. The figure below shows the straight-line obtained by linear regression and principal component analysis together. (Because principal component analysis is sensitive to the scale of the data, centering and normalization should be assumed. Refer to the relevant literature for details.)

 ◩ Open Problem 4

Discuss what you understand about the similarity and difference between the least squares line and the linear regression.

[Lecture on PCA] https://youtu.be/ukIttphmM_4

Week 13. Artificial Neural Network

A neural network is a model of neurons, the basic unit of the nervous system, and performs tasks such as image recognition. In this section, we will explain how the artificial neural network works.

 13.1 ANN 13.2 How artificial neural network works 13.3 How the neural network study 13.4 Backpropagation (BP)

13.1 Artificial Neural Network

Neurons(nerve cells) in a human brain make <Neuro transmitters from amino acids in the blood>. Nerve stimulation occurs in one direction, and the signals are transmitted to other nerve cells around it by using the generated neurotransmitter. One nerve cell responds to signals (input signals) received from several other nerve cells. If the cell membrane reaches a particular threshold potential(threshold value), it transmits a signal (output signal) of a certain size to the next nerve cell. Since this is a vector-in and vector-out, it can be understood that the resulting vector was obtained by multiplying a matrix to the input vector. The key is to find this function, which is a matrix.

In general, the nerve does not respond when a small stimulus is initially given. But the nerve's intensity is gradually increased, and when it exceeds a threshold value, it suddenly responds with 'Ouch!'. The corresponding function can be considered a Heaviside function. But the Heaviside function is not differentiable, we use the Sigmoid function (or ReLU function) as the activation function which is similar to it.

Machine Learning (ML) is the study of computer algorithms that improve automatically through experience Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so. An artificial neural network (ANN) is a computational model based on the structure and functions of biological neural networks. Information that flows through the network affects the structure of the ANN because a neural network changes or learns based on that input and output.

When we talk about machine learning in ANN(artificial neural network), if there are hidden layers in an ANN, we call it a deep learning. The process of creating and updating weights between hidden layers in a deep learning is called a Backpropagation. The mathematical principle of this Backpropagation will be introduced. The Backpropagation algorithm explains what happens between the hidden layers in a ANN. The human brain has about 100 billion neurons and performs difficult tasks very well, such as image and pattern recognition. It is a picture of a neural network in the brain.

A neural network is a model of a neuron, which is a basic unit of the nervous system. Neurons are in effect the primary transmission lines of the nervous system. One artificial neuron(node) receives multiple input signals and one output signal. This is similar to transmitting information by sending electrical signals in real neurons. When a neuron transmits signals, the weight plays a role. Each input signal has its own weight, and a signal with a larger weight will be a more important signal. This figure is an example of an artificial neural network.

13.2  How artificial neural network works

Now we will study how this artificial neural network works. As shown below, one node can be seen as a function that receives an input signal and transmits the result. Therefore, it can be expressed as a function that receives an input signal and outputs . Here, is a bias. For example, when an input signal is given and the weights assigned in advance exceed the threshold value , then we will have the output .

If     output 1  ()

[Source : Prof. Ho-Sung NAM at Korea Univ.]

Similarly, it can be expressed as a linear function when we have two input signals. In the figure, when and are inputs with given weights , then the output will be . And when there are three input signals, and , then the output will be .

If there are two outputs, not one, it can be expressed using the matrix product as follows. For convenience, let the bias 0 for simplicity. We may put weights as 's since there are multiple weights. This is called the weight connected from the th node in the input layer to the th node in the output layer.

.

In linear algebra, the input () is a vector and the output () is a vector and the function is a matrix . But in statistics or artificial intelligence, they use which is the transpose of for their convenience. Here vector is the input and matrix is multiplied it to yield a vector as following.

This way is more intuitive in some sense, so this notation will be seen in the field of artificial intelligence and statistics. Now we see that we can model neurons using functions, especially matrices.

If a number of input signals are given in an artificial neuron, calculate with the weights assigned in advance, and if the sum exceeds a predetermined threshold, outputs 1. If not, outputs 0 (or –1). At this time, the function that determines the output is called the Activation Function. A role model of an activation function is the Sigmoid function which is -th differentiable.

The Sigmoid function has almost the same behavior as the following Heaviside function (step function),

.

-------- http://matrix.skku.ac.kr/KOFAC/ --------

# Sigmoid function

f(x) = 1/(1 + exp(-x))  # Sigmoid function

H(x) = heaviside(x)   # Heaviside function

p1 = plot(f(x), (x, -5, 5))

p2 = plot(H(x), (x, -5, 5), color = 'red')

p1 + p2

------------------------------------------------------------------------

Sigmoid function has the following properties.

① For all , .

② If then and if then .

③ Sigmoid function has the following a nice property.

-------- http://matrix.skku.ac.kr/KOFAC/ --------

#

f(x) = 1/(1 + exp(-x))

print("f'(x) =", diff(f(x), x))

bool(diff(f(x), x) == f(x)*(1 - f(x)))

--------------------------------------------

True

Compared with the Heaviside function(step function), this Sigmoid function normalizes the output signal to a continuous value between 0 and 1, rather than an extreme value (0 or 1). In addition, the Heaviside function is not differentiable when , but the Sigmoid function is differentiable on all real numbers and the computations are very simple. Besides, ReLU(Rectified Linear Unit) function, hyperbolic tangent(), etc are used more often as activation functions in many ANN problems. For now, we will focus on the Sigmoid function and develop the theory. [More details in https://reniew.github.io/12/]

Let’s organize this process.

① First, artificial neurons receive input data () from other artificial neurons or from outside.

② When input comes in, the artificial neuron combines the input data with the given weights and creates an output (by linearly combining a single value with the weight and adding the bias).

③ Substituting the value obtained in ⓶ into the activation function , output is returned.

Activation functions can be written as or . In the following figure, the activation function was written as . Schematically, the output value is obtained through this process. In other words, when the linear combination of an input () and weight is given, the activation function gives an output .

Input data, Weight, Linear combination,      Activation function,     Output data

13.3  How the neural network study

The neural network consists of an input layer, hidden layers, and an output layer. It receives a signal from the input layer and propagates it to the hidden layer through a given activation function after computation with a weight assigned in advance. The prediction result is transmitted to the output layer in the same way. Suppose there is an error between the predicted value obtained using the neural network from the given data and the observed value known in advance. In that case, the weight is gradually updated to reduce the error. Backpropagation and the gradient descent method are used in this process.

We can interpret a neural network is a model of a collection of neurons connected by a non-circulating graph. As shown in the figure below, it consists of an input layer, hidden layers, and an output layer from the left.

When a signal is received from the input layer, it is propagated to the hidden layer through a given activation function after calculating the weight given in advance. Similarly, the signal is calculated and transmitted to the next layer, and after repeating this process, the output layer produces the corresponding result.

[Source] https://deepai.org/machine-learning-glossary-and-terms/hidden-layer-machine-learning

Suppose that we are given a large input data set and we can obtained the correct answer based on the given data set. Our objective in this exercise to determine how efficient neural network can be used to reproduce the correct answer. At this time, the weights are not computed by any formula, but the weights are given randomly at the beginning, and the weights are gradually updated in the direction of reducing the error between the predicted values using the neural network and the correct answers obtained from the given data. For performing this weights updating process is the Backpropagation(BP) algorithm.

13.4  Backpropagation (BP)

Let’s look at the Backpropagation algorithm, which is used to train artificial neural networks.

 [Back propagation algorithm]   ① [Data Cleaning] The process of organizing data to fit into files is called a . After data preprocessing and securing the essential data to make the necessary model, divide this data set into training data, validation data, and a test data set. In general, 80% of the original data is used as training and verification data, and 20% is used as a test data set. The modeling is performed using the training data, and the model is refined more and more elaborately using the validation data. Then, it is checked whether it fits well by using test data. Through this process, we can find a functsion, in the form of a matrix, that fits the given artificial neural network. ② [Start with the initial conditions] Enter input and observed value and set the weight of artificial neural network arbitrarily. ③ [Set up matrix, activation ft.] Set up initial weight values for ANN(example; matrix, activation ft.) In other words, provide an arbitrary matrix for an appropriate function. After that, set the Sigmoid function (or ReLU function) as an activation function and provide an appropriate coefficient. For example, we can start with a matrix whose entries are all 1. ④ [Process] When the input layer receives a signal from training data, compute it with the weight (or updated weight after step ⑥) and transmits it to the hidden layer through a given activation function. The signal passes through all hidden layers, and we can get a predicted value at the output layer. ⑤[Find error] There will be an error between this predicted value and the observed value. ⑥ [Minimize error, GDM] Minimize this error by applying the gradient descent method. With the answer, we modify weights of the matrices (functions) in ③. ⑦ [Repeat the algorithm until the error is minimized] After correcting the weights, repeat the ④~⑥ steps until the overall error is minimized. ⑧ [Stop with the optimal solution, Check the model] If we find the optimal solution through the whole process, we have the neural network model. After modeling, check that the neural network model works well using the test data set.   This whole process operated within the hidden layers is called the algorithm.

If we want to increase the accuracy, we may increase the number of training data and the number of hidden layers until we find a model with acceptable or target accuracy as we need.

Deep learning is a learning process of an artificial neural network with more than two hidden layers.

The method of updating the weights based on error transmitted to each layer by the Backpropagation method is the same as applying the gradient descent method to minimize the error function. For example, let be the output value obtained from the activation function after the computation with weights; let () be the input signal received from each layer; let be the real observed value(the value corresponding to the correct answer). There must be an error between and . Then we can write an error function as a squared error between and , using a vector norm.

Then the formula for updating each weight is given by the gradient descent method. As we can see, is an algorithm to find NEW weights comes from the OLD one by one. In the previous description of the gradient descent method, the same algorithm was used to define for the new . Here means a partial derivative of with respect to the variable .

After calculating the weights with input signals 's obtained from a hidden layer, make a linear combination to compute an error of the next output layer. The Backpropagation algorithm is used in this process. We can transmit the error inversely proportional to the degree of influence on error. In the same way, if the output is shown in the following picture from the hidden layer, we compute the error between the expected value and apply it to the next step. Let's go through the mathematical details involve in this process.

① Compute the error passed to each layer. The objective for this step is to compute the error in the output layer. Assume that the hidden layer receives the input signal and and compute the weights , , , and then obtains the output and through the Sigmoid function .

Then, the squared error in the output layer can be obtained as follows.

Now we compute the error in the hidden layer. Unlike the output layer, the hidden layer does not have any observed values, so there is no corresponding error(difference) in the output layer. At this time, the aforementioned backpropagation method is used. The error in the neural network means that the result was made due to the error in hidden layers when the input signal propagates from the input layer to the final output layer. The error can be reversed in proportion to the degree, that is, the weight.

So, assuming that the error was obtained at the first node of the  output layer, it is connected to the first node of the hidden layer with the weight and the second node of the hidden layer with the weight . If the weight is large, the effect on the error will be large. So the error is transferred to the hidden layer in proportion to the weight. For example, if is an error propagated to the first node of the hidden layer, it can be expressed as below. Also, in the same way, can be expressed as below.

,

Even if we do the computation with the denominator of the error function 's are removed, the ratio still be kept. So there will be no big difference in our analysis from the result we obtained since it does not significantly affect the optimal solution. Now we may simplify the equation into the following linear system of equations.

Then, we now can handle the artificial neural network with hidden layers.

The objective of this step is to update the weights between the hidden layers and the output layer from the error obtained at the output layer. From the properties of the Sigmoid function and the chain law, we can find as follows.

As we have seen before, the derivative of the above function can be obtained easily due to the properties of the Sigmoid function. Specifically, we can get the result easily because we know the property of the Sigmoid function.

In the same way, the other weights are calculated as follows.

Similarly, it is easy to obtain other weights. The weight is connected only with the th node of the hidden layer and the th node in the output layer, so can be expressed simply as follows.

Here, is an input from the th node in the hidden layer, is an output from the th node in the output layer and is an error at the th node in the output layer. Essentially, we are using the gradient descent method to update the weights.

The objective of this step is to update the weights between the input layer and the hidden layer from the error passed to the hidden layer. After reflecting it, update the weights between the input layer and the hidden layer from the error passed to the hidden layer. Basically, we can compute it by updating the weights between the hidden layer and the output layer. For convenience, if we use the symbol as it is, we can see that the same algorithm, the Backpropagation, works in the hidden layer.

Here, is an input from the th node in the input layer, is an output from the th node in the output layer, is an error passed to th node.

All weights can be updated as follow by applying the gradient descent method.

Neural networks require a lot of data until the model can predict adequately and may also have many hidden layers between the input and the output layers. Such artificial neural networks with several hidden layers are called a <Deep Neural Network>, and machine learning for training a deep neural network is called <Deep Learning>. If there are multiple hidden layers, we can also use this Backpropagation method to update the weights by applying a gradient descent method from the error propagated in the previous layer. There are various informations/references related to this topic that we can refer to it.

This is the Backpropagation method. The extra info can be found in the following reference.

[Reference]   ‘What is Backpropagation doing?’

[Lecture]   The Maths behind Backpropagation algorithm
https://towardsdatascience.com/the-maths-behind-back-propagation-cf6714736abf

[Lecture] Math in ANN: https://youtu.be/d4WercT_OnU  ﻿

 ◩ Open Problem 5

Describe simply ANN and Backpropagation algorithm as you understand.

14.  Week 14. Hand Writing Numbers detection using ANN on MNIST Dataset

 14.1 Hand Writing Numbers detection using Artificial Neural Network

For a long time, post offices have spent a lot of effort in recognizing handwritten zip codes, sorting them, and then the physical process of delivering mail. This section shows how the artificial neural network recognizes the handwritten numbers in the MNIST data set and finds the closest number. It is called <Handwriting Number Recognition Using Artificial Neural Network>. We will learn about the process and principle of how artificial intelligence, instead of humans, does the process of automatically classifying envelopes so that the postal delivery department in charge of the area can carry out their work with minimum delay. We practice the process that AI figure out what the zip code we wrote. We explain <Examples of Handwritten Number Recognition Using Artificial Intelligence> and introduce related codes. The Backpropagation and the Gradient descent method are the main algorithms to update Artificial Neural Network.

14.1 Hand Writing Numbers detection using ANN

In the past, many people were involved in [Recognition of Handwritten (postal) ZIP Codes in a Postal Sorting system]. In this section, we learn about how Computer/AI do recognize Handwritten ZIP Codes by artificial neural networks. Now, artificial intelligence (Robot) is replacing the workforce at the post office by recognizing and sorting ZIP code and sorting them out by themselves. If we look at the images here, the letters 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 can be written in various ways by each person.

In order to utilize the artificial neural network, as shown in the picture above, hand-written numeral images (number) are needed first. MNIST(Modified National Institute of Standards and Technology) database contains approximately 70,000 numeric images in black and white, 28 pixels wide by 28 pixels tall. Of these, we divide it into 60,000 learning data and 10,000 test data. Each data contains a hand-written image (with a label) of numbers from 0 to 9. The most widely used MNIST data for benchmarking the performance of machine learning can be found on the Web site below. Now let's check out the number images of the MNIST data.

Here, we have modified the Python code to Sage, which was released on a GitHub page in https://github.com/freebz/Make-Your-Own-Neural-Network.

As the title 'Make your own neural network' show, it works with the code that fits us. We added detailed explanations to make it easier to be understood.

Let's read the code and try to understand the algorithm. Then we can practice it in the Lab http://matrix.skku.ac.kr/intro-math4ai/W14/.

The first part of the code calls <numpy and matplotlib> libraries and some part of the code needs libraries.

[Source]  THE MNIST DATABASE of handwritten digits http://yann.lecun.com/exdb/mnist/

Let's check the numerical image of the MNIST data. The following was implemented in Sage with some modifications of the Python code from

-------- http://matrix.skku.ac.kr/KOFAC/ --------

import numpy

import matplotlib.pyplot

import csv

import urllib.request

import codecs

# Load csv file from web, mnist training data, into list

url = 'https://media.githubusercontent.com/media/freebz/Make-Your-Own-Neural-Network/master/mnist_dataset/mnist_test_10.csv'

# Open web data

response = urllib.request.urlopen(url)

training_data_list = list(training_data_file)  # list

all_values = training_data_list[0] # a data

# Figure out what the stored number is

print(all_values[0])

# Convert to 28 x 28 matrix

image_array = numpy.asfarray(all_values[1:]).reshape((28,28))

# Visualize the matrix as an image

matplotlib.pyplot.imshow(image_array, cmap = 'Greys', interpolation = 'None')

# Save image as foo.png

matplotlib.pyplot.savefig('foo.png')

# Close web data

response.close()

----------------------------------------------------------------------------

7

foo.png

When we copy and execute the command in our lab, the number 7 of training_data_list[0] appears. And the image is created in a png file foo.png as follows.

At this time, the number stored in training_data_list[0] was 7. And when we click on this, the number 7 is shown as an image as we see now. If we change this number [0] to training_data_list[1] in the same way, the stored number is 2. So the image 2 will be seen. In other words, if we choose another (number) image, ANN will tell us what that number was.

all_values = training_data_list[1]

An image of a number is saved as a matrix.

① Let's take a look at how the images of a number were converted into a matrix. Divide the image into sizes of 28 pixels wide by 28 pixels tall.

② Now, we have an image/matrix of 784= pixels. And we classify the brightness of each pixel with the natural numbers. The brightness of 0 is set to black and the number 255 to white, and then we have a matrix whose entries indicating its brightness.

③ For example, “the horizontal 12th and vertical 6th pixel and brightness is 82” would be expressed as (12, 6, 82). Thus, a number image can be represented in a matrix as shown below.

Here, all black image is a zero matrix. A colored area is marked with the number 82.  Now the letter 8 is shown below. It can be understood that this matrix data means a hand-written letter.

Now, if we look at the process of recognizing the image of a number through an ANN, the image of a number in our learning data is a matrix of 28×28, which converts it into a 784×1 vector.

Each component of this vector is an input signal and will be sent to the node of the input layer. Then the output layer through the ANN will give us an answer.

If there is an error from comparing this with the correct answer, the neural network's weight is adjusted using the gradient descent method and the Backpropagation algorithm. After all these learning data were used (when the learning process is completed), the remaining test data will be used to determine our neural network's performance.

The following is a neural network of the input, hidden layers, and output layers. This code shows how ANN worked when we use 100 learning data and 10 test data in it.

[Source]

The neural network's input layer for recognizing numbers consists of 784 nodes, and the output layer consists of 10 nodes representing numbers from 0 to 9. There are no specific regulations concerning the number of nodes in the hidden layer, but the number of nodes in this section was set at 100. Let's take a closer look at the lines in the code.

This code explains that it learns from the data with a three-layer neural network. It starts to import a 'numpy' and load 'scipy.special' to use the 'sigmoid function'. It imports a library for visualizing matrices. It defines 'Class of neural network' as 'neuralNetwork' and initializes a 'Neural network'. Then it defines 'input, hidden layer, output, learning rate, and weight matrix, and the weights in the array are denoted by which means connecting from node to node , in the next layer. It defines the learning rate and the activation function as a Sigmoid function.

-------- http://matrix.skku.ac.kr/KOFAC/ --------

# Code to learn MNIST data with 3 layers of neural network

import numpy

# Call scipy.special to use the sigmoid function expit()

import scipy.special

# Library for visualizing matrices

import matplotlib.pyplot

#Definition of neural network class

class neuralNetwork:

# Initialize the neural network

def __init__(self, inputnodes, hiddennodes, outputnodes, learningrate):

# set up the number of Nodes

self.inodes = inputnodes

self.hnodes = hiddennodes

self.onodes = outputnodes

# Weight matrix wih and who
# Weight w_i_j means connecting from node i to node j in the next layer.
# w11 w21
# w12 w22 etc

self.wih = numpy.random.normal(0.0, pow(self.hnodes, -0.5), (self.hnodes, self.inodes))

self.who = numpy.random.normal(0.0, pow(self.onodes, -0.5), (self.onodes, self.hnodes))

# Learning rate

self.lr = learningrate

# Use sigmoid function as activation function

self.activation_function = lambda x: scipy.special.expit(x)

pass

# Train neural network

def train(self, inputs_list, targets_list):

# Convert input list to 2-dimensional matrix

inputs = numpy.array(inputs_list, ndmin=2).T

targets = numpy.array(targets_list, ndmin=2).T

# Compute the signal coming into the hidden layer

hidden_inputs = numpy.dot(self.wih, inputs)

# Compute the signal going out of the hidden layer

hidden_outputs = self.activation_function(hidden_inputs)

# Compute the signal coming into the final output layer

final_inputs = numpy.dot(self.who, hidden_outputs)

# Compute the outgoing signal from the final output layer

final_outputs = self.activation_function(final_inputs)

# The error of the output layer is (actual value-calculated value)

output_errors = targets - final_outputs

# The error of the hidden layer is calculated by recombining the errors of the output layer divided by the weight.

hidden_errors = numpy.dot(self.who.T, output_errors)

# Update weights between hidden layer and output layer

self.who += self.lr * numpy.dot((output_errors * final_outputs * (1.0 - final_outputs)), numpy.transpose(hidden_outputs))

# Update weights between input layer and hidden layer

self.wih += self.lr * numpy.dot((hidden_errors * hidden_outputs * (1.0 - hidden_outputs)), numpy.transpose(inputs))

pass

# Querying the neural network

def query(self, inputs_list):

# Convert input list to 2D matrix

inputs = numpy.array(inputs_list, ndmin=2).T

# Compute the signal coming into the hidden layer

hidden_inputs = numpy.dot(self.wih, inputs)

# Compute the signal going out of the hidden layer

hidden_outputs = self.activation_function(hidden_inputs)

# Compute the signal coming into the final output layer

final_inputs = numpy.dot(self.who, hidden_outputs)

# Compute the outgoing signal from the final output layer

final_outputs = self.activation_function(final_inputs)

return final_outputs

# Number of input, hidden, and output nodes

input_nodes = 784

hidden_nodes = 100  # If you increase it to 3000, 300, it is from 65% to 70%

output_nodes = 10   # Increase it  to 10000 to improve the accuracy to 95%

# Learning rate

learning_rate = 0.3

# Create an instance of the neural network

n = neuralNetwork(input_nodes, hidden_nodes, output_nodes, learning_rate)

# Load csv file, mnist training data, into a list

import csv

import urllib.request

import codecs

url = 'https://media.githubusercontent.com/media/freebz/Make-Your-Own-Neural-Network/master/mnist_dataset/mnist_train_100.csv'

response = urllib.request.urlopen(url)

training_data_list = list(training_data_file)

response.close()

# Train neural network
# Epoch means the number of times the training data is used for learning.

epochs = 5

for e in range(epochs):

# Explore all records within the training data set

for record in training_data_list:

all_values = record

# Adjust the range and value of the input value

inputs = (numpy.asfarray(all_values[1:]) / 255.0 * 0.99) + 0.01

# Generate result value (all 0.01 except the actual value of 0.99)

targets = numpy.zeros(output_nodes) + 0.01

# all_values[0] is the result value for this record

targets[int(all_values[0])] = 0.99

n.train(inputs, targets)

pass

pass

# Load csv file, mnist test data, into list

url = 'https://media.githubusercontent.com/media/freebz/Make-Your-Own-Neural-Network/master/mnist_dataset/mnist_test_10.csv'

response1 = urllib.request.urlopen(url)

test_data_list = list(test_data_file)

response1.close()

# Testing the neural network

# Initialize the report card, which is an indicator of the performance of the neural network, to have no value

scorecard = []

# Explore all records within the test data set

for record in test_data_list:

all_values = record

# Correct answer is the first value

correct_label = int(all_values[0])

# Adjust the range and value of the input value

inputs = (numpy.asfarray(all_values[1:]) / 255.0 * 0.99) + 0.01

# Query the neural network

outputs = n.query(inputs)

# The index of the highest value matches the index of the label

label = numpy.argmax(outputs)

if (label == correct_label):

# If the answer is correct, add 1 to the report card

scorecard.append(1)

else:

# If the answer is not correct, add 0 to the report card

scorecard.append(0)

pass

pass

# Calculate and print the grade, which is the percentage of correct answers

scorecard_array = numpy.asarray(scorecard)

print("performance = ", float(scorecard_array.sum()) / scorecard_array.size)

---------------------------------------------------------------------------------------

performance =  0.7

After we run the above algorithm (the result of our practice), we learned the following. When we started this ANN with 100 training data and 10 test data, this neural network's accuracy was 70%, which is not bad. It took only a very short time. However, as the number of data increases to 60,000 training data and 10,000 test data from 100 training data and 10 test data, we see that the same algorithm's accuracy is roughly 95% accurate (more accurate than humans). More details in https://linux-blog.anracom.com/2019/09/29/a-simple-program-for-an-ann-to-cover-the-mnist-dataset-i/.

The case of recognizing hand-written numbers in the MNIST data set was possible because enough data was available to train the neural network. Previously, there was not much data available, and the computer took a lots of time to perfome the task. But nowadays, a lot of data is being produced every day, and AI systems have been improving better and better with vast amounts of training data. As with the example of [Recognition of Hand-written number with MNIST dataset], it is possible to recognize speech, signals, or text similarly. Furthermore, research that processes our speech (our natural language is more challenging to recognize than hand-writing, numbers, voices, signals, and letters) is ongoing.

Note: PCA is used to analyze MNIST data. The original data has 6304 rows and 785 columns. The shape of handwriting data is described using 24 by 24 (784) pixels. The Principal Components were constructed using 784 pixel data from the columns. The figure shows 0 to 9 based on the first two Principal Components.

We showed the result of using PCA in MNIST data. The following figure is the result of using t-Stochastic Neighbor Embedding(t-SNE). t-SNE does a similar job as PCA. It represents high dimension data on a lower dimension as PCA does. However, it uses a non-linear methodology, vs. PCA is a linear methodology. Satisfactory results require computationally extensive methods such as t-SNE.

[Source: https://bigsnarf.files.wordpress.com/2016/11/tsne_mnist_all.png

With the help of Natural Language Process, ANN will analyze the real meaning of what we said and will try to give us the best possible answers. This should enable us to react faster and make better decision.

For example, when I say "I want to pizza!" to a secretary in my smartphone, the AI system goes further to understand the meaning. 'Oh, this person does not want to make pizza but wants to buy or eat pizza from recognizing there is no term in the text such as 'recipe'. 'After AI understand the true meaning of what I said', AI understand that I said, "I want to eat pizza!". Next, AI uses GPS information to find restaurants that serve pizza near me. Then AI order the restaurants by distance and cost-effectiveness. After finding a restaurant in this way and figuring out the restaurant's name and address. AI is ready to make recommendations. Lastly, a message is generated in natural language sentences. For example, AI may returned the message "There is a Gino's Pizza house in the right of 1km ahead." And if we ask AI in a driveless car to take me to COEX convention center, AI will recognize the meaning and find the best route through the navigation and drive me safely to COEX. While AI driving the car, we can read, prepare for a talk, and be ready for the work to be done there. It will be a lifestyle that we will enjoy sooner or later.

 ◩ Open Problem 6

Find out some other examples on number recognition, face recognition, voice recognition, and natural language generation technologies currently used with ANN.

 ◩ Open Problem 7

Find and discuss topics in mathematics that have been used in artificial intelligence. http://matrix.skku.ac.kr/math4ai/part4/

 Homework

[Week 13]

 ◩ Open Problem 1

Explain how a singular value decomposition (SVD) is used in the  principal component analysis.

 ◩ Open Problem 2

Take a data matrix from sources related to your major, then use the code above to apply a Principal Component Analysis (PCA) algorithm to this matrix. Then, post the results on the QnA Board for discussion. It will help you to understand the topic.

 ◩ Open Problem 3

Discuss what you understand on the process of dimension reduction by PC’s on the covariance matrix.

 ◩ Open Problem 4

Discuss what you  understand on the similarity and difference between the least-squares line and the linear regression.

 ◩ Open Problem 5

Describe ANN and Backpropagation as you understand.

 Homework

[Week 14]

 ◩ Open Problem 6

Find out some other examples on number recognition, face recognition, voice recognition, and natural language generation technologies currently used with ANN .

 ◩ Open Problem 7

Find and discuss topics in mathematics that have been used in artificial intelligence.

[Source] http://matrix.skku.ac.kr/KOFAC2/

Part VI.

Part VI. References

A1. References

A2. Appendix

A1. References

[1] [Bigbook] Sang-Gu Lee, Jae Hwa Lee, Math for AI, Kyobo book PubPle, 2019, ISBN 978-89-24-06612-8. http://pod.kyobobook.co.kr/index.ink http://matrix.skku.ac.kr/math4ai/Math4AI.pdf

[2] Sang-Gu Lee, Jae Hwa Lee, Yoonmee Ham, Artificial Intelligence and College Mathematics Education, Communications of Mathematical Education, Vol. 34, No. 1, Feb. 2020. 1-15.

[3] Kyung-Eun Park, Sang-Gu Lee, Yoonmee Ham, Jae Hwa Lee, Teaching and Learning of University Calculus with Python-based Coding Education, Communications of Mathematical Education, Vol. 33, No. 3, Sep. 2019. 163-180.

[4]  Sang-Gu Lee, Jae Hwa Lee, Student-Centered Discrete Mathematics Class with Cyber Lab, Communications of Mathematical Education, Vol. 33, No. 1, Feb. 2019. 1-19.

[5]  Sang-Gu Lee, Jae Hwa Lee, Young Rock Kim, Yoonmee Ham, The Fourth Industrial Revolution and College Mathematics Education6) - Case study of Linear Algebra approach, Communications of Mathematical Education, Vol. 32, No. 3, Sep. 2018. 245-255.

A2. Appendix

Final PBL Report [Week 14] [TakeHome/OpenBook Exam]

English Version MS Word file PBL report (Form and Sample) :

< PBL report Form : http://matrix.skku.ac.kr/PBL-Form/PBL.hwp >

Web resources

[Math4AI PBL Report, Fall] http://matrix.skku.ac.kr/math4ai/PBL-Record/

[Math4AI PBL Report, Spring] http://matrix.skku.ac.kr/2020-Math4AI-PBL/

[Student Project, 1] http://matrix.skku.ac.kr/math4ai/Project-1/

[Student Project, Presentation 1] https://youtu.be/zJ4PTgnWyac

[Student Project, 2] http://matrix.skku.ac.kr/math4ai/Project-2/

[Student Project, Presentation 2] https://youtu.be/3e0VybhOS1U

[KMS Talk on Math for AI] https://youtu.be/KP1WSi69Mu0

<Math for AI Textbook> http://matrix.skku.ac.kr/math4ai/Math4AI.pdf

Junior PolyMath [AI, Do it with Math]

http://www.polymath.co.kr/contents/list/020207  http://www.polymath.co.kr/contents/list/020201

[PolyMath] (Interview 1) http://www.polymath.co.kr/contents/view/740?page=1 ﻿

[PolyMath] (Interview 2) http://www.polymath.co.kr/contents/view/18134 ﻿

Introduction: AI ? http://www.polymath.co.kr/contents/view/18109?page=1 ﻿

#1 Deep Learning and Matrix http://www.polymath.co.kr/contents/view/18142

#2 Least square solutions 1 http://www.polymath.co.kr/contents/view/19796?page=1

#3 Least square solutions 2 http://www.polymath.co.kr/contents/view/21717?page=1

Sage Quick Reference: https://wiki.sagemath.org/quickref

[KMS] Symposium for AI and University-Level Mathematics

2020.6.19. & 2020. 12.17. (ZOOM conference)

 Cyber Lab [K-MOOC] Introductory Math for Artificial Intelligence    [Web book] http://matrix.skku.ac.kr/kofac/Book/                   [Source] https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_overview.htmtml