2021, Fall semester PBL Report (Final)

Introductory Mathematics for Artificial Intelligence


                    Prof : Sang-Gu LEE

Due day:  Dec. 5th, 2021 (11 AM) in HW box in I-campus


Name: Kim Daniil (김다니일), Pravas Giri, Park Junho (박준호), 김다솔, …

Major: Software, Global Economics, Software, …

Student ID: 2021, 2014, 2021,…

e-mail/HP number: kim.daniil**,   junho**, giri**/010-***

Ch 1. Participation (10pt)

We have learned ‘Basic Mathematics(행렬, 도함수, 통계)’ to understand and talk about the following concepts in 14 weeks.

1. SVD(Singular Value Decomposition)

2. GDM(Gradient Descent Method)

3. Data and Covariance Matrix

4. PCA(Principal Components Analysis)

5. Rank Reduction and the role of SVD in PCA

6. BP(Back-Propagation) algorithm in ML(Machine Learning) and ANN(Artificial Neural Network) as we can see in http://matrix.skku.ac.kr/2021-Final-PBL-E/

We could practice the our codes in http://matrix.skku.ac.kr/KOFAC/

Math Lab Review (실습실)

 9학년(3)  수학    http://matrix.skku.ac.kr/9th-Grade/  

10학년(1)  수학    http://matrix.skku.ac.kr/10th-Grade/  

11학년(2) 수학 1   http://matrix.skku.ac.kr/11th-Grade-1/  

11학년(2) 수학 2   http://matrix.skku.ac.kr/11th-Grade-2/  

12학년(3) 미적분   http://matrix.skku.ac.kr/12th-Grade-1/  

12학년(3) 확률통계 http://matrix.skku.ac.kr/12th-Grade-2/


 (1) State more than 10 Math Definitions and concepts what you learned in the first 14 weeks.


1.     Data can be represented as an Ordered Pair (n-tuple). Example:

1. Extreme point: A point at which a function has a local

maximum or minimum.

2. GDM(Gradient Descent Method): The method finding the

optimal point(a extreme point), as setting initial iterate, initial

learning rate and tolerance, and then compute the next point

equal to previous point learning rate * slope..

3. Joint probability: The probability of which two or more random

variables which signify events occuring at the same time.

4. Conditional Probability: The probability that an event occurs

under the condition that the other event already occured.

5. Bayes Theorem: A conditional probability explained when we

know the probabilities of the events having reversed priority.

6. Covariance: the measurement of distances of two other

variables from the mean.

7. PCA(Principal Components Analysis): Express information as

covariance matrix to reduce dimension of data as negligible


8. ANN(Artificial Neural Network): System having hidden layers as

operating weight matrix from input to output.

9. Sigmoid Activation Function:      , which replaces

Heaviside function(step function) for some analytic advantages.

10. BP(Back-Propagation): Updating method for weights matrix

and activation function as errors between predicted and observed value adjusting



Height (cm)

Weight (kg)


Data Representation






(160, 80, 19)





(170, 70, 27)





(180, 56, 30)


1.     Each number is called a component of the data.

2.     Vector is a line (arrow), that has a starting point – the origin, and the endpoint.

3.     The components of the vector are  numbers, which are called scalars.

4.     Data can also be represented as a rectangular (2D) array (grid), that is called Matrix.

5.     Tensors are just “data containers”. A 0-dim Tensor is a single number, a 1-dim Tensor is a Vector, a 2-dim Tensor is a Matrix and so on.

6.       Diagonal Matrix. A Diagonal Matrix is a matrix, in which the entries outside the main diagonal are all 0.

7.       Identity Matrix. An Identity Matrix  is a square matrix, that has 1s along the main diagonal and 0s for all other entries.

8.       Triangular Matrix. A Triangular Matrix is a square matrix, where all the entries above (lower triangular) or below (upper triangular) the main diagonal are 0s.

 – lower triangular
 – upper triangular

9.       Symmetric Matrix. A Symmetric Matrix is a square matrix that is equal to its transpose.

10.   Inverse Matrix. An Inverse Matrix of an  matrix A is an  matrix , such that .  denoted as .

11.    Singular Matrix is a matrix that cannot be inverted.

12.    Transposed Matrix is a flipped version of the initial matrix. Basically, a matrix with rows and columns swapped.

13.  The determinant is a special number that can be calculated from a matrix that helps us find the inverse of a matrix and the solutions of the system of equations.

14.    Coefficient matrix is a matrix, made up of coefficients from the equations in a system of equations.

15.   Variable matrix is a  matrix that is made up of unknown variables in the system of equations.

16.    Constant matrix is a matrix made up of constants in the system of equations.

17.    Augmented matrix is a result matrix, that we get by merging multiple other matrices.

18.    ERO (Elementary Row Operations) are operations that can be performed in matrix: Row swap, Scalar multiplication, Row sum.

19.    REF (Row Echelon Form) is a matrix form, where

o    The row vectors, all entries of which are zeroes should be at the very bottom of the matrix.

o    Each pivot should be in a column strictly to the right of the pivots, occurring in the rows above it.

20.  Matrix Pivot is the first non-zero entry of a particular row.

21.  RREF (Reduced Row Echelon Form) is a matrix in REF form, where

o    All pivots are 1s.

o    And the entries above pivots are all 0s.

22.  Free Variables are located in the columns that does not have a pivot in REF form, and they are basically variables that can take any number.

23.  Gauss-Jordan Elimination - method of matrix system of equations solution by performing ERO, to get the identity matrix.

24.  Cramer’s Rule – method of matrix system of equations solution by finding the determinants ratio.

25.  Classification is a process of identifying to which category a new data belongs, based on the data characteristics.

26.  The distance between two points is defined by the Euclidean Distance.

27.  Norm is the size/length of a vector.

28.  Inner product (dot product) is the sum of the multiplication of the same indexed numbers in two ordered tuples. Example:

29.  Cosine theorem:

30.  Normalized vector is a vector, whose length/norm equals to 1, basically making a unit vector out of it.

31.  Unit Vector is a vector with length = 1.

32.  Eigenvector is a vector which direction remains unchanged when a linear transformation is applied to it.

33.  Eigenvalue is a value that can be found from the following formula: , where  is a square matrix,  is an Eigen Vector and  is an Eigen Value.

34.  The Least-Squares method is used to find a straight line (the least-squares line, the best fit line) that has the minimal distance to each data point.

35.  A projection is the transformation of points and lines in one plane onto another plane. The example of projection is shadow.

36.  Curve fitting is a process of finding the best fit curve (a quadratic approximation) to describe an array of data.

37.  LU Decomposition is a matrix decomposition that results a product of the Lower triangular matrix and the Upper triangular matrix.

38.  Permutation Matrix is a matrix , that shows all the changes in the row positions of the initial matrix .

39.  QR Decomposition is a matrix decomposition that makes finding the least-squares solution easier. For the QR Decomposition the following equation is also true:

40.  SVD Decomposition helps us to reduce the size of the original matrix, hence reducing the amount of computational power required. We also need to keep in mind that SVD always exists for any rectangular or square matrix.

41.  Kernel. A Kernel of a matrix  is a matrix , such that .

42.  Tangent (касательная). A Tangent is a straight line that only “touches” a function at only 1 point.

43.  Limit. The limit of a function  is the value  that the function  approaches as its argument  approaches .

44.  Slope. A Slope is a number that describes both the direction and the steepness of the function.

45.  Continuity. A function is Continuous at a point , if it is defined at this point.

46.  Derivative. The derivative of a function  at a number , denoted by , is the instantaneous rate of change of  with respect to , when .

47.  Differentiation. Differentiation is the process of finding the derivative of a function.

48.  Fermat’s Theorem: If  has a local maximum or minimum at  and if  exists, then .

49.  Gradient Descent Method/Algorithm is an optimization algorithm to find the minimum of a complex function.

50.  Factorial of  is a recursive product of all positive integers. . Note: .

51.  Permutations is a number of ways, we can choose  elements out of  elements, with the consideration of their order. There are two formulas:

1.     Without Repetitions: (all chosen elements are distinct)

2.     With Repetitions: (all chosen elements are not necessarily distinct)

52.  Combinations is a number of ways, we can choose  elements out of  elements, but without the consideration of their order. There are two formulas:

1.     Without Repetitions: (all chosen elements are distinct)

2.     With Repetitions: (all chosen elements are not necessarily distinct)

53.  Sample Space or Probability Space is a set of possible outcomes.

54.  Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur.

55.  Mathematical probability is:

56.  Conditional probability is defined as the probability of an event or outcome occurring, based on the occurrence of a previous event or outcome. It is calculated by multiplying the probability of the prior event by the updated probability of the succeeding, or conditional event:

57.  Bayes’ Theorem describes the probability of the occurrence of a particular event, considering the conditions, related to this event.

58.  Discrete Probability Distribution describes the probability of occurrence of each value of a discrete (or in other words, explicitly defined) random variable.

59.  Probability Mass Function is a function that gives the probability that a discrete random variable is exactly equal to some value.

60.  Continuous probability distribution describes the probability of occurrence of each value of a continuous (or in other words, not explicitly defined or conditionally defined, e.g., ) random variable.

61.  Probability Density Function is a function that is providing a relative likelihood that the value of the continuous random variable would be close to a particular sample.

62.  Expectation of a random variable is an average value, that we can get from random variable. It is also called a mean value.

63.  Variance of a random variable is the spread of a set of data, in relation to their average value.

64.  Standard Deviation is the square root of its variance.

65.  Standardized Random Variables are similar to “Normalization” (like vectors). The expected value of a Standardized Random Variable is always 0, and its variance is 1. Standardizing makes it easier to compare variables of different types and units:

66.  Joint Probability Distribution shows a probability distribution of multiple variables and their relationship.

67.  Marginal Density Function – a density function of a one particular variable.

68.  Covariance is a measure of the joint variability of two random variables.

69.  Correlation Coefficient is the measure of relationship strength between the two variables.

70.  Covariance Matrix is a square matrix with the variance in the main diagonal and all covariances in non-diagonal entries. Any covariance matrix is symmetric and positive semi-definite. It visualizes the data distribution.

71.  Poisson Distribution – discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.

72.  Bernoulli Trial – random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted.

73.  Central Limit Theorem – when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed.

74.  Hypergeometric Distribution – probability of k successes (random draws for which the object drawn has a specified feature) in n draws, without replacement, from a finite population of size N that contains exactly K objects with that feature, wherein each draw is either a success or a failure.

75.  Gamma Distribution – two-parameter family of continuous probability distributions.

76.  Dimensionality Reduction Techniques – algorithms that transform the data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains meaningful properties of the original data.

77.  Principal Component Analysis (PCA) is one of the most widely used dimensionality reduction techniques. The PCA tries to combine existing variables to define the new variables, called Principal Components, while minimizing the information loss.

78.  Principal Components represent the directions of the data that explain a maximal amount of variance.

79.  Machine Learning (ML) is a subfield of AI that studies Computer Algorithms, that are capable of self-improvement through experiencing sample data, without direct programmer’s interference.

80.  Training Data – a sample data, used to teach Machine Learning.

81.  Artificial Neural Networks (ANN) are a commonly used, specific class of ML algorithms. ANNs are modeled on the human brain, in which thousands or millions of processing nodes, called “neurons”, are interconnected and organized into layers.
ANNs always has an Input Layer and an Output Layer.

82.  Sometimes, Neural Networks can have a hidden layer in between I and O Layers, where the weights are adjusted. Such Neural Networks are called “Deep Learning”.

83.  Backpropagation – the process of weight adjustment. The flow of the Backpropagation algorithm is as follows:

1.     Data Division (~80% for learning and the rest ~20% for the test cases)

2.     Weights Randomization (initial setup of random weights)

3.     Matrices and Activation Functions setup

4.     Standard Calculation (using mentioned formulas)

5.     Calculate Error (compare with the expected result)

6.     Use Gradient Descent Method, to adjust the weights, so that the error is minimal.

7.     Repeat 4-6.

8.     Stop when result is satisfactory (the error is almost negligible or non-existent).

84.  Weight or  decides how influential the input will be on the output.

85.  Bias or  is a constant that helps us fit our model for the given data.

86.  Activation Function of a node defines the output of that node given an input or set of inputs.

87.  Linear Activation Function – simplest Activation Function. If we use a linear activation function in a neural network, then this model can only learn linearly separable problems:


88.  Non-Linear Activation Functions – the most popular type activation functions in modern ANNs.  They allow ANNs to easily learn a non-linearly separable problem. One such function is a Sigmoid Function.

89.   Sigmoid Function – a special form of the logistic function:  . With the addition of just one hidden layer and this activation function in it, neural network can learn complex decision functions.

90.  MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples.
All the examples are monochrome, centered and fully normalized (no distortion or skew) 28x28 images.

91.  t-SNE (t-Distributed Stochastic Neighbor Embedding)– a technique for dimensionality reduction (like PCA) that is particularly well suited for the visualization of high-dimensional datasets.


1.      Functions and its Graph

The relationship in which the value of the two variables x and y is uniquely determined according to the value is called a function and is expressed as .

2.      Vector

A vector is an abstract data type used to represent properties that have both direction and magnitude. Vectors are commonly used to represent movement. For example, vector can be used to represent the distance between the two points.

For this above figure, whose starting point is A, ending point is B the vector will be .

Vector operations:

a.    Scalar multiplication: For scalar  and vector .

b.   Vector addition: For two vectors  and .

3.      Matrix Operations

Matrix is a rectangular array of numbers or polynomials arranged in rows and columns.


a.      Scalar multiplication

The term scalar multiplication refers to the product of a real number and a matrix. In a scalar multiplication, each entry in the matrix is multiplied by the given scalar. For scalar ,

b.      Matrix addition and subtraction

A matrix can only be added or subtracted from another matrix if they have same dimensions.

c.       Matrix multiplication

For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix.

d.      Transpose of matrix

The transpose of a matrix is an operator which flips a matrix over its diagonal which means it switches the row and column indices of the matrix A by producing another matrix which is denoted by .


e.       Diagonal matrix

A matrix in which the entries outside the main diagonal are all zero.

f.        Identity matrix

A square matrix that has 1’s along the diagonal and all the remaining are zero.

g.      Triangular matrix

It is a special type of square matrix. A square matrix having all the entries above main diagonal zero is called lower triangular matrix. Similarly, a square matrix having all the entries below main diagonal zero is called upper triangular matrix.



h.      Symmetric matrix

A symmetric matrix is a square matrix  which is equal to its transpose.

i.        Inverse matrix

A square matrix  with an associated matrix  such that  multiplied by  and  multiplied by  both equal the identity matrix.

Properties of inverse matrix:

1.      If A is nonsingular, then so is A-1 and

          (A-1-1  =  A

2.      If A and B are nonsingular matrices, then AB is nonsingular and

        (AB) -1  =  B-1A-1

3.      If A is nonsingular then    

          (AT) -1  =  (A -1)T

4.      If A and B are matrices with

        AB  =  In

then A and B are inverses of each other.

4.      Tensor

Tensor is a container, which is a storage that can put data together since most of them deal with numeric data. We can think of a 1-dim Tensor as a Vector, a 2-dim tensor as a Matrix, and a matrix’s generalized form as a tensor. However, tensor and matrix are different. A matrix is just a container for entries and it doesn’t change if any change occurs in the system, whereas tensor is an entity in the system that interacts with other entities in a system an changes its values when other values change.


5.      Data Similarity

Data similarity is calculating how close or similar the data is in each category.

a.      Distance

The distance between the two points  and  is defined by

This is also called ‘Euclidean Distance’.

b.      Norm

For a vector  the size of  is called a norm.

In case of two vectors  and  the norm is the distance between two points  and .


6.      Cosine Similarity

The cosine similarity can be calculated by the angle between two vectors which can be defined with an inner product and measuring the similarity with the cosine value  of by using an inner product.

a.      Inner Product

The inner product between two vectors  and  is

b.      Angle between two vectors

The angle  between two vectors is given by 

7.      Gauss-Jordan Elimination

This is an algorithm for solving system of linear equations. While solving the final form of the augmented matrix on the left side of the equation will become identity matrix.

We can solve a equation following Elementary Row Operations (ERO).

a.       Exchange two equations.

b.      Multiply a row by a nonzero real number.

c.       Add a nonzero multiple of a row to another row.


While finding RREF of an augmented matrix .

If the right side of the solution of given linear system of equations is not 0 then it has no solutions.

If it is 0, we should check if the solution is identity matrix. If yes, it has unique solutions.

However, if there is free variable then it will have infinitely many solutions.


8.      Singular Value Decomposition

Singular value decomposition of a matrix is a factorization of that matrix into three matrices  where U is an orthogonal matrix, V is an orthogonal matrix, and  is a rectangular diagonal matrix with non-negative real numbers on the main diagonal.

The key point of SVD is it exists for any sort of rectangular or square matrices.


9.      Limit of functions

The limit of a function is a fundamental concept in calculus and analysis concerning the behavior of that function near a particular input.


This only works when . Because when  the function cannot be defined.


10.  Derivatives and Differentiation

a.      Derivatives

When is differentiable at every point x in an interval,  is called differentiable in that interval. In this case, the derivative at that point is called the derivative of  at . This is denoted by:  etc.

b.      Differentiation.

The derivative of function  is called differentiation of  which can be found by =.

c.       Tangent line

The tangent line can be found by the following formula:


m=slope of the tangent line

)= The point of slope


11.  Local maximum and Local Minimum

When a function  has  at c in the domain with .

12.  Absolute Maximum

It is point where a function obtains its greatest possible value.

13.  Absolute Minimum

It is a point where a function obtains its smallest possible value.

14.  Gradient Descent Method (GDM)

This is a method which is used when we need to find the critical value of a complicated function. The basic idea is to find the slope of the function (derivative), move it toward downhill (learning rate), and repeat it until it reaches the extreme value.

15.  Permutations