2021, Fall semester PBL Report (Final)
Introductory Mathematics for Artificial Intelligence
http://matrix.skku.ac.kr/intro-math4ai/
Prof : Sang-Gu LEE
Due day: Dec. 5th, 2021 (11 AM) in HW box in I-campus
Name: Kim Daniil (김다니일), Pravas Giri, Park Junho (박준호), 김다솔, …
Major: Software, Global Economics,
Software, …
Student ID: 2021, 2014, 2021,…
e-mail/HP number: kim.daniil**, junho**, giri**/ … 010-***
Ch 1. Participation (10pt)
We
have learned ‘Basic Mathematics(행렬, 도함수, 통계)’
to understand and talk about the following concepts in 14 weeks.
1.
SVD(Singular Value Decomposition)
2.
GDM(Gradient Descent Method)
3.
Data and Covariance Matrix
4.
PCA(Principal Components Analysis)
5.
Rank Reduction and the role of SVD in PCA
6.
BP(Back-Propagation) algorithm in ML(Machine Learning) and ANN(Artificial
Neural Network) as we can see in http://matrix.skku.ac.kr/2021-Final-PBL-E/
We could practice the our codes in http://matrix.skku.ac.kr/KOFAC/
○ Math Lab Review (실습실)
9학년(중3) 수학 http://matrix.skku.ac.kr/9th-Grade/
10학년(고1) 수학 http://matrix.skku.ac.kr/10th-Grade/
11학년(고2) 수학 1
http://matrix.skku.ac.kr/11th-Grade-1/
11학년(고2) 수학 2
http://matrix.skku.ac.kr/11th-Grade-2/
12학년(고3) 미적분
http://matrix.skku.ac.kr/12th-Grade-1/
12학년(고3) 확률통계 http://matrix.skku.ac.kr/12th-Grade-2/
(1) State more than 10
Math Definitions and concepts what you learned in the first 14 weeks.
1.
Data can be represented as an Ordered
Pair (n-tuple). Example:
1. Extreme point: A point at
which a function has a local
maximum or minimum.
2. GDM(Gradient Descent Method):
The method finding the
optimal point(a extreme point),
as setting initial iterate, initial
learning rate and tolerance, and
then compute the next point
equal to previous point – learning rate * slope..
3. Joint probability: The
probability of which two or more random
variables which signify events
occuring at the same time.
4. Conditional Probability: The
probability that an event occurs
under the condition that the
other event already occured.
5. Bayes’ Theorem: A conditional probability explained when we
know the probabilities of the
events having reversed priority.
6. Covariance: the measurement
of distances of two other
variables from the mean.
7. PCA(Principal Components
Analysis): Express information as
covariance matrix to reduce
dimension of data as negligible
terms.
8. ANN(Artificial Neural
Network): System having hidden layers as
operating weight matrix from
input to output.
9. Sigmoid Activation Function: , which replaces
‘Heaviside function(step function)’ for some analytic advantages.
10. BP(Back-Propagation):
Updating method for weights matrix
and activation function as
errors between predicted and observed value adjusting
Name |
Height (cm) |
Weight (kg) |
Age |
Data Representation (4-tuple) |
Kim |
160 |
80 |
19 |
(160, 80, 19) |
Lee |
170 |
70 |
27 |
(170, 70, 27) |
Park |
180 |
56 |
30 |
(180, 56, 30) |
1.
Each number is called a component
of the data.
2.
Vector
is a line (arrow), that has a starting point – the origin, and the endpoint.
3.
The components of the vector are numbers, which are called scalars.
4.
Data can also be represented as a
rectangular (2D) array (grid), that is called Matrix.
5. Tensors are just “data containers”. A 0-dim Tensor is a single number, a 1-dim Tensor is a Vector, a 2-dim Tensor is a Matrix and so on.
6.
Diagonal Matrix. A Diagonal Matrix is a matrix, in which the entries
outside the main diagonal are all 0.
7.
Identity Matrix. An Identity Matrix is
a square matrix, that has 1s along the main diagonal and 0s for all other
entries.
8.
Triangular Matrix. A Triangular Matrix is a square matrix, where all the
entries above (lower triangular) or below (upper triangular) the main diagonal
are 0s.
– lower triangular
– upper triangular
9.
Symmetric Matrix. A Symmetric Matrix is a square matrix that is equal to
its transpose.
10.
Inverse Matrix. An Inverse Matrix of an matrix A is an
matrix
, such that
.
denoted as
.
11.
Singular Matrix is a matrix that cannot be inverted.
12.
Transposed Matrix is a flipped version of the initial matrix. Basically, a matrix
with rows and columns swapped.
13. The
determinant is a special number that can be calculated from a
matrix that helps us find the inverse of a matrix and the solutions of the
system of equations.
14.
Coefficient matrix is a matrix, made up of coefficients from the equations in a system
of equations.
15.
Variable matrix is a matrix that is made up of
unknown variables in the system of equations.
16.
Constant matrix is a matrix made up of constants in the system of equations.
17.
Augmented matrix is a result matrix, that we get by merging multiple other matrices.
18.
ERO (Elementary Row
Operations) are operations that can be
performed in matrix: Row swap, Scalar multiplication, Row sum.
19.
REF (Row Echelon Form) is a matrix form, where
o The row vectors, all entries of which are zeroes should be at the very bottom of the matrix.
o Each pivot should be in a
column strictly to the right of the pivots, occurring in the rows above it.
20. Matrix
Pivot is the first
non-zero entry of a particular row.
21. RREF
(Reduced Row Echelon Form) is a matrix
in REF form, where
o
All pivots are 1s.
o
And the entries above pivots
are all 0s.
22. Free
Variables are located in the columns that does not
have a pivot in REF form, and they are basically variables that can take any
number.
23. Gauss-Jordan
Elimination - method of matrix system of equations
solution by performing ERO, to get the identity matrix.
24. Cramer’s
Rule – method of matrix system of equations
solution by finding the determinants ratio.
25. Classification is a process of
identifying to which category a new data belongs, based on the data
characteristics.
26. The distance between two points is defined by the Euclidean
Distance.
27. Norm is the size/length of a vector.
28. Inner product (dot product) is the
sum of the multiplication of the same indexed numbers in two ordered tuples.
Example:
29. Cosine theorem:
30. Normalized vector is a vector, whose
length/norm equals to 1, basically making a unit vector out of
it.
31. Unit Vector is a vector with length
= 1.
32. Eigenvector is a vector which
direction remains unchanged when a linear transformation is applied to it.
33. Eigenvalue is a value that can be
found from the following formula: , where
is a square matrix,
is an Eigen Vector and
is an Eigen Value.
34. The Least-Squares method is used to find a straight line (the least-squares line, the best fit line) that has the minimal distance to each data point.
35. A projection is the transformation of points and lines in one plane onto another plane. The example of projection is shadow.
36. Curve fitting is a process of
finding the best fit curve (a quadratic approximation) to describe an array of
data.
37. LU Decomposition is a matrix
decomposition that results a product of the Lower triangular matrix
and the Upper triangular matrix.
38. Permutation Matrix is a matrix , that shows all the changes in the row positions of the initial
matrix
.
39. QR Decomposition is a matrix
decomposition that makes finding the least-squares solution easier. For the QR
Decomposition the following equation is also true:
40. SVD Decomposition helps us to reduce the size of the original matrix, hence reducing the amount of computational power required. We also need to keep in mind that SVD always exists for any rectangular or square matrix.
41. Kernel. A Kernel of a
matrix is a matrix
, such that
.
42. Tangent (касательная). A Tangent is a straight line that only “touches” a function at only 1 point.
43. Limit. The limit of a
function is the value
that the function
approaches as its argument
approaches
.
44. Slope. A Slope is a number that describes both the direction and the steepness of the function.
45. Continuity. A function is Continuous
at a point , if it is defined at this point.
46. Derivative. The derivative of
a function at a number
, denoted by
, is the instantaneous rate of change of
with respect to
, when
.
47. Differentiation. Differentiation is the process of finding the derivative of a function.
48. Fermat’s Theorem: If has a local maximum
or minimum at
and if
exists, then
.
49. Gradient Descent Method/Algorithm is an optimization algorithm to find the minimum of a complex function.
50. Factorial of is a recursive product of all
positive integers
.
. Note:
.
51. Permutations is a number of ways, we
can choose elements out of
elements, with the
consideration of their order. There are two formulas:
1. Without Repetitions: (all chosen elements are distinct)
2. With Repetitions: (all chosen elements are not necessarily distinct)
52. Combinations is a number of ways, we
can choose elements out of
elements, but without
the consideration of their order. There are two formulas:
1. Without Repetitions: (all chosen elements are distinct)
2. With Repetitions: (all chosen elements are not necessarily distinct)
53. Sample Space or Probability Space is a set of possible outcomes.
54. Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur.
55. Mathematical probability is:
56. Conditional probability is defined
as the probability of an event or outcome occurring, based on the occurrence of
a previous event or outcome. It is calculated by multiplying the probability of
the prior event by the updated probability of the succeeding, or conditional
event:
57. Bayes’ Theorem describes the probability of the occurrence of a particular event, considering the conditions, related to this event.
58. Discrete Probability Distribution describes the probability of occurrence of each value of a discrete (or in other words, explicitly defined) random variable.
59. Probability Mass Function is a function that gives the probability that a discrete random variable is exactly equal to some value.
60. Continuous probability distribution
describes the probability of occurrence of each value of a continuous (or in
other words, not explicitly defined or conditionally defined, e.g., ) random variable.
61. Probability Density Function is a function that is providing a relative likelihood that the value of the continuous random variable would be close to a particular sample.
62. Expectation of a random variable is
an average value, that we can get from random variable. It is also called a mean
value.
63. Variance of a random variable is the
spread of a set of data, in relation to their average value.
64. Standard Deviation is the square
root of its variance.
65. Standardized Random Variables are
similar to “Normalization” (like vectors). The expected value of a Standardized
Random Variable is always 0, and its variance is 1. Standardizing makes it
easier to compare variables of different types and units:
66. Joint Probability Distribution shows a probability distribution of multiple variables and their relationship.
67. Marginal Density Function – a
density function of a one particular variable.
68. Covariance is a measure of the joint
variability of two random variables.
69. Correlation Coefficient is the
measure of relationship strength between the two variables.
70. Covariance Matrix is a square matrix
with the variance in the main diagonal and all covariances in non-diagonal
entries. Any covariance matrix is symmetric and positive semi-definite. It
visualizes the data distribution.
71. Poisson Distribution – discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.
72. Bernoulli Trial – random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted.
73. Central Limit Theorem – when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed.
74. Hypergeometric Distribution – probability of k successes (random draws for which the object drawn has a specified feature) in n draws, without replacement, from a finite population of size N that contains exactly K objects with that feature, wherein each draw is either a success or a failure.
75. Gamma Distribution – two-parameter family of continuous probability distributions.
76. Dimensionality Reduction Techniques – algorithms that transform the data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains meaningful properties of the original data.
77. Principal Component Analysis (PCA) is one of the most widely used dimensionality reduction techniques. The PCA tries to combine existing variables to define the new variables, called Principal Components, while minimizing the information loss.
78. Principal Components represent the directions of the data that explain a maximal amount of variance.
79. Machine Learning (ML) is a subfield of AI that studies Computer Algorithms, that are capable of self-improvement through experiencing sample data, without direct programmer’s interference.
80. Training Data – a sample data, used to teach Machine Learning.
81. Artificial Neural Networks (ANN) are
a commonly used, specific class of ML algorithms. ANNs
are modeled on the human brain, in which thousands or millions of processing
nodes, called “neurons”, are interconnected and organized into
layers.
ANNs always has an Input Layer and an Output
Layer.
82. Sometimes, Neural Networks can have a hidden layer in between I and O Layers, where the weights are adjusted. Such Neural Networks are called “Deep Learning”.
83. Backpropagation – the process of weight adjustment. The flow of the Backpropagation algorithm is as follows:
1. Data Division (~80% for learning and the rest ~20% for the test cases)
2. Weights Randomization (initial setup of random weights)
3. Matrices and Activation Functions setup
4. Standard Calculation (using mentioned formulas)
5. Calculate Error (compare with the expected result)
6. Use Gradient Descent Method, to adjust the weights, so that the error is minimal.
7. Repeat 4-6.
8. Stop when result is satisfactory (the error is almost negligible or non-existent).
84. Weight or decides how influential the input
will be on the output.
85. Bias or is a constant that
helps us fit our model for the given data.
86. Activation Function of a node defines the output of that node given an input or set of inputs.
87. Linear Activation Function –
simplest Activation Function. If we use a linear activation function in a
neural network, then this model can only learn linearly separable problems:
or
88. Non-Linear Activation Functions – the most popular type activation functions in modern ANNs. They allow ANNs to easily learn a non-linearly separable problem. One such function is a Sigmoid Function.
89.
Sigmoid Function – a special form of the logistic function: . With the addition of just one hidden layer and this activation
function in it, neural network can learn complex decision functions.
90. MNIST database (Modified
National Institute of Standards and Technology database) is a large
collection of handwritten digits. It has a training set of 60,000 examples, and
a test set of 10,000 examples.
All the examples are monochrome, centered and fully normalized (no distortion
or skew) 28x28 images.
91. t-SNE (t-Distributed Stochastic Neighbor Embedding)– a technique for dimensionality reduction (like PCA) that is particularly well suited for the visualization of high-dimensional datasets.
1.
Functions and its Graph
The relationship in which the value
of the two variables x and y is uniquely determined according to the value is
called a function and is
expressed as
.
2.
Vector
A vector is an abstract data type
used to represent properties that have both direction and magnitude. Vectors
are commonly used to represent movement. For example, vector can be used to
represent the distance between the two points.
For this above figure, whose starting
point is A, ending point is B the vector will be .
Vector operations:
a.
Scalar multiplication: For scalar and vector
.
b.
Vector addition: For two
vectors and
.
3.
Matrix Operations
Matrix is a rectangular array of
numbers or polynomials arranged in rows and columns.
a.
Scalar multiplication
The term scalar multiplication refers
to the product of a real number and a matrix. In a scalar multiplication, each
entry in the matrix is multiplied by the given scalar. For scalar ,
b.
Matrix addition and subtraction
A matrix can only be added or
subtracted from another matrix if they have same dimensions.
c.
Matrix multiplication
For matrix multiplication, the number
of columns in the first matrix must be equal to the number of rows in the
second matrix.
d.
Transpose of matrix
The transpose of a matrix is an
operator which flips a matrix over its diagonal which means it switches the row
and column indices of the matrix A by producing another matrix which is denoted
by .
e.
Diagonal matrix
A matrix in which the entries outside
the main diagonal are all zero.
f.
Identity matrix
A square matrix that has 1’s along
the diagonal and all the remaining are zero.
g.
Triangular matrix
It is a special type of square
matrix. A square matrix having all the entries above main diagonal zero is
called lower triangular matrix. Similarly, a square matrix having all the
entries below main diagonal zero is called upper triangular matrix.
h.
Symmetric matrix
A symmetric matrix is a square matrix
which is equal to its transpose.
i.
Inverse matrix
A square matrix with an associated matrix
such that
multiplied by
and
multiplied by
both equal the identity matrix.
Properties of inverse matrix:
1. If A is
nonsingular, then so is A-1 and
(A-1) -1
= A
2. If A and B are
nonsingular matrices, then AB is nonsingular and
(AB) -1
= B-1A-1
-1
3. If A is
nonsingular then
(AT) -1
= (A -1)T
4. If A and B are
matrices with
AB = In
then A and B are inverses of each other.
4.
Tensor
Tensor is a container, which is a
storage that can put data together since most of them deal with numeric data.
We can think of a 1-dim Tensor as a Vector, a 2-dim tensor as a Matrix, and a
matrix’s generalized form as a tensor. However, tensor and matrix are
different. A matrix is just a container for entries and it doesn’t change if
any change occurs in the system, whereas tensor is an entity in the system that
interacts with other entities in a system an changes its values when other
values change.
5.
Data Similarity
Data similarity is calculating how
close or similar the data is in each category.
a.
Distance
The distance
between the two points and
is defined by
This is also
called ‘Euclidean Distance’.
b.
Norm
For a vector the size of
is called a norm.
In case of two vectors and
the norm is the distance between two points
and
.
6.
Cosine Similarity
The cosine similarity can be
calculated by the angle between two vectors which can be defined with an inner
product and measuring the similarity with the cosine value of by using an inner product.
a.
Inner Product
The inner product between two vectors
and
is
b.
Angle between two vectors
The angle between two vectors is given by
7.
Gauss-Jordan Elimination
This is an algorithm for solving
system of linear equations. While solving the final form of the augmented
matrix on the left side of the equation will become identity matrix.
We can solve a equation following
Elementary Row Operations (ERO).
a.
Exchange two equations.
b.
Multiply a row by a nonzero real
number.
c.
Add a nonzero multiple of a row to
another row.
While finding RREF of an augmented
matrix .
If the right side of the solution of
given linear system of equations is not 0 then it has no solutions.
If it is 0, we should check if the
solution is identity matrix. If yes, it has unique solutions.
However, if there is free variable
then it will have infinitely many solutions.
8.
Singular Value Decomposition
Singular value decomposition of a
matrix is a factorization of that matrix into three matrices where U is an orthogonal matrix, V is an
orthogonal matrix, and
is a rectangular diagonal matrix with
non-negative real numbers on the main diagonal.
The key point of SVD is it exists for
any sort of rectangular or square matrices.
9.
Limit of functions
The limit of a function is a
fundamental concept in calculus and analysis concerning the behavior of that
function near a particular input.
This only works when . Because
when
the function cannot be defined.
10.
Derivatives and Differentiation
a.
Derivatives
When is
differentiable at every point x in an interval,
is called differentiable in that interval. In
this case, the derivative at that point is called the derivative of
at
. This is
denoted by:
etc.
b.
Differentiation.
The derivative of function is called differentiation of
which can be found by
=
.
c.
Tangent line
The tangent line can be found by the
following formula:
Where,
m=slope of the tangent line
)= The point
of slope
11.
Local maximum and Local Minimum
When a function has
at c in the domain with
.
12. Absolute Maximum
It is point where a function obtains its greatest
possible value.
13. Absolute Minimum
It is a point where a function obtains its smallest
possible value.
14. Gradient Descent Method (GDM)
This is a method which is used when we need to find
the critical value of a complicated function. The basic idea is to find the slope of the
function (derivative), move it toward downhill (learning rate), and repeat it
until it reaches the extreme value.
15. Permutations