Foundational Math for AI

🔢 1. Linear Algebra

"Neural networks are just matrix multiplications with purpose."

Linear algebra provides the language of tensors, the structure of computations, and the geometry of data flow in machine and deep learning. Let’s break down its core operations, starting with the dot product and matrix multiplication — both central to model layers, loss calculations, and gradient flow.

📍 Dot Product (Inner Product)

Definition:

Given two vectors:

$$ \mathbf{a} = [a_1, a_2, ..., a_n], \quad \mathbf{b} = [b_1, b_2, ..., b_n] $$

The dot product is:

$$ \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i $$

Interpretation:

Measures similarity (cosine of the angle between vectors)
Returns a scalar
Core to attention mechanisms, feature projections, and loss functions

Applications:

Attention scores in Transformers
Projection of embeddings
Energy functions in similarity-based models

📍 Matrix Multiplication

Definition:

Given matrix $ A $ of shape $ m \times n $ and matrix $ B $ of shape $ n \times p $:

$$ C = A \cdot B \Rightarrow C_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj} $$

Visual Example:

Let:

$$ A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad B = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} $$

Then:

$$ C = AB = \begin{bmatrix} 1\cdot5 + 2\cdot7 & 1\cdot6 + 2\cdot8 \\ 3\cdot5 + 4\cdot7 & 3\cdot6 + 4\cdot8 \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix} $$

Key Properties:

Associative: $ (AB)C = A(BC) $
Not Commutative: $ AB \ne BA $
Used for linear transformations in neural nets

🧠 Why It Matters in AI

Each layer in a neural network is a matrix multiplication:
$$ z = W x + b $$
where $ W $ is the weight matrix and $ x $ the input vector.
Embeddings, convolutions, attention, recurrent steps — all use matrix ops.
Matrix multiplication is GPU-optimized, making it the heart of deep learning acceleration.

📚 Related Topics

Matrix-vector multiplication and linear transforms?
Outer products, Hadamard product, norms and projections?
Or go deeper into how this applies in neural layers (e.g., why linear layers use weight matrices)?

📏 Norms and Distances

“Distance is not just how far — it’s how your model feels difference.”

In machine learning, norms quantify the magnitude of vectors, and distance metrics measure how similar or dissimilar two data points are. These are fundamental to loss functions, regularization, and clustering.

📍 Vector Norms (Length of a Vector)

A norm is a function $ \|x\| $ that assigns a non-negative length or size to a vector $ x $.

L2 Norm (Euclidean Norm)

$$ \|x\|_2 = \sqrt{\sum_{i=1}^n x_i^2} $$

Interpreted as the straight-line distance from the origin to the point.
Used in:
- Gradient descent convergence
- L2 regularization (Ridge): $ \lambda \|w\|_2^2 $

L1 Norm (Manhattan Norm)

$$ \|x\|_1 = \sum_{i=1}^n |x_i| $$

Interpreted as the distance if moving only along axes.
Used in:
- Lasso regression
- Promotes sparse weights in models

L∞ Norm (Max Norm)

$$ \|x\|_\infty = \max_i |x_i| $$

Measures the maximum absolute value.
Useful for:
- Robust optimization
- Constraint satisfaction

🧭 Distance Metrics

Euclidean Distance (L2)

$$ d(x, y) = \|x - y\|_2 = \sqrt{\sum_i (x_i - y_i)^2} $$

Most common distance metric.
Used in:
- k-NN
- Clustering (e.g., k-means)
- Image similarity

Manhattan Distance (L1)

$$ d(x, y) = \|x - y\|_1 = \sum_i |x_i - y_i| $$

Better in high-dimensional spaces for sparse data.

Cosine Similarity

$$ \cos(\theta) = \frac{x \cdot y}{\|x\|\|y\|} $$

Measures angle between vectors, not magnitude.
Used in:
- NLP word embeddings
- Recommender systems

🧠 In Practice

Application	Metric Used
Regularization	L1 / L2 Norms
Optimization	Gradient Norms
Model Compression	L1 for pruning
Image Similarity	Euclidean or Cosine Distance
Attention Mechanisms	Dot product + normalization

🔍 Visualization Tip

Plot $ \|x\|_1 $, $ \|x\|_2 $, $ \|x\|_\infty $ in 2D — they form diamond, circle, and square shapes respectively, showing different constraint “balls”.

🧠 Eigenvalues & Eigenvectors

“If a transformation stretches space, eigenvectors point where nothing bends.”

📍 What is an Eigenvector?

Given a square matrix $ A $, an eigenvector $ \mathbf{v} $ satisfies:

$$ A \mathbf{v} = \lambda \mathbf{v} $$

$ \mathbf{v} $: the eigenvector — a direction unchanged by the transformation $ A $
$ \lambda $: the eigenvalue — a scalar scaling the eigenvector

This means: applying $ A $ to $ \mathbf{v} $ stretches or compresses it, but does not rotate or bend it.

🔢 How to Compute

Start from: $$ A \mathbf{v} = \lambda \mathbf{v} \Rightarrow (A - \lambda I)\mathbf{v} = 0 $$

To find $ \lambda $, solve the characteristic equation:

$$ \det(A - \lambda I) = 0 $$

Then for each $ \lambda $, solve:

$$ (A - \lambda I)\mathbf{v} = 0 $$

📦 Example

Let:

$$ A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} $$

Solve: $$ \det(A - \lambda I) = \begin{vmatrix} 2-\lambda & 1 \\ 1 & 2-\lambda \end{vmatrix} = (2 - \lambda)^2 - 1 = 0 \Rightarrow \lambda = 1, 3 $$

Each $ \lambda $ gives an eigenvector $ \mathbf{v} $.

🧭 Why It Matters in AI

Area	Use of Eigen Concepts
PCA	Principal components = eigenvectors of covariance
Spectral Clustering	Use eigenvectors of graph Laplacian
Stability Analysis	Eigenvalues of Jacobians tell if training is stable
Weight Analysis	Norms/eigenvalues of weight matrices reflect expressivity
Dynamical Systems	In RNNs, eigenvalues affect gradient explosion/vanishing

📊 Geometric Insight

Eigenvectors define the invariant axes of transformation.
Eigenvalues describe how much data is stretched or squashed along those axes.

🧪 Deep Learning Insight

In PCA, eigenvectors of the covariance matrix give directions of maximum variance:

$$ \Sigma x = \lambda x $$

In recurrent networks:

$ |\lambda| > 1 $: gradients may explode
$ |\lambda| < 1 $: gradients may vanish

📚 Related Topics

Principal Component Analysis
Stability in Dynamical Systems
Spectral Graph Theory
Matrix Decompositions
Gradient Flow and Eigenvalues

🔺 Derivatives & Partial Derivatives

“To teach a model, we show it how it's wrong — via derivatives.”

📍 What is a Derivative?

A derivative measures how a function changes as its input changes — the rate of change or slope.

Formal Definition:

$$ f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} $$

$ f'(x) $: how much $ f $ changes with a tiny change in $ x $
Visualized as the slope of the tangent line at point $ x $

📘 Example:

$$ f(x) = x^2 \Rightarrow f'(x) = 2x $$
At $ x = 3 $, the slope is $ f'(3) = 6 $

📍 Partial Derivatives

In ML, functions often depend on many variables, like weights $ w_1, w_2, ..., w_n $. A partial derivative shows how the output changes when only one variable changes.

Notation:

$$ \frac{\partial f}{\partial x_i} $$

📘 Example:

$$ f(x, y) = 3x^2 + 2xy $$
$$ \frac{\partial f}{\partial x} = 6x + 2y, \quad \frac{\partial f}{\partial y} = 2x $$

🧠 Why It Matters in AI

Use Case	Derivative Role
Gradient Descent	Uses derivatives to minimize loss
Backpropagation	Chain rule of partials through layers
Optimization	Maximize likelihood, minimize error
Sensitivity Analysis	How a weight shift affects output

📦 In Deep Learning

For a loss function $ L(w) $, we compute:

$$ \frac{\partial L}{\partial w_i} \Rightarrow \text{tells how to adjust } w_i $$

These gradients are passed backward through the network via backpropagation, powered by the chain rule.

🔍 Visualization Tip

Plot $ f(x) = x^2 $ and show tangent slopes at different $ x $
Illustrate how descending the slope reduces the loss

📚 Related Topics

Chain Rule & Backpropagation
Gradient Vector Notation
Loss Landscapes & Learning Rates

🔗 Chain Rule (for Backpropagation)

“The chain rule is how intelligence flows backward.”

📍 What Is the Chain Rule?

The chain rule in calculus tells us how to compute the derivative of a composite function — a function made of functions.

🔢 Scalar Version:

Let: $$ y = f(u), \quad u = g(x) \Rightarrow y = f(g(x)) $$ Then the derivative of $ y $ with respect to $ x $ is: $$ \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} $$

📘 Example:

Let $ y = \sin(x^2) $:
$ u = x^2 $, $ y = \sin(u) $ ⇒ $$ \frac{dy}{dx} = \cos(x^2) \cdot 2x $$

📍 Vector Chain Rule (Deep Learning Form)

In deep learning, functions are stacked:

$$ L = f(g(h(x))) $$ where:

$ h(x) $: Linear transformation
$ g $: Activation function
$ f $: Loss function

Using the vector chain rule, you compute: $$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial f} \cdot \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x} $$

🎯 Backpropagation in Practice

Suppose:

$ z = w^T x + b $
$ a = \text{ReLU}(z) $
$ L = \text{MSE}(a, y) $

Then:

$$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w} $$

This is computed layer-by-layer — and multiplied in reverse.

🔄 Why It Matters

Concept	Chain Rule Use
Backpropagation	Enables neural networks to learn
Auto-diff (PyTorch, TF)	Automates gradient computation
Sensitivity analysis	Traces how outputs change via internals

🧠 Visualization

Imagine a flowchart:

Data flows forward layer-by-layer
Gradients flow backward using local derivatives

📚 Related Topics

Derivatives & Partial Derivatives
Jacobian & Gradient Vectors
Backpropagation Explained

🧮 Gradients, Jacobians, Hessians

“Where the derivative points in one direction, gradients tell us where to go in many.”

📍 1. Gradient: Direction of Steepest Ascent/Descent

The gradient of a scalar function $ f:\mathbb{R}^n \rightarrow \mathbb{R} $ is a vector of partial derivatives:

$$ \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} $$

Points in the direction of maximum increase
Used in gradient descent to move in the opposite direction

📘 Example:

If $ f(x, y) = x^2 + y^2 $, then: $$ \nabla f(x, y) = \begin{bmatrix} 2x \\ 2y \end{bmatrix} $$

📍 2. Jacobian Matrix: Vector-Valued Derivatives

For a vector-valued function $ \mathbf{f}:\mathbb{R}^n \rightarrow \mathbb{R}^m $, the Jacobian is an $ m \times n $ matrix:

$$ J_{ij} = \frac{\partial f_i}{\partial x_j} $$

📘 Example:

Let: $$ \mathbf{f}(x, y) = \begin{bmatrix} x^2 + y \\ \sin(xy) \end{bmatrix} $$ Then the Jacobian is: $$ J = \begin{bmatrix} 2x & 1 \\ y \cos(xy) & x \cos(xy) \end{bmatrix} $$

🧠 Why the Jacobian Matters

Use Case	Role of Jacobian
Neural nets (forward/backprop)	Maps input deltas to output deltas
Auto-differentiation	Tracks vector transformations
Generative models (e.g. flows)	Controls volume distortion in density

📍 3. Hessian Matrix: Second-Order Derivatives

The Hessian is a square matrix of second-order partial derivatives of a scalar function: $$ H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} $$

Describes local curvature of the function
Used in second-order optimization like Newton’s Method

📘 Example:

For $ f(x, y) = x^2 + xy + y^2 $, the Hessian is: $$ H = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} $$

🧭 Applications in AI

Tool	Mathematical Object
Gradient Descent	Gradient $ \nabla f $
Layer-wise backpropagation	Jacobian
Curvature-based optimization	Hessian
Saddle point detection	Eigenvalues of Hessian
Neural Tangent Kernel	Hessian-like constructs

🧪 Visualization Tips

Gradient → vector pointing “downhill” in scalar fields
Jacobian → local linear approximation of vector mappings
Hessian → shape of surface: convex (bowl) or saddle

📚 Related Topics

Chain Rule & Backpropagation
Optimization Algorithms
Next: Probability & Information Theory

🎲 Bayes’ Rule

“What’s the probability of a cause, given an effect?”

📍 The Rule Itself

$$ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} $$

Posterior $ P(A|B) $: probability of A given B (what we want to find)
Likelihood $ P(B|A) $: how likely is B if A is true
Prior $ P(A) $: belief in A before seeing B
Evidence $ P(B) $: total probability of B across all causes

📘 Intuitive Example

Let A = "patient has disease", B = "test is positive":

$$ P(\text{disease}|\text{positive}) = \frac{P(\text{positive}|\text{disease}) \cdot P(\text{disease})}{P(\text{positive})} $$

🧠 In Machine Learning

Use Case	Role of Bayes' Rule
Naive Bayes Classifier	Estimates $ P(\text{class} \mid \text{features}) $
Bayesian Inference	Updates posterior beliefs over parameters
Generative Models	Infers latent variables given observed data
Uncertainty Quantification	Models belief over predictions

📦 Naive Bayes Model Equation

Assuming conditional independence: $$ P(y|x_1, ..., x_n) \propto P(y) \prod_{i=1}^n P(x_i|y) $$ where $ P(y) $ is the class prior and $ P(x_i|y) $ is the likelihood of each feature

🧮 Example: Naive Bayes in NLP

For a spam classifier: $$ P(\text{spam}|\text{words}) \propto P(\text{spam}) \cdot P(w_1|\text{spam}) \cdot P(w_2|\text{spam}) \cdots $$

🔎 Numerically Stable Form (Log Space)

To prevent underflow in computation: $$ \log P(A|B) = \log P(B|A) + \log P(A) - \log P(B) $$

📊 Visualization Tip

Use a dynamic diagram showing how the posterior shifts based on likelihood strength
Prior belief gets “reshaped” by incoming evidence

📚 Related Topics

Likelihood Functions & Maximum Likelihood Estimation
KL Divergence & Entropy
Probability Distributions (Gaussian, Bernoulli, etc.)

📈 Expectation & Variance

“Expectation captures what’s typical; variance captures how much it can surprise you.”

📍 1. Expectation (Mean)

The expectation of a random variable is its average value if you repeat the process infinitely.

Discrete Case:

$$ \mathbb{E}[X] = \sum_{i} x_i \cdot P(x_i) $$

Continuous Case:

$$ \mathbb{E}[X] = \int x \cdot p(x) \, dx $$

🧠 Intuition:

Think of it as the center of mass of the probability distribution.
In supervised learning, we often assume:

$$ \hat{y} = \mathbb{E}[Y \mid X] $$ This means the model predicts the expected label given an input.

📘 Example:

If a coin has outcomes $\{0, 1\}$ with $P(1) = 0.7$, then:

$$ \mathbb{E}[X] = 0 \cdot 0.3 + 1 \cdot 0.7 = 0.7 $$

📍 2. Variance

The variance measures how far the values of a random variable deviate from the mean.

$$ \mathrm{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] $$
or equivalently: $$ \mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 $$

Low variance → values are tightly clustered
High variance → values are spread out

📘 Example:

For $X \in \{1, 3\}$ with equal probability:

$\mathbb{E}[X] = 2$
$\mathrm{Var}(X) = 0.5(1^2 + 3^2) - 2^2 = 5 - 4 = 1$

🧠 In Machine Learning

Concept	How Expectation/Variance Is Used
Loss Functions	Minimize expected error over data: $ \mathbb{E}[\text{Loss}] $
Bayesian Inference	Posterior = weighted average (expectation)
Generalization Analysis	Bias-Variance decomposition
Variational Autoencoders	Variance in latent variables
Regularization	Implicitly controls variance

📊 Bias-Variance Tradeoff

$$ \text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} $$

Bias: error from wrong assumptions
Variance: error from model sensitivity to training data

🔍 Visualization Tip

Plot a distribution and overlay its mean and ±1 standard deviation (√variance) range.
Show how flatter/wider distributions have higher variance.

📚 Related Topics

KL Divergence & Entropy
Maximum Likelihood Estimation (MLE)
Probability Distributions

🎲 Probability Distributions (Bernoulli & Gaussian)

“All uncertainty in ML flows through a distribution.”

📍 1. Bernoulli Distribution (Binary Outcomes)

The Bernoulli distribution models a single binary outcome: success (1) or failure (0).

PMF (Probability Mass Function):

$$ P(X = x) = p^x (1 - p)^{1 - x}, \quad x \in \{0,1\} $$

p: probability of success (i.e., $ X = 1 $)
Mean: $ \mathbb{E}[X] = p $
Variance: $ \mathrm{Var}(X) = p(1 - p) $

📘 Example:

In binary classification, a model might output: $$ \hat{y} = \text{sigmoid}(z) = p $$ Then the label is sampled as: $$ y \sim \text{Bernoulli}(p) $$

🧠 Applications in ML:

Task	Why Bernoulli?
Binary classification	Models 0/1 output probabilities
Logistic regression	Predicts $ p $, then samples label
Bernoulli Naive Bayes	For binary feature datasets

📍 2. Gaussian Distribution (Normal Distribution)

The Gaussian models continuous data with bell-shaped symmetry.

PDF (Probability Density Function):

$$ p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) $$

Mean: $ \mathbb{E}[X] = \mu $
Variance: $ \mathrm{Var}(X) = \sigma^2 $
Continuous support: $ x \in \mathbb{R} $

📘 Multivariate Gaussian:

$$ p(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^d |\Sigma|}} \exp\left(-\frac{1}{2}(\mathbf{x} - \mu)^T \Sigma^{-1} (\mathbf{x} - \mu)\right) $$

μ: mean vector
Σ: covariance matrix

🧠 Applications in ML:

Use Case	Role of Gaussian
Regression error modeling	Assume residuals are Normal
Variational Inference	Latent variable priors (VAEs)
Gaussian Mixture Models	Cluster soft assignments
Kalman Filters	State uncertainty

🔍 Visualization Ideas

Bernoulli: two bars at 0 and 1, height = $ p $ and $ 1 - p $
Gaussian: classic bell curve; adjust $ \mu, \sigma $ to show shape variation

📚 Related Topics

KL Divergence, Entropy, Cross-Entropy
Maximum Likelihood Estimation (MLE)
Variational Inference

Tool	Mathematical Object
Gradient Descent	Gradient \( \nabla f \)
Layer-wise backpropagation	Jacobian
Curvature-based optimization	Hessian
Saddle point detection	Eigenvalues of Hessian
Neural Tangent Kernel	Hessian-like constructs

Use Case	Role of Bayes' Rule
Naive Bayes Classifier	Estimates \( P(\text{class} \mid \text{features}) \)
Bayesian Inference	Updates posterior beliefs over parameters
Generative Models	Infers latent variables given observed data
Uncertainty Quantification	Models belief over predictions

Concept	How Expectation/Variance Is Used
Loss Functions	Minimize expected error over data: \( \mathbb{E}[\text{Loss}] \)
Bayesian Inference	Posterior = weighted average (expectation)
Generalization Analysis	Bias-Variance decomposition
Variational Autoencoders	Variance in latent variables
Regularization	Implicitly controls variance

Foundational Math AI Atlas

🔢 1. Linear Algebra

📍 Dot Product (Inner Product)

Definition:

Interpretation:

Applications:

📍 Matrix Multiplication

Definition:

Visual Example:

Key Properties:

🧠 Why It Matters in AI

📚 Related Topics

📏 Norms and Distances

📍 Vector Norms (Length of a Vector)

L2 Norm (Euclidean Norm)

L1 Norm (Manhattan Norm)

L∞ Norm (Max Norm)

🧭 Distance Metrics

Euclidean Distance (L2)

Manhattan Distance (L1)

Cosine Similarity

🧠 In Practice

🔍 Visualization Tip

📚 Related Topics

🧠 Eigenvalues & Eigenvectors

📍 What is an Eigenvector?

🔢 How to Compute

📦 Example

🧭 Why It Matters in AI

📊 Geometric Insight

🧪 Deep Learning Insight

📚 Related Topics

🔺 Derivatives & Partial Derivatives

📍 What is a Derivative?

Formal Definition:

📘 Example:

📍 Partial Derivatives

📘 Example:

🧠 Why It Matters in AI

📦 In Deep Learning

🔍 Visualization Tip

📚 Related Topics

🔗 Chain Rule (for Backpropagation)

📍 What Is the Chain Rule?

🔢 Scalar Version:

📘 Example:

📍 Vector Chain Rule (Deep Learning Form)

🎯 Backpropagation in Practice

🔄 Why It Matters

🧠 Visualization

📚 Related Topics

🧮 Gradients, Jacobians, Hessians

📍 1. Gradient: Direction of Steepest Ascent/Descent

📘 Example:

📍 2. Jacobian Matrix: Vector-Valued Derivatives

📘 Example:

🧠 Why the Jacobian Matters

📍 3. Hessian Matrix: Second-Order Derivatives

📘 Example:

🧭 Applications in AI

🧪 Visualization Tips

📚 Related Topics

🎲 Bayes’ Rule

📍 The Rule Itself

📘 Intuitive Example

🧠 In Machine Learning

📦 Naive Bayes Model Equation

🧮 Example: Naive Bayes in NLP

🔎 Numerically Stable Form (Log Space)

📊 Visualization Tip

📚 Related Topics

📈 Expectation & Variance

📍 1. Expectation (Mean)

Discrete Case:

Continuous Case:

🧠 Intuition:

📘 Example:

📍 2. Variance

📘 Example:

🧠 In Machine Learning