๐ข 1. Linear Algebra
"Neural networks are just matrix multiplications with purpose."
Linear algebra provides the language of tensors, the structure of computations, and the geometry of data flow in machine and deep learning. Letโs break down its core operations, starting with the dot product and matrix multiplication โ both central to model layers, loss calculations, and gradient flow.
๐ Dot Product (Inner Product)
Definition:
Given two vectors:
$$ \mathbf{a} = [a_1, a_2, ..., a_n], \quad \mathbf{b} = [b_1, b_2, ..., b_n] $$
The dot product is:
$$ \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i $$
Interpretation:
- Measures similarity (cosine of the angle between vectors)
- Returns a scalar
- Core to attention mechanisms, feature projections, and loss functions
Applications:
- Attention scores in Transformers
- Projection of embeddings
- Energy functions in similarity-based models
๐ Matrix Multiplication
Definition:
Given matrix \( A \) of shape \( m \times n \) and matrix \( B \) of shape \( n \times p \):
$$ C = A \cdot B \Rightarrow C_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj} $$
Visual Example:
Let:
$$ A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad B = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} $$
Then:
$$ C = AB = \begin{bmatrix} 1\cdot5 + 2\cdot7 & 1\cdot6 + 2\cdot8 \\ 3\cdot5 + 4\cdot7 & 3\cdot6 + 4\cdot8 \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix} $$
Key Properties:
- Associative: \( (AB)C = A(BC) \)
- Not Commutative: \( AB \ne BA \)
- Used for linear transformations in neural nets
๐ง Why It Matters in AI
- Each layer in a neural network is a matrix multiplication:
$$ z = W x + b $$
where \( W \) is the weight matrix and \( x \) the input vector. - Embeddings, convolutions, attention, recurrent steps โ all use matrix ops.
- Matrix multiplication is GPU-optimized, making it the heart of deep learning acceleration.
๐ Related Topics
- Matrix-vector multiplication and linear transforms?
- Outer products, Hadamard product, norms and projections?
- Or go deeper into how this applies in neural layers (e.g., why linear layers use weight matrices)?
๐ Norms and Distances
โDistance is not just how far โ itโs how your model feels difference.โ
In machine learning, norms quantify the magnitude of vectors, and distance metrics measure how similar or dissimilar two data points are. These are fundamental to loss functions, regularization, and clustering.
๐ Vector Norms (Length of a Vector)
A norm is a function \( \|x\| \) that assigns a non-negative length or size to a vector \( x \).
L2 Norm (Euclidean Norm)
$$ \|x\|_2 = \sqrt{\sum_{i=1}^n x_i^2} $$
- Interpreted as the straight-line distance from the origin to the point.
- Used in:
- Gradient descent convergence
- L2 regularization (Ridge): \( \lambda \|w\|_2^2 \)
L1 Norm (Manhattan Norm)
$$ \|x\|_1 = \sum_{i=1}^n |x_i| $$
- Interpreted as the distance if moving only along axes.
- Used in:
- Lasso regression
- Promotes sparse weights in models
Lโ Norm (Max Norm)
$$ \|x\|_\infty = \max_i |x_i| $$
- Measures the maximum absolute value.
- Useful for:
- Robust optimization
- Constraint satisfaction
๐งญ Distance Metrics
Euclidean Distance (L2)
$$ d(x, y) = \|x - y\|_2 = \sqrt{\sum_i (x_i - y_i)^2} $$
- Most common distance metric.
- Used in:
- k-NN
- Clustering (e.g., k-means)
- Image similarity
Manhattan Distance (L1)
$$ d(x, y) = \|x - y\|_1 = \sum_i |x_i - y_i| $$
- Better in high-dimensional spaces for sparse data.
Cosine Similarity
$$ \cos(\theta) = \frac{x \cdot y}{\|x\|\|y\|} $$
- Measures angle between vectors, not magnitude.
- Used in:
- NLP word embeddings
- Recommender systems
๐ง In Practice
| Application | Metric Used |
|---|---|
| Regularization | L1 / L2 Norms |
| Optimization | Gradient Norms |
| Model Compression | L1 for pruning |
| Image Similarity | Euclidean or Cosine Distance |
| Attention Mechanisms | Dot product + normalization |
๐ Visualization Tip
Plot \( \|x\|_1 \), \( \|x\|_2 \), \( \|x\|_\infty \) in 2D โ they form diamond, circle, and square shapes respectively, showing different constraint โballsโ.
๐ Related Topics
- Linear Algebra Basics
- Matrix Norms and Stability
- Distance in Latent Spaces
- Regularization with Norms
- Projections and Basis Transforms
๐ง Eigenvalues & Eigenvectors
โIf a transformation stretches space, eigenvectors point where nothing bends.โ
๐ What is an Eigenvector?
Given a square matrix \( A \), an eigenvector \( \mathbf{v} \) satisfies:
$$ A \mathbf{v} = \lambda \mathbf{v} $$
- \( \mathbf{v} \): the eigenvector โ a direction unchanged by the transformation \( A \)
- \( \lambda \): the eigenvalue โ a scalar scaling the eigenvector
This means: applying \( A \) to \( \mathbf{v} \) stretches or compresses it, but does not rotate or bend it.
๐ข How to Compute
Start from: $$ A \mathbf{v} = \lambda \mathbf{v} \Rightarrow (A - \lambda I)\mathbf{v} = 0 $$
To find \( \lambda \), solve the characteristic equation:
$$ \det(A - \lambda I) = 0 $$
Then for each \( \lambda \), solve:
$$ (A - \lambda I)\mathbf{v} = 0 $$
๐ฆ Example
Let:
$$ A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} $$
Solve: $$ \det(A - \lambda I) = \begin{vmatrix} 2-\lambda & 1 \\ 1 & 2-\lambda \end{vmatrix} = (2 - \lambda)^2 - 1 = 0 \Rightarrow \lambda = 1, 3 $$
Each \( \lambda \) gives an eigenvector \( \mathbf{v} \).
๐งญ Why It Matters in AI
| Area | Use of Eigen Concepts |
|---|---|
| PCA | Principal components = eigenvectors of covariance |
| Spectral Clustering | Use eigenvectors of graph Laplacian |
| Stability Analysis | Eigenvalues of Jacobians tell if training is stable |
| Weight Analysis | Norms/eigenvalues of weight matrices reflect expressivity |
| Dynamical Systems | In RNNs, eigenvalues affect gradient explosion/vanishing |
๐ Geometric Insight
- Eigenvectors define the invariant axes of transformation.
- Eigenvalues describe how much data is stretched or squashed along those axes.
๐งช Deep Learning Insight
In PCA, eigenvectors of the covariance matrix give directions of maximum variance:
$$ \Sigma x = \lambda x $$
In recurrent networks:
- \( |\lambda| > 1 \): gradients may explode
- \( |\lambda| < 1 \): gradients may vanish
๐ Related Topics
- Principal Component Analysis
- Stability in Dynamical Systems
- Spectral Graph Theory
- Matrix Decompositions
- Gradient Flow and Eigenvalues
๐บ Derivatives & Partial Derivatives
โTo teach a model, we show it how it's wrong โ via derivatives.โ
๐ What is a Derivative?
A derivative measures how a function changes as its input changes โ the rate of change or slope.
Formal Definition:
$$ f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} $$
- \( f'(x) \): how much \( f \) changes with a tiny change in \( x \)
- Visualized as the slope of the tangent line at point \( x \)
๐ Example:
$$ f(x) = x^2 \Rightarrow f'(x) = 2x $$
At \( x = 3 \), the slope is \( f'(3) = 6 \)
๐ Partial Derivatives
In ML, functions often depend on many variables, like weights \( w_1, w_2, ..., w_n \). A partial derivative shows how the output changes when only one variable changes.
Notation:
$$ \frac{\partial f}{\partial x_i} $$
๐ Example:
$$ f(x, y) = 3x^2 + 2xy $$
$$ \frac{\partial f}{\partial x} = 6x + 2y, \quad \frac{\partial f}{\partial y} = 2x $$
๐ง Why It Matters in AI
| Use Case | Derivative Role |
|---|---|
| Gradient Descent | Uses derivatives to minimize loss |
| Backpropagation | Chain rule of partials through layers |
| Optimization | Maximize likelihood, minimize error |
| Sensitivity Analysis | How a weight shift affects output |
๐ฆ In Deep Learning
For a loss function \( L(w) \), we compute:
$$ \frac{\partial L}{\partial w_i} \Rightarrow \text{tells how to adjust } w_i $$
These gradients are passed backward through the network via backpropagation, powered by the chain rule.
๐ Visualization Tip
- Plot \( f(x) = x^2 \) and show tangent slopes at different \( x \)
- Illustrate how descending the slope reduces the loss
๐ Related Topics
๐ Chain Rule (for Backpropagation)
โThe chain rule is how intelligence flows backward.โ
๐ What Is the Chain Rule?
The chain rule in calculus tells us how to compute the derivative of a composite function โ a function made of functions.
๐ข Scalar Version:
Let: $$ y = f(u), \quad u = g(x) \Rightarrow y = f(g(x)) $$ Then the derivative of \( y \) with respect to \( x \) is: $$ \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} $$
๐ Example:
Let \( y = \sin(x^2) \):
\( u = x^2 \), \( y = \sin(u) \) โ
$$ \frac{dy}{dx} = \cos(x^2) \cdot 2x $$
๐ Vector Chain Rule (Deep Learning Form)
In deep learning, functions are stacked:
$$ L = f(g(h(x))) $$ where:
- \( h(x) \): Linear transformation
- \( g \): Activation function
- \( f \): Loss function
Using the vector chain rule, you compute: $$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial f} \cdot \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x} $$
๐ฏ Backpropagation in Practice
Suppose:
- \( z = w^T x + b \)
- \( a = \text{ReLU}(z) \)
- \( L = \text{MSE}(a, y) \)
Then:
$$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w} $$
This is computed layer-by-layer โ and multiplied in reverse.
๐ Why It Matters
| Concept | Chain Rule Use |
|---|---|
| Backpropagation | Enables neural networks to learn |
| Auto-diff (PyTorch, TF) | Automates gradient computation |
| Sensitivity analysis | Traces how outputs change via internals |
๐ง Visualization
Imagine a flowchart:
- Data flows forward layer-by-layer
- Gradients flow backward using local derivatives
๐ Related Topics
๐งฎ Gradients, Jacobians, Hessians
โWhere the derivative points in one direction, gradients tell us where to go in many.โ
๐ 1. Gradient: Direction of Steepest Ascent/Descent
The gradient of a scalar function \( f:\mathbb{R}^n \rightarrow \mathbb{R} \) is a vector of partial derivatives:
$$ \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} $$
- Points in the direction of maximum increase
- Used in gradient descent to move in the opposite direction
๐ Example:
If \( f(x, y) = x^2 + y^2 \), then: $$ \nabla f(x, y) = \begin{bmatrix} 2x \\ 2y \end{bmatrix} $$
๐ 2. Jacobian Matrix: Vector-Valued Derivatives
For a vector-valued function \( \mathbf{f}:\mathbb{R}^n \rightarrow \mathbb{R}^m \), the Jacobian is an \( m \times n \) matrix:
$$ J_{ij} = \frac{\partial f_i}{\partial x_j} $$
๐ Example:
Let: $$ \mathbf{f}(x, y) = \begin{bmatrix} x^2 + y \\ \sin(xy) \end{bmatrix} $$ Then the Jacobian is: $$ J = \begin{bmatrix} 2x & 1 \\ y \cos(xy) & x \cos(xy) \end{bmatrix} $$
๐ง Why the Jacobian Matters
| Use Case | Role of Jacobian |
|---|---|
| Neural nets (forward/backprop) | Maps input deltas to output deltas |
| Auto-differentiation | Tracks vector transformations |
| Generative models (e.g. flows) | Controls volume distortion in density |
๐ 3. Hessian Matrix: Second-Order Derivatives
The Hessian is a square matrix of second-order partial derivatives of a scalar function: $$ H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} $$
- Describes local curvature of the function
- Used in second-order optimization like Newtonโs Method
๐ Example:
For \( f(x, y) = x^2 + xy + y^2 \), the Hessian is: $$ H = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} $$
๐งญ Applications in AI
| Tool | Mathematical Object |
|---|---|
| Gradient Descent | Gradient \( \nabla f \) |
| Layer-wise backpropagation | Jacobian |
| Curvature-based optimization | Hessian |
| Saddle point detection | Eigenvalues of Hessian |
| Neural Tangent Kernel | Hessian-like constructs |
๐งช Visualization Tips
- Gradient โ vector pointing โdownhillโ in scalar fields
- Jacobian โ local linear approximation of vector mappings
- Hessian โ shape of surface: convex (bowl) or saddle
๐ Related Topics
๐ฒ Bayesโ Rule
โWhatโs the probability of a cause, given an effect?โ
๐ The Rule Itself
$$ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} $$
- Posterior \( P(A|B) \): probability of A given B (what we want to find)
- Likelihood \( P(B|A) \): how likely is B if A is true
- Prior \( P(A) \): belief in A before seeing B
- Evidence \( P(B) \): total probability of B across all causes
๐ Intuitive Example
Let A = "patient has disease", B = "test is positive":
$$ P(\text{disease}|\text{positive}) = \frac{P(\text{positive}|\text{disease}) \cdot P(\text{disease})}{P(\text{positive})} $$
๐ง In Machine Learning
| Use Case | Role of Bayes' Rule |
|---|---|
| Naive Bayes Classifier | Estimates \( P(\text{class} \mid \text{features}) \) |
| Bayesian Inference | Updates posterior beliefs over parameters |
| Generative Models | Infers latent variables given observed data |
| Uncertainty Quantification | Models belief over predictions |
๐ฆ Naive Bayes Model Equation
Assuming conditional independence: $$ P(y|x_1, ..., x_n) \propto P(y) \prod_{i=1}^n P(x_i|y) $$ where \( P(y) \) is the class prior and \( P(x_i|y) \) is the likelihood of each feature
๐งฎ Example: Naive Bayes in NLP
For a spam classifier: $$ P(\text{spam}|\text{words}) \propto P(\text{spam}) \cdot P(w_1|\text{spam}) \cdot P(w_2|\text{spam}) \cdots $$
๐ Numerically Stable Form (Log Space)
To prevent underflow in computation: $$ \log P(A|B) = \log P(B|A) + \log P(A) - \log P(B) $$
๐ Visualization Tip
- Use a dynamic diagram showing how the posterior shifts based on likelihood strength
- Prior belief gets โreshapedโ by incoming evidence
๐ Related Topics
- Likelihood Functions & Maximum Likelihood Estimation
- KL Divergence & Entropy
- Probability Distributions (Gaussian, Bernoulli, etc.)
๐ Expectation & Variance
โExpectation captures whatโs typical; variance captures how much it can surprise you.โ
๐ 1. Expectation (Mean)
The expectation of a random variable is its average value if you repeat the process infinitely.
Discrete Case:
$$ \mathbb{E}[X] = \sum_{i} x_i \cdot P(x_i) $$
Continuous Case:
$$ \mathbb{E}[X] = \int x \cdot p(x) \, dx $$
๐ง Intuition:
- Think of it as the center of mass of the probability distribution.
- In supervised learning, we often assume:
$$ \hat{y} = \mathbb{E}[Y \mid X] $$ This means the model predicts the expected label given an input.
๐ Example:
If a coin has outcomes \(\{0, 1\}\) with \(P(1) = 0.7\), then:
$$ \mathbb{E}[X] = 0 \cdot 0.3 + 1 \cdot 0.7 = 0.7 $$
๐ 2. Variance
The variance measures how far the values of a random variable deviate from the mean.
$$
\mathrm{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]
$$
or equivalently:
$$
\mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2
$$
- Low variance โ values are tightly clustered
- High variance โ values are spread out
๐ Example:
For \(X \in \{1, 3\}\) with equal probability:
- \(\mathbb{E}[X] = 2\)
- \(\mathrm{Var}(X) = 0.5(1^2 + 3^2) - 2^2 = 5 - 4 = 1\)
๐ง In Machine Learning
| Concept | How Expectation/Variance Is Used |
|---|---|
| Loss Functions | Minimize expected error over data: \( \mathbb{E}[\text{Loss}] \) |
| Bayesian Inference | Posterior = weighted average (expectation) |
| Generalization Analysis | Bias-Variance decomposition |
| Variational Autoencoders | Variance in latent variables |
| Regularization | Implicitly controls variance |
๐ Bias-Variance Tradeoff
$$ \text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} $$
- Bias: error from wrong assumptions
- Variance: error from model sensitivity to training data
๐ Visualization Tip
- Plot a distribution and overlay its mean and ยฑ1 standard deviation (โvariance) range.
- Show how flatter/wider distributions have higher variance.
๐ Related Topics
๐ฒ Probability Distributions (Bernoulli & Gaussian)
โAll uncertainty in ML flows through a distribution.โ
๐ 1. Bernoulli Distribution (Binary Outcomes)
The Bernoulli distribution models a single binary outcome: success (1) or failure (0).
PMF (Probability Mass Function):
$$ P(X = x) = p^x (1 - p)^{1 - x}, \quad x \in \{0,1\} $$
- p: probability of success (i.e., \( X = 1 \))
- Mean: \( \mathbb{E}[X] = p \)
- Variance: \( \mathrm{Var}(X) = p(1 - p) \)
๐ Example:
In binary classification, a model might output: $$ \hat{y} = \text{sigmoid}(z) = p $$ Then the label is sampled as: $$ y \sim \text{Bernoulli}(p) $$
๐ง Applications in ML:
| Task | Why Bernoulli? |
|---|---|
| Binary classification | Models 0/1 output probabilities |
| Logistic regression | Predicts \( p \), then samples label |
| Bernoulli Naive Bayes | For binary feature datasets |
๐ 2. Gaussian Distribution (Normal Distribution)
The Gaussian models continuous data with bell-shaped symmetry.
PDF (Probability Density Function):
$$ p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) $$
- Mean: \( \mathbb{E}[X] = \mu \)
- Variance: \( \mathrm{Var}(X) = \sigma^2 \)
- Continuous support: \( x \in \mathbb{R} \)
๐ Multivariate Gaussian:
$$ p(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^d |\Sigma|}} \exp\left(-\frac{1}{2}(\mathbf{x} - \mu)^T \Sigma^{-1} (\mathbf{x} - \mu)\right) $$
- ฮผ: mean vector
- ฮฃ: covariance matrix
๐ง Applications in ML:
| Use Case | Role of Gaussian |
|---|---|
| Regression error modeling | Assume residuals are Normal |
| Variational Inference | Latent variable priors (VAEs) |
| Gaussian Mixture Models | Cluster soft assignments |
| Kalman Filters | State uncertainty |
๐ Visualization Ideas
- Bernoulli: two bars at 0 and 1, height = \( p \) and \( 1 - p \)
- Gaussian: classic bell curve; adjust \( \mu, \sigma \) to show shape variation