๐Ÿ”ข 1. Linear Algebra

"Neural networks are just matrix multiplications with purpose."

Linear algebra provides the language of tensors, the structure of computations, and the geometry of data flow in machine and deep learning. Letโ€™s break down its core operations, starting with the dot product and matrix multiplication โ€” both central to model layers, loss calculations, and gradient flow.


๐Ÿ“ Dot Product (Inner Product)

Definition:

Given two vectors:

$$ \mathbf{a} = [a_1, a_2, ..., a_n], \quad \mathbf{b} = [b_1, b_2, ..., b_n] $$

The dot product is:

$$ \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i $$

Interpretation:

  • Measures similarity (cosine of the angle between vectors)
  • Returns a scalar
  • Core to attention mechanisms, feature projections, and loss functions

Applications:

  • Attention scores in Transformers
  • Projection of embeddings
  • Energy functions in similarity-based models

๐Ÿ“ Matrix Multiplication

Definition:

Given matrix \( A \) of shape \( m \times n \) and matrix \( B \) of shape \( n \times p \):

$$ C = A \cdot B \Rightarrow C_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj} $$

Visual Example:

Let:

$$ A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad B = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} $$

Then:

$$ C = AB = \begin{bmatrix} 1\cdot5 + 2\cdot7 & 1\cdot6 + 2\cdot8 \\ 3\cdot5 + 4\cdot7 & 3\cdot6 + 4\cdot8 \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix} $$

Key Properties:

  • Associative: \( (AB)C = A(BC) \)
  • Not Commutative: \( AB \ne BA \)
  • Used for linear transformations in neural nets

๐Ÿง  Why It Matters in AI

  • Each layer in a neural network is a matrix multiplication:

    $$ z = W x + b $$

    where \( W \) is the weight matrix and \( x \) the input vector.
  • Embeddings, convolutions, attention, recurrent steps โ€” all use matrix ops.
  • Matrix multiplication is GPU-optimized, making it the heart of deep learning acceleration.

๐Ÿ“š Related Topics

  • Matrix-vector multiplication and linear transforms?
  • Outer products, Hadamard product, norms and projections?
  • Or go deeper into how this applies in neural layers (e.g., why linear layers use weight matrices)?

๐Ÿ“ Norms and Distances

โ€œDistance is not just how far โ€” itโ€™s how your model feels difference.โ€

In machine learning, norms quantify the magnitude of vectors, and distance metrics measure how similar or dissimilar two data points are. These are fundamental to loss functions, regularization, and clustering.


๐Ÿ“ Vector Norms (Length of a Vector)

A norm is a function \( \|x\| \) that assigns a non-negative length or size to a vector \( x \).

L2 Norm (Euclidean Norm)

$$ \|x\|_2 = \sqrt{\sum_{i=1}^n x_i^2} $$

  • Interpreted as the straight-line distance from the origin to the point.
  • Used in:
    • Gradient descent convergence
    • L2 regularization (Ridge): \( \lambda \|w\|_2^2 \)

L1 Norm (Manhattan Norm)

$$ \|x\|_1 = \sum_{i=1}^n |x_i| $$

  • Interpreted as the distance if moving only along axes.
  • Used in:
    • Lasso regression
    • Promotes sparse weights in models

Lโˆž Norm (Max Norm)

$$ \|x\|_\infty = \max_i |x_i| $$

  • Measures the maximum absolute value.
  • Useful for:
    • Robust optimization
    • Constraint satisfaction

๐Ÿงญ Distance Metrics

Euclidean Distance (L2)

$$ d(x, y) = \|x - y\|_2 = \sqrt{\sum_i (x_i - y_i)^2} $$

  • Most common distance metric.
  • Used in:
    • k-NN
    • Clustering (e.g., k-means)
    • Image similarity

Manhattan Distance (L1)

$$ d(x, y) = \|x - y\|_1 = \sum_i |x_i - y_i| $$

  • Better in high-dimensional spaces for sparse data.

Cosine Similarity

$$ \cos(\theta) = \frac{x \cdot y}{\|x\|\|y\|} $$

  • Measures angle between vectors, not magnitude.
  • Used in:
    • NLP word embeddings
    • Recommender systems

๐Ÿง  In Practice

ApplicationMetric Used
RegularizationL1 / L2 Norms
OptimizationGradient Norms
Model CompressionL1 for pruning
Image SimilarityEuclidean or Cosine Distance
Attention MechanismsDot product + normalization

๐Ÿ” Visualization Tip

Plot \( \|x\|_1 \), \( \|x\|_2 \), \( \|x\|_\infty \) in 2D โ€” they form diamond, circle, and square shapes respectively, showing different constraint โ€œballsโ€.


๐Ÿ“š Related Topics


๐Ÿง  Eigenvalues & Eigenvectors

โ€œIf a transformation stretches space, eigenvectors point where nothing bends.โ€

๐Ÿ“ What is an Eigenvector?

Given a square matrix \( A \), an eigenvector \( \mathbf{v} \) satisfies:

$$ A \mathbf{v} = \lambda \mathbf{v} $$

  • \( \mathbf{v} \): the eigenvector โ€” a direction unchanged by the transformation \( A \)
  • \( \lambda \): the eigenvalue โ€” a scalar scaling the eigenvector

This means: applying \( A \) to \( \mathbf{v} \) stretches or compresses it, but does not rotate or bend it.

๐Ÿ”ข How to Compute

Start from: $$ A \mathbf{v} = \lambda \mathbf{v} \Rightarrow (A - \lambda I)\mathbf{v} = 0 $$

To find \( \lambda \), solve the characteristic equation:

$$ \det(A - \lambda I) = 0 $$

Then for each \( \lambda \), solve:

$$ (A - \lambda I)\mathbf{v} = 0 $$

๐Ÿ“ฆ Example

Let:

$$ A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} $$

Solve: $$ \det(A - \lambda I) = \begin{vmatrix} 2-\lambda & 1 \\ 1 & 2-\lambda \end{vmatrix} = (2 - \lambda)^2 - 1 = 0 \Rightarrow \lambda = 1, 3 $$

Each \( \lambda \) gives an eigenvector \( \mathbf{v} \).

๐Ÿงญ Why It Matters in AI

Area Use of Eigen Concepts
PCAPrincipal components = eigenvectors of covariance
Spectral ClusteringUse eigenvectors of graph Laplacian
Stability AnalysisEigenvalues of Jacobians tell if training is stable
Weight AnalysisNorms/eigenvalues of weight matrices reflect expressivity
Dynamical SystemsIn RNNs, eigenvalues affect gradient explosion/vanishing

๐Ÿ“Š Geometric Insight

  • Eigenvectors define the invariant axes of transformation.
  • Eigenvalues describe how much data is stretched or squashed along those axes.

๐Ÿงช Deep Learning Insight

In PCA, eigenvectors of the covariance matrix give directions of maximum variance:

$$ \Sigma x = \lambda x $$

In recurrent networks:

  • \( |\lambda| > 1 \): gradients may explode
  • \( |\lambda| < 1 \): gradients may vanish

๐Ÿ“š Related Topics


๐Ÿ”บ Derivatives & Partial Derivatives

โ€œTo teach a model, we show it how it's wrong โ€” via derivatives.โ€

๐Ÿ“ What is a Derivative?

A derivative measures how a function changes as its input changes โ€” the rate of change or slope.

Formal Definition:

$$ f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} $$

  • \( f'(x) \): how much \( f \) changes with a tiny change in \( x \)
  • Visualized as the slope of the tangent line at point \( x \)

๐Ÿ“˜ Example:

$$ f(x) = x^2 \Rightarrow f'(x) = 2x $$
At \( x = 3 \), the slope is \( f'(3) = 6 \)

๐Ÿ“ Partial Derivatives

In ML, functions often depend on many variables, like weights \( w_1, w_2, ..., w_n \). A partial derivative shows how the output changes when only one variable changes.

Notation:

$$ \frac{\partial f}{\partial x_i} $$

๐Ÿ“˜ Example:

$$ f(x, y) = 3x^2 + 2xy $$
$$ \frac{\partial f}{\partial x} = 6x + 2y, \quad \frac{\partial f}{\partial y} = 2x $$

๐Ÿง  Why It Matters in AI

Use CaseDerivative Role
Gradient DescentUses derivatives to minimize loss
BackpropagationChain rule of partials through layers
OptimizationMaximize likelihood, minimize error
Sensitivity AnalysisHow a weight shift affects output

๐Ÿ“ฆ In Deep Learning

For a loss function \( L(w) \), we compute:

$$ \frac{\partial L}{\partial w_i} \Rightarrow \text{tells how to adjust } w_i $$

These gradients are passed backward through the network via backpropagation, powered by the chain rule.

๐Ÿ” Visualization Tip

  • Plot \( f(x) = x^2 \) and show tangent slopes at different \( x \)
  • Illustrate how descending the slope reduces the loss

๐Ÿ“š Related Topics


๐Ÿ”— Chain Rule (for Backpropagation)

โ€œThe chain rule is how intelligence flows backward.โ€

๐Ÿ“ What Is the Chain Rule?

The chain rule in calculus tells us how to compute the derivative of a composite function โ€” a function made of functions.

๐Ÿ”ข Scalar Version:

Let: $$ y = f(u), \quad u = g(x) \Rightarrow y = f(g(x)) $$ Then the derivative of \( y \) with respect to \( x \) is: $$ \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} $$

๐Ÿ“˜ Example:

Let \( y = \sin(x^2) \):
\( u = x^2 \), \( y = \sin(u) \) โ‡’ $$ \frac{dy}{dx} = \cos(x^2) \cdot 2x $$

๐Ÿ“ Vector Chain Rule (Deep Learning Form)

In deep learning, functions are stacked:

$$ L = f(g(h(x))) $$ where:

  • \( h(x) \): Linear transformation
  • \( g \): Activation function
  • \( f \): Loss function

Using the vector chain rule, you compute: $$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial f} \cdot \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x} $$

๐ŸŽฏ Backpropagation in Practice

Suppose:

  • \( z = w^T x + b \)
  • \( a = \text{ReLU}(z) \)
  • \( L = \text{MSE}(a, y) \)

Then:

$$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w} $$

This is computed layer-by-layer โ€” and multiplied in reverse.

๐Ÿ”„ Why It Matters

ConceptChain Rule Use
BackpropagationEnables neural networks to learn
Auto-diff (PyTorch, TF)Automates gradient computation
Sensitivity analysisTraces how outputs change via internals

๐Ÿง  Visualization

Imagine a flowchart:

  • Data flows forward layer-by-layer
  • Gradients flow backward using local derivatives

๐Ÿ“š Related Topics


๐Ÿงฎ Gradients, Jacobians, Hessians

โ€œWhere the derivative points in one direction, gradients tell us where to go in many.โ€

๐Ÿ“ 1. Gradient: Direction of Steepest Ascent/Descent

The gradient of a scalar function \( f:\mathbb{R}^n \rightarrow \mathbb{R} \) is a vector of partial derivatives:

$$ \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} $$

  • Points in the direction of maximum increase
  • Used in gradient descent to move in the opposite direction

๐Ÿ“˜ Example:

If \( f(x, y) = x^2 + y^2 \), then: $$ \nabla f(x, y) = \begin{bmatrix} 2x \\ 2y \end{bmatrix} $$

๐Ÿ“ 2. Jacobian Matrix: Vector-Valued Derivatives

For a vector-valued function \( \mathbf{f}:\mathbb{R}^n \rightarrow \mathbb{R}^m \), the Jacobian is an \( m \times n \) matrix:

$$ J_{ij} = \frac{\partial f_i}{\partial x_j} $$

๐Ÿ“˜ Example:

Let: $$ \mathbf{f}(x, y) = \begin{bmatrix} x^2 + y \\ \sin(xy) \end{bmatrix} $$ Then the Jacobian is: $$ J = \begin{bmatrix} 2x & 1 \\ y \cos(xy) & x \cos(xy) \end{bmatrix} $$

๐Ÿง  Why the Jacobian Matters

Use CaseRole of Jacobian
Neural nets (forward/backprop)Maps input deltas to output deltas
Auto-differentiationTracks vector transformations
Generative models (e.g. flows)Controls volume distortion in density

๐Ÿ“ 3. Hessian Matrix: Second-Order Derivatives

The Hessian is a square matrix of second-order partial derivatives of a scalar function: $$ H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} $$

  • Describes local curvature of the function
  • Used in second-order optimization like Newtonโ€™s Method

๐Ÿ“˜ Example:

For \( f(x, y) = x^2 + xy + y^2 \), the Hessian is: $$ H = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} $$

๐Ÿงญ Applications in AI

ToolMathematical Object
Gradient DescentGradient \( \nabla f \)
Layer-wise backpropagationJacobian
Curvature-based optimizationHessian
Saddle point detectionEigenvalues of Hessian
Neural Tangent KernelHessian-like constructs

๐Ÿงช Visualization Tips

  • Gradient โ†’ vector pointing โ€œdownhillโ€ in scalar fields
  • Jacobian โ†’ local linear approximation of vector mappings
  • Hessian โ†’ shape of surface: convex (bowl) or saddle

๐Ÿ“š Related Topics


๐ŸŽฒ Bayesโ€™ Rule

โ€œWhatโ€™s the probability of a cause, given an effect?โ€

๐Ÿ“ The Rule Itself

$$ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} $$

  • Posterior \( P(A|B) \): probability of A given B (what we want to find)
  • Likelihood \( P(B|A) \): how likely is B if A is true
  • Prior \( P(A) \): belief in A before seeing B
  • Evidence \( P(B) \): total probability of B across all causes

๐Ÿ“˜ Intuitive Example

Let A = "patient has disease", B = "test is positive":

$$ P(\text{disease}|\text{positive}) = \frac{P(\text{positive}|\text{disease}) \cdot P(\text{disease})}{P(\text{positive})} $$

๐Ÿง  In Machine Learning

Use CaseRole of Bayes' Rule
Naive Bayes ClassifierEstimates \( P(\text{class} \mid \text{features}) \)
Bayesian InferenceUpdates posterior beliefs over parameters
Generative ModelsInfers latent variables given observed data
Uncertainty QuantificationModels belief over predictions

๐Ÿ“ฆ Naive Bayes Model Equation

Assuming conditional independence: $$ P(y|x_1, ..., x_n) \propto P(y) \prod_{i=1}^n P(x_i|y) $$ where \( P(y) \) is the class prior and \( P(x_i|y) \) is the likelihood of each feature

๐Ÿงฎ Example: Naive Bayes in NLP

For a spam classifier: $$ P(\text{spam}|\text{words}) \propto P(\text{spam}) \cdot P(w_1|\text{spam}) \cdot P(w_2|\text{spam}) \cdots $$

๐Ÿ”Ž Numerically Stable Form (Log Space)

To prevent underflow in computation: $$ \log P(A|B) = \log P(B|A) + \log P(A) - \log P(B) $$

๐Ÿ“Š Visualization Tip

  • Use a dynamic diagram showing how the posterior shifts based on likelihood strength
  • Prior belief gets โ€œreshapedโ€ by incoming evidence

๐Ÿ“š Related Topics


๐Ÿ“ˆ Expectation & Variance

โ€œExpectation captures whatโ€™s typical; variance captures how much it can surprise you.โ€

๐Ÿ“ 1. Expectation (Mean)

The expectation of a random variable is its average value if you repeat the process infinitely.

Discrete Case:

$$ \mathbb{E}[X] = \sum_{i} x_i \cdot P(x_i) $$

Continuous Case:

$$ \mathbb{E}[X] = \int x \cdot p(x) \, dx $$

๐Ÿง  Intuition:

  • Think of it as the center of mass of the probability distribution.
  • In supervised learning, we often assume:

$$ \hat{y} = \mathbb{E}[Y \mid X] $$ This means the model predicts the expected label given an input.

๐Ÿ“˜ Example:

If a coin has outcomes \(\{0, 1\}\) with \(P(1) = 0.7\), then:

$$ \mathbb{E}[X] = 0 \cdot 0.3 + 1 \cdot 0.7 = 0.7 $$

๐Ÿ“ 2. Variance

The variance measures how far the values of a random variable deviate from the mean.

$$ \mathrm{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] $$
or equivalently: $$ \mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 $$

  • Low variance โ†’ values are tightly clustered
  • High variance โ†’ values are spread out

๐Ÿ“˜ Example:

For \(X \in \{1, 3\}\) with equal probability:

  • \(\mathbb{E}[X] = 2\)
  • \(\mathrm{Var}(X) = 0.5(1^2 + 3^2) - 2^2 = 5 - 4 = 1\)

๐Ÿง  In Machine Learning

ConceptHow Expectation/Variance Is Used
Loss FunctionsMinimize expected error over data: \( \mathbb{E}[\text{Loss}] \)
Bayesian InferencePosterior = weighted average (expectation)
Generalization AnalysisBias-Variance decomposition
Variational AutoencodersVariance in latent variables
RegularizationImplicitly controls variance

๐Ÿ“Š Bias-Variance Tradeoff

$$ \text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} $$

  • Bias: error from wrong assumptions
  • Variance: error from model sensitivity to training data

๐Ÿ” Visualization Tip

  • Plot a distribution and overlay its mean and ยฑ1 standard deviation (โˆšvariance) range.
  • Show how flatter/wider distributions have higher variance.

๐Ÿ“š Related Topics


๐ŸŽฒ Probability Distributions (Bernoulli & Gaussian)

โ€œAll uncertainty in ML flows through a distribution.โ€

๐Ÿ“ 1. Bernoulli Distribution (Binary Outcomes)

The Bernoulli distribution models a single binary outcome: success (1) or failure (0).

PMF (Probability Mass Function):

$$ P(X = x) = p^x (1 - p)^{1 - x}, \quad x \in \{0,1\} $$

  • p: probability of success (i.e., \( X = 1 \))
  • Mean: \( \mathbb{E}[X] = p \)
  • Variance: \( \mathrm{Var}(X) = p(1 - p) \)

๐Ÿ“˜ Example:

In binary classification, a model might output: $$ \hat{y} = \text{sigmoid}(z) = p $$ Then the label is sampled as: $$ y \sim \text{Bernoulli}(p) $$

๐Ÿง  Applications in ML:

TaskWhy Bernoulli?
Binary classificationModels 0/1 output probabilities
Logistic regressionPredicts \( p \), then samples label
Bernoulli Naive BayesFor binary feature datasets

๐Ÿ“ 2. Gaussian Distribution (Normal Distribution)

The Gaussian models continuous data with bell-shaped symmetry.

PDF (Probability Density Function):

$$ p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) $$

  • Mean: \( \mathbb{E}[X] = \mu \)
  • Variance: \( \mathrm{Var}(X) = \sigma^2 \)
  • Continuous support: \( x \in \mathbb{R} \)

๐Ÿ“˜ Multivariate Gaussian:

$$ p(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^d |\Sigma|}} \exp\left(-\frac{1}{2}(\mathbf{x} - \mu)^T \Sigma^{-1} (\mathbf{x} - \mu)\right) $$

  • ฮผ: mean vector
  • ฮฃ: covariance matrix

๐Ÿง  Applications in ML:

Use CaseRole of Gaussian
Regression error modelingAssume residuals are Normal
Variational InferenceLatent variable priors (VAEs)
Gaussian Mixture ModelsCluster soft assignments
Kalman FiltersState uncertainty

๐Ÿ” Visualization Ideas

  • Bernoulli: two bars at 0 and 1, height = \( p \) and \( 1 - p \)
  • Gaussian: classic bell curve; adjust \( \mu, \sigma \) to show shape variation

๐Ÿ“š Related Topics