$$ y = f\left(\sum_{i=1}^{n}w_ix_i + b\right) $$
Linear Regression
- y: Predicted value or output of the model.
- f: Activation function (e.g., identity function for regression).
- wi: Weights assigned to each input feature.
- xi: Input features or variables.
- b: Bias term, allowing flexibility in the prediction.
$$ P(y|x) = \frac{e^{w^Tx}}{\sum_{k=1}^{K}e^{w_k^Tx}} $$
Softmax Function
- P(y|x): Probability of class \(y\) given input \(x\).
- e: Exponential function to ensure positive outputs.
- wTx: Dot product of weights and inputs for a specific class.
- Denominator: Sum of exponentials for all classes ensures probabilities sum to 1.
$$ L = -\frac{1}{N}\sum_{i=1}^{N} \left[ y_i\log(p_i) + (1-y_i)\log(1-p_i) \right] $$
Binary Cross-Entropy Loss
- L: Loss function value (lower is better).
- N: Total number of samples.
- yi: True label (0 or 1).
- pi: Predicted probability for the positive class.
- log: Natural logarithm ensures steep penalties for incorrect predictions.
$$ \theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\hat{v_t}} + \epsilon} \hat{m_t} $$
Adam Optimizer Algorithm
- \(\theta_t\): Updated parameter values at time step \(t\).
- \(\theta_{t-1}\): Previous parameter values.
- \(\eta\): Learning rate, which controls the step size for updates.
- \(\hat{m_t}\): Bias-corrected first moment estimate (mean of gradients).
- \(\hat{v_t}\): Bias-corrected second moment estimate (variance of gradients).
- \(\epsilon\): A small constant to prevent division by zero (e.g., \(10^{-8}\)).