Logistic Regression¶
Some javascript to enable auto numbering of mathematical equations. Reference
%%javascript
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});
MathJax.Hub.Queue(
["resetEquationNumbers", MathJax.InputJax.TeX],
["PreProcess", MathJax.Hub],
["Reprocess", MathJax.Hub]
);
Here, we will be figuring out the math for a binary logistic classifier.
Logistic Regression is similar to Linear Regression but instead of a real valued output $y$, it will be either 0 or 1 since we need to classify into one of 2 categories.
In the linear regression post, we have defined our hypothesis function as:
$$ \begin{equation} h_\theta(x) = \theta_0 + \theta_1x \end{equation} $$
Now, we can also have multiple input features i.e $x_1, x_2, x_3...$ and so on, so in that case our hypothesis function becomes:
$$ \begin{equation} h_\theta(x) = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + \theta_1x_3 .... \end{equation} $$
We have added $x_0=1$ with $\theta_0$ for simplification. Now, the hypothesis function can be expressed as a combination of just 2 vectors: $X=[x_0, x_1, x_2, x_3, ...]$ and $\theta = [\theta_0, \theta_1, \theta_2, ...]$
$$ \begin{equation} h_\theta(x) = \theta^TX \end{equation} $$
Still, the output of this function will be a real value, so we'll apply an activation function to convert the output to 0 or 1. We'll use the sigmoid function $g(z)$ for this purpose. TODO: Explore other activation functions
\begin{equation} g(z) = \frac{1}{1+e^{-z}} \end{equation}
\begin{equation} h(X) = g(\theta^TX) = \frac{1}{1+e^{-\theta^TX}} \end{equation}
The most commonly used loss function for logistic regression is log-loss (or cross-entropy) TODO: Why log-loss? Explore other loss functions.
So, the loss function $l(\theta)$ for $m$ training examples is:
$$ \begin{equation} l(\theta) = -\frac{1}{m}(\sum_{i=1}^m y^{(i)}log(h(x^{(i)}) + (1-y^{(i)})log(1-h(x^{(i)})) \end{equation} $$
which can also be represented as:
$$ \begin{equation} l(\theta) = -(\sum_{i=1}^m y^{(i)}log(g(\theta^T x^{(i)})) + (1-y^{(i)})log(1-g(\theta^T x^{(i)})) \end{equation} $$
Now, similar to linear regression, we need to find out the value of $\theta$ that minimizes the loss. We can again use gradient descent for that. TODO: Explore other methods to minimize the loss function.
$$ \begin{equation} \theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} l(\theta) \end{equation} $$
where $\alpha$ is the learning rate.
From (8), we get that we need to find out $\frac{\partial}{\partial \theta_j} l(\theta)$ to derive the gradient descent rule. Lets start by working with just one training example.
$\frac{\partial}{\partial \theta_j} l(\theta)$ can be broken down as follows:
$$ \begin{equation} \frac{\partial}{\partial \theta} l(\theta) = \frac{\partial}{\partial h(x)}l(\theta).\frac{\partial}{\partial \theta}h(x) \end{equation} $$
$$ \begin{equation} \frac{\partial}{\partial \theta} l(\theta) = \frac{\partial}{\partial g(\theta^T x)}l(\theta).\frac{\partial}{\partial \theta}g(\theta^Tx) \end{equation} $$
Calculating $\frac{\partial}{\partial \theta}g(\theta^Tx)$ first:
$$ \frac{\partial}{\partial \theta}g(\theta^Tx) = \frac{\partial}{\partial \theta} \left(\frac{1}{1+e^{-\theta^T x}}\right) $$
$$ = \frac{\partial}{\partial \theta}({1+e^{-\theta^T x}})^{-1} $$
Using the chain rule of derivatives,
$$ =-({1+e^{-\theta^T x}})^{-2}.(e^{-\theta^T x}).(-x) $$
$$ =\frac{e^{-\theta^T x}}{(1+e^{-\theta^T x})^2}.(x) $$
$$ =\frac{1+e^{-\theta^T x}-1}{(1+e^{-\theta^T x})^2}.(x) $$
$$ =\left(\frac{1+e^{-\theta^T x}}{(1+e^{-\theta^T x})^2}-\frac{1}{(1+e^{-\theta^T x})^2}\right).(x) $$
$$ =\left(\frac{1}{(1+e^{-\theta^T x})}-\frac{1}{(1+e^{-\theta^T x})^2}\right).(x) $$
$$ =(g(\theta^T x)-g(\theta^T x)^2).(x) $$
$$ \begin{equation} \frac{\partial}{\partial \theta}g(\theta^Tx) =g(\theta^T x)(1-g(\theta^T x).x \end{equation} $$
Now, calculating $\frac{\partial}{\partial g(\theta^T x)}l(\theta)$,
$$ \frac{\partial}{\partial g(\theta^T x)}l(\theta) = \frac{\partial}{\partial g(\theta^T x)}.(-(y.log(g(\theta^T x) + (1-y)log(1-g(\theta^T x))) $$
Again, using the chain rule,
$$ = -\left(\frac{y}{g(\theta^T x)} + \frac{1-y}{1-g(\theta^T x)}.(-1)\right) $$
$$ = -\left(\frac{y-y.g(\theta^T x)-g(\theta^T x)+y.g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right) $$
$$ = -\left(\frac{y-g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right) $$
$$ \begin{equation} \frac{\partial}{\partial g(\theta^T x)}l(\theta) = -\left(\frac{y-g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right) \end{equation} $$
Finally, combining (10),(11),(12), we get
$$ \frac{\partial}{\partial \theta} l(\theta) = -\left(\frac{y-g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right).g(\theta^T x)(1-g(\theta^T x).x $$
$$ \frac{\partial}{\partial \theta} l(\theta) = -(y-g(\theta^T x)).x $$
$$ \begin{equation} \frac{\partial}{\partial \theta} l(\theta) = -(y-h(x)).x \end{equation} $$
Plugging this back in (8),
$$ \begin{equation} \theta_j = \theta_j + \alpha(y-h(x)).x \end{equation} $$