## Tuesday, 4 September 2018

### Math of Intelligence : Logistic Regression

Logistic Regression

# Logistic Regression¶

Some javascript to enable auto numbering of mathematical equations. Reference

In [20]:
%%javascript
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});

MathJax.Hub.Queue(
["resetEquationNumbers", MathJax.InputJax.TeX],
["PreProcess", MathJax.Hub],
["Reprocess", MathJax.Hub]
);


Here, we will be figuring out the math for a binary logistic classifier.

Logistic Regression is similar to Linear Regression but instead of a real valued output $y$, it will be either 0 or 1 since we need to classify into one of 2 categories.

In the linear regression post, we have defined our hypothesis function as:

$$$$h_\theta(x) = \theta_0 + \theta_1x$$$$

Now, we can also have multiple input features i.e $x_1, x_2, x_3...$ and so on, so in that case our hypothesis function becomes:

$$$$h_\theta(x) = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + \theta_1x_3 ....$$$$

We have added $x_0=1$ with $\theta_0$ for simplification. Now, the hypothesis function can be expressed as a combination of just 2 vectors: $X=[x_0, x_1, x_2, x_3, ...]$ and $\theta = [\theta_0, \theta_1, \theta_2, ...]$

$$$$h_\theta(x) = \theta^TX$$$$

Still, the output of this function will be a real value, so we'll apply an activation function to convert the output to 0 or 1. We'll use the sigmoid function $g(z)$ for this purpose. TODO: Explore other activation functions

$$g(z) = \frac{1}{1+e^{-z}}$$

$$h(X) = g(\theta^TX) = \frac{1}{1+e^{-\theta^TX}}$$

The most commonly used loss function for logistic regression is log-loss (or cross-entropy) TODO: Why log-loss? Explore other loss functions.

So, the loss function $l(\theta)$ for $m$ training examples is:

$$$$l(\theta) = -\frac{1}{m}(\sum_{i=1}^m y^{(i)}log(h(x^{(i)}) + (1-y^{(i)})log(1-h(x^{(i)}))$$$$

which can also be represented as:

$$$$l(\theta) = -(\sum_{i=1}^m y^{(i)}log(g(\theta^T x^{(i)})) + (1-y^{(i)})log(1-g(\theta^T x^{(i)}))$$$$

Now, similar to linear regression, we need to find out the value of $\theta$ that minimizes the loss. We can again use gradient descent for that. TODO: Explore other methods to minimize the loss function.

$$$$\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} l(\theta)$$$$

where $\alpha$ is the learning rate.

From (8), we get that we need to find out $\frac{\partial}{\partial \theta_j} l(\theta)$ to derive the gradient descent rule. Lets start by working with just one training example.

$\frac{\partial}{\partial \theta_j} l(\theta)$ can be broken down as follows:

$$$$\frac{\partial}{\partial \theta} l(\theta) = \frac{\partial}{\partial h(x)}l(\theta).\frac{\partial}{\partial \theta}h(x)$$$$

$$$$\frac{\partial}{\partial \theta} l(\theta) = \frac{\partial}{\partial g(\theta^T x)}l(\theta).\frac{\partial}{\partial \theta}g(\theta^Tx)$$$$

Calculating $\frac{\partial}{\partial \theta}g(\theta^Tx)$ first:

$$\frac{\partial}{\partial \theta}g(\theta^Tx) = \frac{\partial}{\partial \theta} \left(\frac{1}{1+e^{-\theta^T x}}\right)$$

$$= \frac{\partial}{\partial \theta}({1+e^{-\theta^T x}})^{-1}$$

Using the chain rule of derivatives,

$$=-({1+e^{-\theta^T x}})^{-2}.(e^{-\theta^T x}).(-x)$$

$$=\frac{e^{-\theta^T x}}{(1+e^{-\theta^T x})^2}.(x)$$

$$=\frac{1+e^{-\theta^T x}-1}{(1+e^{-\theta^T x})^2}.(x)$$

$$=\left(\frac{1+e^{-\theta^T x}}{(1+e^{-\theta^T x})^2}-\frac{1}{(1+e^{-\theta^T x})^2}\right).(x)$$

$$=\left(\frac{1}{(1+e^{-\theta^T x})}-\frac{1}{(1+e^{-\theta^T x})^2}\right).(x)$$

$$=(g(\theta^T x)-g(\theta^T x)^2).(x)$$

$$$$\frac{\partial}{\partial \theta}g(\theta^Tx) =g(\theta^T x)(1-g(\theta^T x).x$$$$

Now, calculating $\frac{\partial}{\partial g(\theta^T x)}l(\theta)$,

$$\frac{\partial}{\partial g(\theta^T x)}l(\theta) = \frac{\partial}{\partial g(\theta^T x)}.(-(y.log(g(\theta^T x) + (1-y)log(1-g(\theta^T x)))$$

Again, using the chain rule,

$$= -\left(\frac{y}{g(\theta^T x)} + \frac{1-y}{1-g(\theta^T x)}.(-1)\right)$$

$$= -\left(\frac{y-y.g(\theta^T x)-g(\theta^T x)+y.g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right)$$

$$= -\left(\frac{y-g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right)$$

$$$$\frac{\partial}{\partial g(\theta^T x)}l(\theta) = -\left(\frac{y-g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right)$$$$

Finally, combining (10),(11),(12), we get

$$\frac{\partial}{\partial \theta} l(\theta) = -\left(\frac{y-g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right).g(\theta^T x)(1-g(\theta^T x).x$$

$$\frac{\partial}{\partial \theta} l(\theta) = -(y-g(\theta^T x)).x$$

$$$$\frac{\partial}{\partial \theta} l(\theta) = -(y-h(x)).x$$$$

Plugging this back in (8),

$$$$\theta_j = \theta_j + \alpha(y-h(x)).x$$$$

### Math of Intelligence : Logistic Regression

Logistic Regression Logistic Regression ¶ Some javascript to enable auto numberi...