## Tuesday, 4 September 2018

### Math of Intelligence : Logistic Regression

Logistic Regression

# Logistic Regression¶

Some javascript to enable auto numbering of mathematical equations. Reference

In :
%%javascript
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});

MathJax.Hub.Queue(
["resetEquationNumbers", MathJax.InputJax.TeX],
["PreProcess", MathJax.Hub],
["Reprocess", MathJax.Hub]
);


Here, we will be figuring out the math for a binary logistic classifier.

Logistic Regression is similar to Linear Regression but instead of a real valued output $y$, it will be either 0 or 1 since we need to classify into one of 2 categories.

In the linear regression post, we have defined our hypothesis function as:

$$\begin{equation} h_\theta(x) = \theta_0 + \theta_1x \end{equation}$$

Now, we can also have multiple input features i.e $x_1, x_2, x_3...$ and so on, so in that case our hypothesis function becomes:

$$\begin{equation} h_\theta(x) = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + \theta_1x_3 .... \end{equation}$$

We have added $x_0=1$ with $\theta_0$ for simplification. Now, the hypothesis function can be expressed as a combination of just 2 vectors: $X=[x_0, x_1, x_2, x_3, ...]$ and $\theta = [\theta_0, \theta_1, \theta_2, ...]$

$$\begin{equation} h_\theta(x) = \theta^TX \end{equation}$$

Still, the output of this function will be a real value, so we'll apply an activation function to convert the output to 0 or 1. We'll use the sigmoid function $g(z)$ for this purpose. TODO: Explore other activation functions

\begin{equation} g(z) = \frac{1}{1+e^{-z}} \end{equation}

\begin{equation} h(X) = g(\theta^TX) = \frac{1}{1+e^{-\theta^TX}} \end{equation}

The most commonly used loss function for logistic regression is log-loss (or cross-entropy) TODO: Why log-loss? Explore other loss functions.

So, the loss function $l(\theta)$ for $m$ training examples is:

$$\begin{equation} l(\theta) = -\frac{1}{m}(\sum_{i=1}^m y^{(i)}log(h(x^{(i)}) + (1-y^{(i)})log(1-h(x^{(i)})) \end{equation}$$

which can also be represented as:

$$\begin{equation} l(\theta) = -(\sum_{i=1}^m y^{(i)}log(g(\theta^T x^{(i)})) + (1-y^{(i)})log(1-g(\theta^T x^{(i)})) \end{equation}$$

Now, similar to linear regression, we need to find out the value of $\theta$ that minimizes the loss. We can again use gradient descent for that. TODO: Explore other methods to minimize the loss function.

$$\begin{equation} \theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} l(\theta) \end{equation}$$

where $\alpha$ is the learning rate.

From (8), we get that we need to find out $\frac{\partial}{\partial \theta_j} l(\theta)$ to derive the gradient descent rule. Lets start by working with just one training example.

$\frac{\partial}{\partial \theta_j} l(\theta)$ can be broken down as follows:

$$\begin{equation} \frac{\partial}{\partial \theta} l(\theta) = \frac{\partial}{\partial h(x)}l(\theta).\frac{\partial}{\partial \theta}h(x) \end{equation}$$

$$\begin{equation} \frac{\partial}{\partial \theta} l(\theta) = \frac{\partial}{\partial g(\theta^T x)}l(\theta).\frac{\partial}{\partial \theta}g(\theta^Tx) \end{equation}$$

Calculating $\frac{\partial}{\partial \theta}g(\theta^Tx)$ first:

$$\frac{\partial}{\partial \theta}g(\theta^Tx) = \frac{\partial}{\partial \theta} \left(\frac{1}{1+e^{-\theta^T x}}\right)$$

$$= \frac{\partial}{\partial \theta}({1+e^{-\theta^T x}})^{-1}$$

Using the chain rule of derivatives,

$$=-({1+e^{-\theta^T x}})^{-2}.(e^{-\theta^T x}).(-x)$$

$$=\frac{e^{-\theta^T x}}{(1+e^{-\theta^T x})^2}.(x)$$

$$=\frac{1+e^{-\theta^T x}-1}{(1+e^{-\theta^T x})^2}.(x)$$

$$=\left(\frac{1+e^{-\theta^T x}}{(1+e^{-\theta^T x})^2}-\frac{1}{(1+e^{-\theta^T x})^2}\right).(x)$$

$$=\left(\frac{1}{(1+e^{-\theta^T x})}-\frac{1}{(1+e^{-\theta^T x})^2}\right).(x)$$

$$=(g(\theta^T x)-g(\theta^T x)^2).(x)$$

$$\begin{equation} \frac{\partial}{\partial \theta}g(\theta^Tx) =g(\theta^T x)(1-g(\theta^T x).x \end{equation}$$

Now, calculating $\frac{\partial}{\partial g(\theta^T x)}l(\theta)$,

$$\frac{\partial}{\partial g(\theta^T x)}l(\theta) = \frac{\partial}{\partial g(\theta^T x)}.(-(y.log(g(\theta^T x) + (1-y)log(1-g(\theta^T x)))$$

Again, using the chain rule,

$$= -\left(\frac{y}{g(\theta^T x)} + \frac{1-y}{1-g(\theta^T x)}.(-1)\right)$$

$$= -\left(\frac{y-y.g(\theta^T x)-g(\theta^T x)+y.g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right)$$

$$= -\left(\frac{y-g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right)$$

$$\begin{equation} \frac{\partial}{\partial g(\theta^T x)}l(\theta) = -\left(\frac{y-g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right) \end{equation}$$

Finally, combining (10),(11),(12), we get

$$\frac{\partial}{\partial \theta} l(\theta) = -\left(\frac{y-g(\theta^T x)}{g(\theta^T x).(1-g(\theta^T x)}\right).g(\theta^T x)(1-g(\theta^T x).x$$

$$\frac{\partial}{\partial \theta} l(\theta) = -(y-g(\theta^T x)).x$$

$$\begin{equation} \frac{\partial}{\partial \theta} l(\theta) = -(y-h(x)).x \end{equation}$$

Plugging this back in (8),

$$\begin{equation} \theta_j = \theta_j + \alpha(y-h(x)).x \end{equation}$$

### Math of Intelligence : Logistic Regression

Logistic Regression Logistic Regression ¶ Some javascript to enable auto numberi...