Logistic Regression¶
Some javascript to enable auto numbering of mathematical equations. Reference
%%javascript
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});
MathJax.Hub.Queue(
["resetEquationNumbers", MathJax.InputJax.TeX],
["PreProcess", MathJax.Hub],
["Reprocess", MathJax.Hub]
);
Here, we will be figuring out the math for a binary logistic classifier.
Logistic Regression is similar to Linear Regression but instead of a real valued output y, it will be either 0 or 1 since we need to classify into one of 2 categories.
In the linear regression post, we have defined our hypothesis function as:
hθ(x)=θ0+θ1x
Now, we can also have multiple input features i.e x1,x2,x3... and so on, so in that case our hypothesis function becomes:
hθ(x)=θ0x0+θ1x1+θ2x2+θ1x3....
We have added x0=1 with θ0 for simplification. Now, the hypothesis function can be expressed as a combination of just 2 vectors: X=[x0,x1,x2,x3,...] and θ=[θ0,θ1,θ2,...]
hθ(x)=θTX
Still, the output of this function will be a real value, so we'll apply an activation function to convert the output to 0 or 1. We'll use the sigmoid function g(z) for this purpose. TODO: Explore other activation functions
g(z)=11+e−z
h(X)=g(θTX)=11+e−θTX
The most commonly used loss function for logistic regression is log-loss (or cross-entropy) TODO: Why log-loss? Explore other loss functions.
So, the loss function l(θ) for m training examples is:
l(θ)=−1m(m∑i=1y(i)log(h(x(i))+(1−y(i))log(1−h(x(i)))
which can also be represented as:
l(θ)=−(m∑i=1y(i)log(g(θTx(i)))+(1−y(i))log(1−g(θTx(i)))
Now, similar to linear regression, we need to find out the value of θ that minimizes the loss. We can again use gradient descent for that. TODO: Explore other methods to minimize the loss function.
θj=θj−α∂∂θjl(θ)
where α is the learning rate.
From (8), we get that we need to find out ∂∂θjl(θ) to derive the gradient descent rule. Lets start by working with just one training example.
∂∂θjl(θ) can be broken down as follows:
∂∂θl(θ)=∂∂h(x)l(θ).∂∂θh(x)
∂∂θl(θ)=∂∂g(θTx)l(θ).∂∂θg(θTx)
Calculating ∂∂θg(θTx) first:
∂∂θg(θTx)=∂∂θ(11+e−θTx)
=∂∂θ(1+e−θTx)−1
Using the chain rule of derivatives,
=−(1+e−θTx)−2.(e−θTx).(−x)
=e−θTx(1+e−θTx)2.(x)
=1+e−θTx−1(1+e−θTx)2.(x)
=(1+e−θTx(1+e−θTx)2−1(1+e−θTx)2).(x)
=(1(1+e−θTx)−1(1+e−θTx)2).(x)
=(g(θTx)−g(θTx)2).(x)
∂∂θg(θTx)=g(θTx)(1−g(θTx).x
Now, calculating ∂∂g(θTx)l(θ),
∂∂g(θTx)l(θ)=∂∂g(θTx).(−(y.log(g(θTx)+(1−y)log(1−g(θTx)))
Again, using the chain rule,
=−(yg(θTx)+1−y1−g(θTx).(−1))
=−(y−y.g(θTx)−g(θTx)+y.g(θTx)g(θTx).(1−g(θTx))
=−(y−g(θTx)g(θTx).(1−g(θTx))
∂∂g(θTx)l(θ)=−(y−g(θTx)g(θTx).(1−g(θTx))
Finally, combining (10),(11),(12), we get
∂∂θl(θ)=−(y−g(θTx)g(θTx).(1−g(θTx)).g(θTx)(1−g(θTx).x
∂∂θl(θ)=−(y−g(θTx)).x
∂∂θl(θ)=−(y−h(x)).x
Plugging this back in (8),
θj=θj+α(y−h(x)).x