Logistic Regression Algorithm
Classification Technique(Using Geometry and Probability interpretation).
Objective : To find the plane(hyperplane) which separates +ve and -ve points.
Approach 1:: Geometry
Let x denotes +ve points
and circle denotes -ve points.
So, y = {+1, -1}
+1 => +ve point
-1 => -ve point
Here points are “almost” linearly separable.
We assumed Plane passed through origin so we had no constant term in the equation.
Generally, there can be many planes which can separate +ve and -ve points but we have to find the best plane P
We will find the distance of each point to the plane
Assume, B = 0 , which means Plane is passing through origin.
And W is a unit vector.
https://avinash-k-mishra.medium.com/distance-of-a-point-to-a-plane-d4b3591e7fb3
Now we know for ideal case above distance
a. If positive then its class y = +1
b. If negative then its class y = -1
Based on Plane(Decision Surface) we can have 4 different cases:
Case 1:
If Y =+1 and classifier correctly classified the point i.e., W¹*X>0
Case 2:
If Y = -1 and classifier correctly classified the point i.e., W¹*X<0
Case 3:
If Y = +1 and classifier classified the point i.e., W¹*X<0
Case 4:
If Y = -1 and classifier classified the point i.e., W¹*X>0
So now our objective is to maximize the number of classified point and minimize number of misclassified point. Thus we can find W so that our objective function reduces to
But the above Objective function will fail in some scenario, Lets’ analyse that
Image 2 is giving +1 as result but here 4 points have been misclassified whereas Image 1 is giving result as -42 but only 1 point is misclassified. Thus we can say that our objective function will choose the wrong plane based on the maximizing signed distance formula.
Squashing Signed distance formula with Sigmoid function
After applying sigmoid function, we are adding a tapering behaviour which will eventually put a threshold on the maximum signed distance i.e., if the signed-distance is very large then we will use +1 and if it is very small then 0 and if the point lies on the plane then 0.5.
This also adds a probabilistic approach to our objective function
To further simply it, lets understand that for any monotonic increasing function, we can say,
log(z) is a monotonic increasing function. Thus we can write our objective function as,
And we also know log(1/x) = -log(x)
Regularization
To stop W to tends towards infinity, we have to add a regularization term to our optimal function.(Either L1 or L2)
L1-Regularization (Lambda is the hyper-parameter, L1-Norm of W)
L2-Regularization (Lambda is the hyper-parameter, Square of L2-Norm of W)
- if lambda = 0 , function overfits. And when lambda is very large it underfits.
Approach 2:: Probability
Using Naive Bayes and Bernoulli random variable we can deduce