Logistic regression is model which is used to predict the categorical variable. It measures the relationship between the **categorical dependent variable** and **independent variables** by estimating probabilities using a **logistic function**.

For example, logistic regression can be used if we want to predict whether an American voter will vote Democratic or Republican, based on age, income, sex, race, state of residence, votes in previous elections. It can also be used for predicting the probability of failure of a given process, system or product.

Apart from binary classification, it can also used for ordinal or multinomial. Multinomial logistic regression deals with situations where the outcome can have three or more disordered possible types.

As the outcome of logistic regression is binary, Y needs to be transformed so that the regression process can be used. The logistic regression model can be written as follows,

Solving for p(x),

We should predict Y = 1 when p(x) ≥ 0.5 and Y = 0 when p(x) < 0.5. The inverse of logit(x;b,w), p(x) is shown in following graph,

Our goal is to find β_{0} and β using the training data. Maximum likelihood estimation is used to esimate the parameters.

For each training data-point, we have a vector of features, x_{i}, and an observed class, y_{i} . The probability of that class was either p, if y_{i}= 1, or 1*− *p, if y_{i} = 0. The likelihood is,

The sume of log-likelihood is,

To find the maximum likelihood estimates, we differentiate the log likelihood with respect to the parameters, *β*, say *β*j and set the derivatives equal to zero to estimate the parameters. We can’t able to set this to zero and solve it as there is no closed-form solution. However, we can approximately solve it numerically. Newton-Raphson and Stochastic gradient descent methods can be used to solve this problem.

**Multinomial logistic regression (Logistic regression with more than two classes)**

If Y can take on more than two values, the predicted conditional probabilities will be,

When there are only two classes (say, 0 and 1), the above equation reduces to previous equation, with *β*0 = *β*^{(}^{1}^{)}_{0} *−**β*^{(}^{0}^{)}_{0} and *β *= *β*^{(}^{1}^{)}*−**β*^{(}^{0}^{)}.