How to optimize weights in Logistic Regression?

Question

How to optimize weights in Logistic Regression?

asked Jun 5, 2019 in Machine Learning by AskDataScience (115k points)

The hypothesis (model) of Logistic Regression which is a binary classifier ( $y =\{0,1\} $) is given in the equation below:

Hypothesis

$S(z)=P(y=1 | x)=h_{\theta}(x)=\frac{1}{1+\exp \left(-\theta^{\top} x\right)}$

Which calculates probability of Class 1, and by setting a threshold (such as $h_{\theta}(x) > 0.5 $) we can classify to 1, or 0.

Cost function

The cost function for Logistic Regression is defined as below. It is called binary cross entropy loss function:

$J(\theta)=-\frac{1}{m} \sum_{i}^{m}\left(y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right)$

Iterative updates

Assume we start all the model parameters with a random number (in this case the only model parameters we have are $\theta_j$ and assume we initialized all of them with 1: for all $\theta_j = 1$ for $j=\{0,1,...,n\}$ and $n$ is the number of features we have)

$\theta_{j_{n e w}} \leftarrow \theta_{j_{o l d}}+\alpha \times \frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)}-\sigma\left(\theta_{j_{o l d}}^{\top}\left(x^{(i)}\right)\right)\right] x_{j}^{(i)}$

Where:
$m =$ number of rows in the training batch
$x^{(i)} = $ the feature vector for sample $i$
$\theta_j = $ the coefficient vector corresponding the features
$y^{(i)} = $ actual class label for sample $i$ in the training batch
$x_{j}^{(i)} = $ the element (column) $j$ in the feature vector for sample $i$
$\alpha =$ the learning rate

Dataset

The training dataset of pass/fail in an exam for 5 students is given in the table below:

If we initialize all the model parameters with 1 (all $\theta_j = 1$), and the learning rate is $\alpha = 0.1$, and if we use batch gradient descent, what will be the:

$a)$ Accuracy of the model at initialization of the train set ($\text{accuracy} = \frac{\text{number of correct classifications}}{\text{all classifications}}$)?
$b)$ Cost at initialization?
$c)$ Cost after 1 epoch?
$d)$ Repeat all $a,b,c$ steps if we use mini-batch gradient descent and $\text{batch size} = 2$

(Hint: For $x_{j}^{(i)}$ when $j=0$ we have $x_{0}^{(i)} = 1$ for all $i$ )