First time here? Checkout the FAQ!
x
+2 votes
6.9k views
asked in Machine Learning by (116k points)  

The hypothesis (model) of Logistic Regression which is a binary classifier ( $y =\{0,1\} $) is given in the equation below:

Hypothesis

$S(z)=P(y=1 | x)=h_{\theta}(x)=\frac{1}{1+\exp \left(-\theta^{\top} x\right)}$

Which calculates probability of Class 1, and by setting a threshold (such as $h_{\theta}(x) > 0.5 $) we can classify to 1, or 0.

Cost function

The cost function for Logistic Regression is defined as below. It is called binary cross entropy loss function:

$J(\theta)=-\frac{1}{m} \sum_{i}^{m}\left(y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right)$

Iterative updates

Assume we start all the model parameters with a random number (in this case the only model parameters we have are $\theta_j$ and assume we initialized all of them with 1:  for all $\theta_j = 1$ for $j=\{0,1,...,n\}$ and $n$ is the number of features we have)

$\theta_{j_{n e w}} \leftarrow \theta_{j_{o l d}}+\alpha \times \frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)}-\sigma\left(\theta_{j_{o l d}}^{\top}\left(x^{(i)}\right)\right)\right] x_{j}^{(i)}$

Where:
$m =$ number of rows in the training batch
$x^{(i)} = $ the feature vector for sample $i$
$\theta_j = $ the coefficient vector corresponding the features
$y^{(i)} = $ actual class label for sample $i$ in the training batch
$x_{j}^{(i)} = $ the element (column) $j$ in the feature vector for sample $i$
$\alpha =$ the learning rate

Dataset

The training dataset of pass/fail in an exam for 5 students is given in the table below:

If we initialize all the model parameters with 1 (all $\theta_j = 1$), and the learning rate is $\alpha = 0.1$, and if we use batch gradient descent, what will be the:

$a)$ Accuracy of the model at initialization of the train set ($\text{accuracy} = \frac{\text{number of correct classifications}}{\text{all classifications}}$)?
$b)$ Cost at initialization?
$c)$ Cost after 1 epoch?
$d)$ Repeat all $a,b,c$ steps if we use mini-batch gradient descent and $\text{batch size} = 2$

(Hint: For $x_{j}^{(i)}$ when $j=0$ we have $x_{0}^{(i)}  = 1$ for all $i$ )

  

2 Answers

+3 votes
answered by (200 points)  
edited by

Here is my attempt at the answer. Link to video solution (also includes a small introduction into logistic regression, Goto 13:00 to skip logistic regression explanation.): 

commented by (140 points)  
Hello Wahba
It was amazing, However, in part d the first iteration with the first batch the update of theta 1 should be (-1.2)
please correct me if I am wrong!
commented by (200 points)  
edited by
+1
Ahh, good catch. I must have messed it up when i plugged this into my calculator.

Yeah, theta1 should be -1.2 after the first iteration in part d.

In the second iteration, you get a theta0 of 1, and a theta1 of 1.85. At this point, my calculator is not accurate enough to calculate the sigmoid, which gives me a result of 1 for each sample.

In the 3rd iteration,  since the output of the model is the same as the actual value (since my calculator is not accurate enough), the model does not change.

Because of this, when you calculate your error, you end up with an infinite error/cost again
:(
commented by (140 points)  
Thanks for clarification Wahab, really appreciate it
+1 vote
answered by (160 points)  
edited by

Here is my answer for a,b,c &d questions 

Here is my answer for a,b,c &d questions

 

commented by (200 points)  
Hi Yoga,
I think there is one small mistake in your solutions. When you are updating your thetas, each term in the summation is multiplied by Xi. in the case of theta0, you need to multiply the terms by X0 which is 1, and not X1 like you did. That is why your theta0 and theta1 are the same when you train your model.

There is a better explication in the following video (timestamp is embedded in the link, remove the space. i cant post a direct link as a comment): https://youtu .be/V9jCH1Hrpzo?t=1260
commented by (100 points)  
Hi Yoga'
Thank you for  good  explanation ,and solution ,but you have a very  small mistake in part c  when you want to  losses for  x= 33,28 and 39   ,by mistake  calculate for  these  y log z  not  y  log p .
...