+1 vote
10.4k views
asked in Machine Learning by (230 points)  
Hi,

Since I still have confuse to use the score()  and accuracy_score(), so I want to confirm my test assumption.
Q1: score(), we use the split data to test the accuracy by knn.score(X_test, y_test) to prevent bias using the same training data, right? here knn.score(X_test, y_test) just compare the pair of test value.

Q2: accuracy_score from sklearn.metrics to test the predicted output of target value "y_pred" with the y_test, using accuracy_score(y_test, y_pred), just compare the actual target value and predicted target value?

Q3.My result is the same after using both methods, are they doing the same thing?

Q4.using accuracy_score(), I can using to compare the split training target data y_train with the y_train_pred(return form knn.predict(X_train) ). Then it should be OK now, using it to show the accuracy by accuracy_score(y_train, y_train_pred), since the prediction is done and just compare the original data, then the bias does not exist?

Thanks.
  

2 Answers

+2 votes
answered by (480 points)  
selected by
 
Best answer

Q1: knn.score(X_test, y_test) calls accuracy_score of sklearn.metrics for classifier. For regressor, it calls r2_score, which is the coefficient of determination defined in the statistics course.

You can find the source code of knn.score here. It’s open source. https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/base.py#L324

Q2: accuracy_score is not a method of knn, but a method of sklearn.metrics. If normalize argument is true, accuracy_score(knn.predict(X_test),y_test) returns the same result as knn.score(X_test,y_test). You can check document below for more details

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

Q3: As explained above, yes, they return the same result, but only in the give situation

Q4: If there is bias after the split, the bias still exists whichever data set is compared. Here the bias exists when the data distribution in the train set and the data distribution in the whole set are not the same. Taking the Iris dataset as example, if the distribution of the three classes (Setosa, Versicolour, Virginica) is 50-50-50 in the 150 samples, and you make a 20-80 split, then the distribution of the three classes in the train set should be 40-40-40. If not, there’s bias, because your train set is different from the population in terms of data distribution.

This may be why Elon doesn't trust the simulation and insist on using the data from the real world to train the Tesla auto-pilot system. 

commented by (230 points)  
Thank you , very clear and easy to understand
0 votes
answered by (115k points)  

Q1,2,3: Please take a look at the example here and see what are the differences. The application of functions for regression and classification is different.

Q4. You need to know more a bit about the procedure of Cross-Validation to see how to avoid bias. If you have access to DataCamp, complete this course first to understand the whole pipeline. After you complete the course, you will learn Cross-Validation to avoid bias.

...