score() vs accuracy_score() in sklearn

kaADSS asked Jan 21, 2020

13,079 views

Hi,

Since I still have confuse to use the score() and accuracy_score(), so I want to confirm my test assumption.
Q1: score(), we use the split data to test the accuracy by knn.score(X_test, y_test) to prevent bias using the same training data, right? here knn.score(X_test, y_test) just compare the pair of test value.

Q2: accuracy_score from sklearn.metrics to test the predicted output of target value "y_pred" with the y_test, using accuracy_score(y_test, y_pred), just compare the actual target value and predicted target value?

Q3.My result is the same after using both methods, are they doing the same thing?

Q4.using accuracy_score(), I can using to compare the split training target data y_train with the y_train_pred(return form knn.predict(X_train) ). Then it should be OK now, using it to show the accuracy by accuracy_score(y_train, y_train_pred), since the prediction is done and just compare the original data, then the bias does not exist?

Thanks.

kaADSS

230 points

2 Answers

Best answer

Q1: knn.score(X_test, y_test) calls accuracy_score of sklearn.metrics for classifier. For regressor, it calls r2_score, which is the coefficient of determination defined in the statistics course.

You can find the source code of knn.score here. It’s open source. https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/base.py#L324

Q2: accuracy_score is not a method of knn, but a method of sklearn.metrics. If normalize argument is true, accuracy_score(knn.predict(X_test),y_test) returns the same result as knn.score(X_test,y_test). You can check document below for more details

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

Q3: As explained above, yes, they return the same result, but only in the give situation

Q4: If there is bias after the split, the bias still exists whichever data set is compared. Here the bias exists when the data distribution in the train set and the data distribution in the whole set are not the same. Taking the Iris dataset as example, if the distribution of the three classes (Setosa, Versicolour, Virginica) is 50-50-50 in the 150 samples, and you make a 20-80 split, then the distribution of the three classes in the train set should be 40-40-40. If not, there’s bias, because your train set is different from the population in terms of data distribution.

This may be why Elon doesn't trust the simulation and insist on using the data from the real world to train the Tesla auto-pilot system.

XingLi answered Jan 21, 2020 • selected Jan 21, 2020 by tofighi

XingLi

480 points

score() vs accuracy_score() in sklearn

Please log in or register to add a comment.

Please log in or register to answer this question.

2 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

0 reply

Please log in or register to add a comment.

Please log in or register to answer this question.

2 Answers

1 1 comment reply

Please log in or register to add a comment.

0 reply

Please log in or register to add a comment.

Related questions

0

1 1 comment

0