First time here? Checkout the FAQ!
x
+2 votes
429 views
asked in Machine Learning by (180 points)  

Regarding the datacamp tutorial "Python Machine Learning: Scikit-Learn Tutorial", the author is considering the use cases that are relevant to the digits data set, so she can select an appropriate machine learning algorithm. The reader is directed to the scikit-learn machine learning map. Here is the excerpt from the tutorial:

As your use case was one for clustering, you can follow the path on the map towards “KMeans”. You’ll see the use case that you have just thought about requires you to have more than 50 samples (“check!”), to have labeled data (“check!”), to know the number of categories that you want to predict (“check!”) and to have less than 10K samples (“check!”).

However, if you follow the learning map based on the listed use cases, KMeans is not the algorithm you would arrive at. According to the map, you would only arrive at the KMeans algorithm if you do NOT have labelled data. But the digits dataset contains labels.

When KMeans does not return optimal results, the learning map suggests trying the Spectral Clustering or GMM algorithms. But the author selected SVC (which is a classification algorithm, not a clustering algorithm), when KMeans didn't work.

Did the author select the wrong algorithm or is the learning map incorrect? Should classification or clustering have been used?

  

1 Answer

0 votes
answered by (116k points)  

The tutorial tends to cover examples of all the topics in scikit-learn, and it is written before getting to the clustering algorithm that it is a "research question" based on the following assumption:

"Do you think that, in a case where you knew that there are 10 possible digits labels to assign to the data points, but you have no access to the labels, the observations would group or “cluster” together by some criterion in such a way that you could infer the labels?

Now this is a research question!"

...