In summary:
- Stochastic gradient descent (SGD) in general is an iterative learning algorithm that uses a randomly selected (or shuffled) samples from the training dataset to update a model. However, it is also referred to updating the model parameters using just one sample at a time
- Batch Size is a hyper-parameter of gradient descent that controls the number of training samples to work through before the model’s internal parameters are updated
- Epochs is a hyper-parameter of gradient descent that controls is the number of complete passes through the training dataset
Let's review some basic definitions:
What Is a Sample?
A sample is a single row of data. A sample may also be called an instance, an observation, an input vector, or a feature vector.
What is a Model Parameter?
A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data.
- They are required by the model when making predictions.
- They are estimated or learned from data.
- They are often not set manually by the practitioner.
- They are often saved as part of the learned model.
Some examples of model parameters include:
- The weights in an artificial neural network.
- The coefficients in a linear regression or logistic regression task.
What is a Model Hyperparameter?
A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.
- They are often used in processes to help estimate model parameters.
- They are often specified by the practitioner.
- They can often be set using heuristics.
- They are often tuned for a given predictive modeling problem.
A good rule of thumb to overcome this confusion is as follows:
If you have to specify a model parameter manually then it is probably a model hyperparameter.
Some examples of model hyperparameters include:
- The learning rate for training a neural network.
- The k in k-nearest neighbors.
What Is a Batch?
The batch size is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.
When all training samples are used to create one batch, the learning algorithm is called batch gradient descent. When the batch is the size of one sample, the learning algorithm is called stochastic gradient descent. When the batch size is more than one sample and less than the size of the training dataset, the learning algorithm is called mini-batch gradient descent.
- Batch Gradient Descent. Batch Size = Size of Training Set
- Stochastic Gradient Descent. Batch Size = 1
- Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set
In the case of mini-batch gradient descent, popular batch sizes include 32, 64, and 128 samples. You may see these values used in models in the literature and in tutorials. If the dataset does not divide evenly by the batch size, it simply means that the final batch has fewer samples than the other batches. Alternately, you can remove some samples from the dataset or change the batch size such that the number of samples in the dataset does divide evenly by the batch size.
What Is an Epoch?
The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.
One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters. An epoch is comprised of one or more batches. For example, as above, an epoch that has one batch is called the batch gradient descent learning algorithm.
For more information, please take a look at this article.