In this blog, I'll try to break down how to select an algorithm from a practical approach. However, the Machine Learning training workflow won't be explained in detail here. Here are five steps you should follow to finish with an optimal algorithm.
1. Identify the type of problem
This will certainly help you reduce the number of algorithms to choose from. If you have and utilize the labeled data (target values), it's a supervised learning problem.
If you're having and using unlabeled data (no target values) and your purpose is to find patterns/structures in your data, it's an unsupervised learning problem. If your solution involves interacting with the environment and getting feedback from it, it's a reinforcement learning problem.
Depending on the output of data, it might be a classification or regression problem. If the output is numerical, it belongs to regression and if the output is categorical, it is a classification-type algorithm.
Below is an easy-to-follow table that classifies the algorithm depending on the problem type.
2. Be familiar with the data
To dive deeper and understand the problem you may have to answer these questions: What's the available data? How many features do you have? Is the input of your data categorical or numerical? Is there a linear relationship in your data? If you have categorical target values, are they binary or multi-class? Do you have lots of outliers or anomalies?
Few data and a high number of features would lead us to use algorithms with high bias and low variance so that they generalize well (Linear Regression, Naïve Bayes, linear SVM, logistic regression).
Support Vector Machines are particularly well suited for problems with a high number of features. As for lots of data and fewer features, this would lead us to use algorithms with low bias and higher variance so that they learn better and don't underfit (KNN, Decision Tree, Random Forest, Kernel SVM, and neural nets.
If your data contains lots of outliers and you don't want to or you can't get rid of them (you think they are important), you might want to avoid algorithms that are sensible to outliers (linear regression, logistic regression, etc.). Random Forest is, on the other hand, not sensible to outliers.
Some algorithms are made to work better with linear relationships (linear regression, logistic regression, linear SVM). If your data does not contain linear relationships or your input is not numerical or does not have an order (cannot convert into numerical) you might want to try algorithms that can manage high dimensional and complex data structures (Random Forest, Kernel SVM, Gradient Boosting, Neural Nets).
If your target values are binary, logistic regression and SVM are desirable choices of algorithm. However, if you have a multi-class target, you might need to opt for a more complex model like Random Forest or Gradient Boosting.
3. What are your expectations?
Choosing an algorithm also depends on what is your end goal. Does the model meet the business goals? You might have a threshold in accuracy or other metrics (speed, recall, precision, memory footprint…) that you want or do not want to surpass. In that case, you would want to compare the speed of your algorithms and choose wisely.
Sometimes you might prefer algorithms that are easier to train and give a good enough result (Naïve Bayes and Linear and Logistic regression). This might be the case for time restriction, data simplicity, interpretability, etc. Approximate methods also tend to avoid overfitting and generalize well.
Another important thing to consider is the number of parameters your algorithm has. The time required to train a model increases exponentially with the number of parameters since you must find the right pattern, so it performs well. So, if you are time-restricted you would want to take this into consideration.
Knowing what metric is important in your problem might play a key role in deciding what model to pick. However, metrics are not always the only things that drive your decision.
There is a well-known trade-off between accuracy and interpretability, and depending on your end goal you might want to choose the right algorithm.
4. Test different models
Once you have an idea of what you are looking for and what algorithms may work for you and the team, it is always good to start with baseline models. The best preference is to start with the simplest algorithms so that if the performance satisfies the requirements, I do not even have to choose neural networks for instance, and waste some time.
We do want to try a few different combinations of hyper-parameters to allow each model class to have the chance to perform well but do not make the error to waste too much time on it. Having evaluated all the algorithms with basic hyper-parameters, select the ones that are the best fit for the business problem.
5. Compare and hyperparameter tuning
A machine learning pipeline can be set up that compares the performance of the final chosen algorithms on the dataset with selected evaluation criteria. Utilize grid search to bring the best out of the algorithms. Machine Learning is an iterative process, so from there on you might want to go back and play around with the selected features. Note that some features of engineering might shift the best algorithm options.
Selecting an algorithm for a specific problem isn't subjective and not a preference decision at the same time. It becomes easier with practice, but the description shall help you develop some sort of structural approach in choosing your algorithm instead of trial and error.