How to, Machine learning

6 Most used machine learning algorithms

There are so many machine learning algorithms, sometimes it’s hard to find the right one. But a couple of algorithms suit many applications and therefore are very popular. I’ve listed six algorithms I often use below: decision tree, K-nearest neighbor, support vector machine, neural network, naive Bayes and linear regression.

Decision tree

The decision tree is a machine learning algorithm which is very easy to interpret for people without any technical knowledge. The outcome is a model in the shape of a tree, usually visualized upside down. Each point in the tree represents a decision rule. The final nodes of the decision tree classifies the target variables. The model is trained based on the target variables, which makes it supervised learning algorithm.

A decision tree starts with a single node, called the root node. The root node includes the whole data sample. Calculations decide which decision rule fits best and creates the first split with two branches and two new nodes at the end of each branch. The node is now a decision point, the two new nodes include a subset of the sample data based on the decision rule. This process of splitting and creating decision rules continues until it stops based on certain conditions. The final nodes are called leafs and determines the classification for new data points. When the final nodes include to few data, the model is overfitted. This can be solved by setting the right constrains and pruning.

Decision tree
Decision tree with two decision rules and three leafs

The diversity of the decision tree makes it a model which is used in many applications. When relationships between variables are complex and non-linear, the tree model is the one to go for. It can be applied to both categorical and continuous (target) variables and handles outliers and missing values well. A decision tree can also be used to identify significant variables for other models and applications.

Read more about decision trees in the other articles I wrote.


K-nearest neighbor

The K-nearest neighbor algorithm is easy to understand and simple to implement. The algorithm calculates the distance between the test data and the K nearest data points. The K nearest points determine the final classification of the test data. The distance can be any metric, but the Euclidean distance is often used.

A small value for K results in low bias but high variance. A high value for K results in a high bias but low variance. The challenge is to find the most optimal value for K. The optimal K can be determined using different approaches. Most common approach is x-cross validation, which runs the algorithm for different K-values and decides the most appropriate value for K.

K-nearest neighbor
K-nearest neighbor with K=5. The test data (black) becomes a square because 3 out of 5 nearest observations are a square.

The difference between K-nearest neighbor and most modeling techniques is that it is a type of lazy learning. Lazy learning doesn’t create a model or set of rules during the train phase, but performs calculations during the test phase. This means computation times increase while running the algorithm. On the other hand, the advantage is that you can add new observations to the train set and this will immediately change the results when using the algorithm.

K-nearest neighbor is a supervised learning algorithm used in applications such as economic forecasting and recommendation systems. It performs very well when the probability distribution of the variables is unknown, but best practice is to normalize the data. The algorithm performs best when the number of input variables is small.


Support vector machine

Support vector machines is a type of supervised learning for classification problems, although it can also be used for regression problems. The concept was developed during the 1990s. The algorithm creates hyperplanes which categorizes data points.

Each data point is plotted in an n-dimensional space, just like K-nearest neighbor. The difference is that support vector machines create hyperplanes. A hyperplane is a line that creates different groups of the data points. The hyperplane creates groups such that the error within each group is minimized. Different metrics determine how to decide what the best hyperplane is. For example the maximal-margin, which can be interpreted as the margin from the line to the closest data point. Other examples are the soft margin classifier and slack variables which makes the groups less strict. Transformations to find other patterns than linear relations are called kernels. The train set determines the size and location of the hyperplanes within the space. New test data is categorized based on the hyperplane the test data belongs to.

Support vector machines
Support vector machines creating a hyperplane within a 2 dimensional space

In the ideal world, all datasets are perfect and well structured. Unfortunately, we often have missing data and outliers. The support vector machine algorithm can easily deal with these situations. It classifies numerical data, so categorical variables should be transformed into binary dummy variables. The algorithm is a good fit for classification problems with many variables. Support vector machines become more complicated when the model contains more parameters, such as slack variables.


Neural network

The neural network finds its origin in 1943 developed by Warren S. McCulloch (neuroscientist) and Walter Pitts (logician). Throughout the years, the neural network concept was modified, adapted and extended by many others.

The creators saw that some problems are incredibly simple for a computer to calculate, but are difficult or take some time to calculate for a human. On the other hand there are also problems which are natural to solve for a human, but difficult to solve for a computer. A good example is patter recognition, such as recognizing numbers in different handwritings. For us it is easy to identify numbers in different handwritings but for computers this is much more difficult. Someone needs to ‘teach’ the computer how to recognize a number in different handwritings. A neural network is able to derive meaning from complicated or imprecise data which are too complex to be noticed by humans.

Neural network
Neural network with two layers and eight elements

Traditional problem solving approaches use an algorithm or set of rules to find an optimal solution. Computers uses procedures and executes one line at the time. The neural network is different because it does not use a fixed algorithm or set of rules, it processes information in parallel throughout a network of elements. It learns by examples and can change its internal logic on the fly. A neural network can be thought of as an expert, focused on one application at the time, in the category of information it has been given to analyse. When the neural network has been trained, it can interpret new inputs and give you the desired output.

The concept processes information inspired by how a humans mind processes information. A neural network consists of three stages: input, hidden layers and output. The input are examples to train the neural network. The hidden layers consist of elements which can be interpreted as neurons in the human brain. Each element processes information independently with its own logic and weights but the elements within a layer work in parallel.

The neural network can be used as supervised learning or unsupervised learning. Common applications are image recognition, speech recognition, natural language processing.


Naive Bayes

The Naive Bayes concept was developed in the 1950s. It uses the Bayes theorem of probability to predict the class of test data based on our prior knowledge. Naive Bayes is easy and fast, which makes it a good model if you need quick results. Since the algorithm does not use any coefficients to fit the model, it is fast and easier to explain to the business. It can handle large datasets, because it caluculates probabilities of groups in stead of interpreting each data point one by one.

The algorithm calculates probabilities for a number of different hypotheses. The calculation is simply the frequency of each attribute value for a given class value, divided by the frequency of instances with that class value. For example, if we have a predictor with the variables ‘play outside’ and ‘not play outside’ and an outlook variable with the values ‘rain’ and ‘no rain’, then for each combination the algorithm calculates the probability. The test data selects the hypothesis with the highest probability based in its input variables.

Naive Bayes
Naive Bayes calculates the probability per hypothesis

The algorithm assumes independence among the variables. Despite it is hard to have a predictor with independent variables, the Naive Bayes algorithm often outperforms other algorithms. It can handle categorical variables and numerical variables. In case of categorical variables, all categories should be present in the training data, otherwise the probability of an unknown category will be 0. Smoothing techniques can deal with zero’s such as the Laplace estimation. In case of numerical variables, you will need to determine the distribution of the variables and use different methods to convert it into a normal distribution. The algorithm is used for text classification problems, spam filtering, sentiment analysis and recommendation systems.


Linear regression

The linear regression model is a model used for many applications, such as stock price movements and price elasticity predictions. The model is used in statistics and machine learning. It estimates predictions based on numeric (continuous) variables represented by a linear equation.

Linear regression
Linear regression estimates the best fit using a linear equation

The model is simply a formula based on train data. The goal is to create a linear equation estimating the train data. The linear equation consists of the predictions and a set of variables. The prediction can be calculated from a starting coefficient and a number of input variables which are multiplied by a certain factor Beta. Our goal is to estimate each Beta belonging to an input variable. Each Beta has it’s own value in order to minimize the error with the train data. By minimizing the error, the accuracy of the regression model increases. Usually the model minimizes the sum of squared residuals, called the ordinary least square method. The resulting formula can be applied to the test data to find the predictions.

The variables should be numeric and need to have a linear relationship. Categorical values should therefore be transformed into binary dummy variables. The best results are obtained when the variables follow a Gaussion distribution. In case the variables do not have a Gaussion distribution, you can apply a transformation.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.