Machine Learning Algorithms How They Work And What They Are

Machine learning, or automatic learning, is the subset of artificial intelligence in which algorithms are used to learn from data and errors autonomously, without punctual and explicit instructions from users.

In other words, machine learning algorithms harness the computing power of computers to mimic human learning from actions and refine reasoning through experience. They aim to increasingly differentiate between what is correct and incorrect, thereby guiding decision-making more accurately.

The rapid and unpredictable dynamics of contemporary market scenarios necessitate that companies of all sectors and sizes leverage their most crucial resource: the digital data obtained from their systems and communication channels with users.

Simultaneously, any data analysis, no matter how meticulous and thorough, runs the risk of quickly becoming outdated if it fails to incorporate real-time trends from the relevant scenarios, which reflect ongoing changes in the data.

The continuous availability of updated data is essential so that analytical models can continue to provide adequate answers over time. In fact, much of the success of machine learning lies in the continuous and automatic retraining to maintain the high analytical performance of predictive models.

But how do machine learning algorithms technically work? In the introduction, it is useful to consider that various typologies have been developed over time, over a hundred recognised by the scientific literature, to offer functional responses to very different needs.

There is no single solution to a given problem or to a problem to be discovered. Beyond the practices and customs that characterise the machine learning market, it is largely the experience and culture of data engineers and data scientists that is decisive in the choice of algorithms to use, case by case.

What are machine learning algorithms?

Machine learning algorithms are applications developed in code to help people explore, analyse, and find meaning in complex data sets. In other words, an algorithm is a finite set of unambiguous detailed instructions that can be followed by a computer to achieve a given analytical goal.

To understand the underlying nature of an algorithm, it is appropriate to describe the machine learning processes in their main phases.

First of all, it is necessary to implement what is needed to apply a predictive model: tools and technologies for data exploration and data preparation, combined with the specialised skills of data engineers and data scientists, are the essential elements for preparing the advanced analytics environment that is considered more suitable for the solution of the specific case.

The initial definition is followed by a cyclical and inductive process: the system performs the pre-set analysis on a sample of data (input data) to obtain models (output).

Based on the outputs, new rules are extracted, which the system uses again to optimise the analytical model, refining its knowledge of the scenario. In this way the outputs of one cycle become the input of the following cycle, to achieve increasingly accurate results.

Based on this premise, it is clear that machine learning systems are not born “already learned”, but must be developed and updated every time the reference scenario involves significant data variations.

For the algorithms to “come to life”, starting the actual machine learning, both the initial setup phase and the subsequent training phases are fundamental, linked to the continuous availability of new data on which to retrain the model. A better quality and variety of data corresponds to a greater probability for the algorithm to formulate reliable predictions.

This explains the fundamental importance of the training set, the first data set with which the algorithm is trained. A successful analysis cannot in fact ignore a set of training data that is as representative as possible regarding the scenario to be analysed.

Learning methods

Machine learning now boasts a long tradition of development, which has led to the establishment of three main learning methods, which include a great variety of algorithms. These are processes of a technical nature, which however can also be disseminated at a high level. Let’s see what supervised learning, unsupervised learning, and reinforcement learning consist of.

Supervised learning

With supervised learning, models are trained to recognise and map input data to specific predefined classes. The data sets with which to train the models therefore contain labeled data, of which the expected outputs are known.

Classification and regression analyses are part of this group.

Classification algorithms predict a result expressed as a discrete value, which indicates the object’s belonging to a specific class. They answer questions that require a single answer. The most recurring example of this type of algorithm is taxonomy or the classification of every living being in the animal kingdom.

Regression algorithms differ from classification because the results are expressed using continuous values. A recurring example is the financial sector, in particular the Capital Asset Pricing Model (CAPM), a model that describes the relationship between the expected return and risk of a security.

This family of models implements very precise and accurate algorithms but is not always applicable if you do not have a labelled training set.

Unsupervised learning

Unsupervised learning includes situations in which a labelled dataset is not available, but one still wants to try to model the data by creating homogeneous groups not defined a priori, based on the characteristics of the data themselves.

Clustering and associative analysis are part of this group.

Clustering is a technique that allows data to be aggregated into groups (called clusters) that are internally homogeneous, according to the concept of similarity or distance. The various groups are instead separate from each other, as different and distant as possible. Each cluster includes data that are similar to each other, like what happens in supervised classification models, with the difference that the definition is not known a priori.

Associative analyses have a similar, but broader purpose: to extract frequent patterns from data, rules hidden among the characteristics of each data such that it is possible to deduce the relationships that they have in common. Each rule answers a specific question such as: “ Who buys eggs, how likely is it that he also buys flour, beer, or diapers? ”. These correlations are often carried out to guide promotional campaigns or to define the position of products on shelves.

Reinforcement learning

Unlike previous methods, reinforcement learning works using sequential decisions, in which the prediction is not only based on the characteristics of the data but depends on the current state of the system, to determine the future one.

The basic idea is to try to improve the functionality of the system through the introduction of ” reinforcements “, i.e. reward signals understood as numerical values ​​indicating the quality of the action and the state it leads to, all with the aim last to encourage actions that lead to more favourable states of the system.

Reinforcement learning-based algorithms are frequently used in deep learning.

The algorithms used

Over the years, many machine learning algorithms have been developed, as well as centres officially recognised by the scientific literature, each with its own characteristics and analytical objectives. Among the most commonly used ones we find:

Linear regression

Regression algorithms are used to predict a numerical value based on other input variables. Its use extends to many predictive analytics applications, such as financial stock price prediction, weather forecasting, and market demand assessment.

In this context, linear regression is the simplest type as it aims to predict a numerical value based on a single input variable. The algorithm generates a line that represents the relationship between the input variable and the expected numeric value.

Logistic regression

Logistic regression is more complex than linear regression and is used to predict the value of a binary variable based on other input variables. The algorithm generates an “S” curve that describes the probability of the output value based on the input variable. Among possible applications, logistic regression can predict whether a patient will develop a disease based on documented risk factors (input data).

Naïve Bayes

Naïve Bayes is an algorithm based on Bayes’ theorem, which assumes that the presence or absence of a particular feature in a document is not related to the presence or absence of other features.

The Naïve Bayes algorithm is used in classification, that is, to assign a class to each instance of data, such as classifying whether an email is spam or not.

The operation of the Naïve Bayes algorithm is based on estimating the conditional probabilities of the independent variables given the dependent variables. Simple and fast in execution, this algorithm can however suffer from accuracy problems.

K-nearest neighbour

K-nearest neighbour (KNN) is a classification algorithm that is based on the characteristics of objects close to the one considered. In other words, the KNN algorithm classifies an object based on its proximity to other known objects.

The KNN algorithm operates by estimating the distance between the characteristics of the object to be classified and those of the objects already known to the system. Subsequently, it utilises the ‘k’ objects closest to the object being classified to determine its class. The selection of the value of ‘k’ is determined through various heuristic techniques.

K-means

K-means is an algorithm that has the algorithmic goal of dividing a data set into k clusters, where k is a fixed number. The operation of the k-means algorithm is based on the definition of the ‘k’ centroids, which represent the central points of each cluster.

The k-means algorithm is mainly used for clustering problems. For example, it can be used in marketing applications, to create homogeneous groups of customers based on their purchasing preferences, or in computer vision, to group images based on predominant colours or other similar characteristics.

Decision tree

The decision tree algorithm is a graphical representation of a set of decision rules and their consequences. In particular, each internal node of the tree represents a variable, while each arc starting from an internal node represents a possible value for that variable. Finally, each leaf of the tree represents the predicted value for the target variable from the values ​​of the other properties.

Decision trees are also used in the context of more complex algorithms, which involve combining them.

Random forest

Random Forest is a supervised learning algorithm that uses an ensemble technique to improve model accuracy and stability. It is used for both classification and regression problems, to predict a numerical value or class based on one or more input variables.

How the random forest algorithm works involves creating a set of decision trees, each of which is trained on a random subset of the data. This generates a native independence between the trees, which results in a series of predictions that are unrelated to each other, except in the final combination.

Developed by Leo Breiman and Adele Cutler in 2001, random forest is appreciated for its accuracy and flexibility, to the point of being widespread in many fields, such as image classification, medical diagnosis, and fraud detection.

Gradient Boosting

Gradient boosting is a machine learning algorithm used for both regression and classification, which works incrementally, sequentially creating models capable of correcting the errors of previous models.

The operation of the gradient boosting algorithm is in fact based on the creation of a set of decision trees, and weak predictive models, which in turn are combined to create a stronger predictive model.

In each iteration, the algorithm searches for the best predictive model that is weakly correlated with the errors of the previous model. The maximum number of iterations can be determined a priori based on the available time and budget, otherwise, we proceed until the algorithm generates a model deemed acceptable based on the required parameters. The Gradient Boosting algorithm produces very precise models but achieving this result requires a significant amount of data, as well as averagely high calculation times. Its quality and reliability make it ideal for numerous applications, for example in the fintech and healthcare sectors.