Amid all the hype and hysteria about ChatGPT, Bard, and other generative large language models (LLMs), it’s worth taking a step back to look at the gamut of AI algorithms and their uses. After all, many “traditional” machine learning algorithms have been solving important problems for decades—and they’re still going strong. Why should LLMs get all the attention?
Before we dive in, recall that machine learning is a class of methods for automatically creating predictive models from data. Machine learning algorithms are the engines of machine learning, meaning it is the algorithms that turn a data set into a model. Which kind of algorithm works best (supervised, unsupervised, classification, regression, etc.) depends on the kind of problem you’re solving, the computing resources available, and the nature of the data.
In the next section, I’ll briefly survey the different kinds of machine learning and the different kinds of machine learning models. Then I’ll discuss 14 of the most commonly used machine learning and deep learning algorithms, and explain how those algorithms relate to the creation of models for prediction, classification, image processing, language processing, game-playing and robotics, and generative AI.
Kinds of machine learning
Machine learning can solve non-numeric classification problems (e.g., “predict whether this applicant will default on his loan”) and numeric regression problems (e.g., “predict the sales of food processors in our retail locations for the next three months”). Both kinds of models are primarily trained using supervised learning, which means the training data has already been tagged with the answers.
Tagging training data sets can be expensive and time-consuming, so supervised learning is often enhanced with semi-supervised learning. Semi-supervised learning applies the supervised learning model from a small tagged data set to a larger untagged data set, and adds whatever predicted data that has a high probability of being correct to the model for further predictions. Semi-supervised learning can sometimes go off the rails, so you can improve the process with human-in-the-loop (HITL) review of questionable predictions.
While the biggest problem with supervised learning is the expense of labeling the training data, the biggest problem with unsupervised learning (where the data is not labeled) is that it often doesn’t work very well. Nevertheless, unsupervised learning does have its uses: It can sometimes be good for reducing the dimensionality of a data set, exploring the data’s patterns and structure, finding groups of similar objects, and detecting outliers and other noise in the data.
The potential of an agent that learns for the sake of learning is far greater than a system that reduces complex pictures to a binary decision (e.g., dog or cat). Uncovering patterns rather than carrying out a pre-defined task can yield surprising and useful results, as demonstrated when researchers at Lawrence Berkeley Lab ran a text processing algorithm (Word2vec) on several million material science abstracts to predict discoveries of new thermoelectric materials.
Reinforcement learning trains an actor or agent to respond to an environment in a way that maximises some value, usually by trial and error. That’s different from supervised and unsupervised learning, but is often combined with them. It has proven useful for training computers to play games and for training robots to perform tasks.
Neural networks, which were originally inspired by the architecture of the biological visual cortex, consist of a collection of connected units, called artificial neurons, organised in layers. The artificial neurons often use sigmoid or ReLU (rectified linear unit) activation functions, as opposed to the step functions used for the early perceptrons. Neural networks are usually trained with supervised learning.
Deep learning uses neural networks that have a large number of “hidden” layers to identify features. Hidden layers come between the input and output layers. The more layers in the model, the more features can be identified. At the same time, the more layers in the model, the longer it takes to train. Hardware accelerators for neural networks include GPUs, TPUs, and FPGAs.
Fine-tuning can speed up the customisation of models significantly by training a few final layers on new tagged data without modifying the weights of the rest of the layers. Models that lend themselves to fine-tuning are called base models or foundational models.
Vision models often use deep convolutional neural networks. Vision models can identify the elements of photographs and video frames, and are usually trained on very large photographic data sets.
Language models sometimes use convolutional neural networks, but recently tend to use recurrent neural networks, long short-term memory, or transformers. Language models can be constructed to translate from one language to another, to analyse grammar, to summarise text, to analyse sentiment, and to generate text. Language models are usually trained on very large language data sets.
Popular machine learning algorithms
The list that follows is not comprehensive, and the algorithms are ordered roughly from simplest to most complex.
Linear regression, also called least squares regression, is the simplest supervised machine learning algorithm for predicting numeric values. In some cases, linear regression doesn’t even require an optimiser, since it is solvable in closed form. Otherwise, it is easily optimised using gradient descent (see below). The assumption of linear regression is that the objective function is linearly correlated with the independent variables. That may or may not be true for your data.
To the despair of data scientists, business analysts often blithely apply linear regression to prediction problems and then stop, without even producing scatter plots or calculating correlations to see if the underlying assumption is reasonable. Don’t fall into that trap. It’s not that hard to do your exploratory data analysis and then have the computer try all the reasonable machine learning algorithms to see which ones work the best. By all means, try linear regression, but treat the result as a baseline, not a final answer.
Optimisation methods for machine learning, including neural networks, typically use some form of gradient descent algorithm to drive the back propagation, often with a mechanism to help avoid becoming stuck in local minima, such as optimising randomly selected mini-batches (stochastic gradient gescent) and applying momentum corrections to the gradient. Some optimisation algorithms also adapt the learning rates of the model parameters by looking at the gradient history (AdaGrad, RMSProp, and Adam).
Classification algorithms can find solutions to supervised learning problems that ask for a choice (or determination of probability) between two or more classes. Logistic regression is a method for solving categorical classification problems that uses linear regression inside a sigmoid or logit function, which compresses the values to a range of 0 to 1 and gives you a probability. Like linear regression for numerical prediction, logistic regression is a good first method for categorical prediction, but shouldn’t be the last method you try.
Support vector machines
Support vector machines (SVMs) are a kind of parametric classification model, a geometric way of separating and classifying two label classes. In the simplest case of well-separated classes with two variables, an SVM finds the straight line that best separates the two groups of points on a plane.
In more complicated cases, the points can be projected into a higher-dimensional space and the SVM finds the plane or hyperplane that best separates the classes. The projection is called a kernel, and the process is called the kernel trick. After you reverse the projection, the resulting boundary is often nonlinear.
When there are more than two classes, SVMs are used on the classes pairwise. When classes overlap, you can add a penalty factor for points that are misclassified; this is called a soft margin.
Decision trees (DTs) are a non-parametric supervised learning method used for both classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.
Decision trees are easy to interpret and cheap to deploy, but computationally expensive to train and prone to overfitting.
The random forest model produces an ensemble of randomised decision trees, and is used for both classification and regression. The aggregated ensemble either combines the votes modally or averages the probabilities from the decision trees. Random forest is a kind of bagging ensemble.
XGBoost (eXtreme Gradient Boosting) is a scalable, end-to-end, tree-boosting system that has produced state-of-the-art results on many machine learning challenges. Bagging and boosting are often mentioned in the same breath. The difference is that instead of generating an ensemble of randomised trees (RDFs), gradient tree boosting starts with a single decision or regression tree, optimises it, and then builds the next tree from the residuals of the first tree.
The k-means clustering problem attempts to divide n observations into k clusters using the Euclidean distance metric, with the objective of minimising the variance (sum of squares) within each cluster. It is an unsupervised method of vector quantisation, and is useful for feature learning, and for providing a starting point for other algorithms.
Lloyd’s algorithm (iterative cluster agglomeration with centroid updates) is the most common heuristic used to solve the problem. It is relatively efficient, but doesn’t guarantee global convergence. To improve that, people often run the algorithm multiple times using random initial cluster centroids generated by the Forgy or random partition methods.
K-means assumes spherical clusters that are separable so that the mean converges towards the cluster center, and also assumes that the ordering of the data points does not matter. The clusters are expected to be of similar size, so that the assignment to the nearest cluster center is the correct assignment.
Principal component analysis
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated numeric variables into a set of values of linearly uncorrelated variables called principal components. Karl Pearson invented PCA in 1901. PCA can be accomplished by eigenvalue decomposition of a data covariance (or correlation) matrix, or singular value decomposition (SVD) of a data matrix, usually after a normalisation step applied to the initial data.
Popular deep learning algorithms
There are a number of very successful and widely adopted deep learning paradigms, the most recent being the transformer architecture behind today’s generative AI models.
Convolutional neural networks
Convolutional neural networks (CNNs) are a type of deep neural network often used for machine vision. They have the desirable property of being position-independent.
The understandable summary of a convolution layer when applied to images is that it slides over the image spatially, computing dot products; each unit in the layer shares one set of weights. A convnet typically uses multiple convolution layers, interspersed with activation functions. CNNs can also have pooling and fully connected layers, although there is a trend toward getting rid of these types of layers.
Recurrent neural networks
While convolutional neural networks do a good job of analysing images, they don’t really have a mechanism that accounts for time series and sequences, as they are strictly feed-forward networks. Recurrent neural networks (RNNs), another kind of deep neural network, explicitly include feedback loops, which effectively gives them some memory and dynamic temporal behavior and allows them to handle sequences, such as speech.
That doesn’t mean that CNNs are useless for natural language processing; it does mean that RNNs can model time-based information that escapes CNNs. And it doesn’t mean that RNNs can only process sequences. RNNs and their derivatives have a variety of application areas, including language translation, speech recognition and synthesis, robot control, time series prediction and anomaly detection, and handwriting recognition.
While in theory an ordinary RNN can carry information over an indefinite number of steps, in practice it generally can’t go many steps without losing the context. One of the causes of the problem is that the gradient of the network tends to vanish over many steps, which interferes with the ability of a gradient-based optimiser such as stochastic gradient descent (SGD) to converge.
Long short-term memory
Long short-term memory networks (LSTMs) were explicitly designed to avoid the vanishing gradient problem and allow for long-term dependencies. The design of an LSTM adds some complexity compared to the cell design of an RNN, but works much better for long sequences.
In LSTMs, the network is capable of forgetting (gating) previous information as well as remembering it, in both cases by altering weights. This effectively gives an LSTM both long-term and short-term memory, and solves the vanishing gradient problem. LSTMs can deal with sequences of hundreds of past inputs.
Transformers are neural networks that solely use attention mechanisms, dispensing with recurrence and convolutions entirely. Transformers were invented at Google.
Attention units (and transformers) are part of Google’s BERT (Bidirectional Encoder Representations from Transformers) algorithm and OpenAI’s GPT-2 algorithm (transformer model with unsupervised pre-training) for natural language processing. Transformers continue to be integral to the neural architecture of the latest large language models, such as ChatGPT/Bing Chat (based on GPT-3.5 or GPT-4) and Bard (based on LaMDA, which stands for Language Model for Dialogue Applications).
Attention units are not terribly sensitive to how close two words in a sentence appear, unlike RNNs; that makes them good at tasks that RNNs don’t do well, such as identifying antecedents of pronouns that may be separated from the referent pronouns by several sentences. Attention units are good at looking at a context larger than just the last few words preceding the current word.
Q-learning is a model-free, value-based, off-policy algorithm for reinforcement learning that will find the best series of actions based on the current state. The “Q” stands for quality. Quality represents how valuable the action is in maximising future rewards. Q-learning is essentially learning by experience.
Q-learning is often combined with deep neural networks. It’s used with convolutional neural networks trained to extract features from video frames, for example for teaching a computer to play video games or for learning robotic control. AlphaGo and AlphaZero are famous successful game-playing programs from Google DeepMind that were trained with reinforcement learning combined with deep neural networks.
As we’ve seen, there are many kinds of machine learning problems, and many algorithms for each kind of problem. These range in complexity from linear regression for numeric prediction to convolutional neural networks for image processing, transformer-based models for generative AI, and reinforcement learning for game-playing and robotics.