A method for full approximation of test data is being developed using algorithmic techniques perfected recently at the Department of Mathematics and Statistics at Université de Montréal. Error-free predictions of data and events are now attainable. The goal of this project is to automate intelligent machines so that they can make correct decisions on their own.

Today, the design and programming of intelligent robots—e.g., to assist caregivers and other professionals, or to protect a home from potential dangers—is coordinated by humans. In the near future, however, machines will gain the technical ability to automate themselves. That is the objective of my research project, which introduces a novel technique whereby machines can learn data, based on a study of hidden-layer neural networks conducted with Professor Alejandro Murua of Université de Montréal.

To equip machines with the ability to learn, current models are trained using baseline data, and then the set is tested with new data. Scientific research has yielded a number of techniques—especially in the area of deep learning, following Frank Rosenblatt’s positing of the perceptron in 1958—that improve the generalization of models to new data so as to achieve accurate forecasts. The goal is therefore for machines to be able to predict the future more accurately: quite the challenge!

In that regard, a well-known problem in machine learning is that of “model generalization.” What does this mean, exactly? Training a machine-learning model to generalize sometimes creates errors. To remedy that shortcoming, in my research I am seeking a model that would know all of the train data and the test data, as accurately as possible. To that end, Prof. Murua and I focused on differential machine learning coupled with the technique known as back-propagation.

Our methodology is based on gradient descent, first proposed by the mathematician Augustin-Louis Cauchy in 1847 and sometimes known as “descent of the curve of the derivative function.” It is applied locally to act for each observation of the train data, one data point at a time, via a back-propagation algorithm. Each dedicated algorithm (by observation) reduces the local train error related to a single observation. In practice, we see that the technique is convergent and rapidly reduces the training error. Back-propagation summarizes this technique. When at the start (before application of the algorithm), back-propagation serves to identify the initial (ideal) parameters of the model using another back-propagation, we have double back-propagation.

To reduce the training error for the test data, I perform a vector reconciliation of the training input data and the test-phase input data, using a choice metric or criterion. Once the metric is chosen, I use the Taylor Approximation Theorem to refine the model and cause it to converge toward the true value. This theorem was outlined by Brook Taylor in his treatise Methodus incrementorum directa et inversa, published in 1715. It allows me to interpolate a function from various values along its curve.

The basic condition for application of the Taylor Approximation Theorem is the property of derivability of the model, i.e., whether the model can be explored with the gradient: if the model under study is assumed to be differentiable, we can determine all of the test data by applying vector differentiation, which we could also call “data augmentation using (very small) electronic steps.” With vector differentiation, convergence is thus guaranteed, and the machine learning is said to be “differential.”

The combination of double back-propagation and differential machine learning paves the way for perfect knowledge of all data, whether train or test. This represents a major advance for the field of statistics and data science.

I had the idea of merging these elements when conducting some tests with a number of structured models that deleted connections from the chosen architecture, but this still did not provide acceptable concurrent optimization at the start. This led to an initial lesson: making the models neuron-sparse with a probabilistic approach to real data distributions is not sufficient to compete with the existing usual methods.

I therefore sought ways of enhancing optimization with other models using new criteria. When these methods also proved insufficient, I redoubled my efforts at the optimization stage and took a second look at the parameters of the gradient-descent convergent algorithm in its random form, developed by Léon Bottou and described in his chapter “Stochastic Gradient Descent Tricks” in the collection Lecture Notes in Computer Science (Springer, 2012).

In my opinion, the potential of this discovery is huge. This novel technique will also have applications to the design of intelligent robots, self-driving cars, surveillance and intrusion detection systems, etc. Eventually, these machines will be better able to self-automate and achieve perfection, provided that the conditions for approximation listed above are met.

This article was produced by Nonvikan Karl-Augustt Alahassa, PhD in Statistics / Department of Mathematics and Statistics (Université de Montréal), with the guidance of Marie-Paule Primeau, science communication advisor, as part of our “My research project in 800 words” initiative.