The secret to AI success? Focusing on data preparation

success

Datasets are essential to AI models. They provide the truth by which we train AI models and measure a model’s success.

Engineers often look to the AI model as the key to delivering highly accurate results, but in reality it is often the data that determines an AI model success. Data flows through every step of the AI workflow, from model training to deployment, and the way it is prepared can be the main driver of accuracy when designing robust AI models.

Engineers can use these five tips to improve their data preparation process and drive success when developing a complete AI system.

Tip 1: Don’t settle for the data you have

“How much data do I need?” This is probably the most frequently asked question at the beginning of most AI projects. Rather than asking if they have enough data, engineers should be asking themselves whether they have enough of the right data for a model to achieve its goal with high accuracy.

If you don’t have enough samples to provide to the model to learn and understand the nuances of the data, it is highly unlikely you will arrive at an accurate model.

If you are without enough of the right data, don’t settle for the data you have. There are various techniques you can use to augment and cultivate new data and overcome any shortage of data:

  1. Generate new data through simulation of a physical model, a common scenario used in predictive maintenance applications. Consider, for example, the case of a hydraulic pump used in oil extraction. You often know what the critical failure causes are, such as a seal leak in the pump. They rarely happen and are destructive, making it very difficult to get actual failure data. With tools such as Simulink and Simscape, which allows engineers to design and simulate physical systems, you can create a realistic model of the pump and use it to run simulations under various failure scenarios. Using this approach enables you to generate the data needed to train an AI model so that future occurrences can be detected on real systems in the field.
  2. Use modern deep learning techniques, such as GANs (generative adversarial networks), to generate synthetic data with characteristics and features similar to the original data. It is interesting to note that GANs can be used to generate data for both image and time-series data. For images, using GANs can produce synthetic images, which can then be used for training an object detector or image classifier. For time-series data, using GANs can produce synthetic sensor samples, like in this demonstration for how to train a GAN for sound synthesis.

In both scenarios of simulation and deep learning techniques, synthetic data produced can then be used to train AI models, alleviating the issue of lack of data for AI and allowing engineers to continue to focus on building accurate models.

Atlas Copco took advantage of Simulink’s ability to enable engineers to build physical models to simulate its pumps, creating necessary data for its AI models to ensure that all field scenarios were represented, even those that happen only very rarely.

The company also used MATLAB and Simulink to distinguish between different types of data to create predictive maintenance schedules and provide reliable information to its field teams for every combination of products, which enabled thousands of sales engineers to demonstrate reliable performance.

Tip 2: More data doesn’t (necessarily) mean a more successful model

A common frustration when designing AI models is that even with large amounts of data, the performance of the model does not increase.

Recently, engineers at MathWorks were designing a neural network to identify different classes of animals, and the data collected was massive: millions of samples of various wildlife captured on camera. They assumed the data would provide great results, but the models designed began to cap the accuracy at 80 per cent.

Using hyperparameter tuning, they were able to maximise a few percentage points higher. Taking a step back and looking at the data rather than the model, they were able to see that the animals were never perfectly posed and sometimes an image of a bear was simply an ear or a snout.

Focusing on removing data samples that were ambiguous provided a highly accurate result because the model was not introduced to as many confounding images.

For any problem where more data isn’t translating to higher accuracy, the solution lies in cleaning, cropping, labelling, and transforming the data to provide enough—high-quality—data samples to the model as possible. Tools such as Computer Vision Toolbox and Signal Processing Toolbox provide engineers with automated video labelling and signal labelling capabilities to create clean samples to quickly introduce to the model for training.

Tip 3: Apply your domain expertise to transform your data

Accurate models are never a surprise to the engineers creating them when they are made with thoughtful, well-prepared data. This is especially important to engineers and scientists using signal data.

Raw signal data is rarely added directly to AI models, as signal data tends to be noisy and memory intensive. Instead, time-frequency techniques are often incorporated to transform the data to gather the most important features the models will learn.

For example, UT Austin used MATLAB signal transformation functions and apps to transform brainwaves into images using wavelets that were then used as input to train deep learning models.

This technique allows a compact transformation of signals into images while preserving the overall signal characteristics. The model was able to detect words and phrases with an accuracy of more than 96 per cent.

success
Figure 1: Classifying the brain signals corresponding to the imagined word “goodbye” using time-frequency transformation (scalogram) and deep neural networks.
Tip 4: Use data as insight into your model

In the past, models have been thought of as black boxes. Fortunately, a variety of research is being done in the area of debugging and validation of models, and new techniques are available to understand models at a deeper level. Often, debugging techniques rely on data to give insight into the model through visualisations. An example of this is visualisation techniques, such as LIME and occlusion mapping, which aims to highlight locations in the image that were most essential in the model’s decision making.

If you look at the image shown in Figure 2, the cup is being mistaken for a buckle. Using debugging techniques, engineers can ask the model why a certain category was predicted and where in the image the model is primarily focused based on the category.

success
Figure 2: Visualisation techniques can give insight into why a model makes decisions. ©1984–2021 The MathWorks, Inc.

In this image, it is clear the model is focusing on the wristwatch rather than the mug.

Debugging techniques such as LIME provide insight into the model through data. Therefore, data is equally as important to the debugging process: offering insight into both the model and the most important features, with the opportunity to use the debugging information for model improvement.

Tip 5: Manage data in preparation for production

A highly accurate model prototype doesn’t automatically translate to a highly accurate model in a production system. Once you look to the final deployment stage of the workflow, extra considerations must be made when handling data.

You need to make sure that your data pipelines are set up to process the incoming raw data in preparation for doing the predictions on the models you’ve built. In addition, you must verify that the new model will work as intended using production data.

Production data is more challenging to work with and often has a few more hurdles involved:

  • Live sensor data is often dirty, due to missing values, outliers, or even sensor failures.
  • Models often combine signals from multiple sensors, all of which need to be synchronised.
  • Physical systems change over time, either from wear and tear or from additional components being added. This can cause model drift, where the performance degrades over time.

Creating a successful AI system requires model lifecycle management. This includes training, deploying, monitoring, updating, and maintaining the model throughout its useful life. Data is key to all these processes, from the data needed to train a model to the data that’s needed to verify and a validate a model before it’s deployed.

As highlighted by the five tips above, data becomes a key to driving success in AI, and in engineers who use their skills to turn data into results. Yes, data is key, but more important is data preparation, and engineers are the key to turning data into a successful AI model.

Remember to step back and ask what business problem needs to be solved and which parts of the data will produce the most insightful results. Though it is tempting to focus on model development, the data and the way it is prepared can be key in the creation of highly accurate and robust models.

By Jean Baptiste Lanfrey, manager – Application Engineering at Mathworks Australia

Leave a Reply