When discussing the implementation of machine learning solutions, it’s easy to focus on the processes of data preparation, model development, and the deployment of the solution to production. However, it’s important for teams to recognize that the deployment of the machine learning model is hardly the end. Instead, it can only be considered the end of the beginning.
Unlike a traditional software application, the behavior of a machine learning (ML) model is not just defined by its code. In fact, the behavior of an ML solution is also heavily impacted by the data it was trained on, and the effectiveness of the model will change as the data being fed to it evolves (typically for the worse). Below, I will discuss this phenomenon and how the deterioration of an ML model’s efficiency and accuracy can be remediated through monitoring, constant labeling, and model retraining.
As we mentioned, a traditional software product is defined by its code, while the behavior of an ML solution is dependent upon both code and data. Therefore, there’s one less aspect to consider in the realm of post-release monitoring when developing and releasing a typical software product. That’s not to say that monitoring isn’t important in a traditional instance, but the likelihood that stable functionality in production will become less reliable when the code remains untouched is really low.
Machine learning models, on the other hand, differ greatly in this respect. The response from the model is dependent upon both its code and the data being provided. If the input to the model begins to evolve, as it very likely could over time, the model will probably not be trained to handle the newer input with any level of effectiveness. This will lead to a deterioration in the accuracy of its predictive capabilities as time goes on. To ensure that this is detected, monitoring the ML solution is critical.
There are multiple types of drift to consider in the maintenance of a machine learning solution. Data drift, for instance, occurs when the input being fed to the model changes, impacting its predictive capabilities. Consider the scenario in which a model is developed to detect instances of spam comments in response to videos posted on a website. At first, the model may be extremely effective in identifying comments that should be classified as spam. But, over time, those posting the spam comments may alter their tactics to evade detection. The data then drifts, as there now exists input that should be marked as spam but the model is unaware and will not do so. If the model isn’t retrained to evolve with these tactics, a higher percentage of spam comments will go undetected.
Concept drift, on the other hand, occurs when the interpretation of the data changes. One example of this is a change in the classification space. Consider the scenario in which a model is developed and trained to determine whether an image is of a car, truck, or train. However, after some time, images of bicycles are fed to the model. In this case, concept drift has occurred and the model will need to be altered to be able to properly classify the new images.
Both of these types of drift result in a deterioration in the predictive capabilities of the model, known as model drift. Therefore, it’s critical to detect instances of drift as early as possible in order to ensure that the service being powered by the model continues to provide value.
Priority number one in ensuring model effectiveness over time is monitoring the ML solution for instances of drift. With respect to machine learning, this practice includes tracking metrics involving model behavior to help detect a decline in performance. In the effort to detect potential drift as early as possible, teams can set baselines for model behavior, and when the model begins to stray from that baseline value, they can raise alerts to notify the proper personnel to analyze and remediate the problem.
For instance, let’s consider a machine learning solution that was constructed to detect fraudulent credit card transactions. If the model is typically expected to detect instances of fraud in 0.5% of cases, then this may be a good baseline to set. But what if fraud is detected by the model in 5% of cases all of a sudden? It could be that fraud is occurring with 10x greater frequency, but that probably isn’t the case. Instead, it’s more likely that some new trend in the data has emerged that is impacting the accuracy and effectiveness of the model’s predictive capabilities. Thus, an alert should be raised when the baseline is dramatically exceeded, and then the process of evaluating model performance should commence.
This example represents one manner in which organizations monitor data drift - by monitoring the distribution of the classifications applied to production input over time. When the frequencies with which the classifications are being applied is no longer in line with what used to happen, there is potential that drift has occured.
Furthermore, monitoring input data to the model and comparing this data to that on which the model was trained can be critical in identifying instances of data drift. When differences between the training data set and production input exceeds an acceptable threshold, the model may need to be retrained to deal with significant changes to the input being fed to the model in production.
This type of monitoring information and their associated alerts provides data engineers, data scientists, and other critical personnel with the level of detail necessary to evaluate the cause of the problem and make the appropriate changes to address it (i.e. re-evaluating the viability of the current model, re-labeling and retraining to regain performance, etc.).
One of the most important aspects of producing an effective machine learning workflow is curating high-quality labeled data to train the model.
Data labeling is the process of tagging or annotating groups of samples with one or more labels. When done for large quantities of data, labeling provides the basis for an ML model to derive common characteristics from similarly tagged data. This is important in the process for producing a model that can accurately classify input. You see, a supervised model learns by example. And the labeled training data represents the “examples” from which the model learns. The model evaluates the labels against the data to which they are attached, learning the relationship between the two. It then uses what it’s learned about this relationship to classify new datapoints in the future. Therefore, it is labeled data that enables a machine learning solution to hone the predictive capabilities that can then be leveraged to accurately classify input in a production environment.
But when input data changes and drift occurs, the model’s understanding of the relationship between the input data and the appropriate label will be outdated, and therefore likely incorrect. When evolving data is determined to be the cause of the decay in predictive capabilities, one potential solution is to re-label large quantities of data in a manner that corresponds with the data’s evolution. The ML solution can then be retrained using the newly labeled data, thereby updating its behavior to regain its effectiveness and accuracy.
Data labeling, in it’s typical form, is an arduous and expensive undertaking. As a time-consuming and manual process, it requires individuals with extensive domain expertise and a team of dedicated data labelers. These factors create a bottleneck in the process for producing high-quality labeled training data. And this bottleneck prevents ML teams from refreshing and retraining their models with any level of efficiency.
With that said, there now exist more efficient options to address the task of data labeling. This is primarily due to the creation of automated data labeling platforms that can streamline this portion of the process for producing training data sets. Watchful is a platform with exactly this capability. Through the use of “hinters,” ML teams can craft heuristic functionality that enables programmatic data labeling. Additionally, Watchful has the ability to learn from an ML team’s use of the platform, recommending new hinters based on its understanding of what the team is attempting to accomplish. In doing so, Watchful provides an organization with valuable opportunities for improving the quality of their training data.
All in all, a platform like Watchful allows ML teams to quickly annotate large amounts of data with less manpower and in less time. With Watchful, an ML team may only require 1 domain expert to address the task of data labeling rather than requiring a slew of experts to accomplish the same task. And, by employing hinters to automate the process of actually labeling the data, what might take one month to label manually could instead take just a day. Furthermore, it stands to reason that it’s less time-consuming for ML teams to modify the hinters used to programmatically label the training data than it would be to manually re-label the data by hand. So when drift occurs, ML teams leveraging Watchful can more effectively focus their efforts on editing their hinters. Thereby, more efficiently producing new training data sets that can be used to retrain the model and restore performance.
In many ways, the maintenance of machine learning solutions can be more complicated than maintaining a traditional software application.