Production AI models degrade over time as data, users, and environments change, often without obvious failure signals. This post outlines how to detect model drift, monitor the right performance and input signals, and apply structured maintenance workflows to evaluate, retrain, and release models safely in production.


Just like ordinary software, no production AI model is “ship once.” With regular software, operating systems change, bugs crop up, and vulnerabilities are discovered. AI software faces some of the same issues, plus novel conditions that require model maintenance, including:
With this guide, you’ll learn how to identify these issues with tools for monitoring and quality control, then remedy them with AI model maintenance techniques such as evaluation loops, data refresh, and retraining. Plus, you can understand how to incorporate repeatable human judgments to evaluate, label, and calibrate the rubrics for your AI model.
Constant model maintenance is necessary in order to keep an AI product functioning within defined performance thresholds, even as data, users, and environments change.
For example, imagine an image recognition model trained in a world before Labubus became popular. Every time the model is presented with an image containing a Labubu, it throws an error because it can’t understand what it’s looking at. The AI product now requires model maintenance to process images without errors.
Here’s the standard set of maintenance activities you will need to perform:
Although generative AI is still new, there are already formalized maintenance plans supplied by various regulatory bodies. Post-deployment monitoring plans are increasingly treated as a governance requirement, with frameworks such as those from the NIST AI Resource Center outlining expectations for ongoing evaluation and change control.
AI models operate in an environment that’s always changing, but they don’t have an awareness of these changes unless they’re retrained. Changing fashions, evolving language, and advancing technology can require ML model maintenance to keep the product relevant to its users. Here are a few examples:
Deciding when to perform model maintenance means monitoring and alerting on specific signals generated by the model itself. This can involve manually testing the model by offering queries and judging their responses. Alternatively, it can involve automated signals such as cost-per-prediction and complaint rate. Each cluster of signals can be organized into a different “layer” of a monitoring approach.
When models are designed to label input data (for example, classification or routing systems), you can judge performance based on how often they produce correct labels.
For these systems, standard classification metrics apply. For example, the F1 score measures how often the model correctly balances true positives and false negatives for a given label. Another useful signal is confidence vs. correctness, which compares how confident the model is in its predictions against how often those predictions are correct. Lastly, slice metrics track performance on specific subsets of interest (such as languages, regions, channels, or long-tail intents), where aggregate metrics can mask localized degradation.
Scope note: These metrics apply primarily to supervised classification tasks. They are not sufficient on their own for evaluating generative or large language models.
The importance of slice metrics in AI model maintenance can’t be overemphasized. Imagine an AI model designed to converse with customers and recommend products. The conversational aspect of the model might perform very well, but the product recommendation engine might be starting to decay. If you’re not monitoring the product recommendation slice, e.g., the part of the model that drives revenue, then you’re not going to notice the drop in model performance until you start losing money.
For generative models and other systems that are not designed to produce explicit labels, you can still make inferences about performance using proxy signals.
For example, you can track the out-of-vocabulary (OOV) rate, which measures how often the model encounters tokens, entities, or terms that were not present in its training data. Novelty detection similarly indicates how frequently the model is exposed to new task types, topics, or prompt structures that fall outside its original training or evaluation scope.
These indicators are often accompanied by feature distribution drift, where the characteristics of the input data change over time. In production systems, this commonly reflects shifts in user demographics, product offerings, or use cases, all of which can lead to new query patterns that stress the model in different ways.
Drilling into operational metrics helps translate anomalous model behavior into observable system issues. Queries that the model was not trained to handle often require more compute resources or additional processing steps.
This can manifest as higher latency, lower throughput, increased error rates, or rising cost per prediction. Queue depth and timeout frequency often rise first, making operational metrics an early signal that model maintenance is required.
Lastly, declining customer engagement metrics may reveal that your model is no longer meeting user expectations. You may see that customers are no longer buying the products your model recommends (search conversion), are disengaging from the model after shorter conversations (deflection rate), or are submitting more complaints about model performance. If your model is designed to prevent fraud or catch cybercriminals, you may find related incidents begin to tick up. These could be your final warning that your model needs maintenance.
AI model maintenance can be difficult because model decay often happens slowly over time. Operators should set up different alert categories that span performance metrics, input characteristics, and feature-level signals across each monitoring layer.
Now that you know what to alert on, let’s see what an AI model maintenance workflow looks like in practice.
As you can see, model maintenance shouldn’t be rushed, but at the same time, you shouldn’t defer necessary changes. For text categorization and LLM evaluation, the bottleneck is often high-quality human judgments—rubric-calibrated and QA’d rather than the model training step.
Quality control in model maintenance ensures that retraining decisions are based on consistent, reliable evaluation rather than subjective judgment or outdated taxonomies. Without calibration and agreement checks, retraining amplifies inconsistency instead of correcting it.
Let’s look at some best practices:
Adhering to these best practices can have measurable business impacts. Hunting down wrong branch errors means that consumers will receive better search relevance. Fixing responses that are too general improves personalization. Emphasizing quality control results in stable re-trains and happier customers while aligning with governance expectations around post-deployment monitoring plans.
Retraining is a core part of model maintenance, but doing it without disrupting production requires defined triggers, stable evaluation sets, and controlled release strategies. You can shape the maintenance schedule in order to deliver maximum value to users while leaving your developers free to create new features.
The first step is to understand when to perform model maintenance. Retraining should be policy-driven rather than calendar-based. Monitoring signals, sustained KPI deviation, and confirmed taxonomy changes should define when retraining is justified. Sustained KPI deviation, meaningful drift, and label definition changes should be your triggering events for retraining.
Secondly, you should be consistent in your use of evaluation sets. Each new version should be tested on a frozen benchmark, a rolling recent set (reflecting changes to labels and taxonomy), and a targeted “edge case” set. This ensures that retrained models will retain their performance characteristics.
Releasing the retrained model to your entire user base at once is a gamble. Instead, using staged release strategies can help test the model with a live audience while allowing the rollback of underperforming versions.
Lastly, how do you prevent your retrained model’s performance from quickly degrading again? Best practice is to implement slice-level gates before release, ensuring the model performs well on your most critical metrics. If your model can impact worker safety or supports a regulated industry, it is important to gate development in these areas as well. You should always have a rollback plan in place, just in case a new version fails despite your quality checks.
Organizations can support model maintenance using a range of tooling, from general-purpose monitoring platforms to specialized systems designed for drift detection and evaluation. At a minimum, tools should support automated monitoring, slice-level reporting, evaluation harnesses, and dataset or version management for controlled releases.
Even with automation, effective maintenance typically requires human-in-the-loop evaluation and adjudication. Teams may handle this entirely in-house for maximum control, or adopt a hybrid operating model in which internal teams retain ownership while external capacity supports evaluation and labeling workflows. Managed partners can support this hybrid model by providing repeatable, QA-controlled evaluation and labeling workflows alongside internal ownership.
If you’re beginning the model maintenance process or would like to improve an existing program, follow this short checklist to ensure you’re incorporating best practices.
ML model maintenance is not a one-time task, but an operating discipline that determines whether production systems remain reliable over time.
By adhering to best practices, you’ll ensure that your model not only performs well but also improves over time in scalable, repeatable ways. Follow our checklist to monitor your model, evaluate performance, refresh data when needed, and release safely.
Effective model maintenance depends on discipline rather than sophistication. Teams that monitor continuously, evaluate consistently, refresh data deliberately, and release changes safely are far more likely to sustain model performance as data, users, and environments shift.