As a data scientist, you may be familiar with the following scenario: your model performed well in training, but it’s producing less-than-desirable results when faced with new data in production.
While there are many variables that may be at fault, the quality of your data could be the culprit. Bad training data can have an outsized influence on model predictions. Perhaps bias has crept in, or your dataset is unbalanced, or over/underfitted.
Data validation enables ML practitioners to check the accuracy and quality of the source data they use to train their models. It can help you:
- Catch anomalies, errors, or outliers in your data;
- Spot differences in data you collected early on vs. more recently;
- Gain confidence in the data you are using to feed your models.
Read on to learn how to incorporate a scalable data validation process into your model-building pipeline, to catch errors early and speed your path to high-performing models in production.
Why you need a human-in-the-loop
To get models off the ground, you might start with pre-annotated datasets or off-the-shelf models — but these are not necessarily engineered to tackle the unique problems you’re looking to solve.
Crucially, these off-the-shelf solutions will likely flounder when faced with the unique edge cases and anomalies present in your datasets; this is because a model cannot learn things it has not seen in training.
In contrast, the human brain is wired to easily catch anomalies and exceptions, which is why adding a human validation step is key to ensuring you have a balanced dataset. A human-in-the-loop and a validation step baked into your data pipeline can help proactively raise and solve for edge cases, data anomalies, and false detection errors.
At Sama, we’ve seen clients using open-sourced models get to about 50-70% accuracy in their predictions out of the box. When paired with a human-in-the-loop (HITL) data validation process, they were able to achieve 95+% quality.
Why crowdsourcing won’t cut it
A lot of annotation services offer great prices for data annotation and validation off the backs of crowd-sourcing. There are a few problems with this approach:
- Agent training or skill isn’t guaranteed;
- They may work on unsecured machines; and
- They may not stay for the whole engagement.
This last point is important beyond the obvious avoidance of project disruption: agents who stay grow from their experience and develop expertise. They learn from mistakes and make adjustments over time. Overall, having dedicated agents on your project safeguards against project delays, allowing you to get an accurate model as quickly as possible.
Selecting the right data validation and annotation partner
Who can you work with to build a scalable data validation process for your model pipeline, to catch errors early and speed your path to production? Here are some attributes to look out for in data validation and annotation partner.
A workforce that can scale with your ML projects
It may be straightforward to get good results on small amounts of data, but as you near production, your datasets can explode in size. You may find at this stage that the high-quality levels that a single person or even a small team can achieve may not be reproducible in a timely manner.
Efficient annotation and validation partners will pair you with a dedicated team who become experts on your data early on in the process and can scale on demand to maintain a high standard of quality.
A platform with good data-handling and editing capabilities
A good platform has to, among other things, enable the seamless uploading of data that has (pre-)annotations encoded already, preferably through an API.
Beyond that, for the actual agent who would be working on the data validation, editing tools that allow for easy clean-up of pre-annotated data is key. The platform should let the agent adjust annotations across and within frames, and easily determine whether an annotation actually falls within quality requirements.
Other helpful features would be a bulk action that allows quick removal of unnecessary annotations, and functionality that allows for quick redraws of annotations.
A platform and service with good reporting and visualizations
It’s pretty critical that good, easy-to-read, and easy-to-digest reporting be available to you. After all, there’s no real point in doing data validation if there’s no easy way to determine whether corrections made to your previous data yield better model predictions.
At a minimum, the reporting should provide you with a direct comparison of model performance with the validated data versus unvalidated data. As a general rule, the more reporting that’s available, the better.
Sama offers industry-leading human-in-the-loop data validation
Sama has been building up our expertise in data annotation for over a decade now, and our in-house agents are both well-trained and long-tenured. We upskill, assign, and directly manage a dedicated workforce of annotation and validation experts to become experts on your data — at scale.
When coupled with our ML-powered data annotation and validation platform, our experts can help you efficiently label and validate your data, for the fastest path to accuracy.