AI Practitioners
10 Frequently Asked Data Labeling Questions

10 Frequently Asked Data Labeling Questions

A day in the life of an ML engineer or a data scientist is not as glamorous as you might think; data-related tasks — from aggregating to labeling and augmenting data — can take up to 80% of their time.

At Sama, we’ve helped hundreds of organizations overcome data challenges at every stage of the AI model lifecycle. A lot of the same questions come up, and we’ve compiled them here along with recommendations for approaching your data annotation strategy holistically and sustainably.

1. Where do I start with my data?

In the last decade, businesses from virtually every industry have invested in collecting and storing unprecedented amounts of data. Wanting to see a return on their investment, these businesses are now turning to AI to extract value from the data they’ve collected. But more often than not, they’re faced with challenges that can make their data feel like more of a burden than a gold mine:

  • Noisy, unbalanced data
  • Lack of necessary data
  • Incompatible data formats
  • No clear picture of available data

If you’re finding yourself in this position—with a vast data lake and no paddle to cross it—go back to basics. Identify your business objectives and then work your way back to your data requirements.

If you start by trying to make sense of your data, you’re likely to find the process of wading through overwhelming. Even worse, you may wind up addressing the wrong business needs.

2. How do I ensure the data sent for annotation is representative of data my model will observe in production?

We don’t always get this question because far too often, it’s assumed to be the case. The reality is that the data you use to train your model will often be significantly different from what that model sees in a live production environment.

If your model performed well in a testing environment but less so in production, this could be the culprit. Our recommendation is simple:

Don’t assume your training data will look exactly like your production data. 

If you’re an ML engineer, keep communication lines with your business units open. Talk to domain experts to gain a deep understanding of what your production data will look like. You may find there is a lot you don’t know you don’t know.

3. How do I make sure there is no inherent bias in my data?

There’s a reason this topic comes up again and again at AI conferences and in academic papers… though it should come up just as much in the boardroom.

Biased data can come in many different forms — from societal bias to unrepresentative datasets, to bias due to feedback loops or system drift. Whether the negative downstream impacts of bias are societal or business-related, the only way to mitigate bias while building a model is to be proactive:

1. Stay up to date on the field of research:

AI Now Institute’s annual reports
Partnership on AI
Alan Turing Institute’s Fairness, Transparency, Privacy

2. Establish responsible processes to mitigate bias:

Google AI recommended practices
IBM Fairness 360 framework

4. Which parts of my training data should I get annotated first?

Especially if you have a large dataset, not all of it needs to be (or even can be) annotated. How do you know which parts of your data to prioritize?

There are techniques and products on the market that can help you classify your dataset, enabling you to cluster and rebalance your data so that you are only sending a subset of your data to be labeled: a subset containing a well-distributed sample of your data. 

Doing this will ensure that your dataset is balanced and holds the information that will have the greatest impact on your model’s performance.

Don’t pay for labels your model doesn’t need with Sama Curate.

5. What do I do with special cases?

In some cases, data rebalancing and filtering may not be enough. Even if a model seems to produce good results from a purely technical perspective overall, it may not necessarily cut it from a more business or social level in certain scenarios.

Take for example a vehicle detection model trained to recognize vehicles on the road. The computer vision model may perform sufficiently at recognizing “vans,” but an extra level of care may be required to recognize specific types of vans — consider the use case in which road safety codes dictate that one must keep a certain distance from a medical transportation vehicle.

In these situations, human-in-the-loop input is very important to help catch nuances that are difficult for algorithms to catch.

Luckily, there are tools to support and accelerate capturing this human input. Similarity search allows you to preprocess your entire dataset and gain insight into images that appear similar (at least to your model) and may require the judgment of a human in the loop.

Once you conduct a similarity search, you will be left with a subset of similar images to help you focus your annotation efforts. This allows you to streamline your process by only requiring you to browse through hundreds of images, rather than thousands or hundreds of thousands to find relevant examples to annotate.

In the example above, it would make sense to perform a similarity search to pull out all vehicles that look like medical transportation vehicles and regular vans, and conduct a thorough labeling of the returned results.

6. What kind of labels do I need, and what quality?

When you set out to label your datasets, it may be tempting to request the most precise and detail-oriented labels. Take for example this image of a motorcycle on a showroom floor:

LEFT: A polygon annotation with over 100 points. RIGHT: Bounding box to identify the location of the motorcycle. MIDDLE: More coarse polygon annotation with closer to 30 points.

It may seem like a no-brainer that, in an ideal world, you should label your images with the most granular level of detail. And why shouldn’t you? Well, for starters, the label on the left is orders of magnitude more time-intensive and costly than the label on the right.

An autonomous vehicle application of the image above would likely require the highest level of precision possible. But if you’re just looking to differentiate between a Honda and a Kawasaki, bounding boxes will likely suffice — and will get you to production more quickly.

This is a simplified example, but the takeaway here is to resist the temptation to be over-prescriptive in your data annotation requirements, especially early in the process. Depending on your use case, you may find that a label type you hadn’t considered can help your models perform sufficiently well.

In short: avoid being over-prescriptive with your data annotation requirements until you’re sure what kind of labels your model needs.

7. How do I capture edge cases and deal with complexity?

Even with data pre-processing and curation, you’re likely to come across edge cases you had not accounted for in your annotation strategy.

Avoid over-designing your annotation strategy by attempting to predict every edge case (and probably dreaming up edge cases that will likely never occur). Instead, account for the range of variability in your production environment by putting in place a plan to uncover and quickly address edge cases as they come up.

Set aside budget and time to catch and resolve edge cases. And crucially: ensure your labeling process is iterative with tight feedback loops between your annotators and your ML engineers. Which brings me to my next point…

8. How do I manage ambiguity?

In addition to edge cases, your dataset is likely to contain ambiguous examples. Let’s say you’re a grocer who wants to label produce. Should these apples be labeled as red or green?

Identify ambiguous examples such as this as early as possible – and importantly, build tight feedback loops between subject matter experts and your annotators. Empower them to be in constant communication, capturing and iterating on instructions that can then be socialized with the rest of your annotation workforce.

Clear annotation instructions from the start are key, but you should always assume that ambiguity will arise. You’ll be happy you have lines of communication open when they do.

9. How can I make sure my model in production is still performing as expected?

As production data increasingly differs from data the model was trained on, your model will start to degrade in performance. With supervised learning, a model cannot learn things it has not seen in training.

DATA DRIFT:

As production data increasingly differs from data the model was trained on, your model will start to degrade in performance. With supervised learning, a model cannot learn things it has not seen in training.

How can you avoid a situation like this? Be proactive in monitoring data drift. If you detect that the nature of your data in production is changing, enrich your model with more representative examples, retrain, and put it back into production.

10. Who can I trust to annotate my training data?

For many organizations, technical challenges stand in the way of benefiting from the great data boom. Not all businesses have the resources or know-how to effectively turn the data they have into a competitive edge. As ML models become increasingly “off the shelf,” the real competitive advantage is increasingly going to lie with your data and what you do with it. 

Your model is only as good as the data it’s trained on. There are many considerations when selecting a data training partner, and understanding what to look for is critical to the success of your training data strategy.

Sama is the only data labeling platform that solves for accuracy, efficiency, and ethics. We reduce time to quality using automation, advanced analytics, and a highly agile training data methodology. Our directly managed annotation workforce is selected and trained to become experts on your data.

Learn how to choose the data labeling partner that is right for you.

Related Resources

In-House vs Outsourcing Data Annotation for ML: Pros & Cons

Sama’s Experiment-Driven Approach to Solving for High-Quality Labels at Scale

ML Assisted Annotation Powered by MICROMODEL Technology