For companies striving to unlock the full potential of artificial intelligence, access to accurate and scalable datasets often represents a significant bottleneck. This is in part because many common approaches to data labeling come with drawbacks in regards to accuracy, cost, and time investment.
In-house labeling approaches lean on the institutional knowledge of a skilled workforce to produce high-quality annotations. However, this approach can be expensive, time-consuming, and difficult to scale. Crowd-based methods are more cost-effective and scalable but come with their own set of disadvantages. Low-quality labels provided by massively distributed workforces can come with downstream risks such as a slower path to production and a lack of AI governance and ethics.
While navigating these tradeoffs may seem daunting, the artificial intelligence industry as a whole is maturing. And the good news is that as more organizations successfully implement ML to drive business goals, best practices for data annotation are crystalizing. This post describes the pros and cons of different annotation options and answers several key questions, including:
- What are the pros and cons of in-house data annotation?
- What are the pros and cons of outsourcing data annotation?
- What is the difference between crowdsourcing and managed workforces for data annotation?
- Are there benefits to a combined approach: in-house first, and outsourcing later?
- What should I consider when selecting an annotation partner?
What are the pros and cons of in-house data annotation?
In-house annotators — whether data scientists or a small dedicated team of labelers you’ve added to your own team or hired through a partner — have the advantage of being well versed in your business. They have a good understanding of your data and processes as well as the objectives of your machine learning initiatives.
There are some drawbacks to in-house annotation that merit consideration. The costs to hire, train, and retain training specialists can be significant, especially in cases where your own in-house data scientists are taking on annotation work, or where a partner has to hire data scientists to add to their own teams. Their time is better spent on data analytics and building and fine-tuning the models that your labeled data will fuel. There will also be costs associated with sourcing your own annotation tool — whether you’re investing in a team to develop a tool in-house, using an open-source solution with limited features, or paying licensing costs to a labeling platform.
Managing an in-house annotation team can be time-consuming as well, especially if there is high turnover and the need to scale up the team at periods of peak demand for annotations. You will also need to set aside time for quality assurance regardless of whether it’s your own team or your vendor’s; in some cases, ML engineers can spend several hours reviewing annotations and providing feedback to annotators.
Beyond cost, there is the more subtle problem of bias. Annotators that are primarily exposed to your organization’s way of looking at data and the problem you are attempting to solve will adopt a mindset around labeling shaped by that perspective. This can lead to missed opportunities to create useful training examples that fall outside your norm. It’s in this last situation where hiring a vendor with their own in-house workforce of annotators would be very advantageous to you.
Now, having highlighted all the drawbacks mentioned above, in-house annotation confers some truly meaningful advantages, particularly if you hire a vendor with their own in-house annotators to do the labeling tasks. It’s often the best option for earlier stages of the ML production lifecycle, when the volume of data is comparatively small and models are still being developed and fine-tuned.
Labeling data in-house using skilled annotators can give precious insights into potential model errors and edge cases, which can save time and money in the long run if they are tackled early enough. You can experiment and iterate quickly because the feedback loop can be lightning-fast. Annotators have direct access to the ML team, and they can work together to update instructions as unforeseen situations arise, saving hours of rework later on.
Finally, labeling your data with a properly vetted partner who employs in-house annotators gives you full control over your data and physical security. Here’s how and why:
- They’ll have a dedicated workforce for your project, meaning their attention is focused on your data alone, and the vendor can more easily, quickly, and directly address any misunderstandings that annotators may have with instructions;
- They’ll work on owned infrastructure as opposed to personal computers that operate outside the company’s network – this puts annotators’ work within the coverage of standardized, company-approved security measures and processes;
- There will be no reason for the in-house annotators to ever share data or instructions with other co-workers or clients to get clarity on instructions (as happens with outsourced/crowdsourced annotators);
It will be easier to ensure that your data will not be used to train unauthorized models; and
- They will be able to easily manage and anonymize projects with dedicated codes and processes so that in-house annotators won’t have to know who they are working for. This is a much harder feat to achieve with crowdsourced contractors.
What are the pros and cons of outsourcing data annotation?
The need for large volumes of data and low-cost annotation has driven the growth of a variety of outsourced data labeling solutions, from crowdsourcing to business process outsourcing (BPO) solutions.
Crowdsourcing: Traditional crowdsourcing platforms optimize for quantity over quality; though clients can affordably access a large distributed third-party workforce for their machine learning projects, annotators do not often have domain expertise and resulting datasets lack quality control.
Business Process Outsourcing (BPO): BPO companies may offer more bespoke solutions, but implementation can be expensive and slow, and this approach is not optimized for scaling or the integration of new tools.
Outsourcing data labeling can provide a quick path to receiving a high volume of simple, low-context labeled data, which may suffice depending on your use case. A false negative in an autonomous vehicle or biomedical algorithm could mean life or death, however, in the case of an e-commerce chatbot, it may just result in poor customer service. In short, better quality training data generally leads to higher and more reliable model performance, especially for the same amount of data.
Since the weight and severity of a false negative differs across verticals, it’s important to define the level of data quality and domain expertise needed to train your algorithm as a part of your training data strategy.
It is important to note, however, that low-performing models are not the only potential downstream impact of poor data quality. Just as manufacturers have found when making outsourcing decisions based solely or primarily on costs, companies using outsourced, low-cost annotators can quickly run into unanticipated problems.
For starters, massively distributed annotation workforces often come hand-in-hand with opaque practices in regards to AI governance and ethics. The complexity of the data procurement process combined with a lack of standards around equitable data supply chains has several downstream implications for essential but largely unseen annotators. For some, the decision to crowdsource the annotation process can result in unwittingly doing business with an unethical partner who does not follow fair labor practices.
Additionally, cutting corners early on can slow the path to production in later stages of ML model development. Crowdsourcing — especially when annotators are anonymous — does not lend itself to an agile labeling process. Many ML engineers prefer to stay close to their data in the early stages of their AI projects, with tight feedback loops to uncover and mitigate edge cases, iterate on labeling instructions, and ultimately get better results more quickly.
There is also the question of security. Your data is valuable intellectual property, especially if it is important enough to be a key component in your machine learning initiatives. Yet crowdsourcing typically relies on a large distributed workforce, making it difficult to control physical security measures: are all annotators labeling your data from a secure location, on a secure machine and network? How would such a vendor even identify a security breach of client data when their contractors aren’t subject to the IT and security monitoring inherent to company infrastructure?
The short answer is that there is no way to guarantee that there won’t be any data leak or even to know if such a leak occurred when you’re working with crowd-sourced annotators. In fact, it’s been vividly shown that sensitive data entrusted to data annotation companies employing outsourced annotators can be (maliciously or unintentionally) leaked online. Obtaining the reassurance that your data — and your clients’ data — is secure becomes a challenge when your annotators remain anonymous.
What is the difference between crowdsourcing and managed workforces for data annotation?
The limitations of low-cost, crowdsourced annotation are clear and substantial, and the industry has responded by developing product-led, AI-driven alternatives. Businesses now have the option of using a service that combines the best qualities of existing services for their data annotation projects.
Labeling platforms with directly managed workforces prioritize innovation in their tech and place more emphasis on tight feedback loops and QA processes — consequently improving label quality, eliminating questionable, unethical labor practices, and mitigating data security risks.
This new category of labeling providers are product-led, often with a dedicated team of machine learning engineers building better annotation tools. This is an important point: the platform improves over time because the ML engineers building the service can do experiment-driven development, for example by using A/B tests, to continuously improve the platform.
These improvements are coupled with a highly skilled workforce that is trained in the specific domain they are annotating and directly managed by your team. The combination of domain-trained annotators and a platform that is designed to support experiments that further drive improvements in data quality are the key reasons product-led with domain expertise services provide the highest quality annotated data while keeping costs under control.
Are there benefits to a combined approach: in-house first, and outsourcing later?
In practice, there is no one-size-fits-all approach for deciding between in-house and outsourced data annotations, but Sama has seen a common pattern amongst our customers. Some clients find in-house labeling works best early in the project as they are refining their requirements and discovering edge cases. Once models are well-performing on a subset of their data and they have a good understanding of requirements, these clients then look for a partner to help them scale annotations.
Here are some of the most critical questions you need to ask yourself when you consider which approach or combination of approaches to take:
- What stage am I at in the ML production lifecycle?
- How much data do I need to annotate?
- How complex are my annotation requirements? What type of model am I gathering data to train?
- How critical is data security to the reputation and success of my project and organization?
- How much money and time do I have?
What should I consider when selecting an annotation partner?
If you’re convinced that you need to selecting a data training partner, you’ll have to consider a number of things. To start, when choosing an annotation partner, we recommend looking for the following in order to maximize the chances of success for your training data strategy:
A robust, AI-powered annotation platform. Look for indicators that the company is product-led: actively developing new techniques for improving annotation quality and validating their advances, for example, through A/B testing, technical conferences, or peer-reviewed publications.
A skilled workforce. Ensure that annotators are specifically trained on your use case, and that you can remain in constant communication with labelers to help monitor quality, respond to edge cases, and iterate on instructions as you go. Look for evidence that your annotation partner understands the impact of the data on the AI models that are being developed and trained.
Flexible engagement models. Labeling needs change frequently during the course of model development, and your partner should have the ability to customize workflows and QA processes accordingly. You want to find a partner that has gone through the iterative process many times, and can scale with your projects on demand if needed.
Rigorous quality assurance (QA) processes. These processes should include automation and AI-powered QA processes bolstered by humans-in-the-loop.
Evidence of fair labor practices. Look for organizations that have documented ethical supply chains, verified by independent third-party review. Look for other indicators of sound business practices, for example, B Corp Certification.
Adherence to industry-standard security practices. Assess the data retention practices of the annotator. Ideally, the annotating service should not retain your data. Verify they can comply with relevant industry and government regulations, such as GDPR for European Union customers. Finally, validate that they adhere to best practices for physical security, such as ISO certified delivery centers, biometric authentication, and user authentication with 2FA.
These are but a few of the dimensions you should consider when selecting an annotation partner who can deliver the high-quality annotations you need to get your ML models into production more quickly.
Why work with Sama?
Learn more about the differences between your annotation services options and why Sama has been chosen as an annotation partner by major corporations advancing the state of the art in applied machine learning.