For companies striving to unlock the full potential of artificial intelligence, access to accurate and scalable datasets often represents a significant bottleneck. This is in part because many common approaches to data labeling come with drawbacks in regards to accuracy, cost, and time investment.
In-house labeling approaches lean on the institutional knowledge of a skilled workforce to produce high-quality annotations. However, this approach can be expensive, time-consuming, and difficult to scale. Crowd-based methods are more cost-effective and scalable but come with their own set of disadvantages. Low-quality labels provided by massively distributed workforces can come with downstream risks such as a slower path to production and a lack of AI governance and ethics.
While navigating these tradeoffs may seem daunting, the artificial intelligence industry as a whole is maturing. And the good news is that as more organizations successfully implement ML to drive business goals, best practices for data annotation are crystalizing. This post describes the pros and cons of different annotation options and answers several key questions, including:
- What are the pros and cons of in-house data annotation?
- What are the pros and cons of outsourcing data annotation?
- What is the difference between crowdsourcing and managed workforces for data annotation?
- Are there benefits to a combined approach: in-house first, and outsourcing later?
- What should I consider when selecting an annotation partner?
What are the pros and cons of in-house data annotation?
In-house annotators — whether data scientists or a small dedicated team of labelers — have the advantage of being well versed in your business. They have a good understanding of your data and processes as well as the objectives of your machine learning initiatives.
In-house annotation is often the best option for earlier stages of the ML production lifecycle, when the volume of data is small and models are still being developed and fine-tuned. Labeling data in-house can give precious insights into potential model errors and edge cases, which can save time and money in the long run if they are tackled early enough. You can experiment and iterate quickly because the feedback loop can be lightning-fast. Annotators have direct access to the ML team, and they can work together to update instructions, saving hours of rework later on. Finally, labeling your data in-house gives you full control over your data and physical security.
There are, however, significant drawbacks to in-house annotation. The costs to hire, train, and retain training specialists can be significant, especially in cases where in-house data scientists are taking on annotation work. Their time is better spent on data analytics and building and fine-tuning the models that your labeled data will fuel. There will also be costs associated with sourcing your own annotation tool — whether you’re investing in a team to develop a tool in-house, using an open-source solution with limited features, or paying licensing costs to a labeling platform.
Managing an in-house annotation team can be time-consuming as well, especially if there is high turnover due to the seasonal nature of the work and the need to scale up the team at periods of peak demand for annotations. You will also need to set aside time for quality assurance; in some cases, ML engineers can spend several hours reviewing annotations and providing feedback to internal annotators.
Beyond cost, there is the more subtle problem of bias. Annotators that are primarily exposed to your organization’s way of looking at data and the problem you are attempting to solve will adopt a mindset around labeling shaped by that perspective. This can lead to missed opportunities to create useful training examples that fall outside your norm.
What are the pros and cons of outsourcing data annotation?
The need for large volumes of data and low-cost annotation has driven the growth of a variety of outsourced data labeling solutions, from crowdsourcing to business process outsourcing (BPO) solutions.
Crowdsourcing: Traditional crowdsourcing platforms optimize for quantity over quality; though clients can affordably access a large distributed third-party workforce for their machine learning projects, annotators do not often have domain expertise and resulting datasets lack quality control.
Business Process Outsourcing (BPO): BPO companies may offer more bespoke solutions, but implementation can be expensive and slow, and this approach is not optimized for scaling or the integration of new tools.
Outsourcing data labeling can provide a quick path to receiving a high volume of simple, low-context labeled data, which may suffice depending on your use case. A false negative in an autonomous vehicle or biomedical algorithm could mean life or death, however, in the case of an e-commerce chatbot, it may just result in poor customer service. In short, better quality training data generally leads to higher and more reliable model performance, especially for the same amount of data.
Since the weight and severity of a false negative differs across verticals, it’s important to define the level of data quality and domain expertise needed to train your algorithm as a part of your training data strategy.
It is important to note however that low-performing models are not the only potential downstream impact of poor data quality. Just as manufacturers have found when making outsourcing decisions based solely or primarily on costs, companies using outsourced, low-cost annotators can quickly run into unanticipated problems.
For starters, massively distributed annotation workforces often come hand-in-hand with opaque practices in regards to AI governance and ethics. The complexity of the data procurement process combined with a lack of standards around equitable data supply chains has several downstream implications for essential but largely unseen annotators. For some, the decision to crowdsource the annotation process can result in unwittingly doing business with an unethical partner who does not follow fair labor practices.
Additionally, cutting corners early on can slow the path to production in later stages of ML model development. Crowdsourcing — especially when annotators are anonymous — does not lend itself to an agile labeling process. Many ML engineers prefer to stay close to their data in the early stages of their AI projects, with tight feedback loops to uncover and mitigate edge cases, iterate on labeling instructions, and ultimately get better results more quickly.
There is also the question of security. Your data is valuable intellectual property, especially if it is important enough to be a key component in your machine learning initiatives. Yet crowdsourcing typically relies on a large distributed workforce, making it difficult to control physical security measures: are all annotators labeling your data from a secure location, on a secure machine and network? There is no way to guarantee that there won’t be any data leak or even to know if such a leak occurred. Obtaining the reassurance that your data — and your clients’ data — is secure becomes a challenge when your annotators remain anonymous.
What is the difference between crowdsourcing and managed workforces for data annotation?
The limitations of low-cost, crowdsourced annotation are clear and substantial, and the industry has responded by developing product-led, AI-driven alternatives. Businesses now have the option of using a service that combines the best qualities of existing services for their data annotation projects.
Labeling platforms with directly managed workforces prioritize innovation in their tech and place more emphasis on tight feedback loops and QA processes — consequently improving label quality, eliminating questionable, unethical labor practices, and mitigating data security risks.
This new category of labeling providers are product-led, often with a dedicated team of machine learning engineers building better annotation tools. This is an important point: the platform improves over time because the ML engineers building the service can do experiment-driven development, for example by using A/B tests, to continuously improve the platform.
These improvements are coupled with a highly skilled workforce that is trained in the specific domain they are annotating and directly managed by your team. The combination of domain-trained annotators and a platform that is designed to support experiments that further drive improvements in data quality are the key reasons product-led with domain expertise services provide the highest quality annotated data while keeping costs under control.
Are there benefits to a combined approach: in-house first, and outsourcing later?
In practice, there is no one-size-fits-all approach for deciding between in-house and outsourced data annotations, but Sama has seen a common pattern amongst our customers. Some clients find in-house labeling works best early in the project as they are refining their requirements and discovering edge cases. Once models are well-performing on a subset of their data and they have a good understanding of requirements, these clients then look for a partner to help them scale annotations.
Borrow the keys to our ML-powered platform with Sama Go Beta
Get started quickly with test sets, stay close to your data to uncover edge cases early, and scale-up annotations strategically as you experiment with your models. When you’re ready to scale, hit the Boost button so we can convert all your hard work into what we do best: quality annotation at scale thanks to our trained force of highly skilled annotators.
What should I consider when selecting an annotation partner?
There are many considerations when selecting a data training partner, and understanding what to look for is critical to the success of your training data strategy. To start, when choosing an annotation partner, we recommend looking for:
A robust, AI-powered annotation platform. Look for indicators that the company is product-led: actively developing new techniques for improving annotation quality and validating their advances, for example, through A/B testing, technical conferences, or peer-reviewed publications.
A skilled workforce. Ensure that annotators are specifically trained on your use case, and that you can remain in constant communication with labelers to help monitor quality, respond to edge cases, and iterate on instructions as you go. Look for evidence that your annotation partner understands the impact of the data on the AI models that are being developed and trained.
Flexible engagement models. Labelling needs change frequently during the course of model development, and your partner should have the ability to customize workflows and QA processes accordingly. You want to find a partner that has gone through the iterative process many times, and can scale with your projects on demand if needed.
Rigorous quality assurance (QA) processes. These processes should include automation and AI-powered QA processes bolstered by humans-in-the-loop.
Evidence of fair labor practices. Look for organizations that have documented ethical supply chains, verified by independent third-party review. Look for other indicators of sound business practices, for example, B Corp Certification.
Adherence to industry-standard security practices. Assess the data retention practices of the annotator. Ideally, the annotating service should not retain your data. Verify they can comply with relevant industry and government regulations, such as GDPR for European Union customers. Finally, validate that they adhere to best practices for physical security, such as ISO certified delivery centers, biometric authentication, and user authentication with 2FA.
These are but a few of the dimensions you should consider when selecting an annotation partner who can deliver the high-quality annotations you need to get your ML models into production more quickly.
Why work with Sama?
Learn more about the differences between your annotation services options and why Sama has been chosen as an annotation partner by major corporations advancing the state of the art in applied machine learning.