Sama is more than just a data annotation company
We’ve had quite a year at Sama. We’ve expanded, greatly enhanced our platform’s technology, and have been growing our presence in data annotation circles. But what we really want the world to know is that pushing the envelope in generating data for machine learning is a key part of our DNA. We think long and work hard on improving processes and resources. Now it’s time to introduce a key result of our work: the Sama-Coco dataset.
Introducing the Sama-Coco dataset
Many of you will have heard of and used the Coco dataset. Now we’re proud to release a relabelling of the Coco-2017 dataset, this one by our very own in-house Sama associates (here’s more information about our people!). And we want to invite the Machine Learning (ML) community to use it for anything you would like to do – all free of charge and ungated.
This is part of our ongoing effort to redefine data quality for the modern age, and to contribute to the wider research and development efforts of the ML community. We’ve already started to use it to explore the impact of data quality on model performance, and we’ll be publishing the results of those studies soon. To get started, here are the ungated links to the Sama-Coco and original Coco-2017 datasets so that you can get right to them.
What does the Sama-Coco dataset look like?
One of the main aims was to study the annotation by our associates of the original Coco-2017 images with precise polygons. This resulted in a very different dataset with characteristics that are summarized in the tables below.
|Number of images||123 287||123 287||0|
|Number of classes||80||80||0|
|Number of classes with more objects annotated||33||47||–|
|Number of instances
|896 782||1 115 464||218 682 (x1.24)|
|Number of crowds||10 498||47 428||36 930 (x4.5)|
|Objects composed of more than one polygon||86 156||175 698||89 952 (x2)|
|Number of vertices||22 735 106||41 638 434||18 903 328 (x1.8)|
|Very small objects
|78 213||48 394||-29 819 (x0.6)|
|371 655 (41.4%)||555 006 (49.8%)||183 351 (x1.49)|
(>= 32×32 and <96×96 pixels)
|307 732 (34.3%)||354 290 (31.8%)||46 558 (x1.15)|
|217 395 (24.2%)||206 168 (18.4%)||-11 227|
Some key features should be highlighted:
- The core number of images and item classes are the same across both the Sama-Coco and the original Coco-2017 datasets.
- The number and instances of crowds are significantly greater in Sama-Coco. This is partially because our associates were tasked with decomposing large, singular crowds into smaller individual elements and smaller crowds.They were also instructed to be more precise and comprehensive when annotating instances and crowds.
- By extension, the previous parameter meant that the total number of vertices in Sama-Coco would rise significantly – it nearly doubled. The number of large objects would drop significantly, as the individual members or elements of big crowds or clusters of objects were relabeled as their own unique items.
- There is a significant reduction in the number of very small objects – those measuring 10×10 pixels or less. It was a conscious choice at the outset to ask associates not to annotate such small objects. We were attempting to balance quality and time allocated to labeling when we made this decision. We believe the significantly greater number of other small objects (between 10×10 and 32×32 pixels) and medium objects (between 32×32 and 96×96 pixels) that emerged in our dataset justifies this course of action.
Of course, seeing is believing. Here are two illustrative examples of the differences between the two datasets.
In this first example, Coco labellers largely treated this as one singular crowd, whereas in Sama-Coco, each person was individually labeled.
This second example shows how most annotations were carried out with an acute level of precision. Coco’s motorcycle annotation is rather coarse, whereas Sama-Coco’s is more fine-grained.
How was Sama-Coco generated?
We revisited all 123,287 images pre-loaded with annotations from the Coco-2017 dataset with up to 500 associates performing the following procedure:
- Distinguish crowd from non-crowd images. Coco loosely defines a crowd as a group of instances of the same class that are co-located.
- Prioritize annotating instances of objects over crowds of objects. However, when associates encountered more than a certain number of instances of a specific class in a single image, they were told to label the first of such instances individually and then label the balance as part of a crowd. The exact number of instances to annotate changed across the project and this requirement was done to balance budget, time, and quality considerations.
- Ignore objects that were smaller than 10×10 pixels (it should be noted that some associates deleted Coco-2017 pre-annotations for such small objects whereas others simply ignored them).
We strive to improve our annotation operations through research, experimentation, and qualification as we try to advance knowledge in the broader ML community. Given Coco-2017’s status as a well-established benchmark, relabelling it with our quality rubric was an opportunity to produce a dataset that’s simultaneously familiar and distinct. While both datasets share the same base, Sama-Coco has more instances for 47 of the 80 classes and has a marked overall improvement in accuracy of polygon annotations. In some cases, such as for the person class, the number of instances is significantly higher than the one in Coco-2017. Because Sama-Coco is distinct from Coco-2017, we anticipate ML practitioners will find each of the two systems suitable for different tasks.
It has already been a useful dataset for us as we leverage it in studies of annotation quality. We’re sure it will be useful for other AI practitioners with similar aims – stay tuned for the results of our quality experiments.
And so, we invite you all to download and use the dataset from the links above.
We’d love to hear from you about your experience with Sama-Coco! Please contact [email protected] with your feedback.