We are proud to offer the Sama-Coco dataset, a relabelling of the Coco-2017 dataset by our own in-house Sama associates (here’s more information about our people!). We invite the Machine Learning (ML) community to use it for anything you would like to do – all free of charge and ungated.
Here’s a quick overview of the two datasets’ most important characteristics:
Number of instances per class
(10 most frequent classes)
Sama-Coco’s Key Features
Some key features should be highlighted:
The core number of images and item classes are the same across both the Sama-Coco and the original Coco-2017 datasets.
The number and instances of crowds are significantly greater in Sama-Coco. This is partially because our associates were tasked with decomposing large, singular crowds into smaller individual elements and smaller crowds. While both datasets share the same base, Sama-Coco has more instances for 47 of the 80 classes. In some cases, such as for the person class, the number of instances is significantly higher than the one in Coco-2017.
Associates were instructed to be more precise and comprehensive when annotating instances and crowds. This led to a sharp rise in the total number of vertices – it nearly doubled. The number of large objects also dropped significantly, as the individual members or elements of big crowds or clusters of objects were relabeled as their own unique items.
There is a significant reduction in the number of very small objects – those measuring 10×10 pixels or less. It was a conscious choice at the outset to ask associates not to annotate such small objects. We were attempting to balance quality and time allocated to labeling when we made this decision, and we believe that the significantly greater number of other small objects (between 10×10 and 32×32 pixels) and medium objects (between 32×32 and 96×96 pixels) that emerged in our dataset justifies this decision.
Illustrative Differences between Sama-Coco and Coco-2017
Here, we cover two images that are illustrative of some of the differences between Sama-Coco and Coco-2017.
In this first example, Coco labellers largely treated this as one singular crowd, whereas in Sama-Coco, each person was individually labeled.
This second example shows how most annotations were carried out with an acute level of precision. Coco’s motorcycle annotation is rather coarse, whereas Sama-Coco’s is more fine-grained.
How Sama-Coco was Labeled
We revisited all 123 287 images pre-loaded with annotations from the Coco-2017 dataset with up to 500 associates performing three key tasks. They had to:
Distinguish crowd from non-crowd images (note that both Sama-Coco and Coco-2017 loosely defined a crowd as a group of instances of the same class that are co-located).
Prioritize annotating instances of objects over crowds of objects. However, when associates encountered more than a certain number of instances of a specific class in a single image, they were told to label the first of such instances individually and then label the balance as part of a crowd. The exact number of instances to annotate changed over the course of the project. This requirement was done to balance budget, time, and quality considerations.
Ignore objects that were smaller than 10×10 pixels (some associates deleted Coco-2017 pre-annotations for such small objects whereas others simply ignored them).
Sama-Coco Installation Instruction For FiftyOne App
Load Sama-Coco directly from the FiftyOne app. Explore all 123,287 images directly within FiftyOne and compare them side by side with the original MS Coco dataset.