The Sama-Coco Dataset

We are proud to offer the Sama-Coco dataset, a relabelling of the Coco-2017 dataset by our own in-house Sama associates (here’s more information about our people!). We invite the Machine Learning (ML) community to use it for anything you would like to do – all free of charge and ungated.

This is part of our ongoing effort to redefine data quality for the modern age, and to contribute to the wider research and development efforts of the ML community. Here are the ungated links to the two datasets (both covered by the Creative Commons license) so that you can get started right away.

Coco-2017 Sama-Coco
Validation Images 2017 Val images [5K/1GB]
Train Images 2017 Train images [118K/18GB]
Validation Detection Annotations
2017 Train/Val annotations [241MB] sama-coco-val.zip [5.7MB]*
Train Detection Annotations sama-coco-train.zip [154.4MB]*

*Creative Commons License

Sama-Coco by the Numbers

Here’s a quick overview of the two datasets’ most important characteristics:

Overview Coco-2017 Sama-Coco Difference
Number of images 123 287 123 287 0
Number of classes 80 80 0
Number of classes with more objects annotated 33 47
Instances Coco-2017 Sama-Coco Difference
Number of instances
(crowds included)
896 782 1 115 464 218 682 (x1.24)
Number of crowds 10 498 47 428 36 930 (x4.5)
Objects composed of more than one polygon 86 156 175 698 89 952 (x2)
Number of vertices 22 735 106 41 638 434 18 903 328 (x1.8)
Object Sizes Coco-2017 Sama-Coco Difference
Very small objects
(<=10×10 pixels)
78 213 48 394 -29 819 (x0.6)
Small Objects
(<32×32 pixels)
371 655 (41.4%) 555 006 (49.8%) 183 351 (x1.49)
Medium Objects
(>= 32×32 and <96×96 pixels)
307 732 (34.3%) 354 290 (31.8%) 46 558 (x1.15)
Large Objects
(>=96×96 pixels)
217 395 (24.2%) 206 168 (18.4%) -11 227

Sama-Coco’s Key Features

Some key features should be highlighted:

  • The core number of images and item classes are the same across both the Sama-Coco and the original Coco-2017 datasets.
  • The number and instances of crowds are significantly greater in Sama-Coco. This is partially because our associates were tasked with decomposing large, singular crowds into smaller individual elements and smaller crowds. While both datasets share the same base, Sama-Coco has more instances for 47 of the 80 classes. In some cases, such as for the person class, the number of instances is significantly higher than the one in Coco-2017.
  • Associates were instructed to be more precise and comprehensive when annotating instances and crowds. This led to a sharp rise in the total number of vertices – it nearly doubled. The number of large objects also dropped significantly, as the individual members or elements of big crowds or clusters of objects were relabeled as their own unique items.
  • There is a significant reduction in the number of very small objects – those measuring 10×10 pixels or less. It was a conscious choice at the outset to ask associates not to annotate such small objects. We were attempting to balance quality and time allocated to labeling when we made this decision, and we believe that the significantly greater number of other small objects (between 10×10 and 32×32 pixels) and medium objects (between 32×32 and 96×96 pixels) that emerged in our dataset justifies this decision.

Illustrative Differences between Sama-Coco and Coco-2017

Here, we cover two images that are illustrative of some of the differences between Sama-Coco and Coco-2017.

In this first example, Coco labellers largely treated this as one singular crowd, whereas in Sama-Coco, each person was individually labeled.

This second example shows how most annotations were carried out with an acute level of  precision. Coco’s motorcycle annotation is rather coarse, whereas Sama-Coco’s is more fine-grained.

How Sama-Coco was Labeled

We revisited all 123 287 images pre-loaded with annotations from the Coco-2017 dataset with up to 500 associates performing three key tasks. They had to:

  • Distinguish crowd from non-crowd images (note that both Sama-Coco and Coco-2017 loosely defined a crowd as a group of instances of the same class that are co-located).
  • Prioritize annotating instances of objects over crowds of objects. However, when associates encountered more than a certain number of instances of a specific class in a single image, they were told to label the first of such instances individually and then label the balance as part of a crowd. The exact number of instances to annotate changed over the course of the project. This requirement was done to balance budget, time, and quality considerations.
  • Ignore objects that were smaller than 10×10 pixels (some associates deleted Coco-2017 pre-annotations for such small objects whereas others simply ignored them).

Please Give Us Your Feedback!

We’d love to hear from you about your experience with Sama-Coco! Please contact [email protected] with your feedback. Thanks!