Treat not a trick – the brand new Sama-Coco dataset

Treat not a trick – the brand new Sama-Coco dataset

Sama is more than just a data annotation company

We’ve had quite a year at Sama. We’ve expanded, greatly enhanced our platform’s technology, and have been growing our presence in data annotation circles. But what we really want the world to know is that pushing the envelope in generating data for machine learning is a key part of our DNA. We think long and work hard on improving processes and resources. Now it’s time to introduce a key result of our work: the Sama-Coco dataset.


Introducing the Sama-Coco dataset

Many of you will have heard of and used the Coco dataset. Now we’re proud to release a relabelling of the Coco-2017 dataset, this one by our very own in-house Sama associates (here’s more information about our people!). And we want to invite the Machine Learning (ML) community to use it for anything you would like to do – all free of charge and ungated.

This is part of our ongoing effort to redefine data quality for the modern age, and to contribute to the wider research and development efforts of the ML community. We’ve already started to use it to explore the impact of data quality on model performance, and we’ll be publishing the results of those studies soon. To get started, here are the ungated links to the Sama-Coco and original Coco-2017 datasets so that you can get right to them.

Coco-2017 Sama-Coco
Validation Images 2017 Val images [5K/1GB]
Train Images 2017 Train images [118K/18GB]
Validation Detection Annotations
2017 Train/Val annotations [241MB] [5.7MB]*
Train Detection Annotations [154.4MB]*

*Creative Commons License

What does the Sama-Coco dataset look like?

One of the main aims was to study the annotation by our associates of the original Coco-2017 images with precise polygons. This resulted in a very different dataset with characteristics that are summarized in the tables below.

Overview Coco-2017 Sama-Coco Difference
Number of images 123 287 123 287 0
Number of classes 80 80 0
Number of classes with more objects annotated 33 47
Instances Coco-2017 Sama-Coco Difference
Number of instances
(crowds included)
896 782 1 115 464 218 682 (x1.24)
Number of crowds 10 498 47 428 36 930 (x4.5)
Objects composed of more than one polygon 86 156 175 698 89 952 (x2)
Number of vertices 21 726 743 40 258 235 18 531 492 (x1.85)
Object Sizes Coco-2017 Sama-Coco Difference
Very small objects
(<=10×10 pixels)
78 213 48 394 -29 819 (x0.6)
Small Objects
(<32×32 pixels)
371 655 (41.4%) 555 006 (49.8%) 183 351 (x1.49)
Medium Objects
(>= 32×32 and <96×96 pixels)
307 732 (34.3%) 354 290 (31.8%) 46 558 (x1.15)
Large Objects
(>=96×96 pixels)
217 395 (24.2%) 206 168 (18.4%) -11 227

Some key features should be highlighted:

  • The core number of images and item classes are the same across both the Sama-Coco and the original Coco-2017 datasets.
  • The number and instances of crowds are significantly greater in Sama-Coco. This is partially because our associates were tasked with decomposing large, singular crowds into smaller individual elements and smaller crowds.They were also instructed to be more precise and comprehensive when annotating instances and crowds.
  • By extension, the previous parameter meant that the total number of vertices in Sama-Coco would rise significantly – it nearly doubled. The number of large objects would drop significantly, as the individual members or elements of big crowds or clusters of objects were relabeled as their own unique items.
  • There is a significant reduction in the number of very small objects – those measuring 10×10 pixels or less. It was a conscious choice at the outset to ask associates not to annotate such small objects. We were attempting to balance quality and time allocated to labeling when we made this decision. We believe the significantly greater number of other small objects (between 10×10 and 32×32 pixels) and medium objects (between 32×32 and 96×96 pixels) that emerged in our dataset justifies this course of action.

Of course, seeing is believing. Here are two illustrative examples of the differences between the two datasets.

In this first example, Coco labellers largely treated this as one singular crowd, whereas in Sama-Coco, each person was individually labeled.

This second example shows how most annotations were carried out with an acute level of precision. Coco’s motorcycle annotation is rather coarse, whereas Sama-Coco’s is more fine-grained.

How was Sama-Coco generated?

We revisited all 123,287 images pre-loaded with annotations from the Coco-2017 dataset with up to 500 associates performing the following procedure:

  • Distinguish crowd from non-crowd images. Coco loosely defines a crowd as a group of instances of the same class that are co-located.
  • Prioritize annotating instances of objects over crowds of objects. However, when associates encountered more than a certain number of instances of a specific class in a single image, they were told to label the first of such instances individually and then label the balance as part of a crowd. The exact number of instances to annotate changed across the project and this requirement was done to balance budget, time, and quality considerations.
  • Ignore objects that were smaller than 10×10 pixels (it should be noted that some associates deleted Coco-2017 pre-annotations for such small objects whereas others simply ignored them).

Why Sama-Coco?

We strive to improve our annotation operations through research, experimentation, and qualification as we try to advance knowledge in the broader ML community. Given Coco-2017’s status as a well-established benchmark, relabelling it with our quality rubric was an opportunity to produce a dataset that’s simultaneously familiar and distinct. While both datasets share the same base, Sama-Coco has more instances for 47 of the 80 classes and has a marked overall improvement in accuracy of polygon annotations. In some cases, such as for the person class, the number of instances is significantly higher than the one in Coco-2017. Because Sama-Coco is distinct from Coco-2017, we anticipate ML practitioners will find each of the two systems suitable for different tasks.

It has already been a useful dataset for us as we leverage it in studies of annotation quality. We’re sure it will be useful for other AI practitioners with similar aims – stay tuned for the results of our quality experiments.

And so, we invite you all to download and use the dataset from the links above.

We’d love to hear from you about your experience with Sama-Coco! Please contact [email protected] with your feedback.

Related Resources

New Ebook: How to Get Quality Ground Truth Labels for All Autonomous Driving Applications – Without Busting the Bank

13 Min Read

The State of Data Annotation in Computer Vision

13 Min Read
video annotators

In-House vs Outsourcing Data Annotation for ML: Pros & Cons

13 Min Read