Treat not a trick - the brand new Sama-Coco dataset

We’re proud to publicly release a relabelling of the Coco-2017 dataset, by our very own in-house Sama associates (here’s more information about our people!).

Thank you! Your submission has been received.
We'll get back to you as soon as possible.

In the meantime, we invite you to check out our free resources to help you grow your service business!

Free Resources

Oops! Something went wrong while submitting the form.

Table of Contents

Loading....

Talk to an Expert

Sama is more than just a data annotation company

We’ve had quite a year at Sama. We’ve expanded, greatly enhanced our platform’s technology, and have been growing our presence in data annotation circles. But what we really want the world to know is that pushing the envelope in generating data for machine learning is a key part of our DNA. We think long and work hard on improving processes and resources. Now it’s time to introduce a key result of our work: the Sama-Coco dataset.

Introducing the Sama-Coco dataset

Many of you will have heard of and used the Coco dataset. Now we’re proud to release a relabelling of the Coco-2017 dataset, this one by our very own in-house Sama associates (here’s more information about our people!). And we want to invite the Machine Learning (ML) community to use it for anything you would like to do - all free of charge and ungated.This is part of our ongoing effort to redefine data quality for the modern age, and to contribute to the wider research and development efforts of the ML community. We’ve already started to use it to explore the impact of data quality on model performance, and we’ll be publishing the results of those studies soon. To get started, here are the ungated links to the Sama-Coco and original Coco-2017 datasets so that you can get right to them.Coco-2017Sama-CocoValidation Images2017 Val images Train Images2017 Train images Validation Detection Annotations2017 Train/Val annotations sama-coco-val.zip *Train Detection Annotationssama-coco-train.zip *

*Creative Commons License

What does the Sama-Coco dataset look like?

One of the main aims was to study the annotation by our associates of the original Coco-2017 images with precise polygons. This resulted in a very different dataset with characteristics that are summarized in the tables below. OverviewCoco-2017Sama-CocoDifferenceNumber of images123 287123 2870Number of classes80800Number of classes with more objects annotated3347-InstancesCoco-2017Sama-CocoDifferenceNumber of instances(crowds included)896 7821 115 464218 682 (x1.24)Number of crowds10 49847 42836 930 (x4.5)Objects composed of more than one polygon86 156175 69889 952 (x2)Number of vertices21 726 74340 258 23518 531 492 (x1.85)Object SizesCoco-2017Sama-CocoDifferenceVery small objects(<=10x10 pixels)78 21348 394-29 819 (x0.6)Small Objects(<32x32 pixels)371 655 (41.4%)555 006 (49.8%)183 351 (x1.49)Medium Objects(>= 32x32 and <96x96 pixels)307 732 (34.3%)354 290 (31.8%)46 558 (x1.15)Large Objects(>=96x96 pixels)217 395 (24.2%)206 168 (18.4%)-11 227Some key features should be highlighted:

The core number of images and item classes are the same across both the Sama-Coco and the original Coco-2017 datasets.
The number and instances of crowds are significantly greater in Sama-Coco. This is partially because our associates were tasked with decomposing large, singular crowds into smaller individual elements and smaller crowds.They were also instructed to be more precise and comprehensive when annotating instances and crowds.
By extension, the previous parameter meant that the total number of vertices in Sama-Coco would rise significantly - it nearly doubled. The number of large objects would drop significantly, as the individual members or elements of big crowds or clusters of objects were relabeled as their own unique items.
There is a significant reduction in the number of very small objects - those measuring 10x10 pixels or less. It was a conscious choice at the outset to ask associates not to annotate such small objects. We were attempting to balance quality and time allocated to labeling when we made this decision. We believe the significantly greater number of other small objects (between 10x10 and 32x32 pixels) and medium objects (between 32x32 and 96x96 pixels) that emerged in our dataset justifies this course of action.

Of course, seeing is believing. Here are two illustrative examples of the differences between the two datasets.In this first example, Coco labellers largely treated this as one singular crowd, whereas in Sama-Coco, each person was individually labeled.

This second example shows how most annotations were carried out with an acute level of precision. Coco’s motorcycle annotation is rather coarse, whereas Sama-Coco’s is more fine-grained.

How was Sama-Coco generated?

We revisited all 123,287 images pre-loaded with annotations from the Coco-2017 dataset with up to 500 associates performing the following procedure:

Distinguish crowd from non-crowd images. Coco loosely defines a crowd as a group of instances of the same class that are co-located.
Prioritize annotating instances of objects over crowds of objects. However, when associates encountered more than a certain number of instances of a specific class in a single image, they were told to label the first of such instances individually and then label the balance as part of a crowd. The exact number of instances to annotate changed across the project and this requirement was done to balance budget, time, and quality considerations.
Ignore objects that were smaller than 10x10 pixels (it should be noted that some associates deleted Coco-2017 pre-annotations for such small objects whereas others simply ignored them).

Why Sama-Coco?

We strive to improve our annotation operations through research, experimentation, and qualification as we try to advance knowledge in the broader ML community. Given Coco-2017’s status as a well-established benchmark, relabelling it with our quality rubric was an opportunity to produce a dataset that’s simultaneously familiar and distinct. While both datasets share the same base, Sama-Coco has more instances for 47 of the 80 classes and has a marked overall improvement in accuracy of polygon annotations. In some cases, such as for the person class, the number of instances is significantly higher than the one in Coco-2017. Because Sama-Coco is distinct from Coco-2017, we anticipate ML practitioners will find each of the two systems suitable for different tasks.It has already been a useful dataset for us as we leverage it in studies of annotation quality. We’re sure it will be useful for other AI practitioners with similar aims - stay tuned for the results of our quality experiments.

And so, we invite you all to download and use the dataset from the links above. We’d love to hear from you about your experience with Sama-Coco! Please contact samacoco@samasource.org with your feedback.

Author

RESOURCES