The Sama-Coco Dataset

We are proud to offer the Sama-Coco dataset, a relabelling of the Coco-2017 dataset by our own in-house Sama associates (here’s more information about our people!). We invite the Machine Learning (ML) community to use it for anything you would like to do – all free of charge and ungated.

This is part of our ongoing effort to redefine data quality for the modern age, and to contribute to the wider research and development efforts of the ML community. Here are the ungated links to the two datasets (both covered by the Creative Commons license) so that you can get started right away.

Table of Contents

Loading....

Jump to Installation Instructions

Coco-2017

Sama-Coco

Validation Images

2017 Val images [5K/1GB]

Train Images

2017 Train images [118K/18GB]

Validation Detection Annotations

2017 Train/Val annotations [241MB]

sama-coco-val.zip [5.7MB]*

Validation Detection Annotations

2017 Train/Val annotations [241MB]

Sama-coco-train.zip [154.4MB]*

Train Detection Annotations

Installation Instructions

Coco-2017

Sama-Coco

Difference

Overview

Number of images

123 287

0

Number of classes

80

0

Number of classes with more objects annotated

33

47

–

Coco-2017

Sama-Coco

Difference

Instances

Number of instances
‍(crowds included)

896 782

1 115 464

218 682 (x1.24)

Number of crowds

10 498

47 428

36 930 (x4.5)

Objects composed of more than one polygon

86 156

175 698

89 952 (x2)

Number of vertices

21 726 743

40 258 235

18 531 492 (x1.85)

Coco-2017

Sama-Coco

Difference

Object Sizes

Very small objects
(<=10×10 pixels)

78 213

48 394

-29 819 (x0.6)

Small Objects
(<32×32 pixels)

371 655 (41.4%)

555 006 (49.8%)

183 351 (x1.49)

Medium Objects
(>= 32×32 and <96×96 pixels)

86 156

354 290 (31.8%)

46 558 (x1.15)

Large Objects
(>=96×96 pixels)

217 395 (24.2%)

206 168 (18.4%)

-11 227

Sama-Coco by the Numbers

Here’s a quick overview of the two datasets’ most important characteristics:

Number of instances per class

(10 most frequent classes)

‍

Sama-Coco’s Key Features

Some key features should be highlighted:

The core number of images and item classes are the same across both the Sama-Coco and the original Coco-2017 datasets.
The number and instances of crowds are significantly greater in Sama-Coco. This is partially because our associates were tasked with decomposing large, singular crowds into smaller individual elements and smaller crowds. While both datasets share the same base, Sama-Coco has more instances for 47 of the 80 classes. In some cases, such as for the person class, the number of instances is significantly higher than the one in Coco-2017.
Associates were instructed to be more precise and comprehensive when annotating instances and crowds. This led to a sharp rise in the total number of vertices – it nearly doubled. The number of large objects also dropped significantly, as the individual members or elements of big crowds or clusters of objects were relabeled as their own unique items.
There is a significant reduction in the number of very small objects – those measuring 10×10 pixels or less. It was a conscious choice at the outset to ask associates not to annotate such small objects. We were attempting to balance quality and time allocated to labeling when we made this decision, and we believe that the significantly greater number of other small objects (between 10×10 and 32×32 pixels) and medium objects (between 32×32 and 96×96 pixels) that emerged in our dataset justifies this decision.

Illustrative Differences between Sama-Coco and Coco-2017

Here, we cover two images that are illustrative of some of the differences between Sama-Coco and Coco-2017.

In this first example, Coco labellers largely treated this as one singular crowd, whereas in Sama-Coco, each person was individually labeled.

This second example shows how most annotations were carried out with an acute level of precision. Coco’s motorcycle annotation is rather coarse, whereas Sama-Coco’s is more fine-grained.

How Sama-Coco was Labeled

We revisited all 123 287 images pre-loaded with annotations from the Coco-2017 dataset with up to 500 associates performing three key tasks. They had to:

Distinguish crowd from non-crowd images (note that both Sama-Coco and Coco-2017 loosely defined a crowd as a group of instances of the same class that are co-located).
Prioritize annotating instances of objects over crowds of objects. However, when associates encountered more than a certain number of instances of a specific class in a single image, they were told to label the first of such instances individually and then label the balance as part of a crowd. The exact number of instances to annotate changed over the course of the project. This requirement was done to balance budget, time, and quality considerations.
Ignore objects that were smaller than 10×10 pixels (some associates deleted Coco-2017 pre-annotations for such small objects whereas others simply ignored them).

Sama-Coco Installation Instruction For FiftyOne App

Load Sama-Coco directly from the FiftyOne app. Explore all 123,287 images directly within FiftyOne and compare them side by side with the original MS Coco dataset.

import fiftyone asfoimport fiftyone.zoo asfoz dataset = foz.load_zoo_dataset("sama-coco", splits="validation", label_types="segmentations", include_id=True) coco_val_dataset = foz.load_zoo_dataset("coco-2017", split="validation", label_types="segmentations", include_id=True) dataset.rename_sample_field("segmentations", "sama_segmentations") coco_val_dataset.rename_sample_field("segmentations", "coco_segmentations") dataset.merge_samples(coco_val_dataset, key_field="coco_id") session = fo.launch_app(dataset)