We’re proud to publicly release a relabelling of the Coco-2017 dataset, by our very own in-house Sama associates (here’s more information about our people!).
We’ve had quite a year at Sama. We’ve expanded, greatly enhanced our platform’s technology, and have been growing our presence in data annotation circles. But what we really want the world to know is that pushing the envelope in generating data for machine learning is a key part of our DNA. We think long and work hard on improving processes and resources. Now it’s time to introduce a key result of our work: the Sama-Coco dataset.
Many of you will have heard of and used the Coco dataset. Now we’re proud to release a relabelling of the Coco-2017 dataset, this one by our very own in-house Sama associates (here’s more information about our people!). And we want to invite the Machine Learning (ML) community to use it for anything you would like to do - all free of charge and ungated.This is part of our ongoing effort to redefine data quality for the modern age, and to contribute to the wider research and development efforts of the ML community. We’ve already started to use it to explore the impact of data quality on model performance, and we’ll be publishing the results of those studies soon. To get started, here are the ungated links to the Sama-Coco and original Coco-2017 datasets so that you can get right to them.Coco-2017Sama-CocoValidation Images2017 Val images Train Images2017 Train images Validation Detection Annotations2017 Train/Val annotations sama-coco-val.zip *Train Detection Annotationssama-coco-train.zip *
One of the main aims was to study the annotation by our associates of the original Coco-2017 images with precise polygons. This resulted in a very different dataset with characteristics that are summarized in the tables below. OverviewCoco-2017Sama-CocoDifferenceNumber of images123 287123 2870Number of classes80800Number of classes with more objects annotated3347-InstancesCoco-2017Sama-CocoDifferenceNumber of instances(crowds included)896 7821 115 464218 682 (x1.24)Number of crowds10 49847 42836 930 (x4.5)Objects composed of more than one polygon86 156175 69889 952 (x2)Number of vertices21 726 74340 258 23518 531 492 (x1.85)Object SizesCoco-2017Sama-CocoDifferenceVery small objects(<=10x10 pixels)78 21348 394-29 819 (x0.6)Small Objects(<32x32 pixels)371 655 (41.4%)555 006 (49.8%)183 351 (x1.49)Medium Objects(>= 32x32 and <96x96 pixels)307 732 (34.3%)354 290 (31.8%)46 558 (x1.15)Large Objects(>=96x96 pixels)217 395 (24.2%)206 168 (18.4%)-11 227Some key features should be highlighted:
Of course, seeing is believing. Here are two illustrative examples of the differences between the two datasets.In this first example, Coco labellers largely treated this as one singular crowd, whereas in Sama-Coco, each person was individually labeled.
This second example shows how most annotations were carried out with an acute level of precision. Coco’s motorcycle annotation is rather coarse, whereas Sama-Coco’s is more fine-grained.
We revisited all 123,287 images pre-loaded with annotations from the Coco-2017 dataset with up to 500 associates performing the following procedure:
We strive to improve our annotation operations through research, experimentation, and qualification as we try to advance knowledge in the broader ML community. Given Coco-2017’s status as a well-established benchmark, relabelling it with our quality rubric was an opportunity to produce a dataset that’s simultaneously familiar and distinct. While both datasets share the same base, Sama-Coco has more instances for 47 of the 80 classes and has a marked overall improvement in accuracy of polygon annotations. In some cases, such as for the person class, the number of instances is significantly higher than the one in Coco-2017. Because Sama-Coco is distinct from Coco-2017, we anticipate ML practitioners will find each of the two systems suitable for different tasks.It has already been a useful dataset for us as we leverage it in studies of annotation quality. We’re sure it will be useful for other AI practitioners with similar aims - stay tuned for the results of our quality experiments.
And so, we invite you all to download and use the dataset from the links above.We’d love to hear from you about your experience with Sama-Coco! Please contact samacoco@samasource.org with your feedback.