Machine Learning
3
min read

Best Open Source Datasets for Machine Learning

Best open source datasets for machine learning and three dataset finders, including one that was featured in the Fine-Grained Visual Categorization (FGVC) workshop at CVPR 2019 on June 17.

Best Open Source Datasets for Machine LearningAbstract background shapes
Table of Contents
Talk to an Expert

Artificial intelligence is a double-edged sword--on one hand, homes are smarter, health tech is advancing at rapid pace, and driverless vans will soon deliver your groceries.

On the other hand, privacy violations, discrimination and a whole host of effects not yet known or experienced give pause to these technologies.

Confronting the risks of AI begins with facing your data difficulties, including ingesting high-quality data before sorting, linking and programming even occurs.Here are ten open source datasets for machine learning and three dataset finders, including one that was featured in the Fine-Grained Visual Categorization (FGVC) workshop at CVPR 2019 on June 17.

Machine Learning Datasets

The datasets below span multiple data types, including scholarly articles, images, survey data, and autonomous driving data. This overview table gives readers a quick way to compare the options before moving into the detailed list.

Dataset Data Type Key Detail
COVID-19 Open Research Dataset Scholarly articles 45,000+ articles on COVID-19 and coronaviruses
Google Open Images Images 9M+ images across 6,000 categories
Waymo Open Dataset Autonomous driving data One of the largest and most diverse AD datasets
ImageNet Images Organized by WordNet hierarchy
iMaterialist-Fashion Clothing images 50K+ images with fine-grained segmentation labels
Fishnet.AI Fisheries images ~35,000 images with bounding boxes from tuna fishing cameras
Visual Genome Images + language Connects image concepts to structured language
UCI Machine Learning Repository Mixed 474 datasets maintained by UC Irvine
Pew Research Center Survey data Raw survey data, account required
Labelme Annotated images Accessible via Matlab toolbox
Labelled Faces in the Wild Face photographs 13,000+ face photos for facial recognition
  1. COVID-19 Open Research Dataset Allen Institute for AI partnered with leading research groups to prepare this research dataset of over 45,000 scholarly articles about COVID-19 and the coronavirus family of viruses.
  2. Google Open Images Google AI introduced over 9 million images spanning 6,000 categories--”enough to train a deep neural network from  scratch.”
  3. Waymo Open Dataset Waymo released one of the largest, most diverse autonomous driving datasets to date. All you need is a Gmail account, and you can access the dataset.
  4. ImageNet If you’re looking for an image database organized according to the WordNet hierarchy, give ImageNet a try.
  5. iMaterialist-Fashion Sama and Cornell Tech announced the iMaterialist-Fashion dataset in May 2019, with over 50K clothing images labeled for fine-grained segmentation. The dataset was used in the FGVC workshop at CVPR, co-sponsored by Google AI.
  6. Fishnet.AI Working together with Sama, The Nature Conservancy released Fishnet.AI, an AI training dataset for fisheries. This dataset of approximately 35,000 images with an average of 5 bounding boxes per image was collected from on-board monitoring cameras for long line tuna fishing activity in the Western and Central Pacific.
  7. Visual Genome Visual Genome is the product of 9 technology professionals with a goal of connecting structured image concepts to language.
  8. UCI Machine Learning Repository The University of California - Irvine (UCI) maintains 474 datasets as a service to the machine learning community.
  9. Pew Research Center Gain access to raw data from survey research via Pew Research Center. An account is required to access their datasets, but registration is easy.
  10. Labelme Use the Labelme Matlab toolbox to access a large dataset of annotated images.
  11. Labelled Faces in the Wild (LFW) Develop your facial recognition application using LFW, a collection of over 13,000 face photographs collected from around the web.

Dataset Finders

Finding the right dataset is often as important as evaluating the dataset itself. The tools below help researchers and practitioners discover niche datasets, large public data collections, and sources distributed across the web.

Tool What It Is Best For Key Detail
Kaggle Datasets Dataset hosting and discovery platform Finding niche datasets, downloading ready-to-use datasets, exploring dataset popularity Large community-driven catalog, includes dataset pages with context, licenses, and notebooks
AWS Registry of Open Data Curated registry of open datasets hosted on AWS Very large datasets, public data at scale, cloud-native access patterns Many datasets are accessible directly in AWS, often via S3, useful for large files and programmatic access
Google Dataset Search Dataset search engine, indexes dataset metadata across the web Discovering datasets across publishers, libraries, and websites Indexes dataset metadata, helps locate the source dataset and documentation quickly
  1. Kaggle Data scientists and machine learners can find and publish datasets on Kaggle, an online community that was acquired by Google in 2017. Kaggle’s master list of datasets boasts a wide range of niche data sources.
  2. Amazon Web Services (AWS) With over 110 datasets and counting, you’ll find a web crawl of billions of web pages, NASA satellite imagery and more on the Registry of Open Data for AWS. If you want to add to the registry, of course there’s an AWS Labs GitHub repository for that.
  3. Google Dataset Search Google Dataset Search indexes datasets from digital libraries, personal websites and publisher pages, so you can find them when you need them. It’s currently in beta, but the predictive interface makes it easy to see what datasets are available on your selected topic at a glance.

This is just a small sample of the free, open source datasets that are available for machine learning use cases. If you have a dataset or dataset finder you’d like to add, hit us up and let us know.

Author
Sharon L. Hadden
Sharon L. Hadden

RESOURCES

Related Blog Articles

No items found.