What is Training Data?

Daniele Packard

December 18, 2017

4 Minute Read

Historically, to get a computer to do something, you had to explicitly program it to execute a series of steps to accomplish your desired task. This would include everything from the most simple of arithmetic skills to manipulating objects in a complex digital world like what you see in video games.

33Machine learning is a fast-developing field of computer science that allows computers to complete exercises they have not been explicitly programmed to do such as, react to new or unforeseen situations and to then learn from this new input dataset. Not only does this create an entirely novel and potentially more efficient process to have computers engage with the world, but it has also opened the door to applications once thought to be too complex to be achieved with traditional computer science (hello, autonomous vehicles!).

How does it work? Training data.

The best way for a computer to gain knowledge is to start by showing it exactly what it is you want it to do. To do this, we use training data, the input the machine learning algorithm references and learns from. The computer will use this to look for patterns, extrapolate connections, create rules, and ultimately learn how to accomplish what it is you are trying to achieve.

The level of complexity and nuance needed in your training dataset depends on the desired goal. For example, a model that gives a binary yes/no to whether a dog is in an image will need input training images that are categorized as having a dog or not. Whereas, a model that needs to tell you not only whether a dog is in an image, but also show you where the animal is, will need training images where the location of the dog is specified.

Different industries and applications have very different training data needs. A financial company trying to automate fraud detection will need appropriately categorized examples of where fraud did and did not happen. Meanwhile, a biomedical company looking to automate medical image analysis will need a doctor to label images (although it has been shown that for certain applications, pigeons and machine learning can help medical image analysis!).

As a user of digital products you are creating training data all the time, often without even realizing it. When you tell your fitness app your preferences you are contributing to a machine learning model. When, in your inbox, you flag certain messages as spam, you are creating training data for your spam filter algorithm.

The need for high quality data training sets is ubiquitously recognized and the counterproductive effects of bad data introduced to your model can be succinctly summarized with “garbage in, garbage out”. Essentially meaning that less than stellar training data, which inaccurately represents what you want your model to achieve, will yield an equally poorly performing model.

Your training data also needs to be diverse enough to meet all the potential scenarios your model will or may encounter in order to avoid creating biasses. For example, a model created to process the age of a person using training data composed of only images of adults and their respective ages will be clueless when presented with the image of a child. Biases can cause a benign oversight or cause errors that are dangerous or even socially controversial.

Ultimately, the need for training data is growing in parallel with the applications of machine learning. While some are researching ways to minimize the need for training data or generate it digitally, there is growing evidence and research validating how important high volumes of training data can be. And though the degree to which training data will be essential to the future of AI is up for debate, it will surely continue to play an important role in machine learning.

*There are various types of machine learning. For the purpose of this blog article, we are discussing supervised machine learning.