We reached out to various ML experts, asking them the questions: Why is high-quality training data so important? Why do so many projects fail in ML?
In our guide on Data Quality we discussed the need for high-quality data for Machine Learning models. It is widely accepted that without ample amounts of high-quality training data, the application of AI and Machine Learning is impossible. This has also been seen in studies, including an IDC survey, in which only 30% of companies reported a 90% or higher success rate in their AI rollout, with reported failure rates of 10 to 49 percent. A key reason for this? Data!We partnered with our friends from RE•WORK to reach out to various ML experts, specifically asking them the questions: Why is high-quality training data so important? Why do so many projects fail in ML?Manmeet SinghMachine Learning Lead, Apple"The core of any Machine Learning model is what input is being fed to it as the model generalizes based on these training examples. The criteria to choose a ML model is heavily dependent on the kind of input available. For the model to learn anything relevant, training data plays a key role. Imagine in a supervised setting, we are trying to do object recognition. If the labels themselves are messed up what would the model learn? Besides the quality, the quantity of training examples also plays a major role. Training data forms the basis of business decisions based on the offline KPIs being measured on their information. They are the building block for defining a roadmap to the product cycle."Indu KhatriMachine Learning Lead, HSBC"There are two main reasons why quality training data is important. First one is that many problems are solved using Supervised Learning and training data forms the backbone for such applications. The second and more deeper reason is that with the advent of AutoML, democratization of ML skills, and open sourcing of cutting edge research and tools; they are no longer competitive advantages. The only way businesses can sustain a competitive advantage in AI applications is through differentiated training data."Lavi NigamData Scientist, Gartner"In supervised learning, algorithms are dependent on training data to extract relevant patterns for future predictability, hence clean, unbiased and processed training data is crucially important. It’s like the “garbage in garbage out” rule for training any ML/DL models. Although, sometimes, we have lack of training data available and in such cases we have many new semi-supervised learning algorithms coming up. I see great future in such techniques as businesses all across the world don't work at Google/Facebook data scales mostly."Zhiyong (Sean) XieDirector, AI, Pfizer“‘Give me a lever long enough, and I shall move the world’. With deep learning methods, we may say ‘Give me data big enough and I can predict anything.’ Data is the foundation of Machine Learning, especially the deep learning method, with the machines learning everything from data. If you feed a machine biased data, it gives you biased predictions. Garbage in, garbage out.”Shuo ZhangSenior ML Engineer, Bose Corporation“One point I'd like to stress is the domain knowledge and domain specificity in ML. People sometimes think of ML as a ubiquitous technology that can be universally applied to any domain. But in reality, blindly applying ML is dangerous, and you should always know your domain and your data in a very deep way.”Piero MolinoNLP Research Scientist, Staff ML Team, Stanford"How would you train models without it? More seriously, quality and size of data are important because we are trying to tackle difficult tasks and we haven't figured out yet methods that are very robust to noise or that can be trained from smaller amounts of data."Sparkle R.Associate Director Data Science, Johnson & Johnson"While innovative ML tools and techniques are being developed at a rapid pace, ML practitioners can sometimes get drawn into the novelty of applying newer approaches without realizing that the fundamentals of science and lack of high-quality data make many of these algorithms impractical for some healthcare facing solutions. Additionally, in our haste to publish our findings, researchers often forget that these systems will be used by an end-user (patient and provider), and expert level performance in development does not guarantee real-world clinical utility, adoption and transferability across heterogeneous healthcare systems. So, it is equally important that the end user’s workflow is also considered to ensure that the lab to real world transition is successful."Yaman KumarPhD Computer Science, University of Buffalo"Training data is so important since most of the people, books, blogs, videos start and end with supervised learning. We currently do not know how to work in unsupervised settings. As we move towards unsupervised settings, the requirement of training data as we know it will reduce. In that world, gathering training data would not be an arduous task as it is supposed to be now."Kiana AlikhademiArtificial Intelligence & Computer Science, University of Florida"Training data is the backbone of any machine learning system, without sufficient training data, it is impossible for a machine to learn patterns or solve problems. The importance of training data illustrates why inadequate, or low quality training data, could lead to Machine Learning systems’ failure. Training data ought to be representative of all different groups within the sample without inheriting any societal prejudices."Jack BrzezinskiChief AI Scientist at AI Systems & Strategy Lab"I feel that the role of knowledge will be increasing. Structures, various knowledge representation types will be essential for the next wave of AI innovation. The lawmakers might soon require AI, ML models to be compliant with multiple statutes or regulations."