Data Protection and Privacy for Training Data

Heather Gadonniex

June 10, 2020

4 Minute Read


  • The need for data to meet privacy & security requirements by law can often reduce the amount of training data available 
  • The growth of popularity in AI has been mirrored by a growing number of concerns surrounding privacy, security and ethical use of data 
  • PII includes any information which could point toward identification, including, but not limited to social media information, IP addresses and more 
  • Sama gives the ability for AI companies to scale training data at a faster pace without compromising quality, privacy or security

With the steady rise in both popularity and progress in Artificial Intelligence (AI) over recent years, many have been quick to address potential privacy and security concerns, with buzzwords like ‘ethics’ and ‘responsibility’ never too far from discussion. While the initial public perception of AI was “will automation steal my job”, steady progress has seen facets of AI and Machine Learning technology present in our living rooms, cars, phones and more, mostly without people knowing. That said, an important question has emerged: What level of trust can—and should—we place in these AI systems?

In an age of increasingly complex governmental data privacy requirements, it can be hard to understand not only the level of personal data that’s available, but also how this is protected, both in law (GDPR, Information Privacy law etc.), but also through the development of solution provider products. 

Personally identifiable information (PII) is to be considered any information which could identify a specific individual. Of course, the wide definition of PII can create challenges, especially when searching for AI training data as it can cover anything from IP addresses, imagery, behavioral data, social media information, and more

Why do we need such swathes of data? Well, a simple input/output equation suggests that more data equates to the ability for increased training and training environments. This, in turn, leads to models that are often increasingly accurate due to both the level of training and the various training scenarios it has been placed inside [1]. At this stage, it is also important to recognize the differentiation between both structured & unstructured data as well as supervised & unsupervised learning. You can read more on this here

To surmise, the main challenges faced by those in need of large data training sets include, but are not limited to, the following:

  • Inability to utilize all owned data due to GDPR and CCPA privacy restrictions 
  • Lack of anonymized video training data available for use 
  • Previously used techniques, including pixelation of imagery reducing model performance for video analysis
  • Required manual human intervention  

Recent developments on our platform aim to address the above through the use of Vector Annotation, Semantic Segmentation, Lidar/3D Annotation, and Dynamic Labelling. These developments when coupled with ISO-certified data centers, vulnerability testing of systems, data storage encryption & GDPR compliance, not only ensure the highest level of internal security, but also the most dynamic and innovative data utilization service available. 

Many applications have previously struggled to keep personally identifying information safe across a variety of data sources, especially when street-level images, cars, retail places, and similar are discussed, however, Sama uses deep learning pre-annotation technology to anonymize data without the need for a human intervention! The process for this includes: 

  • Data is run through our anonymizer technology service before any labeling occurs
  • This service automatically detects faces and license plates to obscure and does so until unrecognizable 
  • This AI-generated content creates training data that looks like real-time data when people and vehicles are the primary objects of interest for the algorithm
  • The above, through the lack of human intervention further accentuates the privacy of PII data
  • This service then allows for increased test data at a faster pace, without compromising security, allowing you to scale your AI whilst also not compromising the trust of stakeholders 

Want to learn more? Check out our Anonymization webinar here.