Synthetic Data Over Obfuscated Production Data for GDPR Compliance

Software testing is and always will be a crucial element of the development and QA process and has been traditionally carried out using readily available production data. However, with the arrival of data protection regulations such as GDPR effectively making it illegal to process personal and private data for any purpose not authorised by the individual(s) concerned, organisations need to find an acceptable solution that keeps them on the right side of the rules – or face the risk of significant financial penalties.

As has been discussed in previous blogs, a possible solution might be to use one of the data obfuscation techniques such as encryption, tokenisation or data masking. However, whilst these approaches have many practical applications such as for departmental data sharing or protecting financial transactions not every method gets you off the GDPR hook when it comes to the testing or training environment. Not only that, they can also come with cost implications and practical usability downsides.

Data obfuscation typically involves clouding or replacing one or more of the data elements, such as a name or address line with meaningless fake values of similar length and structure.  However, from a GDPR perspective this would still be considered non-compliant because, depending on which approach is used, obfuscated data can potentially be easily reverse engineered back to the original state to disclose an individual’s personal and private information.

Reverse engineering of obfuscated data has proven to be relatively simple due to the fact that not all the data elements are obfuscated leaving multiple clues and signatures in the data that can be traced back to real people. A study in 2019, by researchers in Belgium and the UK developed an algorithm that correctly re-identified nearly every real person in any anonymized dataset with just 15 or more demographic attributes. Similar studies have also found a way to re-identify a dataset of 1.1 million people based on 3 months of credit card metadata, with 90% accuracy.

To make obfuscated data possibly GDPR compliant it needs to be strictly protected and audited in the same way as the source data but this can be resource intensive, potentially expensive to implement and also leave other stakeholders open to liability issues.

Encryption is another option and can be a much stronger obfuscation technique because it fully anonymises the data and it cannot be reversed without the encryption key. But unless the key is destroyed after use, encrypted data would still not be considered GDPR compliant for test purposes. Without the key the encrypted data is effectively rendered useless in a test environment.

An alternative approach would be to consider using Synthetic Data as a replacement for production data. By completely replacing all the source data with randomly generated values it means that it does not fall under the normal GDPR rules and can be used freely in the test environment without limiting or compromising the testing processes.

Synthetic data uses the raw production data to generate a completely new dataset that has all the same characteristics, attributes and predictive potential of the original dataset. This makes it indistinguishable from the real data and because it is entirely fictitious it cannot be linked to any real people, which means there is no risk of a breach of data privacy or accidental disclosure to a non-authorised third party.

Synthetic data also has additional benefits in the testing environment. As well as removing the issue of protecting data privacy Synthetic data is ultimately scalable and provides the opportunity to create unlimited quantities of quality data to enable wider test innovation such as system stress testing providing the ability to measure performance criteria under extreme production workloads, helping to future-proof your corporate systems.

Which Form of Data Masking is Right for my Organization?

All companies have different needs when it comes to masking or obfuscating their datasets as much depends on their specific data usage. If you are not sure that your approach to maintaining data security and privacy is fully compliant with the latest GDPR or other data protection regulations, or just want reassurance that you are doing the right thing, a call with one of our data management experts will provide you with the answers you need, help to minimise the risk of a data breach and avoid a hefty financial hit.


100% Data Assurance for a Healthcare Organization

iData performed rapid implementation of data obfuscation to eliminate risk and reduce the cost of delivering test data at scale.

Downloadable_Cover_3