Angsuman Dutta, Pricchaa Inc

Angsuman Dutta, Pricchaa

In preparation for the General Data Protection Regulation (GDPR) compliance, a global financial services organisation embarked on a journey to assess its core information processing environments with the objective of identifying opportunities to strengthen its privacy data protection programs.

This article focuses on the technology challenges, approach, and lessons learned for the centralised testing environment.


Like many DevOps groups across the industry, this financial organisation has adopted both continuous testing and quality testing regime to deliver quality products leveraging agile methodology.

The organisation prefers to use production data to prepare the test data. While majority testing is primarily be done by an internal team, certain applications are tested by outsourced offshore teams.

The test environment is fairly complex comprising Oracle, Hadoop (Parquet files), Hive, Cassandra, MS SQL, SAS, Linux-based system. Incremental data volume varies between ten million to 15 million records on weekly basis. Certain major releases of big data based applications require up to 5 GB data (~75 million records).


To comply with GDPR and prevent privacy data breach events, the testing team needed to detect and de-identify the personally identifiable information (PII) element. If they use available de-identification methods of leveraging product specific encryption technology, such as MS SQL encryption etc, much of the data becomes unusable for testing for the following reasons:

  • current methods scramble the data and make data unusable;
  • current methods do not preserve any referential relationship between various data sources.

If they choose to mask the data, they are challenged with similar challenges. For example, if they want to test an application that calculates the end of month summary balance of a customer account using an Oracle data source and Hadoop data source – they would not able to use the data encrypted using available technology.

In addition, PII information often appears within comments and description fields – encryption or masking of the entire field would result in loss of important information.

More importantly, data encryption using available methods are computationally time-consuming and requires large hardware infrastructure. 


The organisation identified the following solution criteria to mitigate the challenges identified during the assessment:

  • Autonomous detection

Leveraging a centralised library, a solution should examine all incoming data including embedded documents for the presence of PII elements. Solution should also be using machine learning techniques to classify sensitive documents present in big data repository.

  • Format preserving encryption

Based on the type of PII data and preference of the user, the solution should encrypt the data elements in three following three modes:

1) blind mode: it should encrypt data element if the data element matches a specific regular expression;

2) column mode: it should encrypt the content of a specific column or a field;

3) mixed mode: it should encrypt the data elements within a specific column if the data element matches a specific regular expression.

  • Cross-platform referential integrity

Solution must be able to retain referential integrity between records across platforms.

  • Big data volume

Solution should be able to detect and encrypt sensitive data in 100 GB of data in less than one hour using commodity hardware.

  • Data usage monitoring

Solution should be able to record and retain information all privacy data usage for audit and compliance. In addition, the solution should be able to identify abnormal data usage leveraging machine learning.

Lessons learned

  • Understand business and technology landscape

It is imperative to understand the current technology landscape, business practices and emerging trends. If your technology platform and domain is monolithic today – do you expect it to remain monolithic in near future. What would be the impact should you move some of your testings to a cloud platform? What about big data applications?

  • Evaluate risks

Assess data security risks through the lens of GDPR and beyond. In addition to the PII and protected health information (PHI), most organisations deal with a number of sensitive data that may not be associated with an individual. How to you detect, encrypt and monitor other types of sensitive data such as B2B contract information in your testing environment?

  • Beyond retrofitting 

Define the ideal solution characteristics prior to evaluating solutions. Retrofitting a solution to meet your business needs is often time-consuming and costly.

By Angsuman Dutta, founder of Pricchaa Inc