Skip to main content

Test data sampling

· 2 min read
Marie Padberg

Before new applications go into production, data-driven tests are essential to ensure the quality of software and applications. For these tesings we have built a test data generator.

Personal data is a core element of the entire business in many industries, e.g. insurance, banking and of course retail. The sensitive and highly-differentiated information stored here cannot be used for testing for data protection reasons. Instead, synthetic data is often used, the structure of which usually does not reflect the original data stock.

This means that realistic test conditions cannot be created. The result is significantly higher costs, because fixinings have to be installed during operation. In the worst case, this can even lead to acceptance problems among users and customers.

What is needed instead is test data that is realistic and representative of the overall data stock. However, the creation of such test data means a lot of work in many cases, because dependencies have to be preserved, data types must not change, outliers need attention, personal data should not be used without pseudonomization, and so on.

Therefore, we have developed a test data generator that creates a representative sample from a large original data set and then pseudonomizes it..

Two different sampling methods can be selected, which we have evaluated in advance using statistical methods. Furthermore, different pseudonomizations are available. Finally, a download of the test data and a short report, with a comparison of the original and test data, are provided.

Currently, we have deployed the entire system as an on-demand site using Azure Functions. This means that when demand is high, more resources are kept available for as long as necessary. When demand drops, resources are reduced again. Therefore, loading the page can sometimes take a few seconds.

We are still in the testing phase of the test data generator.