How To Stop Living With Your Seed Data Sucking


4 min read

So often throughout my career I've come across projects where seed data was created during the very early days of a repo, and the bare minimum is done to update this seed data as schema changes. If a new string column is added, maybe at best developers will add a "blah" on required string/varchar fields just to get tests to pass, but in no way is this data robust enough to really back rigorous tests, and it is a pain to maintain for thorough regression testing.

The main problems, as I see them, are:

  1. When writing unit tests it can be challenging to think of edge cases that your users will find, and conjuring up the time and creativity to challenge your code with good seed data.
  2. As your users expose new edge cases, developers have to go back and maintain this seed data if you want your test coverage to account for these edge cases (which is generally a good idea).
  3. Unit tests typically mock certain data from other services or from external APIs/data sources, while integration tests are designed to provide a more in-depth sort of experience using your product with as realistic data as possible. Developers typically don't have a great way to generate datasets robust enough to justify this entire exercise.

I don't have a great solution to the first problem, unfortunately we all lack crystal balls. However, let's discuss the other challenges...

How Do We Come Up With Realistic Data for Integration Testing?

There are SaaS providers that create synthetic data, which is fake data that has all of the same characteristics of real data. I haven't used it for machine learning modelling, but I would imagine it would excel for this purpose in using logic to cover all possibilities of data permutation. However, I'm not sold on it for QA/testing purposes. In my experience, the most realistic form of production-like data is... production data. Customers always seem to have a knack at wanting to do unexpected things with products I've supported, following their lead seems like the best approach. One problem with using production data has always been grabbing a fairly small sample (you don't want gigabytes of data belonging to users that are not being used for your tests), and protecting customer's PII and sensitive information. There are solutions for both of these problems...

How Do We Support New Features with Production Data?

We are back at the chicken and egg problem, as described above. However, in my experience integration tests are great for regression testing, and often times it is that legacy feature or part of your code that is rarely your happy path, so we can use production data as test data for existing features, and do your best with fake data for unreleased features that have no production data created yet.

Does This Mean Resetting the Data Used For Integration Tests Periodically?

Yes. Fortunately, this too can be automated (and integrated with your CI/CD platform). When should this be done? Well, I've also found value in using these same data sets for local testing, particularly when a customer has reported a problem, so whenever I install a dataset locally after I resolve the issue I update my integration tests to use this same dataset. You could also do this before a major release, the beginning of a coding cycle, whatever works for you and your team. This sort of requires creating some new habits, but it's worth it to have smooth releases without the frenzy of fire fighting.

With This Production Data Sample How Does We Impersonate Specific Users?

Your sample can be for the most recent 5-10 users, some of your super users, or even just a random selection of users. Your approach of choice will influence your tactics, but it's not hard to take over an account once you have physical access to the underlying data.

How Do We Actually Do All Of This?

Check out Redactics. It is free, there are plans to open source it, and it is ideal to:

  • Generate production data samples (support is coming to pull data for specific users, but you can extract data from specific time periods).
  • Handle all PII and sensitive data so that it never finds its way into your testing environments.
  • Automatically generate datasets on-demand or on schedule that are pushed to an Amazon S3 bucket your tests can pull from.
  • Install these same datasets locally for local testing/QA.
  • Automate not only generating these datasets, but also installing them directly into your environment. For example, for local usage there is a CLI tool to install the data locally via docker-compose.

Thoughts? Really curious to hear what you have to say!