Last updated: Oct 11, 2022
We know that rich test data helps expose problems in our code, and we know that there is no richer real world data than actual real world data. So, why don't we test using production datasets?
Well, for starters, doing so is tedious. It typically requires:
- Concocting a script to generate some sort of production data dump, typically based on a sample from a recent time period (if your company has been around a while you probably don't want to have to install a multi-terrabyte database or recreate ancient rows of data).
- Working with the right people to get the data, and repeating this process whenever the schema changes or you wish to refresh this data.
- Ideally redacting sensitive information to protect the personal data of your customers. This sounds painfully obvious, but according to some studies this step is skipped nearly 30% of the time!
I don't know whether the primary blocker is the resource allocation required, the challenges of coordinating efforts across multiple teams, or the security/data privacy concerns, but at the end of the day production data is out-of-play more often than not, and instead developers either create dummy data, or most recently some companies have formed to specialize in selling synthetic datasets that resemble production data.
Too Bad, So Sad?
I reject the idea that production data should be out-of-play. I think synthetic data has its utility (e.g., challenging your machine learning models), but it is largely unnecessary. It will never "out real" real production data. It also can't be used to recreate actual production problems that involve data being in a certain state. As for dummy data, throughout my career I've seen a pattern where some dummy data was created during the early days of a service, but it is rarely maintained because nobody has the time to do so. It is the ultimate backlog task that keeps descending into the abyss.
If your career has been like mine, you have learned that customers have this uncanny knack of finding these weird edge cases, and obviously your customers having to report problems to you is less than ideal. The purpose of having automated tests is to help find these problems before they are deployed in the first place, so is this the best we can do?
How Can We Put Production Data in Play, and Should We?
I really hate for this to sound like an ad, but I would feel dishonest not providing a full disclosure that I built a solution to this called Redactics. In the spirit of full transparency, it is a new company, we are in the process of finding funding for it, but developer usage is free, so this is not some sort of shameless money making ploy!
Redactics generates production data samples with appropriate redactions — i.e. "safe" data sets — and pushes them out to wherever you need them to go, whether this is an Amazon S3 bucket or directly to a "digital twin" database. I'll spare you from all of the sales-y stuff about it, but I'd love it if you'd check it out and give us your feedback!
Returning to the question of whether we should test using production data: if you're motivated to tackle the tedium and the annoyingly slow feedback loop of working to perfect your scripts, and you don't mind maintaining these scripts going forward, this just leaves the question of data privacy. Data privacy regulations are growing in prominence. Whether you are concerned about your compliance, a mistake being made leading to a data breach or bad PR, or just have common sense concerns, doing something about this sensitive information is definitely a good idea.
The easiest thing to do is simply omit these fields, or replace them with gibberish — i.e., a primitive sort of redaction. There are far more sophisticated approaches, such as differential privacy, that are helpful when the data values in the aggregate need to be preserved, but this is generally more of a data science thing — more often than not, as far as your tests go it really doesn't matter if an email address is a real customer email address or just a dummy address. If you take the "redact when unsure" approach, you shouldn't have to worry about the most common data privacy risks, including data re-identification.
We used the 18 HIPAA Identifiers to determine what sort of sensitive information we should target, but you'll also want to target API keys, credit card info (if applicable to your company), and other sorts of tokens that should be considered confidential, even though it might not be personally identifiable information (PII). We built a scanner to help automate finding this PII, but if you are good with your regex searches you can build your own scanner since this information contains searchable data. Otherwise, you'll just have to review your schema to identify all of the appropriate fields that warrant some sort of redaction.
We've found that the last month of safe data from our key tables (leaving out useless info like records of logging in) worked well. Our CI/CD tests simply downloaded our schema and CSV of our safe data from an Amazon S3 bucket, applied that to our database container, and ran the tests. You might not find this as useful for simple unit testing, but this is a great way to back integration tests or more complete end-to-end tests.
Furthermore, with this same process we are able to install these same datasets locally to reproduce problems a company's customers have reported. Since all of the PII has been replaced we can't just find the customer by email, but we can reference the customer by their original numerical primary key in the database.
So, the answer of "should we test with production data" is, in my opinion, yes, so long as you can overcome these challenges. Being able to have a virtual repository of problematic data states is a luxury for testing, and can be a massive time saver for recreating actual problems and crushing those bugs.
Do you have any comments? Disagreements? Further insight? We'd love to hear from you, please comment below!