What every developer should know about data security/privacy

There are studies, including this one, that show that the vast majority of data breaches are due to human error.

Some of this human error is the result of being vulnerable to manipulation, social engineering, etc. Other forms are due to bad practices or even technical debt that lead to not adhering to the principles of least privilege, for example.

The obvious remedy to addressing the former is with security training (whether required for compliance or voluntary), and while I'm sure that to some understanding what Johnny did wrong to infect his PC with malware, or how Jane was tricked by a hacker wearing a hoodie to clicking on a thing in her email can be useful to some extent (depending on how charitable one wants to be), to others (including myself) this is just super lame. I've always wanted to better understand how and when to access the production database, how to identify sensitive information (including PII), and what specific things developers should be doing and not doing in their day-to-day development, etc. Here is what I found that I feel is worth sharing with developers:

Understand that Unfettered Access to the Production Database is Not Ideal

I know this is super obvious to many, but this is as good a starting place as any. Some less experience developers I've met have felt that under no terms should anybody have access to the production data, and in some companies VPN access to the database is embedded in their working culture. In reality, there are always "break glass" scenarios where somebody needs to make exception to any established rules for the purposes of fire-fighting. Even if you are not fire-fighting, there may be times when access is required in lieu of a better option. Therefore, my position is not absolutist and that access should be a "no, never, under any circumstances", but at the very least, we should all strive towards providing better, more sustainable sorts of solutions. I think many developers understand this, but we tend to kick that can down the street without a justifiable replacement option.

Even if the risk of leaking sensitive information somehow is not compelling, we all know that a single bad query can cause a myriad of problems, whether this is load related or actual data integrity issues of an errant query (or script). To state the obvious yet again, there is no "undo" button in a database, sadly.

Moving Data Off of the Production Database For Analysis/Testing

Moving data elsewhere is a perfectly rational thing to entertain — it's sort of the idea of taking the bullets out of the loaded gun.

The problem is, if by doing so you create a copy of this data and this data contains sensitive information, you are potentially creating data leaks, especially if these data copies start breeding more and more copies throughout your organization (including on laptops). It is important to understand the risks involved here and start thinking about a possible solution rather than trading off one can to kick down the street for another.

Are Data Leaks Really All That Common? Is the Risk Overhyped a Little?

Truthfully, the actual risk of a copy of production data on an engineer's laptop is probably minimal, but the problem is if this becomes a regular sort of practice, this becomes extremely difficult to manage. If your company is subject to some sort of compliance, this may hang up a passing grade from an audit. If you aren't and your company is ready to exit, an auditor will likely look for skeletons in your closet in the form of bad data management practices. Nobody wants a PR nightmare of a data leak or some sort of exposé on sloppy data management practices. In many cases the psychological pressures here and possible narratives dominate over the actual reality.

This is why there are very successful, highly valued companies (an example) that help identify sensitive information throughout organizations. I'm not criticizing the existence of these companies or gaslighting their utility, they are clearly a necessary evil, but I feel that bad practices with data is sort of a form of original sin that provides companies like this business in the first place. By the time you need them you are kind of trying to put that horse back in the barn.

How Do We Keep That Horse in the Barn?

I suggest checking out Redactics. What I feel is compelling about the approach taken here is they automate creating production samples and sending that data to wherever it should go (whether this is another database or an Amazon S3 bucket, for example), and redacting data is a part of this same process. By the time the data arrives at its desired destination, the sensitive information and PII is already gone. There is nothing to account for (aside from the engineers that have physical access to the infrastructure where this software runs). Moreover, this software is free for developer usage without caveat (and may soon be open-sourced), so this product is not some pay-to-play sort of solution targeted only at large enterprise companies.

Know Your Data

Whether your solution of choice automatically identifies sensitive information and PII or you have to find it yourself, I feel it's important for developers to be aware of specific fields that contain this information, and at least document this in some way so it can be accounted for, however you choose to do so.

With this information for example, you could create table grants that forcefully block access to this data for non-privileged users, or if you are using a solution like Redactics you can ensure that all fields are included in your redaction configuration. Redactics does include a PII scanner tool that looks for PII based on HIPAA identifiers based on a limited sample of your production data, but this tool is intended to be a guide and an additional check. It is much better to simply know your data.

It is also important to note that it is not just personal information that we need to be careful about, but it's sensitive or confidential information like API keys and hashed passwords as well. In my experience, this information is rarely actually needed for QA, analysis, etc. so my rule of thumb is to apply a redaction rule to whatever you are unsure of, especially if there is no harm in doing so.

Data Re-identification

In some cases leaving certain (needed) data in its original state can have unintended consequences from a security perspective. For example, if you redact a person's name and email but not the town they live in, in a small enough town somebody with access to this data and is really invested in knowing who this person is might be able to figure it out if they have the person's age, for example.

Redacting fields is not always a bullet-proof solution, especially if the use case requires retaining a certain amount of data for analysis (analytics, machine learning, etc.). If you feel this risk is relevant to you, looking at a tactic such as differential privacy might be worthwhile.

What is Differential Privacy?

Differential Privacy is effectively a giant shuffle button you press to take values in a dataset and assign them to random users within that dataset. This way, the numerical values in the aggregate remain the same, yet if you capture a certain row of data data re-identification is no longer possible. A number of BigTech companies have adopted differential privacy tactics in their platforms and applications.

Don't be Paralyzed

Perhaps some developers are so concerned about the possible risks here that they stick with their seed/dummy data, or find workarounds that don't involve working with their production data even if these are less efficient. You simply need to be mindful of these issues, and with a solution in place (whether Redactics or something else) you are confident with, you don't have to be blocked. Chances are this same solution will unblock a whole bunch of other use cases in your company — your production data is extremely valuable when leveraged to its full potential!

What Every Developer Should Know About Data Security/Privacy

Understand that Unfettered Access to the Production Database is Not Ideal

Moving Data Off of the Production Database For Analysis/Testing

Are Data Leaks Really All That Common? Is the Risk Overhyped a Little?

How Do We Keep That Horse in the Barn?

Know Your Data

Data Re-identification

What is Differential Privacy?

Don't be Paralyzed

Comments

More from this blog

The Easiest Way to Clone a PostgreSQL Database

How To Get the Most Out of Airflow's Dynamic Task Mapping

How To Automate Database Migration Testing/Dry-runs in Your CI/CD Pipelines

How To Poll an Airflow Job (i.e. DAG Run)

Command Palette

Understand that Unfettered Access to the Production Database is Not Ideal

Moving Data Off of the Production Database For Analysis/Testing

Are Data Leaks Really All That Common? Is the Risk Overhyped a Little?

How Do We Keep That Horse in the Barn?

Know Your Data

Data Re-identification

What is Differential Privacy?

Don't be Paralyzed

Comments

More from this blog