How To Sync Data Between Two Databases

How can we sync two databases, both copying all initial data, and then finding new and updated subsequent records and delta copying these records?

The Automated (and Free) Solution

Redactics will do this running in your own infrastructure, and is free for developer usage (and we are working on open sourcing our developer-focused offerings). It is a pure SQL solution making it extremely lightweight. Delta updates approach near realtime replication (and are lightweight enough to handle aggressive table operational concurrency), and are fast enough to run every few minutes or so for many use cases.

The Homemade Solution

If you want to do this yourself, here is an approach that works well:

Ensure that each of your tables you wish to sync have numerical primary keys, as most SQL-based databases do (typically auto-incrementing).
Then, considering that new primary keys will always be increasing in size, for identifying new rows you can just create SQL queries to search for records with larger primary key values than your last primary key.
For updates, ensure that your master tables have an updated date field that is updated each time the record is updated. Then, you can use SQL to search for last updated dates newer than your most recent update. It is okay if you need to search for dates a little older than the present time because there generally is no harm in re-applying the same update to existing data.
If you want to write new data to your target database, you'll need a strategy to prevent conflicting primary keys. An approach we like is creating a new column called something like source_primary_key, generating CSVs (without headers) using the above two approaches (one CSV for new rows, one CSV for updates), and then piping this CSV into awk to remove the primary key column, and move its value to your source_primary_key column. For example, if your primary key column is your first column and your source_primary_key your fifth column: cat /path/to/csv.csv | gawk -vOFS=, -F "," "{print $2,$3,$4,$5,$1}" | [your SQL client]. We found this library necessary for handling data that contains quotation marks.
For row deletions we generally recommend soft deleting records, which would be treated as an update, since setting the value of a disabled field to true would also update this row's last updated date.

Please let me know if you have any luck with either approach!

How To Sync Data Between Two Databases

The Automated (and Free) Solution

The Homemade Solution

Comments

More from this blog

The Easiest Way to Clone a PostgreSQL Database

How To Get the Most Out of Airflow's Dynamic Task Mapping

How To Automate Database Migration Testing/Dry-runs in Your CI/CD Pipelines

How To Poll an Airflow Job (i.e. DAG Run)

What Every Developer Should Know About Data Security/Privacy

Command Palette

The Automated (and Free) Solution

The Homemade Solution

Comments

More from this blog