Case Study: Apache Airflow As a Managed Appliance

·

7 min read

We bet our business around our customers being sold on running our software in their own infrastructure, rather than building a more traditional SaaS platform. Redactics is all things data management: primarily data pipelining with PII removal for test and demo environments (including ML, analytics, etc.), data delivery to stakeholders, database migration/schema dry-running, data destruction, etc. In short, we are trying to position ourselves as a robust toolkit for building out automated data workflows. We are a new company and would love to hear from all of you, but I'll stop here so that this doesn't come across as a self-serving sales pitch!

We feel that there would be numerous technological disadvantages to building this company as a SaaS platform: data security concerns in shipping data to the cloud, performance, cost overhead, etc. It seems like certain areas of technology have a knack of swinging back and forth like a pendulum - for example the classic mainframe vs. thin client question. Relatively speaking, there are not many companies taking the managed appliance approach (although we do supplement our local agent with our cloud-hosted APIs), so who knows, perhaps we are part of the pendulum swinging in the other direction (for very specific use cases) or are just out of our minds!

The existence of Airflow and Kubernetes really tipped the scales for us in making this decision, it is hard to imagine building this company without these technologies. Airflow can certainly scale horizontally to manage a gazillion jobs and workloads, but we were impressed with how well it runs with a minimalist resource footprint, and how resilient it has been for us on very modest hardware. The improvements to the scheduler in 2.x and the dynamic task mapping features in 2.3 have been huge improvements, the latter in allowing our DAGs to be remote managed from the cloud (more on this later).

We thought there might be an audience interested in understanding our tactics that have been successful for us.

Kubernetes and Helm

We are big fans of Helm, and it is a key ingredient to how our customers install/bootstrap our Agent software. Our Dashboard builds out both a workflow configuration for our customers, as well as a helm install/upgrade command for installing and updating our Agent software. We run our own Chartmuseum registry so that customers can take the "if it ain't broke, don't fix it" approach, pinning their version so they can upgrade (and downgrade) at their leisure. A single command that users can copy and paste (as well as one time Docker registry and ChartMuseum authentication steps) really fits well with our approach of building a turnkey-style appliance.

The popularity and ubiquity of managed Kubernetes services (EKS, GKE, AKS, etc.) was another important factor, so we can provide a simple recipe (or Terraform plan) for setting up this infrastructure for customers not already using Kubernetes.

Database Authentication

If you've used Airflow below you know that it includes support for encrypted variables, which is an appropriate way to store sensitive information such as database passwords. These are not available to the KubernetesPodOperator though, so these will need to be provisioned as secrets. We created a script to run airflow connections delete and airflow connections add commands to recreate these connections based on values passed in to Helm via a local values.yaml file. This way, as new input sources are defined or passwords rotated these will be applied on each helm upgrade command, and we don't have to get our users to input their passwords into the Redactics Dashboard. In other words, their connection information is sort of a local augmentation to the configuration generated by the Redactics Dashboard, and the Dashboard generates a template configuration file with "changemes" where authentication information should be provided, and they create their local configuration file by replacing these "changemes" with their real values.

The KubernetesPodOperator, Resource Management

Because we are installing onto Kubernetes, we have the luxury of leveraging the KubernetesPodOperator for some of our workflow steps, particularly those with dependencies that go beyond Python. Setting container resource limits help us stay true to our goals of sticking in a very minimalist resource footprint, even when datasets grow in size. This is particularly helpful when running steps in parallel, and because the number of steps in our workflows is dynamic and context dependent, it is important to set the max_active_tasks DAG argument to ensure that your resources operate within their allocated footprint. Otherwise your workflows could be really slow and/or you could face pod evictions.

We set our Helm chart for Airflow to skip installing the webserver, as we use on_failure_callback callbacks to read the logs from the filesystem and send them to our APIs so that our Redactics Dashboard is what customers interface with, so that they don't have to jump between our Dashboard and the Airflow webserver. We send pod exec commands to the Airflow CLI in the scheduler rather than the REST API for starting workflows manually. Our Agent software installs a few different pods, but the only Airflow pod needed is a single schedule. It sends a heartbeat to our APIs as well so we can provide feedback in our Dashboard about the status and health of the installation.

Dynamic Task Mapping

We were anxious to try out Airflow 2.3's Dynamic Task Mapping along with the Taskflow API introduced prior, as without these features we could only inject preset workflow configurations into our DAGs, which meant that whenever we wanted to update our workflow configurations these DAGs also had to be updated, which meant another helm upgrade command which needed to be run. Now, our DAGs do the following:

  • Fetch the workflow config from our dashboard's API (and we can control how often the DAG is refreshed with this Airflow variable, or else mitigate the load with Redis/memory caching
  • Several workflow steps have KubernetesPodOperator commands that are dynamically generated based on these workflow configs fetched from our API with Python functions such as the following:
@task(on_failure_callback=post_logs)
        def gen_table_resets(input_id, schema, **context):
            tables = []
            for input in wf_config["inputs"]:
                if input["id"] == input_id:
                    for table in initial_copies:
                        tables.append(table)
            if len(tables):
                return [["/scripts/table-resets.sh", dag_name, schema, ",".join(tables)]]
            else:
                return []
  • Each workflow configuration has its own DAG so that this work is tracked independently. The dag_name variable, above, or a unique configuration ID. This ensures that if you use Redactics with multiple databases (as we hope you will), a problem with one workflow is isolated and doesn't impact the others.

Reporting

We mentioned leveraging the on_failure_callback callbacks to report issues to our APIs, but we also leverage the on_success_callback callbacks to render progress bars and show visual feedback of the workflow being run by sending this to our APIs for display in our Dashboard.

Our CLI supports outputting diagnostic information about the health of the pods, zipping up logs, etc. but so far our customers have not needed this capability. We built this for the possibility of having to deep-dive specific issues with our customers, but one advantage to using Kubernetes is when customers are already familiar with it, they can be somewhat self-sufficient. For example, our Agent supports being pinned to specific nodes, and this works the same way you'd assign any other pod to a node.

Persistent Storage, Parallel Tasks

We built an open source streaming network file system, sort of like a NAS with an http interface, so that pods could easily access and share files, including concurrent usage by tasks running in parallel. We decided on this approach rather than shipping files up to Amazon S3 or the like, in part because this fit better in our "use your infrastructure" appliance approach without having to get customers to navigate creating buckets for use with the Agent (and dealing with possible data privacy issues in doing so). Performance, of course, was another important factor in evaluating our options here, as was providing customers the option to not have to open up any sort of egress networking. We wrote another article about this in case you're curious to learn more, but this was definitely a challenge, as ReadWriteMany Kubernetes persistent storage options are not a given.

Airflow supports running tasks in parallel, and this is a great way to improve the performance of your workflows where possible.

Summary

Airflow is a superb choice for workloads in general, but even those that need the sort of "Airflow-slim" minimalist approach. Your only challenge might be some sort of shared filesystem. There is no reason I can think of why Redactics and a managed Airflow appliance in general could not work on a VM as well, providing you could come up with a way to simplify installing and updating this software.

We hope that some of this information is useful! Please be sure to let us know...