Christopher Bergh - How the DataOps principles help data engineers make data pipelines trustworthy

DataOps is a set of practices and principles that reduce the end-to-end cycle time of data analysis.

For me, the most crucial concept of DataOps is separating the work into two pipelines: The Value Pipeline is the code that runs in production and extracts value from the data; The Innovation Pipeline is the work that produces insights and generates new ideas.

That distinction is crucial because both of those pipelines have different “moving parts.” In the Value Pipeline, the code does not change, but we may encounter unexpected data values. Hence, we should focus on validating the input data, verifying the pipeline output, and monitoring the process in production. On the other hand, the Innovation Pipeline reuses the same data repeatedly, but we modify the code. It is part of the process when we must write automated tests to ensure that the code works correctly.

Recently, I interviewed Christopher Bergh - CEO of DataKitchen, a co-author of the “DataOps Manifesto” and the “DataOps Cookbook” because I wanted to learn more about DataOps and the benefits of following that practice.

Customer must trust that the data is right, fresh, and ready to use.

First, I wanted to know Christopher Bergh’s opinion about a data engineering team’s most essential responsibilities. He said that we should focus on making customers successful. In our case, the customers are usually data analysts or other backend teams. Our primary responsibility is building integrated, trusted datasets, and we have achieved the goal if our customers are sure that the data is right, fresh, and ready to use.

After that, I asked what the one thing data engineers should do every day is. I was not surprised to hear that we should be writing automated tests. In fact, we should write tons of tests. According to Christopher Bergh, we should spend 20% of the time writing the tests.

Tests are the gift that you give to your future self.

He pointed out that most data engineers write very few tests and that unit tests are nice but insufficient. After all, we spend most of the time integrating multiple data sources. To verify the code, we also need regression tests and end-to-end testing. In addition to that, we should also write data checks to validate the data before it gets into the pipeline, while it is in processing, and after the output is calculated.

It may not be the answer you wanted to hear. Many developers don’t like writing tests, but Christopher Bergh thinks that “tests are the gift that you give to your future self.” It is also the most effective way to improve data quality. Christopher Bergh calls it “defensive data coding.” It is all about not letting the bad data get into the production pipeline. He advises that we have exception tables where we store the bad rows and report them back to the data providers. This increases the odds that the problem gets fixed upstream, and we won’t keep fixing and patching the same thing over and over again.

When asked about tracking the data quality, Christopher Bergh told me that it is better to track the error rates and the time required to process the data because error rates and delays are a superset of data quality. He also suggests keeping a spreadsheet with all the detected errors because it helps us spot a pattern in the errors, and if we see a pattern, we can write a test to detect it early.

Good data engineers should do DataOps just like good software engineers should do DevOps.

I was glad to hear that Christopher Bergh recommends the functional data engineering approach because a few months ago, I published the “Data flow - what functional programming and Unix philosophy can teach us about data streaming” article. We should assume that the raw, input data is immutable, and our data transformations are pure functions that always produce deterministic results. If we store the intermediate results, we get an idempotent pipeline, which we can partially restart when we need to fix the code and reprocess the data. Having immutable input data allows us to use the majority of production tests during development by parameterizing them to run on a subset of the production data.

I have also asked Christopher Bergh about his opinion regarding the crucial skills data engineers should learn. He suggests starting with learning languages like SQL and Python, but if you have time to learn only one language, choose SQL because most of the data is structured anyway. Additionally, he suggests learning DataOps principles and getting used to seeing your work as a part of a more extensive system. He says that “Good data engineers should do DataOps just like good software engineers should do DevOps.”

When asked about training and mentoring data engineers, he pointed out that data engineering is a craft and a team effort, so the only way to learn it is to work on a team with more experienced engineers.

In the end, I wanted to know what books he recommends to read. Obviously, he recommended the “DataOps Cookbook,” which you can get for free when you subscribe to the DataKitchen newsletter, “Data Teams,” written by Jesse Anderson, and getting a good SQL or Python book.

Older post

How to retrieve the statuses of the recent DAG executions from Airflow database

How to make a dashboard that displays Airflow DAG statuses

Newer post

Use LatestOnlyOperator to skip some tasks while running a backfill in Airflow

How to skip some tasks when backfilling a DAG in the past