Dan North created the CUPID properties as a replacement for the SOLID principles. I wouldn’t call them a replacement. SOLID still works in its original context - large, monolithic backend applications (probably written in Java).
Can we apply SOLID or CUPID to data engineering?
SOLID principles in data engineering
SOLID makes no sense in data engineering. At least, not as a whole set of principles.
Single Responsibility Principle
SRP works everywhere. A single unit of code should have one responsibility. We may struggle to define a unit’s scope - A function? Class? Module? A single application?
Similarly, we have difficulties defining a responsibility. Assume I have a single micro-service for managing a pricing page on a website. The service handles authorization to verify the user making changes has permission to do them. It stores an audit log of all changes and sends emails to all admin users whenever a pricing change gets published. Is this one responsibility or four?
The scope of SRP is subject to individual interpretation. Should we even consider the responsibilities at the level of the entire application? Perhaps, we should think of SRP only at the level of code modules or classes.
What about data engineering?
I wouldn’t apply SRP at the level of entire pipelines and the outputs they produce. Sometimes it’s faster to calculate multiple results in a single pipeline because it reuses the same data. However, the functions defining the pipeline must have only one responsibility. Always.
What if we modify the behavior of an existing data pipeline by adding new code instead of changing its code? How is it even possible? We could use polymorphism. We would need to create a new class in which we define the new, updated behavior. Ultimately, we need to modify a Factory Object to return the new implementation.
Would that work? Yes. Does it make sense? Not at all.
If we tried to define Apache Spark transformations using polymorphism and class hierarchies, we would create a performance and debugging nightmare.
When you have to change the behavior of a data transformation, change its code.
Liskov Substitution Principle
Backend developers have problems understanding Liskov Substitution Principle. However, in data engineering, we use it all the time!
We replace elements of data processing pipelines or entire pipelines, and the data consumers don’t even notice. That’s Liskov Substitution Principle!
Interface Segregation Principle
You won’t need to write multipurpose code while developing a data pipeline. I’m 99% sure. If I’m wrong, send me a message.
What about applying the principle to entire pipelines? What if we use the output dataset as our interface?
I wouldn’t follow ISP while designing a pipeline’s output. You will end up with lots of repeated calculations. A single output dataset may contain columns used by multiple clients. Those subsets of columns don’t need to overlap.
Calculate the results once and let the consumers choose what they need. Don’t try to create a separate pipeline for every purpose.
Dependency Inversion Principle
DIP makes me think SOLID applies only to old-school, complex, backend applications written in Java.
It’s overkill when you implement a backend microservice. Create the dependency in the
main function. You don’t need a framework to pass a parameter to a constructor, do you?
In data engineering, Dependency Inversion is an over-overkill. It makes no sense at all. Don’t even try.
Dan North doesn’t use the word “principles” to describe CUPID. Instead, Dan prefers calling them “properties” because we use them as characteristics of good code, not rules we must follow.
What are those properties?
- Unix Philosophy
Data pipelines are composable by definition. Every output dataset may become an input for something else.
At the code level, it gets harder. Rarely can we extract a function encapsulating a part of one data transformation and use it in another. However, maybe that’s a good thing. We should reuse output datasets to avoid calculating the same thing multiple times. We don’t need code reuse when we have data reuse.
Unix Philosophy means we can build a new program on top of another. We can use one tool’s output as another application’s data source. By all means, follow the Unix philosophy in data engineering!
You can start by reading my article about applying Unix Philosophy to data engineering.
It should be a principle, not a property. Unpredictable data pipelines are useless.
For me, predictability means automated testing. It’s the easiest way to ensure complex code becomes predictable. If you struggle with writing automated tests for your data pipelines, look at my article about adding tests to existing code in data engineering.
Write whatever looks normal in the language and tools you use. Don’t try to write Java in Python. Don’t try to copy the Spring Framework to Scala.
You won’t use Domain-Driven Design in data engineering.
You don’t need DDD. It’s sufficient to use terms from the business domain as the variable names and encapsulate domain operations into functions.
The function’s name describes the domain term. The function’s implementation describes the technical implementation. And try not to mix technical details with domain concepts within one function.
Doesn’t CUPID look like functional programming to you? The first three CUPID properties say “use functional programming” (or your code should behave as if you used functional programming).
Composability. That’s a property of all functions.
Unix Philosophy. That’s function chaining.
Predictability. Write pure functions and don’t break referential transparency.
CUPID = “functional programming” without scaring people off using those two words.
Functional programming is the perfect mental model for data engineering. It doesn’t mean we should write data engineering code in Haskell or Scala! Data pipelines should have the same properties as functional programming code. And for me, CUPID is all about functional programming.