I have just finished reading “Practical Data Cleaning” by Lee Baker. I can’t recommend that book. It is too short and disappointingly basic. If you never worked with real-world data, if you worked only with ready-to-use Kaggle datasets, then maybe you can learn something from this book.
I am disappointed, but there is one idea that cannot be stressed enough. One thing that should be repeated every single day until we all accept it as a regular daily practice. Writing things down!
Do you have a special value that indicates missing data? Write it down somewhere. Preferably in the code that cleans the dataset. Preferably as a constant with a name that conveys its meaning.
We all “love” looking in the code for other usages of data we are trying to use because we hope that someone else has already figured out the meaning of values, don’t we?
We “love” when information about the data source is available two days a week when a colleague is in the office and has some spare time to answer our questions.
We “love” being surprised by a variable which has way too many values and they don’t mean what we expect. Lately, I came across data in which one variable indicated a boolean property, but surprisingly 1 meant “negative,” 0 meant “positive,” and there were values larger than 1 which I had to convert to 0 because for some time it was possible to choose more options, but people did not like it, so it was removed, but nobody updated the values in database.
How can we know about stuff like that if it is not written down anywhere? Please, please document the data. The next person (or you in 6 months) will be very, very grateful.
If you find something strange and spend two days trying to find an explanation, don’t try to remember it, write it down. If you found incorrect data caused by a bug, write down some information about that bug. Or even better, write a (well-documented) function that converts the faulty data into something correct. If you get an explanation: “we have always been doing it like this, but nobody remembers why,” please dig down and find out why you are doing it.
We don’t need data mythology. After all, we call the practice “data science” or “data engineering.” That should hold us to higher standards.