One or two weeks ago, a machine learning engineer who works on my team started complaining that working with databases is awful.
Well, is it? It never occurred to me this way. I’m a data engineer, so databases are an everyday thing for me. From an ML engineer’s point of view, databases look like needless complexity. After all, ML engineers want to get a dataset and don’t care how it happens.
Can we give machine learning engineers the datasets they need without forcing them to get familiar with many databases? Fortunately, we can use feature stores for that!
How does machine learning training look like without a feature store
ML engineers got used to getting the data as CSV files. They can easily load such files into Pandas. Usually, running a project without a feature store leads to a couple of problems that don’t get your attention before it is too late.
How does it happen?
At first, the ML engineer reaches out to a data engineer to get the CSV files. The data engineer runs a couple of queries, dumps the data into CSV, and sends them to the ML engineer. In the best case, they write down the query and upload the files to S3.
Later, the ML engineer requests more data, so the data engineer must run a few more queries. Now they have a problem. What was the time range of the original dataset? How can we be sure we retrieve the same range of dates now? What if something was updated later? Can we get a snapshot?
If they still have the original queries, they may figure it out. If they don’t, you lose the reproducibility of the model. You will never retrain it using fresh data because you don’t even know what data you used to train the first version.
When your ML project turns into such a mess, the best thing you can do is to start it from scratch—this time, doing it correctly.
Of course, you may be lucky and get away with it. You may still complete the project successfully. It’s easy to procure the data later if you train the model using a simple dataset. After all, if your model processes online reviews and you care only about the text, you know you need to get the text and nothing else.
Yet, you won’t get reproducible results in many projects without a feature store.
Benefits of having a feature store
A feature store is an intermediate software layer between the data sources and machine learning engineers. It creates an abstraction layer hiding the complexity of databases.
A data engineering team builds ETL to transform the data and ingest it into the feature store. On the other side of the feature store API, the machine learning engineers can request the features they need. The intermediate software will merge the data sources and filter the data by the keys provided by ML engineers.
In the end, both teams work in a familiar environment.
However, that’s not the only reason to use a feature store!
We can use a feature store to increase the reproducibility of ML training code. The code retrieving the data from a feature store is a part of your training pipeline. You can store it in the repository and rerun it later.
The feature store keeps a snapshot of values, so you will get the same data if you retrieve the same snapshot later.
When you have a feature store, you outsource taking care of data retrieval reproducibility to a trustworthy piece of software.
Which feature store should you use?
Many feature store implementations exist. Today every major ML company builds one.
Feast seems to be a good choice if you prefer an open-source implementation. In July 2021, I wrote an article about using the Feast feature store. Of course, using an open-source project requires way more effort than buying a third-party service.
If you want a fire-and-forget approach to feature stores, you should look at the Qwak ML platform. (For disclosure, I do freelance content writing and MLOps evangelism for Qwak.) Qwak lets you forget about the Ops aspect of running a feature store, so you can focus on procuring the data and building the models.
The AWS Sagemaker Feature Store may be an obvious choice for AWS users. Personally, I find it a little bit overengineered. It seems AWS tries to turn Sagemaker Studio into “the one ML tool to rule them all.”
When should you invest time in a feature store
I wouldn’t recommend starting an ML project by setting up a feature store.
Although you will need the feature store soon, I recommend focusing your initial MLOps efforts on the deployment pipeline. Nothing demotivates engineers as fast as having ready-to-use code that they cannot deploy.
At the same time, make sure you will have enough information to populate the feature store with useful data. Starting with the deployment code isn’t an excuse to do a sloppy job during data retrieval. Keep the code and queries! You will need them later.
After you deploy the model, see it running in production, and discuss training a new version, you can start working on the feature store.
It makes no sense to start earlier because your effort may be wasted if your model performs poorly in production.
As long as you documented the data retrieval, you can push the features into a feature store while building the second version of the model.