In this article, I’ll show how to use the Feast feature store in a local environment. We will download a dataset, store it in a Parquet file, define a new FeatureView in the Feast repository, and retrieve it using Feast.
Table of Contents
- How to prepare a dataset for the Feast feature store
- Defining features in Feast repository
- Retrieving value from the feature store
How to prepare a dataset for the Feast feature store
The Feast feature store works with time-series features. Therefore, every dataset must contain the timestamp in addition to the entity id. Different observations of the same entity may exist if such observations have a different timestamp.
In our example, we are going to use the Iris dataset. We have a single observation, and we don’t have an entity identifier. What can we do?
We can use the time when the dataset has been obtained as the observation date and turn the DataFrame indexes into the entity ids. It is OK, as long as we don’t use such ids to join with different datasets. After all, those values have no business meaning, and we created them only to identify the observations.
import seaborn
from datetime import datetime
data = seaborn.load_dataset('iris')
data.reset_index(level=0, inplace=True) # turns the index into a column
data = data.rename(columns={'index': 'iris_id'})
data['observation_date'] = datetime(2021, 7, 9, 10, 0, 0)
Now, we have to create and initialize a Feast repository. I run the code in a Jupyter Notebook. Hence there are exclamation marks at the beginning of the commands. If you use the command line, you won’t need them.
!feast init feature_repo
!cd feature_repo && feast apply
The freshly created Feast repository contains an example dataset, so we should see the following output:
Registered entity driver_id
Registered feature view driver_hourly_stats
Deploying infrastructure for driver_hourly_stats
When the repository is ready, we can store the dataset in the data
directory as a Parquet file:
data.to_parquet('/content/feature_repo/data/iris.parquet')
Defining features in Feast repository
In the next step, we have to prepare a Python file describing the FeatureView. It must define the data input location, the entity identifier, and the available feature columns. We store this file in the feature repository using the Jupyter writefile
command:
%%writefile /content/feature_repo/iris.py
from datetime import timedelta
from google.protobuf.duration_pb2 import Duration
from feast import Entity, Feature, FeatureView, ValueType
from feast.data_source import FileSource
iris_observations = FileSource(
path="/content/feature_repo/data/iris.parquet",
event_timestamp_column="observation_date",
)
iris = Entity(name="iris_id", value_type=ValueType.INT64, description="Iris identifier",)
iris_observations_view = FeatureView(
name="iris_observations",
entities=["iris_id"],
ttl=timedelta(days=-1),
features=[
Feature(name="sepal_length", dtype=ValueType.FLOAT),
Feature(name="sepal_width", dtype=ValueType.FLOAT),
Feature(name="petal_length", dtype=ValueType.INT64),
Feature(name="petal_width", dtype=ValueType.INT64),
Feature(name="species", dtype=ValueType.STRING),
],
online=False,
input=iris_observations,
tags={},
)
The code above does three things:
- It defines the feature source location. In this case, a path to the local file system. Note that the
FileSource
also requires the column containing the event timestamp. - The
Entity
object describes which column contains the entity identifier. In our example, the value is useless and has no business meaning, but we still need it. - Finally, we define the
FeatureView
, which combines the available column names (and types) with the entity identifier and the data location. We have only historical data in our example, so I set theonline
parameter to False.
Since Feast 0.11, we can skip the features
parameter in FeatureView
, and the library will infer the column names and types from the data.
When we have the FeatureView
definition, we can reload the repository and use the new feature:
!cd feature_repo && feast apply
Output:
Registered entity driver_id
Registered entity iris_id
Registered feature view driver_hourly_stats
Registered feature view iris_observations
Deploying infrastructure for driver_hourly_stats
Deploying infrastructure for iris_observations
What does TTL mean?
In the example below, we retrieve the value from the feature store. We must specify the event_timestamp
. The ttl
describes the maximal time difference between the actual event timestamp and the timestamp we want to get. Of course, it is a difference “in the past.” We can never retrieve events “in the future.”
Retrieving value from the feature store
To retrieve the value, we must specify the entity ids and the desired observation time:
entity_df = pd.DataFrame.from_dict(
{
"iris_id": range(0, 100),
"event_timestamp": datetime(2021, 7, 10, 10, 0, 0)
}
)
store = FeatureStore(repo_path="feature_repo")
training_df = store.get_historical_features(
entity_df=entity_df,
feature_refs=[
"iris_observations:sepal_length",
"iris_observations:sepal_width",
"iris_observations:species",
],
).to_df()
Feast joins the request columns with the given entity_df
DataFrame, so when data is not available, we get the entity_df
value joined with nulls or NaNs.
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
Working with TTLs and the event_timestamp
In the previous example, I specified the event_timestamp
equal to the observation_date
. Because of that, Feast retrieved 100 observations:
event_timestamp iris_id iris_observations__sepal_length iris_observations__sepal_width iris_observations__species
0 2021-07-10 10:00:00+00:00 0 5.1 3.5 setosa
1 2021-07-10 10:00:00+00:00 72 6.3 2.5 versicolor
2 2021-07-10 10:00:00+00:00 71 6.1 2.8 versicolor
3 2021-07-10 10:00:00+00:00 70 5.9 3.2 versicolor
4 2021-07-10 10:00:00+00:00 69 5.6 2.5 versicolor
Values in the future
When I specify the event_timestamp “in the future” (a date after the available data), Feast will return NaNs because the data does not exist:
entity_df = pd.DataFrame.from_dict(
{
"iris_id": range(0, 100),
"event_timestamp": datetime(2021, 7, 11, 10, 0, 0)
}
)
...
event_timestamp iris_id iris_observations__sepal_length iris_observations__sepal_width iris_observations__species
0 2021-07-11 10:00:00+00:00 0 NaN NaN NaN
Values in the past
In our example, the observation date is datetime(2021, 7, 9, 10, 0, 0)
and the TTL = 1 day, so we can request values between datetime(2021, 7, 8, 10, 0, 0)
and datetime(2021, 7, 9, 10, 0, 0)
. If we use a date in this range, we still get the freshest available value (the only one we have in the database).
entity_df = pd.DataFrame.from_dict(
{
"iris_id": range(0, 100),
"event_timestamp": datetime(2021, 7, 9, 10, 0, 0)
}
)
...
Returns:
event_timestamp iris_id iris_observations__sepal_length iris_observations__sepal_width iris_observations__species
0 2021-07-09 10:00:00+00:00 0 5.1 3.5 setosa
Retrieving expired values
However, when we go back one second further, Feast will return NaNs because the available data is not in the range between the event_timestamp
and event_timestamp
+ ttl
:
entity_df = pd.DataFrame.from_dict(
{
"iris_id": range(0, 100),
"event_timestamp": datetime(2021, 7, 9, 9, 59, 59)
}
)
...
event_timestamp iris_id iris_observations__sepal_length iris_observations__sepal_width iris_observations__species
0 2021-07-09 09:59:59+00:00 0 NaN NaN NaN