---
title: "How to add a new dataset to the Feast feature store"
description: "How to use Feast feature store in a local environment"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/adding-datasets-to-feast-feature-store
---

In this article, I'll show how to use the Feast feature store in a local environment. We will download a dataset, store it in a Parquet file, define a new FeatureView in the Feast repository, and retrieve it using Feast.

## How to prepare a dataset for the Feast feature store

The Feast feature store works with time-series features. Therefore, every dataset must contain the timestamp in addition to the entity id. Different observations of the same entity may exist if such observations have a different timestamp.

In our example, we are going to use the Iris dataset. We have a single observation, and we don't have an entity identifier. What can we do?

We can use the time when the dataset has been obtained as the observation date and turn the DataFrame indexes into the entity ids. It is OK, as long as we don't use such ids to join with different datasets. After all, those values have no business meaning, and we created them only to identify the observations.

```python
import seaborn
from datetime import datetime

data = seaborn.load_dataset('iris')

data.reset_index(level=0, inplace=True) # turns the index into a column
data = data.rename(columns={'index': 'iris_id'})

data['observation_date'] = datetime(2021, 7, 9, 10, 0, 0)
```

**Now, we have to create and initialize a Feast repository.** I run the code in a Jupyter Notebook. Hence there are exclamation marks at the beginning of the commands. If you use the command line, you won't need them.

```bash
!feast init feature_repo
!cd feature_repo && feast apply
```

The freshly created Feast repository contains an example dataset, so we should see the following output:

```
Registered entity driver_id
Registered feature view driver_hourly_stats
Deploying infrastructure for driver_hourly_stats
```

When the repository is ready, we can **store the dataset** in the `data` directory as a Parquet file:

```python
data.to_parquet('/content/feature_repo/data/iris.parquet')
```

## Defining features in Feast repository

In the next step, we have to **prepare a Python file describing the FeatureView**. It must define the data input location, the entity identifier, and the available feature columns. We store this file in the feature repository using the Jupyter `writefile` command:

```python
%%writefile /content/feature_repo/iris.py
from datetime import timedelta

from google.protobuf.duration_pb2 import Duration

from feast import Entity, Feature, FeatureView, ValueType
from feast.data_source import FileSource

iris_observations = FileSource(
    path="/content/feature_repo/data/iris.parquet",
    event_timestamp_column="observation_date",
)

iris = Entity(name="iris_id", value_type=ValueType.INT64, description="Iris identifier",)

iris_observations_view = FeatureView(
    name="iris_observations",
    entities=["iris_id"],
    ttl=timedelta(days=-1),
    features=[
        Feature(name="sepal_length", dtype=ValueType.FLOAT),
        Feature(name="sepal_width", dtype=ValueType.FLOAT),
        Feature(name="petal_length", dtype=ValueType.INT64),
        Feature(name="petal_width", dtype=ValueType.INT64),
        Feature(name="species", dtype=ValueType.STRING),
    ],
    online=False,
    input=iris_observations,
    tags={},
)
```

The code above does three things:

* It **defines the feature source location**. In this case, a path to the local file system. Note that the `FileSource` also requires the column containing the event timestamp.
* The `Entity` object describes which column contains **the entity identifier**. In our example, the value is useless and has no business meaning, but we still need it.
* Finally, we **define the `FeatureView`, which combines the available column names (and types) with the entity identifier and the data location**. We have only historical data in our example, so I set the `online` parameter to False.

Since Feast 0.11, we can skip the `features` parameter in `FeatureView`, and the library will infer the column names and types from the data.

When we have the `FeatureView` definition, we can **reload the repository** and use the new feature:

```bash
!cd feature_repo && feast apply
```

Output:

```
Registered entity driver_id
Registered entity iris_id
Registered feature view driver_hourly_stats
Registered feature view iris_observations
Deploying infrastructure for driver_hourly_stats
Deploying infrastructure for iris_observations
```

### What does TTL mean?

In the example below, we retrieve the value from the feature store. We must specify the `event_timestamp`. **The `ttl` describes the maximal time difference between the actual event timestamp and the timestamp we want to get**. Of course, it is a difference "in the past." We can never retrieve events "in the future."

## Retrieving value from the feature store

To retrieve the value, we must **specify the entity ids and the desired observation time**:

```python

entity_df = pd.DataFrame.from_dict(
    {
        "iris_id": range(0, 100),
        "event_timestamp": datetime(2021, 7, 10, 10, 0, 0)
    }
)

store = FeatureStore(repo_path="feature_repo")

training_df = store.get_historical_features(
    entity_df=entity_df,
    feature_refs=[
        "iris_observations:sepal_length",
        "iris_observations:sepal_width",
        "iris_observations:species",
    ],
).to_df()
```

Feast joins the request columns with the given `entity_df` DataFrame, so when data is not available, we get the `entity_df` value joined with nulls or NaNs.

### Working with TTLs and the event_timestamp

In the previous example, I specified the `event_timestamp` equal to the `observation_date`. Because of that, Feast retrieved 100 observations:

```
event_timestamp	iris_id	iris_observations__sepal_length	iris_observations__sepal_width	iris_observations__species
0	2021-07-10 10:00:00+00:00	0	5.1	3.5	setosa
1	2021-07-10 10:00:00+00:00	72	6.3	2.5	versicolor
2	2021-07-10 10:00:00+00:00	71	6.1	2.8	versicolor
3	2021-07-10 10:00:00+00:00	70	5.9	3.2	versicolor
4	2021-07-10 10:00:00+00:00	69	5.6	2.5	versicolor
```

#### Values in the future

When I specify the event_timestamp "in the future" (a date after the available data), Feast will return NaNs because the data does not exist:

```python
entity_df = pd.DataFrame.from_dict(
    {
        "iris_id": range(0, 100),
        "event_timestamp": datetime(2021, 7, 11, 10, 0, 0)
    }
)
...
```

```
event_timestamp	iris_id	iris_observations__sepal_length	iris_observations__sepal_width	iris_observations__species
0	2021-07-11 10:00:00+00:00	0	NaN	NaN	NaN
```

#### Values in the past

In our example, the observation date is `datetime(2021, 7, 9, 10, 0, 0)` and the TTL = 1 day, so we can request values between `datetime(2021, 7, 8, 10, 0, 0)` and `datetime(2021, 7, 9, 10, 0, 0)`. If we use a date in this range, we still get the freshest available value (the only one we have in the database).

```python
entity_df = pd.DataFrame.from_dict(
    {
        "iris_id": range(0, 100),
        "event_timestamp": datetime(2021, 7, 9, 10, 0, 0)
    }
)
...
```

Returns:

```
event_timestamp	iris_id	iris_observations__sepal_length	iris_observations__sepal_width	iris_observations__species
0	2021-07-09 10:00:00+00:00	0	5.1	3.5	setosa
```

#### Retrieving expired values

However, when we go back one second further, Feast will return NaNs because the available data is not in the range between the `event_timestamp` and `event_timestamp` + `ttl`:

```python
entity_df = pd.DataFrame.from_dict(
    {
        "iris_id": range(0, 100),
        "event_timestamp": datetime(2021, 7, 9, 9, 59, 59)
    }
)
...
```

```
event_timestamp	iris_id	iris_observations__sepal_length	iris_observations__sepal_width	iris_observations__species
0	2021-07-09 09:59:59+00:00	0	NaN	NaN	NaN
```