---
title: "Best practices about partitioning data in S3 by date"
description: "How to partition data in S3 by date in a way that makes your life easier"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/partitioning-s3-data-by-date
---

In one of my projects, we have partitioned data in S3 in a way that makes everything complicated. Seriously. Writing any query that is more complicated than retrieving everything from one partition is a nightmare.

## Don't partition by year, month, and day

It seems to be a good idea to have files partitioned by three values: year, month, and day. This allows you to easily retrieve all rows assigned to a particular year or a month in a year. The object keys look so tidy. For example, you can have data stored in `s3://some_bucket/some_key/year=2020/month=12/day=01`. Which, seemingly, makes sense because that is the natural way of partitioning things by date. It is so perfect, except it isn't. At all.

Creating a hierarchy of partitions and partitioning files separately by year, month, and the day is a terrible idea if you ever want to make date range queries. What if you don't want all rows added in December 2020? What if I want rows between 2020.11.20 and 2020.12.20? Good luck with that!

What will you do?
List every single date in a huge WHERE statement like this:

```sql
WHERE (year = '2020' and month = '11' and day = '20') OR
(year = '2020' and month = '11' and day = '21') OR
(year = '2020' and month = '11' and day = '22') OR
...
(year = '2020' and month = '12' and day = '18') OR
(year = '2020' and month = '12' and day = '19') OR
(year = '2020' and month = '12' and day = '20')
```

Do you know that Athena queries have a length limit? No? You will find out that soon ;)

You may try grouping predicates by the common parts of the query:

```sql
WHERE (year = '2020' and month = '11' and day in ('20', '21', '22', ...) OR (year = '2020' and month = '12' and day in ('01', '02', ...)
```

For sure, your query will be shorter. For sure, the code that generates that SQL statement will have tons of bugs. Don't you love leap years? ;)

## The correct way to partition by date

Is there a better way? Yeah! Forget about splitting the partition into chunks. You can put the whole date (or maybe the date with time) as the partition name and have object keys that look like this: `s3://some_bucket/some_key/dt=2020-12-01`

If you write your dates in the ISO8601 format, you will end up with dates in alphabetical order, so all of your range queries will work correctly!

```sql
WHERE dt >= '2020-11-20' AND dt <= '2020-12-20'
```

Isn't it beautiful when dates work like dates?

You may say that now it is more difficult to get every value from 2020.
I agree that `WHERE dt >= '2020-01-01' AND dt <= '2020-12-31'` is a little longer than `WHERE year = '2020'`, but I think that shortening the notation in a few special cases is not worth the additional effort on a daily basis.