---
title: "How to Speed Up AWS Athena Queries Using Partition Projection"
description: "How to define partition projection while creating an Athena table"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/speed-up-aws-athena-queries-using-partition-projection
---

Partition Projection in AWS Athena is a recently added feature that speeds up queries by defining the available partitions as a part of table configuration instead of retrieving the metadata from the Glue Data Catalog.

It makes Athena queries faster because there is no need to query the metadata catalog. Also, when partition projection is enabled, it is no longer necessary to reload partitions to rebuild the metadata catalog after storing new files in S3.

## How to Create a Table with Partition Projection

The AWS documentation shows how to add [Partition Projection](https://docs.aws.amazon.com/athena/latest/ug/partition-projection-setting-up.html) to an existing table. In this article, I will define a new table with partition projection using the `CREATE TABLE` statement.

Let's assume that I have an S3 bucket full of Parquet files stored in partitions that denote the date when the file was stored. For example, I have an S3 key which looks like this: `s3://my_bucket_name/files/year=2020/month=08/day=29/f_001`

We see that my files are partitioned by year, month, and day. In this case, the standard `CREATE TABLE` statement that uses the Glue Data Catalog to store the partition metadata looks like this:

```sql
CREATE EXTERNAL TABLE IF NOT EXISTS my_table_name
(
  -- some columns here
)
PARTITIONED BY (
  year STRING,
  month STRING,
  day STRING
)
STORED AS PARQUET
LOCATION 's3://my_bucket_name/files/';
```

### Enabling Partition Projection

When I want to enable the Partition Projection feature for `my_table_name` table, I must `TBLPROPERTIES` that describes my partitioning method. I have three text columns that denote parts of a date.

In my case, the `year` may have values between 2015 and 2020. Therefore, its properties look like this:

```sql
projection.year.type = 'integer',
projection.year.range= '2015,2020'
```

The `month` column is a little bit more tricky because I use the leading zeros. The numeric values of months between January and September are preceded by zero. Because of that, in addition to specifying the value range, I have to configure the number of digits:

```sql
projection.month.type = 'integer',
projection.month.range= '1,12'
projection.month.digits = 2
```

Finally, I have to do the same with the `day` column. There are no tricks that allow me to have fewer `day` partitions in months shorter than 31 days. I must use the range of 1-31. Athena will ignore empty partitions while retrieving the data.

```sql
projection.day.type = 'integer',
projection.month.range= '1,31'
projection.month.digits = 2
```

The new `CREATE TABLE` statement that defines projected partitions looks like this:

```sql
CREATE EXTERNAL TABLE IF NOT EXISTS my_table_name
(
  -- some columns here
)
PARTITIONED BY (
  year STRING,
  month STRING,
  day STRING
)
STORED AS PARQUET
LOCATION 's3://my_bucket_name/files/'
TBLPROPERTIES (
    ‘projection.enabled’ = ‘true’,
	‘projection.year.type’ = ‘integer’,
    ‘projection.year.range’ = ‘2015,2020’,
    ‘projection.month.type’ = ‘integer’,
    ‘projection.month.range’ = ‘1,12’
    ‘projection.month.digits’ = ‘2’,
    ‘projection.day.type’ = ‘integer’,
    ‘projection.month.range’ = ‘1,31’
    ‘projection.month.digits’ = 2’
)
```

Now, when I query the table in Athena, it will generate the S3 key prefix that it needs to scan without using the Glue Data Catalog.

## When to Avoid Using Partition Projection

The Partition Projection feature is available only in AWS Athena. Even if a table definition contains the partition projection configuration, other tools will not use those values. For example, **Apache Spark, Hive, Presto read partition metadata directly from Glue Data Catalog and do not support partition projection**. When we have a table used by a service running on EMR, we MUST continue using Glue Data Catalog.

EMR and Redshift Spectrum look for the partitions in the metadata. If we add new files to the S3 location and a new partition should be created, we MUST reload the partitions. Otherwise, only AWS Athena will find the data (because of partition projection).

Adding new files and creating new partitions causes another issue. The partition projection configuration is static. If we want to change it, we must recreate the table.

If we add a new partition value outside of the range defined as a partition projection, Athena will not find those files. **When we have partition projection enabled, Athena does not retrieve the metadata from Glue. Therefore, reloading the partition metadata in Glue Data Catalog will not help solve that issue.** When such a problem occurs, we can use the [Dynamic ID Partitioning](https://docs.aws.amazon.com/athena/latest/ug/partition-projection-dynamic-id-partitioning.html) to avoid recreating the table every time we want to add a new partition value.