Partition Projection in AWS Athena is a recently added feature that speeds up queries by defining the available partitions as a part of table configuration instead of retrieving the metadata from the Glue Data Catalog.
Table of Contents
It makes Athena queries faster because there is no need to query the metadata catalog. Also, when partition projection is enabled, it is no longer necessary to reload partitions to rebuild the metadata catalog after storing new files in S3.
How to Create a Table with Partition Projection
The AWS documentation shows how to add Partition Projection to an existing table. In this article, I will define a new table with partition projection using the CREATE TABLE
statement.
Let’s assume that I have an S3 bucket full of Parquet files stored in partitions that denote the date when the file was stored. For example, I have an S3 key which looks like this: s3://my_bucket_name/files/year=2020/month=08/day=29/f_001
We see that my files are partitioned by year, month, and day. In this case, the standard CREATE TABLE
statement that uses the Glue Data Catalog to store the partition metadata looks like this:
CREATE EXTERNAL TABLE IF NOT EXISTS my_table_name
(
-- some columns here
)
PARTITIONED BY (
year STRING,
month STRING,
day STRING
)
STORED AS PARQUET
LOCATION 's3://my_bucket_name/files/';
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
Enabling Partition Projection
When I want to enable the Partition Projection feature for my_table_name
table, I must TBLPROPERTIES
that describes my partitioning method. I have three text columns that denote parts of a date.
In my case, the year
may have values between 2015 and 2020. Therefore, its properties look like this:
projection.year.type = 'integer',
projection.year.range= '2015,2020'
The month
column is a little bit more tricky because I use the leading zeros. The numeric values of months between January and September are preceded by zero. Because of that, in addition to specifying the value range, I have to configure the number of digits:
projection.month.type = 'integer',
projection.month.range= '1,12'
projection.month.digits = 2
Finally, I have to do the same with the day
column. There are no tricks that allow me to have fewer day
partitions in months shorter than 31 days. I must use the range of 1-31. Athena will ignore empty partitions while retrieving the data.
projection.day.type = 'integer',
projection.month.range= '1,31'
projection.month.digits = 2
The new CREATE TABLE
statement that defines projected partitions looks like this:
CREATE EXTERNAL TABLE IF NOT EXISTS my_table_name
(
-- some columns here
)
PARTITIONED BY (
year STRING,
month STRING,
day STRING
)
STORED AS PARQUET
LOCATION 's3://my_bucket_name/files/'
TBLPROPERTIES (
‘projection.enabled’ = ‘true’,
‘projection.year.type’ = ‘integer’,
‘projection.year.range’ = ‘2015,2020’,
‘projection.month.type’ = ‘integer’,
‘projection.month.range’ = ‘1,12’
‘projection.month.digits’ = ‘2’,
‘projection.day.type’ = ‘integer’,
‘projection.month.range’ = ‘1,31’
‘projection.month.digits’ = 2’
)
Now, when I query the table in Athena, it will generate the S3 key prefix that it needs to scan without using the Glue Data Catalog.
When to Avoid Using Partition Projection
The Partition Projection feature is available only in AWS Athena. Even if a table definition contains the partition projection configuration, other tools will not use those values. For example, Apache Spark, Hive, Presto read partition metadata directly from Glue Data Catalog and do not support partition projection. When we have a table used by a service running on EMR, we MUST continue using Glue Data Catalog.
EMR and Redshift Spectrum look for the partitions in the metadata. If we add new files to the S3 location and a new partition should be created, we MUST reload the partitions. Otherwise, only AWS Athena will find the data (because of partition projection).
Adding new files and creating new partitions causes another issue. The partition projection configuration is static. If we want to change it, we must recreate the table.
If we add a new partition value outside of the range defined as a partition projection, Athena will not find those files. When we have partition projection enabled, Athena does not retrieve the metadata from Glue. Therefore, reloading the partition metadata in Glue Data Catalog will not help solve that issue. When such a problem occurs, we can use the Dynamic ID Partitioning to avoid recreating the table every time we want to add a new partition value.