How to populate a PostgreSQL (RDS) database with data from CSV files stored in AWS S3

Today, I am going to show you how to import data from Amazon S3 into a PostgreSQL database running as an Amazon RDS service.

Table of Contents

Importing the data

The first thing we have to do is installing the aws_s3 extension in PostgreSQL. To do this, we have to login as an administrator and run the following statement:

CREATE EXTENSION aws_s3 CASCADE;

The RDS service needs read-only access to the S3 bucket we want to import and the ability to use the s3import feature. Let’s use Terraform to define the policies and roles:

resource "aws_iam_policy" "s3-access" {
    name        = "s3-access"
    description = "A policy for S3 access."
    policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "s3import",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::bucket-name",
                "arn:aws:s3:::bucket-name/*"
            ]
        }
    ]
}
EOF
}

Now we have to allow the RDS service to assume the role associated with the S3 access policy:

resource "aws_iam_role" "rds-assume-role" {
    name = "rds"
    assume_role_policy = <<EOF
{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "rds.amazonaws.com"
      },
      "Effect": "Allow"
    }
  ]
}
EOF
}

resource "aws_iam_role_policy_attachment" "rds-to-s3" {
    role       = "${aws_iam_role.rds-assume-role.name}"
    policy_arn = "${aws_iam_policy.s3-access.arn}"
}

There is one more thing we need. We have to allow our RDS instance to import data from S3:

resource "aws_db_instance_role_association" "rds_s3import_role" {
  db_instance_identifier = "${aws_db_instance.rds-database.id}"
  feature_name = "s3Import"
  role_arn = "${aws_iam_role.rds-assume-role.arn}"
}

In the end, we have to add two rules to the security group used by the RDS database. The database needs permission to make HTTP and HTTPS calls to communicate with S3:

# Add those to the security group used by the RDS database
  egress {
    from_port       = 80
    to_port         = 80
    protocol        = "tcp"
    cidr_blocks     = ["0.0.0.0/0"]
  }

  egress {
    from_port       = 443
    to_port         = 443
    protocol        = "tcp"
    cidr_blocks     = ["0.0.0.0/0"]
  }

Importing the data

Finally, we can use the aws_s3 extension to import the CSV files into the database tables:

SELECT aws_s3.table_import_from_s3('schema.table_name', '', '(FORMAT CSV)', 'bucket-name,file_key,aws-region-id)');

Obviously, before running that command, we have to create the table with a schema that matches the CSV file.