Copy directories in S3 using s3-dist-cp

S3 has no catalogs concept, but that does not stop us from putting / as delimiters in the object keys and think of files with the same key prefix as files in the same directory.

That causes a problem when we want to copy one catalog’s content into another because we cannot just copy files to a different location. We have to preserve parts of the object keys.

In a file system on a computer when we have those files:

/home/user/the_directory/file_A
/home/user/the_directory/file_B
/home/user/the_directory/file_C
/home/user/the_directory/file_D

and we want to copy them to the home directory of user another_user, the expected result looks like this:

/home/another_user/the_directory/file_A
/home/another_user/the_directory/file_B
/home/another_user/the_directory/file_C
/home/another_user/the_directory/file_D

How do we achieve the same outcome in S3?

We need two things:

  • a running EMR cluster
  • the s3-dist-cp script, which is available on all EMR clusters

Let’s pretend that the above directory structure is also the structure of our S3 keys. For example, we have a file in this location: s3://home/user/the_directory/file_A.

First, we have to SSH into the master node of the cluster.

After that, we run the s3-dist-cp command using the source prefix as the source and the target prefix as the destination. The script will automatically preserve the rest of the object keys:

s3-dist-cp --src=s3://home/user --dest=s3://home/another_user
Older post

How to select a random sample of rows using Athena

How to use a window function to select random rows from Athena

Newer post

What to do when Airflow BashOperator fails with TemplateNotFound error

How to fix TemplateNotFound error when using Airflow BashOperator

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Schedule a call, send me a message on LinkedIn. Schedule a call or send me a message on LinkedIn

>