In this article, we are going to set up a Dask cluster using Docker Compose and configure it to download data from Amazon S3.
Table of Contents
I am going to run all of my code on the local machine, but when you have Dask workers configured you can deploy them in a Mesos/Kubernetes cluster using the same configuration (just change the connection strings).
Docker Compose
First, I’m going to define my Docker Compose file (I will call it: docker-compose.yaml
). The file starts with the version, so in the first line I put:
version: "3.1"
Scheduler
The first service I must run is the Dask scheduler, so I add the following code to my docker-compose.yaml
file:
services:
scheduler:
image: daskdev/dask
hostname: dask-scheduler
ports:
- "8786:8786"
- "8787:8787"
command: ["dask-scheduler"]
We see that it will deploy the Dask scheduler as a dask-scheduler
host and expose the endpoints used by the workers.
Workers
In the next step, I have to deploy the workers. In this example, I will configure only one worker, but in a real-life scenerio, you will need many of them (if one worker is enough, you don’t need Dask at all).
Note that the worker configuration needs the address of the Dask scheduler and we must mount the directory containing the AWS credentials. I’m going to assume that you pass the credentials as a Docker volume, so the sensitive data is not displayed in any configuration console containing environment variables.
worker:
image: daskdev/dask
volumes:
- $HOME/.aws/credentials:/home/app/.aws/credentials:ro
hostname: dask-worker
command: ["dask-worker", "tcp://scheduler:8786"]
Workers (with environment variables)
If for some reason, you need to specify the AWS credentials as environment variables, you can configure the worker in the following way:
worker:
image: daskdev/dask
environment:
- AWS_ACCESS_KEY_ID=your_access_key
- AWS_SECRET_ACCESS_KEY=your_secred
- AWS_DEFAULT_REGION=the_aws_region
hostname: dask-worker
command: ["dask-worker", "tcp://scheduler:8786"]
Please remember, that it is not a recommented way because the environment variables (so also the keys and passwords) are usually visible in an administration console, so everyone who can manage your deployed services has access to the keys.
Notebook
Finally, when I have the scheduler and workers configured, I can add the notebook to the configuration.
notebook:
image: daskdev/dask-notebook
hostname: notebook
ports:
- "8888:8888"
environment:
- DASK_SCHEDULER_ADDRESS="tcp://scheduler:8786"
Now, when I run the docker-compose --file docker-compose.yaml up
command, all of the services will start, and I can open the localhost:8888 address in my browser to access the notebook.
Obviously, when you deploy the configuration using Kubernetes or Mesos, the command is going to be different. Check the documentation of the service you are using for deployment.
Reading files from S3
To create a DataFrame which contains data from an S3 bucket, I have to run the following code in the notebook:
import dask.dataframe as dd
df = dd.read_csv('s3://bucket/directory/file_name-*.csv')