While Airflow offers a pre-build docker-compose.yml file that largely takes care of all of the complicated bits, there are a few customizations I make to that file to improve performance on our VMs and to set things up. So the below instructions walk you through the process. These instructions currently install and setup Airflow 3.2.1.
Getting the Docker Compose file
To make it as easy as possible for you, I have posted the altered docker-compose.yml file as a gist. The easiest way to get it onto your VM would be to navigate into the the airflow folder and to then run:
curl -LfO https://gist.github.com/jrembold/99007351043c7564a61781c53c57ca4b/raw/1a044f58b745530c3aff9a44c0bc96501238babc/docker-compose.yamlThis will download the file directly into the current directory. Step 1 complete!
Extending the Image
There will quickly come a time when you would like to use a Python library which is not as standard and thus not included in the base Airflow image. The easiest way to remedy this is to extend the Airflow image. Doing so requires both writing a Dockerfile to create the new, extended image, and then telling Docker Compose to build an image from that file instead of using the existing image. This has already been set up in the docker-compose.yml file, so you just need to add the following to a new Dockerfile that you create in the same directory:
FROM apache/airflow:3.2.1
ADD requirements.txt .
RUN pip install apache-airflow==${AIRFLOW_VERSION} -r requirements.txtThen you need to create the requirements.txt file, which just has a library name on each line. Initially, I’d recommend you add DuckDB (all the other common libraries like Pandas are already included in the image) and a library to help Airflow connect to S3 storage. Thus your requirements.txt should look like:
duckdb
apache-airflow-providers-amazon[s3fs]Creating Directories
The Airflow Docker image uses bind-mounts to make accessible some of the important folders found within the image. In order for those bind-mounts to have a target, we need to create several folders. While still in your airflow folder, run:
mkdir -p ./dags ./logs ./plugins ./configThis will create 4 different folders in that directory. The most important to us will be the dags folder, as that is where Airflow looks for new DAGs to process. We do need to tweak the permissions of this folder slightly to allow for both you as a user to access files within the folder and the Airflow container user to access these files. Otherwise what can happen is that you create the file but Airflow never “sees” it to load the DAG. To set up these permissions, run:
chmod 775 ./dags
chmod g+s ./dagsThis commands set the read/write/execute permissions on the folder, and then ensure that all future files in the folder inherit these permissions.
Setting up the User Environment Variable
In addition to setting up the above folder permissions, we need to let Airflow know what your current user is, so that it can interact properly with you. We do this by providing a key-value pairing inside a .env file, which is where Airflow automatically looks for any environmental variables. We can both create the file and automatically populate it with the correct value by running:
echo -e "AIRFLOW_UID=$(id -u)" > .envNote that if you try to see the file once it is created, you will not be able to see it with your standard ls as it is a hidden file, denoted by the leading dot. To see hidden files with ls, you must add the -a flag.
Starting it all up!
At this point, you are good to begin starting things! Airflow recommends running the airflow-init service first, to set up everything it needs:
docker compose up airflow-initThat will take a minute, but eventually you’ll see a message like “airflow-init-1 exited with code 0”. At that point you can just run docker compose up or docker compose up -d to start up the whole stack!
Remember that the Airflow GUI is hosted on your VM (localhost) at point 8080. So that is where you should point your SSH tunnel to if you want to view the GUI on your own system. The default login user and password are both airflow!