Terraform to write infrastructure-as-code for running the Airflow DAG tasks
Apache Airflow made it easy to schedule and manage a of large number of DAGs and Travis CI helped us automate the container building and deployment process. This reduced the time for our data science team to build and push different models. While building and pushing containers became quick the infrastructure required for those containers became biggest bottleneck. A simple DAG that should be triggered when a file is dropped into S3 bucket to process and write into a different output bucket requires following infrastructure:
- Elastic Container Registry entry for container definition
- Elastic Container Service task definition
- S3 bucket for storing input and output files
- SQS trigger on file drop into S3 bucket
- Cloudwatch logs
While this is easy to create for single task, scaling this over hundreds of containers across multiple environments (sandbox, dev, prod) and cleanup in case of mistakes became a big overhead. Something that was tested and working fine in sandbox would not work in dev or prod due to missing or incorrect configuration. Switching to infrastructure-as-code provided stability and ability to quickly build and remove infrastructure without any adverse effects such as unintended removal of another developers resources. Using terraform we can now use infrastructure as code to provision and manage any cloud, infrastructure, or service.
Once terraform is installed and the .tf config files are created then using below commands one can easily create and remove infrastructure. Further, since infrastructure is as code replication across environments and DAGs becomes much easier.
Initialize Terraform
terraform init
View the changes that will be applied
terraform plan
Apply the changes
terraform apply
Remove/Destroy the infrastructure
terraform destroy
Below code snippet illustrates various .tf files for infrastructure creation. We decided to break down the .tf files by resources to keep them better organized and simple enough to read and modify with large number of containers in single repository:
- versions.tf - Required by terraform to specify the version being used
- vars.tf - Local variables referred in other files. Note we have ENV_NAME variable which is the only thing that changes when moving across stages from sandbox to dev to prod
- provider.tf - Specify the cloud provider which in this example is AWS
- ecr.tf - Elastic Container Registry entry for holding container code. You can find the corresponding container code in our previous blog post on Travis CI here
- ecs_td.tf - Elastic Container Service Task Definition for running sample container
- s3.tf - Code to create S3 bucket where file drop will trigger SQS
- cloudwatch.tf - Create cloudwatch logs to monitor the container runs
- sqs.tf - Define the queue and corresponding policy for S3 file drop