ML Model Deployment Infrastructure at Ensemble Energy

This is first in a series of posts to give an overview of what our team at Ensemble Energy has been doing over the past few months to automate the process of deploying machine learning models at scale, managing and monitoring large fleet of containers generating predictive insights on high speed time series data.

Ensuring that data is in right place at right time is critical for our data science team. Apart from ETL (Extract, Transform and Load) operations, there were several scheduled jobs running at regular intervals, originally setup as cron jobs for its ease of deployment. The biggest challenge was the lack of visibility and inability to reprocess failed tasks or take corrective steps without using up developer time from other activities. In addition, the challenge of managing this across various environments like sandbox, development and production systems was a nightmare.

To get around these challenges, we have built a robust container deployment infrastructure using some of the industry standard technology and the following posts will dive deep into our use of these tools:

  1. Apache Airflow for scheduling and monitoring ETL and ML model scoring
    blog.ensembleenergy.ai/apache-airflow/

  2. Travis CI to automate the process of building containers and deploying to Amazon Elastic Container Registry
    blog.ensembleenergy.ai/travis-ci-ecr/

  3. Terraform to write infrastructure-as-code to provision and manage infrastructure for running the Airflow DAG tasks
    blog.ensembleenergy.ai/infrastructure-as-code/