Add a Flask Worker in AWS Elastic Beanstalk

I was not successful setting up a Django server in an AWS Worker Environment, so I set up a Flask server instead. This guide assumes that you have a web server running in another environment, and that you’ve configured an SQS queue to enqueue jobs for your worker. We obviate the need for Celery by using a cron.yaml file on the Flask worker that can be used to define periodic tasks.

WSGI Configuration

I used the awsebcli tool to create my worker environment named “sqs-worker”, eb create -t worker sqs-worker. Then I needed to configure the WSGIPath for the new environment, so I ran eb config sqs-worker. I simply used the name of my application file, application.py.

  aws:elasticbeanstalk:container:python:
    NumProcesses: '1'
    NumThreads: '15'
    StaticFiles: /static/=static/
    WSGIPath: application.py

Worker Layout

Here is the structure of the project for my worker. I kept this in a separate git repo from my Django web server that was running in a separate environment.

worker/
├── .ebextensions
│   └── 01_logging.config
├── .elasticbeanstalk/...
├── venv/...
├── .gitignore
├── application.py
├── cron.yaml
├── requirements.txt
└── tasks.py

.ebextensions

The .ebextensions directory contains additional configuration for the Amazon AWS AMI Linux Environment, which is a separate concern from your Python application. I have one config file in there for creating a log file for debugging flask.log. This config file also arranges for flask.log to be included in the log files collected by Amazon so that you’ll see it whenever you run eb logs <worker-name> from the command line.

commands:
  00_create_dir:
    command: mkdir -p /var/log/app-logs
  01_change_permissions:
    command: chmod g+s /var/log/app-logs
  02_change_owner:
    command: chown wsgi:wsgi /var/log/app-logs

files:
  "/opt/elasticbeanstalk/tasks/taillogs.d/flask.conf":
    mode: "000755"
    owner: root
    group: root
    content: |
      /var/log/app-logs/flask.log
  "/opt/python/log/flask.log" :
    mode: "000666"
    owner: ec2-user
    group: ec2-user
    content: |
      # flask log file

Miscellaneous Items in Worker

The .elasticbeanstalk directory was created by the awsebcli tool, which I think is beyond the scope of this question.

The venv directory is the virtual environment for Python, and it was created by python3 -m venv venv. The virtual environment can be started by running source venv/bin/activate. Then packages can be installed into the virtual environment as usual with pip, pip install package, and then the requirements.txt can be generated as pip freeze > requirements.txt.

The .gitignore file informs the git program what to ignore. Since the awsebcli tool uses the lated git commit, this file will probably be useful to you. The last section, regarding “Elastic Beanstalk Files” was added automatically by awsebcli.

# .gitignore
__pycache__
.pytest_cache
.vscode
db.sqlite3
venv
chromedriver
# Elastic Beanstalk Files
.elasticbeanstalk/*
!.elasticbeanstalk/*.cfg.yml
!.elasticbeanstalk/*.global.yml

Flask Application

I found that I needed to name my Flask app file application.py, and that I needed to name my Flask application object application. Here, tasks.py is a simple Python script with several scraping functions defined. Note that we’re logging to /opt/python/log/flask.log. I create flask.log in an .ebextension config file, so look out for that later. I listed several endpoints to illustrate that you can use different endpoints in your cron.yaml file.

# application.py
import datetime
from flask import Flask, Response
import logging

import tasks

application = Flask(__name__)

logging.basicConfig(filename='/opt/python/log/flask.log', level=logging.INFO)

def get_timestamp():
    date_fmt = '%Y/%m/%d %H:%M:%S'
    date_now = datetime.datetime.now()
    date_str = datetime.datetime.strftime(date_now, date_fmt)
    return date_str

@application.route('/worker/scrape/', methods = ['POST'])
def scrape():
    application.logger.info(get_timestamp())
    data = tasks.scrape()
    application.logger.info(data)
    return Response(status="200")

@application.route('/worker/archives/', methods = ['POST'])
def scrape_archives():
    application.logger.info(get_timestamp())
    data = tasks.scrape_archives()
    application.logger.info(data)
    return Response(status="200")

@application.route('/worker/posts/', methods = ['POST'])
def scrape_posts():
    application.logger.info(get_timestamp())
    data = tasks.scrape_posts()
    application.logger.info(data)
    return Response(status="200")

Cron

Here is my cron.yaml file. I am using crontab notation to run everything every three minutes, one task per minute.

# cron.yaml
version: 1
cron:
 - name: "scrape"
   url: "/worker/scrape/"
   schedule: "0-59/3 * * * *"
 - name: "scrape-archives"
   url: "/worker/archives/"
   schedule: "1-59/3 * * * *"
 - name: "scrape-posts"
   url: "/worker/posts/"
   schedule: "2-59/3 * * * *"

Helpful References

The Python Platform on AWS

Deploying Flask on AWS

Deploying Django on AWS

Overview of Worker Environment

Flask AWS Gotchas

Videos from AWS