I was not successful setting up a Django server in an AWS Worker Environment, so I set up a Flask server instead. This guide assumes that you have a web server running in another environment, and that you’ve configured an SQS queue to enqueue jobs for your worker. We obviate the need for Celery by using a cron.yaml
file on the Flask worker that can be used to define periodic tasks.
WSGI Configuration
I used the awsebcli
tool to create my worker environment named “sqs-worker”, eb create -t worker sqs-worker
. Then I needed to configure the WSGIPath
for the new environment, so I ran eb config sqs-worker
. I simply used the name of my application file, application.py
.
aws:elasticbeanstalk:container:python: NumProcesses: '1' NumThreads: '15' StaticFiles: /static/=static/ WSGIPath: application.py
Worker Layout
Here is the structure of the project for my worker. I kept this in a separate git repo from my Django web server that was running in a separate environment.
worker/ ├── .ebextensions │ └── 01_logging.config ├── .elasticbeanstalk/... ├── venv/... ├── .gitignore ├── application.py ├── cron.yaml ├── requirements.txt └── tasks.py
.ebextensions
The .ebextensions
directory contains additional configuration for the Amazon AWS AMI Linux Environment, which is a separate concern from your Python application. I have one config file in there for creating a log file for debugging flask.log
. This config file also arranges for flask.log
to be included in the log files collected by Amazon so that you’ll see it whenever you run eb logs <worker-name>
from the command line.
commands: 00_create_dir: command: mkdir -p /var/log/app-logs 01_change_permissions: command: chmod g+s /var/log/app-logs 02_change_owner: command: chown wsgi:wsgi /var/log/app-logs files: "/opt/elasticbeanstalk/tasks/taillogs.d/flask.conf": mode: "000755" owner: root group: root content: | /var/log/app-logs/flask.log "/opt/python/log/flask.log" : mode: "000666" owner: ec2-user group: ec2-user content: | # flask log file
Miscellaneous Items in Worker
The .elasticbeanstalk
directory was created by the awsebcli
tool, which I think is beyond the scope of this question.
The venv
directory is the virtual environment for Python, and it was created by python3 -m venv venv
. The virtual environment can be started by running source venv/bin/activate
. Then packages can be installed into the virtual environment as usual with pip, pip install package
, and then the requirements.txt
can be generated as pip freeze > requirements.txt
.
The .gitignore
file informs the git program what to ignore. Since the awsebcli
tool uses the lated git commit, this file will probably be useful to you. The last section, regarding “Elastic Beanstalk Files” was added automatically by awsebcli
.
# .gitignore __pycache__ .pytest_cache .vscode db.sqlite3 venv chromedriver # Elastic Beanstalk Files .elasticbeanstalk/* !.elasticbeanstalk/*.cfg.yml !.elasticbeanstalk/*.global.yml
Flask Application
I found that I needed to name my Flask app file application.py
, and that I needed to name my Flask application object application
. Here, tasks.py
is a simple Python script with several scraping functions defined. Note that we’re logging to /opt/python/log/flask.log
. I create flask.log
in an .ebextension
config file, so look out for that later. I listed several endpoints to illustrate that you can use different endpoints in your cron.yaml
file.
# application.py import datetime from flask import Flask, Response import logging import tasks application = Flask(__name__) logging.basicConfig(filename='/opt/python/log/flask.log', level=logging.INFO) def get_timestamp(): date_fmt = '%Y/%m/%d %H:%M:%S' date_now = datetime.datetime.now() date_str = datetime.datetime.strftime(date_now, date_fmt) return date_str @application.route('/worker/scrape/', methods = ['POST']) def scrape(): application.logger.info(get_timestamp()) data = tasks.scrape() application.logger.info(data) return Response(status="200") @application.route('/worker/archives/', methods = ['POST']) def scrape_archives(): application.logger.info(get_timestamp()) data = tasks.scrape_archives() application.logger.info(data) return Response(status="200") @application.route('/worker/posts/', methods = ['POST']) def scrape_posts(): application.logger.info(get_timestamp()) data = tasks.scrape_posts() application.logger.info(data) return Response(status="200")
Cron
Here is my cron.yaml
file. I am using crontab notation to run everything every three minutes, one task per minute.
# cron.yaml version: 1 cron: - name: "scrape" url: "/worker/scrape/" schedule: "0-59/3 * * * *" - name: "scrape-archives" url: "/worker/archives/" schedule: "1-59/3 * * * *" - name: "scrape-posts" url: "/worker/posts/" schedule: "2-59/3 * * * *"