I was not successful setting up a Django server in an AWS Worker Environment, so I set up a Flask server instead. This guide assumes that you have a web server running in another environment, and that you’ve configured an SQS queue to enqueue jobs for your worker. We obviate the need for Celery by using a
cron.yaml file on the Flask worker that can be used to define periodic tasks.
I used the
awsebcli tool to create my worker environment named “sqs-worker”,
eb create -t worker sqs-worker. Then I needed to configure the
WSGIPath for the new environment, so I ran
eb config sqs-worker. I simply used the name of my application file,
aws:elasticbeanstalk:container:python: NumProcesses: '1' NumThreads: '15' StaticFiles: /static/=static/ WSGIPath: application.py
Here is the structure of the project for my worker. I kept this in a separate git repo from my Django web server that was running in a separate environment.
worker/ ├── .ebextensions │ └── 01_logging.config ├── .elasticbeanstalk/... ├── venv/... ├── .gitignore ├── application.py ├── cron.yaml ├── requirements.txt └── tasks.py
.ebextensions directory contains additional configuration for the Amazon AWS AMI Linux Environment, which is a separate concern from your Python application. I have one config file in there for creating a log file for debugging
flask.log. This config file also arranges for
flask.log to be included in the log files collected by Amazon so that you’ll see it whenever you run
eb logs <worker-name> from the command line.
commands: 00_create_dir: command: mkdir -p /var/log/app-logs 01_change_permissions: command: chmod g+s /var/log/app-logs 02_change_owner: command: chown wsgi:wsgi /var/log/app-logs files: "/opt/elasticbeanstalk/tasks/taillogs.d/flask.conf": mode: "000755" owner: root group: root content: | /var/log/app-logs/flask.log "/opt/python/log/flask.log" : mode: "000666" owner: ec2-user group: ec2-user content: | # flask log file
Miscellaneous Items in Worker
.elasticbeanstalk directory was created by the
awsebcli tool, which I think is beyond the scope of this question.
venv directory is the virtual environment for Python, and it was created by
python3 -m venv venv. The virtual environment can be started by running
source venv/bin/activate. Then packages can be installed into the virtual environment as usual with pip,
pip install package, and then the
requirements.txt can be generated as
pip freeze > requirements.txt.
.gitignore file informs the git program what to ignore. Since the
awsebcli tool uses the lated git commit, this file will probably be useful to you. The last section, regarding “Elastic Beanstalk Files” was added automatically by
# .gitignore __pycache__ .pytest_cache .vscode db.sqlite3 venv chromedriver # Elastic Beanstalk Files .elasticbeanstalk/* !.elasticbeanstalk/*.cfg.yml !.elasticbeanstalk/*.global.yml
I found that I needed to name my Flask app file
application.py, and that I needed to name my Flask application object
tasks.py is a simple Python script with several scraping functions defined. Note that we’re logging to
/opt/python/log/flask.log. I create
flask.log in an
.ebextension config file, so look out for that later. I listed several endpoints to illustrate that you can use different endpoints in your
# application.py import datetime from flask import Flask, Response import logging import tasks application = Flask(__name__) logging.basicConfig(filename='/opt/python/log/flask.log', level=logging.INFO) def get_timestamp(): date_fmt = '%Y/%m/%d %H:%M:%S' date_now = datetime.datetime.now() date_str = datetime.datetime.strftime(date_now, date_fmt) return date_str @application.route('/worker/scrape/', methods = ['POST']) def scrape(): application.logger.info(get_timestamp()) data = tasks.scrape() application.logger.info(data) return Response(status="200") @application.route('/worker/archives/', methods = ['POST']) def scrape_archives(): application.logger.info(get_timestamp()) data = tasks.scrape_archives() application.logger.info(data) return Response(status="200") @application.route('/worker/posts/', methods = ['POST']) def scrape_posts(): application.logger.info(get_timestamp()) data = tasks.scrape_posts() application.logger.info(data) return Response(status="200")
Here is my
cron.yaml file. I am using crontab notation to run everything every three minutes, one task per minute.
# cron.yaml version: 1 cron: - name: "scrape" url: "/worker/scrape/" schedule: "0-59/3 * * * *" - name: "scrape-archives" url: "/worker/archives/" schedule: "1-59/3 * * * *" - name: "scrape-posts" url: "/worker/posts/" schedule: "2-59/3 * * * *"