Project description

Django Spam Classifier

Contact form spam getting you down? We know the feeling. It's demeaning, draining and relentless.

This a very basic Django app that uses dbacl Bayesian text classification tool to filter out contact form spam. It's not perfect, but it works very well on blocking the really offensive English text spam. The app was written to avoid depending on external services like reCAPTCHA or Akismet - these services work well enough, but introduce some privacy concerns.

Limitations

Currently doesn't work so well on non-English text, very short input, garbage input or HTML only with a single hyperlink. It's possible that dbacl may have options to deal more effectively with this.

Additionally, dbacl seems to be not so actively maintained, and is currently not available on Debian Bullseye. I may switch to bogofilter or other Bayesian filtering options in the future.

Getting started

Install django-spam-classifier
Install dbacl via your OS package manager
Add a BASE_DIR setting
Enable Django django.contrib.sites app and configure your site domain via Django Admin (used for training links in emails)
Add 'classifier' to your INSTALLED_APPS setting
Add path('', include('classifier.urls')), to your project's urls.py
Run python manage.py migrate
Create the classifier_data directory to hold the classifier database
In contact form call classifier.is_spam() on all text accepted by your form:
```
spam, submission = is_spam('\n'.join(submission_fields))
if spam:
    # Throw away the form submission and don't notify anyone.
else:
    # Process the form submission as normal.
```
Doing so will internally use dbacl to classify the submission as spam or not spam and generate a confidence of 0-100. Spam/not-spam with a high confidence is processed as you'd expect. If the confidence is below the RECORD_AND_DISCARD_CONFIDENCE, the submission is treated as not spam because confidence is too low to make a safe decision. The body is recorded in the Submissions model and can be manually classified via the Django Admin. If the confidence is above RECORD_AND_DISCARD_CONFIDENCE but below SILENTLY_DISCARD_CONFIDENCE, the submission is treated as confidently spam, but also recorded to the Submissions model for manual classification.

Add a training link to the footer of any notification email you send::

email_body = email_body + spam_footer(submission, site)

Which will output something like:

--
Spam score: spam (15% confidence)
Train as spam: https://example.com/classifier/1704/spam/
Train as not spam: https://example.com/classifier/1704/not-spam/

Ensure you have a logging configuration set up so you can see log messages
Add a cron job to regularly (eg. daily) update the training database with any new manual classifications you've made:
```
python manage.py train
```
Visit the Django Admin and classify the low-confidence submissions you receive.

Tune the Django settings as desired (optional):

CLASSIFIER = {
   'SILENTLY_DISCARD_CONFIDENCE': 90,  # Defaults to 80
  'RECORD_AND_DISCARD_CONFIDENCE': 75,  # Defaults to 60
}

Development

Create a venv and install the development requirements:

python3 -m python3.8 -m venv --system-site-packages [VENV-PATH]
source [VENV_PATH]/bin/activate
python -m pip install Django pytz

TODO: There is undoubtedly a better way of installing dev-dependencies. Perhaps poetry or flit? Are they the only tools that handle this? What's generally accepted?

Run tests with tox or:

PYTHONPATH=src:.:$PYTHONPATH DJANGO_SETTINGS_MODULE=tests.test_settings pytest tests

Create migrations with:

DJANGO_SETTINGS_MODULE=tests.test_settings python -m django makemigrations

Project details

These details have not been verified by PyPI

Project links

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.0

Aug 26, 2022

0.0.7

Oct 1, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

django-spam-classifier-0.1.0.tar.gz (12.8 kB view hashes)

Uploaded Aug 26, 2022 Source

Built Distribution

django_spam_classifier-0.1.0-py3-none-any.whl (13.0 kB view hashes)

Uploaded Aug 26, 2022 Python 3

Hashes for django-spam-classifier-0.1.0.tar.gz

Hashes for django-spam-classifier-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f67195d1344d3cad52cf585be2575bab1a461b5b593b141d08a700e20d920f77`
MD5	`0604212f8a245c1f88529aec3a199fd1`
BLAKE2b-256	`c8f94ea6a3b8fe5408485845212adc48820d402c1fde0f54818a6fd411da4545`

Hashes for django_spam_classifier-0.1.0-py3-none-any.whl

Hashes for django_spam_classifier-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8b8566f2ed52c95b65c5e1dd5b95d88c617010357bcebbe52248337854751ba8`
MD5	`f68dfcc2a271becd4448adde7a16392e`
BLAKE2b-256	`d86331525fb7be2f88d78ad4a484bdb0d03c3139c34bac80a7d09fc3e8b801c5`