Finds equal or similar images in a directory containing (many) image files
Project description
Finding Duplicate Images
Finds equal or similar images in a directory containing (many) image files.
Usage
$ pip install duplicate_images
$ find-dups -h
to print the help screen. Or just
$ find-dups $IMAGE_ROOT
for a test run.
Image comparison algorithms
Use the --algorithm
option to select how equal images are found.
ahash
, colorhash
, dhash
, phash
, whash
: five different image hashing algorithms. See
https://pypi.org/project/ImageHash for an introduction on image hashing and
https://tech.okcupid.com/evaluating-perceptual-image-hashes-okcupid for some gory details which
image hashing algorithm performs best in which situation. For a start I recommend using phash
,
and only evaluating the other algorithms if phash
does not perform satisfactorily in your use
case.
Actions for matching image pairs
Use the --on-equal
option to select what to do to pairs of equal images.
delete-first
: deletes the first of the two filesdelete-second
: deletes the second of the two filesdelete-bigger
ord>
: deletes the file with the bigger sizedelete-smaller
ord<
: deletes the file with the smaller sizeeog
: launches theeog
image viewer to compare the two filesxv
: launches thexv
image viewer to compare the two filesprint
: prints the two filesquote
: prints the two files with quotes around eachnone
: does nothing. The default action isprint
.
Parallel execution
Use the --parallel
option to utilize all free cores on your system.
Progress and verbosity control
--progress
prints a progress bar each for the process of reading the images and the process of finding duplicates among the scanned image--debug
prints debugging output
Development notes
Needs Python3 and Pillow imaging library to run, additionally Wand for the test suite.
Uses Poetry for dependency management.
Installation
From source:
$ git clone https://gitlab.com/lilacashes/DuplicateImages.git
$ cd DuplicateImages
$ pip3 install poetry
$ poetry install
Running
$ poetry run find-dups $PICTURE_DIR
or
$ poetry run find-dups -h
for a list of all possible options.
Test suite
Running it all:
$ poetry run pytest
$ poetry run mypy duplicate_images tests
$ poetry run flake8
$ poetry run pylint duplicate_images tests
or simply
$ .git_hooks/pre-push
Setting the test suite to be run before every push:
$ cd .git/hooks
$ ln -s ../../.git_hooks/pre-push .
Publishing
$ poetry config repositories.testpypi https://test.pypi.org/legacy/
$ poetry build
$ poetry publish --username $PYPI_USER --password $PYPI_PASSWORD --repository testpypi && \
poetry publish --username $PYPI_USER --password $PYPI_PASSWORD
(obviously assuming that username and password are the same on PyPI and TestPyPI)
Profiling
CPU time
To show the top functions by time spent, including called functions:
$ poetry run python -m cProfile -s tottime ./duplicate_images/duplicate.py \
--algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1 | head -n 15
or, to show the top functions by time spent in the function alone:
$ poetry run python -m cProfile -s cumtime ./duplicate_images/duplicate.py \
--algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1 | head -n 15
Memory usage
$ poetry run fil-profile run ./duplicate_images/duplicate.py \
--algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1
This will open a browser window showing the functions using the most memory (see https://pypi.org/project/filprofiler for more details).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for duplicate_images-0.4.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 002f1a7277d3ba09f8f5231d5b392c1c3c1c176ab273666b68663a59661993c4 |
|
MD5 | 34bbf7de0b18506ba10f0506955c365f |
|
BLAKE2b-256 | c5e62ee0b0727f9abc95c6b4f6842a6c896de6154d7190752884487baff42c01 |