Skip to main content

A tiny, zero-dependency replacement for Python's zipfile.ZipFile for creating reproducible/deterministic ZIP archives.

Project description

repro-zipfile

PyPI Supported Python versions tests codecov

A tiny, zero-dependency replacement for Python's zipfile.ZipFile for creating reproducible/deterministic ZIP archives.

"Reproducible" or "deterministic" in this context means that the binary content of the ZIP archive is identical if you add files with identical binary content in the same order. This Python package provides a ReproducibleZipFile class that works exactly like zipfile.ZipFile from the Python standard library, except that all files written to the archive have their last-modified timestamps set to a fixed value.

Installation

repro-zipfile is available from PyPI. To install, run:

pip install repro-zipfile

Usage

Simply import ReproducibleZipFile and use it in the same way you would use zipfile.ZipFile from the Python standard library.

from repro_zipfile import ReproducibleZipFile

with ReproducibleZipFile("archive.zip", "w") as zp:
    # Use write to add a file to the archive
    zp.write("examples/data.txt", arcname="data.txt")
    # Or writestr to write data to the archive
    zp.writestr("lore.txt", data="goodbye")

Note that files must be written to the archive in the same order to reproduce an identical archive. Be aware that functions that like os.listdir, os.glob, Path.iterdir, and Path.glob return files in a nondeterministic order—you should call sorted on their returned values first.

See examples/usage.py for an example script that you can run, and examples/demo_vs_zipfile.py for a demonstration in contrast with the standard library's zipfile module.

Set timestamp value with SOURCE_DATE_EPOCH

repro_zipfile supports setting the fixed timestamp value using the SOURCE_DATE_EPOCH environment variable. This should be an integer corresponding to the Unix epoch time of the timestamp you want to set. SOURCE_DATE_EPOCH is a standard created by the Reproducible Builds project.

How does repro-zipfile work?

The primary reason that ZIP archives aren't automatically reproducible is because they include last-modified timestamps of files. This means that files with identical content but with different last-modified times cause the resulting ZIP archive to be different. repro_zipfile.ReproducibleZipFile is a subclass of zipfile.ZipFile that overrides the write and writestr methods to set the modified timestamp of all files written to the archive to a fixed value. By default, this value is 1980-01-01 0:00 UTC, which is the earliest timestamp that is supported by the ZIP format. You can customize this value as documented in the previous section. Note that repro-zipfile does not modify the original files—only the metadata written to the archive.

You can effectively reproduce what ReproducibleZipFile does with something like this:

from zipfile import ZipFile

with ZipFile("archive.zip", "w") as zp:
    # Use write to add a file to the archive
    zp.write("examples/data.txt", arcname="data.txt")
    zinfo = zp.getinfo("data.txt")
    zinfo.date_time = (1980, 1, 1, 0, 0, 0)
    # Or writestr to write data to the archive
    zp.writestr("lore.txt", data="goodbye")
    zinfo = zp.getinfo("lore.txt")
    zinfo.date_time = (1980, 1, 1, 0, 0, 0)

It's not hard to do, but we believe ReproducibleZipFile is sufficiently more convenient to justify a small package!

Why care about reproducible ZIP archives?

ZIP archives are often useful when dealing with a set of multiple files, especially if the files are large and can be compressed. Creating reproducible ZIP archives is often useful for:

  • Building a software package. This is a development best practice to make it easier to verify distributed software packages. See the Reproducible Builds project for more explanation.
  • Working with data. Verify that your data pipeline produced the same outputs, and avoid further reprocessing of identical data.
  • Packaging machine learning model artifacts. Manage model artifact packages more effectively by knowing when they contain identical models.

Related Tools and Alternatives

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repro_zipfile-0.1.0.tar.gz (12.7 kB view hashes)

Uploaded Source

Built Distribution

repro_zipfile-0.1.0-py3-none-any.whl (7.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page