Skip to main content

Automatically track data and artifacts

Project description

Boxs

pipeline status coverage report

Boxs is a python library that manages data automatically and keeps track of different versions of the same data created in different runs of the same script. No more need to manually think about file paths and S3 keys, just store the data and let boxs manage the rest. Besides managing the version history of the data, boxs allows to track the dependencies between different artifacts. It is meant as a tool for making it easy to manage artifacts in workflows for data science and machine learning.

What it does

Boxs provides simple functions that allow storing arbitrary values and loading them at a later point. Instead of passing around file paths or s3 keys, that have to be manually defined and versioned, boxs uses references to data, that are automatically generated when data is initially stored.

The data items of every run of the python scripts are stored next to each other without overwriting anything, which allows inspecting and comparing artifacts between different runs. For easier usage, users can name individual data items or runs so that they can be referred to by a simple name.

Additionally, boxs can automatically create meta-data for each stored data item. This meta-data like data type, size, number of lines or checksum can be accessed during execution and makes it easier to optimize the data handling.

How it works

Boxs organizes the data it manages in so-called Boxes, namespaces that allow to group related data together. Boxes themselves don't store the data, but use different Storage implementations to actually store the data and their meta-data. Within a storage a data item is stored with 3 different identifiers:

  • box_id: The id of the box, that this data item belongs to.
  • data_id: An identifier for the data item, that is derived from where the data was created. This identifier is the same across multiple runs.
  • run_id: The id of the run during which the data item was created. At the beginning of the python interpreter, a new random run_id is generated.

How to use it

Boxs can be easily installed from PyPI using pip:

pip install tox

The first step is to define a box and the underlying Storage, which can be used for storing the data:

import boxs

storage = boxs.FileSystemStorage('/my/storage/directory')
box = boxs.Box('my-box-id', storage)

The API of boxs is quite simple and consists of mainly 3 different functions:

data = boxs.store(value, *parents, name=None, box=box)

print(boxs.info(data))

value = boxs.load(data)

For more information, please take a look at Getting started or the User guide.

Develop

Boxs uses tox to build and test the library. Tox runs all tests on different python versions, can generate the documentation and run linters and style checks to improve the code quality. In order to install all the necessary python modules, please run:

pip install tox

Afterwards the tests can be run by just calling

tox

from the project directory. For this to work, you need to have multiple python interpreters installed. If you don't want to run the tests on all supported platforms just edit the tox.ini file and set

envlist = py36,py37,py38,py39

to contain only the python version you want to use. Another option is to run tox with the additional command line argument '--skip_missing_interpreters' which skips python versions that aren't installed.

Documentation

The latest version of the documentation can always be found at https://docs.kant.ai/boxs/latest. The documentation is written in Markdown and is located in the docs directory of the project. It can be built into static HTML by using MkDocs. In order to manually generate the documentation we can use tox to build the HTML pages from our markdown.

tox -e docs

Release

Releasing a new package version

Releasing new versions of bandsaw is done using flit.

pip install flit

In order to be able to publish a new release, you need an account with PyPI or their respective test environment.

Add those accounts into your ~.pypirc:

[distutils]
index-servers =
  pypi
  pypitest

[pypi]
username: <my-user>

[pypitest]
repository: https://test.pypi.org/legacy/
username: <my-test-user>

Publishing a new release to test

flit publish --repository pypitest

Releasing a new version of the documentation

The package uses mike to manage multiple versions of the documentation. The already generated documentation is kept in the docs-deployment branch and will be automatically deployed, if the branch is pushed to the repository.

In order to build a new version of the documentation, we need to use the corresponding tox environment:

VERSION_TAG='<my-version>' tox -e docs-release

The VERSION_TAG environment variable should be set to the new version in format '.'. This will build the documentation and add it as new commits to the docs-deployment branch.

By pushing the updated branch to the gitlab repository, the documentation will be automatically deployed to the official documentation website.

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boxs-0.1.tar.gz (162.0 kB view hashes)

Uploaded Source

Built Distribution

boxs-0.1-py3-none-any.whl (36.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page