An fsspec implementation for lakeFS.
Project description
lakefs-spec: An fsspec
implementation for lakeFS
This repository contains a filesystem-spec implementation for the lakeFS project.
Its main goal is to facilitate versioned data operations in lakeFS directly from Python code, for example using pandas
. See the examples below for inspiration.
Installation
To install the package directly from PyPI via pip
, run
python3 -m pip install lakefs-spec
or, for the bleeding edge version,
python3 -m pip install git+https://github.com/appliedAI-Initiative/lakefs-spec.git
To add the project as a dependency using poetry
, use
poetry add lakefs-spec
or, for the development version,
poetry add git+https://github.com/appliedAI-Initiative/lakefs-spec.git
Usage
As an example showcase, we use the lakeFS file system to read a Pandas DataFrame
directly from a branch. To follow
this small tutorial, you should first complete Step 1 in the lakeFS quickstart by
launching an instance, and then creating a pre-populated repository by clicking the green button on the login page.
Then, run the following code to download the sample dataframe directly from the main
branch:
import pandas as pd
# change these settings to match your instance's address and credentials
storage_options={
"host": "localhost:8000",
"username": "username",
"password": "password",
}
df = pd.read_parquet('lakefs://quickstart/main/lakes.parquet', storage_options=storage_options)
Paths and URIs
The lakeFS filesystem expects URIs that follow the lakeFS protocol.
URIs need to have the form lakefs://<repo>/<ref>/<resource>
, with the repository name, the ref name (either a branch name or a commit SHA, depending on the operation), and resource name.
The resource can be a single file name, or a directory name for recursive operations.
Client-side caching
In order to reduce the number of IO operations, you can enable client-side caching of both uploaded and downloaded files. Caching works by calculating the MD5 checksum of the local file, and comparing it to that of the lakeFS remote file. If they match, the operations are cancelled, and no additional client-server communication (including up- and downloads) happens.
Client-side caching is enabled by default in the lakeFS file system, and can be controlled through the precheck_files
argument in the constructor:
from lakefs_spec import LakeFSFileSystem
# The default setting, precheck_files=False disables client-side caching.
fs = LakeFSFileSystem(host="localhost:8000", precheck_files=True)
Automatic commit creation with a commit hook
Some operations, like fs.put()
or fs.rm()
, change the state of a lakeFS repository by changing files. According to
the lakeFS working model, these changes are tracked as uncommitted changes, similarly to the git version control system.
With lakefs-spec
, you can optionally commit changes caused by file system operations directly after they are made,
by using a commit hook. A commit hook is a Python function taking the fsspec
event name that caused the changes
(e.g. put
or rm
), as well as the remote resource path, and returning a CommitCreation
object that is then used by
lakeFS to create a commit directly on the chosen branch.
An example of a commit hook:
from lakefs_client.models import CommitCreation
def my_commit_hook(event: str, rpath: str) -> CommitCreation:
if event == "rm":
message = f"❌ Remove file {rpath}"
else:
message = f"✅ Add file {rpath}"
return CommitCreation(message=message)
To enable automatic commits after stateful filesystem operations, set postcommit = True
in the filesystem constructor. If you
would like to use your own commit hook, supply a Python callable with the aforementioned signature as the commithook
argument:
from lakefs_spec import LakeFSFileSystem
# use the example commit hook from above
fs = LakeFSFileSystem(host="localhost:8000", postcommit=True, commithook=my_commit_hook)
Scoped filesystem behavior changes
To selectively enable or disable automatic commits or client-side caching, you can use a scope
context manager:
from lakefs_spec import LakeFSFileSystem
fs = LakeFSFileSystem(host="localhost:8000")
with fs.scope(precheck_files=False):
# get a fresh version of the file by disabling caching checks
fs.get("lakefs://my-repo/my-branch/my-file.txt", "my-file.txt")
# do something with the text file...
...
# create a commit on upload by enabling automatic commits in a scoped section
with fs.scope(postcommit=True):
fs.put("my-file.txt", "lakefs://my-repo/my-branch/my-new-file.txt")
Developing and contributing to lakefs-spec
We welcome contributions to the project! For information on the general development workflow, head over to the contribution guide.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for lakefs_spec-0.1.0rc5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7505e5629e4ca08cc467776750ed1c838b3219883d51e937c99da861bfbfe2ba |
|
MD5 | ab9d1fe328e377c101077cc42b21db19 |
|
BLAKE2b-256 | ecc30ae7ebd462a59268110fa14a518e3b526958307ba454f56238013e529f3d |