An fsspec implementation for lakeFS.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

lakefs-spec: An `fsspec` implementation for lakeFS

This repository contains a filesystem-spec implementation for the lakeFS project. Its main goal is to facilitate versioned data operations in lakeFS directly from Python code, for example using pandas. See the examples below for inspiration.

Installation

To install the package directly from PyPI via pip, run

pip install --upgrade pip
pip install lakefs-spec

or, for the bleeding edge version,

pip install git+https://github.com/appliedAI-Initiative/lakefs-spec.git

To add the project as a dependency using poetry, use

poetry add lakefs-spec

or, for the development version,

poetry add git+https://github.com/appliedAI-Initiative/lakefs-spec.git

Usage

As an example showcase, we use the lakeFS file system to read a Pandas DataFrame directly from a branch. To follow this small tutorial, you should first complete Step 1 in the lakeFS quickstart by launching an instance, and then creating a pre-populated repository by clicking the green button on the login page.

Then, run the following code to download the sample dataframe directly from the main branch:

import pandas as pd

# change these settings to match your instance's address and credentials
storage_options={
    "host": "localhost:8000",
    "username": "username",
    "password": "password",
}

df = pd.read_parquet('lakefs://quickstart/main/lakes.parquet', storage_options=storage_options)

Paths and URIs

The lakeFS filesystem expects URIs that follow the lakeFS protocol. URIs need to have the form lakefs://<repo>/<ref>/<resource>, with the repository name, the ref name (either a branch name or a commit SHA, depending on the operation), and resource name. The resource can be a single file name, or a directory name for recursive operations.

Client-side caching

In order to reduce the number of IO operations, you can enable client-side caching of both uploaded and downloaded files. Caching works by calculating the MD5 checksum of the local file, and comparing it to that of the lakeFS remote file. If they match, the operations are cancelled, and no additional client-server communication (including up- and downloads) happens.

Client-side caching is enabled by default in the lakeFS file system, and can be controlled through the precheck_files argument in the constructor:

from lakefs_spec import LakeFSFileSystem

# The default setting, precheck_files=False disables client-side caching.
fs = LakeFSFileSystem(host="localhost:8000", precheck_files=True)

Automatic commit creation with a commit hook

Some operations, like fs.put() or fs.rm(), change the state of a lakeFS repository by changing files. According to the lakeFS working model, these changes are tracked as uncommitted changes, similarly to the git version control system.

With lakefs-spec, you can optionally commit changes caused by file system operations directly after they are made, by using a commit hook. A commit hook is a Python function taking the fsspec event that caused the changes (e.g. put or rm), as well as a context object containing useful information like the repository, branch name, changed resource, and the lakeFS diff, and returning a CommitCreation object that is then used by lakeFS to create a commit on the chosen branch.

An example of a commit hook:

from lakefs_client.models import CommitCreation
from lakefs_spec.commithook import FSEvent, HookContext

def my_commit_hook(event: FSEvent, ctx: HookContext) -> CommitCreation:
    if event == FSEvent.RM:
        message = f"❌ Remove file {ctx.resource}"
    else:
        message = f"✅ Add file {ctx.resource}"
    
    return CommitCreation(message=message)

To enable automatic commits after stateful filesystem operations, set postcommit = True in the filesystem constructor. If you would like to use your own commit hook, supply a Python callable with the aforementioned signature as the commithook argument:

from lakefs_spec import LakeFSFileSystem

# use the example commit hook from above
fs = LakeFSFileSystem(host="localhost:8000", postcommit=True, commithook=my_commit_hook)

Scoped filesystem behavior changes

To selectively enable or disable automatic commits or client-side caching, you can use a scope context manager:

from lakefs_spec import LakeFSFileSystem

fs = LakeFSFileSystem(host="localhost:8000")

with fs.scope(precheck_files=False):
    # get a fresh version of the file by disabling caching checks
    fs.get("lakefs://my-repo/my-branch/my-file.txt", "my-file.txt")

# do something with the text file...
...

# create a commit on upload by enabling automatic commits in a scoped section
with fs.scope(postcommit=True):
    fs.put("my-file.txt", "lakefs://my-repo/my-branch/my-new-file.txt")

Implicit initialization and instance caching

Aside from explicit initialization, you can also use environment variables and a configuration file (by default ~/.lakectl.yaml) to initialize a lakeFS file system. The environment variables for the lakeFS client arguments are the names of the constructor arguments prefixed with LAKEFS_:

import os
from lakefs_spec import LakeFSFileSystem

os.environ["LAKEFS_HOST"] = "localhost:8000"
os.environ["LAKEFS_USERNAME"] = "username"
os.environ["LAKEFS_PASSWORD"] = "password"

fs = LakeFSFileSystem()

To initialize the lakeFS file system from a lakectl YAML configuration file, you can specify the configfile argument.

from lakefs_spec import LakeFSFileSystem

# No argument means the default config (~/.lakectl.yaml) will be used.
fs = LakeFSFileSystem(configfile="path/to/my/lakectl.yaml")

⚠️ To be able to read settings from a YAML configuration file, pyyaml has to be installed. You can do this by installing lakefs-spec together with the yaml extra:

pip install --upgrade lakefs-spec[yaml]

A note on mixing environment variables and `lakectl` configuration files

lakeFS file system instances are cached, and existing lakeFS instances are reused from an instance cache when requested.

For implicit initialization from environment variables and configuration files as described above, this means that whichever initialization method is used first populates the cache - thus, when using the other method, a cache hit happens and no new instance is created. This can lead to surprising misconfigurations:

import os
from lakefs_spec import LakeFSFileSystem

# set envvars
os.environ["LAKEFS_HOST"] = "localhost:8000"
os.environ["LAKEFS_USERNAME"] = "username"
os.environ["LAKEFS_PASSWORD"] = "password"

# creates a cache entry for the bare instance
fs = LakeFSFileSystem()

# ~/.lakectl.yaml
#  server:
#    endpoint_url: http://example-host

# this time, try to read in the default lakectl config, with http://example-host set as host.
fs = LakeFSFileSystem()
print(fs.client._api.configuration.host) # <- prints localhost:8000!

The best way to avoid this is to commit to only using either environment variables or lakectl configuration files. If you do have to mix both methods, you can clear the instance cache like so:

from lakefs_spec import LakeFSFileSystem

LakeFSFileSystem._cache.clear()

Developing and contributing to `lakefs-spec`

We welcome contributions to the project! For information on the general development workflow, head over to the contribution guide.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.9.0

May 7, 2024

0.8.0.post0

Apr 3, 2024

0.8.0

Apr 2, 2024

0.7.1

Feb 29, 2024

0.7.0

Feb 2, 2024

0.6.1

Jan 8, 2024

0.6.0

Dec 22, 2023

0.5.0

Dec 18, 2023

0.4.0

Nov 29, 2023

0.3.0

Nov 10, 2023

0.2.1

Oct 20, 2023

0.2.0

Oct 16, 2023

0.1.6

Oct 12, 2023

0.1.5

Sep 22, 2023

0.1.4

Sep 21, 2023

0.1.3

Sep 4, 2023

0.1.2

Aug 30, 2023

0.1.1 yanked

Aug 24, 2023

Reason this release was yanked:

accidentally published

This version

0.1.0rc6 pre-release

Aug 30, 2023

0.1.0rc5 pre-release

Aug 25, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakefs-spec-0.1.0rc6.tar.gz (29.0 kB view hashes)

Uploaded Aug 30, 2023 Source

Built Distribution

lakefs_spec-0.1.0rc6-py3-none-any.whl (16.2 kB view hashes)

Uploaded Aug 30, 2023 Python 3

Hashes for lakefs-spec-0.1.0rc6.tar.gz

Hashes for lakefs-spec-0.1.0rc6.tar.gz
Algorithm	Hash digest
SHA256	`30369f7747991b2802a1f4688f1309dac96679c7df657c0715b3ca7d9cfeeacf`
MD5	`78ca5a18cddb4d1d14350455e1033397`
BLAKE2b-256	`23280d32ffc07d8bd3df1f2171d8e9779d8cd6392853b31c353046b510040409`

Hashes for lakefs_spec-0.1.0rc6-py3-none-any.whl

Hashes for lakefs_spec-0.1.0rc6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1bd69c4cba774bb3150793b6391c9b6f76df9629f25289b590324c2252b74a9e`
MD5	`a90b27281f7b579e249dd5e995e2fb97`
BLAKE2b-256	`495269a93f132d3c658bdf806f61e1e1e8a88b776364e27e1f9f65c7e69e56a1`

lakefs-spec 0.1.0rc6

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

lakefs-spec: An `fsspec` implementation for lakeFS

Installation

Usage

Paths and URIs

Client-side caching

Automatic commit creation with a commit hook

Scoped filesystem behavior changes

Implicit initialization and instance caching

A note on mixing environment variables and `lakectl` configuration files

Developing and contributing to `lakefs-spec`

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

lakefs-spec 0.1.0rc6

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

lakefs-spec: An fsspec implementation for lakeFS

Installation

Usage

Paths and URIs

Client-side caching

Automatic commit creation with a commit hook

Scoped filesystem behavior changes

Implicit initialization and instance caching

A note on mixing environment variables and lakectl configuration files

Developing and contributing to lakefs-spec

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

lakefs-spec: An `fsspec` implementation for lakeFS

A note on mixing environment variables and `lakectl` configuration files

Developing and contributing to `lakefs-spec`