Project description

Wrangl

Ray-based parallel data preprocessing for NLP and ML. See docs here.

pip install -e .  # add [dev] if you want to run tests and build docs.

# for latest
pip install git+https://github.com/vzhong/wrangl

See examples and test cases for usage.

How to process in parallel

Here is a trivial example where we simple repeat the input many times.

import ray
from wrangl.data import Dataloader, Processor


# define how to load your data
class MyDataloader(Dataloader):

    def __init__(self, strings, pool: ray.util.ActorPool, cache_size: int = 1024):
        super().__init__(pool, cache_size=cache_size)
        self.current = 0
        self.strings = strings

    def reset(self):
        self.current = 0

    def load_next(self):
        if self.current < len(self.strings):
            ret = self.strings[self.current]
            self.current += 1
            return ret
        else:
            return None


# define how a worker processes each example
@ray.remote
class MyProcessor(Processor):

    def process(self, raw):
        return raw * 10


if __name__ == '__main__':
    ray.init()
    strings = ['a', 'b', 'c', 'd', 'e', 'f', 'g']

    pool = ray.util.ActorPool([MyProcessor.remote() for _ in range(3)])
    loader = MyDataloader(self.strings, pool, cache_size=5)
    out = []
    for batch in loader.batch(2, ordered=True):  # you can use ordered=False here for faster speed if you do not care about retrieving examples in order.
        out.extend(batch)
    expect = [s * 10 for s in self.strings]
    assert expect == out

Additional utilities

Annotate data in commandline:

wannotate -h

Run tests

python -m unittest discover tests

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.8

May 9, 2022

0.0.6

Dec 13, 2021

0.0.5

Sep 29, 2021

This version

0.0.4

Sep 26, 2021

0.0.1

Sep 1, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wrangl-0.0.4.tar.gz (13.0 kB view hashes)

Uploaded Sep 26, 2021 Source

Built Distribution

wrangl-0.0.4-py3-none-any.whl (13.3 kB view hashes)

Uploaded Sep 26, 2021 Python 3

Hashes for wrangl-0.0.4.tar.gz

Hashes for wrangl-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`57adf66e810dfebad16dfd341827ac898f45eef21f0b89e2248a9feba2b3cdd8`
MD5	`e7fb9809cf9c21c019954d9f8434b3c7`
BLAKE2b-256	`ca468ac1a86dde6984cabcb66921463de19b6be6caa148620a5323c56f9c53b5`

Hashes for wrangl-0.0.4-py3-none-any.whl

Hashes for wrangl-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ae9a126a626ea2065a8526bf75c68ea53b15e691891b6c8023b54b8f8bc56a0c`
MD5	`0ad3e3d45032ae7f76719e5a9eab37cf`
BLAKE2b-256	`d4ed176f6ae557b1b21b064229af74e26ddddeacf61b260d6f6c015f04cfbaed`