Ray-backed data preprocessing.
Project description
Wrangl
Ray-based parallel data preprocessing for NLP and ML. See docs here.
pip install -e . # add [dev] if you want to run tests and build docs.
# for latest
pip install git+https://github.com/vzhong/wrangl
See examples and test cases for usage.
How to process in parallel
Here is a trivial example where we simple repeat the input many times.
import ray
from wrangl.data import Dataloader, Processor
# define how to load your data
class MyDataloader(Dataloader):
def __init__(self, strings, pool: ray.util.ActorPool, cache_size: int = 1024):
super().__init__(pool, cache_size=cache_size)
self.current = 0
self.strings = strings
def reset(self):
self.current = 0
def load_next(self):
if self.current < len(self.strings):
ret = self.strings[self.current]
self.current += 1
return ret
else:
return None
# define how a worker processes each example
@ray.remote
class MyProcessor(Processor):
def process(self, raw):
return raw * 10
if __name__ == '__main__':
ray.init()
strings = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
pool = ray.util.ActorPool([MyProcessor.remote() for _ in range(3)])
loader = MyDataloader(self.strings, pool, cache_size=5)
out = []
for batch in loader.batch(2, ordered=True): # you can use ordered=False here for faster speed if you do not care about retrieving examples in order.
out.extend(batch)
expect = [s * 10 for s in self.strings]
assert expect == out
Additional utilities
Annotate data in commandline:
wannotate -h
Run tests
python -m unittest discover tests
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wrangl-0.0.4.tar.gz
(13.0 kB
view hashes)
Built Distribution
wrangl-0.0.4-py3-none-any.whl
(13.3 kB
view hashes)