fastchunking

Fast chunking library.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 2
- Python :: 3
Topic
- Software Development :: Libraries :: Python Modules

Project description

What it is

fastchunking is a Python library that contains efficient and easy-to-use implementations of string chunking algorithms.

It has been developed as part of the work [LS16] at CISPA, Saarland University.

Installation

$ pip install fastchunking

Usage and Overview

fastchunking provides efficient implementations for different string chunking algorithms, e.g., static chunking (SC) and content-defined chunking (CDC).

Static Chunking (SC)

Static chunking splits a message into fixed-size chunks.

Let us consider a random example message that shall be chunked:

>>> import os
>>> message = os.urandom(1024*1024)

Static chunking is trivial when chunking a single message:

>>> import fastchunking
>>> sc = fastchunking.SC()
>>> chunker = sc.create_chunker(chunk_size=4096)
>>> chunker.next_chunk_boundaries(message)
[4096, 8192, 12288, ...]

A large message can also be chunked in fragments, though:

>>> chunker = sc.create_chunker(chunk_size=4096)
>>> chunker.next_chunk_boundaries(message[:10240])
[4096, 8192]
>>> chunker.next_chunk_boundaries(message[10240:])
[2048, 6144, 10240, ...]

Content-Defined Chunking (CDC)

fastchunking supports content-defined chunking, i.e., chunking of messages into fragments of variable lengths.

Currently, a chunking strategy based on Rabin-Karp rolling hashes is supported.

As a rolling hash computation on plain-Python strings is incredibly slow with any interpreter, most of the computation is performed by a C++ extension which is based on the ngramhashing library by Daniel Lemire, see: https://github.com/lemire/rollinghashcpp

Let us consider a random message that should be chunked:

>>> import os
>>> message = os.urandom(1024*1024)

When using static chunking, we have to specify a rolling hash window size (here: 48 bytes) and an optional seed value that affects the pseudo-random distribution of the generated chunk boundaries.

Despite that, usage is similar to static chunking:

>>> import fastchunking
>>> cdc = fastchunking.RabinKarpCDC(window_size=48, seed=0)
>>> chunker = cdc.create_chunker(chunk_size=4096)
>>> chunker.next_chunk_boundaries(message)
[7475L, 10451L, 12253L, 13880L, 15329L, 19808L, ...]

Chunking in fragments is straightforward:

>>> chunker = cdc.create_chunker(chunk_size=4096)
>>> chunker.next_chunk_boundaries(message[:10240])
[7475L]
>>> chunker.next_chunk_boundaries(message[10240:])
[211L, 2013L, 3640L, 5089L, 9568L, ...]

Multi-Level Chunking (ML-*)

Multiple chunkers of the same type (but with different chunk sizes) can be efficiently used in parallel, e.g., to perform multi-level chunking [LS16].

Again, let us consider a random message that should be chunked:

>>> import os
>>> message = os.urandom(1024*1024)

Usage of multi-level-chunking, e.g., ML-CDC, is easy:

>>> import fastchunking
>>> cdc = fastchunking.RabinKarpCDC(window_size=48, seed=0)
>>> chunk_sizes = [1024, 2048, 4096]
>>> chunker = cdc.create_multilevel_chunker(chunk_sizes)
>>> chunker.next_chunk_boundaries_with_levels(message)
[(1049L, 2L), (1511L, 1L), (1893L, 2L), (2880L, 1L), (2886L, 0L),
(3701L, 0L), (4617L, 0L), (5809L, 2L), (5843L, 0L), ...]

The second value in each tuple indicates the highest chunk size that leads to a boundary. Here, the first boundary is a boundary created by the chunker with index 2, i.e., the chunker with 4096 bytes target chunk size.

Performance

Computation costs for static chunking are barely measurable: As chunking does not depend on the actual message but only its length, computation costs are essentially limited to a single xrange call.

Content-defined chunking, however, is expensive: The algorithm has to compute hash values for rolling hash window contents at every byte position of the message that is to be chunked. To minimize costs, fastchunking works as follows:

The message (fragment) is passed in its entirety to the C++ extension.

Chunking is performed within the C++ extension.

The resulting list of chunk boundaries is communicated back to Python and converted into a Python list.

Based on a 100 MiB random content, the author measured the following throughput on an Intel Core i7-4600U in a single, non-representative test run:

chunk size

throughput

64 bytes

49 MiB/s

128 bytes

57 MiB/s

256 bytes

62 MiB/s

512 bytes

63 MiB/s

1024 bytes

67 MiB/s

2048 bytes

68 MiB/s

4096 bytes

70 MiB/s

8192 bytes

71 MiB/s

16384 bytes

71 MiB/s

32768 bytes

71 MiB/s

chunk size	throughput
64 bytes	49 MiB/s
128 bytes	57 MiB/s
256 bytes	62 MiB/s
512 bytes	63 MiB/s
1024 bytes	67 MiB/s
2048 bytes	68 MiB/s
4096 bytes	70 MiB/s
8192 bytes	71 MiB/s
16384 bytes	71 MiB/s
32768 bytes	71 MiB/s

Testing

fastchunking uses tox for testing, so simply run:

$ tox

References:: [LS16] (1,2)
Dominik Leibenger and Christoph Sorge (2016). sec-cs: Getting the Most out of Untrusted Cloud Storage. arXiv preprint.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 2
- Python :: 3
Topic
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

0.0.3

Feb 14, 2017

This version

0.0.2

Jun 10, 2016

0.0.1

Jun 10, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastchunking-0.0.2.zip (30.6 kB view hashes)

Uploaded Jun 10, 2016 Source

Hashes for fastchunking-0.0.2.zip

Hashes for fastchunking-0.0.2.zip
Algorithm	Hash digest
SHA256	`a1719609cca099e7d5518481e2d194f19a5ae075866a4f8ce12b35a5c8860219`
MD5	`81dfabba4bedaff9f257273ca0ea2e2d`
BLAKE2b-256	`5cd39d49991625c91377a148f6cad08e61e1e2cc195cbf378e14c8cde5db6536`

fastchunking 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

What it is

Installation

Usage and Overview

Static Chunking (SC)

Content-Defined Chunking (CDC)

Multi-Level Chunking (ML-*)

Performance

Testing

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution