Skip to main content

Computing (batch-wise) sample statistics.

Project description

pyBatchedMoments

pypi-version python-version Build, Test & Deploy

pyBatchedMoments is a Python library for computing (batch-wise) sample statistics, such as mean, variance, standard deviation, skewness and kurtosis.

In certain applications it is needed to compute simple statistics of a population, but with textbook formulae the calculation can suffer from loss of precision and can be numerically unstable. Additionally, for large populations only a single pass over the values is feasible, therefore, an incremental (batch-wise) approach is needed.

Installation

To install the current release, run

pip install batchedmoments

From Source

To install the latest development version (e.g. in editable mode), run

git clone https://github.com/sbrodehl/pyBatchedMoments.git
pip install -e pyBatchedMoments

Examples

We start with the simple use case of sample statistics of some (random) numbers.

from batchedmoments import BatchedMoments

data = [2, 8, 0, 4, 1, 9, 9, 0]
bm = BatchedMoments()
bm(data)

# use computed values
# bm.mean, bm.std, ...

The result is equivalent to numpy (mean, std and var) and scipy.stats (skew and kurtosis).

Batched Computation

Where pyBatchedMoments really shines is when the data is not available at once. In this case, the data can be batched (split in usable parts), and the statistics can be computed batch-wise.

from batchedmoments import BatchedMoments

# a generator function which returns batches of data
data_iter = iter(list(range(n, n + 10)) for n in range(0, 1000, 10))

bm = BatchedMoments()
for batch in data_iter:
    bm(batch)

# use computed values
# bm.mean, bm.std, ...

Distributed / Parallel Computation

The sample statistics of single batches can be computed independently and later be combined with the add operator. The following example shows a multiprocessing use case, but the batches can be computed distributed among different computers (nodes) as well.

import multiprocessing
from multiprocessing import Pool
from batchedmoments import BatchedMoments

# a generator function which returns batches of data
data = iter(list(range(n, n + 10)) for n in range(0, 1000, 10))
# create object and initialize with first batch of data
bm = BatchedMoments()(next(data))
with Pool(processes=multiprocessing.cpu_count()) as pool:
    for dbm in pool.imap_unordered(BatchedMoments(), data):
        bm += dbm

# use computed values
# bm.mean, bm.std, ...

Reduction of Axes

The axis=... keyword allows specifying axis or axes along which the sample statistics are computed. The default (None) is to compute the sample statistics of the flattened array.

Working with data of shape (1000, 3, 28, 28) and specifying axis=0 the computed statistics will have shape (3, 28, 28). If axis=(0, 2, 3) the computed statistics will have shape (3,).

Using the reduce method the shape of the computed statistics can be further reduced at a later stage. E.g. with data of shape (1000, 3, 28, 28) and axis=(2, 3) the computed statistics will have shape (1000, 3). By using reduce(0) the computed statistics will be reduced to shape (3,).

Machine Learning Use Case

A prime example, where pyBatchedMoments can be used, is to compute sample statistics of machine learning data sets. Here we use torchvision.datasets to compute sample mean and sample standard deviation needed for normalization of the data set.

from torch.utils.data import DataLoader
from torchvision import transforms, datasets
from batchedmoments import BatchedMoments

image_data = datasets.FashionMNIST(
    "/tmp/FashionMNIST",
    download=True,
    train=True,
    transform=transforms.Compose([
        transforms.ToTensor()
    ])
)
data_loader = DataLoader(
    image_data,
    batch_size=1024,
)

bm = BatchedMoments(axis=(0, 2, 3))
for imgs, _ in data_loader:
    bm(imgs.numpy())

# use computed values
# bm.mean, bm.std, ...
# mean=0.28604060219395394 std=0.35302424954262396

License

pyBatchedMoments uses a MIT-style license, as found in LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

batchedmoments-1.0.2.tar.gz (9.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page