Last updated:
0 purchases
batchedmoments 1.0.2
pyBatchedMoments
pyBatchedMoments is a Python library for computing (batch-wise) sample statistics,
such as mean, variance, standard deviation, skewness and kurtosis.
In certain applications it is needed to compute simple statistics of a population, but with textbook formulae
the calculation can suffer from loss of precision and can be numerically unstable.
Additionally, for large populations only a single pass over the values is feasible, therefore,
an incremental (batch-wise) approach is needed.
Installation
To install the current release, run
pip install batchedmoments
From Source
To install the latest development version (e.g. in editable mode), run
git clone https://github.com/sbrodehl/pyBatchedMoments.git
pip install -e pyBatchedMoments
Examples
We start with the simple use case of sample statistics of some (random) numbers.
from batchedmoments import BatchedMoments
data = [2, 8, 0, 4, 1, 9, 9, 0]
bm = BatchedMoments()
bm(data)
# use computed values
# bm.mean, bm.std, ...
The result is equivalent to numpy (mean, std and var)
and scipy.stats (skew and kurtosis).
Batched Computation
Where pyBatchedMoments really shines is when the data is not available at once.
In this case, the data can be batched (split in usable parts), and the statistics can be computed batch-wise.
from batchedmoments import BatchedMoments
# a generator function which returns batches of data
data_iter = iter(list(range(n, n + 10)) for n in range(0, 1000, 10))
bm = BatchedMoments()
for batch in data_iter:
bm(batch)
# use computed values
# bm.mean, bm.std, ...
Distributed / Parallel Computation
The sample statistics of single batches can be computed independently and later be combined with the add operator.
The following example shows a multiprocessing use case, but the batches can be computed distributed among different
computers (nodes) as well.
import multiprocessing
from multiprocessing import Pool
from batchedmoments import BatchedMoments
# a generator function which returns batches of data
data = iter(list(range(n, n + 10)) for n in range(0, 1000, 10))
# create object and initialize with first batch of data
bm = BatchedMoments()(next(data))
with Pool(processes=multiprocessing.cpu_count()) as pool:
for dbm in pool.imap_unordered(BatchedMoments(), data):
bm += dbm
# use computed values
# bm.mean, bm.std, ...
Reduction of Axes
The axis=... keyword allows specifying axis or axes along which the sample statistics are computed.
The default (None) is to compute the sample statistics of the flattened array.
Working with data of shape (1000, 3, 28, 28) and specifying axis=0 the computed statistics will have shape (3, 28, 28).
If axis=(0, 2, 3) the computed statistics will have shape (3,).
Using the reduce method the shape of the computed statistics can be further reduced at a later stage.
E.g. with data of shape (1000, 3, 28, 28) and axis=(2, 3) the computed statistics will have shape (1000, 3).
By using reduce(0) the computed statistics will be reduced to shape (3,).
Machine Learning Use Case
A prime example, where pyBatchedMoments can be used, is to compute sample statistics of machine learning data sets.
Here we use torchvision.datasets to compute sample mean and sample standard deviation needed for normalization of the data set.
from torch.utils.data import DataLoader
from torchvision import transforms, datasets
from batchedmoments import BatchedMoments
image_data = datasets.FashionMNIST(
"/tmp/FashionMNIST",
download=True,
train=True,
transform=transforms.Compose([
transforms.ToTensor()
])
)
data_loader = DataLoader(
image_data,
batch_size=1024,
)
bm = BatchedMoments(axis=(0, 2, 3))
for imgs, _ in data_loader:
bm(imgs.numpy())
# use computed values
# bm.mean, bm.std, ...
# mean=0.28604060219395394 std=0.35302424954262396
License
pyBatchedMoments uses a MIT-style license, as found in LICENSE file.
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.