Project description

bigarray

Fast and scalable numpy array using Memory-mapped I/O

Stable build:

pip install bigarray

Nightly build from github:

pip install git+https://github.com/trungnt13/bigarray@master

The three principles

Transparency: everything is numpy.array, metadata and support for extra features (e.g. multiprocessing, indexing, etc) are subtly implemented in the background.
Pragmatism: fast but easy, simplified A.P.I for common use cases
Focus: "Do one thing and do it well"

The benchmarks

About 535 times faster than HDF5 data (using h5py) and 223 times faster than normal numpy.array The detail benchmark code

Array size: 1220.70 (MB)

Create HDF5   in: 0.0005580571014434099 s
Create Memmap in: 0.000615391880273819 s

Numpy save in: 0.5713834380730987 s
Writing data to HDF5  : 0.5530977640300989 s
Writing data to Memmap: 0.7038380969315767 s

Numpy saved size: 1220.70 (MB)
HDF5 saved size: 1220.71 (MB)
Mmap saved size: 1220.70 (MB)

Load Numpy array: 0.3723734531085938 s
Load HDF5 data  : 0.00041177100501954556 s
Load Memmap data: 0.00017150305211544037 s

Test correctness of stored data
Numpy : True
HDF5  : True
Memmap: True

Iterate Numpy data   : 0.00020254682749509811 s
Iterate HDF5 data    : 0.8945782391820103 s
Iterate Memmap data  : 0.0014937107916921377 s
Iterate Memmap (2nd) : 0.0011746759992092848 s

Numpy total time (open+iter): 0.3725759999360889 s
H5py  total time (open+iter): 0.8949900101870298 s
**Mmap  total time (open+iter): 0.001665213843807578 s**

Example

from multiprocessing import Pool

import numpy as np

from bigarray import PointerArray, PointerArrayWriter

n = 80 * 10  # total number of samples
jobs = [(i, i + 10) for i in range(0, n // 10, 10)]
path = '/tmp/array'

# ====== Multiprocessing writing ====== #
writer = PointerArrayWriter(path, shape=(n,), dtype='int32', remove_exist=True)


def fn_write(job):
  start, end = job
  # it is crucial to write at different position for different process
  writer.write(
      {"name%i" % i: np.arange(i * 10, i * 10 + 10) for i in range(start, end)},
      start_position=start * 10)


# using 2 processes to generate and write data
with Pool(2) as p:
  p.map(fn_write, jobs)
writer.flush()
writer.close()

# ====== Multiprocessing reading ====== #
x = PointerArray(path)
print(x['name0'])
print(x['name66'])
print(x['name78'])

# normal indexing
for name, (s, e) in x.indices.items():
  data = x[s:e]
# fast indexing
for name in x.indices:
  data = x[name]


# multiprocess indexing
def fn_read(job):
  start, end = job
  total = 0
  for i in range(start, end):
    total += np.sum(x['name%d' % i])
  return total

# use multiprocessing to calculate the sum of all arrays
with Pool(2) as p:
  total_sum = sum(p.map(fn_read, jobs))
print(np.sum(x), total_sum)

Output:

[0 1 2 3 4 5 6 7 8 9]
[660 661 662 663 664 665 666 667 668 669]
[780 781 782 783 784 785 786 787 788 789]
319600 319600

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.2.2

Oct 16, 2019

0.2.1

Oct 9, 2019

0.2.0

Oct 8, 2019

0.1.0

Oct 8, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigarray-0.2.2.tar.gz (11.6 kB view hashes)

Uploaded Oct 16, 2019 Source

Built Distribution

bigarray-0.2.2-py3-none-any.whl (17.3 kB view hashes)

Uploaded Oct 16, 2019 Python 3

Hashes for bigarray-0.2.2.tar.gz

Hashes for bigarray-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`0fd84a057549d54fb6bcc0662aef9981694178c757acf3f58f1558e8a2f7961f`
MD5	`82a9cbf196f95a30d7505e4775f5c8fe`
BLAKE2b-256	`c85b5238f40f8bf1068a94cc2a62be843bb887d778e5c5729065e0cc039dea4e`

Hashes for bigarray-0.2.2-py3-none-any.whl

Hashes for bigarray-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4b864ba98dcbb5695ca791c14c025e55c3bae5423171c3a673d6dff9b5daba7c`
MD5	`ad3a054ada1dd6fa4988210d3a59b77c`
BLAKE2b-256	`8937b743fc2e7c18f81a8870b27d622b13b70a07d211b96a2bcd8de461aff59e`