Last updated:
0 purchases
ragged 0.1.0
Ragged
Introduction
Ragged is a library for manipulating ragged arrays as though they were
NumPy or CuPy arrays, following the
Array API specification.
For example, this is a
ragged/jagged array:
>>> import ragged
>>> a = ragged.array([[[1.1, 2.2, 3.3], []], [[4.4]], [], [[5.5, 6.6, 7.7, 8.8], [9.9]]])
>>> a
ragged.array([
[[1.1, 2.2, 3.3], []],
[[4.4]],
[],
[[5.5, 6.6, 7.7, 8.8], [9.9]]
])
The values are all floating-point numbers, so a.dtype is float64,
>>> a.dtype
dtype('float64')
but a.shape has non-integer dimensions to account for the fact that some of
its list lengths are non-uniform:
>>> a.shape
(4, None, None)
In general, a ragged.array can have any mixture of regular and irregular
dimensions, though shape[0] (the length) is always an integer. This convention
follows the Array API's specification for
array.shape,
which must be a tuple of int or None:
array.shape: Tuple[Optional[int], ...]
(Our use of None to indicate a dimension without a single-valued size differs
from the Array API's intention of specifying dimensions of unknown size,
but it follows the technical specification. Array API-consuming libraries
can try using Ragged to find out if they are ragged-ready.)
All of the normal elementwise and reducing functions apply, as well as slices:
>>> ragged.sqrt(a)
ragged.array([
[[1.05, 1.48, 1.82], []],
[[2.1]],
[],
[[2.35, 2.57, 2.77, 2.97], [3.15]]
])
>>> ragged.sum(a, axis=0)
ragged.array([
[11, 8.8, 11, 8.8],
[9.9]
])
>>> ragged.sum(a, axis=-1)
ragged.array([
[6.6, 0],
[4.4],
[],
[28.6, 9.9]
])
>>> a[-1, 0, 2]
ragged.array(7.7)
>>> a[a * 10 % 2 == 0]
ragged.array([
[[2.2], []],
[[4.4]],
[],
[[6.6, 8.8], []]
])
All of the methods, attributes, and functions in the Array API will be
implemented for Ragged, as well as conveniences that are not required by the
Array API. See
open issues marked "todo"
for Array API functions that still need to be written (out of 120 in total).
Ragged has two device values, "cpu" (backed by NumPy) and "cuda"
(backed by CuPy). Eventually, all operations will be identical for CPU and
GPU.
Implementation
Ragged is implemented using Awkward Array
(code,
docs), which is an array library for arbitrary
tree-like (JSON-like) data. Because of its generality, Awkward Array cannot
follow the Array API—in fact, its array objects can't have separate dtype
and shape attributes (the array type can't be factorized). Ragged is
therefore
a specialization of Awkward Array for numeric data in fixed-length and
variable-length lists, and
a formalization to adhere to the Array API and its fully typed
protocols.
See
Why does this library exist?
under the Discussions tab for
more details.
Ragged is a thin wrapper around Awkward Array, restricting it to ragged
arrays and transforming its function arguments and return values to fit the
specification.
Awkward Array, in turn, is time- and memory-efficient, ready for big
datasets. Consider the following:
import gc # control for garbage collection
import psutil # measure process memory
import time # measure time
import math
import ragged
this_process = psutil.Process()
def measure_memory(task):
gc.collect()
start_memory = this_process.memory_full_info().uss
out = task()
gc.collect()
stop_memory = this_process.memory_full_info().uss
print(f"memory: {(stop_memory - start_memory) * 1e-9:.3f} GB")
return out
def measure_time(task):
gc.disable()
start_time = time.perf_counter()
out = task()
stop_time = time.perf_counter()
gc.enable()
print(f"time: {stop_time - start_time:.3f} sec")
return out
def make_big_python_object():
out = []
for i in range(10000000):
out.append([j * 1.1 for j in range(i % 10)])
return out
def make_ragged_array():
return ragged.array(pyobj)
def compute_on_python_object():
out = []
for row in pyobj:
out.append([math.sqrt(x) for x in row])
return out
def compute_on_ragged_array():
return ragged.sqrt(arr)
The ragged.array is 3 times smaller:
>>> pyobj = measure_memory(make_big_python_object)
memory: 2.687 GB
>>> arr = measure_memory(make_ragged_array)
memory: 0.877 GB
and a sample calculation on it (square root of each value) is 50 times faster:
>>> result = measure_time(compute_on_python_object)
time: 4.180 sec
>>> result = measure_time(compute_on_ragged_array)
time: 0.082 sec
Awkward Array and Ragged are generally smaller and faster than their
Python equivalents for the same reasons that NumPy is smaller and faster
than Python lists. See Awkward Array
papers and presentations
for more.
Installation
Ragged is on PyPI:
pip install ragged
and will someday be on conda-forge.
ragged is a pure-Python library that only depends on awkward (which, in
turn, only depends on numpy and a compiled extension). In principle (i.e.
eventually), ragged can be loaded into Pyodide and JupyterLite.
Acknowledgements
Support for this work was provided by NSF grant
OAC-2103945 and the
gracious help of
Awkward Array contributors.
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.