Last updated:
0 purchases
pydantictopyarrow 0.1.3
pydantic-to-pyarrow
pydantic-to-pyarrow is a library for Python to help with conversion
of pydantic models to pyarrow schemas.
pydantic is a Python library
for data validation, applying type hints / annotations. It enables
the creation of easy or complex data validation rules.
pyarrow is a Python library
for using Apache Arrow, a development platform for in-memory analytics. The library
also enables easy writing to parquet files.
Why might you want to convert models to schemas? One scenario is for a data
processing pipeline:
Import / extract the data from its source
Validate the data using pydantic
Process the data in pyarrow / pandas / polars
Store the raw and / or processed data in parquet.
The easiest approach for steps 3 and 4 above is to let pyarrow infer
the schema from the data. The most involved approach is to
specify the pyarrow schema separate from the pydantic model. In the middle, many
application could benefit from converting the pydantic model to a
pyarrow schema. This library aims to achieve that.
Installation
pip install pydantic-to-pyarrow
Conversion Table
The below conversions still run into the possibility of
overflows in the Pyarrow types. For example, in Python 3
the int type is unbounded, whereas the pa.int64() type has a fixed
maximum. In most cases, this should not be an issue, but if you are
concerned about overflows, you should not use this library and
should manually specify the full schema.
Python / Pydantic
Pyarrow
Overflow
str
pa.string()
Literal[strings]
pa.dictionary(pa.int32(), pa.string())
.
.
.
int
pa.int64() if no minimum constraint, pa.uint64() if minimum is zero
Yes, at 2^63 (for signed) or 2^64 (for unsigned)
Literal[ints]
pa.int64()
float
pa.float64()
Yes
decimal.Decimal
pa.decimal128 ONLY if supplying max_digits and decimal_places for pydantic field
Yes
.
.
.
datetime.date
pa.date32()
datetime.time
pa.time64("us")
datetime.datetime
pa.timestamp("ms", tz=None) ONLY if param allow_losing_tz=True
pydantic.types.NaiveDatetime
pa.timestamp("ms", tz=None)
pydantic.types.AwareDatetime
pa.timestamp("ms", tz=None) ONLY if param allow_losing_tz=True
.
.
Optional[...]
The pyarrow field is nullable
Pydantic Model
pa.struct()
List[...]
pa.list_(...)
Dict[..., ...]
pa.map_(pa key_type, pa value_type)
Enum of str
pa.dictionary(pa.int32(), pa.string())
Enum of int
pa.int64()
If a field is marked as exclude, (Field(exclude=True)), then it will be excluded
from the pyarrow schema if exclude_fields is set to True.
An Example
from typing import Dict, List, Optional
from pydantic import BaseModel, Field
from pydantic_to_pyarrow import get_pyarrow_schema
class NestedModel(BaseModel):
str_field: str
class MyModel(BaseModel):
int_field: int
opt_str_field: Optional[str]
py310_opt_str_field: str | None
nested: List[NestedModel]
dict_field: Dict[str, int]
excluded_field: str = Field(exclude=True)
pa_schema = get_pyarrow_schema(MyModel)
print(pa_schema)
#> int_field: int64 not null
#> opt_str_field: string
#> py310_opt_str_field: string
#> nested: list<item: struct<str_field: string not null>> not null
#> child 0, item: struct<str_field: string not null>
#> child 0, str_field: string not null
#> dict_field: map<string, int64> not null
#> child 0, entries: struct<key: string not null, value: int64> not null
#> child 0, key: string not null
#> child 1, value: int64
Development
Prerequisites:
Any Python 3.8 through 3.11
poetry for dependency management
git
make
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.