proxiflow 0.1.9

Creator: bradpython12

Last updated:

Add to Cart

Description:

proxiflow 0.1.9

ProxiFlow
ProxiFlow is a data preprocessig tool for machine learning that performs
data cleaning, normalization, and feature engineering.
The biggest advantage if this library (which is basically a wrapper over polars data frame) is that it is configurable via YAML configuration file which makes it suitable for MLOps pipelines or for building API requests over it.
Documentation
Read the full documentation here.
Usage
To use ProxiFlow, install it via pip:
pip install proxiflow

You can then call it from the command line:
proxiflow --config-file myconfig.yaml --input-file mydata.csv --output-file cleaned_data.csv

Here's an example of a YAML configuration file:
input_format: csv
output_format: csv

data_cleaning: #mandatory
# NOTE: Not handling missing values can cause errors during data normalization
handle_missing_values:
drop: false
mean: true # Only Int and Float columns are handled
# mode: true # Turned off for now.
knn: true

handle_outliers: true # Only Float columns are handled
remove_duplicates: true

data_normalization: # mandatory
min_max: #mandatory but values are not mandatory. It can be left empty
# Specify columns:
- Age # not mandatory
z_score:
- Price
log:
- Floors

feature_engineering:
one_hot_encoding: # mandatory
- Bedrooms # not mandatory

feature_scaling: # mandatory
degree: 2 # not mandatory. It specifies the polynominal degree
columns: # not mandatory
- Floors # not mandatory

The above configuration specifies that duplicate rows should be removed
and missing values should be dropped.
API
ProxiFlow can also be used as a Python library. Here's an example:
import polars as pl
from proxiflow.config import Config
from proxiflow.core import Cleaner

# Load the data
df = pl.read_csv("mydata.csv")

# Load the configuration
config = Config("myconfig.yaml")

# Clean the data
cleaner = Cleaner(config)
cleaned_data = cleaner.clean_data(data)

# Perform data normalization
normalizer = Normalizer(config)
normalized_data = normalizer.normalize(cleaned_data)

# Perform feature engineering
engineer = Engineer(config)
engineered_data = engineer.execute(normalized_data)

# Write the output data
engineered_data.write_csv("cleaned_data.csv")

Log

[x] Data cleaning

[x] Missing values handling

[x] Mean
[x] Drop
[x] KNN Imputer
[x] Median




[x] Data normalization

[x] Min Max normalization
[x] Z-Score normalization
[x] Logarithmic normalization


[ ] Feature engineering

[x] One Hot Encoding
[x] Feature Scaling
[ ] Recursive Feature Elimination
[ ] SelectKBest
[ ] LASSO regularization


[ ] Text Preprocessing

[ ] Tokenization
[ ] Stemming
[ ] Stopword removal
[ ] Text Vectorization

[ ] Bag of Words
[ ] TF-IDF


[ ] Word embeddings

[ ] Word2Vec
[ ] GloVe
[ ] BERT




[ ] Categorical Encoding

[ ] Target encoding
[ ] Count encoding
[ ] Binary encoding


[ ] Dimensionality reduction

[ ] Principal Component Analysis (PCA)
[ ] t-Distributed Stochastic Neighbor Embedding (t-SNE)
[ ] Uniform Manifold Approximation and Projection (UMAP)

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.