GitLocker: The Coding Marketplace

Description:

pgbinny 0.0.3

pg_binny

Discretize a whole dataframe into ≤N bins, using Top N categories.

%nbdev_hide

The discretize function handles discrete & continuous columns:

Continuous columns are cut into N bins using supplied cutting function (defaults to qcut for quantile cuts.
Categorical columns: take the Top N-1, with the rest tossed into "Other"

TODO: Describe and show the plot helpers too.
Install
conda install pg_binny
-or-
pip install pg_binny
-or (locally)-
pip install -e . (That's "pip install -e dot")
How to use
Make a sample dataframe.
import pandas as pd
import pg_binny as binny

dataset = 'car_crashes'
try:
import seaborn as sns
df = sns.load_dataset(dataset)
except ModuleNotFoundError:
df = pd.read_csv(f'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/{dataset}.csv')
df.sample(5)

<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

</style>

total
speeding
alcohol
not_distracted
no_previous
ins_premium
ins_losses
abbrev

19
15.1
5.738
4.530
13.137
12.684
661.88
96.57
ME

15
15.7
2.669
3.925
15.229
13.659
649.06
114.47
IA

35
14.1
3.948
4.794
13.959
11.562
697.73
133.52
OH

50
17.4
7.308
5.568
14.094
15.660
791.14
122.04
WY

43
19.4
7.760
7.372
17.654
16.878
1004.75
156.83
TX

Discretize with default bins
dfd = binny.discretize(df)
dfd.sample(5)

---------------------------------------------------------------------------

AttributeError Traceback (most recent call last)

<ipython-input-2-41b3e27056d4> in <module>
----> 1 dfd = binny.discretize(df)
2 dfd.sample(5)
3

AttributeError: module 'pg_binny' has no attribute 'discretize'

dfd['speeding'].dtype

CategoricalDtype(categories=[(1.7910000000000001, 2.413], (2.413, 3.496], (3.496, 3.948], (3.948, 4.095], (4.095, 4.608], (4.608, 5.032], (5.032, 6.014], (6.014, 6.923], (6.923, 7.76], (7.76, 9.45]],
, ordered=True)

dfd['total'].dtype

CategoricalDtype(categories=[(5.899, 11.1], (11.1, 12.3], (12.3, 13.6], (13.6, 14.5], (14.5, 15.6], (15.6, 17.4], (17.4, 18.1], (18.1, 19.4], (19.4, 21.4], (21.4, 23.9]],
, ordered=True)

You can set the #bins and the cutting function (defaults to quantile cut, but you may prefer plain-old cut, or something else.
?binny.discretize

[0;31mSignature:[0m
[0mbinny[0m[0;34m.[0m[0mdiscretize[0m[0;34m([0m[0;34m[0m
[0;34m[0m [0mdf[0m[0;34m,[0m[0;34m[0m
[0;34m[0m [0mnbins[0m[0;34m=[0m[0;36m10[0m[0;34m,[0m[0;34m[0m
[0;34m[0m [0mcut[0m[0;34m=[0m[0;34m<[0m[0mfunction[0m [0mqcut[0m [0mat[0m [0;36m0x7fae29d843b0[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m [0mverbose[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m [0mdrop_useless[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Discretize columns in {df} to have at most {nbins} categories.
* Categorical columns: take the Top n-1 plus "Other"
* Continuous columns: cut into {nbins} using {cut}.

Returns a new discretized dataframe with the same column names.
Promotes discrete columns to categories.

Parameters
-----------
df: Dataframe to discretize
nbins: Max number of bins to use. May return fewer.
cut: Cutting method. Default `pd.qcut`. Consider pd.cut, or write your own.
verbose: 0: silent, 1: colnames, 2: (Default) top N for each column
drop_useless: Removes columns that have < 2 unique values.

Replaces numerical NA values with 'NA'.
[0;31mFile:[0m /Volumes/Peregrine/binny/pg_binny/core.py
[0;31mType:[0m function

Other functions
[x for x in dir(binny) if x[:2] not in ['__', 'pa', 'pd', 'rc']]

['autolabel',
'clean_category',
'discretize',
'drop_singletons',
'is_numeric',
'isnum']

?binny.autolabel

[0;31mSignature:[0m [0mbinny[0m[0;34m.[0m[0mautolabel[0m[0;34m([0m[0max[0m[0;34m,[0m [0mborder[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m [0;34m->[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Label bars in a barplot {ax} with their height.
Thanks to matplotlib, composition.ai, and jsoma/chart.py.

TODO: how to label with their legend labels?
[0;31mFile:[0m /Volumes/Peregrine/binny/pg_binny/core.py
[0;31mType:[0m function

?binny.clean_category

[0;31mSignature:[0m [0mbinny[0m[0;34m.[0m[0mclean_category[0m[0;34m([0m[0mdf[0m[0;34m,[0m [0mcol[0m[0;34m:[0m [0mstr[0m[0;34m)[0m [0;34m->[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Remove unused categories from df.col, inplace.
If not a category, do nothing.
[0;31mFile:[0m /Volumes/Peregrine/binny/pg_binny/core.py
[0;31mType:[0m function

?binny.is_numeric

[0;31mSignature:[0m [0mbinny[0m[0;34m.[0m[0mis_numeric[0m[0;34m([0m[0mcol[0m[0;34m:[0m [0mstr[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Returns True iff already numeric, or can be coerced.
Usage: df.apply(is_numeric)
Usage: is_numeric(df['colname'])

Returns Boolean series.

From:
https://stackoverflow.com/questions/54426845/how-to-check-if-a-pandas-dataframe-contains-only-numeric-column-wise
[0;31mFile:[0m /Volumes/Peregrine/binny/pg_binny/core.py
[0;31mType:[0m function

History
pg_binny is an example extracting some frequently copy/pasted routines into a general purpose nbdev project.
Originally called binny because it bins things, that was already taken on PyPi (for... a project that bins things). The prefix pg is short for the project we were working on.
The routines and text are completely general.