GitLocker: The Coding Marketplace

Description:

PandaPlyr 0.1.8

PandaPlyr

PandaPlyr is a Python package designed to provide a familiar and efficient data manipulation experience similar to the popular dplyr package in R. It aims to simplify and streamline the process of working with tabular data by providing a concise and intuitive syntax. The purpose of pandaplyr is to make chained operations on pandas DataFrames easier and more readable.
Table of Contents

Installing
Features

group_by and summarise / summarize
mutate
where
select
rename
arrange and order_by
left_join, right_join, full_join
union and union_all
distinct
fill_na
drop_na
sample_n and sample_frac
head and tail

User-defined functions
Future features
Contact

Installing
Installers for the latest version are available at the Python
Package Index (PyPI).
pip install PandaPlyr

Use case and example
I have two dataframes - grade_df (left) and subject_df (right).

StudentID
Subject
Grade

1
CompSci
80

1
English
85

1
History
75

1
LinearAlg
75

...
...
...

Subject
SubjectType

CompSci
STEM

LinearAlg
STEM

Philosophy
Humanities

English
Humanities

History
Humanities

Merge the dataframes and find the student with the highest average grade for Humanities classes only,
but exclude any students ('StudentID') who are not enrolled in at least 2 humanities courses.
import pandas as pd
import PandaPlyr as pp
grade_df = pp.utils.read_grades_dataset()
subject_df = pp.utils.read_subject_dataset()

In pandas
merged_df = pd.merge(grade_df, subject_df, on='Subject')

humanities_df = merged_df[merged_df['SubjectType'] == 'Humanities']

course_counts = (
humanities_df
.groupby('StudentID', as_index = False)
.agg(CourseCount = ('Subject', 'count'))
)

filtered_students = course_counts.loc[course_counts['CourseCount'] >= 2][['StudentID']]

top_student_pandas = (
humanities_df.loc[humanities_df['StudentID'].isin(filtered_students['StudentID'])]
.groupby('StudentID', as_index=False)
.agg(AverageGrade = ('Grade', 'mean'))
.sort_values('AverageGrade', ascending = False)
.head(1)
)

In pandaplyr
top_student_pp = (
grade_df >>
pp.inner_join(subject_df, 'Subject') >>
pp.where('SubjectType == "Humanities"') >>
pp.group_by('StudentID', 'SubjectType') >>
pp.summarise(AverageGrade = ('Grade', 'mean'), CourseCount = ('Grade', 'count')) >>
pp.where('(CourseCount >= 2)') >>
pp.mutate(MaxAverageGrade = 'AverageGrade.max()') >>
pp.where('MaxAverageGrade == AverageGrade') >>
pp.select('StudentID', 'AverageGrade')
)

The same answer takes 8 fewer lines, ~ 145 fewer characters, and is lot more readable.

StudentID
AverageGrade

2
87.0

Features
Here's a quick summary of the classes, methods, and functions we'll cover:
group_by() and summarise() / summarize()
These functions allow group-wise aggregations on your DataFrame for one or more columns. The syntax is as follows:
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'Z' : ['x', 'x', 'y', 'x', 'x', 'y'],
'B': [10, 20, 30, 40, 50, 60]})
new_df = df >> group_by('A', 'Z') >> summarise(AVG_B = ('B', 'mean'))
print(new_df)

A
Z
B

0
foo
x
10

1
foo
x
20

2
foo
y
30

3
bar
x
40

4
bar
x
50

5
bar
y
60

------>

A
Z
AVG_B

foo
x
15.0

foo
y
30.0

bar
x
45.0

bar
y
60.0

Note that you can pass the columns as separate arguments, or inside a list. By default, it will not return indices.
Functions summarize() and summarise() are identical.

mutate()
The mutate function lets you add new columns or modify existing ones.
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'B': [10, 20, 30, 40, 50, 60],
'C': [1, 2, 3, 4, 5, 6]})
new_df = df >> mutate(B_X_2 = 'B * 2',
B_PLUS_C = 'B + C',
CONST = 1)
print(new_df)

A
B
C
B_X_2
B_PLUS_C
CONST

0
foo
10
1
20
11
1

1
foo
20
2
40
22
1

2
foo
30
3
60
33
1

3
bar
40
4
80
44
1

4
bar
50
5
100
55
1

5
bar
60
6
120
66
1

where()
This function allows you to filter rows in your DataFrame based on a condition.
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'B': [10, 20, 30, 40, 50, 60],
'C': [1, 2, 3, 4, 5, 6]})
new_df = df >> where('A == "foo" | C == 6')
print(new_df)

A
B
C

0
foo
10
1

1
foo
20
2

2
foo
30
3

3
bar
40
4

4
bar
50
5

5
bar
60
6

------>

A
B
C

0
foo
10
1

1
foo
20
2

2
foo
30
3

5
bar
60
6

select()
The select function can be used to select specific columns in your DataFrame.
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'B': [10, 20, 30, 40, 50, 60],
'C': [1, 2, 3, 4, 5, 6]})
new_df = df >> select('A', 'B')
print(new_df)

A
B
C

0
foo
10
1

1
foo
20
2

2
foo
30
3

3
bar
40
4

4
bar
50
5

5
bar
60
6

------>

A
B

0
foo
10

1
foo
20

2
foo
30

3
bar
40

4
bar
50

5
bar
60

rename()
You can rename columns in your DataFrame using the rename function.
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'B': [10, 20, 30, 40, 50, 60],
'C': [1, 2, 3, 4, 5, 6]})
new_df = df >> rename(Z = 'C')
print(new_df)

A
B
C

0
foo
10
1

1
foo
20
2

2
foo
30
3

3
bar
40
4

4
bar
50
5

5
bar
60
6

------>

A
B
Z

0
foo
10
1

1
foo
20
2

2
foo
30
3

3
bar
40
4

4
bar
50
5

5
bar
60
6

arrange() and order_by()
Use arrange or order_by (which are 100% identical) to sort your DataFrame by one or more columns.
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'B': [10, 20, 30, 40, 50, 60]})
new_df = df >> arrange('B', 'desc')
print(new_df)

A
B

0
foo
10

1
foo
20

2
foo
30

3
bar
40

4
bar
50

5
bar
60

------>

A
B

5
bar
60

4
bar
50

3
bar
40

2
foo
30

1
foo
20

0
foo
10

left_join(), right_join(), full_join()
These functions allow you to join multiple DataFrames together.
import pandas as pd
from PandaPlyr import *
df1 = pd.DataFrame({'A': ['foo', 'bar', 'other'],
'B': [1, 2, 3]})

df2 = pd.DataFrame({'A': ['foo', 'bar', 'foo'],
'C': [10, 20, 30]})

new_df = df1 >> left_join(df2, on = 'A', fill_na = 0)
print(new_df)

A
B
C

0
foo
1
10

1
foo
1
30

2
bar
2
20

3
other
3
0

Note that left_join and full_join have an optional fill_na argument to replace numpy.nan values from merged fields.

union() and union_all()
union and union_all let you concatenate two DataFrames together.
Note that union removes duplicates while union_all doesn't.
import pandas as pd
from PandaPlyr import *
df1 = pd.DataFrame({'A': ['foo', 'bar', 'other'],
'B': [1, 2, 3]})

df2 = pd.DataFrame({'A': ['other', 'bar', 'foo'],
'B': [3, 4, 5]})

new_df = df1 >> union(df2)
print(new_df)

A
B

0
foo
1

1
bar
2

2
other
3

3
bar
4

4
foo
5

import pandas as pd
from PandaPlyr import *
df1 = pd.DataFrame({'A': ['foo', 'bar', 'other'],
'B': [1, 2, 3]})

df2 = pd.DataFrame({'A': ['other', 'bar', 'foo'],
'B': [3, 4, 5]})

new_df = df1 >> union_all(df2)
print(new_df)

A
B

0
foo
1

1
bar
2

2
other
3

3
other
3

4
bar
4

5
foo
5

distinct()
distinct removes duplicate rows in your DataFrame.
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': ['foo', 'bar', 'other', 'other']})

new_df = df >> distinct()
print(new_df)

A

0
foo

1
bar

2
other

3
other

------>

A

0
foo

1
bar

2
other

fill_na()
replaces numpy.nan, None, and (unlike pandas fillna) it works on numpy.inf and -numpy.inf
import pandas as pd
from PandaPlyr import *
df = pd.DataFrame({'A': [1, np.nan, None, np.inf, -np.inf]})
new_df = df >> fill_na('A', 0)
print(new_df)

A

0
1.0

1
0.0

2
0.0

3
0.0

4
0.0

drop_na()
Removes records with numpy.nan, None, and (unlike pandas dropna) it works on numpy.inf and -numpy.inf
import pandas as pd
from src.PandaPlyr import *
df = pd.DataFrame({'A': [1, np.nan, None, np.inf, -np.inf]})
new_df = df >> drop_na()
print(new_df)

A

0
1.0

sample_n() and sample_frac()
Randomly samples n rows (sample_n) or a fraction of rows (sample_frac) from the DataFrame. The random_state argument ensures reproducible results.
import pandas as pd
from src.PandaPlyr import *
df = pd.DataFrame({'A': range(10)})
new_df = df >> sample_n(5, random_state=42)
print(new_df)

A

0
8.0

0
1.0

0
5.0

0
0.0

0
2.0

head() and tail()
Return the first or last n rows of a DataFrame, respectively.
import pandas as pd
from src.PandaPlyr import *
df = pd.DataFrame({'A': range(10)})
new_df = df >> head(5) >> tail(2)
print(new_df)

A

0
3

0
4

The Pipe class allows us to use the '>>' operator to chain operations together in a pipeline.

User-defined functions
You can define your own functions using the @Pipe decorator
import pandas as pd
from PandaPlyr import *

@Pipe
def median_impute(df, *args):
"""Replace missing values with the median value in the column"""
for col in args:
col_median = np.nanmedian(df[col])
df = df >> fillna(col, value = col_median)
return df

df = pd.DataFrame({'A': ['X', None, 'Y', np.inf, 'X', 'Y'],
'B': [1, 2, 3, None, 5, 6]})

new_df = df >> fillna('A', 'Missing') >> median_impute('B')
print(new_df)

A
B

0
X
1

1
None
2

2
Y
3

3
np.inf
None

4
X
5

5
Y
6

------>

A
B

0
X
1

1
Missing
2

2
Y
3

3
Missing
3

4
X
4

5
Y
5

Future features

Benchmarking module
Polars backend
Intelligent multiprocessing

Contact