poola 0.0.16

Creator: railscoderz

Last updated:

0 purchases

poola 0.0.16 Image
poola 0.0.16 Images
Add to Cart

Description:

poola 0.0.16

poola

Python package for pooled screen analysis

Install
Install from github for the latest development release:
pip install git+git://github.com/gpp-rnd/poola.git#egg=poola
Or install the most recent distribution from PyPi:

pip install poola
How to use
Additional packages required for this tutorial can be install using pip install -r requirements.txt
from poola import core as pool
import pandas as pd
import seaborn as sns
import gpplot
import matplotlib.pyplot as plt

To demonstrate the functionality of this module we'll use read counts from Sanson et al. 2018.
supp_reads = 'https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-018-07901-8/MediaObjects/41467_2018_7901_MOESM4_ESM.xlsx'
read_counts = pd.read_excel(supp_reads,
sheet_name = 'A375_orig_tracr raw reads',
header = None,
skiprows = 3,
names = ['sgRNA Sequence', 'pDNA', 'A375_RepA', 'A375_RepB'],
engine='openpyxl')
guide_annotations = pd.read_excel(supp_reads,
sheet_name='sgRNA annotations',
engine='openpyxl')

The input data has three columns with read counts and one column with sgRNA annotations
read_counts.head()


<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

</style>




sgRNA Sequence
pDNA
A375_RepA
A375_RepB




0
AAAAAAAATCCGGACAATGG
522
729
774


1
AAAAAAAGGATGGTGATCAA
511
1484
1393


2
AAAAAAATGACATTACTGCA
467
375
603


3
AAAAAAATGTCAGTCGAGTG
200
737
506


4
AAAAAACACAAGCAAGACCG
286
672
352




lognorms = pool.lognorm_columns(reads_df=read_counts, columns=['pDNA', 'A375_RepA', 'A375_RepB'])
filtered_lognorms = pool.filter_pdna(lognorm_df=lognorms, pdna_cols=['pDNA'], z_low=-3)
print('Filtered ' + str(lognorms.shape[0] - filtered_lognorms.shape[0]) + ' columns due to low pDNA abundance')

Filtered 576 columns due to low pDNA abundance

Note that the column names for the lognorms remain the same
lognorms.head()


<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

</style>




sgRNA Sequence
pDNA
A375_RepA
A375_RepB




0
AAAAAAAATCCGGACAATGG
4.192756
3.373924
3.521755


1
AAAAAAAGGATGGTGATCAA
4.163726
4.326828
4.312620


2
AAAAAAATGACATTACTGCA
4.041390
2.540624
3.196767


3
AAAAAAATGTCAGTCGAGTG
2.930437
3.388159
2.973599


4
AAAAAACACAAGCAAGACCG
3.388394
3.268222
2.528233




lfc_df = pool.calculate_lfcs(lognorm_df=filtered_lognorms, ref_col='pDNA', target_cols=['A375_RepA', 'A375_RepB'])

We drop the pDNA column after calculating log-fold changes
lfc_df.head()


<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

</style>




sgRNA Sequence
A375_RepA
A375_RepB




0
AAAAAAAATCCGGACAATGG
-0.818831
-0.671000


1
AAAAAAAGGATGGTGATCAA
0.163102
0.148894


2
AAAAAAATGACATTACTGCA
-1.500766
-0.844622


3
AAAAAAATGTCAGTCGAGTG
0.457721
0.043161


4
AAAAAACACAAGCAAGACCG
-0.120172
-0.860161




Since we only have two conditions it's easy to visualize replicates as a point densityplot using gpplot
plt.subplots(figsize=(4,4))
gpplot.point_densityplot(data=lfc_df, x='A375_RepA', y='A375_RepB')
gpplot.add_correlation(data=lfc_df, x='A375_RepA', y='A375_RepB')
sns.despine()


Since we see a strong correlation, we'll average the log-fold change of each sgRNA across replicates
avg_replicate_lfc_df = pool.average_replicate_lfcs(lfcs=lfc_df, guide_col='sgRNA Sequence', condition_indices=[0],
sep='_')

After averaging log-fold changes our dataframe is melted, so the condition column specifies the experimental condition (A375 here) and the n_obs specifies the number of replicates
avg_replicate_lfc_df.head()


<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

</style>




sgRNA Sequence
condition
avg_lfc
n_obs




0
AAAAAAAATCCGGACAATGG
A375
-0.744916
2


1
AAAAAAAGGATGGTGATCAA
A375
0.155998
2


2
AAAAAAATGACATTACTGCA
A375
-1.172694
2


3
AAAAAAATGTCAGTCGAGTG
A375
0.250441
2


4
AAAAAACACAAGCAAGACCG
A375
-0.490166
2




It's sometimes helpful to group controls into pseudo-genes so they're easier to compare with target genes. Our annotation file maps from sgRNA sequences to gene symbols
remapped_annotations = pool.group_pseudogenes(annotations=guide_annotations, pseudogene_size=4,
gene_col='Annotated Gene Symbol',
control_regex=['NO_CURRENT'])
remapped_annotations.head()


<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

</style>




sgRNA Sequence
Annotated Gene Symbol
Annotated Gene ID




0
AAAAAAAATCCGGACAATGG
SLC25A24
29957


1
AAAAAAAGGATGGTGATCAA
FASTKD3
79072


2
AAAAAAATGACATTACTGCA
BCAS2
10286


3
AAAAAAATGTCAGTCGAGTG
GPR18
2841


4
AAAAAACACAAGCAAGACCG
ZNF470
388566




We provide two methods for scaling log-fold change values to controls:

Z-score from a set of negative controls
Scale scores between a set of negative and positive controls

For both scoring methods, you can input either a regex or a list of genes to define control sets
For our set of negative controls, we'll use nonessential genes
nonessential_genes = (pd.read_table('https://raw.githubusercontent.com/gpp-rnd/genesets/master/human/non-essential-genes-Hart2014.txt',
names=['gene'])
.gene)
annot_guide_lfcs = pool.annotate_guide_lfcs(avg_replicate_lfc_df, remapped_annotations, 'Annotated Gene Symbol',
merge_on='sgRNA Sequence', z_score_neg_ctls=True,
z_score_neg_ctl_genes=nonessential_genes)
annot_guide_lfcs.head()


<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

</style>




sgRNA Sequence
condition
avg_lfc
n_obs
Annotated Gene Symbol
Annotated Gene ID
z_scored_avg_lfc




0
AAAAAAAATCCGGACAATGG
A375
-0.744916
2
SLC25A24
29957
-1.651921


1
AAAAAAAGGATGGTGATCAA
A375
0.155998
2
FASTKD3
79072
0.166585


2
AAAAAAATGACATTACTGCA
A375
-1.172694
2
BCAS2
10286
-2.515396


3
AAAAAAATGTCAGTCGAGTG
A375
0.250441
2
GPR18
2841
0.357219


4
AAAAAACACAAGCAAGACCG
A375
-0.490166
2
ZNF470
388566
-1.137705




To aggregate scores at the gene level, we specify columns to average, and columns to z-score.
From our z-scores we calculate a p-value and FDR using the Benjamini-Hochberg procedure
gene_lfcs = pool.aggregate_gene_lfcs(annot_guide_lfcs, 'Annotated Gene Symbol',
average_cols=['avg_lfc'],
zscore_cols=['z_scored_avg_lfc'])
gene_lfcs.head()


<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

</style>




condition
Annotated Gene Symbol
n_guides
avg_lfc
z_scored_avg_lfc
z_scored_avg_lfc_p_value
z_scored_avg_lfc_fdr




0
A375
A1BG
3
-0.084621
-0.552711
0.580461
0.872833


1
A375
A1CF
4
0.500314
1.723181
0.084856
0.338055


2
A375
A2M
4
0.288629
0.868604
0.385064
0.759829


3
A375
A2ML1
4
-0.481786
-2.241580
0.024989
0.132164


4
A375
A3GALT2
4
0.059362
-0.056952
0.954583
0.990693




Finally, to evaluate the quality this screen, we'll calculate the ROC-AUC between essential and nonessential genes for each condition
essential_genes = (pd.read_table('https://raw.githubusercontent.com/gpp-rnd/genesets/master/human/essential-genes-Hart2015.txt',
names=['gene'])
.gene)
roc_aucs, roc_df = pool.get_roc_aucs(lfcs=gene_lfcs, tp_genes=essential_genes, fp_genes=nonessential_genes, gene_col='Annotated Gene Symbol',
condition_col='condition', score_col='avg_lfc')
print('ROC-AUC: ' + str(round(roc_aucs['ROC-AUC'].values[0], 3)))

ROC-AUC: 0.976

Note that we can also use this function to calculate roc-aucs at the guide level
annotated_guide_lfcs = lfc_df.merge(guide_annotations, how='inner', on='sgRNA Sequence')
roc_aucs, roc_df = pool.get_roc_aucs(lfcs=annotated_guide_lfcs, tp_genes=essential_genes, fp_genes=nonessential_genes, gene_col='Annotated Gene Symbol',
condition_list=['A375_RepA', 'A375_RepB'])
roc_aucs


<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

</style>




condition
ROC-AUC




0
A375_RepA
0.918505


1
A375_RepB
0.917600




And plot ROC curves from the roc_df
plt.subplots(figsize=(4,4))
sns.lineplot(data=roc_df, x='fpr', y='tpr', hue='condition', ci=None)
gpplot.add_xy_line()
sns.despine()

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.