coniferanalysis 0.1.0

Post-process conifer output for downstream statistical analysis.
conifer-analysis uses dask in order to analyze
conifer results in a distributed and
out-of-memory fashion. This can be helpful when processing many such results.

Say that you have a bunch of conifer results in a directory. You can
generate a histogram of the confidence values per file (sample) and per taxa
using the provided pipeline confidence_hist. Even when you work locally, it
can be helpful to explicitly create a distributed client controlling the number
of workers.
from dask.distributed import Client
from conifer_analysis import confidence_hist

client = Client(n_workers=8)
You can then visit the default dashboard in
your browser to observe tasks live. Next, we run the pipeline which returns a
hist = confidence_hist("data/*.tsv")
As an example of the returned shape:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7700 entries, 0 to 7699
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 path 7700 non-null category
1 name 7700 non-null category
2 taxonomy_id 7700 non-null category
3 bin 7700 non-null interval[float64, right]
4 midpoints 7700 non-null float64
5 read1_hist 7700 non-null int64
6 read2_hist 7700 non-null int64
7 avg_hist 7700 non-null int64
dtypes: category(3), float64(1), int64(3), interval(1)
memory usage: 385.3 KB

It’s as simple as:
pip install conifer-analysis
If you want to observe tasks in the dask dashboard, you will need additional
pip install conifer-analysis[dashboard]


Copyright © 2022, Moritz E. Beber.
Free software distributed under the Apache Software License 2.0.


