pyrodigal 3.5.2

Last updated:

0 purchases

pyrodigal 3.5.2 Image
pyrodigal 3.5.2 Images
Add to Cart

Description:

pyrodigal 3.5.2

πŸ”₯ Pyrodigal
Cython bindings and Python interface to Prodigal, an ORF
finder for genomes and metagenomes. Now with SIMD!
















πŸ—ΊοΈ Overview
Pyrodigal is a Python module that provides bindings to Prodigal using
Cython. It directly interacts with the Prodigal
internals, which has the following advantages:

single dependency: Pyrodigal is distributed as a Python package, so you
can add it as a dependency to your project, and stop worrying about the
Prodigal binary being present on the end-user machine.
no intermediate files: Everything happens in memory, in a Python object
you fully control, so you don't have to invoke the Prodigal CLI using a
sub-process and temporary files. Sequences can be passed directly as
strings or bytes, which avoids the overhead of formatting your input to
FASTA for Prodigal.
better memory usage: Pyrodigal uses more compact data structures compared
to the original Prodigal implementation, allowing to save memory to store
the same information. A heuristic is used to estimate the number of nodes
to allocate based on the sequence GC% in order to minimize reallocations.
better performance: Pyrodigal uses SIMD instructions to compute which
dynamic programming nodes can be ignored when scoring connections. This can
save from a third to half the runtime depending on the sequence. The Benchmarks page of the documentation contains comprehensive comparisons. See the JOSS paper
for details about how this is achieved.
same results: Pyrodigal is tested to make sure it produces
exactly the same results as Prodigal v2.6.3+31b300a. This was verified
extensively by Julian Hahnfeld and can be
checked with his comparison repository.

πŸ“‹ Features
The library now features everything from the original Prodigal CLI:

run mode selection: Choose between single mode, using a training
sequence to count nucleotide hexamers, or metagenomic mode, using
pre-trained data from different organisms (prodigal -p).
region masking: Prevent genes from being predicted across regions
containing unknown nucleotides (prodigal -m).
closed ends: Genes will be identified as running over edges if they
are larger than a certain size, but this can be disabled (prodigal -c).
training configuration: During the training process, a custom
translation table can be given (prodigal -g), and the Shine-Dalgarno motif
search can be forcefully bypassed (prodigal -n)
output files: Output files can be written in a format mostly
compatible with the Prodigal binary, including the protein translations
in FASTA format (prodigal -a), the gene sequences in FASTA format
(prodigal -d), or the potential gene scores in tabular format
(prodigal -s).
training data persistence: Getting training data from a sequence and
using it for other sequences is supported; in addition, a training data
file can be saved and loaded transparently (prodigal -t).

In addition, the new features are available:

custom gene size threshold: While Prodigal uses a minimum gene size
of 90 nucleotides (60 if on edge), Pyrodigal allows to customize this
threshold, allowing for smaller ORFs to be identified if needed.
custom metagenomic models: Since v3.0.0, you can use your own
metagenomic models to run Pyrodigal in meta-mode. Check for instance
pyrodigal-gv, which
provides additional models for giant viruses and gut phages.

🐏 Memory
Pyrodigal makes several changes compared to the original Prodigal binary
regarding memory management:

Sequences are stored as raw bytes instead of compressed bitmaps. This means
that the sequence itself takes 3/8th more space, but since the memory used
for storing the sequence is often negligible compared to the memory used to
store dynamic programming nodes, this is an acceptable trade-off for better
performance when extracting said nodes.
Node fields use smaller data types to fit into 128 bytes, compared to the
176 bytes of the original Prodigal data structure.
Node arrays are pre-allocated based on the sequence GC% to extrapolate the
probability to find a start or stop codon.
Genes are stored in a more compact data structure than in Prodigal (which
reserves a buffer to store string data), saving around 1KiB per gene.

🧢 Thread-safety
pyrodigal.GeneFinder
instances are thread-safe. In addition, the
find_genes
method is re-entrant. This means you can train an
GeneFinder
instance once, and then use a pool to process sequences in parallel:
import multiprocessing.pool
import pyrodigal

gene_finder = pyrodigal.GeneFinder()
gene_finder.train(training_sequence)

with multiprocessing.pool.ThreadPool() as pool:
predictions = pool.map(orf_finder.find_genes, sequences)

πŸ”§ Installing
Pyrodigal can be installed directly from PyPI,
which hosts some pre-built wheels for the x86-64 architecture (Linux/MacOS/Windows)
and the Aarch64 architecture (Linux/MacOS), as well as the code required to compile
from source with Cython:
$ pip install pyrodigal

Otherwise, Pyrodigal is also available as a Bioconda
package:
$ conda install -c bioconda pyrodigal

Check the install page
of the documentation for other ways to install Pyrodigal on your machine.
πŸ’‘ Example
Let's load a sequence from a
GenBank file, use an GeneFinder
to find all the genes it contains, and print the proteins in two-line FASTA
format.
πŸ”¬ Biopython
To use the GeneFinder
in single mode (corresponding to prodigal -p single, the default operation mode of Prodigal),
you must explicitly call the
train method
with the sequence you want to use for training before trying to find genes,
or you will get a RuntimeError:
import Bio.SeqIO
import pyrodigal

record = Bio.SeqIO.read("sequence.gbk", "genbank")

orf_finder = pyrodigal.GeneFinder()
orf_finder.train(bytes(record.seq))
genes = orf_finder.find_genes(bytes(record.seq))

However, in meta mode (corresponding to prodigal -p meta), you can find genes directly:
import Bio.SeqIO
import pyrodigal

record = Bio.SeqIO.read("sequence.gbk", "genbank")

orf_finder = pyrodigal.GeneFinder(meta=True)
for i, pred in enumerate(orf_finder.find_genes(bytes(record.seq))):
print(f">{record.id}_{i+1}")
print(pred.translate())

On older versions of Biopython (before 1.79) you will need to use
record.seq.encode() instead of bytes(record.seq).
πŸ§ͺ Scikit-bio
import skbio.io
import pyrodigal

seq = next(skbio.io.read("sequence.gbk", "genbank"))

orf_finder = pyrodigal.GeneFinder(meta=True)
for i, pred in enumerate(orf_finder.find_genes(seq.values.view('B'))):
print(f">{record.id}_{i+1}")
print(pred.translate())

We need to use the view
method to get the sequence viewable by Cython as an array of unsigned char.
πŸ”– Citation
Pyrodigal is scientific software, with a
published paper
in the Journal of Open-Source Software. Please
cite both Pyrodigal
and Prodigal if you are using it in
an academic work, for instance as:

Pyrodigal (Larralde, 2022), a Python library binding to Prodigal (Hyatt et al., 2010).

Detailed references are available on the Publications page of the
online documentation.
πŸ’­ Feedback
⚠️ Issue Tracker
Found a bug ? Have an enhancement request ? Head over to the GitHub issue
tracker if you need to report
or ask something. If you are filing in on a bug, please include as much
information as you can about the issue, and try to recreate the same bug
in a simple, easily reproducible situation.
πŸ—οΈ Contributing
Contributions are more than welcome! See
CONTRIBUTING.md
for more details.
πŸ“‹ Changelog
This project adheres to Semantic Versioning
and provides a changelog
in the Keep a Changelog format.
βš–οΈ License
This library is provided under the GNU General Public License v3.0.
The Prodigal code was written by Doug Hyatt and is distributed under the
terms of the GPLv3 as well. See vendor/Prodigal/LICENSE for more information.
This project is in no way not affiliated, sponsored, or otherwise endorsed
by the original Prodigal authors. It was developed
by Martin Larralde during his PhD project
at the European Molecular Biology Laboratory in
the Zeller team.

License:

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.