0 purchases
pytantan 0.1.1
ππ PyTantan
Cython bindings and Python interface to Tantan, a fast method for identifying repeats in DNA and protein sequences.
πΊοΈ Overview
Tantan is a fast method developed
by Martin Frith[1] to identify simple repeats in DNA or protein
sequences. It can be used to mask repeat regions in reference sequences, and
avoid false homology predictions between repeated regions.
PyTantan is a Python module that provides bindings to Tantan
using Cython. It implements a user-friendly, Pythonic
interface to mask a sequence with various parameters. It interacts with the
Tantan interface rather than with the CLI, which has the following advantages:
no binary dependency: PyTantan is distributed as a Python package, so
you can add it as a dependency to your project, and stop worrying about the
tantan binary being present on the end-user machine.
no intermediate files: Everything happens in memory, in a Python object
you control, so you don't have to invoke the Tantan CLI using a sub-process
and temporary files.
better portability: Tantan uses SIMD to accelerate alignment scoring,
but doesn't support dynamic dispatch, so it has to be compiled on the local
machine to be able to use the full capabilities of the local CPU. PyTantan
ships several versions of Tantan instead, each compiled with different
target features, and selects the best one for the local platform at runtime.
π§ Installing
PyTantan is available for all modern versions (3.6+), depending only on the
scoring-matrices package, and
optionally on the lightweight archspec
package for runtime CPU feature detection.
It can be installed directly from PyPI,
which hosts some pre-built wheels for Linux and MacOS, as well as the code
required to compile from source with Cython:
$ pip install pytantan
Check the install page
of the documentation for other ways to install PyTantan on your machine.
π‘ Example
The top-level function pytantan.mask_repeats can be used to mask a sequence
without having to manage intermediate objects:
import pytantan
masked = pytantan.mask_repeats("ATTATTATTATTATT")
print(masked) # ATTattattattatt
The mask symbol (and other parameters) can be given as keyword arguments:
import pytantan
masked = pytantan.mask_repeats("ATTATTATTATTATT", mask='N')
print(masked) # ATTNNNNNNNNNNNN
To mask several sequences iteratively with the same parameters, consider
creating a RepeatFinder once and calling the mask_repeats method for
each sequence to avoid resource re-initialization.
π Feedback
β οΈ Issue Tracker
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker
if you need to report or ask something. If you are filing in on a bug,
please include as much information as you can about the issue, and try to
recreate the same bug in a simple, easily reproducible situation.
ποΈ Contributing
Contributions are more than welcome! See
CONTRIBUTING.md
for more details.
π Changelog
This project adheres to Semantic Versioning
and provides a changelog
in the Keep a Changelog format.
βοΈ License
This library is provided under the GNU General Public License v3.0 or later.
Tantan is developed by Martin Frith and is distributed under the
terms of the GPLv3 or later as well. See vendor/tantan/COPYING.txt for more information.
This project is in no way not affiliated, sponsored, or otherwise endorsed
by the Tantan authors. It was developed
by Martin Larralde during his PhD project
at the Leiden University Medical Center in
the Zeller team.
π References
[1] Frith, Martin C. βA new repeat-masking method enables specific detection of homologous sequences.β Nucleic acids research vol. 39,4 (2011): e23. doi:10.1093/nar/gkq1212
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.