snps 2.9.0

Creator: bradpython12

Last updated:

Add to Cart

Description:

snps 2.9.0

snps
tools for reading, writing, merging, and remapping SNPs 🧬
snps strives to be an easy-to-use and accessible open-source library for working with
genotype data

Features

Input / Output

Read raw data (genotype) files from a variety of direct-to-consumer (DTC) DNA testing
sources with a SNPs
object
Read and write VCF files (e.g., convert 23andMe to VCF)
Merge raw data files from different DNA tests, identifying discrepant SNPs in the process
Read data in a variety of formats (e.g., files, bytes, compressed with gzip or zip)
Handle several variations of file types, validated via
openSNP parsing analysis



Build / Assembly Detection and Remapping

Detect the build / assembly of SNPs (supports builds 36, 37, and 38)
Remap SNPs between builds / assemblies



Data Cleaning

Perform quality control (QC) / filter low quality SNPs based on chip clusters
Fix several common issues when loading SNPs
Sort SNPs based on chromosome and position
Deduplicate RSIDs
Deduplicate alleles in the non-PAR regions of the X and Y chromosomes for males
Deduplicate alleles on MT
Assign PAR SNPs to the X or Y chromosome



Analysis

Derive sex from SNPs
Detect deduced genotype / chip array and chip version based on chip clusters
Predict ancestry from SNPs (when installed with ezancestry)




Supported Genotype Files
snps supports VCF files and
genotype files from the following DNA testing sources:

23andMe
Ancestry
CircleDNA
CĂłdigo 46
DNA.Land
Family Tree DNA
Genes for Good
LivingDNA
Mapmygenome
MyHeritage
PLINK
Sano Genetics
tellmeGen

Additionally, snps can read a variety of “generic” CSV and TSV files.


Dependencies
snps requires Python 3.8+ and the following Python
packages:

numpy
pandas
atomicwrites



Installation
snps is available on the
Python Package Index. Install snps (and its required
Python dependencies) via pip:
$ pip install snps
For ancestry prediction
capability, snps can be installed with ezancestry:
$ pip install snps[ezancestry]


Examples

Download Example Data
First, let’s setup logging to get some helpful output:
>>> import logging, sys
>>> logger = logging.getLogger()
>>> logger.setLevel(logging.INFO)
>>> logger.addHandler(logging.StreamHandler(sys.stdout))

Now we’re ready to download some example data from openSNP:
>>> from snps.resources import Resources
>>> r = Resources()
>>> paths = r.download_example_datasets()
Downloading resources/662.23andme.340.txt.gz
Downloading resources/662.ftdna-illumina.341.csv.gz



Load Raw Data
Load a 23andMe raw data file:
>>> from snps import SNPs
>>> s = SNPs("resources/662.23andme.340.txt.gz")
>>> s.source
'23andMe'
>>> s.count
991786

The SNPs class accepts a path to a file or a bytes object. A Reader class attempts to
infer the data source and load the SNPs. The loaded SNPs are
normalized and
available via a pandas.DataFrame:
>>> df = s.snps
>>> df.columns.values
array(['chrom', 'pos', 'genotype'], dtype=object)
>>> df.index.name
'rsid'
>>> df.chrom.dtype.name
'object'
>>> df.pos.dtype.name
'uint32'
>>> df.genotype.dtype.name
'object'
>>> len(df)
991786

snps also attempts to detect the build / assembly of the data:
>>> s.build
37
>>> s.build_detected
True
>>> s.assembly
'GRCh37'



Merge Raw Data Files
The dataset consists of raw data files from two different DNA testing sources - let’s combine
these files. Specifically, we’ll update the SNPs object with SNPs from a
Family Tree DNA file.
>>> merge_results = s.merge([SNPs("resources/662.ftdna-illumina.341.csv.gz")])
Merging SNPs('662.ftdna-illumina.341.csv.gz')
SNPs('662.ftdna-illumina.341.csv.gz') has Build 36; remapping to Build 37
Downloading resources/NCBI36_GRCh37.tar.gz
27 SNP positions were discrepant; keeping original positions
151 SNP genotypes were discrepant; marking those as null
>>> s.source
'23andMe, FTDNA'
>>> s.count
1006960
>>> s.build
37
>>> s.build_detected
True

If the SNPs being merged have a build that differs from the destination build, the SNPs to merge
will be remapped automatically. After this example merge, the build is still detected, since the
build was detected for all SNPs objects that were merged.
As the data gets added, it’s compared to the existing data, and SNP position and genotype
discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.) These
discrepant SNPs are available for inspection after the merge via properties of the SNPs object.
>>> len(s.discrepant_merge_genotypes)
151

Additionally, any non-called / null genotypes will be updated during the merge, if the file
being merged has a called genotype for the SNP.
Moreover, merge takes a chrom parameter - this enables merging of only SNPs associated
with the specified chromosome (e.g., “Y” or “MT”).
Finally, merge returns a list of dict, where each dict has information corresponding
to the results of each merge (e.g., SNPs in common).
>>> sorted(list(merge_results[0].keys()))
['common_rsids', 'discrepant_genotype_rsids', 'discrepant_position_rsids', 'merged']
>>> merge_results[0]["merged"]
True
>>> len(merge_results[0]["common_rsids"])
692918



Remap SNPs
Now, let’s remap the merged SNPs to change the assembly / build:
>>> s.snps.loc["rs3094315"].pos
752566
>>> chromosomes_remapped, chromosomes_not_remapped = s.remap(38)
Downloading resources/GRCh37_GRCh38.tar.gz
>>> s.build
38
>>> s.assembly
'GRCh38'
>>> s.snps.loc["rs3094315"].pos
817186

SNPs can be remapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38
(GRCh38).


Save SNPs
Ok, so far we’ve merged the SNPs from two files (ensuring the same build in the process and
identifying discrepancies along the way). Then, we remapped the SNPs to Build 38. Now, let’s save
the merged and remapped dataset consisting of 1M+ SNPs to a tab-separated values (TSV) file:
>>> saved_snps = s.to_tsv("out.txt")
Saving output/out.txt
>>> print(saved_snps)
output/out.txt

Moreover, let’s get the reference sequences for this assembly and save the SNPs as a VCF file:
>>> saved_snps = s.to_vcf("out.vcf")
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.2.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.3.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.4.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.5.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.6.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.7.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.8.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.9.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.10.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.14.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.15.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.16.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.17.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.18.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.20.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.X.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.Y.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz
Saving output/out.vcf
1 SNP positions were found to be discrepant when saving VCF

When saving a VCF, if any SNPs have positions outside of the reference sequence, they are marked
as discrepant and are available via a property of the SNPs object.
All output files are saved to the
output directory.



Documentation
Documentation is available here.


Acknowledgements
Thanks to Mike Agostino, Padma Reddy, Kevin Arvai, openSNP,
Open Humans, and Sano Genetics.
snps incorporates code and concepts generated with the assistance of
OpenAI’s ChatGPT. ✨


License
snps is licensed under the BSD 3-Clause License.

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Files:

Customer Reviews

There are no reviews.