0 purchases
ScaleHDALSPAC 0.3
ScaleHD: Automated Huntington Disease genotyping=========================================================ScaleHD is a package for automating the process of genotyping microsatellite repeats in Huntington Disease data.We utilise machine learning approaches to take into account natural data 'artefacts', such as PCR slippage and somaticmosaicism, when processing data. This provides the end-user with a simple to use platform which can robustly predict genotypes from input data.By default, input is a pair of unaligned .fastq sequence data -- both forward and reverse reads, per sample. We utilise both forward and reversereads in order to reduce the complex dimensionality issue posed by Huntington Disease's multiple repeat tract genetic structure. Reverse reads allowus to determine the current sample's CCG state -- this provides us with a mechanism by which to more easily call the entire genotype. Forward readsare utilised in a similar approach, to determine the CAG and intervening structure.The general overview of the application is as follows:1) Input FastQ files are subsampled, if an overwhelming number of reads are present. This can be overruled with the -b flag.2) Sequence quality control is carried out per the user's instructions. We reccomend triming of any 5-prime spacer+primer combinations, for optimal alignment.3) Alignment of these files, to a typical HD structure (CAG_1_1_CCG_2) reference, is carried out.4) Assemblies are scanned with Digital Signal Processing to detect any possible atypical structures (e.g. CAG_2_1_CCG_3).4.1) If no atypical alleles are detected, proceed as normal.4.2) If atypical alleles are detected, a custom reference is generated, and re-alignment to this is carried out.5) With the appropriate allele information and sequence assembly(ies) present, sampled are genotyped.6) Output is written for the current sample; the procedure is repeated for the next sample in the queue (if present).What's New==========* Added an n-aligned matrix of repeat-count distributions, on a SHD instance-wide basis.* Instances of SHD will output the utilised configuration file with other results.* Removed the -b/--boost flag, and made subsampling the default behaviour (given acceptable read-count in raw input data)* Added the -b/--broadscope flag, which forces alignment and DSP to be executed on all present reads (i.e. no subsampling).* Added the -e/--enshrine flag, which forces SHD to retain all aligned reads which are not uniquely mapped (which are removed by default).* Implemented DSP to function within the intervening sequence, rather than utilising a string derision method.* Added many report flags for SHD's instance report output -- more on this in the Output section of this readme.Installation Prerequisites==========================If you do not have sudo access (to install requisite packages), you should run ScaleHD in a user-bound local python environment, or discrete installation. This guide will assume you have sudo access. However, we detail an extra stage on setting up a local python environment for use with ScaleHD.0. (Optional 1 - no sudo) Python 2.7 Setup ~~~~ cddesired−directory tar jvzf Python-2.7.tar.bz2 cdPython−2.7 ./configure --enable-shared --prefix=/your/custom/installation/path make make install ~~~~0. (Optional 2 - no sudo) Bash profile edit.. in your ~/.bash_profile file ~~~~ exportPATH=/your/custom/installation/path/bin:PATH exportLDLIBRARYPATH=/your/custom/installation/path/lib:LD_LIBRARY_PATH ~~~~1. Get PIP if not already installed! ~~~~ wgethttps://bootstrap.pypa.io/get−pip.py python ~/path/to/get-pip.py ~~~~2. Install Cython/Scipy stack separately (Setuptools seems to install incorrectly..) ~~~~ pipinstallcython pip install scipy pipinstallnumpy 3.InstallScaleHDfromsrc(pipcomingsoon...) cd ~/path/to/ScaleHD/src/ pythonsetup.pyinstall 4.Installrequiredthird−partybinaries.PleasemakesureanybinariesyoudoinstallareincludedonyourPATH so that they can be found by your system. **Please note**, ScaleHD will utilise GNU WHICH/TYPE to determine if a command is on your PATH.IfeitherWHICH/TYPEoradependencyismissing,ScaleHDwillinformyouandexit. QualityControl:CutadaptFastQC(Javarequired)Alignment:BWASeqTKSamtoolsGeneratr(setup.pywillinstallthisforyou)Genotyping:RSamtoolsGeneratr(asabove)Picard(aliasrequired∗)GATK(aliasrequired∗) ∗aliasesarerequiredforcertainthirdpartysoftwarewhicharenotdistributedasinstallablebinaries.Anexampleofanaliaswouldlooklike:aliasgatk="java−jar/Users/homedir/Documents/Builds/GenomeAnalysisTK.jar"5.Checkthatlibxml2−devandlibxslt−devareinstalled...Usage=====Generalusageisasfollows: scalehd [-h/--help] [-v] [-c CONFIG] [-t THREADS] [-e] [-b] [-g] [-j "jobname"] [-o OUTPUT] e.g. scalehd−v−c /path/to/config.xml−t12−j"ExampleJobPrefix"−o /path/to/master/outputScaleHDflagsare:−h/−−help::Simplehelpmessageexplainingflagsindetail−v/−−verbose::Enablesverbosemodeintheterminal(i.e.showsuserfeedback)−c/−−config::WillexecuteallsettingsspecifiedinthegivenArgumentConfig.xml[filepath].−t/−−threads::Numberofthreadstoutilise.Mainlywillaffectalignmentperformance[integer].−e/−−enshrine::Forcesalignedreadswhicharenotuniquelymappedtoberetained;defaultbehaviourwithoutthisflagremovessaidreads.−b/−−broadscope::Forcessubsamplingofrawandalignedreadstobedisabled.−g/−−groupsam::Groupsallalignedassembliesgeneratedintooneoutputfolder,withappropriatesamplenames.Ifnotspecified,assemblieswillbeleftinthesample′sspecificoutputsubfolder.−j/−−jobname::Specifiesaprefixtousefortherootoutputdirectory.Optional.IfyouspecifyaJobNamethatalreadyexistswithinyourspecified−ooutputfolder,ScaleHDwillprompttheusertodecideiftheywishtodeletethepre−existingfolderandreplace.−o/−−output::Desiredoutputdirectory.DataPrimer===========Ashortnoteontherequirementsoffilenames/structureforScaleHDtofunction.Asample′sfilename(here,ExampleSampleName)mustadheretothefollowingstructure:ExampleSampleNameR1.fastqExampleSampleNameR2.fastqYoumustutilisebothforward(R1)andreverse(R2)reads,persamplepair.IftherespectivefilesdonotendinR1.fastq(.fq)orR2.fastq(.fq),ScaleHDwillnotruncorrectly.SincethisisahighlyHDspecificapplication,wecanoffersomeinsightintoprovidingthebestapproachesforgenotyping.DuetothesimilarityofbothrepeattractsinHD(CAGandCCG),whichareflankinganinterveningsequence,thatinitselfishighlysimilartobothregions,alignmentcanbefussyaboutyourinputdata.Thus,wehighlyrecommendtrimminganyspacersorprimerspresentonthe5Primeendofyourreads;thisenablesreadstostartatthesamepositionandprovidesthealignerwithamorediscreteboundarybetweenthedifferentHDrepeattracts.IndividualsettingsfordifferentstagesinScaleHDaresetwithinaconfigurationXMLdocument.Theparticularacceptabledatatypes/rangesforeachparametervaries.TheconfigurationXMLdocumentforScaleHDsettingsmustalsoadheretothefollowingstructure:<configdatadir="/path/to/reads/"forwardreference="/path/to/forward/refseq.fa"reversereference="/path/to/reverse/refseq.fa"><instanceflagsqualitycontrol="BOOL"sequencealignment="BOOL"atypicalrealignment="BOOL"genotypeprediction="BOOL",snpcalling="BOOL"/><trimflagstrimtype="x"qualitythreshold="x"adapterflag="x"adapter="x"errortolerance="x"/><alignmentflagsminseedlength="x"bandwidth="x"seedlengthextension="x"skipseedwithoccurrence="x"chaindrop="x"seededchaindrop="x"seqmatchscore="x"mismatchpenalty="x"indelpenalty="x"gapextendpenalty="x"primeclippingpenalty="x"unpairedpairingpenalty="x"/><predictionflagsplotgraphs="BOOL"/></config>Witheachparameterdatatype/rulebeingasfollows:CONFIGdatadir:Mustbearealpath,withanevennumberofONLY∗.fastqor∗.fqfileswithin.forwardreference:Mustbearealreferencefile(∗.fasta,∗.faor∗.fas).reversereference:Seeforwardreference.INSTANCEqualitycontrol:Boolean,TRUE/FALSEsequencealignment:Boolean,TRUE/FALSEatypicalrealignment:Boolean,TRUE/FALSEgenotypeprediction:Boolean,TRUE/FALSEsnpcalling:Boolean,TRUE/FALSETRIMtrimtype:String,"Quality","Adapter"or"Both"qualitythreshold:Integer,withintherange0−38adapterflag:String,oneof:′−a′,′−g′,′−a','-g^','-b'. ([See Cutadapt](http://cutadapt.readthedocs.io/en/stable/guide.html#removing-adapters)) adapter: String, consisting of only 'A','T','G','C' error_tolerance: Float, within the range of 0.0 to 1.0 (in 0.01 increments). ALIGNMENT All flags present are direct equivalents of parameters present in BWA-MEM. See [the BWA manual for more information](http://bio-bwa.sourceforge.net/bwa.shtml). PREDICTION plot_graphs: Boolean, TRUE/FALSEOutput======A brief overview of flags provided in the output is as follows: SampleName:: The extracted filename of the sample that was processed. Primary/Secondary GTYPE:: Allele genotype in the format CAG_x_y_CCG_z Status:: Atypical or Typical structure BSlippage:: Slippage ratio of allele's read peak ('N minus 2' to 'N minus 1)', over 'N'. Somatic Mosaicism:: Mosaicism ratio of allele's read peak ('N plus 1' to 'N plus 10'), over 'N' Confidence:: Confidence in genotype prediction (0-100). Exception Raised:: If, during a particular stage of the pipeline, exceptions caused the processing to fail, this flag will inform the user in which stage it crashed. Homozygous Haplotype:: If True, both alleles have an identical genotype. Neighbouring Peaks:: If True, both alleles exist within the same CCG distribution, neighbouring each other. Diminished Peaks:: If True, an expanded peak has very few reads and was detected independently. Manual inspection recommended. Novel Atypical:: If True, an intervening sequence structure that has not been readily observed before was detected. Manual inspection recommended. Alignment Warning:: If True, determining the CCG value(s) returned more peaks than is 'possible'. Manual inspection recommended. Atypical Alignment Warning:: In the case of atypical re-alignment, particularly awful alignment quality can return more than one peak; which should not happen. CCG Rewritten:: CCG was rewritten from the FOD-derived value -- i.e. DSP overwrote the FOD results. CCG Zygosity Rewritten:: A sample (aligned to a typical reference) that was heterozygous (CCG), was detected to be an atypical homozygous (CCG) sample. CCT Uncertainty:: The most common CCT 'sizes' returned from DSP were too similar in count (e.g. CCT2 == 54%, CCT3 == 46%) to be certain. SVM Failure:: SVM CCG zygosity calling was incorrect, as a result of the resultant confusion matrix providing differing results from a brute force ratio check. Manual inspection highly recommended. Differential Confusion:: The allele sorting algorithm is confused between a potential neighbouring peak, and a homozygous haplotype. Manual inspection highly recommended. Peak Inspection Warning:: At least one allele failed inspection on the repeat-count distribution the genotype(s) was(were) derived from. Common in very low read count samples/poor sequencing. Low Distribution Reads:: A warning which is triggered when at least one allele's repeat count distribution contains an unappealingly low number of reads. Low Peak Reads:: A fatal warning which is triggered when, in a given repeat count distribution, the returned N value contains a very low number of reads. Manual inspection highly recommended.
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.