smallScripts

Personal collection of small tools to manipulate sequence files

A. General utility

checkmd5sum.py : Checks md5sum generated from a file vs md5sum provided for a file (missing lines, WIP).
renameFilesRecursive.py: Based on an input file of old/new names, will rename all files recursively in an input directory. This was designed for file prefixes - to rename random prefixes with QBiC barcodes.
matchColumnInOneFile.py: Sort by column1 in file2 based on the order of column1 in file1. Requires pandas package.

format_fasta_oneline.py : Converts a multi-line sequence fasta file into a (seq1Name, newline, seq1 in one line, newline, seq2Name..). Just uses awk, so no dependencies biopython or perl.
filterMinSpecies.py : Goes through each fasta file in an input directory, checks the number of sequences in a file, and only copies file to the output directory if number of sequences >= minNumSpecPer on totalSpecies. E.g if totalSpecies is 60 and minNumSpecPer is 50, a fasta file must have >30 sequences to be copied to output folder.
seqLengthDist.py: # 1)get a distribution count of sequence length and 2)generate a histogram of length distribution.
statsFasta.py: output stats for a fasta file (number of seqs, average seq length, longestSeq, shortestSeq, %GC)

getIgphymlMSAclone.py: Parse the input file for igPhyML, typically named xxx_db-pass_productive-T_clone-pass_germ-pass.tsv to create a standard MSA file in fasta format.
readIgphymlOutPhylo.R : Rscript template to read in output from igPhyML, typically xxx_igphyml-pass.tab, to generate trees in newick format.

bedstat.py: Customise and view statistical properties of a bedfile. Written in python2.
parseEnsemblGeneNamesDesc.py: parse the downloaded Ensembl biomart file to create a clean output. Combines multiple phenotypes from the same geneID into one entry. Written in python2.
parseProteinFunction_csv.py: Parse protein sequences downloaded from Ensembl and output the proteinID and function in two columns. Written in python2