Skip to content

sr-murthy/rosalind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Solutions to the ROSALIND problem set

Selected Python solutions to the ROSALIND bioinformatics problem set, including some data structures and generic utilities.

Notes

  • The solutions are a work in progress, and will be added as time permits. The implementations are based on solutions to the graded problems on ROSALIND.

  • solutions.py is the main solution set, while utils.py contains generic utilities which are used in the solutions, as required.

  • A basic set of tests has been added, and they are mostly based on solutions to the example problems in ROSALIND.

  • Solutions always produce raw values, and don't depend on formatting, e.g. GC (Computing GC Content), where the marking in ROSALIND depends on formatting the answer in a particular way.

  • In problems with numerical solutions decimal.Decimal objects are returned instead of float values, where possible, to ensure that results are as exact as possible.

  • Several functions (in utils.py and solutions.py) use caching with functools.cache, which requires that the arguments and parameters passed to the functions are hashable: in particular, all array-valued arguments and parameters to these functions must be tuples, because Python tuples are immutable and hence hashable.

  • The function docstrings are written using the Numpy docstring style.

  • For more background on linguistic complexity (LC) refer here (page W630).

  • In Python the counting of k-mers in the KMER (k-Mer Composition) problem must take overlapping substrings into account, because the Python standard library str.count function only counts non-overlapping occurences: so a custom function has been used for this purpose.

  • The solutions to several problems, including SSEQ (Finding a Spliced Motif), involve finding and returning arrays of indices of a matching subsequence or substring, in terms of 1-indexed arrays, as required by the problems. They convert the 0-indexed array indices returned by some generic utility functions that they call on.

  • The solution to EDIT (Edit Distance) is (cached) recursive, which is slower than equivalent iterative implementations, but more readable and easier to understand. Also, and separately, the solution allows for insertion, deletion, and substitution costs to be customised, with default values of 1, 1, 1 respectively.

About

Python solutions to the ROSALIND (https://rosalind.info/) bioinformatics problem set.

Topics

Resources

License

Stars

Watchers

Forks

Languages