Skip to content

Garnett-Lab/BEstimate

Repository files navigation

BEstimate

DOI

BEstimate, a Python module that systematically identifies guide RNA (gRNA) targetable sites across given sequences for given Base Editors, functional and clinical effects of the potential edits on the resulting proteins, on-target scores and off target consequence of the found sequences. It has the ability to provide in silico analysis of the sequences to identify positions that can be editable by Base Editors, and their features before starting experiments.

Table of Contents

Quick start installation

Clone the BEstimate source

git clone https://github.com/Garnett-Lab/BEstimate.git
cd BEstimate

Use Conda to create and activate a dedicated Python 3.13 virtual environment and install BEstimate:

conda env create -yf bestimate.yml && conda activate bestimate
pip install -e .

Installation for on-target analysis

On-target scoring requires Python 3.10 and must be installed in a separate environment from the main BEstimate installation above, which requires Python 3.12 or 3.13.

Make sure you are in the cloned repository from the last installation step. Then use Conda to create and activate a dedicated Python 3.10 virtual environment:

conda env create -yf bestimate_ontarget.yml && conda activate bestimate_ontarget

Install the on-target dependencies:

pip install -r requirements_ontarget.txt

Install the FORECasT-BE dependency:

git clone https://github.com/ananth-pallaseni/FORECasT-BE.git
cd FORECasT-BE
pip install --no-deps -e .

Run BEstimate

Examples with BEstimate

Activate the bestimate Python 3.13 environment if not already activated:

conda activate bestimate

If you would like to run for the SRY gene with NGG PAM sequence, with CBE (C to T editing) and without VEP and protein analysis:

BEstimate -gene SRY -assembly GRCh38 -pamseq NGG -pamwin 21-23 -actwin 4-8 -protolen 20 -edit C -edit_to T -o ../output/ -ofile SRY_CBE_NGG

The user can also run the same analysis for a different PAM only by changing -pamseq NGN.

Warning: Be careful to write the PAM sequence to be in concordance with the length of the -pamwin. Here, NGN is in concordance with 21-23 (3 nucleotides). Otherwise, the user need to write NG -pamseq with 21-22 -pamwin.

If you would like to run for a specific transcript and run the protein analysis:

BEstimate -gene SRY -assembly GRCh38 -transcript ENST00000383070 -edit C -edit_to T -vep -o ../output/ -ofile SRY_CBE_NGG

Warning: Here we ran -vep option, therefore the file we'd like to run for on-target will be summary_df.csv. Otherwise, edit_df can be used as well.

# Activate the Python 3.10 on-target environment (conda) if not already activated
conda activate bestimate_ontarget
# and move to the root of BEstimate folder
cd ../
python -m BEstimate BEstimate/x_ontarget.py -gene SRY -assembly GRCh38 -iname 'summary_df' -edit C -o ../output/ -ofile SRY_CBE_NGG -rs3 -fc

If you would like to run with a specific point mutation, with NGN PAM and with VEP and protein analysis: Prepare a PIK3CA_mutation_file.txt for example with 3:g.179218303G>A

# activate the Python 3.13 environment if you have not already done so
conda activate bestimate
BEstimate -gene PIK3CA -assembly GRCh38 -pamseq NGN -pamwin 21-23 -actwin 4-8 -protolen 20 -mutation_file PIK3CA_mutation_file.txt -edit A -edit_to G -vep -ofile PIK3CA_NGN_ABE_mE545K -o ../output/

To run on-target with mutations, switch to the on-target environment and give the mutation_file again:

# Activate the Python 3.10 on-target environment (conda) if not already activated
conda activate bestimate_ontarget
python -m BEstimate BEstimate/x_ontarget.py -gene PIK3CA -assembly GRCh38 -mutation_file PIK3CA_mutation_file.txt -iname 'summary_df' -edit C -o ../output/ -ofile PIK3CA_NGN_ABE_mE545K -rs3 -fc

If you have your own library for MYC gene, you can add the csv file (library.csv) to use BEstimate annotation as follows Warning: You have to have column names: CRISPR_PAM_Sequence (gRNA sequence with its PAM), Direction (left or right- orientation of the gRNA) and Location (genomic location of the CRISPR_PAM_Sequence):

# Activate the Python 3.10 on-target environment (conda) if not already activated
BEstimate -gene MYC -assembly GRCh38 -pamseq NGN -pamwin 21-23 -actwin 3-9 -protolen 20 -library_file library.csv -edit A -edit_to G -vep -ofile annotated_library -o ../output/

Off-Target Analysis

To run the off-target analysis, first you need to have the Ensembl Genome indexed for the interested PAM sequence.

The x_genome program will download the required files and index the genome for CRISPRs as follows.

  • Download the specified FASTA genome assembly files from the Ensembl project,
  • Gather CRISPRs from the FASTA files into CSV files detailing chromosome, position in chromosome, as well as PAM position,
  • Generate a binary list of gRNA signatures (accounting for PAM position),
  • Insert the CRISPRs into a SQLite database for cross-referencing the gRNAs found in the binary list.

to run the x_genome program see command line options section below.

For example:

conda activate bestimate
x_genome --pamseq NGN --assembly GRCh38 --ensembl_version 113

The gathering of CRISPRs from the genome assembly takes a while and requires a fair amount of disk storage. For example, using the GRCh38 genome assembly:

Pam Sequence Space (GB) Run Time
NGG 38 ~3 Hours
NGN 140 ~9 Hours

Then, you can run the off-target analysis, see below for the SRY gene:

BEstimate -gene SRY -assembly GRCh38 -pamseq NGN -edit A -edit_to G -vep -ot -o ../output -ot_path ../offtargets -ofile SRY_ABE_NGN

Command line usage and options

There are three programs when the package is installed available from the command line:

  • BEstimate - the main program to find and analyse Base Editor sites
  • x_genome - the program to download and index a genome for off-target analysis
  • x_crispranalyser - the program to run off-target analysis on guides

x_ontarget is run directly from the cloned source in the separate Python 3.10 environment — see Installation for on-target analysis.

Expand to see BEstimate command line options
BEstimate --help
usage: BEstimate [inputs]

********************************** Find and Analyse Base Editor sites **********************************

Mandatory Inputs:
  -h, --help                    show this help message and exit
  --version                     Show program's version number and exit.
  -gene GENE                    The hugo symbol of the interested gene!
  -assembly ASSEMBLY            The genome assembly that will be used!
  -transcript TRANSCRIPT        The interested ensembl transcript id
  -uniprot UNIPROT              The interested Uniprot id
  -pamseq PAMSEQ                The PAM sequence in which features used for searching activity window and editable nucleotide.
  -pamwin PAMWINDOW             The index of the PAM sequence when starting from the first index of protospacer as 1.
  -actwin ACTWINDOW             The index of the activity window when starting from the first index of protospacer as 1.
  -protolen PROTOLEN            The total protospacer and PAM length.
  -vep                          The boolean option if user wants to analyse the edits through VEP and Uniprot.
  -mutation_file MUTATION_FILE  A file for the mutations on the interested gene that you need to integrate into guide and/or annotation analysis
  - library_file LIBRARY_FILE   Existing library file: should include:
			  1. gRNA + PAM Sequence (5>3'): column name -> CRISPR_PAM_Sequence
			  2. gRNA genomic location (smallest>largest): column name -> Location
			  3. gRNA directionality: column name -> Direction
  -flank                        The boolean option if the user wants to add flanking sequences of the gRNAs
  -flank3 FLAN_3                The number of nucleotides in the 3' flanking region
  -flank5 FLAN_5                The number of nucleotides in the 5' flanking region
  -edit {A,T,G,C}               The nucleotide which will be edited.
  -edit_to {A,T,G,C}            The nucleotide after edition.
  -o OUTPUT_PATH                The path for output. If not specified the current directory will be used!
  -ofile OUTPUT_PATH            The output file name, if not specified "position" will be used!
  -ot OFF_TARGET                The boolean option if off targets will be computed or not
  -genome GENOME                (If -ot provided) name of the genome file
  -v_ensembl VERSION            The ensembl version in which genome will be retrieved (if the assembly is GRCh37 then please use <=75)
  -ot_path OT_PATH              The path of the offtargets folder. default is os.getcwd() + "/../offtargets/
Expand to see x_genome command line options
usage: x_genome [inputs]

Script for indexing CRISPRs for finding off-targets

options:
  -h, --help            show this help message and exit
  --version             Show the version number and exit
  --pamseq PAMSEQ, -p PAMSEQ
                        The PAM sequence in which features used for searching activity window and editable nucleotide.
  --assembly {GRCh38,GRCh37}, -a {GRCh38,GRCh37}
                        The genome assembly that will be used!
  --output_path OUTPUT_PATH, -o OUTPUT_PATH
                        The path for output. If not specified the current directory will be used!
  --ensembl_version ENSEMBL_VERSION, -e ENSEMBL_VERSION
                        The ensembl version in which genome will be retrieved (if the assembly is GRCh37 then please use <=75)
  --offtargets_path OFFTARGETS_PATH, -ot OFFTARGETS_PATH
                        The path to the root offtargets output directory
Expand to see x_crispranalyser command line options
usage: x_crispranalyser [inputs]

Script for finding off-targets

options:
  -h, --help            show this help message and exit
  --version             Show the version number and exit
  --input_csv INPUT_CSV, -i INPUT_CSV
                        The input CSV file to be analysed
  --binary_index BINARY_INDEX, -b BINARY_INDEX
                        The CRISPR binary index file generated by x_genome.py
  --output_csv OUTPUT_CSV, -o OUTPUT_CSV
                        The output CSV generated
  --db_file DB_FILE, -d DB_FILE
                        The CRISPR DB file generated by x_genome.py
Expand to see x_ontarget command line options
usage: x_ontarget [inputs]

Script for predicting on-target scores

options:
  -gene GENE            The hugo symbol of the interested gene!
  -assembly ASSEMBLY    The genome assembly that will be used!
  -mutation_file MUTATION_FILE
                        A file for the mutations on the interested gene that you need to integrate into guide and/or annotation analysis
  -rs3 RS3, The boolean option if the user wants to add on target RuleSet3 scoring for the gRNAs
  -fc FCAST,  The boolean option if the user wants to add on target ForeCast gRNAs efficiency info
  -edit EDIT,  The searched nuceleotide
  -iname INPUT,  The input BEstimate file extension (edit_df/ summary_df/ ot_annotated_df)
  -ofile OUTPUT_FILE, The output file name
  -o OUTPUT_PATH,   The path for output. If not specified the current directory will be used!")

Output Interpretation

BEstimate produces several files for different purposes:

  • crispr_df: All gRNAs with and without editable sites
Expand to see crispr_df column interpretation
columns:
    - Hugo_Symbol: Hugo Symbol of the corresponding gene location.
    - CRISPR_PAM_Sequence: gRNA target sequence with including PAM region.
    - gRNA_Target_Sequence: gRNA target sequence.
    - Location: Location of the gRNA target sequence (chromosome:start location:end location).
    - Direction: Direction of the gRNA.
    - Gene_ID: Ensembl Gene ID of the interested Hugo Symbol.
    - Transcript_ID: Ensembl Transcript ID of the interested Hugo Symbol in the corresponding gene location.
    - Exon_ID:  Ensembl Exon ID of the interested Hugo Symbol in the corresponding gene location.
    - guide_in_CDS: If gRNA has any nucleotide inside a coding sequence of the gene.
    - gRNA_flanking_sequences: In case that user has given `-flan` option, then gRNA with the flanking sequences
    - Poly_T: Boolean output representing if the gRNA has a Poly-T region
    - GC%: GC content of the gRNA
  • edit_df: gRNAs with editable sites within the targeted sequence
Expand to see edit_df column interpretation
columns:
    - Hugo_Symbol: Hugo Symbol of the corresponding gene location.
    - CRISPR_PAM_Sequence: gRNA target sequence with including PAM region.
    - gRNA_Target_Sequence: gRNA target sequence.
    - Location: Location of the gRNA target sequence (chromosome:start location:end location).
    - Edit_Location: Location of the possible edit with the given Base Editor.
    - Direction: Direction of the gRNA.
    - Strand: Strand of the interested gene on the genome (-1 or 1)
    - Gene_ID: Ensembl Gene ID for the corresponding gRNA target site region.
    - Transcript_ID: Ensembl Transcript ID for the corresponding gRNA target site region.
    - Exon_ID:  Ensembl Exon ID for the corresponding gRNA target site region.
    - guide_in_CDS: Boolean output representing if the any nucleotide on the gRNA is on the CDS region or not.
    - gRNA_flanking_sequences: In case that user has given `-flan` option, then gRNA with the flanking sequences
    - Edit_in_Exon: Boolean output representing if the specified edit in the Edit Location happening on the exon or not.
    - Edit_in_Exon: Boolean output representing if the specified edit in the Edit Location happening on the CDS or not.
    - GC%: GC content of the gRNA
    - # Edits/guide: Number of editable nuclotide within the activity window of the gRNA
    - Poly_T: Boolean output representing if CRISPR_PAM_Sequence has consecutive 4 T nucleotides.
    - mutation_on_guide: If any mutation is provided, boolean output representing if the mutation is included within the gRNA sequence.
    - guide_change_mutation: If any mutation is provided, boolean output representing if the gRNA can make changed on specified mutation.
    - mutation_on_window: If any mutation is provided, boolean output representing if the mutation is included within the activity window of the gRNA sequence
    - mutation_on_PAM: If any mutation is provided, boolean output representing if the mutation is included within the PAM sequence of the gRNA sequence
  • hgvs_df: If -vep provided, a file with VEP API inputs for each gRNA
  • vep_df: If -vep provided, an edit_df file with VEP API annotation for each gRNA
  • protein_df: If -vep provided, a vep_df file with protein and structural annotation for each gRNA
Expand to see protein_df column interpretation
additional columns:
    - HGVS: HGVS nomenclature of the gRNA potential edits
    - Protein_ID:  Ensembl Protein ID of the potential edit.
    - VEP_input: HGVS nomenclature which was used for VEP API.
    - allele: Allelic change of the potential edit.
    - variant_classification: VEP classification of the resulting variant of the potential edit. (Substitution, SNV etc.)
    - most_severe_consequence: The most severe consequence of the resulting variant of the potential edit.
    - consequence_terms: List of functional consequences of the resulting variant of the potential edit (splice region, missense, stop gain etc.)
    - variant_biotype:  The biotype of the variant created by the potential edit.
    - Regulatory_ID: Ensembl Regulatory ID of the potential edit.
    - Motif_ID: Ensembl Model ID of the potential edit.
    - TFs_on_motif: List of transcription factors on the Ensembl Regulatory ID of the potential edit.
    - cDNA_Change: cDNA changes resulting from the possible edits with the corresponding guides (position old nucleotide> new nucleotide)
    - Edited_codon: Codon sequence before the corresponding edit.
    - New_codon: Codon sequence after the corresponding edit.
    - Protein_ID: Ensembl Protein ID for the corresponding gRNA target site region.
    - CDS_Position: Position on the coding sequence of the potential edit.
    - Protein_Position_ensembl: Location of the corresponding edit on the resulting protein (as Ensembl PID index).
    - Protein_Position: Location of the corresponding edit on the resulting protein (as Uniprot index).
    - Protein_Change: Protein sequence change with the corresponding edit (old amino asid - position - new amino asid).
    - Edited_AA: One letter representation of the amino asid before the corresponding edit.
    - Edited_AA_Prop: Chemical properties of the amino asid before the corresponding edit.
    - New_AA: One letter representation of the amino asid after the corresponding edit.
    - New_AA_Prop: Chemical properties of the amino asid after the corresponding edit.
    - is_Synonymous: Boolean output representing if the resulting edit causes synonymous or non-synonymous mutations.
    - is_Stop: Boolean output representing if the resulting edit causes stop codon or not.
    - proline_addition: Boolean output representing if potential edit created a Proline amino acid or not.
    - swissprot_vep: SwissProt ID of the corresponding gRNA target site region from Ensembl VEP.
    - uniprot_provided: Uniprot ID from the user.
    - polyphen_score: Polyphen Score of the corresponding edit.
    - polyphen_prediction: Polyphen Prediction of the corresponding edit.
    - sift_score: Sift Score of the corresponding edit.
    - sift_prediction: Sift Prediction of the corresponding edit.
    - cadd_phred: CADD Prediction of the corresponding edit.
    - cadd_raw: CADD raw score of the corresponding edit.
    - lof: representing if the corresponding edit causes Loss of function with high confidence (HC)/Low confidence (LC) or not by implementing [LOFTEE](https://github.com/konradjk/loftee) through VEP.
    - impact: Impact of the corresponding edit.
    - blosum62: blosum62 Score of the corresponding edit.
    - is_clinical: representing if the corresponding edit has Clinical consequences in ClinVar or not.
    - clinical_id: dbSNP id of the clinical allele.
    - clinical_significance: Clinical significance of the corresponding edit.
    - cosmic_id: COSMIC id of the clinical allele.
    - clinvar_id: ClinVar id of the clinical allele.
    - ancestral_populations: The conservation score from the Ensembl Compara databases
    - Domain: Protein domain in which corresponding edit happening.
    - curated_Domain: Curated name of the protein domain in which corresponding edit happening.
    - PTM: Post Translational Modification sites in which corresponding edit happening.
    - is_disruptive_interface_EXP: Boolean output representing if the corresponding edit happening on the interface region and having an experimental evidence (PDB). *High confidence*
    - is_disruptive_interface_MOD: Boolean output representing if the corresponding edit happening on the interface region and having evidence from a model (Interactome3D). *Medium confidence*
    - is_disruptive_interface_PRED: Boolean output representing if the corresponding edit happening on the interface region and having evidence from prediction (Interactome Insider). *Low confidence*
    - disrupted_PDB_int_partners: List of partner proteins whose interactions are disrupted by the corresponding edit according to experimental evidence - PDB.
    - disrupted_I3D_int_partners: List of partner proteins whose interactions are disrupted by the corresponding edit according to Interactome3D.
    - disrupted_Eclair_int_partners: List of partner proteins whose interactions are disrupted by the corresponding edit according to Eclair - prediction algorithm of Interactome Insider.
    - disrupted_PDB_int_genes: List of partner genes (Hugo Symbols) whose interactions are disrupted by the corresponding edit according to experimental evidence - PDB.
    - disrupted_I3D_int_genes: List of partner genes (Hugo Symbols) whose interactions are disrupted by the corresponding edit according to Interactome3D.
    - disrupted_Eclair_int_genes: List of partner genes (Hugo Symbols) whose interactions are disrupted by the corresponding edit according to Eclair - prediction algorithm of Interactome Insider.
  • ot_annotated_df: If -ot provided, an edit_df/protein_df file with off-target annotation for each gRNA
Expand to see ot_annotated_df column interpretation
additional columns:
    - exact: The number of exact alignments of gRNA sequence.
    - mm1: The number of alignments of gRNA sequence with one mismatch.
    - mm2: The number of alignments of gRNA sequence with two mismatches.
    - mm3: The number of alignments of gRNA sequence with three mismatches.
    - mm4: The number of alignments of gRNA sequence with four mismatches.
  • scored_df: If -rs3 or -fc provided, an edit/protein_df/ot_annotated_df file with on-target scoring for each gRNA
Expand to see scored_df column interpretation
additional columns:
    - rs3_sequence: The flanking region and gRNA sequence for thr RS3 analysis.
    - RuleSet3_Hsu2013: gRNA on-target activity score based on Hsu2013 tracRNA gRNA with Rule Set 3
    - RuleSet3_Chen2013: gRNA on-target activity score based on Chen2013 tracRNA gRNA with Rule Set 3
    - FORECasT-BE: gRNA predicted guide efficacy

Contact

BEstimate is the product of Cansu Dinçer, Matthew Coelho and Mathew Garnett from Garnett Group at the Wellcome Sanger Institute. Off-target analysis has been adapted by Bo Fussing from the Cellular Informatics team within the Wellcome Sanger Institute.

If you have any problems or feedback regarding BEstimate, please contact here.

License

GNU AFFERO GENERAL PUBLIC LICENSE

BEstimate: A Python module to design and annotate base editor gRNAs

Copyright (C) 2025-2026 Genome Research Limited

Authors: Cansu Dinçer (cd7@sanger.ac.uk), Bo Fussing (bf15@sanger.ac.uk)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

Further Disclaimer

This tool is for research purposes and not for clinical use. For policies regarding the underlying data, please also refer to:

Development

Software Requirements

Development requires are quite minimal with two approachs - with a Virtual Environment or with a VSCode devcontainer.

Virtual Environment Requirements
  • Git Hubflow tools:
  • Python 3.12 or higher. It is up to the developer how to install a specific Python if the system default is not suitable.
    • It's recommended to use pyenv as it offers flexibility in managing multiple Python versions. Works on Linux and MacOS, and installing via the command line is straightforward:
      curl https://pyenv.run | bash
      pyenv install 3.12.4
      cd <project-repository-directory>
      pyenv local 3.12.4
    • Via poetry itself. Since poetry 2.1.0 running poetry python install 3.12 will install a standalone Python
    • On Ubuntu Linux via apt-get typically the default python3 is old so we need to add a PPA for newer versions. Famously is the Deadsnakes PPA:
      sudo apt update
      sudo apt install -y software-properties-common
      
      # Add the Deadsnakes PPA and refresh package lists
      sudo add-apt-repository -y ppa:deadsnakes/ppa
      sudo apt update
      
      # Install Python 3.12, its dev headers, and venv support
      sudo apt install -y python3.12 python3.12-dev python3.12-venv python3.12-distutils
      
      # Bootstrap pip for this interpreter, then upgrade basics
      python3.12 --version  # sanity check
      python3.12 -m ensurepip --upgrade
      python3.12 -m pip install --upgrade pip setuptools wheel
    • On MacOS via Homebrew
      brew update
      brew install python@3.12
      python3.12 --version # sanity check
  • Poetry 2.2 or higher. Poetry is used for dependency management and packaging, there are multiple ways to install it. The official docs recommend will have up-to-date instructions, but in summary:
    • On Linux one typically sudo apt-get installs pipx and then pipx install poetry to ensure Poetry is up to date and isolated from system Python packages. Do not directly apt-get install poetry as that version is out of date and not compatible with this project.
    • On MacOS, if you have Homebrew installed, you can use it to install Poetry brew update && brew install poetry.
VSCode devcontainer Requirements

Working with a devcontainer is the easiest way to get started with development as all dependencies and tools are pre-installed. Working with VSCode is optional as other IDEs/editors can attach to a running container or you can run commands directly in the container with docker run.

You will need:

Up-to-date instructions for installing Docker and VSCode can be found on their respective websites, but the VSCode instructions summarise all the main steps.

An important note about SSH and working with GitLab/GitHub from within a devcontainer: forward your keys via your ssh-agent:

~/.ssh/config on your Laptop/Workstation (not OpenStack/Farm or other remote host):

Host *
  AddKeysToAgent yes
  UseKeychain yes
  IdentityFile ~/.ssh/id_rsa
  TCPKeepAlive yes
  ServerAliveInterval 120

One time setup

Virtual Environment Setup

The process to setup the development environment is as follows:

git clone https://github.com/Garnett-Lab/BEstimate.git
cd BEstimate
python3.12 -m venv .venv
source .venv/bin/activate
poetry install
pre-commit install --install-hooks
DevContainer Setup

The process to setup the development environment is as follows:

git clone https://github.com/Garnett-Lab/BEstimate.git
cd BEstimate
docker build -t bestimate-dev:local -f .devcontainer/Dockerfile .
docker run -it --rm -v $(pwd):/opt/repo  bestimate-dev:local bash

# Inside the container
pwd # should be /opt/repo
BEstimate --version # should show version
exit # exit the container

Next, if using VSCode, open the project folder within VSCode and open the command palette (Shift+CMD+P or Shift+CTRL+P) and select "Remote-Containers: Rebuild and Reopen in Container"

Adding new dependencies

To add a new dependency, use Poetry to add it to the pyproject.toml and poetry.lock files:

poetry add <package-name>
# or for a development dependency:
poetry add --group dev <package-name>

To constrain the version of a package, see the Poetry versioning docs.

What does the lockfile poetry.lock do ?

Expand for lockfile summary

The poetry.lock file ensures that everyone working on the project uses the same versions of dependencies, which helps to avoid "it works on my machine" problems.

It separates dependency-of-dependencies from human-specified dependencies in pyproject.toml, and pins them to specific versions.

However the lockfile is a 'disposable' file in that it can be regenerated from the pyproject.toml file if needed. The lockfile should always be committed to version control. If there are merge conflicts in the lockfile, discard it and regenerate it with poetry lock.

Finally, downstream users of the BEstimate package do not benefit from the lockfile, and install the dependencies as specified in pyproject.toml (but not dev dependencies).

What does the requirements.txt do ?

Expand for requirements.txt summary

The requirements.txt file is generated by Poetry and pre-commit as an artifact and to allow developers to install dependencies with pip install -r requirements.txt in environments where Poetry is not available. It should not be manually edited.

Formatting and pre-commit hooks

This project uses pre-commit to manage elements of code formatting and linting. See the One time setup section for installation and setup.

Pre-commit will run automatically on git commit. Generally pre-commit will modify and correct files, these need to be staged again before the commit can complete.

To run the pre-commit hooks manually:

pre-commit run -a

To push a commit while bypassing pre-commit (there are reasons to do this):

git commit --no-verify -m "My commit message"

The pre-commit configuration is in .pre-commit-config.yaml and includes:

  • built-in hooks for checking for end-of-file newlines, trailing whitespace and ensuring valid JSON, YAML and TOML files
  • black - Python code formatter
  • flake8 - Python code linter
  • poetry-check
  • poetry-export - generates requirements.txt from poetry.lock

Testing

Tests are in the tests/ directory and use pytest.

To run the tests:

pytest tests/

Or with the development Docker image:

docker build -f Dockerfile-dev -t 'bestimate-dev:local' .
docker run -it --rm bestimate-dev:local pytest tests/

Or with the public image:

docker build -t bestimate:local -f Dockerfile .
# The tests don't exist in the image and pytest is not installed
docker run -it --rm \
    -v ./tests/:/opt/repo/tests \
    -w /opt/repo \
    bestimate:local \
    bash -c  'pip install pytest && python -m pytest tests/'

CICD (Gitlab CI)

The CI in .gitlab-ci.yml uses the CICD template repository and includes the following stages that:

  • build two Docker images, from Dockerfile and Dockerfile.dev
  • tests runs e2e, pytest and pre-commit against the built images
  • publish
    • if on a tag e.g 1.2.3 publishes the image to GitLab Container Registry as <image>:<tag> and publishes the package to GitLab PyPI as <package>:<tag>
    • if on main branch, publishes the image to Docker Hub as <image>:latest
    • if on develop branch, publishes the image to Docker Hub as <image>:develop-latest

Certain CI variables are maintained in this repository's CICD settings. Of note is:

  • GITLAB_DEPLOY_TOKEN_RW and GITLAB_DEPLOY_USERNAME_RW - used to authenticate with the GitLab container registry and PyPI (the GITLAB_CI_TOKEN doesn't have API write permissions)
  • DOCKER_HUB_USER and DOCKER_HUB_ACCESS_TOKEN - used to authenticate with Docker Hub to allow pull images without interfering with the Sanger/DockerHub rate limits

Git and Tagged releases

This repo implements the GitFlow branching model and uses hubflow as a tool to enable this from the CLI.

# Switch to the develop branch
git checkout develop

# Start a new release branch e.g. 0.1.0 not v0.1.0
git hf release start <project_version>

Now, do the following things:

  • CHANGELOG.md: Under the heading of the newest release version, describe what was changed, fixed, added.
  • pyproject.toml: Increment the project version to the current release version
  • Commit these changes
  • Run pre-commit run -a to ensure no formatting issues

Finally

git hf release finish <project_version>

About

BEstimate, a Python module that systematically analyses guide RNA (gRNA) targetable sites across given sequences for given Base Editors, and functional and clinical effects of the potential edits on the resulting proteins.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages