BEstimate, a Python module that systematically identifies guide RNA (gRNA) targetable sites across given sequences for given Base Editors, functional and clinical effects of the potential edits on the resulting proteins, on-target scores and off target consequence of the found sequences. It has the ability to provide in silico analysis of the sequences to identify positions that can be editable by Base Editors, and their features before starting experiments.
Clone the BEstimate source
git clone https://github.com/Garnett-Lab/BEstimate.git
cd BEstimateUse Conda to create and activate a dedicated Python 3.13 virtual environment and install BEstimate:
conda env create -yf bestimate.yml && conda activate bestimate
pip install -e .On-target scoring requires Python 3.10 and must be installed in a separate environment from the main BEstimate installation above, which requires Python 3.12 or 3.13.
Make sure you are in the cloned repository from the last installation step. Then use Conda to create and activate a dedicated Python 3.10 virtual environment:
conda env create -yf bestimate_ontarget.yml && conda activate bestimate_ontargetInstall the on-target dependencies:
pip install -r requirements_ontarget.txtInstall the FORECasT-BE dependency:
git clone https://github.com/ananth-pallaseni/FORECasT-BE.git
cd FORECasT-BE
pip install --no-deps -e .Activate the bestimate Python 3.13 environment if not already activated:
conda activate bestimateIf you would like to run for the SRY gene with NGG PAM sequence, with CBE (C to T editing) and without VEP and protein analysis:
BEstimate -gene SRY -assembly GRCh38 -pamseq NGG -pamwin 21-23 -actwin 4-8 -protolen 20 -edit C -edit_to T -o ../output/ -ofile SRY_CBE_NGGThe user can also run the same analysis for a different PAM only by changing -pamseq NGN.
Warning: Be careful to write the PAM sequence to be in concordance with the length of the -pamwin. Here, NGN is in concordance with 21-23 (3 nucleotides). Otherwise, the user need to write NG -pamseq with 21-22 -pamwin.
If you would like to run for a specific transcript and run the protein analysis:
BEstimate -gene SRY -assembly GRCh38 -transcript ENST00000383070 -edit C -edit_to T -vep -o ../output/ -ofile SRY_CBE_NGGWarning: Here we ran -vep option, therefore the file we'd like to run for on-target will be summary_df.csv.
Otherwise, edit_df can be used as well.
# Activate the Python 3.10 on-target environment (conda) if not already activated
conda activate bestimate_ontarget
# and move to the root of BEstimate folder
cd ../
python -m BEstimate BEstimate/x_ontarget.py -gene SRY -assembly GRCh38 -iname 'summary_df' -edit C -o ../output/ -ofile SRY_CBE_NGG -rs3 -fcIf you would like to run with a specific point mutation, with NGN PAM and with VEP and protein analysis:
Prepare a PIK3CA_mutation_file.txt for example with 3:g.179218303G>A
# activate the Python 3.13 environment if you have not already done so
conda activate bestimate
BEstimate -gene PIK3CA -assembly GRCh38 -pamseq NGN -pamwin 21-23 -actwin 4-8 -protolen 20 -mutation_file PIK3CA_mutation_file.txt -edit A -edit_to G -vep -ofile PIK3CA_NGN_ABE_mE545K -o ../output/To run on-target with mutations, switch to the on-target environment and give the mutation_file again:
# Activate the Python 3.10 on-target environment (conda) if not already activated
conda activate bestimate_ontarget
python -m BEstimate BEstimate/x_ontarget.py -gene PIK3CA -assembly GRCh38 -mutation_file PIK3CA_mutation_file.txt -iname 'summary_df' -edit C -o ../output/ -ofile PIK3CA_NGN_ABE_mE545K -rs3 -fcIf you have your own library for MYC gene, you can add the csv file (library.csv) to use BEstimate annotation as follows Warning: You have to have column names: CRISPR_PAM_Sequence (gRNA sequence with its PAM), Direction (left or right- orientation of the gRNA) and Location (genomic location of the CRISPR_PAM_Sequence):
# Activate the Python 3.10 on-target environment (conda) if not already activated
BEstimate -gene MYC -assembly GRCh38 -pamseq NGN -pamwin 21-23 -actwin 3-9 -protolen 20 -library_file library.csv -edit A -edit_to G -vep -ofile annotated_library -o ../output/To run the off-target analysis, first you need to have the Ensembl Genome indexed for the interested PAM sequence.
The x_genome program will download the required files and index the genome for CRISPRs as follows.
- Download the specified FASTA genome assembly files from the Ensembl project,
- Gather CRISPRs from the FASTA files into CSV files detailing chromosome, position in chromosome, as well as PAM position,
- Generate a binary list of gRNA signatures (accounting for PAM position),
- Insert the CRISPRs into a SQLite database for cross-referencing the gRNAs found in the binary list.
to run the x_genome program see command line options section below.
For example:
conda activate bestimate
x_genome --pamseq NGN --assembly GRCh38 --ensembl_version 113The gathering of CRISPRs from the genome assembly takes a while and requires a fair amount of disk storage. For example, using the GRCh38 genome assembly:
| Pam Sequence | Space (GB) | Run Time |
|---|---|---|
| NGG | 38 | ~3 Hours |
| NGN | 140 | ~9 Hours |
Then, you can run the off-target analysis, see below for the SRY gene:
BEstimate -gene SRY -assembly GRCh38 -pamseq NGN -edit A -edit_to G -vep -ot -o ../output -ot_path ../offtargets -ofile SRY_ABE_NGNThere are three programs when the package is installed available from the command line:
BEstimate- the main program to find and analyse Base Editor sitesx_genome- the program to download and index a genome for off-target analysisx_crispranalyser- the program to run off-target analysis on guides
x_ontarget is run directly from the cloned source in the separate Python 3.10 environment — see Installation for on-target analysis.
Expand to see BEstimate command line options
BEstimate --help
usage: BEstimate [inputs]
********************************** Find and Analyse Base Editor sites **********************************
Mandatory Inputs:
-h, --help show this help message and exit
--version Show program's version number and exit.
-gene GENE The hugo symbol of the interested gene!
-assembly ASSEMBLY The genome assembly that will be used!
-transcript TRANSCRIPT The interested ensembl transcript id
-uniprot UNIPROT The interested Uniprot id
-pamseq PAMSEQ The PAM sequence in which features used for searching activity window and editable nucleotide.
-pamwin PAMWINDOW The index of the PAM sequence when starting from the first index of protospacer as 1.
-actwin ACTWINDOW The index of the activity window when starting from the first index of protospacer as 1.
-protolen PROTOLEN The total protospacer and PAM length.
-vep The boolean option if user wants to analyse the edits through VEP and Uniprot.
-mutation_file MUTATION_FILE A file for the mutations on the interested gene that you need to integrate into guide and/or annotation analysis
- library_file LIBRARY_FILE Existing library file: should include:
1. gRNA + PAM Sequence (5>3'): column name -> CRISPR_PAM_Sequence
2. gRNA genomic location (smallest>largest): column name -> Location
3. gRNA directionality: column name -> Direction
-flank The boolean option if the user wants to add flanking sequences of the gRNAs
-flank3 FLAN_3 The number of nucleotides in the 3' flanking region
-flank5 FLAN_5 The number of nucleotides in the 5' flanking region
-edit {A,T,G,C} The nucleotide which will be edited.
-edit_to {A,T,G,C} The nucleotide after edition.
-o OUTPUT_PATH The path for output. If not specified the current directory will be used!
-ofile OUTPUT_PATH The output file name, if not specified "position" will be used!
-ot OFF_TARGET The boolean option if off targets will be computed or not
-genome GENOME (If -ot provided) name of the genome file
-v_ensembl VERSION The ensembl version in which genome will be retrieved (if the assembly is GRCh37 then please use <=75)
-ot_path OT_PATH The path of the offtargets folder. default is os.getcwd() + "/../offtargets/Expand to see x_genome command line options
usage: x_genome [inputs]
Script for indexing CRISPRs for finding off-targets
options:
-h, --help show this help message and exit
--version Show the version number and exit
--pamseq PAMSEQ, -p PAMSEQ
The PAM sequence in which features used for searching activity window and editable nucleotide.
--assembly {GRCh38,GRCh37}, -a {GRCh38,GRCh37}
The genome assembly that will be used!
--output_path OUTPUT_PATH, -o OUTPUT_PATH
The path for output. If not specified the current directory will be used!
--ensembl_version ENSEMBL_VERSION, -e ENSEMBL_VERSION
The ensembl version in which genome will be retrieved (if the assembly is GRCh37 then please use <=75)
--offtargets_path OFFTARGETS_PATH, -ot OFFTARGETS_PATH
The path to the root offtargets output directoryExpand to see x_crispranalyser command line options
usage: x_crispranalyser [inputs]
Script for finding off-targets
options:
-h, --help show this help message and exit
--version Show the version number and exit
--input_csv INPUT_CSV, -i INPUT_CSV
The input CSV file to be analysed
--binary_index BINARY_INDEX, -b BINARY_INDEX
The CRISPR binary index file generated by x_genome.py
--output_csv OUTPUT_CSV, -o OUTPUT_CSV
The output CSV generated
--db_file DB_FILE, -d DB_FILE
The CRISPR DB file generated by x_genome.pyExpand to see x_ontarget command line options
usage: x_ontarget [inputs]
Script for predicting on-target scores
options:
-gene GENE The hugo symbol of the interested gene!
-assembly ASSEMBLY The genome assembly that will be used!
-mutation_file MUTATION_FILE
A file for the mutations on the interested gene that you need to integrate into guide and/or annotation analysis
-rs3 RS3, The boolean option if the user wants to add on target RuleSet3 scoring for the gRNAs
-fc FCAST, The boolean option if the user wants to add on target ForeCast gRNAs efficiency info
-edit EDIT, The searched nuceleotide
-iname INPUT, The input BEstimate file extension (edit_df/ summary_df/ ot_annotated_df)
-ofile OUTPUT_FILE, The output file name
-o OUTPUT_PATH, The path for output. If not specified the current directory will be used!")BEstimate produces several files for different purposes:
- crispr_df: All gRNAs with and without editable sites
Expand to see crispr_df column interpretation
columns:
- Hugo_Symbol: Hugo Symbol of the corresponding gene location.
- CRISPR_PAM_Sequence: gRNA target sequence with including PAM region.
- gRNA_Target_Sequence: gRNA target sequence.
- Location: Location of the gRNA target sequence (chromosome:start location:end location).
- Direction: Direction of the gRNA.
- Gene_ID: Ensembl Gene ID of the interested Hugo Symbol.
- Transcript_ID: Ensembl Transcript ID of the interested Hugo Symbol in the corresponding gene location.
- Exon_ID: Ensembl Exon ID of the interested Hugo Symbol in the corresponding gene location.
- guide_in_CDS: If gRNA has any nucleotide inside a coding sequence of the gene.
- gRNA_flanking_sequences: In case that user has given `-flan` option, then gRNA with the flanking sequences
- Poly_T: Boolean output representing if the gRNA has a Poly-T region
- GC%: GC content of the gRNA
- edit_df: gRNAs with editable sites within the targeted sequence
Expand to see edit_df column interpretation
columns:
- Hugo_Symbol: Hugo Symbol of the corresponding gene location.
- CRISPR_PAM_Sequence: gRNA target sequence with including PAM region.
- gRNA_Target_Sequence: gRNA target sequence.
- Location: Location of the gRNA target sequence (chromosome:start location:end location).
- Edit_Location: Location of the possible edit with the given Base Editor.
- Direction: Direction of the gRNA.
- Strand: Strand of the interested gene on the genome (-1 or 1)
- Gene_ID: Ensembl Gene ID for the corresponding gRNA target site region.
- Transcript_ID: Ensembl Transcript ID for the corresponding gRNA target site region.
- Exon_ID: Ensembl Exon ID for the corresponding gRNA target site region.
- guide_in_CDS: Boolean output representing if the any nucleotide on the gRNA is on the CDS region or not.
- gRNA_flanking_sequences: In case that user has given `-flan` option, then gRNA with the flanking sequences
- Edit_in_Exon: Boolean output representing if the specified edit in the Edit Location happening on the exon or not.
- Edit_in_Exon: Boolean output representing if the specified edit in the Edit Location happening on the CDS or not.
- GC%: GC content of the gRNA
- # Edits/guide: Number of editable nuclotide within the activity window of the gRNA
- Poly_T: Boolean output representing if CRISPR_PAM_Sequence has consecutive 4 T nucleotides.
- mutation_on_guide: If any mutation is provided, boolean output representing if the mutation is included within the gRNA sequence.
- guide_change_mutation: If any mutation is provided, boolean output representing if the gRNA can make changed on specified mutation.
- mutation_on_window: If any mutation is provided, boolean output representing if the mutation is included within the activity window of the gRNA sequence
- mutation_on_PAM: If any mutation is provided, boolean output representing if the mutation is included within the PAM sequence of the gRNA sequence
- hgvs_df: If
-vepprovided, a file with VEP API inputs for each gRNA - vep_df: If
-vepprovided, an edit_df file with VEP API annotation for each gRNA - protein_df: If
-vepprovided, a vep_df file with protein and structural annotation for each gRNA
Expand to see protein_df column interpretation
additional columns:
- HGVS: HGVS nomenclature of the gRNA potential edits
- Protein_ID: Ensembl Protein ID of the potential edit.
- VEP_input: HGVS nomenclature which was used for VEP API.
- allele: Allelic change of the potential edit.
- variant_classification: VEP classification of the resulting variant of the potential edit. (Substitution, SNV etc.)
- most_severe_consequence: The most severe consequence of the resulting variant of the potential edit.
- consequence_terms: List of functional consequences of the resulting variant of the potential edit (splice region, missense, stop gain etc.)
- variant_biotype: The biotype of the variant created by the potential edit.
- Regulatory_ID: Ensembl Regulatory ID of the potential edit.
- Motif_ID: Ensembl Model ID of the potential edit.
- TFs_on_motif: List of transcription factors on the Ensembl Regulatory ID of the potential edit.
- cDNA_Change: cDNA changes resulting from the possible edits with the corresponding guides (position old nucleotide> new nucleotide)
- Edited_codon: Codon sequence before the corresponding edit.
- New_codon: Codon sequence after the corresponding edit.
- Protein_ID: Ensembl Protein ID for the corresponding gRNA target site region.
- CDS_Position: Position on the coding sequence of the potential edit.
- Protein_Position_ensembl: Location of the corresponding edit on the resulting protein (as Ensembl PID index).
- Protein_Position: Location of the corresponding edit on the resulting protein (as Uniprot index).
- Protein_Change: Protein sequence change with the corresponding edit (old amino asid - position - new amino asid).
- Edited_AA: One letter representation of the amino asid before the corresponding edit.
- Edited_AA_Prop: Chemical properties of the amino asid before the corresponding edit.
- New_AA: One letter representation of the amino asid after the corresponding edit.
- New_AA_Prop: Chemical properties of the amino asid after the corresponding edit.
- is_Synonymous: Boolean output representing if the resulting edit causes synonymous or non-synonymous mutations.
- is_Stop: Boolean output representing if the resulting edit causes stop codon or not.
- proline_addition: Boolean output representing if potential edit created a Proline amino acid or not.
- swissprot_vep: SwissProt ID of the corresponding gRNA target site region from Ensembl VEP.
- uniprot_provided: Uniprot ID from the user.
- polyphen_score: Polyphen Score of the corresponding edit.
- polyphen_prediction: Polyphen Prediction of the corresponding edit.
- sift_score: Sift Score of the corresponding edit.
- sift_prediction: Sift Prediction of the corresponding edit.
- cadd_phred: CADD Prediction of the corresponding edit.
- cadd_raw: CADD raw score of the corresponding edit.
- lof: representing if the corresponding edit causes Loss of function with high confidence (HC)/Low confidence (LC) or not by implementing [LOFTEE](https://github.com/konradjk/loftee) through VEP.
- impact: Impact of the corresponding edit.
- blosum62: blosum62 Score of the corresponding edit.
- is_clinical: representing if the corresponding edit has Clinical consequences in ClinVar or not.
- clinical_id: dbSNP id of the clinical allele.
- clinical_significance: Clinical significance of the corresponding edit.
- cosmic_id: COSMIC id of the clinical allele.
- clinvar_id: ClinVar id of the clinical allele.
- ancestral_populations: The conservation score from the Ensembl Compara databases
- Domain: Protein domain in which corresponding edit happening.
- curated_Domain: Curated name of the protein domain in which corresponding edit happening.
- PTM: Post Translational Modification sites in which corresponding edit happening.
- is_disruptive_interface_EXP: Boolean output representing if the corresponding edit happening on the interface region and having an experimental evidence (PDB). *High confidence*
- is_disruptive_interface_MOD: Boolean output representing if the corresponding edit happening on the interface region and having evidence from a model (Interactome3D). *Medium confidence*
- is_disruptive_interface_PRED: Boolean output representing if the corresponding edit happening on the interface region and having evidence from prediction (Interactome Insider). *Low confidence*
- disrupted_PDB_int_partners: List of partner proteins whose interactions are disrupted by the corresponding edit according to experimental evidence - PDB.
- disrupted_I3D_int_partners: List of partner proteins whose interactions are disrupted by the corresponding edit according to Interactome3D.
- disrupted_Eclair_int_partners: List of partner proteins whose interactions are disrupted by the corresponding edit according to Eclair - prediction algorithm of Interactome Insider.
- disrupted_PDB_int_genes: List of partner genes (Hugo Symbols) whose interactions are disrupted by the corresponding edit according to experimental evidence - PDB.
- disrupted_I3D_int_genes: List of partner genes (Hugo Symbols) whose interactions are disrupted by the corresponding edit according to Interactome3D.
- disrupted_Eclair_int_genes: List of partner genes (Hugo Symbols) whose interactions are disrupted by the corresponding edit according to Eclair - prediction algorithm of Interactome Insider.
- ot_annotated_df: If
-otprovided, an edit_df/protein_df file with off-target annotation for each gRNA
Expand to see ot_annotated_df column interpretation
additional columns:
- exact: The number of exact alignments of gRNA sequence.
- mm1: The number of alignments of gRNA sequence with one mismatch.
- mm2: The number of alignments of gRNA sequence with two mismatches.
- mm3: The number of alignments of gRNA sequence with three mismatches.
- mm4: The number of alignments of gRNA sequence with four mismatches.
- scored_df: If
-rs3or-fcprovided, an edit/protein_df/ot_annotated_df file with on-target scoring for each gRNA
Expand to see scored_df column interpretation
additional columns:
- rs3_sequence: The flanking region and gRNA sequence for thr RS3 analysis.
- RuleSet3_Hsu2013: gRNA on-target activity score based on Hsu2013 tracRNA gRNA with Rule Set 3
- RuleSet3_Chen2013: gRNA on-target activity score based on Chen2013 tracRNA gRNA with Rule Set 3
- FORECasT-BE: gRNA predicted guide efficacy
BEstimate is the product of Cansu Dinçer, Matthew Coelho and Mathew Garnett from Garnett Group at the Wellcome Sanger Institute. Off-target analysis has been adapted by Bo Fussing from the Cellular Informatics team within the Wellcome Sanger Institute.
If you have any problems or feedback regarding BEstimate, please contact here.
GNU AFFERO GENERAL PUBLIC LICENSE
BEstimate: A Python module to design and annotate base editor gRNAs
Copyright (C) 2025-2026 Genome Research Limited
Authors: Cansu Dinçer (cd7@sanger.ac.uk), Bo Fussing (bf15@sanger.ac.uk)
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
This tool is for research purposes and not for clinical use. For policies regarding the underlying data, please also refer to:
- Ensembl terms and conditions
- Uniprot terms and conditions
- Interactome Insider terms and conditions
- Rule Set 3 terms and conditions
- FORECasT-BE terms and conditions
Development requires are quite minimal with two approachs - with a Virtual Environment or with a VSCode devcontainer.
Virtual Environment Requirements
- Git Hubflow tools:
- On linux see hubflow installation instructions
- On MacOS with Homebrew:
brew install hubflow
- Python 3.12 or higher. It is up to the developer how to install a specific
Python if the system default is not suitable.
- It's recommended to use pyenv as it offers flexibility in managing multiple Python versions. Works on Linux and MacOS, and installing via the command line is straightforward:
curl https://pyenv.run | bash pyenv install 3.12.4 cd <project-repository-directory> pyenv local 3.12.4
- Via poetry itself. Since poetry 2.1.0 running
poetry python install 3.12will install a standalone Python - On Ubuntu Linux via
apt-gettypically the default python3 is old so we need to add a PPA for newer versions. Famously is the Deadsnakes PPA:sudo apt update sudo apt install -y software-properties-common # Add the Deadsnakes PPA and refresh package lists sudo add-apt-repository -y ppa:deadsnakes/ppa sudo apt update # Install Python 3.12, its dev headers, and venv support sudo apt install -y python3.12 python3.12-dev python3.12-venv python3.12-distutils # Bootstrap pip for this interpreter, then upgrade basics python3.12 --version # sanity check python3.12 -m ensurepip --upgrade python3.12 -m pip install --upgrade pip setuptools wheel
- On MacOS via Homebrew
brew update brew install python@3.12 python3.12 --version # sanity check
- It's recommended to use pyenv as it offers flexibility in managing multiple Python versions. Works on Linux and MacOS, and installing via the command line is straightforward:
- Poetry 2.2 or higher. Poetry is used for dependency management and packaging, there are multiple ways to install it. The official docs recommend will have up-to-date instructions, but in summary:
- On Linux one typically
sudo apt-get installs pipxand thenpipx install poetryto ensure Poetry is up to date and isolated from system Python packages. Do not directlyapt-get install poetryas that version is out of date and not compatible with this project. - On MacOS, if you have Homebrew installed, you can use it to install Poetry
brew update && brew install poetry.
- On Linux one typically
VSCode devcontainer Requirements
Working with a devcontainer is the easiest way to get started with development
as all dependencies and tools are pre-installed. Working with VSCode is optional
as other IDEs/editors can attach to a running container or you can run commands
directly in the container with docker run.
You will need:
- Docker
- (Optional) VSCode with the Remote - Containers extension
Up-to-date instructions for installing Docker and VSCode can be found on their respective websites, but the VSCode instructions summarise all the main steps.
An important note about SSH and working with GitLab/GitHub from within a devcontainer: forward your keys via your ssh-agent:
~/.ssh/config on your Laptop/Workstation (not OpenStack/Farm or other remote host):
Host *
AddKeysToAgent yes
UseKeychain yes
IdentityFile ~/.ssh/id_rsa
TCPKeepAlive yes
ServerAliveInterval 120Virtual Environment Setup
The process to setup the development environment is as follows:
git clone https://github.com/Garnett-Lab/BEstimate.git
cd BEstimate
python3.12 -m venv .venv
source .venv/bin/activate
poetry install
pre-commit install --install-hooksDevContainer Setup
The process to setup the development environment is as follows:
git clone https://github.com/Garnett-Lab/BEstimate.git
cd BEstimate
docker build -t bestimate-dev:local -f .devcontainer/Dockerfile .
docker run -it --rm -v $(pwd):/opt/repo bestimate-dev:local bash
# Inside the container
pwd # should be /opt/repo
BEstimate --version # should show version
exit # exit the containerNext, if using VSCode, open the project folder within VSCode and
open the command palette (Shift+CMD+P or Shift+CTRL+P) and select
"Remote-Containers: Rebuild and Reopen in Container"
To add a new dependency, use Poetry to add it to the pyproject.toml and
poetry.lock files:
poetry add <package-name>
# or for a development dependency:
poetry add --group dev <package-name>To constrain the version of a package, see the Poetry versioning docs.
What does the lockfile poetry.lock do ?
Expand for lockfile summary
The poetry.lock file ensures that everyone working on the project uses the
same versions of dependencies, which helps to avoid "it works on my machine"
problems.
It separates dependency-of-dependencies from human-specified dependencies in
pyproject.toml, and pins them to specific versions.
However the lockfile is a 'disposable' file in that it can be regenerated from
the pyproject.toml file if needed. The lockfile should always be committed to
version control. If there are merge conflicts in the lockfile, discard it and
regenerate it with poetry lock.
Finally, downstream users of the BEstimate package do not benefit from the
lockfile, and install the dependencies as specified in pyproject.toml (but not
dev dependencies).
What does the requirements.txt do ?
Expand for requirements.txt summary
The requirements.txt file is generated by Poetry and pre-commit as an artifact
and to allow developers to install dependencies with pip install -r requirements.txt
in environments where Poetry is not available. It should not be manually edited.
This project uses pre-commit to manage elements of code formatting and linting. See the One time setup section for installation and setup.
Pre-commit will run automatically on git commit. Generally pre-commit will
modify and correct files, these need to be staged again before the commit can
complete.
To run the pre-commit hooks manually:
pre-commit run -aTo push a commit while bypassing pre-commit (there are reasons to do this):
git commit --no-verify -m "My commit message"The pre-commit configuration is in .pre-commit-config.yaml and includes:
- built-in hooks for checking for end-of-file newlines, trailing whitespace and ensuring valid JSON, YAML and TOML files
- black - Python code formatter
- flake8 - Python code linter
- poetry-check
- poetry-export - generates
requirements.txtfrompoetry.lock
Tests are in the tests/ directory and use pytest.
To run the tests:
pytest tests/Or with the development Docker image:
docker build -f Dockerfile-dev -t 'bestimate-dev:local' .
docker run -it --rm bestimate-dev:local pytest tests/Or with the public image:
docker build -t bestimate:local -f Dockerfile .
# The tests don't exist in the image and pytest is not installed
docker run -it --rm \
-v ./tests/:/opt/repo/tests \
-w /opt/repo \
bestimate:local \
bash -c 'pip install pytest && python -m pytest tests/'The CI in .gitlab-ci.yml uses the CICD template repository and includes the following stages that:
- build two Docker images, from
DockerfileandDockerfile.dev - tests runs
e2e,pytestandpre-commitagainst the built images - publish
- if on a tag e.g
1.2.3publishes the image to GitLab Container Registry as<image>:<tag>and publishes the package to GitLab PyPI as<package>:<tag> - if on
mainbranch, publishes the image to Docker Hub as<image>:latest - if on
developbranch, publishes the image to Docker Hub as<image>:develop-latest
- if on a tag e.g
Certain CI variables are maintained in this repository's CICD settings. Of note is:
GITLAB_DEPLOY_TOKEN_RWandGITLAB_DEPLOY_USERNAME_RW- used to authenticate with the GitLab container registry and PyPI (theGITLAB_CI_TOKENdoesn't have API write permissions)DOCKER_HUB_USERandDOCKER_HUB_ACCESS_TOKEN- used to authenticate with Docker Hub to allow pull images without interfering with the Sanger/DockerHub rate limits
This repo implements the GitFlow branching model and uses hubflow as a tool to enable this from the CLI.
# Switch to the develop branch
git checkout develop
# Start a new release branch e.g. 0.1.0 not v0.1.0
git hf release start <project_version>Now, do the following things:
CHANGELOG.md: Under the heading of the newest release version, describe what was changed, fixed, added.pyproject.toml: Increment the project version to the current release version- Commit these changes
- Run
pre-commit run -ato ensure no formatting issues
Finally
git hf release finish <project_version>