This project demonstrates how to fetch, analyze, and manipulate the COVID-19 genome using Biopython.
- Fetches the complete SARS-CoV-2 genome from the NCBI database.
- Analyzes nucleotide sequences (length, composition, etc.).
- Uses Biopython modules: Entrez (for data retrieval) and SeqIO (for parsing sequences).
- Accession ID: MN908947.3
- Source: NCBI Nucleotide Database
- Description: Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome.
from Bio import Entrez, SeqIO
Entrez.email = "your_email@example.com" # Required by NCBI
handle = Entrez.efetch(db="nucleotide", id="MN908947", rettype="gb", retmode="text")
recs = list(SeqIO.parse(handle, 'gb'))
handle.close()- Entrez.efetch(): Fetches genome data from NCBI.
- rettype="gb": Retrieves data in GenBank format.
- SeqIO.parse(): Parses the GenBank record into a sequence object.
covid_dna = recs[0].seq
print(f"Length of the genome: {len(covid_dna)}")- Extracts the genome sequence as a
Seqobject. - Prints the number of nucleotides.
You can perform:
- Length analysis (number of nucleotides).
- Base composition: count of A, T, G, C.
- Sub-sequence extraction for specific regions.
Example:
from Bio.SeqUtils import gc_fraction
gc_content = gc_fraction(covid_dna) * 100
print(f"GC Content: {gc_content:.2f}%")- Translate the genome into protein sequences.
- Identify open reading frames (ORFs).
- Perform BLAST analysis to compare with other viral genomes.
- Python 3.x
- Biopython (
pip install biopython) - Internet access (to fetch data from NCBI)
- Install dependencies:
pip install biopython-
Set your email in the
Entrez.emailfield (mandatory for NCBI requests). -
Run the notebook or script to fetch and analyze the genome.
- NCBI GenBank Accession: MN908947.3
- Biopython documentation: https://biopython.org/wiki/Documentation
Open-source and free to use for research and educational purposes.