This project demonstrates how to download and parse FASTA sequence data using Biopython.
The dataset used here is the ls_orchid.fasta file from the Biopython documentation examples.
ls_orchid.fasta→ FASTA file containing orchid DNA sequences (downloaded from Biopython GitHub examples).parser.py→ Python script to parse and store sequences using Biopython'sSeqIOmodule.
from Bio import SeqIOThe SeqIO module allows reading and writing of sequence file formats such as FASTA, GenBank, etc.
sequences = []We create an empty list called sequences to store the DNA sequences extracted from the FASTA file.
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
sequences.append(seq_record.seq)SeqIO.parse()reads the FASTA file one record at a time.- Each record (
seq_record) contains:seq_record.id→ Identifier of the sequence.seq_record.seq→ Actual DNA sequence.
- We append only the sequence (
seq_record.seq) to oursequenceslist.
After running the script, the list sequences will hold all DNA sequences from the FASTA file.
Example output (first few sequences):
[Seq('MATTYGGTTGGA...'), Seq('CTTAGGCTCCTG...'), ...]- Install Biopython:
pip install biopython- Download the FASTA file (Python version of wget):
import urllib.request
url = "https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta"
urllib.request.urlretrieve(url, "ls_orchid.fasta")- Run the parser script to load sequences.
- DNA sequence analysis
- Motif finding
- Sequence alignment
- Bioinformatics pipelines