Description:
I'm trying to download Multiple Sequence Alignment (MSA) files for several million proteins from the AlphaFold database (https://alphafold.ebi.ac.uk/).
For individual proteins, I can successfully download MSA files through the "Download files" section on the protein entry pages (e.g., https://alphafold.ebi.ac.uk/files/msa/AF-G1JSI4-F1-msa_v6.a3m). However, I need to download MSA files at scale using a list of protein IDs.
I've explored several options but encountered limitations:
- Direct API-style downloads using the individual protein links - This appears to work for single files, but I'm concerned about potential rate limiting when scaling to millions of requests. I couldn't find documentation about API rate limits or bulk download policies.
- Google Cloud bucket - The available data appears to be limited to version v4 and doesn't include MSA files.
- EBI FTP server (https://ftp.ebi.ac.uk/pub/databases/alphafold/) - While the changelog mentions MSA updates, I couldn't locate the actual MSA files in the directory structure.
Questions:
- What is the recommended approach for bulk downloading MSA files given a list of protein IDs?
- Are there any rate limits or best practices I should follow when making large numbers of requests to the individual download endpoints?
Thank you for your assistance and for maintaining this valuable resource!
Description:
I'm trying to download Multiple Sequence Alignment (MSA) files for several million proteins from the AlphaFold database (https://alphafold.ebi.ac.uk/).
For individual proteins, I can successfully download MSA files through the "Download files" section on the protein entry pages (e.g., https://alphafold.ebi.ac.uk/files/msa/AF-G1JSI4-F1-msa_v6.a3m). However, I need to download MSA files at scale using a list of protein IDs.
I've explored several options but encountered limitations:
Questions:
Thank you for your assistance and for maintaining this valuable resource!