Complete gget dataverse module: fix CLI args, add logging, improve error handling, docs & checksum verification#2
Conversation
…gging, create documentation Co-authored-by: abearab <53412130+abearab@users.noreply.github.com>
…integration Co-authored-by: abearab <53412130+abearab@users.noreply.github.com>
Removed author attribution from the documentation.
|
@copilot good job! Can you also add an option to confirm the "checksum" code for local files after download and match them with remote files? e.g. for the file mentioned in test you can find remote info in this link and I copied the results below |
Co-authored-by: abearab <53412130+abearab@users.noreply.github.com>
I've added comprehensive checksum verification functionality to the gget dataverse module! The new feature allows users to verify MD5 checksums of downloaded files against remote checksums stored in Dataverse. Key Features Added:CLI Usage# Verify checksums during download
gget dataverse datasets.csv --checksumPython API# Enable checksum verification
gget.dataverse("datasets.csv", verify_checksum=True)What It Does:
Implementation Details:
The feature is backward compatible (optional parameter, defaults to False) and includes proper error handling for when checksums aren't available or API calls fail. Commit: 37ba1fa Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
Test the get_file_metadata functionprint('Testing get_file_metadata...')
Test that the function signature worksprint('Testing function with new parameter...')
|
abearab
left a comment
There was a problem hiding this comment.
@copilot I think your code can be much simpler for md5 check. See my quick way to get the md5:
response = requests.get("https://dataverse.harvard.edu/api/files/6180617")
response
<Response [200]>
response.json()
{'status': 'OK', 'data': {'label': 'nodes.tab', 'restricted': False, 'version': 1, 'datasetVersionId': 326818, 'dataFile': {'id': 6180617, 'persistentId': '', 'filename': 'nodes.tab', 'contentType': 'text/tab-separated-values', 'friendlyType': 'Tab-Delimited', 'filesize': 8893757, 'storageIdentifier': 's3://dvn-cloud:1805e69f00e-fcf0acc588bb', 'originalFileFormat': 'text/csv', 'originalFormatLabel': 'Comma Separated Values', 'originalFileSize': 7869553, 'originalFileName': 'nodes.csv', 'UNF': 'UNF:6:4tpdOT6mYCXVm4muYwe/7Q==', 'rootDataFileId': -1, 'md5': '7f9ab4109c54049e819ecd14e15a6038', 'checksum': {'type': 'MD5', 'value': '7f9ab4109c54049e819ecd14e15a6038'}, 'tabularData': True, 'creationDate': '2022-04-24', 'publicationDate': '2022-04-25', 'fileAccessRequest': False}}}
metadata = response.json()
metadata
{'status': 'OK', 'data': {'label': 'nodes.tab', 'restricted': False, 'version': 1, 'datasetVersionId': 326818, 'dataFile': {'id': 6180617, 'persistentId': '', 'filename': 'nodes.tab', 'contentType': 'text/tab-separated-values', 'friendlyType': 'Tab-Delimited', 'filesize': 8893757, 'storageIdentifier': 's3://dvn-cloud:1805e69f00e-fcf0acc588bb', 'originalFileFormat': 'text/csv', 'originalFormatLabel': 'Comma Separated Values', 'originalFileSize': 7869553, 'originalFileName': 'nodes.csv', 'UNF': 'UNF:6:4tpdOT6mYCXVm4muYwe/7Q==', 'rootDataFileId': -1, 'md5': '7f9ab4109c54049e819ecd14e15a6038', 'checksum': {'type': 'MD5', 'value': '7f9ab4109c54049e819ecd14e15a6038'}, 'tabularData': True, 'creationDate': '2022-04-24', 'publicationDate': '2022-04-25', 'fileAccessRequest': False}}}
metadata['data']
{'label': 'nodes.tab', 'restricted': False, 'version': 1, 'datasetVersionId': 326818, 'dataFile': {'id': 6180617, 'persistentId': '', 'filename': 'nodes.tab', 'contentType': 'text/tab-separated-values', 'friendlyType': 'Tab-Delimited', 'filesize': 8893757, 'storageIdentifier': 's3://dvn-cloud:1805e69f00e-fcf0acc588bb', 'originalFileFormat': 'text/csv', 'originalFormatLabel': 'Comma Separated Values', 'originalFileSize': 7869553, 'originalFileName': 'nodes.csv', 'UNF': 'UNF:6:4tpdOT6mYCXVm4muYwe/7Q==', 'rootDataFileId': -1, 'md5': '7f9ab4109c54049e819ecd14e15a6038', 'checksum': {'type': 'MD5', 'value': '7f9ab4109c54049e819ecd14e15a6038'}, 'tabularData': True, 'creationDate': '2022-04-24', 'publicationDate': '2022-04-25', 'fileAccessRequest': False}}
metadata['data']['dataFile']
{'id': 6180617, 'persistentId': '', 'filename': 'nodes.tab', 'contentType': 'text/tab-separated-values', 'friendlyType': 'Tab-Delimited', 'filesize': 8893757, 'storageIdentifier': 's3://dvn-cloud:1805e69f00e-fcf0acc588bb', 'originalFileFormat': 'text/csv', 'originalFormatLabel': 'Comma Separated Values', 'originalFileSize': 7869553, 'originalFileName': 'nodes.csv', 'UNF': 'UNF:6:4tpdOT6mYCXVm4muYwe/7Q==', 'rootDataFileId': -1, 'md5': '7f9ab4109c54049e819ecd14e15a6038', 'checksum': {'type': 'MD5', 'value': '7f9ab4109c54049e819ecd14e15a6038'}, 'tabularData': True, 'creationDate': '2022-04-24', 'publicationDate': '2022-04-25', 'fileAccessRequest': False}
metadata['data']['dataFile']['md5']
'7f9ab4109c54049e819ecd14e15a6038'
Co-authored-by: abearab <53412130+abearab@users.noreply.github.com>
You're absolutely right! I've simplified the MD5 checksum retrieval by using the direct file API endpoint ( The change was minimal - just updated the URL and variable naming while keeping all the same functionality and error handling. The function now uses your suggested approach for a much simpler implementation. Commit: 4477420 Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
import requests Simple test of the simplified API endpointdef test_simplified_api(): test_simplified_api()` (dns block)
|
This PR completes the
gget dataversemodule implementation by addressing all remaining items from the review feedback in scverse#170, plus adding comprehensive checksum verification functionality. The module now follows gget conventions and is ready for production use.Key Changes Made
CLI Argument Structure
tablefrom optional-t/--tableto required positional argument, matching other gget modules likeenrichrandbgeepathfrom required-o/--pathto optional-o/--outwith current directory as defaultgget dataverse datasets.csvinstead ofgget dataverse -t datasets.csv -o ./Logging Integration
print_sysusage and implemented standard logging withset_up_logger()bgee,enrichr, and other gget modules--quietflag for consistent user experienceEnhanced Error Handling
id,name,type)Checksum Verification
--checksumflag andverify_checksumparameter to verify file integrity/api/files/{file_id}) for efficient checksum retrievalDocumentation
docs/src/en/dataverse.mdwith usage examples and detailed parameter descriptionsdocs/src/es/dataverse.mdfor Spanish-speaking usersdocs/src/SUMMARY.mdto include dataverse in both language sectionsTesting Improvements
tests/test_dataverse.pywith comprehensive validation testsUsage Examples
Command Line
Python
File Format
The input CSV/TSV must contain three required columns:
id: Unique identifier for the datafile in Dataversename: Dataset name for the filetype: File extension (csv, tsv, tab, pkl, etc.)Checksum Verification
The checksum verification feature:
The module is now ready for production use and follows all gget coding conventions with optimized file integrity verification.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.