-
Notifications
You must be signed in to change notification settings - Fork 12
Add on-the-fly compression conversion during download (Issue #18) #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add on-the-fly compression conversion during download (Issue #18) #45
Conversation
📝 WalkthroughWalkthroughAdds on-the-fly compression conversion to downloads: new CLI options Changes
Sequence Diagram(s)sequenceDiagram
participant User as CLI User
participant CLI as databusclient.cli
participant API as databusclient.api.download
participant DL as _download_file
participant Conv as Conversion Logic
participant FS as File System
User->>CLI: run download with --convert-to gz --convert-from bz2
CLI->>API: download(..., convert_to='gz', convert_from='bz2')
API->>DL: _download_file(..., convert_to, convert_from)
DL->>FS: fetch original file (e.g., file.bz2)
FS-->>DL: file.bz2
DL->>Conv: _detect_compression_format(file.bz2)
Conv-->>DL: 'bz2'
DL->>Conv: _should_convert_file(..., convert_to='gz', convert_from='bz2')
Conv-->>DL: true
DL->>Conv: _convert_compression_format(file.bz2 → file.gz)
Conv->>Conv: decompress bz2, compress to gz
Conv-->>DL: file.gz
DL->>FS: remove original file.bz2, write file.gz
DL-->>API: download+conversion complete
API-->>CLI: success
CLI-->>User: finished
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/test_compression_conversion.py (1)
66-135: Comprehensive happy-path coverage.The conversion tests validate all format pairs (bz2↔gz↔xz) with proper data integrity checks and verification that original files are removed after successful conversion. The tests use temporary directories for isolation, which is good practice.
Consider adding tests for error conditions such as corrupted source files, permission errors, or disk space issues to improve test robustness and ensure graceful failure handling.
💡 Optional: Add error condition tests
Consider adding tests like:
def test_convert_with_corrupted_file(): """Test conversion fails gracefully with corrupted input""" with tempfile.TemporaryDirectory() as tmpdir: # Create invalid compressed file bad_file = os.path.join(tmpdir, "bad.txt.bz2") with open(bad_file, 'wb') as f: f.write(b"not actually compressed data") target_file = os.path.join(tmpdir, "bad.txt.gz") # Should raise RuntimeError and not create partial target with pytest.raises(RuntimeError): _convert_compression_format(bad_file, target_file, "bz2", "gz") assert os.path.exists(bad_file) # Original preserved on failure assert not os.path.exists(target_file) # No partial outputThis would validate the error handling mentioned in the conversion implementation.
databusclient/api/download.py (1)
134-138: Consider preserving original exception details.Wrapping the exception in
RuntimeErroron Line 138 may obscure the root cause. Consider chaining the exception or providing more context.🔎 Suggested improvement
except Exception as e: # If conversion fails, ensure the partial target file is removed if os.path.exists(target_file): os.remove(target_file) - raise RuntimeError(f"Compression conversion failed: {e}") + raise RuntimeError(f"Compression conversion failed: {e}") from eThis preserves the full exception chain for debugging.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.gitignoreREADME.mddatabusclient/api/download.pydatabusclient/cli.pytests/test_compression_conversion.pytests/test_download.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_compression_conversion.py (1)
databusclient/api/download.py (5)
download(700-836)_detect_compression_format(31-45)_should_convert_file(48-79)_get_converted_filename(82-99)_convert_compression_format(102-138)
🔇 Additional comments (18)
.gitignore (1)
3-4: LGTM!The additions are appropriate for keeping test artifacts and sensitive authentication data out of version control.
tests/test_download.py (2)
3-3: LGTM!The pytest import is correctly added to support the skip decorator.
30-30: LGTM!Skipping the long-running, network-dependent collection test is a sensible practice for faster local test execution. The reason is clearly documented.
tests/test_compression_conversion.py (4)
17-24: LGTM!The format detection tests comprehensively cover all supported formats (bz2, gz, xz), uncompressed files, and case insensitivity. This validates the detection logic correctly.
26-57: LGTM!The conversion decision logic tests are thorough, covering all key scenarios: no target specified, uncompressed files, same source/target formats, valid conversions, and the convert_from filter behavior.
59-64: LGTM!The filename conversion tests validate all format pair transformations concisely.
137-138: LGTM!Direct pytest execution support is convenient for development and debugging.
README.md (3)
169-173: LGTM!The documentation for the new compression conversion options is clear and accurately describes the functionality. The supported formats and optional filtering behavior are well explained.
185-211: LGTM!The updated help text and reformatted option descriptions improve readability and accurately reflect the new compression conversion capabilities.
264-274: LGTM!The examples section effectively demonstrates the three primary use cases: converting all compressed files, filtering by source format, and converting entire collections. This will help users understand how to leverage the new feature.
databusclient/cli.py (2)
161-170: LGTM!The new CLI options are well-integrated with appropriate validation (restricted choices: bz2, gz, xz) and case-insensitive matching, which aligns with the format detection logic. The help text clearly describes each option's purpose.
180-197: LGTM!The parameters are correctly added to the function signature, the docstring is updated to reflect the new functionality, and the values are properly propagated to the
api_downloadcall.databusclient/api/download.py (6)
3-28: LGTM! Clean compression module setup.The imports and constant mappings are well-structured and use standard library modules for compression handling.
31-45: LGTM! Robust extension detection.The case-insensitive matching ensures reliable format detection across different filename conventions.
48-79: LGTM! Conversion decision logic is sound.The function correctly handles all conversion scenarios including filtering and format matching.
141-263: LGTM! Clean integration of conversion into download flow.The conversion logic is correctly placed after download completion and size verification. The function properly handles filenames and paths when calling conversion helpers.
265-299: LGTM! Consistent parameter propagation across all download functions.All internal download functions properly accept and forward
convert_toandconvert_fromparameters through the call chain, with clear documentation.Also applies to: 437-476, 478-512, 515-557, 624-663
700-836: The CLI already validates compression format parameters using Click's built-inChoicevalidator, which automatically rejects invalid values before they reach the API function. No additional validation is needed.Likely an incorrect or invalid review comment.
|
Hi @DhanashreePetare! Just a reminder that this PR has conflicts. No rush |
98e4f35 to
5277f26
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 82-99: The filename comparison is case-sensitive in
_get_converted_filename, causing mismatches with _detect_compression_format
which lowercases names; change the check to use a case-insensitive test (e.g.,
filename.lower().endswith(source_ext)) while still slicing the original filename
by len(source_ext) so the returned name preserves original casing; keep using
COMPRESSION_EXTENSIONS for source_ext/target_ext and only alter the conditional
and slicing logic accordingly.
- Around line 257-262: The bug is that the filename variable `file` is being
shadowed by a file handle in the `with` block so calls to
`_should_convert_file(file, ...)` end up passing a closed file handle instead of
the original filename string; fix by using the original filename string when
calling `_should_convert_file`, `_get_converted_filename`, and
`_convert_compression_format` (e.g., pass the variable that holds the
URL-derived filename) or rename the `with` block file handle (e.g., `fh` or
`fhandle`) so `file` continues to refer to the filename string; update calls to
`_should_convert_file`, `_get_converted_filename`, and
`_convert_compression_format` accordingly to avoid calling `.lower()` on a file
object.
🧹 Nitpick comments (1)
tests/test_compression_conversion.py (1)
59-63: Consider adding test for uppercase filenames.These tests cover the happy path but don't exercise the case-sensitivity edge case (e.g.,
DATA.TXT.GZ). Adding such a test would help catch the bug flagged in_get_converted_filename.Suggested test addition
def test_get_converted_filename(): """Test filename conversion""" assert _get_converted_filename("data.txt.bz2", "bz2", "gz") == "data.txt.gz" assert _get_converted_filename("data.txt.gz", "gz", "xz") == "data.txt.xz" assert _get_converted_filename("data.txt.xz", "xz", "bz2") == "data.txt.bz2" # Uppercase extension handling assert _get_converted_filename("DATA.TXT.GZ", "gz", "xz") == "DATA.TXT.xz"
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.gitignoreREADME.mddatabusclient/api/download.pydatabusclient/cli.pytests/test_compression_conversion.pytests/test_download.py
🚧 Files skipped from review as they are similar to previous changes (2)
- README.md
- databusclient/cli.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_compression_conversion.py (1)
databusclient/api/download.py (5)
download(700-836)_detect_compression_format(31-45)_should_convert_file(48-79)_get_converted_filename(82-99)_convert_compression_format(102-138)
🔇 Additional comments (14)
.gitignore (1)
3-4: LGTM!Appropriate additions to ignore test artifacts (
test-download/) and sensitive token files (vault-token.dat).databusclient/api/download.py (6)
17-28: LGTM!Clean dictionary mappings for compression formats. This design allows easy extension to additional formats in the future.
31-45: LGTM!Case-insensitive detection logic is correct and well-documented.
48-79: LGTM!Decision logic correctly handles all edge cases: no target specified, uncompressed files, same source/target formats, and source format filtering.
102-138: LGTM!The chunked decompress → recompress pipeline is correctly implemented with proper cleanup on failure. The 8 KiB chunk size is reasonable for streaming.
265-298: LGTM!The
convert_toandconvert_fromparameters are correctly propagated through_download_filesto the underlying_download_filecalls.
700-727: LGTM!The public
download()API correctly exposes and documents the new compression conversion parameters, with clear docstrings explaining the supported formats.tests/test_compression_conversion.py (5)
1-14: LGTM!Appropriate imports for testing the compression conversion utilities.
17-23: LGTM!Good test coverage including case-insensitive matching.
26-56: LGTM!Comprehensive test coverage for all decision branches in
_should_convert_file.
66-134: LGTM!Well-structured conversion tests covering all format pairs with proper data integrity verification. The use of
tempfile.TemporaryDirectoryensures proper cleanup.
137-138: LGTM!Standard pytest main block for direct execution.
tests/test_download.py (2)
3-4: LGTM!Added pytest import to support the skip decorator.
30-32: LGTM!Skipping the long-running collection download test is appropriate for CI stability. The skip reason clearly documents why this test is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (11)
databusclient/api/download.py (8)
269-303: Critical syntax error causing pipeline failure.The function
_download_filehas malformed syntax with duplicate function signatures and docstrings. Lines 276-290 contain the new docstring with convert parameters, but then lines 291-303 contain what appears to be remnants of the original function definition mixed in, causing aSyntaxError: Simple statements must be separated by newlines or semicolons.The function should have a single, clean signature and docstring.
🐛 Proposed fix - consolidate the function definition
def _download_file( url, localDir, vault_token_file=None, databus_key=None, auth_url=None, client_id=None, convert_to=None, convert_from=None, -) -> None: - """ - Download a file from the internet with a progress bar using tqdm. - - Parameters: - - url: the URL of the file to download - - localDir: Local directory to download file to. If None, the databus folder structure is created in the current working directory. - - vault_token_file: Path to Vault refresh token file - - databus_key: Databus API key for protected downloads - - auth_url: Keycloak token endpoint URL - - client_id: Client ID for token exchange - - convert_to: Target compression format for on-the-fly conversion - - convert_from: Optional source compression format filter validate_checksum: bool = False, expected_checksum: str | None = None, ) -> None: - """Download a file from the internet with a progress bar using tqdm. + """Download a file from the internet with a progress bar using tqdm. Args: url: The URL of the file to download. localDir: Local directory to download file to. If None, the databus folder structure is created in the current working directory. vault_token_file: Path to Vault refresh token file. databus_key: Databus API key for protected downloads. auth_url: Keycloak token endpoint URL. client_id: Client ID for token exchange. + convert_to: Target compression format for on-the-fly conversion. + convert_from: Optional source compression format filter. + validate_checksum: Whether to validate the file checksum after download. + expected_checksum: Expected SHA256 checksum for validation. """
443-470: Checksum validation will fail after compression conversion.After conversion, the original file is deleted (in
_convert_compression_formatline 135), but checksum validation at line 453 still attempts to validatefilenamewhich no longer exists. Additionally, the expected checksum from Databus metadata corresponds to the original compressed file, not the converted one.Either:
- Skip checksum validation when conversion is performed, or
- Validate the checksum before conversion
🐛 Proposed fix - validate checksum before conversion
# --- 5. Verify download size --- if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes: raise IOError("Downloaded size does not match Content-Length header") - # --- 6. Convert compression format if requested --- - should_convert, source_format = _should_convert_file(file, convert_to, convert_from) - if should_convert and source_format: - target_filename = _get_converted_filename(file, source_format, convert_to) - target_filepath = os.path.join(localDir, target_filename) - _convert_compression_format(filename, target_filepath, source_format, convert_to) - # --- 6. Optional checksum validation --- + # --- 6. Optional checksum validation (must happen before conversion) --- if validate_checksum: # reuse compute_sha256_and_length from webdav extension try: actual, _ = compute_sha256_and_length(filename) except (OSError, IOError) as e: print(f"WARNING: error computing checksum for {filename}: {e}") actual = None if expected_checksum is None: print(f"WARNING: no expected checksum available for {filename}; skipping validation") elif actual is None: print(f"WARNING: could not compute checksum for {filename}; skipping validation") else: if actual.lower() != expected_checksum.lower(): try: os.remove(filename) # delete corrupted file except OSError: pass raise IOError( f"Checksum mismatch for {filename}: expected {expected_checksum}, got {actual}" ) + # --- 7. Convert compression format if requested --- + should_convert, source_format = _should_convert_file(file, convert_to, convert_from) + if should_convert and source_format: + target_filename = _get_converted_filename(file, source_format, convert_to) + target_filepath = os.path.join(localDir, target_filename) + _convert_compression_format(filename, target_filepath, source_format, convert_to)
473-507: Same syntax error pattern as _download_file.The
_download_filesfunction has the same malformed definition with duplicate/overlapping docstrings. This needs to be consolidated similar to the fix for_download_file.
659-695: Same syntax error pattern in_download_collection.This function also has overlapping/duplicate docstrings that need consolidation.
720-752: Same syntax error pattern in_download_version.This function also has overlapping/duplicate docstrings that need consolidation.
777-812: Same syntax error pattern in_download_artifact.This function also has overlapping/duplicate docstrings that need consolidation.
906-941: Same syntax error pattern in_download_group.This function also has overlapping/duplicate docstrings that need consolidation.
1004-1029: Same syntax error pattern indownloadfunction.The main
downloadfunction also has overlapping/duplicate docstrings that need consolidation.databusclient/cli.py (3)
165-177: Malformed click option definition.The
--convert-fromoption definition is incomplete. Line 174 shows"--validate-checksum",which appears to be the start of a separate option that was incorrectly merged into the--convert-fromdefinition.🐛 Proposed fix - separate the options properly
@click.option( "--convert-from", type=click.Choice(["bz2", "gz", "xz"], case_sensitive=False), help="Source compression format to convert from (optional filter). Only files with this compression will be converted.", +) +@click.option( "--validate-checksum", is_flag=True, help="Validate checksums of downloaded files" )
178-191: Malformed function signature.The function signature has invalid syntax with a closing parenthesis, then a parameter, then another closing parenthesis on separate lines (189-191). This needs to be a single, properly formatted parameter list.
🐛 Proposed fix
def download( databusuris: List[str], localdir, databus, vault_token, databus_key, all_versions, authurl, clientid, convert_to, convert_from, -): - validate_checksum, -): + validate_checksum, +):
196-221: Duplicate api_download calls - only one should exist.There are two calls to
api_download: one at lines 196-207 (outside try block, has conversion params, missing validate_checksum) and another at lines 209-219 (inside try block, has validate_checksum, missing conversion params). This should be a single call with all parameters inside the try block.🐛 Proposed fix - consolidate into single call
- api_download( - localDir=localdir, - endpoint=databus, - databusURIs=databusuris, - token=vault_token, - databus_key=databus_key, - all_versions=all_versions, - auth_url=authurl, - client_id=clientid, - convert_to=convert_to, - convert_from=convert_from, - ) try: api_download( localDir=localdir, endpoint=databus, databusURIs=databusuris, token=vault_token, databus_key=databus_key, all_versions=all_versions, auth_url=authurl, client_id=clientid, - validate_checksum=validate_checksum + convert_to=convert_to, + convert_from=convert_from, + validate_checksum=validate_checksum, ) except DownloadAuthError as e: raise click.ClickException(str(e))
🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 6-8: The imports in this module duplicate List from typing; remove
the redundant "from typing import List" (the second occurrence) and consolidate
typing imports so you only have one line like "from typing import List,
Optional, Tuple" while keeping the existing "import re" intact; update the
import block (references: List, Optional, Tuple, re) accordingly.
🧹 Nitpick comments (1)
databusclient/api/download.py (1)
85-102: Consider case-insensitive extension matching.The
_get_converted_filenamefunction usesfilename.endswith(source_ext)which is case-sensitive, but_detect_compression_formatlowercases the filename before checking. If a file has a mixed-case extension like.BZ2, detection will work but filename conversion will append rather than replace the extension.♻️ Proposed fix for case-insensitive handling
def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str: source_ext = COMPRESSION_EXTENSIONS[source_format] target_ext = COMPRESSION_EXTENSIONS[target_format] - if filename.endswith(source_ext): + if filename.lower().endswith(source_ext): return filename[:-len(source_ext)] + target_ext return filename + target_ext
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
README.mddatabusclient/api/download.pydatabusclient/cli.pytests/test_download.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_download.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (1)
databusclient/cli.py (1)
download(178-190)
🪛 GitHub Actions: Python CI (Lint & pytest)
databusclient/api/download.py
[error] 294-294: ruff check failed: SyntaxError: Simple statements must be separated by newlines or semicolons.
🔇 Additional comments (6)
README.md (2)
177-180: LGTM!The documentation for
--convert-toand--convert-fromoptions is clear and provides helpful examples. The descriptions accurately reflect the implementation behavior (decompress → recompress pipeline, format filtering).
272-282: LGTM!The examples section effectively demonstrates the three key use cases for compression conversion: converting all files to a target format, filtering by source format, and unifying a collection. This aligns well with the PR objectives.
databusclient/api/download.py (4)
20-31: LGTM!The compression format mappings are clean and provide a single source of truth for supported formats. The module-to-format mapping correctly uses Python's built-in compression libraries.
34-48: LGTM!The compression detection is case-insensitive which is a good defensive approach for handling filenames with varied casing.
51-82: LGTM!The conversion decision logic correctly handles all edge cases: no target format, uncompressed files, same source/target format, and optional source format filtering.
105-141: LGTM!The conversion implementation properly:
- Uses chunked streaming (8KB) to avoid memory issues with large files
- Cleans up partial target files on failure
- Removes the original file only after successful conversion
- Provides user feedback during conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
databusclient/api/download.py (1)
432-459: Critical: Variable shadowing causes runtime crash and checksum validation on deleted file.Two critical issues:
Variable shadowing bug: At line 306,
file = url.split("/")[-1]stores the filename string. But at line 422,with open(filename, "wb") as file:shadows this with a file handle. At line 433,_should_convert_file(file, ...)passes the closed file handle, not the filename string. This will crash withAttributeError: '_io.BufferedWriter' object has no attribute 'lower'.Checksum on deleted file: If conversion succeeds,
_convert_compression_formatdeletes the original file atfilename, but line 442 tries to compute checksum onfilenamewhich no longer exists.🐛 Proposed fix
- file = url.split("/")[-1] - filename = os.path.join(localDir, file) + base_filename = url.split("/")[-1] + filename = os.path.join(localDir, base_filename) print(f"Download file: {url}")And update the download and conversion section:
# --- 4. Download with progress bar --- total_size_in_bytes = int(response.headers.get("content-length", 0)) block_size = 1024 # 1 KiB progress_bar = tqdm(total=total_size_in_bytes, unit="iB", unit_scale=True) - with open(filename, "wb") as file: + with open(filename, "wb") as f: for data in response.iter_content(block_size): progress_bar.update(len(data)) - file.write(data) + f.write(data) progress_bar.close() # --- 5. Verify download size --- if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes: raise IOError("Downloaded size does not match Content-Length header") # --- 6. Convert compression format if requested --- - should_convert, source_format = _should_convert_file(file, convert_to, convert_from) + should_convert, source_format = _should_convert_file(base_filename, convert_to, convert_from) + checksum_filepath = filename # Track which file to validate if should_convert and source_format: - target_filename = _get_converted_filename(file, source_format, convert_to) + target_filename = _get_converted_filename(base_filename, source_format, convert_to) target_filepath = os.path.join(localDir, target_filename) _convert_compression_format(filename, target_filepath, source_format, convert_to) - # --- 6. Optional checksum validation --- + checksum_filepath = target_filepath # Validate converted file + # --- 7. Optional checksum validation --- if validate_checksum: # reuse compute_sha256_and_length from webdav extension try: - actual, _ = compute_sha256_and_length(filename) + actual, _ = compute_sha256_and_length(checksum_filepath) except (OSError, IOError) as e: - print(f"WARNING: error computing checksum for {filename}: {e}") + print(f"WARNING: error computing checksum for {checksum_filepath}: {e}") actual = NoneNote: Checksum validation after conversion will compare against the original expected checksum, which won't match the converted file. Consider whether checksum validation should be skipped or adapted when conversion is performed.
🤖 Fix all issues with AI agents
In @databusclient/api/download.py:
- Around line 85-102: The function _get_converted_filename fails for uppercase
extensions; change the conditional to perform a case-insensitive suffix check
(e.g., use filename.lower().endswith(source_ext.lower())) so files like
"file.BZ2" are recognized, then slice off the matched extension length from the
original filename (filename[:-len(source_ext)]) and append target_ext as before;
keep using COMPRESSION_EXTENSIONS for source_ext/target_ext and leave the append
behavior when no match unchanged.
🧹 Nitpick comments (1)
databusclient/api/download.py (1)
105-141: LGTM with optional improvement note.Good error handling with cleanup of partial output on failure. The chunked I/O approach is memory-efficient.
Consider using
logginginstead ofprint()for better control over verbosity, but this is fine for a CLI tool.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
databusclient/api/download.pydatabusclient/cli.py
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/api/download.py (2)
databusclient/api/utils.py (1)
compute_sha256_and_length(62-72)databusclient/extensions/webdav.py (1)
compute_sha256_and_length(16-34)
🔇 Additional comments (12)
databusclient/api/download.py (10)
3-6: LGTM!Standard library imports for compression modules and typing are correctly added.
20-31: LGTM!Clean and correct mappings for the supported compression formats.
34-48: LGTM!Case-insensitive extension detection is correctly implemented.
51-82: LGTM!Conversion decision logic correctly handles all cases: no target format, uncompressed files, same-format skip, and source format filtering.
462-503: LGTM!Parameters correctly propagated to
_download_filesand forwarded to individual file downloads.
639-686: LGTM!Collection download correctly propagates conversion parameters.
689-733: LGTM!Version download correctly propagates conversion parameters.
736-788: LGTM!Artifact download correctly propagates conversion parameters through version iterations.
852-893: LGTM!Group download correctly propagates conversion parameters to artifact downloads.
930-1098: LGTM!Main
downloadfunction correctly propagates conversion parameters to all download paths (collection, file, version, artifact, group, and query-based downloads).databusclient/cli.py (2)
165-174: LGTM!CLI options are correctly defined with appropriate choices and help text. Case-insensitive matching is a good UX choice.
180-212: LGTM!Parameters correctly added to function signature and properly passed to
api_download. The docstring update accurately describes the new capability.
databusclient/api/download.py
Outdated
| """ | ||
| Detect compression format from file extension. | ||
| Parameters: | ||
| - filename: Name of the file | ||
| Returns: | ||
| - Compression format string ('bz2', 'gz', 'xz') or None if not compressed | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use python google style docstring (see example https://google.github.io/styleguide/pyguide.html#244-decision)
databusclient/api/download.py
Outdated
| """ | ||
| Determine if a file should be converted and what the source format is. | ||
| Parameters: | ||
| - filename: Name of the file | ||
| - convert_to: Target compression format | ||
| - convert_from: Optional source compression format filter | ||
| Returns: | ||
| - Tuple of (should_convert: bool, source_format: Optional[str]) | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use python google style docstring (see example https://google.github.io/styleguide/pyguide.html#244-decision)
databusclient/api/download.py
Outdated
| """ | ||
| Generate the new filename after compression format conversion. | ||
| Parameters: | ||
| - filename: Original filename | ||
| - source_format: Source compression format ('bz2', 'gz', 'xz') | ||
| - target_format: Target compression format ('bz2', 'gz', 'xz') | ||
| Returns: | ||
| - New filename with updated extension | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use python google style docstring (see example https://google.github.io/styleguide/pyguide.html#244-decision)
Addresses all feedback from @Integer-Ctrl on PR dbpedia#45
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
databusclient/api/download.py (1)
439-466: Critical bug: Checksum validation fails after conversion because the original file was deleted.When conversion is performed (lines 441-444),
_convert_compression_formatremoves the original source file (filename). However, checksum validation (lines 446-466) still attempts to compute the checksum offilename, which no longer exists after successful conversion. This will raise anOSErrorduringcompute_sha256_and_length.Additionally, checksum validation should probably validate the original file before conversion, not after, since the expected checksum from metadata corresponds to the original compressed format.
🐛 Proposed fix: Validate checksum before conversion
+ # --- 6. Optional checksum validation (before conversion) --- + if validate_checksum: + try: + actual, _ = compute_sha256_and_length(filename) + except (OSError, IOError) as e: + print(f"WARNING: error computing checksum for {filename}: {e}") + actual = None + + if expected_checksum is None: + print(f"WARNING: no expected checksum available for {filename}; skipping validation") + elif actual is None: + print(f"WARNING: could not compute checksum for {filename}; skipping validation") + else: + if actual.lower() != expected_checksum.lower(): + try: + os.remove(filename) # delete corrupted file + except OSError: + pass + raise IOError( + f"Checksum mismatch for {filename}: expected {expected_checksum}, got {actual}" + ) + - # --- 6. Convert compression format if requested --- + # --- 7. Convert compression format if requested --- should_convert, source_format = _should_convert_file(file, convert_to, convert_from) if should_convert and source_format: target_filename = _get_converted_filename(file, source_format, convert_to) target_filepath = os.path.join(localDir, target_filename) _convert_compression_format(filename, target_filepath, source_format, convert_to) - # --- 6. Optional checksum validation --- - if validate_checksum: - # reuse compute_sha256_and_length from webdav extension - try: - actual, _ = compute_sha256_and_length(filename) - except (OSError, IOError) as e: - print(f"WARNING: error computing checksum for {filename}: {e}") - actual = None - - if expected_checksum is None: - print(f"WARNING: no expected checksum available for {filename}; skipping validation") - elif actual is None: - print(f"WARNING: could not compute checksum for {filename}; skipping validation") - else: - if actual.lower() != expected_checksum.lower(): - try: - os.remove(filename) # delete corrupted file - except OSError: - pass - raise IOError( - f"Checksum mismatch for {filename}: expected {expected_checksum}, got {actual}" - )
♻️ Duplicate comments (1)
databusclient/api/download.py (1)
129-129: Misleading comment about progress indication.The comment states "with progress indication" but no progress bar is implemented for the conversion process. This was flagged in a previous review.
🧹 Nitpick comments (4)
databusclient/api/download.py (2)
476-477: Consider usingOptional[str]for type consistency.The type hints use
str = Noneinstead ofOptional[str] = None, which is inconsistent with other parameters in the codebase (e.g., line 6 importsOptionaland other functions use it). While this works,Optional[str]more explicitly conveys thatNoneis a valid value.
439-445: Duplicate step numbers in comments.Both the conversion step (line 439) and checksum validation step (line 445) are numbered "6". After fixing the checksum validation ordering issue, update the step numbers accordingly.
tests/test_compression_conversion.py (2)
59-64: Consider adding a test for the fallback extension append case.The
_get_converted_filenamefunction has a fallback path that appends the target extension when the source extension isn't found (line 100 in download.py). This branch isn't tested here. While unlikely to be hit given how the code flows, testing it would improve coverage.💡 Example test case
# Test fallback when source extension not found (edge case) assert _get_converted_filename("data.txt", "bz2", "gz") == "data.txt.gz"
179-194: Consider verifying that the source file is preserved on conversion failure.The test verifies that the partial target file is cleaned up, but doesn't verify that the source file (
corrupted.bz2) still exists after the failure. Since the source is only removed on successful conversion, verifying its preservation ensures the error handling doesn't accidentally delete user data.💡 Suggested addition
# Verify target file was cleaned up assert not os.path.exists(target_file) + + # Verify source file is preserved (not deleted on failure) + assert os.path.exists(source_file)
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
databusclient/api/download.pytests/test_compression_conversion.py
🧰 Additional context used
🧬 Code graph analysis (2)
tests/test_compression_conversion.py (1)
databusclient/api/download.py (5)
download(937-1105)_detect_compression_format(34-47)_should_convert_file(50-80)_get_converted_filename(83-100)_convert_compression_format(103-148)
databusclient/api/download.py (3)
databusclient/api/utils.py (1)
compute_sha256_and_length(62-72)databusclient/extensions/webdav.py (1)
compute_sha256_and_length(16-34)tests/test_download_auth.py (1)
iter_content(19-23)
🔇 Additional comments (12)
databusclient/api/download.py (5)
20-31: LGTM!The compression format mappings are well-organized and correctly associate format identifiers with their file extensions and corresponding Python modules.
34-47: LGTM!The detection function correctly handles case-insensitive extension matching and returns
Nonefor uncompressed files.
50-80: LGTM!The conversion decision logic correctly handles all edge cases: no target specified, uncompressed files, same source/target formats, and the optional
convert_fromfilter.
83-100: LGTM!The filename conversion correctly uses case-insensitive extension matching while preserving the original filename case in the output.
654-655: LGTM!The
convert_toandconvert_fromparameters are consistently propagated through all download entry points and helper functions, enabling on-the-fly compression conversion throughout the download workflow.Also applies to: 689-690, 703-704, 736-737, 751-752, 791-792, 867-868, 897-898, 946-947, 991-992, 1019-1020, 1033-1034, 1049-1050, 1065-1066, 1101-1102
tests/test_compression_conversion.py (7)
17-24: LGTM!Good coverage of compression format detection including case-insensitive handling.
26-57: LGTM!Comprehensive test coverage for the conversion decision logic, covering all branch conditions including the
convert_fromfilter.
66-88: LGTM!Solid end-to-end test verifying file conversion, source removal, and data integrity.
91-134: LGTM!Good coverage of additional conversion paths (gz→xz, xz→bz2) ensuring all supported format combinations work correctly.
137-146: LGTM!Good coverage of case-insensitive extension handling, addressing PR feedback.
148-176: LGTM!Good validation tests ensuring unsupported formats raise appropriate
ValueErrorexceptions, addressing PR feedback about input validation.
197-198: LGTM!Standard pytest entry point for direct execution.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
|
I apologize for my naiveness earlier. I've now fixed all issues:
Please review it and let me know. ThankYou |
Pull Request
Description
This PR implements on-the-fly compression format conversion during download, allowing users to convert downloaded files between
bz2,gz, andxzformats automatically during the download process. This feature makes it easier to unify datasets with consistent compression formats, save disk space, or integrate data into pipelines that expect specific formats.Key Features:
--convert-toCLI option to specify target compression format (bz2, gz, xz)--convert-fromCLI option to filter which source formats to convertRelated Issues
Resolves #18
Type of change
Checklist:
poetry run pytest- all tests passedpoetry run ruff check- no linting errorsSummary by CodeRabbit
New Features
Documentation
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.