Add on-the-fly compression conversion during download (Issue #18) #45

DhanashreePetare · 2026-01-02T15:07:32Z

Pull Request

Description

This PR implements on-the-fly compression format conversion during download, allowing users to convert downloaded files between bz2, gz, and xz formats automatically during the download process. This feature makes it easier to unify datasets with consistent compression formats, save disk space, or integrate data into pipelines that expect specific formats.

Key Features:

New --convert-to CLI option to specify target compression format (bz2, gz, xz)
Optional --convert-from CLI option to filter which source formats to convert
Automatic compression format detection based on file extensions
Smart conversion logic that skips when source equals target format
Decompress → Recompress pipeline using Python's built-in compression modules
Comprehensive test coverage for all conversion scenarios

Related Issues
Resolves #18

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
This change requires a documentation update
Housekeeping

Checklist:

My code follows the ruff code style of this project.
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (if applicable)
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
- poetry run pytest - all tests passed
- poetry run ruff check - no linting errors

Summary by CodeRabbit

New Features
- On-the-fly compression conversion for downloads via new CLI options --convert-to and --convert-from (supports bz2, gz, xz); files are auto-decompressed/recompressed during download.
Documentation
- Added "Download with Compression Conversion" section, examples, and clarified download option descriptions.
Tests
- Added comprehensive compression conversion tests; marked a long-running collection download test as skipped.
Chores
- Updated project ignore rules (added test-download/ and vault-token.dat).

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-02T15:07:43Z

📝 Walkthrough

Walkthrough

Adds on-the-fly compression conversion to downloads: new CLI options --convert-to and --convert-from (bz2, gz, xz), conversion detection and utilities in the download pipeline, propagates parameters through CLI/API, updates docs and .gitignore, and adds tests for conversion behavior.

Changes

Cohort / File(s)	Summary
Config & Docs `\`.gitignore``,` `README.md``	Adds ignore entries `test-download/` and `vault-token.dat`; documents `--convert-to` / `--convert-from`, updates download help text and usage examples.
Core Download API `\`databusclient/api/download.py``	Introduces compression mappings and helpers (`_detect_compression_format`, `_should_convert_file`, `_get_converted_filename`, `_convert_compression_format`); adds `convert_to` / `convert_from` params across download call chain; performs post-download conversion with cleanup and integrates with checksum flow.
CLI Layer `\`databusclient/cli.py``	Adds `--convert-to` and `--convert-from` click options (choices: `bz2`, `gz`, `xz`), updates command docstring, and forwards new args to the API.
Tests `\`tests/test_compression_conversion.py``,` `tests/test_download.py``	Adds tests for format detection, conversion decision logic, filename generation, and end-to-end conversions (bz2↔gz↔xz) including integrity checks; marks a long-running collection download test as skipped.

Sequence Diagram(s)

sequenceDiagram
    participant User as CLI User
    participant CLI as databusclient.cli
    participant API as databusclient.api.download
    participant DL as _download_file
    participant Conv as Conversion Logic
    participant FS as File System

    User->>CLI: run download with --convert-to gz --convert-from bz2
    CLI->>API: download(..., convert_to='gz', convert_from='bz2')
    API->>DL: _download_file(..., convert_to, convert_from)
    DL->>FS: fetch original file (e.g., file.bz2)
    FS-->>DL: file.bz2
    DL->>Conv: _detect_compression_format(file.bz2)
    Conv-->>DL: 'bz2'
    DL->>Conv: _should_convert_file(..., convert_to='gz', convert_from='bz2')
    Conv-->>DL: true
    DL->>Conv: _convert_compression_format(file.bz2 → file.gz)
    Conv->>Conv: decompress bz2, compress to gz
    Conv-->>DL: file.gz
    DL->>FS: remove original file.bz2, write file.gz
    DL-->>API: download+conversion complete
    API-->>CLI: success
    CLI-->>User: finished

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Refactor/project structure #39 — Modifies databusclient/api/download.py; likely touches similar download logic and parameter propagation.
Download capabilities and docker image #12 — Updates the CLI download command signature/wiring; related to added CLI flags and argument forwarding.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main feature: adding on-the-fly compression conversion during download, with issue reference.
Description check	✅ Passed	The description follows the template with filled sections including key features, type of change, and completed checklist items.
Linked Issues check	✅ Passed	All core requirements from issue `#18` are implemented: --convert-to option for target format, --convert-from filter, format detection, skip logic, decompress/recompress pipeline, and comprehensive tests.
Out of Scope Changes check	✅ Passed	All changes are directly related to implementing compression conversion: API updates, CLI options, core conversion logic, documentation, and test coverage.
Docstring Coverage	✅ Passed	Docstring coverage is 92.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tests/test_compression_conversion.py (1)
66-135: Comprehensive happy-path coverage.

The conversion tests validate all format pairs (bz2↔gz↔xz) with proper data integrity checks and verification that original files are removed after successful conversion. The tests use temporary directories for isolation, which is good practice.

Consider adding tests for error conditions such as corrupted source files, permission errors, or disk space issues to improve test robustness and ensure graceful failure handling.
💡 Optional: Add error condition tests

Consider adding tests like:
def test_convert_with_corrupted_file():
    """Test conversion fails gracefully with corrupted input"""
    with tempfile.TemporaryDirectory() as tmpdir:
        # Create invalid compressed file
        bad_file = os.path.join(tmpdir, "bad.txt.bz2")
        with open(bad_file, 'wb') as f:
            f.write(b"not actually compressed data")
        
        target_file = os.path.join(tmpdir, "bad.txt.gz")
        
        # Should raise RuntimeError and not create partial target
        with pytest.raises(RuntimeError):
            _convert_compression_format(bad_file, target_file, "bz2", "gz")
        
        assert os.path.exists(bad_file)  # Original preserved on failure
        assert not os.path.exists(target_file)  # No partial output
This would validate the error handling mentioned in the conversion implementation.
databusclient/api/download.py (1)
134-138: Consider preserving original exception details.

Wrapping the exception in RuntimeError on Line 138 may obscure the root cause. Consider chaining the exception or providing more context.
🔎 Suggested improvement
     except Exception as e:
         # If conversion fails, ensure the partial target file is removed
         if os.path.exists(target_file):
             os.remove(target_file)
-        raise RuntimeError(f"Compression conversion failed: {e}")
+        raise RuntimeError(f"Compression conversion failed: {e}") from e
This preserves the full exception chain for debugging.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2c3f1c and 46df914.

📒 Files selected for processing (6)

.gitignore
README.md
databusclient/api/download.py
databusclient/cli.py
tests/test_compression_conversion.py
tests/test_download.py

🧰 Additional context used

🧬 Code graph analysis (1)

tests/test_compression_conversion.py (1)

databusclient/api/download.py (5)

download (700-836)

_detect_compression_format (31-45)

_should_convert_file (48-79)

_get_converted_filename (82-99)

_convert_compression_format (102-138)

🔇 Additional comments (18)

.gitignore (1)

3-4: LGTM!

The additions are appropriate for keeping test artifacts and sensitive authentication data out of version control.

tests/test_download.py (2)

3-3: LGTM!

The pytest import is correctly added to support the skip decorator.

30-30: LGTM!

Skipping the long-running, network-dependent collection test is a sensible practice for faster local test execution. The reason is clearly documented.

tests/test_compression_conversion.py (4)

17-24: LGTM!

The format detection tests comprehensively cover all supported formats (bz2, gz, xz), uncompressed files, and case insensitivity. This validates the detection logic correctly.

26-57: LGTM!

The conversion decision logic tests are thorough, covering all key scenarios: no target specified, uncompressed files, same source/target formats, valid conversions, and the convert_from filter behavior.

59-64: LGTM!

The filename conversion tests validate all format pair transformations concisely.

137-138: LGTM!

Direct pytest execution support is convenient for development and debugging.

README.md (3)

169-173: LGTM!

The documentation for the new compression conversion options is clear and accurately describes the functionality. The supported formats and optional filtering behavior are well explained.

185-211: LGTM!

The updated help text and reformatted option descriptions improve readability and accurately reflect the new compression conversion capabilities.

264-274: LGTM!

The examples section effectively demonstrates the three primary use cases: converting all compressed files, filtering by source format, and converting entire collections. This will help users understand how to leverage the new feature.

databusclient/cli.py (2)

161-170: LGTM!

The new CLI options are well-integrated with appropriate validation (restricted choices: bz2, gz, xz) and case-insensitive matching, which aligns with the format detection logic. The help text clearly describes each option's purpose.

180-197: LGTM!

The parameters are correctly added to the function signature, the docstring is updated to reflect the new functionality, and the values are properly propagated to the api_download call.

databusclient/api/download.py (6)

3-28: LGTM! Clean compression module setup.

The imports and constant mappings are well-structured and use standard library modules for compression handling.

31-45: LGTM! Robust extension detection.

The case-insensitive matching ensures reliable format detection across different filename conventions.

48-79: LGTM! Conversion decision logic is sound.

The function correctly handles all conversion scenarios including filtering and format matching.

141-263: LGTM! Clean integration of conversion into download flow.

The conversion logic is correctly placed after download completion and size verification. The function properly handles filenames and paths when calling conversion helpers.

265-299: LGTM! Consistent parameter propagation across all download functions.

All internal download functions properly accept and forward convert_to and convert_from parameters through the call chain, with clear documentation.

Also applies to: 437-476, 478-512, 515-557, 624-663

700-836: The CLI already validates compression format parameters using Click's built-in Choice validator, which automatically rejects invalid values before they reach the API function. No additional validation is needed.

Likely an incorrect or invalid review comment.

databusclient/api/download.py

Integer-Ctrl · 2026-01-13T10:58:00Z

Hi @DhanashreePetare! Just a reminder that this PR has conflicts. No rush

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @databusclient/api/download.py:
- Around line 82-99: The filename comparison is case-sensitive in
_get_converted_filename, causing mismatches with _detect_compression_format
which lowercases names; change the check to use a case-insensitive test (e.g.,
filename.lower().endswith(source_ext)) while still slicing the original filename
by len(source_ext) so the returned name preserves original casing; keep using
COMPRESSION_EXTENSIONS for source_ext/target_ext and only alter the conditional
and slicing logic accordingly.
- Around line 257-262: The bug is that the filename variable `file` is being
shadowed by a file handle in the `with` block so calls to
`_should_convert_file(file, ...)` end up passing a closed file handle instead of
the original filename string; fix by using the original filename string when
calling `_should_convert_file`, `_get_converted_filename`, and
`_convert_compression_format` (e.g., pass the variable that holds the
URL-derived filename) or rename the `with` block file handle (e.g., `fh` or
`fhandle`) so `file` continues to refer to the filename string; update calls to
`_should_convert_file`, `_get_converted_filename`, and
`_convert_compression_format` accordingly to avoid calling `.lower()` on a file
object.

🧹 Nitpick comments (1)

tests/test_compression_conversion.py (1)
59-63: Consider adding test for uppercase filenames.

These tests cover the happy path but don't exercise the case-sensitivity edge case (e.g., DATA.TXT.GZ). Adding such a test would help catch the bug flagged in _get_converted_filename.
Suggested test addition
def test_get_converted_filename():
    """Test filename conversion"""
    assert _get_converted_filename("data.txt.bz2", "bz2", "gz") == "data.txt.gz"
    assert _get_converted_filename("data.txt.gz", "gz", "xz") == "data.txt.xz"
    assert _get_converted_filename("data.txt.xz", "xz", "bz2") == "data.txt.bz2"
    # Uppercase extension handling
    assert _get_converted_filename("DATA.TXT.GZ", "gz", "xz") == "DATA.TXT.xz"

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 46df914 and 5277f26.

📒 Files selected for processing (6)

.gitignore
README.md
databusclient/api/download.py
databusclient/cli.py
tests/test_compression_conversion.py
tests/test_download.py

🚧 Files skipped from review as they are similar to previous changes (2)

README.md
databusclient/cli.py

🧰 Additional context used

🧬 Code graph analysis (1)

tests/test_compression_conversion.py (1)

databusclient/api/download.py (5)

download (700-836)

_detect_compression_format (31-45)

_should_convert_file (48-79)

_get_converted_filename (82-99)

_convert_compression_format (102-138)

🔇 Additional comments (14)

.gitignore (1)

3-4: LGTM!

Appropriate additions to ignore test artifacts (test-download/) and sensitive token files (vault-token.dat).

databusclient/api/download.py (6)

17-28: LGTM!

Clean dictionary mappings for compression formats. This design allows easy extension to additional formats in the future.

31-45: LGTM!

Case-insensitive detection logic is correct and well-documented.

48-79: LGTM!

Decision logic correctly handles all edge cases: no target specified, uncompressed files, same source/target formats, and source format filtering.

102-138: LGTM!

The chunked decompress → recompress pipeline is correctly implemented with proper cleanup on failure. The 8 KiB chunk size is reasonable for streaming.

265-298: LGTM!

The convert_to and convert_from parameters are correctly propagated through _download_files to the underlying _download_file calls.

700-727: LGTM!

The public download() API correctly exposes and documents the new compression conversion parameters, with clear docstrings explaining the supported formats.

tests/test_compression_conversion.py (5)

1-14: LGTM!

Appropriate imports for testing the compression conversion utilities.

17-23: LGTM!

Good test coverage including case-insensitive matching.

26-56: LGTM!

Comprehensive test coverage for all decision branches in _should_convert_file.

66-134: LGTM!

Well-structured conversion tests covering all format pairs with proper data integrity verification. The use of tempfile.TemporaryDirectory ensures proper cleanup.

137-138: LGTM!

Standard pytest main block for direct execution.

tests/test_download.py (2)

3-4: LGTM!

Added pytest import to support the skip decorator.

30-32: LGTM!

Skipping the long-running collection download test is appropriate for CI stability. The skip reason clearly documents why this test is disabled.

databusclient/api/download.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (11)

databusclient/api/download.py (8)
269-303: Critical syntax error causing pipeline failure.

The function _download_file has malformed syntax with duplicate function signatures and docstrings. Lines 276-290 contain the new docstring with convert parameters, but then lines 291-303 contain what appears to be remnants of the original function definition mixed in, causing a SyntaxError: Simple statements must be separated by newlines or semicolons.

The function should have a single, clean signature and docstring.
🐛 Proposed fix - consolidate the function definition
 def _download_file(
     url,
     localDir,
     vault_token_file=None,
     databus_key=None,
     auth_url=None,
     client_id=None,
     convert_to=None,
     convert_from=None,
-) -> None:
-    """
-    Download a file from the internet with a progress bar using tqdm.
-
-    Parameters:
-    - url: the URL of the file to download
-    - localDir: Local directory to download file to. If None, the databus folder structure is created in the current working directory.
-    - vault_token_file: Path to Vault refresh token file
-    - databus_key: Databus API key for protected downloads
-    - auth_url: Keycloak token endpoint URL
-    - client_id: Client ID for token exchange
-    - convert_to: Target compression format for on-the-fly conversion
-    - convert_from: Optional source compression format filter
     validate_checksum: bool = False,
     expected_checksum: str | None = None,
 ) -> None:
-    """Download a file from the internet with a progress bar using tqdm.
+    """Download a file from the internet with a progress bar using tqdm.
 
     Args:
         url: The URL of the file to download.
         localDir: Local directory to download file to. If None, the databus folder structure is created in the current working directory.
         vault_token_file: Path to Vault refresh token file.
         databus_key: Databus API key for protected downloads.
         auth_url: Keycloak token endpoint URL.
         client_id: Client ID for token exchange.
+        convert_to: Target compression format for on-the-fly conversion.
+        convert_from: Optional source compression format filter.
+        validate_checksum: Whether to validate the file checksum after download.
+        expected_checksum: Expected SHA256 checksum for validation.
     """
443-470: Checksum validation will fail after compression conversion.

After conversion, the original file is deleted (in _convert_compression_format line 135), but checksum validation at line 453 still attempts to validate filename which no longer exists. Additionally, the expected checksum from Databus metadata corresponds to the original compressed file, not the converted one.

Either:

Skip checksum validation when conversion is performed, or

Validate the checksum before conversion
🐛 Proposed fix - validate checksum before conversion
     # --- 5. Verify download size ---
     if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
         raise IOError("Downloaded size does not match Content-Length header")

-    # --- 6. Convert compression format if requested ---
-    should_convert, source_format = _should_convert_file(file, convert_to, convert_from)
-    if should_convert and source_format:
-        target_filename = _get_converted_filename(file, source_format, convert_to)
-        target_filepath = os.path.join(localDir, target_filename)
-        _convert_compression_format(filename, target_filepath, source_format, convert_to)
-    # --- 6. Optional checksum validation ---
+    # --- 6. Optional checksum validation (must happen before conversion) ---
     if validate_checksum:
         # reuse compute_sha256_and_length from webdav extension
         try:
             actual, _ = compute_sha256_and_length(filename)
         except (OSError, IOError) as e:
             print(f"WARNING: error computing checksum for {filename}: {e}")
             actual = None

         if expected_checksum is None:
             print(f"WARNING: no expected checksum available for {filename}; skipping validation")
         elif actual is None:
             print(f"WARNING: could not compute checksum for {filename}; skipping validation")
         else:
             if actual.lower() != expected_checksum.lower():
                 try: 
                     os.remove(filename)  # delete corrupted file
                 except OSError: 
                     pass
                 raise IOError(
                     f"Checksum mismatch for {filename}: expected {expected_checksum}, got {actual}"
                 )

+    # --- 7. Convert compression format if requested ---
+    should_convert, source_format = _should_convert_file(file, convert_to, convert_from)
+    if should_convert and source_format:
+        target_filename = _get_converted_filename(file, source_format, convert_to)
+        target_filepath = os.path.join(localDir, target_filename)
+        _convert_compression_format(filename, target_filepath, source_format, convert_to)
473-507: Same syntax error pattern as _download_file.

The _download_files function has the same malformed definition with duplicate/overlapping docstrings. This needs to be consolidated similar to the fix for _download_file.

659-695: Same syntax error pattern in _download_collection.

This function also has overlapping/duplicate docstrings that need consolidation.

720-752: Same syntax error pattern in _download_version.

This function also has overlapping/duplicate docstrings that need consolidation.

777-812: Same syntax error pattern in _download_artifact.

This function also has overlapping/duplicate docstrings that need consolidation.

906-941: Same syntax error pattern in _download_group.

This function also has overlapping/duplicate docstrings that need consolidation.

1004-1029: Same syntax error pattern in download function.

The main download function also has overlapping/duplicate docstrings that need consolidation.
databusclient/cli.py (3)
165-177: Malformed click option definition.

The --convert-from option definition is incomplete. Line 174 shows "--validate-checksum", which appears to be the start of a separate option that was incorrectly merged into the --convert-from definition.
🐛 Proposed fix - separate the options properly
 @click.option(
     "--convert-from",
     type=click.Choice(["bz2", "gz", "xz"], case_sensitive=False),
     help="Source compression format to convert from (optional filter). Only files with this compression will be converted.",
+)
+@click.option(
     "--validate-checksum",
     is_flag=True,
     help="Validate checksums of downloaded files"
 )
178-191: Malformed function signature.

The function signature has invalid syntax with a closing parenthesis, then a parameter, then another closing parenthesis on separate lines (189-191). This needs to be a single, properly formatted parameter list.
🐛 Proposed fix
 def download(
     databusuris: List[str],
     localdir,
     databus,
     vault_token,
     databus_key,
     all_versions,
     authurl,
     clientid,
     convert_to,
     convert_from,
-):
-    validate_checksum,
-):    
+    validate_checksum,
+):
196-221: Duplicate api_download calls - only one should exist.

There are two calls to api_download: one at lines 196-207 (outside try block, has conversion params, missing validate_checksum) and another at lines 209-219 (inside try block, has validate_checksum, missing conversion params). This should be a single call with all parameters inside the try block.
🐛 Proposed fix - consolidate into single call
-    api_download(
-        localDir=localdir,
-        endpoint=databus,
-        databusURIs=databusuris,
-        token=vault_token,
-        databus_key=databus_key,
-        all_versions=all_versions,
-        auth_url=authurl,
-        client_id=clientid,
-        convert_to=convert_to,
-        convert_from=convert_from,
-    )
     try:
         api_download(
             localDir=localdir,
             endpoint=databus,
             databusURIs=databusuris,
             token=vault_token,
             databus_key=databus_key,
             all_versions=all_versions,
             auth_url=authurl,
             client_id=clientid,
-            validate_checksum=validate_checksum
+            convert_to=convert_to,
+            convert_from=convert_from,
+            validate_checksum=validate_checksum,
         )            
     except DownloadAuthError as e:
         raise click.ClickException(str(e))

🤖 Fix all issues with AI agents

In @databusclient/api/download.py:
- Around line 6-8: The imports in this module duplicate List from typing; remove
the redundant "from typing import List" (the second occurrence) and consolidate
typing imports so you only have one line like "from typing import List,
Optional, Tuple" while keeping the existing "import re" intact; update the
import block (references: List, Optional, Tuple, re) accordingly.

🧹 Nitpick comments (1)

databusclient/api/download.py (1)
85-102: Consider case-insensitive extension matching.

The _get_converted_filename function uses filename.endswith(source_ext) which is case-sensitive, but _detect_compression_format lowercases the filename before checking. If a file has a mixed-case extension like .BZ2, detection will work but filename conversion will append rather than replace the extension.
♻️ Proposed fix for case-insensitive handling
 def _get_converted_filename(filename: str, source_format: str, target_format: str) -> str:
     source_ext = COMPRESSION_EXTENSIONS[source_format]
     target_ext = COMPRESSION_EXTENSIONS[target_format]
     
-    if filename.endswith(source_ext):
+    if filename.lower().endswith(source_ext):
         return filename[:-len(source_ext)] + target_ext
     return filename + target_ext

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5277f26 and 390d205.

📒 Files selected for processing (4)

README.md
databusclient/api/download.py
databusclient/cli.py
tests/test_download.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/test_download.py

🧰 Additional context used

🧬 Code graph analysis (1)

databusclient/api/download.py (1)

databusclient/cli.py (1)

download (178-190)

🪛 GitHub Actions: Python CI (Lint & pytest)

databusclient/api/download.py

[error] 294-294: ruff check failed: SyntaxError: Simple statements must be separated by newlines or semicolons.

🔇 Additional comments (6)

README.md (2)

177-180: LGTM!

The documentation for --convert-to and --convert-from options is clear and provides helpful examples. The descriptions accurately reflect the implementation behavior (decompress → recompress pipeline, format filtering).

272-282: LGTM!

The examples section effectively demonstrates the three key use cases for compression conversion: converting all files to a target format, filtering by source format, and unifying a collection. This aligns well with the PR objectives.

databusclient/api/download.py (4)

20-31: LGTM!

The compression format mappings are clean and provide a single source of truth for supported formats. The module-to-format mapping correctly uses Python's built-in compression libraries.

34-48: LGTM!

The compression detection is case-insensitive which is a good defensive approach for handling filenames with varied casing.

51-82: LGTM!

The conversion decision logic correctly handles all edge cases: no target format, uncompressed files, same source/target format, and optional source format filtering.

105-141: LGTM!

The conversion implementation properly:

Uses chunked streaming (8KB) to avoid memory issues with large files

Cleans up partial target files on failure

Removes the original file only after successful conversion

Provides user feedback during conversion

databusclient/api/download.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

databusclient/api/download.py (1)

432-459: Critical: Variable shadowing causes runtime crash and checksum validation on deleted file.

Two critical issues:

Variable shadowing bug: At line 306, file = url.split("/")[-1] stores the filename string. But at line 422, with open(filename, "wb") as file: shadows this with a file handle. At line 433, _should_convert_file(file, ...) passes the closed file handle, not the filename string. This will crash with AttributeError: '_io.BufferedWriter' object has no attribute 'lower'.
Checksum on deleted file: If conversion succeeds, _convert_compression_format deletes the original file at filename, but line 442 tries to compute checksum on filename which no longer exists.

🐛 Proposed fix

-    file = url.split("/")[-1]
-    filename = os.path.join(localDir, file)
+    base_filename = url.split("/")[-1]
+    filename = os.path.join(localDir, base_filename)
     print(f"Download file: {url}")

And update the download and conversion section:

     # --- 4. Download with progress bar ---
     total_size_in_bytes = int(response.headers.get("content-length", 0))
     block_size = 1024  # 1 KiB
 
     progress_bar = tqdm(total=total_size_in_bytes, unit="iB", unit_scale=True)
-    with open(filename, "wb") as file:
+    with open(filename, "wb") as f:
         for data in response.iter_content(block_size):
             progress_bar.update(len(data))
-            file.write(data)
+            f.write(data)
     progress_bar.close()
 
     # --- 5. Verify download size ---
     if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
         raise IOError("Downloaded size does not match Content-Length header")
 
     # --- 6. Convert compression format if requested ---
-    should_convert, source_format = _should_convert_file(file, convert_to, convert_from)
+    should_convert, source_format = _should_convert_file(base_filename, convert_to, convert_from)
+    checksum_filepath = filename  # Track which file to validate
     if should_convert and source_format:
-        target_filename = _get_converted_filename(file, source_format, convert_to)
+        target_filename = _get_converted_filename(base_filename, source_format, convert_to)
         target_filepath = os.path.join(localDir, target_filename)
         _convert_compression_format(filename, target_filepath, source_format, convert_to)
-    # --- 6. Optional checksum validation ---
+        checksum_filepath = target_filepath  # Validate converted file
+    # --- 7. Optional checksum validation ---
     if validate_checksum:
         # reuse compute_sha256_and_length from webdav extension
         try:
-            actual, _ = compute_sha256_and_length(filename)
+            actual, _ = compute_sha256_and_length(checksum_filepath)
         except (OSError, IOError) as e:
-            print(f"WARNING: error computing checksum for {filename}: {e}")
+            print(f"WARNING: error computing checksum for {checksum_filepath}: {e}")
             actual = None

Note: Checksum validation after conversion will compare against the original expected checksum, which won't match the converted file. Consider whether checksum validation should be skipped or adapted when conversion is performed.

🤖 Fix all issues with AI agents

In @databusclient/api/download.py:
- Around line 85-102: The function _get_converted_filename fails for uppercase
extensions; change the conditional to perform a case-insensitive suffix check
(e.g., use filename.lower().endswith(source_ext.lower())) so files like
"file.BZ2" are recognized, then slice off the matched extension length from the
original filename (filename[:-len(source_ext)]) and append target_ext as before;
keep using COMPRESSION_EXTENSIONS for source_ext/target_ext and leave the append
behavior when no match unchanged.

🧹 Nitpick comments (1)

databusclient/api/download.py (1)

105-141: LGTM with optional improvement note.

Good error handling with cleanup of partial output on failure. The chunked I/O approach is memory-efficient.

Consider using logging instead of print() for better control over verbosity, but this is fine for a CLI tool.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 390d205 and cbd59ac.

📒 Files selected for processing (2)

databusclient/api/download.py
databusclient/cli.py

🧰 Additional context used

🧬 Code graph analysis (1)

databusclient/api/download.py (2)

databusclient/api/utils.py (1)

compute_sha256_and_length (62-72)

databusclient/extensions/webdav.py (1)

compute_sha256_and_length (16-34)

🔇 Additional comments (12)

databusclient/api/download.py (10)

3-6: LGTM!

Standard library imports for compression modules and typing are correctly added.

20-31: LGTM!

Clean and correct mappings for the supported compression formats.

34-48: LGTM!

Case-insensitive extension detection is correctly implemented.

51-82: LGTM!

Conversion decision logic correctly handles all cases: no target format, uncompressed files, same-format skip, and source format filtering.

462-503: LGTM!

Parameters correctly propagated to _download_files and forwarded to individual file downloads.

639-686: LGTM!

Collection download correctly propagates conversion parameters.

689-733: LGTM!

Version download correctly propagates conversion parameters.

736-788: LGTM!

Artifact download correctly propagates conversion parameters through version iterations.

852-893: LGTM!

Group download correctly propagates conversion parameters to artifact downloads.

930-1098: LGTM!

Main download function correctly propagates conversion parameters to all download paths (collection, file, version, artifact, group, and query-based downloads).

databusclient/cli.py (2)

165-174: LGTM!

CLI options are correctly defined with appropriate choices and help text. Case-insensitive matching is a good UX choice.

180-212: LGTM!

Parameters correctly added to function signature and properly passed to api_download. The docstring update accurately describes the new capability.

databusclient/api/download.py

Integer-Ctrl · 2026-01-14T15:45:05Z

databusclient/api/download.py

+    """
+    Detect compression format from file extension.
+    
+    Parameters:
+    - filename: Name of the file
+    
+    Returns:
+    - Compression format string ('bz2', 'gz', 'xz') or None if not compressed
+    """


Please use python google style docstring (see example https://google.github.io/styleguide/pyguide.html#244-decision)

Integer-Ctrl · 2026-01-14T15:45:26Z

databusclient/api/download.py

+    """
+    Determine if a file should be converted and what the source format is.
+    
+    Parameters:
+    - filename: Name of the file
+    - convert_to: Target compression format
+    - convert_from: Optional source compression format filter
+    
+    Returns:
+    - Tuple of (should_convert: bool, source_format: Optional[str])
+    """


Please use python google style docstring (see example https://google.github.io/styleguide/pyguide.html#244-decision)

Integer-Ctrl · 2026-01-14T15:45:33Z

databusclient/api/download.py

+    """
+    Generate the new filename after compression format conversion.
+    
+    Parameters:
+    - filename: Original filename
+    - source_format: Source compression format ('bz2', 'gz', 'xz')
+    - target_format: Target compression format ('bz2', 'gz', 'xz')
+    
+    Returns:
+    - New filename with updated extension
+    """


Please use python google style docstring (see example https://google.github.io/styleguide/pyguide.html#244-decision)

databusclient/api/download.py

@Integer-Ctrl

Addresses all feedback from @Integer-Ctrl on PR dbpedia#45

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

databusclient/api/download.py (1)

439-466: Critical bug: Checksum validation fails after conversion because the original file was deleted.

When conversion is performed (lines 441-444), _convert_compression_format removes the original source file (filename). However, checksum validation (lines 446-466) still attempts to compute the checksum of filename, which no longer exists after successful conversion. This will raise an OSError during compute_sha256_and_length.

Additionally, checksum validation should probably validate the original file before conversion, not after, since the expected checksum from metadata corresponds to the original compressed format.

🐛 Proposed fix: Validate checksum before conversion

+    # --- 6. Optional checksum validation (before conversion) ---
+    if validate_checksum:
+        try:
+            actual, _ = compute_sha256_and_length(filename)
+        except (OSError, IOError) as e:
+            print(f"WARNING: error computing checksum for {filename}: {e}")
+            actual = None
+
+        if expected_checksum is None:
+            print(f"WARNING: no expected checksum available for {filename}; skipping validation")
+        elif actual is None:
+            print(f"WARNING: could not compute checksum for {filename}; skipping validation")
+        else:
+            if actual.lower() != expected_checksum.lower():
+                try:
+                    os.remove(filename)  # delete corrupted file
+                except OSError:
+                    pass
+                raise IOError(
+                    f"Checksum mismatch for {filename}: expected {expected_checksum}, got {actual}"
+                )
+
-    # --- 6. Convert compression format if requested ---
+    # --- 7. Convert compression format if requested ---
     should_convert, source_format = _should_convert_file(file, convert_to, convert_from)
     if should_convert and source_format:
         target_filename = _get_converted_filename(file, source_format, convert_to)
         target_filepath = os.path.join(localDir, target_filename)
         _convert_compression_format(filename, target_filepath, source_format, convert_to)
-    # --- 6. Optional checksum validation ---
-    if validate_checksum:
-        # reuse compute_sha256_and_length from webdav extension
-        try:
-            actual, _ = compute_sha256_and_length(filename)
-        except (OSError, IOError) as e:
-            print(f"WARNING: error computing checksum for {filename}: {e}")
-            actual = None
-
-        if expected_checksum is None:
-            print(f"WARNING: no expected checksum available for {filename}; skipping validation")
-        elif actual is None:
-            print(f"WARNING: could not compute checksum for {filename}; skipping validation")
-        else:
-            if actual.lower() != expected_checksum.lower():
-                try: 
-                    os.remove(filename)  # delete corrupted file
-                except OSError: 
-                    pass
-                raise IOError(
-                    f"Checksum mismatch for {filename}: expected {expected_checksum}, got {actual}"
-                )

♻️ Duplicate comments (1)

databusclient/api/download.py (1)

129-129: Misleading comment about progress indication.

The comment states "with progress indication" but no progress bar is implemented for the conversion process. This was flagged in a previous review.

🧹 Nitpick comments (4)

databusclient/api/download.py (2)

476-477: Consider using Optional[str] for type consistency.

The type hints use str = None instead of Optional[str] = None, which is inconsistent with other parameters in the codebase (e.g., line 6 imports Optional and other functions use it). While this works, Optional[str] more explicitly conveys that None is a valid value.

439-445: Duplicate step numbers in comments.

Both the conversion step (line 439) and checksum validation step (line 445) are numbered "6". After fixing the checksum validation ordering issue, update the step numbers accordingly.
tests/test_compression_conversion.py (2)
59-64: Consider adding a test for the fallback extension append case.

The _get_converted_filename function has a fallback path that appends the target extension when the source extension isn't found (line 100 in download.py). This branch isn't tested here. While unlikely to be hit given how the code flows, testing it would improve coverage.
💡 Example test case
# Test fallback when source extension not found (edge case)
assert _get_converted_filename("data.txt", "bz2", "gz") == "data.txt.gz"
179-194: Consider verifying that the source file is preserved on conversion failure.

The test verifies that the partial target file is cleaned up, but doesn't verify that the source file (corrupted.bz2) still exists after the failure. Since the source is only removed on successful conversion, verifying its preservation ensures the error handling doesn't accidentally delete user data.
💡 Suggested addition
         # Verify target file was cleaned up
         assert not os.path.exists(target_file)
+        
+        # Verify source file is preserved (not deleted on failure)
+        assert os.path.exists(source_file)

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cbd59ac and 491a5f3.

📒 Files selected for processing (2)

databusclient/api/download.py
tests/test_compression_conversion.py

🧰 Additional context used

🧬 Code graph analysis (2)

tests/test_compression_conversion.py (1)

databusclient/api/download.py (5)

download (937-1105)

_detect_compression_format (34-47)

_should_convert_file (50-80)

_get_converted_filename (83-100)

_convert_compression_format (103-148)

databusclient/api/download.py (3)

databusclient/api/utils.py (1)

compute_sha256_and_length (62-72)

databusclient/extensions/webdav.py (1)

compute_sha256_and_length (16-34)

tests/test_download_auth.py (1)

iter_content (19-23)

🔇 Additional comments (12)

databusclient/api/download.py (5)

20-31: LGTM!

The compression format mappings are well-organized and correctly associate format identifiers with their file extensions and corresponding Python modules.

34-47: LGTM!

The detection function correctly handles case-insensitive extension matching and returns None for uncompressed files.

50-80: LGTM!

The conversion decision logic correctly handles all edge cases: no target specified, uncompressed files, same source/target formats, and the optional convert_from filter.

83-100: LGTM!

The filename conversion correctly uses case-insensitive extension matching while preserving the original filename case in the output.

654-655: LGTM!

The convert_to and convert_from parameters are consistently propagated through all download entry points and helper functions, enabling on-the-fly compression conversion throughout the download workflow.

Also applies to: 689-690, 703-704, 736-737, 751-752, 791-792, 867-868, 897-898, 946-947, 991-992, 1019-1020, 1033-1034, 1049-1050, 1065-1066, 1101-1102

tests/test_compression_conversion.py (7)

17-24: LGTM!

Good coverage of compression format detection including case-insensitive handling.

26-57: LGTM!

Comprehensive test coverage for the conversion decision logic, covering all branch conditions including the convert_from filter.

66-88: LGTM!

Solid end-to-end test verifying file conversion, source removal, and data integrity.

91-134: LGTM!

Good coverage of additional conversion paths (gz→xz, xz→bz2) ensuring all supported format combinations work correctly.

137-146: LGTM!

Good coverage of case-insensitive extension handling, addressing PR feedback.

148-176: LGTM!

Good validation tests ensuring unsupported formats raise appropriate ValueError exceptions, addressing PR feedback about input validation.

197-198: LGTM!

Standard pytest entry point for direct execution.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

DhanashreePetare · 2026-01-15T19:43:23Z

I apologize for my naiveness earlier. I've now fixed all issues:

The critical variable shadowing bug (renamed file handle to f).
Converted all docstrings to Google style.
Fixed case sensitivity with .lower().
I tested everything with your exact command
([databusclient download https://databus.dbpedia.org/dbpedia/dbpedia-wikipedia-kg-dump/labels/2025-12-01/labels_graph=de-dbpedia-org_partition=rdfs-label.ttl.gz --convert-to bz2](vscode-file://vscode-app/c:/Users/dhana/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html))
and it worked perfectly—downloaded 66.4MB and converted gz→bz2 successfully.
Added 10 comprehensive tests in test_compression_conversion.py (all passing).

Please review it and let me know. ThankYou

coderabbitai bot reviewed Jan 2, 2026

View reviewed changes

databusclient/api/download.py Show resolved Hide resolved

databusclient/api/download.py Show resolved Hide resolved

Add on-the-fly compression conversion during download (Issue dbpedia#18)

5277f26

DhanashreePetare force-pushed the feature/on-the-fly-conversion-issue-18 branch from 98e4f35 to 5277f26 Compare January 13, 2026 19:14

coderabbitai bot reviewed Jan 13, 2026

View reviewed changes

databusclient/api/download.py Show resolved Hide resolved

Merge branch 'main' into feature/on-the-fly-conversion-issue-18

390d205

coderabbitai bot reviewed Jan 13, 2026

View reviewed changes

databusclient/api/download.py Outdated Show resolved Hide resolved

lint fixes

cbd59ac

coderabbitai bot reviewed Jan 13, 2026

View reviewed changes

databusclient/api/download.py Show resolved Hide resolved

Integer-Ctrl self-requested a review January 14, 2026 15:38

Integer-Ctrl requested changes Jan 14, 2026

View reviewed changes

Fix PR dbpedia#45 review feedback:

491a5f3

Addresses all feedback from @Integer-Ctrl on PR dbpedia#45

coderabbitai bot reviewed Jan 15, 2026

View reviewed changes

Add on-the-fly compression conversion during download (Issue #18) #45

Are you sure you want to change the base?

Add on-the-fly compression conversion during download (Issue #18) #45

Uh oh!

Conversation

DhanashreePetare commented Jan 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Description

Type of change

Checklist:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Integer-Ctrl commented Jan 13, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Integer-Ctrl Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Integer-Ctrl Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Integer-Ctrl Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

DhanashreePetare commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DhanashreePetare commented Jan 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 2, 2026 •

edited

Loading