Skip to content

Feature/67 simple blob storage db scan query#68

Merged
jathavaan merged 6 commits intomainfrom
feature/67-simple-blob-storage-db-scan-query
Feb 18, 2026
Merged

Feature/67 simple blob storage db scan query#68
jathavaan merged 6 commits intomainfrom
feature/67-simple-blob-storage-db-scan-query

Conversation

@jathavaan
Copy link
Collaborator

This pull request introduces a new entrypoint for running a blob storage database scan and refactors the way scripts are invoked via the command line. It also updates dependency injection wiring and makes a minor correction in threading initialization. The most important changes are grouped below:

Entrypoint and Script Invocation Improvements:

  • Added a new entrypoint function blob_storage_db_scan in src/presentation/entrypoints/blob_storage_db_scan.py, which scans parquet files in blob storage using DuckDB and a virtual filesystem path service.
  • Refactored main.py to use argparse for selecting which script to run (conflation-pipeline or blob-storage-db-scan), improving extensibility and error handling for script IDs.
  • Updated the docker-compose.yml file to add a new service blob_storage_db_scan and changed the command for the main service to accept script IDs as arguments.

Dependency Injection and Wiring:

  • Updated dependency injection wiring in src/presentation/configuration/app_config.py to include the new blob_storage_db_scan entrypoint.
  • Added blob_storage_db_scan to the __init__.py of the entrypoints module for easier imports.

Minor Technical Fixes:

  • Fixed the type annotation for the target parameter in _initialize_threading in monitor.py to accept object | None instead of a callable, aligning with how it's used.

@jathavaan jathavaan self-assigned this Feb 18, 2026
Copilot AI review requested due to automatic review settings February 18, 2026 11:54
@jathavaan jathavaan linked an issue Feb 18, 2026 that may be closed by this pull request
2 tasks
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new CLI-selectable entrypoint to scan parquet datasets in Azure Blob Storage via DuckDB, while refactoring script invocation to support multiple runnable “script IDs” and updating wiring/runtime configuration accordingly.

Changes:

  • Introduces blob_storage_db_scan entrypoint and wires it into the DI container and entrypoints module.
  • Refactors main.py to dispatch execution based on an argparse-provided script ID.
  • Updates docker-compose.yml to run the default pipeline via a script ID and adds a new scan service; adjusts Windows-only dependencies via environment markers.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/presentation/entrypoints/blob_storage_db_scan.py Adds a new entrypoint intended to scan parquet files in blob storage using DuckDB + virtual filesystem paths.
src/presentation/entrypoints/__init__.py Exposes the new entrypoint for simpler imports.
src/presentation/configuration/app_config.py Wires the new entrypoint module into Dependency Injector.
src/application/common/monitor.py Updates threading initialization type annotation.
main.py Adds argparse-based script selection and dispatch between entrypoints.
docker-compose.yml Runs the main service with a script ID and adds a new scan service.
requirements.txt Restricts pywin32/pywinpty installation to Windows via environment markers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 18, 2026 13:26
@jathavaan jathavaan merged commit d2680d8 into main Feb 18, 2026
2 checks passed
@jathavaan jathavaan deleted the feature/67-simple-blob-storage-db-scan-query branch February 18, 2026 13:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.



def monitor_cpu_and_ram(run_id: str, query_id: str, interval: float = 0.05):
def monitor_cpu_and_ram(run_id: str, query_id: str, interval: float = 0.00005):
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new default sampling interval (50µs) is extremely small and will likely create very large sample lists, distort the benchmark being measured, and add significant CPU overhead (especially for long-running scripts). Consider restoring the previous default or using a more reasonable minimum interval (and/or bounding the number of samples).

Copilot uses AI. Check for mistakes.
Comment on lines 12 to +15
def main() -> None:
initialize_dependencies()
run_pipeline()
script_id = get_script_id()

Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize_dependencies() is called before parsing CLI args. If the user passes no/invalid args, the process will still perform dependency wiring/initialization (and potentially create external connections) before exiting. Parse/validate arguments first, then initialize dependencies for the selected script.

Copilot uses AI. Check for mistakes.
blob_storage_db_scan()
return
case _:
raise ValueError("Script ID is invalid")
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message "Script ID is invalid" doesn’t tell the user what was received or what valid IDs are. Prefer letting argparse validate via choices=... (or call parser.error(...)) so the CLI prints a helpful usage message and lists valid script IDs.

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +31
parser.add_argument("id", help="ID of script to run")
args = parser.parse_args()
return args.id
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The positional argument name id is ambiguous and reads like an internal identifier rather than a script selector. Consider renaming it to script_id and using choices so invalid values are rejected consistently at the parsing layer.

Suggested change
parser.add_argument("id", help="ID of script to run")
args = parser.parse_args()
return args.id
parser.add_argument(
"script_id",
help="ID of script to run",
choices=["conflation-pipeline", "blob-storage-db-scan"],
)
args = parser.parse_args()
return args.script_id

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +19
path = path_service.create_virtual_filesystem_path(
storage_scheme="az",
release="2026-02-16.3",
container=StorageContainer.DATA,
theme=Theme.BUILDINGS,
region="*",
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entrypoint hard-codes release/theme/region/file pattern values, which makes the script difficult to reuse for other scans. Consider accepting these via CLI args/env vars (with sensible defaults) so the scan can be targeted without code changes.

Copilot uses AI. Check for mistakes.
file_name="*.parquet"
)

db_context.execute(f"SELECT count(*) AS count FROM read_parquet('{path}')")
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query result is not fetched/logged, so the script produces no observable output (and it’s easy to miss whether the scan actually succeeded). Consider fetching the count and logging/printing it (or otherwise surfacing scan completion).

Suggested change
db_context.execute(f"SELECT count(*) AS count FROM read_parquet('{path}')")
result = db_context.execute(f"SELECT count(*) AS count FROM read_parquet('{path}')")
row_count = result.fetchone()[0]
print(f"Blob storage DB scan completed; row count: {row_count}")

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Simple blob storage DB scan query

1 participant

Comments