Feature/67 simple blob storage db scan query by jathavaan · Pull Request #68 · kartAI/doppa-data

jathavaan · 2026-02-18T11:54:52Z

This pull request introduces a new entrypoint for running a blob storage database scan and refactors the way scripts are invoked via the command line. It also updates dependency injection wiring and makes a minor correction in threading initialization. The most important changes are grouped below:

Entrypoint and Script Invocation Improvements:

Added a new entrypoint function blob_storage_db_scan in src/presentation/entrypoints/blob_storage_db_scan.py, which scans parquet files in blob storage using DuckDB and a virtual filesystem path service.
Refactored main.py to use argparse for selecting which script to run (conflation-pipeline or blob-storage-db-scan), improving extensibility and error handling for script IDs.
Updated the docker-compose.yml file to add a new service blob_storage_db_scan and changed the command for the main service to accept script IDs as arguments.

Dependency Injection and Wiring:

Updated dependency injection wiring in src/presentation/configuration/app_config.py to include the new blob_storage_db_scan entrypoint.
Added blob_storage_db_scan to the __init__.py of the entrypoints module for easier imports.

Minor Technical Fixes:

Fixed the type annotation for the target parameter in _initialize_threading in monitor.py to accept object | None instead of a callable, aligning with how it's used.

Copilot

Pull request overview

Adds a new CLI-selectable entrypoint to scan parquet datasets in Azure Blob Storage via DuckDB, while refactoring script invocation to support multiple runnable “script IDs” and updating wiring/runtime configuration accordingly.

Changes:

Introduces blob_storage_db_scan entrypoint and wires it into the DI container and entrypoints module.
Refactors main.py to dispatch execution based on an argparse-provided script ID.
Updates docker-compose.yml to run the default pipeline via a script ID and adds a new scan service; adjusts Windows-only dependencies via environment markers.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`src/presentation/entrypoints/blob_storage_db_scan.py`	Adds a new entrypoint intended to scan parquet files in blob storage using DuckDB + virtual filesystem paths.
`src/presentation/entrypoints/__init__.py`	Exposes the new entrypoint for simpler imports.
`src/presentation/configuration/app_config.py`	Wires the new entrypoint module into Dependency Injector.
`src/application/common/monitor.py`	Updates threading initialization type annotation.
`main.py`	Adds argparse-based script selection and dispatch between entrypoints.
`docker-compose.yml`	Runs the main service with a script ID and adds a new scan service.
`requirements.txt`	Restricts `pywin32`/`pywinpty` installation to Windows via environment markers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/presentation/entrypoints/blob_storage_db_scan.py

main.py

src/application/common/monitor.py

docker-compose.yml

Copilot

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-18T13:33:42Z

src/application/common/monitor.py



-def monitor_cpu_and_ram(run_id: str, query_id: str, interval: float = 0.05):
+def monitor_cpu_and_ram(run_id: str, query_id: str, interval: float = 0.00005):


The new default sampling interval (50µs) is extremely small and will likely create very large sample lists, distort the benchmark being measured, and add significant CPU overhead (especially for long-running scripts). Consider restoring the previous default or using a more reasonable minimum interval (and/or bounding the number of samples).

Copilot · 2026-02-18T13:33:43Z

main.py

 def main() -> None:
    initialize_dependencies()
-    run_pipeline()
+    script_id = get_script_id()
+


initialize_dependencies() is called before parsing CLI args. If the user passes no/invalid args, the process will still perform dependency wiring/initialization (and potentially create external connections) before exiting. Parse/validate arguments first, then initialize dependencies for the selected script.

Copilot · 2026-02-18T13:33:43Z

main.py

+            blob_storage_db_scan()
+            return
+        case _:
+            raise ValueError("Script ID is invalid")


The error message "Script ID is invalid" doesn’t tell the user what was received or what valid IDs are. Prefer letting argparse validate via choices=... (or call parser.error(...)) so the CLI prints a helpful usage message and lists valid script IDs.

Copilot · 2026-02-18T13:33:43Z

main.py

+    parser.add_argument("id", help="ID of script to run")
+    args = parser.parse_args()
+    return args.id


The positional argument name id is ambiguous and reads like an internal identifier rather than a script selector. Consider renaming it to script_id and using choices so invalid values are rejected consistently at the parsing layer.

Suggested change

parser.add_argument("id", help="ID of script to run")

args = parser.parse_args()

return args.id

parser.add_argument(

"script_id",

help="ID of script to run",

choices=["conflation-pipeline", "blob-storage-db-scan"],

)

args = parser.parse_args()

return args.script_id

Copilot · 2026-02-18T13:33:44Z

src/presentation/entrypoints/blob_storage_db_scan.py

+    path = path_service.create_virtual_filesystem_path(
+        storage_scheme="az",
+        release="2026-02-16.3",
+        container=StorageContainer.DATA,
+        theme=Theme.BUILDINGS,
+        region="*",


This entrypoint hard-codes release/theme/region/file pattern values, which makes the script difficult to reuse for other scans. Consider accepting these via CLI args/env vars (with sensible defaults) so the scan can be targeted without code changes.

Copilot · 2026-02-18T13:33:44Z

src/presentation/entrypoints/blob_storage_db_scan.py

+        file_name="*.parquet"
+    )
+
+    db_context.execute(f"SELECT count(*) AS count FROM read_parquet('{path}')")


The query result is not fetched/logged, so the script produces no observable output (and it’s easy to miss whether the scan actually succeeded). Consider fetching the count and logging/printing it (or otherwise surfacing scan completion).

Suggested change

db_context.execute(f"SELECT count(*) AS count FROM read_parquet('{path}')")

result = db_context.execute(f"SELECT count(*) AS count FROM read_parquet('{path}')")

row_count = result.fetchone()[0]

print(f"Blob storage DB scan completed; row count: {row_count}")

jathavaan added 3 commits February 18, 2026 12:31

#67 Added blob storage DB scan query

e46b70a

#67 Fixed typehinting that caused error

c04e3fa

#67 Added a CLI arguments to select which script to benchmark and run

baa3c1c

jathavaan self-assigned this Feb 18, 2026

Copilot AI review requested due to automatic review settings February 18, 2026 11:54

jathavaan linked an issue Feb 18, 2026 that may be closed by this pull request

Simple blob storage DB scan query #67

Closed

2 tasks

Copilot started reviewing on behalf of jathavaan February 18, 2026 11:55 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

jathavaan added 3 commits February 18, 2026 14:21

#67 Changed monitoring interval

fe4f644

#67 Fixed bug where path was not passed to duckdb query

5ff9e2f

#67 Changed error type for invalid script ids

a97d180

Copilot AI review requested due to automatic review settings February 18, 2026 13:26

jathavaan merged commit d2680d8 into main Feb 18, 2026
2 checks passed

jathavaan deleted the feature/67-simple-blob-storage-db-scan-query branch February 18, 2026 13:26

Copilot started reviewing on behalf of jathavaan February 18, 2026 13:27 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/67 simple blob storage db scan query#68

Feature/67 simple blob storage db scan query#68
jathavaan merged 6 commits intomainfrom
feature/67-simple-blob-storage-db-scan-query

jathavaan commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments



		def monitor_cpu_and_ram(run_id: str, query_id: str, interval: float = 0.05):
		def monitor_cpu_and_ram(run_id: str, query_id: str, interval: float = 0.00005):

-    db_context.execute(f"SELECT count(*) AS count FROM read_parquet('{path}')")
+    result = db_context.execute(f"SELECT count(*) AS count FROM read_parquet('{path}')")
+    row_count = result.fetchone()[0]
+    print(f"Blob storage DB scan completed; row count: {row_count}")

Conversation

jathavaan commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments