updated some default postgres import behaviour#8
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates Postgres-related vocabulary loading behaviours for OMOP_Alchemy 0.6.3: it removes the explicit psycopg2-binary dependency now that orm-loader>=0.4.0 no longer requires it, switches the Athena CSV import from literal to auto quote mode (fixing VARCHAR(255) overflow on quoted concept names), and makes load-vocab-source default to merge_strategy="replace" with a 100k chunk size. It also renames the CLI entry point from omop-maint to omop-alchemy, adds PostgreSQL integration tests (with a docker-compose harness and GitHub Actions workflow), commits a minimal Athena fixture set, and removes the obsolete notebooks directory.
Changes:
- Switch vocabulary loading defaults:
quote_mode="auto",merge_strategy="replace", defaultchunksize=100_000; droppsycopg2-binaryextra and bumporm-loaderto>=0.4.0. - Rename CLI entry point
omop-maint→omop-alchemyacross code, docs, and CHANGELOG. - Add committed Athena fixtures, Postgres integration test suite + docker-compose harness + CI workflow, and remove old notebooks.
Reviewed changes
Copilot reviewed 35 out of 38 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml, uv.lock | Bump version to 0.6.3, drop psycopg2-binary, require orm-loader>=0.4.0, register pytest postgres marker, rename script. |
| omop_alchemy/config.py | Alias bare postgresql:// URL to psycopg driver, keep psycopg2 entry for clearer errors. |
| omop_alchemy/maintenance/load_vocab.py | Default quote_mode="auto"; default chunksize=100_000; update detail strings. |
| omop_alchemy/maintenance/cli.py | Default --merge-strategy to replace; default --chunksize to 100k (0 disables); rename CLI references. |
| omop_alchemy/maintenance/{ui,info,doctor,backup}.py | Update user-facing strings from omop-maint to omop-alchemy. |
| tests/conftest.py | Add pg_engine/pg_session fixtures; use uppercase fixture filenames; quote_mode="auto". |
| tests/test_load_vocab.py | Rework against the new minimal committed fixture (e.g. assert on concept 8507 MALE). |
| tests/test_load_vocab_source.py | Adapt fakes to new signatures; add quote-mode regression test (misnamed). |
| tests/test_load_vocab_postgres.py | New end-to-end Postgres integration tests covering quote mode, schema, chunksize, replace vs upsert. |
| tests/test_config_driver.py | New tests for driver mapping and _missing_driver_message (contains some dead code). |
| tests/fixtures/athena_source/*.csv | Add minimal committed Athena fixtures (7 concepts). |
| tests/docker-compose.yaml, tests/README.md | Add Postgres test container and updated test docs. |
| .github/workflows/tests.yml | New CI workflow for unit+SQLite and Postgres integration jobs. |
| .gitignore | Permit committed Athena fixtures; ignore notebooks/. |
| notebooks/*.ipynb, notebooks/concept_enums.py | Delete obsolete notebooks. |
| docs/**, CHANGELOG.md | Update CLI name in docs; document 0.6.3 changes (including breaking ones). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def fake_create_engine(url, **kwargs): | ||
| raise ModuleNotFoundError.__new__( | ||
| ModuleNotFoundError, | ||
| ) |
| def test_load_vocab_source_uses_csv_not_literal_quote_mode(monkeypatch, tmp_path): | ||
| """Regression: Athena load must use csv quote mode so that quoted concept_name | ||
| values are not padded with surrounding double-quote characters, which would | ||
| cause 'value too long for type character varying(255)' on CONCEPT.csv.""" |
| def fake_create_engine(url, **kwargs): | ||
| raise ModuleNotFoundError.__new__( | ||
| ModuleNotFoundError, | ||
| ) |
nicoloesch
left a comment
There was a problem hiding this comment.
Review
The three core fixes (quote mode, merge strategy default, psycopg2 removal) are correct and the logic is sound. The surrounding scope needs trimming before this is mergeable.
Changes requested
1. Remove all Docker changes
The Docker additions mix dev-environment concerns (non-root user setup, .venv embedded in the container path) with runtime image concerns. These belong in a separate, dedicated Docker track. Nothing in the stated fix scope requires touching docker/, and having it here creates maintenance confusion. Please revert docker/, .dockerignore, and .github/workflows/docker-python.yml.
NOTE: We need to think about how we package our things together, including the docker support. I am aware that I am a culprit of this too in omop-spires where I orchestrate them all but we need a better strategy here.
2. Remove initial_load and expose insert_if_empty as a first-class merge strategy
initial_load=True is a boolean that silently overrides merge_strategy to "insert_if_empty". This is indirect and adds a parameter interaction that is hard to reason about. The CLI is already complex and this makes it worse without adding meaningful value. The simpler fix is to accept "insert_if_empty" directly as a valid merge_strategy value, ideally constrained with a Literal type. Remove initial_load from both the Python API and the CLI.
3. Rework test fixtures to use in-memory generation
The static CSV files in tests/fixtures/athena_source/ and the shutil.copy calls in _make_concept_source are unnecessary maintenance surface. Generate fixture content in-memory in conftest using a factory function (e.g. build_athena_source(tmp_path, ...)) that writes tab-delimited strings to a temporary path. This is cleaner, removes the scattered static files, and makes the test intent explicit.
4. Remove tests/docker-compose.yaml
This is infrastructure scattered away from the rest of the project config. The GitHub Actions postgres integration tests use native service containers and do not depend on this file. It serves no purpose in the repository.
Minor
Quote the db_schema identifier in SET search_path (load_vocab.py:265)
connection.exec_driver_sql(f"SET search_path TO {db_schema}")db_schema is interpolated directly without identifier quoting. A schema name with uppercase letters or special characters will produce unexpected behaviour. This is pre-existing but we are already in this file. A simple double-quote wrap with internal quote escaping is sufficient.
** CI postgres integration job only covers Python 3.12
The unit and SQLite jobs run a matrix over ["3.12", "3.13"] but the postgres job hardcodes 3.12. Extend the matrix or pin to the latest stable to avoid missing 3.13 regressions in the postgres path.
Passing
- Quote mode fix (
literaltoauto) is correct and the regression test is well-constructed - Merge strategy default change (
upserttoreplace) is the right call and is clearly documented - psycopg2 dependency removal is clean;
_missing_driver_messageprovides useful install hints - CLI rename (
omop-mainttoomop-alchemy) is the right long-term call - Chunksize default and
0toNonehandling in the CLI are correct tests.ymlCI workflow structure is clean; keep it once the docker-compose dependency is removed
| try: | ||
| yield session | ||
| finally: | ||
| session.rollback() |
There was a problem hiding this comment.
is that what we want to rollback the entire DB after something is done? That way we always need to do everything within the with scope of the session as afterwards everything is gone unless we do commit but then what is the point of the rollback?
There was a problem hiding this comment.
I think this is normal in a tests module?
There was a problem hiding this comment.
This feels like a duplicate. Why would the docker-compose.yaml in the test differ from the regular one?
There was a problem hiding this comment.
it's literally just for integration testing convenience for something disposable - have renamed to example
| "CONCEPT_RELATIONSHIP.csv", | ||
| "CONCEPT_SYNONYM.csv", | ||
| ): | ||
| shutil.copy(_FIXTURE_SOURCE / fname, source_path / fname) |
There was a problem hiding this comment.
I don't think we should be handling and copying these files. Why not have it all in memory and properly set it up in conftest.py? If we keept the files, use tmp_path or something to create/write the file.
There was a problem hiding this comment.
agreed re: using tmp_path for tests that construct or customize a source directory. the bit about conftest is addressed above, but have updated so that _copy_fixture_source uses tmp_path instead
CLOSES: #7, CLOSES: #9, CLOSES: #10