Skip to content

Fixes #24559: Add Greenplum 7 partition support#27288

Open
RajdeepKushwaha5 wants to merge 8 commits intoopen-metadata:mainfrom
RajdeepKushwaha5:fix/24559-greenplum7-partition-support
Open

Fixes #24559: Add Greenplum 7 partition support#27288
RajdeepKushwaha5 wants to merge 8 commits intoopen-metadata:mainfrom
RajdeepKushwaha5:fix/24559-greenplum7-partition-support

Conversation

@RajdeepKushwaha5
Copy link
Copy Markdown
Contributor

@RajdeepKushwaha5 RajdeepKushwaha5 commented Apr 11, 2026

PR Description

Describe your changes:

Fixes #24559

Problem:
Greenplum 7 removed the GP-specific catalog tables pg_partition and pg_partition_rule that the connector relied on for table listing and partition detection. This caused UndefinedTable errors when ingesting metadata from GP7 instances.

Root Cause:
GP7 adopted PostgreSQL 10+ declarative partitioning, replacing:

  • pg_partition / pg_partition_rulepg_partitioned_table + relispartition column in pg_class
  • parkind / parattspartstrat / partattrs

Solution:
Added runtime version detection and dual-path query support:

  1. Version detection via SELECT version() — parses Greenplum Database X.Y.Z from the output using regex (note: SHOW server_version returns the underlying PostgreSQL version, not Greenplum's, so it cannot be used)
  2. GP6 path (existing behavior): uses pg_partition_rule JOIN for table names, pg_partition with parkind/paratts for partition details
  3. GP7+ path (new): uses relispartition = false for table names, pg_partitioned_table with partstrat/partattrs for partition details
  4. Version caching: detected version is cached per source instance to avoid repeated queries
  5. Graceful fallback: defaults to GP7 queries if version detection fails (GP7 is the current release)

Testing:

  • 9 unit tests (version detection for v6/v7, caching, error fallback, unparseable fallback, partition details for range/list/empty)
  • Tested against real Greenplum 7 database using Docker (andruche/greenplum:7):
    • SELECT version() correctly returns Greenplum Database 7.0.0
    • Confirmed pg_partition and pg_partition_rule do NOT exist in GP7
    • Confirmed pg_partitioned_table EXISTS in GP7
    • V7 table names query correctly returns only parent tables (filters child partitions)
    • V7 partition details query correctly returns partition strategy and column for both range and list partitioned tables

Files changed:

  • ingestion/src/metadata/ingestion/source/database/greenplum/queries.py — added V7 queries and SELECT version() query
  • ingestion/src/metadata/ingestion/source/database/greenplum/metadata.py — added version detection, query branching, V6/V7 partition detail methods
  • ingestion/tests/unit/topology/database/test_greenplum.py — added 8 new tests (version detection + partition details)

Type of change:

  • Bug fix

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
  • I have added a test that covers the exact scenario we are fixing. For complex issues, comment the issue number in the test for future reference.

Summary by Gitar

  • Refinement of partition results:
    • Added _build_partition_result helper to handle expression-based partition keys, ensuring tables are correctly marked as partitioned.
  • Robustness improvements:
    • Added SQLAlchemyError handling during version detection to ensure graceful fallback to GP6.
    • Applied explicit pg_catalog schema-qualification to partition queries to ensure consistent metadata resolution.

This will update automatically on new commits.

@RajdeepKushwaha5 RajdeepKushwaha5 requested a review from a team as a code owner April 11, 2026 21:24
Copilot AI review requested due to automatic review settings April 11, 2026 21:24
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/ingestion/source/database/greenplum/metadata.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Greenplum 7 compatibility to the ingestion connector by detecting the Greenplum major version at runtime and switching catalog queries to PostgreSQL 10+ declarative partitioning equivalents, preventing failures caused by removed GP6 catalog tables.

Changes:

  • Implement Greenplum major version detection via SELECT version() with per-source caching and GP6/GP7 query branching.
  • Introduce Greenplum 7+ table listing and partition metadata queries (while preserving the existing GP6 query path).
  • Add unit tests covering version detection, caching/fallback behavior, and GP6/GP7 partition detail parsing.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
ingestion/src/metadata/ingestion/source/database/greenplum/metadata.py Adds version detection + conditional logic to choose GP6 vs GP7+ query paths for table listing and partition details.
ingestion/src/metadata/ingestion/source/database/greenplum/queries.py Splits queries into GP6/GP7 variants; adds SELECT version() query and GP7+ partition/table-name queries.
ingestion/tests/unit/topology/database/test_greenplum.py Adds unit tests for version parsing/caching/fallback and partition details for GP6/GP7 paths.

Comment thread ingestion/src/metadata/ingestion/source/database/greenplum/metadata.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/database/greenplum/queries.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/database/greenplum/queries.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/database/greenplum/queries.py Outdated
@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/24559-greenplum7-partition-support branch from 160fdce to df11c66 Compare April 11, 2026 21:29
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@harshach
Copy link
Copy Markdown
Collaborator

@RajdeepKushwaha5 address the review comments raised here #27288 (comment)

@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/24559-greenplum7-partition-support branch from df11c66 to 57a7949 Compare April 11, 2026 21:32
Copilot AI review requested due to automatic review settings April 11, 2026 21:32
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/ingestion/source/database/greenplum/metadata.py Outdated
@harshach harshach added the safe to test Add this label to run secure Github workflows on PRs label Apr 11, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread ingestion/src/metadata/ingestion/source/database/greenplum/metadata.py Outdated
@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/24559-greenplum7-partition-support branch from 57a7949 to f901064 Compare April 11, 2026 21:36
Copilot AI review requested due to automatic review settings April 11, 2026 21:38
@RajdeepKushwaha5 RajdeepKushwaha5 force-pushed the fix/24559-greenplum7-partition-support branch from 244ba77 to e831870 Compare April 11, 2026 21:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread ingestion/src/metadata/ingestion/source/database/greenplum/metadata.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@RajdeepKushwaha5
Copy link
Copy Markdown
Contributor Author

ping @chirag-madlani @karanh37 @harshach

Copilot AI review requested due to automatic review settings April 26, 2026 20:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread ingestion/src/metadata/ingestion/source/database/greenplum/metadata.py Outdated
Comment thread ingestion/tests/unit/topology/database/test_greenplum.py Outdated
…tion keys

When _build_partition_result receives rows from the partition catalog but all
column_name values are NULL (expression-based partition keys), the table IS
partitioned - we just cannot surface the column details yet.

Previously returned (False, None) which caused expression-key partitioned
tables to be ingested as TableType.Regular. Now returns (True, None) so they
are correctly ingested as TableType.Partitioned.

Updated test name and assertion to reflect the corrected behavior.

Addresses Copilot review comment.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread ingestion/src/metadata/ingestion/source/database/greenplum/queries.py Outdated
… queries

Qualify pg_class, pg_namespace, and pg_partitioned_table with pg_catalog. in
both GREENPLUM_PARTITION_DETAILS_V6 and GREENPLUM_PARTITION_DETAILS_V7 to
avoid search_path dependency and prevent potential name shadowing by
user-defined objects with the same names.

Addresses Copilot review comment.
…7-partition-support

# Conflicts:
#	ingestion/src/metadata/ingestion/source/database/greenplum/metadata.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment on lines +132 to +142
" SELECT version() output, defaulting to 7"
)
logger.debug("Full version() output: %s", version_string)
self._greenplum_version = 7
except SQLAlchemyError as exc:
logger.debug(traceback.format_exc())
logger.warning(
"Could not determine Greenplum version (%s), defaulting to 7",
exc,
)
self._greenplum_version = 7
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version-detection fallback sets _greenplum_version = 7 when SELECT version() is unparseable or raises SQLAlchemyError. That can cause a regression for Greenplum 6 clusters in the (rare but possible) case where version detection fails while the existing self.connection remains usable, because the GP7 table/partition queries reference catalogs/columns that don’t exist on GP6. Consider a safer fallback strategy (e.g., try the GP7 query and on UndefinedTable/UndefinedColumn fall back to GP6, or default to GP6 unless GP7 is positively detected).

Suggested change
" SELECT version() output, defaulting to 7"
)
logger.debug("Full version() output: %s", version_string)
self._greenplum_version = 7
except SQLAlchemyError as exc:
logger.debug(traceback.format_exc())
logger.warning(
"Could not determine Greenplum version (%s), defaulting to 7",
exc,
)
self._greenplum_version = 7
" SELECT version() output, defaulting conservatively to 6"
)
logger.debug("Full version() output: %s", version_string)
self._greenplum_version = 6
except SQLAlchemyError as exc:
logger.debug(traceback.format_exc())
logger.warning(
"Could not determine Greenplum version (%s), defaulting conservatively to 6",
exc,
)
self._greenplum_version = 6

Copilot uses AI. Check for mistakes.
Comment on lines 155 to 163
table_names_query = (
GREENPLUM_GET_TABLE_NAMES_V7
if self._is_v7_or_later()
else GREENPLUM_GET_TABLE_NAMES_V6
)
result = self.connection.execute(
sql.text(GREENPLUM_GET_TABLE_NAMES),
sql.text(table_names_query),
{"schema": schema_name},
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_table_names_and_types now switches between GP6 and GP7 table-listing queries based on _is_v7_or_later(), but there’s no unit test asserting that the correct SQL is chosen/executed for each major version. Adding a small test that sets _greenplum_version to 6/7 and verifies self.connection.execute receives the expected query string would help prevent regressions in the table discovery path (the main failure reported in #24559).

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@sonarqubecloud
Copy link
Copy Markdown

RajdeepKushwaha5 and others added 2 commits April 27, 2026 21:11
…-listing query branch tests

Address reviewer feedback on PR open-metadata#27288:

- Default to Greenplum 6 (conservative) when version detection fails or output is unparseable, since GP7 queries reference catalogs/columns that do not exist on GP6 (was defaulting to 7, which could regress GP6 clusters).

- Add unit tests verifying query_table_names_and_types selects GREENPLUM_GET_TABLE_NAMES_V6 vs GREENPLUM_GET_TABLE_NAMES_V7 based on detected major version, covering the table discovery path that originally failed in open-metadata#24559.

- Apply ruff format to satisfy py-checkstyle CI.
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 27, 2026

Code Review ✅ Approved 4 resolved / 4 findings

Greenplum 7 partition support added, with resolutions for SQLAlchemy 2.0 deprecations, potential NoneType errors in regex, and identifier quoting issues. No open issues remain.

✅ 4 resolved
Bug: engine.execute() removed in SQLAlchemy 2.0, will crash at runtime

📄 ingestion/src/metadata/ingestion/source/database/greenplum/metadata.py:121
The project requires sqlalchemy>=2.0.0 (setup.py:167), but self.engine.execute() was removed in SQLAlchemy 2.0. This call in _get_greenplum_major_version will raise AttributeError at runtime, causing version detection to always fall back to GP7 (silently masking the real error). Ironically, the same file already uses the correct pattern (with self.engine.connect() as conn: conn.execute(...)) in the partition detail methods just below.

Edge Case: version_string can be None, causing TypeError in re.search

📄 ingestion/src/metadata/ingestion/source/database/greenplum/metadata.py:123-124
If result.scalar() returns None (e.g., empty result set), version_string will be None. The subsequent re.search(r"Greenplum Database (\d+)", version_string) at line 124 will raise TypeError: expected string or bytes-like object, got 'NoneType'. This is caught by the outer except Exception block so it won't crash, but it bypasses the more informative "Could not parse Greenplum version" warning and instead logs a confusing TypeError traceback.

Bug: V7 query uses regnamespace::text which quotes identifiers

📄 ingestion/src/metadata/ingestion/source/database/greenplum/queries.py:77 📄 ingestion/src/metadata/ingestion/source/database/greenplum/queries.py:99 📄 ingestion/src/metadata/ingestion/source/database/greenplum/queries.py:102
The V7 partition details query uses par.relnamespace::regnamespace::text for the schema comparison in the WHERE clause. PostgreSQL's regnamespace::text cast adds double-quotes around identifiers that contain uppercase letters, spaces, or are reserved words (e.g., "MySchema" instead of MySchema). The schema_name parameter passed in from OpenMetadata is the raw unquoted name, so this will fail to match for any schema that requires quoting.

The V6 query correctly uses ns.nspname which returns the raw catalog name. The V7 query should do the same by joining pg_namespace instead of relying on the regnamespace cast.

Edge Case: Expression-based partition keys return True with empty columns

📄 ingestion/src/metadata/ingestion/source/database/greenplum/metadata.py:232-246 📄 ingestion/src/metadata/ingestion/source/database/greenplum/metadata.py:259-273
If a table's partition key is an expression (not a plain column), partattrs contains 0 for that position. The LEFT JOIN to information_schema.columns on ordinal_position = column_index won't match, producing a NULL column_name. The if row.column_name filter removes that row, but since result is still truthy, the method returns (True, TablePartition(columns=[])) — marking the table as partitioned but with no partition column details. This applies to both V6 and V7 paths.

This is a pre-existing limitation (V6 has the same behavior), so it's low priority, but worth a comment or a TODO for future improvement.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

},
}

PartitionRow = namedtuple("PartitionRow", ["schema", "table_name", "partition_strategy", "column_name"])
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This namedtuple(...) assignment is long enough to likely exceed the project's formatter/linter line-length limit. Please wrap it across multiple lines (or switch to typing.NamedTuple with one field per line) so black/CI formatting checks don't fail.

Suggested change
PartitionRow = namedtuple("PartitionRow", ["schema", "table_name", "partition_strategy", "column_name"])
PartitionRow = namedtuple(
"PartitionRow",
["schema", "table_name", "partition_strategy", "column_name"],
)

Copilot uses AI. Check for mistakes.
"""Expression-based partition keys yield NULL column_name rows; the
table should still be reported as partitioned (is_partitioned=True)
because a row exists in the partitioning catalog, even though we
cannot resolve the column names yet. partition details are None."""
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring has minor grammar/formatting issues (extra double-space and sentence casing: "yet. partition details..."). Please clean this up for readability.

Suggested change
cannot resolve the column names yet. partition details are None."""
cannot resolve the column names yet. Partition details are None."""

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Problem with Greenplum 7 support: connector fails on missing pg_catalog.pg_partition_rule (no tables ingested)

3 participants