Skip to content

Honor caller-provided rowType in HiveViewExpander.expandView#607

Open
simbadzina wants to merge 1 commit into
linkedin:masterfrom
simbadzina:fix-hiveviewexpander-rowtype-alignment
Open

Honor caller-provided rowType in HiveViewExpander.expandView#607
simbadzina wants to merge 1 commit into
linkedin:masterfrom
simbadzina:fix-hiveviewexpander-rowtype-alignment

Conversation

@simbadzina
Copy link
Copy Markdown
Collaborator

@simbadzina simbadzina commented May 13, 2026

Summary

When a downstream view does SELECT * FROM upstream and upstream's HMS-recorded row type disagrees with what its re-parsed body produces in field order, the downstream's positional access can land on the wrong field. The case that surfaces this most painfully is prefix-name sibling pairs:

Field index Caller's expected rowType Re-parsed body's rowType
0 id (INTEGER) idHash (VARCHAR)
1 idHash (VARCHAR) id (INTEGER)

A case-insensitive lookup of id against the body's row type can match idHash — the resolver gets confused when a row type contains both a short name and a longer name that has it as a prefix. Every consumer of the view then silently reads values from the wrong source column.

The root cause is that HiveViewExpander.expandView() was discarding its rowType argument and returning whatever the re-parsed view body produced, violating the ViewExpander contract that the returned RelRoot's row type must match the caller-provided rowType.

Fix

After convertQuery, wrap the returned RelRoot in a LogicalProject that re-aligns fields to the caller-provided rowType by name (case-insensitive):

flowchart LR
  C["Caller's rowType<br/>[id INTEGER, idHash VARCHAR]"] --> E
  B["Re-parsed body's rowType<br/>[idHash VARCHAR, id INTEGER]"] --> E[alignToRowType]
  E --> Q{Arity match<br/>and field names<br/>aligned by order?}
  Q -->|Yes| K[Return original RelRoot]
  Q -->|All names present<br/>but different order| W["Wrap in LogicalProject<br/>that re-binds by name<br/>(case-insensitive)"]
  Q -->|Arity differs<br/>or a name is missing| L[Log warn, return original]
Loading

The safe-fallback paths preserve pre-fix behavior for callers whose expanded body genuinely produces a different shape, while making such cases visible in logs.

Test plan

  • Unit tests for alignToRowType in HiveViewExpanderTest covering: already-aligned (no-op fast path), case-only differences (no-op), reordered fields (wraps in Project), single prefix-sibling pair (id / idHash), multiple prefix-sibling pairs in one row type (id/idHash, email/emailVerified, country/countryCode simultaneously — mirrors the realistic failure shape at production scale), missing field (safe fallback), and different arity (safe fallback).
  • Integration test testExpandViewAlignsToCallerRowType in HiveToRelConverterTest that drives the full expandView path with a deliberately reversed caller-provided rowType.
  • Sanity-check: temporarily reverting alignToRowType to return root; causes exactly the tests targeting the fix to fail (testReorderedFieldsWrapInProject, testPrefixSiblingsReorderedCorrectly, testMultiplePrefixSiblingsAllReorderedCorrectly, testExpandViewAlignsToCallerRowType); the safe-fallback and no-op tests still pass. Restoring the fix turns the suite green.
  • Full :coral-hive:test and :coral-trino:test suites pass with no regressions.

Manual verification

Validated end-to-end against a real production view-on-view chain (Avro-backed base table → intermediate Hive view → downstream view) using internal tooling that drives the full Coral translation path. Confirmed:

  • Without the fix: every prefix-name sibling pair in the downstream view's translated SQL had its source column references swapped, in both Trino and Spark dialects. Three pairs in the same view were affected simultaneously.
  • With the fix: every reference is correct; the no-op fast path is preserved for upstream views that don't trigger the bug.

Also characterized the failure surface by varying the upstream view shape: the bug fires specifically when the upstream lists the longer prefix-extension name before its shorter sibling (e.g. idHash before id). SELECT * upstreams and shorter-name-first explicit projections do not trigger the bug.

🤖 Generated with Claude Code

@simbadzina simbadzina marked this pull request as draft May 13, 2026 20:45
@simbadzina simbadzina force-pushed the fix-hiveviewexpander-rowtype-alignment branch 6 times, most recently from 6d5714a to 55e0f7a Compare May 13, 2026 21:46
@simbadzina simbadzina requested a review from Copilot May 13, 2026 21:46
HiveViewExpander.expandView() previously discarded the rowType argument and
returned whatever the re-parsed view body produced. When the caller-recorded
row type and the expanded RelNode's row type differed in field order,
downstream consumers that rely on positional access could land on the wrong
RelNode-output position. Prefix-named sibling columns (one name a prefix of
another) are particularly prone to swapping under such mismatches.

This change wraps the expanded RelRoot in a LogicalProject that re-aligns
fields to the caller's rowType by name (case-insensitive). The wrapper is a
no-op when the orderings already match. If a caller-expected field is missing
from the expanded body, or if arities differ, the method falls back to
returning the original RelRoot unchanged.

Adds:
- Unit tests for alignToRowType covering aligned, case-only-difference,
  reorder, prefix-pair, missing-field, and arity-mismatch cases.
- An integration test that drives the full expandView path with a
  deliberately reversed caller-provided rowType.
@simbadzina simbadzina force-pushed the fix-hiveviewexpander-rowtype-alignment branch from 55e0f7a to 26a65b6 Compare May 13, 2026 21:47
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Hive view expansion so expanded view outputs can be reordered to match the caller-provided row type, preventing positional column mismatches in downstream view-on-view scenarios.

Changes:

  • Adds HiveViewExpander.alignToRowType to reorder expanded view fields by case-insensitive name when needed.
  • Adds unit coverage for no-op, reorder, prefix-sibling, missing-field, and arity fallback cases.
  • Adds an integration fixture and test for the full expandView path.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
coral-hive/src/main/java/com/linkedin/coral/hive/hive2rel/HiveViewExpander.java Adds row-type alignment after view query conversion.
coral-hive/src/test/java/com/linkedin/coral/hive/hive2rel/HiveViewExpanderTest.java Adds unit tests for alignment behavior.
coral-hive/src/test/java/com/linkedin/coral/hive/hive2rel/HiveToRelConverterTest.java Adds integration regression test for caller row-type ordering.
coral-hive/src/test/java/com/linkedin/coral/hive/hive2rel/TestUtils.java Adds Hive test table/view fixtures used by the integration test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@simbadzina simbadzina marked this pull request as ready for review May 13, 2026 21:53
Comment on lines +67 to +72
/**
* Wrap {@code root} in a {@link LogicalProject} that re-orders its output
* fields to match {@code expected} by name (case-insensitive). No-op when the
* orderings already agree. If a name is missing from {@code root} or arities
* differ, returns {@code root} unchanged and logs a warning.
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not the right behavior to throw an exception? There should not be a disagreement between the caller expectation and the view structure in the first place.

Copy link
Copy Markdown
Collaborator Author

@simbadzina simbadzina May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean for when there is a mismatch like an missing field, or even when there is an ordering issue?

For the ordering, I think we can self-heal drift with the alignment code since there is a complete info.

I'm assuming a change like a new column could show up a mismatch, missing field.
Automatically taking the root would expose the column to consumers, whereas an error would break the view. I think for select * we'd want the new column to seamless appear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants