Skip to content

Add support for Remote UDF#3860

Open
adrw-google wants to merge 19 commits into
GoogleCloudPlatform:mainfrom
adrw-google:remote_udf
Open

Add support for Remote UDF#3860
adrw-google wants to merge 19 commits into
GoogleCloudPlatform:mainfrom
adrw-google:remote_udf

Conversation

@adrw-google
Copy link
Copy Markdown
Member

@adrw-google adrw-google commented May 26, 2026

Support for Spanner Remote UDF

GSQL dialect has been supported for 3+ mo in prod. PG dialect is rolling out, but the changes here should be safe to push (existing databases won't have REMOTE UDFs). Added unit tests for PG dialect. Integration tests will be pushed once we deploy support in prod.

Tested using Export/Import integration tests.

Some minor fixes:

  • Integration tests require ENTERPRISE_PLUS instance due to full text search (otherwise they fail).
  • Simplified prefix in integration tests so that export IT can easily produce input for import IT (with correct checksums and lengths)

@adrw-google adrw-google requested a review from a team as a code owner May 26, 2026 21:07
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enables support for Remote UDFs within the Spanner connector. It extends the existing schema conversion logic to include language, options, and definition fields, ensuring that Remote UDFs are correctly handled during export and import operations. The changes maintain compatibility with both Google Standard SQL and PostgreSQL dialects and include robust testing to ensure stability.

Highlights

  • Remote UDF Support: Added comprehensive support for Remote User-Defined Functions (UDFs) across both Google Standard SQL and PostgreSQL dialects.
  • Schema Conversion Updates: Updated Avro-to-DDL and DDL-to-Avro converters to handle new UDF metadata fields, including language, options, and function definitions.
  • Information Schema Scanning: Enhanced the Information Schema scanner to correctly identify and extract UDF options and remote function definitions.
  • Testing Infrastructure: Added extensive unit and integration tests, including updates to existing IT pipelines to verify Remote UDF functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Spanner Remote User-Defined Functions (UDFs) across both GoogleSQL and PostgreSQL dialects, updating DDL generation, Avro schema conversion, information schema scanning, and associated integration tests. Several issues were identified in the review: a duplicate .createUdf("s1.Foo2") call in CopyDbTest.java overwrites a test definition; the PostgreSQL remote UDF test in pgUdfs() incorrectly uses .options() and has a mismatched name; the GoogleSQL DDL resource file uses backticks instead of single/double quotes for option values; and the random DDL generator outputs invalid JSON for PostgreSQL remote UDF definitions. Additionally, a redundant null check on udf.options() was noted, along with potential NullPointerException risks when reading routine options, and a minor typo in a comment.

UdfParameter.parse(
"arg1 STRING DEFAULT 'bar'", "s1.Foo2", Dialect.GOOGLE_STANDARD_SQL))
.endUdf()
.createUdf("s1.Foo2")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In udfs(), .createUdf("s1.Foo2") is called again, which overwrites the previously defined s1.Foo2 UDF instead of creating a new s1.Foo3 UDF. It should be .createUdf("s1.Foo3").

Suggested change
.createUdf("s1.Foo2")
.createUdf("s1.Foo3")

Comment on lines +1153 to +1160
.name("s1.Foo2")
.language("REMOTE")
.type("BIGINT")
.addParameter(
UdfParameter.parse("arg0 BIGINT", "s1.Foo3", Dialect.POSTGRESQL))
.options(
ImmutableList.of(
"endpoint=\"https://us-central1-myproject.cloudfunctions.net/myfunc\""))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In pgUdfs(), there are two issues with the definition of s1.Foo3:

  1. .name("s1.Foo2") is used instead of .name("s1.Foo3").
  2. .options(...) is used, which is not supported in the PostgreSQL dialect and will throw an IllegalArgumentException during DDL generation. Instead, PostgreSQL remote UDFs specify their options via the definition property as a JSON string (e.g., using .definition(...)).
            .name("s1.Foo3")
            .language("REMOTE")
            .type("BIGINT")
            .addParameter(
                UdfParameter.parse("arg0 BIGINT", "s1.Foo3", Dialect.POSTGRESQL))
            .definition(
                "{\"endpoint\": \"https://us-central1-myproject.cloudfunctions.net/myfunc\"}")

CREATE SCHEMA `%PREFIX%UdfSchema`;

CREATE FUNCTION `%PREFIX%UdfSchema`.`Remote`(x INT64, y INT64) RETURNS INT64 NOT DETERMINISTIC LANGUAGE REMOTE OPTIONS (
endpoint = `https://us-central1-myproject.cloudfunctions.net/myfunc`,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In GoogleSQL, string literals for option values must be enclosed in single or double quotes, not backticks. Using backticks will result in a syntax error.

  endpoint = 'https://us-central1-myproject.cloudfunctions.net/myfunc',

Comment on lines +154 to +158
if (udf.options() != null) {
for (int j = 0; j < udf.options().size(); j++) {
recordBuilder.prop(SPANNER_OPTION + j, udf.options().get(j));
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The udf.options() method returns an ImmutableList<String> which is non-nullable (it is not annotated with @Nullable and is initialized to an empty list by default in the builder). Therefore, the null check udf.options() != null is redundant and can be safely removed.

      for (int j = 0; j < udf.options().size(); j++) {
        recordBuilder.prop(SPANNER_OPTION + j, udf.options().get(j));
      }
References
  1. Ensure that mandatory parameters are validated for nullability at the entry point or in upstream components to prevent NullPointerExceptions in downstream transforms, avoiding the need for redundant null checks.

Comment on lines +1132 to +1133
String optionType = resultSet.getString(3);
String optionValue = resultSet.getString(4);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent potential NullPointerExceptions, guard against null values for optionType and optionValue using resultSet.isNull(...) before calling methods on them or passing them to the escaper. This aligns with defensive programming best practices.

Suggested change
String optionType = resultSet.getString(3);
String optionValue = resultSet.getString(4);
String optionType = resultSet.isNull(3) ? "" : resultSet.getString(3);
String optionValue = resultSet.isNull(4) ? "" : resultSet.getString(4);

if (language() == null || language().isEmpty() || "SQL".equalsIgnoreCase(language())) {
appendable.append(" RETURN ").append(definition());
} else {
// Other langugges use AS definition instead of sql body.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the comment: "langugges" should be "languages".

Suggested change
// Other langugges use AS definition instead of sql body.
// Other languages use AS definition instead of sql body.

if (getDialect() == Dialect.GOOGLE_STANDARD_SQL) {
udfBuilder.options(ImmutableList.of("endpoint=\"https://us-central1-myproject.cloudfunctions.net/myfunc\""));
} else {
udfBuilder.definition("\"endpoint\": \"https://us-central1-myproject.cloudfunctions.net/myfunc\"");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The definition for PostgreSQL remote UDFs should be a valid JSON string enclosed in curly braces {}. Currently, it is missing the curly braces, which will result in invalid JSON.

Suggested change
udfBuilder.definition("\"endpoint\": \"https://us-central1-myproject.cloudfunctions.net/myfunc\"");
udfBuilder.definition("{\"endpoint\": \"https://us-central1-myproject.cloudfunctions.net/myfunc\"}");

@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

❌ Patch coverage is 85.71429% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.53%. Comparing base (02d6629) to head (72c19ba).
⚠️ Report is 27 commits behind head on main.

Files with missing lines Patch % Lines
...ava/com/google/cloud/teleport/spanner/ddl/Udf.java 81.25% 0 Missing and 6 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3860      +/-   ##
============================================
+ Coverage     54.20%   55.53%   +1.32%     
- Complexity     6895     7036     +141     
============================================
  Files          1096     1103       +7     
  Lines         67353    67602     +249     
  Branches       7558     7594      +36     
============================================
+ Hits          36512    37546    +1034     
+ Misses        28375    27634     -741     
+ Partials       2466     2422      -44     
Components Coverage Δ
spanner-templates 76.32% <85.36%> (+3.02%) ⬆️
spanner-import-export 68.56% <85.36%> (+0.05%) ⬆️
spanner-live-forward-migration 90.22% <ø> (+7.21%) ⬆️
spanner-live-reverse-replication 83.96% <ø> (+6.88%) ⬆️
spanner-bulk-migration 92.61% <ø> (+1.65%) ⬆️
gcs-spanner-dv 89.00% <ø> (+3.27%) ⬆️
Files with missing lines Coverage Δ
...he/beam/it/gcp/spanner/SpannerResourceManager.java 73.75% <100.00%> (+0.09%) ⬆️
...oud/teleport/spanner/AvroSchemaToDdlConverter.java 86.83% <100.00%> (+0.05%) ⬆️
...va/com/google/cloud/teleport/spanner/AvroUtil.java 93.75% <ø> (ø)
...oud/teleport/spanner/DdlToAvroSchemaConverter.java 97.55% <100.00%> (+0.03%) ⬆️
...ava/com/google/cloud/teleport/spanner/ddl/Udf.java 84.15% <81.25%> (-2.86%) ⬇️

... and 60 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@shreyakhajanchi shreyakhajanchi added the addition New feature or request label May 29, 2026
Comment thread v1/src/main/java/com/google/cloud/teleport/spanner/ddl/Udf.java.orig Outdated
Comment on lines +113 to +119
gcsClient.uploadArtifact(
"input/UdfSchema.avro-00000-of-00001",
Resources.getResource("ImportPipelineIT/" + subdirectory + "/UdfSchema.avro").getPath());
gcsClient.uploadArtifact(
"input/UdfSchema-manifest.json",
Resources.getResource("ImportPipelineIT/" + subdirectory + "/UdfSchema-manifest.json")
.getPath());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have these files? I don't think we are asserting any non-remote udfs right now in the pipeline.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Jacob has a concurrent PR to add SQL UDFs, which will add these files too. Note that all UDFs (remote or sql) need to be inside a named schema, and these avro files are necessary to create one.

.stream()
.map(row -> row.getString(0))
.collect(Collectors.toList()))
.containsExactly("UdfSchema.Remote");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be better to assert other fields as well from the schema specially UDF type, options , etc?

{
  "type" : "record",
  "name" : "UdfSchema_Remote",
  "namespace" : "spannerexport",
  "fields" : [ ],
  "spannerName" : "UdfSchema.Remote",
  "spannerUdfName" : "UdfSchema.Remote",
  "spannerEntity" : "spannerUdf",
  "spannerOption_0" : "endpoint=\"https://us-central1-myproject.cloudfunctions.net/myfunc\"",
  "spannerOption_1" : "max_batching_rows=10",
  "spannerUdfLanguage" : "REMOTE",
  "googleStorage" : "CloudSpanner",
  "spannerUdfParameter_0" : "`x` INT64",
  "spannerUdfParameter_1" : "`y` INT64",
  "spannerUdfSecurity" : "INVOKER",
  "spannerUdfType" : "INT64",
  "googleFormatVersion" : "1.0.0",
  "spannerUdfDefinition" : ""
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test should assert that the functions were actually created in the database in the simplest possible way. Using InformationSchemaScanner could introduce false positives here, if there are bugs in InformationSchemaScanner (i.e. it sometimes adds default values).

We could just take a full dump of INFO_SCHEMA.ROUTINES here, but that might be flaky if we decide to change INFO_SCHEMA implementation. This shouldn't happen frequently though.

We could just re-export the database and compare the avro files to be identical. That could give us a much better coverage.

I opened b/517150731 for the migrations team to ponder and design how these tests should work. I'm adding a minimal viable coverage in this PR.

@@ -207,18 +207,19 @@ private void testSpannerToGCSAvroBase(
// an empty database without any tables.
spannerResourceManager.executeDdlStatement(setDefaultTimeZoneStatement);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should assert the remote udf handling in export pipeline as well

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened b/517150731 for this. It's up to the migrations teams to design how these tests should work. I'm not familiar with avro enough to pull this off and comparing the binary payload of output to golden doesn't yield diffs that are understandable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

addition New feature or request size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants