Add support for Remote UDF by adrw-google · Pull Request #3860 · GoogleCloudPlatform/DataflowTemplates

adrw-google · 2026-05-26T21:07:31Z

Support for Spanner Remote UDF

GSQL dialect has been supported for 3+ mo in prod. PG dialect is rolling out, but the changes here should be safe to push (existing databases won't have REMOTE UDFs). Added unit tests for PG dialect. Integration tests will be pushed once we deploy support in prod.

Tested using Export/Import integration tests.

Some minor fixes:

Integration tests require ENTERPRISE_PLUS instance due to full text search (otherwise they fail).
Simplified prefix in integration tests so that export IT can easily produce input for import IT (with correct checksums and lengths)

gemini-code-assist · 2026-05-26T21:16:55Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enables support for Remote UDFs within the Spanner connector. It extends the existing schema conversion logic to include language, options, and definition fields, ensuring that Remote UDFs are correctly handled during export and import operations. The changes maintain compatibility with both Google Standard SQL and PostgreSQL dialects and include robust testing to ensure stability.

Highlights

Remote UDF Support: Added comprehensive support for Remote User-Defined Functions (UDFs) across both Google Standard SQL and PostgreSQL dialects.
Schema Conversion Updates: Updated Avro-to-DDL and DDL-to-Avro converters to handle new UDF metadata fields, including language, options, and function definitions.
Information Schema Scanning: Enhanced the Information Schema scanner to correctly identify and extract UDF options and remote function definitions.
Testing Infrastructure: Added extensive unit and integration tests, including updates to existing IT pipelines to verify Remote UDF functionality.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for Spanner Remote User-Defined Functions (UDFs) across both GoogleSQL and PostgreSQL dialects, updating DDL generation, Avro schema conversion, information schema scanning, and associated integration tests. Several issues were identified in the review: a duplicate .createUdf("s1.Foo2") call in CopyDbTest.java overwrites a test definition; the PostgreSQL remote UDF test in pgUdfs() incorrectly uses .options() and has a mismatched name; the GoogleSQL DDL resource file uses backticks instead of single/double quotes for option values; and the random DDL generator outputs invalid JSON for PostgreSQL remote UDF definitions. Additionally, a redundant null check on udf.options() was noted, along with potential NullPointerException risks when reading routine options, and a minor typo in a comment.

gemini-code-assist · 2026-05-26T21:18:31Z

                UdfParameter.parse(
                    "arg1 STRING DEFAULT 'bar'", "s1.Foo2", Dialect.GOOGLE_STANDARD_SQL))
            .endUdf()
+            .createUdf("s1.Foo2")


In udfs(), .createUdf("s1.Foo2") is called again, which overwrites the previously defined s1.Foo2 UDF instead of creating a new s1.Foo3 UDF. It should be .createUdf("s1.Foo3").

Suggested change

.createUdf("s1.Foo2")

.createUdf("s1.Foo3")

gemini-code-assist · 2026-05-26T21:18:31Z

+            .name("s1.Foo2")
+            .language("REMOTE")
+            .type("BIGINT")
+            .addParameter(
+                UdfParameter.parse("arg0 BIGINT", "s1.Foo3", Dialect.POSTGRESQL))
+            .options(
+                ImmutableList.of(
+                    "endpoint=\"https://us-central1-myproject.cloudfunctions.net/myfunc\""))


In pgUdfs(), there are two issues with the definition of s1.Foo3:

.name("s1.Foo2") is used instead of .name("s1.Foo3").

.options(...) is used, which is not supported in the PostgreSQL dialect and will throw an IllegalArgumentException during DDL generation. Instead, PostgreSQL remote UDFs specify their options via the definition property as a JSON string (e.g., using .definition(...)).

.name("s1.Foo3") .language("REMOTE") .type("BIGINT") .addParameter( UdfParameter.parse("arg0 BIGINT", "s1.Foo3", Dialect.POSTGRESQL)) .definition( "{\"endpoint\": \"https://us-central1-myproject.cloudfunctions.net/myfunc\"}")

gemini-code-assist · 2026-05-26T21:18:31Z

+CREATE SCHEMA `%PREFIX%UdfSchema`;
+
+CREATE FUNCTION `%PREFIX%UdfSchema`.`Remote`(x INT64, y INT64) RETURNS INT64 NOT DETERMINISTIC LANGUAGE REMOTE OPTIONS (
+  endpoint = `https://us-central1-myproject.cloudfunctions.net/myfunc`,


In GoogleSQL, string literals for option values must be enclosed in single or double quotes, not backticks. Using backticks will result in a syntax error.

endpoint = 'https://us-central1-myproject.cloudfunctions.net/myfunc',

gemini-code-assist · 2026-05-26T21:18:31Z

+      if (udf.options() != null) {
+        for (int j = 0; j < udf.options().size(); j++) {
+          recordBuilder.prop(SPANNER_OPTION + j, udf.options().get(j));
+        }
+      }


The udf.options() method returns an ImmutableList<String> which is non-nullable (it is not annotated with @Nullable and is initialized to an empty list by default in the builder). Therefore, the null check udf.options() != null is redundant and can be safely removed.

for (int j = 0; j < udf.options().size(); j++) { recordBuilder.prop(SPANNER_OPTION + j, udf.options().get(j)); }

References

Ensure that mandatory parameters are validated for nullability at the entry point or in upstream components to prevent NullPointerExceptions in downstream transforms, avoiding the need for redundant null checks.

gemini-code-assist · 2026-05-26T21:18:31Z

+      String optionType = resultSet.getString(3);
+      String optionValue = resultSet.getString(4);


To prevent potential NullPointerExceptions, guard against null values for optionType and optionValue using resultSet.isNull(...) before calling methods on them or passing them to the escaper. This aligns with defensive programming best practices.

Suggested change

String optionType = resultSet.getString(3);

String optionValue = resultSet.getString(4);

String optionType = resultSet.isNull(3) ? "" : resultSet.getString(3);

String optionValue = resultSet.isNull(4) ? "" : resultSet.getString(4);

gemini-code-assist · 2026-05-26T21:18:31Z

+          if (language() == null || language().isEmpty() || "SQL".equalsIgnoreCase(language())) {
+            appendable.append(" RETURN ").append(definition());
+          } else {
+            // Other langugges use AS definition instead of sql body. 


There is a typo in the comment: "langugges" should be "languages".

Suggested change

// Other langugges use AS definition instead of sql body.

// Other languages use AS definition instead of sql body.

gemini-code-assist · 2026-05-26T21:18:31Z

+      if (getDialect() == Dialect.GOOGLE_STANDARD_SQL) {
+        udfBuilder.options(ImmutableList.of("endpoint=\"https://us-central1-myproject.cloudfunctions.net/myfunc\""));
+      } else {
+        udfBuilder.definition("\"endpoint\": \"https://us-central1-myproject.cloudfunctions.net/myfunc\"");


The definition for PostgreSQL remote UDFs should be a valid JSON string enclosed in curly braces {}. Currently, it is missing the curly braces, which will result in invalid JSON.

Suggested change

udfBuilder.definition("\"endpoint\": \"https://us-central1-myproject.cloudfunctions.net/myfunc\"");

udfBuilder.definition("{\"endpoint\": \"https://us-central1-myproject.cloudfunctions.net/myfunc\"}");

codecov · 2026-05-26T21:38:42Z

Codecov Report

❌ Patch coverage is 85.71429% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.53%. Comparing base (02d6629) to head (72c19ba).
⚠️ Report is 27 commits behind head on main.

Files with missing lines	Patch %	Lines
...ava/com/google/cloud/teleport/spanner/ddl/Udf.java	81.25%	0 Missing and 6 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3860      +/-   ##
============================================
+ Coverage     54.20%   55.53%   +1.32%     
- Complexity     6895     7036     +141     
============================================
  Files          1096     1103       +7     
  Lines         67353    67602     +249     
  Branches       7558     7594      +36     
============================================
+ Hits          36512    37546    +1034     
+ Misses        28375    27634     -741     
+ Partials       2466     2422      -44

Components	Coverage Δ
spanner-templates	`76.32% <85.36%> (+3.02%)`	⬆️
spanner-import-export	`68.56% <85.36%> (+0.05%)`	⬆️
spanner-live-forward-migration	`90.22% <ø> (+7.21%)`	⬆️
spanner-live-reverse-replication	`83.96% <ø> (+6.88%)`	⬆️
spanner-bulk-migration	`92.61% <ø> (+1.65%)`	⬆️
gcs-spanner-dv	`89.00% <ø> (+3.27%)`	⬆️

Files with missing lines	Coverage Δ
...he/beam/it/gcp/spanner/SpannerResourceManager.java	`73.75% <100.00%> (+0.09%)`	⬆️
...oud/teleport/spanner/AvroSchemaToDdlConverter.java	`86.83% <100.00%> (+0.05%)`	⬆️
...va/com/google/cloud/teleport/spanner/AvroUtil.java	`93.75% <ø> (ø)`
...oud/teleport/spanner/DdlToAvroSchemaConverter.java	`97.55% <100.00%> (+0.03%)`	⬆️
...ava/com/google/cloud/teleport/spanner/ddl/Udf.java	`84.15% <81.25%> (-2.86%)`	⬇️

... and 60 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

shreyakhajanchi · 2026-06-04T07:45:40Z

+    gcsClient.uploadArtifact(
+        "input/UdfSchema.avro-00000-of-00001",
+        Resources.getResource("ImportPipelineIT/" + subdirectory + "/UdfSchema.avro").getPath());
+    gcsClient.uploadArtifact(
+        "input/UdfSchema-manifest.json",
+        Resources.getResource("ImportPipelineIT/" + subdirectory + "/UdfSchema-manifest.json")
+            .getPath());


why do we have these files? I don't think we are asserting any non-remote udfs right now in the pipeline.

I think Jacob has a concurrent PR to add SQL UDFs, which will add these files too. Note that all UDFs (remote or sql) need to be inside a named schema, and these avro files are necessary to create one.

shreyakhajanchi · 2026-06-04T07:46:44Z

+                .stream()
+                .map(row -> row.getString(0))
+                .collect(Collectors.toList()))
+        .containsExactly("UdfSchema.Remote");


Will it be better to assert other fields as well from the schema specially UDF type, options , etc?

{ "type" : "record", "name" : "UdfSchema_Remote", "namespace" : "spannerexport", "fields" : [ ], "spannerName" : "UdfSchema.Remote", "spannerUdfName" : "UdfSchema.Remote", "spannerEntity" : "spannerUdf", "spannerOption_0" : "endpoint=\"https://us-central1-myproject.cloudfunctions.net/myfunc\"", "spannerOption_1" : "max_batching_rows=10", "spannerUdfLanguage" : "REMOTE", "googleStorage" : "CloudSpanner", "spannerUdfParameter_0" : "`x` INT64", "spannerUdfParameter_1" : "`y` INT64", "spannerUdfSecurity" : "INVOKER", "spannerUdfType" : "INT64", "googleFormatVersion" : "1.0.0", "spannerUdfDefinition" : "" }

I think the test should assert that the functions were actually created in the database in the simplest possible way. Using InformationSchemaScanner could introduce false positives here, if there are bugs in InformationSchemaScanner (i.e. it sometimes adds default values).

We could just take a full dump of INFO_SCHEMA.ROUTINES here, but that might be flaky if we decide to change INFO_SCHEMA implementation. This shouldn't happen frequently though.

We could just re-export the database and compare the avro files to be identical. That could give us a much better coverage.

I opened b/517150731 for the migrations team to ponder and design how these tests should work. I'm adding a minimal viable coverage in this PR.

shreyakhajanchi · 2026-06-04T07:53:35Z

@@ -207,18 +207,19 @@ private void testSpannerToGCSAvroBase(
    // an empty database without any tables.
    spannerResourceManager.executeDdlStatement(setDefaultTimeZoneStatement);


We should assert the remote udf handling in export pipeline as well

Opened b/517150731 for this. It's up to the migrations teams to design how these tests should work. I'm not familiar with avro enough to pull this off and comparing the binary payload of output to golden doesn't yield diffs that are understandable.

adrw-google added 5 commits May 26, 2026 13:32

Add support for REMOTE UDFs.

6235a56

Update PG UDF support and fix tests

578567a

Add support for REMOTE UDFs.

001f595

Add support for REMOTE UDFs.

95c0938

Add support for REMOTE UDFs.

34a3784

adrw-google requested a review from a team as a code owner May 26, 2026 21:07

adrw-google requested review from shreyakhajanchi and sm745052 May 26, 2026 21:07

pull-request-size Bot added the size/XL label May 26, 2026

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

adrw-google added 2 commits May 26, 2026 14:19

Add support for REMOTE UDFs.

91bcc4f

Add support for REMOTE UDFs.

082e1b7

pull-request-size Bot added size/XXL and removed size/XL labels May 26, 2026

Add support for REMOTE UDFs.

5036177

adrw-google added 6 commits May 26, 2026 14:51

Add support for REMOTE UDFs.

14de2c4

Spotless fixes.

e326098

More spotless.

8b62fd0

Consider CHANGE STREAM created functions when exporting UDF.

94aa3eb

Fixes for formatting and tests

71e282c

More text fixes

f16519e

shreyakhajanchi added the addition New feature or request label May 29, 2026

shreyakhajanchi reviewed May 29, 2026

View reviewed changes

Respond to comments.

0a590ad

pull-request-size Bot added size/XL and removed size/XXL labels Jun 1, 2026

adrw-google added 2 commits June 1, 2026 10:37

Fix for presubmits.

d097c3c

Spotless

7c39e71

adrw-google added 2 commits June 2, 2026 13:29

More spotless.

a370772

Spotless

72c19ba

shreyakhajanchi reviewed Jun 4, 2026

View reviewed changes

		String optionType = resultSet.getString(3);
		String optionValue = resultSet.getString(4);

	// Other langugges use AS definition instead of sql body.
	// Other languages use AS definition instead of sql body.

	udfBuilder.definition("\"endpoint\": \"https://us-central1-myproject.cloudfunctions.net/myfunc\"");
	udfBuilder.definition("{\"endpoint\": \"https://us-central1-myproject.cloudfunctions.net/myfunc\"}");

		@@ -207,18 +207,19 @@ private void testSpannerToGCSAvroBase(
		// an empty database without any tables.
		spannerResourceManager.executeDdlStatement(setDefaultTimeZoneStatement);

Conversation

adrw-google commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented May 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shreyakhajanchi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

adrw-google Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

shreyakhajanchi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

adrw-google Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

shreyakhajanchi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

adrw-google Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adrw-google commented May 26, 2026 •

edited

Loading

codecov Bot commented May 26, 2026 •

edited

Loading