Add support for Remote UDF#3860
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enables support for Remote UDFs within the Spanner connector. It extends the existing schema conversion logic to include language, options, and definition fields, ensuring that Remote UDFs are correctly handled during export and import operations. The changes maintain compatibility with both Google Standard SQL and PostgreSQL dialects and include robust testing to ensure stability. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for Spanner Remote User-Defined Functions (UDFs) across both GoogleSQL and PostgreSQL dialects, updating DDL generation, Avro schema conversion, information schema scanning, and associated integration tests. Several issues were identified in the review: a duplicate .createUdf("s1.Foo2") call in CopyDbTest.java overwrites a test definition; the PostgreSQL remote UDF test in pgUdfs() incorrectly uses .options() and has a mismatched name; the GoogleSQL DDL resource file uses backticks instead of single/double quotes for option values; and the random DDL generator outputs invalid JSON for PostgreSQL remote UDF definitions. Additionally, a redundant null check on udf.options() was noted, along with potential NullPointerException risks when reading routine options, and a minor typo in a comment.
| UdfParameter.parse( | ||
| "arg1 STRING DEFAULT 'bar'", "s1.Foo2", Dialect.GOOGLE_STANDARD_SQL)) | ||
| .endUdf() | ||
| .createUdf("s1.Foo2") |
| .name("s1.Foo2") | ||
| .language("REMOTE") | ||
| .type("BIGINT") | ||
| .addParameter( | ||
| UdfParameter.parse("arg0 BIGINT", "s1.Foo3", Dialect.POSTGRESQL)) | ||
| .options( | ||
| ImmutableList.of( | ||
| "endpoint=\"https://us-central1-myproject.cloudfunctions.net/myfunc\"")) |
There was a problem hiding this comment.
In pgUdfs(), there are two issues with the definition of s1.Foo3:
.name("s1.Foo2")is used instead of.name("s1.Foo3")..options(...)is used, which is not supported in the PostgreSQL dialect and will throw anIllegalArgumentExceptionduring DDL generation. Instead, PostgreSQL remote UDFs specify their options via thedefinitionproperty as a JSON string (e.g., using.definition(...)).
.name("s1.Foo3")
.language("REMOTE")
.type("BIGINT")
.addParameter(
UdfParameter.parse("arg0 BIGINT", "s1.Foo3", Dialect.POSTGRESQL))
.definition(
"{\"endpoint\": \"https://us-central1-myproject.cloudfunctions.net/myfunc\"}")| CREATE SCHEMA `%PREFIX%UdfSchema`; | ||
|
|
||
| CREATE FUNCTION `%PREFIX%UdfSchema`.`Remote`(x INT64, y INT64) RETURNS INT64 NOT DETERMINISTIC LANGUAGE REMOTE OPTIONS ( | ||
| endpoint = `https://us-central1-myproject.cloudfunctions.net/myfunc`, |
| if (udf.options() != null) { | ||
| for (int j = 0; j < udf.options().size(); j++) { | ||
| recordBuilder.prop(SPANNER_OPTION + j, udf.options().get(j)); | ||
| } | ||
| } |
There was a problem hiding this comment.
The udf.options() method returns an ImmutableList<String> which is non-nullable (it is not annotated with @Nullable and is initialized to an empty list by default in the builder). Therefore, the null check udf.options() != null is redundant and can be safely removed.
for (int j = 0; j < udf.options().size(); j++) {
recordBuilder.prop(SPANNER_OPTION + j, udf.options().get(j));
}References
- Ensure that mandatory parameters are validated for nullability at the entry point or in upstream components to prevent NullPointerExceptions in downstream transforms, avoiding the need for redundant null checks.
| String optionType = resultSet.getString(3); | ||
| String optionValue = resultSet.getString(4); |
There was a problem hiding this comment.
To prevent potential NullPointerExceptions, guard against null values for optionType and optionValue using resultSet.isNull(...) before calling methods on them or passing them to the escaper. This aligns with defensive programming best practices.
| String optionType = resultSet.getString(3); | |
| String optionValue = resultSet.getString(4); | |
| String optionType = resultSet.isNull(3) ? "" : resultSet.getString(3); | |
| String optionValue = resultSet.isNull(4) ? "" : resultSet.getString(4); |
| if (language() == null || language().isEmpty() || "SQL".equalsIgnoreCase(language())) { | ||
| appendable.append(" RETURN ").append(definition()); | ||
| } else { | ||
| // Other langugges use AS definition instead of sql body. |
| if (getDialect() == Dialect.GOOGLE_STANDARD_SQL) { | ||
| udfBuilder.options(ImmutableList.of("endpoint=\"https://us-central1-myproject.cloudfunctions.net/myfunc\"")); | ||
| } else { | ||
| udfBuilder.definition("\"endpoint\": \"https://us-central1-myproject.cloudfunctions.net/myfunc\""); |
There was a problem hiding this comment.
The definition for PostgreSQL remote UDFs should be a valid JSON string enclosed in curly braces {}. Currently, it is missing the curly braces, which will result in invalid JSON.
| udfBuilder.definition("\"endpoint\": \"https://us-central1-myproject.cloudfunctions.net/myfunc\""); | |
| udfBuilder.definition("{\"endpoint\": \"https://us-central1-myproject.cloudfunctions.net/myfunc\"}"); |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3860 +/- ##
============================================
+ Coverage 54.20% 55.53% +1.32%
- Complexity 6895 7036 +141
============================================
Files 1096 1103 +7
Lines 67353 67602 +249
Branches 7558 7594 +36
============================================
+ Hits 36512 37546 +1034
+ Misses 28375 27634 -741
+ Partials 2466 2422 -44
🚀 New features to boost your workflow:
|
| gcsClient.uploadArtifact( | ||
| "input/UdfSchema.avro-00000-of-00001", | ||
| Resources.getResource("ImportPipelineIT/" + subdirectory + "/UdfSchema.avro").getPath()); | ||
| gcsClient.uploadArtifact( | ||
| "input/UdfSchema-manifest.json", | ||
| Resources.getResource("ImportPipelineIT/" + subdirectory + "/UdfSchema-manifest.json") | ||
| .getPath()); |
There was a problem hiding this comment.
why do we have these files? I don't think we are asserting any non-remote udfs right now in the pipeline.
There was a problem hiding this comment.
I think Jacob has a concurrent PR to add SQL UDFs, which will add these files too. Note that all UDFs (remote or sql) need to be inside a named schema, and these avro files are necessary to create one.
| .stream() | ||
| .map(row -> row.getString(0)) | ||
| .collect(Collectors.toList())) | ||
| .containsExactly("UdfSchema.Remote"); |
There was a problem hiding this comment.
Will it be better to assert other fields as well from the schema specially UDF type, options , etc?
{
"type" : "record",
"name" : "UdfSchema_Remote",
"namespace" : "spannerexport",
"fields" : [ ],
"spannerName" : "UdfSchema.Remote",
"spannerUdfName" : "UdfSchema.Remote",
"spannerEntity" : "spannerUdf",
"spannerOption_0" : "endpoint=\"https://us-central1-myproject.cloudfunctions.net/myfunc\"",
"spannerOption_1" : "max_batching_rows=10",
"spannerUdfLanguage" : "REMOTE",
"googleStorage" : "CloudSpanner",
"spannerUdfParameter_0" : "`x` INT64",
"spannerUdfParameter_1" : "`y` INT64",
"spannerUdfSecurity" : "INVOKER",
"spannerUdfType" : "INT64",
"googleFormatVersion" : "1.0.0",
"spannerUdfDefinition" : ""
}
There was a problem hiding this comment.
I think the test should assert that the functions were actually created in the database in the simplest possible way. Using InformationSchemaScanner could introduce false positives here, if there are bugs in InformationSchemaScanner (i.e. it sometimes adds default values).
We could just take a full dump of INFO_SCHEMA.ROUTINES here, but that might be flaky if we decide to change INFO_SCHEMA implementation. This shouldn't happen frequently though.
We could just re-export the database and compare the avro files to be identical. That could give us a much better coverage.
I opened b/517150731 for the migrations team to ponder and design how these tests should work. I'm adding a minimal viable coverage in this PR.
| @@ -207,18 +207,19 @@ private void testSpannerToGCSAvroBase( | |||
| // an empty database without any tables. | |||
| spannerResourceManager.executeDdlStatement(setDefaultTimeZoneStatement); | |||
There was a problem hiding this comment.
We should assert the remote udf handling in export pipeline as well
There was a problem hiding this comment.
Opened b/517150731 for this. It's up to the migrations teams to design how these tests should work. I'm not familiar with avro enough to pull this off and comparing the binary payload of output to golden doesn't yield diffs that are understandable.
Support for Spanner Remote UDF
GSQL dialect has been supported for 3+ mo in prod. PG dialect is rolling out, but the changes here should be safe to push (existing databases won't have REMOTE UDFs). Added unit tests for PG dialect. Integration tests will be pushed once we deploy support in prod.
Tested using Export/Import integration tests.
Some minor fixes: